Snowflake

2025

Architecture · Pipelines · 12 min read

Dynamic Tables vs. Streams and Tasks: The Production Decision

Snowflake gives you two fundamentally different approaches to incremental data processing. Choosing the wrong one is not a performance problem — it is an architecture problem.

Read article ↓

Dynamic Tables vs. Streams and Tasks: The Production Decision

When Snowflake introduced Dynamic Tables, many teams asked whether they should retire their Streams and Tasks pipelines. The honest answer is: it depends — but not in the evasive way that phrase usually implies. There is a clear decision framework, and once you understand it, the choice becomes straightforward.

What Each Approach Actually Does

Streams and Tasks are Snowflake's procedural approach to incremental processing. A Stream tracks changes to a source table — inserts, updates, and deletes — using Snowflake's internal change tracking mechanism. A Task schedules SQL or a stored procedure on a defined interval or event trigger. Together they form a pipeline: the Stream captures what changed, the Task runs the logic. You control exactly what runs, when, and what happens when things go wrong. The tradeoff is that you own the complexity.

Dynamic Tables are Snowflake's declarative approach. You write a SELECT statement that defines the target state, set a TARGET_LAG parameter specifying the maximum acceptable staleness, and Snowflake handles the rest — no merge logic, no task scheduling, no dependency management. The mental model: a Dynamic Table is a materialized view you can control the refresh frequency of, supporting joins, aggregations, window functions, and CTEs.

The TARGET_LAG Parameter

TARGET_LAG sets a freshness guarantee, not a fixed schedule. You can set it as a time interval (minimum one minute) or use DOWNSTREAM, which tells the Dynamic Table to refresh only when a downstream Dynamic Table triggers a refresh — useful for aligning entire pipeline freshness around the final consumer's requirements.

REFRESH_MODE: INCREMENTAL vs. FULL vs. AUTO

INCREMENTAL — Snowflake processes only the delta from source changes.
FULL — Snowflake recomputes the entire result set on every refresh, used when the query cannot be incrementally refreshed.
AUTO (default) — Snowflake decides based on query structure. Check the Dynamic Table's REFRESH_MODE after creation — Snowflake shows which mode it selected and why. Non-deterministic functions and certain complex subquery patterns force FULL mode.

Cost Considerations

Dynamic Tables incur storage cost, cloud services compute (only billed if daily cloud services cost exceeds 10% of daily warehouse cost), and virtual warehouse compute — the primary cost driver. Dynamic Tables do not yet support serverless compute; they require a named virtual warehouse. One team reported reducing their pipeline codebase by 40% after switching to Dynamic Tables. The question is whether that engineering time savings justifies any compute cost increase for your specific workload.

When to Use Dynamic Tables

Use when: building transformation layers expressible as a SELECT statement; acceptable staleness is one minute or greater; you have chains of dependent transformations benefiting from automatic DAG management; implementing SCD Type 1.

When to Use Streams and Tasks

Use when: sub-minute latency is required; pipeline logic needs multi-step procedural operations, conditional branching, external API calls, or JavaScript UDFs; implementing SCD Type 2; needing append-only change capture; requiring precise control over error handling and retry logic per step.

A Practical Decision Framework

1. Can the transformation be expressed as a single SELECT statement? If yes, Dynamic Tables are a strong candidate.
2. What is the acceptable staleness? Under one minute — use Streams and Tasks. One minute or more — Dynamic Tables are viable.
3. How complex is the dependency chain? Many dependent steps favor Dynamic Tables' automatic DAG management.

The mistake to avoid is treating Dynamic Tables as universally superior because they are newer. Both tools exist for good reasons. The right architecture uses each where its design strengths align with the pipeline requirements.

2025

Cortex AI · Architecture · 14 min read

Building Production RAG with Snowflake Cortex Search

Cortex Search packages vector search, keyword search, and semantic reranking into a single managed service inside Snowflake. Here is what that means for production RAG — and where it still requires careful architectural decisions.

Read article ↓

Building Production RAG with Snowflake Cortex Search

Building a retrieval-augmented generation system from scratch requires assembling multiple components: an embedding model, a vector database, an indexing pipeline, a retrieval layer, a reranking model, and a generation model. When your data lives in Snowflake, every external hop introduces data movement — security review, compliance consideration, and additional latency. Snowflake Cortex Search addresses this by packaging the retrieval components into a managed service that runs natively inside Snowflake. Your data never leaves the platform.

How Cortex Search Works

Cortex Search implements hybrid search — combining vector search, keyword search, and semantic reranking in a single retrieval operation. The vector search component is powered by Arctic Embed. The current recommended model is snowflake-arctic-embed-l-v2.0, which achieves retrieval performance that only models with over one billion parameters have historically surpassed, while remaining practical to deploy at scale. The result is 200–300 millisecond query latency over large volumes of text, with metadata filtering support.

Creating a Cortex Search Service

CREATE OR REPLACE CORTEX SEARCH SERVICE support_search_service
  ON document_text
  ATTRIBUTES category, department
  WAREHOUSE = search_wh
  TARGET_LAG = '1 day'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (
  SELECT document_text, category, department, document_id
  FROM enterprise_documents WHERE is_active = TRUE
);

The ON parameter is the column Cortex Search indexes. ATTRIBUTES are available as query filters. TARGET_LAG controls update frequency. EMBEDDING_MODEL selects the Arctic Embed variant.

Chunking Strategy: The Decision That Matters Most

Snowflake recommends keeping text chunks at 512 tokens or fewer (approximately 385 English words). When a chunk is too large, the embedding vector averages across too much semantic content — a query retrieves a chunk containing the relevant topic plus unrelated content, reducing precision and diluting LLM context. Three approaches: Fixed-size with overlap (10–20% overlap prevents context loss at boundaries — simplest, good starting point); Semantic (split at natural boundaries like paragraphs — more coherent, variable size); Hierarchical (summary plus fine-grained chunks — most complex, best for long structured documents).

Production Considerations

Governance — Cortex Search services inherit Snowflake's row-level and column-level security. Users without table access cannot retrieve content from a service built on that table.
Cost — the initial index build is the most significant cost event; size your warehouse appropriately for the initial load, then scale down.
Evaluation — build retrieval quality and answer quality evaluation into your pipeline from the start.

What Cortex Search Does Not Replace

For structured data question-answering — queries answerable by generating and executing SQL — Cortex Analyst is more appropriate. Cortex Search is the right choice for enterprise knowledge base, document search, and conversational AI over text data that lives in Snowflake.

2025

Architecture · Data Engineering · 13 min read

Medallion Architecture on Snowflake: What the Diagrams Leave Out

Every Medallion Architecture diagram looks the same. Three boxes, three colors, three arrows. What the diagram does not show is the ten decisions that determine whether the architecture holds at enterprise scale.

Read article ↓

Medallion Architecture on Snowflake: What the Diagrams Leave Out

The Medallion Architecture — Bronze, Silver, Gold — has become the dominant pattern for organizing data pipelines on modern cloud platforms. What the standard diagrams omit is everything that makes the difference between an architecture that looks right on paper and one that holds up when real data arrives from real systems under real operational pressure.

The Bronze Layer: Immutability Is Not Optional

The most common mistake is transforming data before storing it. Bronze exists to preserve raw source data unchanged — duplicates, nulls, malformed records, unexpected schema changes all land in Bronze as-is. This immutability gives you the ability to reprocess data when business logic changes and recover from downstream errors without re-extracting from source systems. In Snowflake: Snowpipe for automated ingestion without transformation; VARIANT columns for semi-structured data parsed at Silver, not Bronze; Time Travel configured for extended retention. Add metadata columns (_source_file_name, _load_timestamp, _row_hash) — they cost almost nothing and make debugging dramatically easier.

The Silver Layer: Where Complexity Lives

Deduplication — row hash for exact duplicates, primary key with timestamp ordering for latest version, QUALIFY with ROW_NUMBER() for partition deduplication. Define logic explicitly per source.
Validation and exception strategy — failed records should not silently be dropped. Route them to an exceptions table with rejection reason codes.
Entity resolution — a customer appears as CUST_12345 in the CRM and C-12345 in the ERP. Silver is where you resolve identities into a canonical entity with a defined identifier hierarchy. This is one of the most underestimated design challenges in Silver layer architecture — it cannot be retrofitted easily.
Incremental vs. full refresh — Dynamic Tables with INCREMENTAL refresh work well for Silver transformations expressible as SELECT statements. For complex MERGE or multi-step CDC, Streams and Tasks remain more appropriate.

The Gold Layer: Serving Specific Consumers

Each Gold table should serve a defined consumer with data matching that consumer's query patterns. A dashboard table and an ML feature table have different requirements — one prioritizes pre-aggregated denormalized data, the other prioritizes feature definition stability. Building one Gold table that tries to serve both typically serves neither well. Metric definitions live in Gold — what exactly is an "active customer"? These must be explicit, documented, and consistent across the organization. Clustering keys improve performance for high-frequency filtered queries — align them with the most common query filter patterns and monitor with SYSTEM$CLUSTERING_INFORMATION.

Governance Across Layers

Bronze — restrict access to ingestion pipeline and Silver jobs only. Silver — apply row-level security and column masking before data reaches Gold. Gold — this is what business users access and where Snowflake's Secure Data Sharing is most relevant. Snowflake's tag-based policy framework scales governance across large numbers of tables without per-table management.

The Mistake That Costs the Most

Getting layer boundaries wrong early and discovering it late. The safest guard is a written specification of what each layer is responsible for, agreed upon before implementation begins. That document does not survive the architecture review — it becomes the architecture.

Dynamic Tables vs. Streams and Tasks: The Production Decision

What Each Approach Actually Does

The TARGET_LAG Parameter

REFRESH_MODE: INCREMENTAL vs. FULL vs. AUTO

Cost Considerations

When to Use Dynamic Tables

When to Use Streams and Tasks

A Practical Decision Framework

Building Production RAG with Snowflake Cortex Search

How Cortex Search Works

Creating a Cortex Search Service

Chunking Strategy: The Decision That Matters Most

Production Considerations

What Cortex Search Does Not Replace

Medallion Architecture on Snowflake: What the Diagrams Leave Out

The Bronze Layer: Immutability Is Not Optional

The Silver Layer: Where Complexity Lives

The Gold Layer: Serving Specific Consumers

Governance Across Layers

The Mistake That Costs the Most

Architecture Notes

Bronze to Gold with incremental refresh

Domain ownership via Snowflake Data Sharing

Natural language queries over structured data

Community Work

Snowflake Tag Contributions

Snowflake User Group Sessions

Forum Participation

Peer-Reviewed Research on Data Quality

Research Relevant to Snowflake