Snowflake
Architecture decisions, Cortex AI patterns, data quality methods, and production trade-offs written at the Advanced Certified Architect level. No surface-level tutorials. No vendor marketing. Just the decisions that actually matter in production.
Dynamic Tables vs. Streams and Tasks: The Production Decision
When Snowflake introduced Dynamic Tables, many teams asked whether they should retire their Streams and Tasks pipelines. The honest answer is: it depends — but not in the evasive way that phrase usually implies. There is a clear decision framework, and once you understand it, the choice becomes straightforward.
What Each Approach Actually Does
Streams and Tasks are Snowflake's procedural approach to incremental processing. A Stream tracks changes to a source table — inserts, updates, and deletes — using Snowflake's internal change tracking mechanism. A Task schedules SQL or a stored procedure on a defined interval or event trigger. Together they form a pipeline: the Stream captures what changed, the Task runs the logic. You control exactly what runs, when, and what happens when things go wrong. The tradeoff is that you own the complexity.
Dynamic Tables are Snowflake's declarative approach. You write a SELECT statement that defines the target state, set a TARGET_LAG parameter specifying the maximum acceptable staleness, and Snowflake handles the rest — no merge logic, no task scheduling, no dependency management. The mental model: a Dynamic Table is a materialized view you can control the refresh frequency of, supporting joins, aggregations, window functions, and CTEs.
The TARGET_LAG Parameter
TARGET_LAG sets a freshness guarantee, not a fixed schedule. You can set it as a time interval (minimum one minute) or use DOWNSTREAM, which tells the Dynamic Table to refresh only when a downstream Dynamic Table triggers a refresh — useful for aligning entire pipeline freshness around the final consumer's requirements.
REFRESH_MODE: INCREMENTAL vs. FULL vs. AUTO
INCREMENTAL — Snowflake processes only the delta from source changes.
FULL — Snowflake recomputes the entire result set on every refresh, used when the query cannot be incrementally refreshed.
AUTO (default) — Snowflake decides based on query structure. Check the Dynamic Table's REFRESH_MODE after creation — Snowflake shows which mode it selected and why. Non-deterministic functions and certain complex subquery patterns force FULL mode.
Cost Considerations
Dynamic Tables incur storage cost, cloud services compute (only billed if daily cloud services cost exceeds 10% of daily warehouse cost), and virtual warehouse compute — the primary cost driver. Dynamic Tables do not yet support serverless compute; they require a named virtual warehouse. One team reported reducing their pipeline codebase by 40% after switching to Dynamic Tables. The question is whether that engineering time savings justifies any compute cost increase for your specific workload.
When to Use Dynamic Tables
Use when: building transformation layers expressible as a SELECT statement; acceptable staleness is one minute or greater; you have chains of dependent transformations benefiting from automatic DAG management; implementing SCD Type 1.
When to Use Streams and Tasks
Use when: sub-minute latency is required; pipeline logic needs multi-step procedural operations, conditional branching, external API calls, or JavaScript UDFs; implementing SCD Type 2; needing append-only change capture; requiring precise control over error handling and retry logic per step.
A Practical Decision Framework
1. Can the transformation be expressed as a single SELECT statement? If yes, Dynamic Tables are a strong candidate.
2. What is the acceptable staleness? Under one minute — use Streams and Tasks. One minute or more — Dynamic Tables are viable.
3. How complex is the dependency chain? Many dependent steps favor Dynamic Tables' automatic DAG management.
The mistake to avoid is treating Dynamic Tables as universally superior because they are newer. Both tools exist for good reasons. The right architecture uses each where its design strengths align with the pipeline requirements.
Building Production RAG with Snowflake Cortex Search
Building a retrieval-augmented generation system from scratch requires assembling multiple components: an embedding model, a vector database, an indexing pipeline, a retrieval layer, a reranking model, and a generation model. When your data lives in Snowflake, every external hop introduces data movement — security review, compliance consideration, and additional latency. Snowflake Cortex Search addresses this by packaging the retrieval components into a managed service that runs natively inside Snowflake. Your data never leaves the platform.
How Cortex Search Works
Cortex Search implements hybrid search — combining vector search, keyword search, and semantic reranking in a single retrieval operation. The vector search component is powered by Arctic Embed. The current recommended model is snowflake-arctic-embed-l-v2.0, which achieves retrieval performance that only models with over one billion parameters have historically surpassed, while remaining practical to deploy at scale. The result is 200–300 millisecond query latency over large volumes of text, with metadata filtering support.
Creating a Cortex Search Service
CREATE OR REPLACE CORTEX SEARCH SERVICE support_search_service
ON document_text
ATTRIBUTES category, department
WAREHOUSE = search_wh
TARGET_LAG = '1 day'
EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (
SELECT document_text, category, department, document_id
FROM enterprise_documents WHERE is_active = TRUE
);
The ON parameter is the column Cortex Search indexes. ATTRIBUTES are available as query filters. TARGET_LAG controls update frequency. EMBEDDING_MODEL selects the Arctic Embed variant.
Chunking Strategy: The Decision That Matters Most
Snowflake recommends keeping text chunks at 512 tokens or fewer (approximately 385 English words). When a chunk is too large, the embedding vector averages across too much semantic content — a query retrieves a chunk containing the relevant topic plus unrelated content, reducing precision and diluting LLM context. Three approaches: Fixed-size with overlap (10–20% overlap prevents context loss at boundaries — simplest, good starting point); Semantic (split at natural boundaries like paragraphs — more coherent, variable size); Hierarchical (summary plus fine-grained chunks — most complex, best for long structured documents).
Production Considerations
Governance — Cortex Search services inherit Snowflake's row-level and column-level security. Users without table access cannot retrieve content from a service built on that table.
Cost — the initial index build is the most significant cost event; size your warehouse appropriately for the initial load, then scale down.
Evaluation — build retrieval quality and answer quality evaluation into your pipeline from the start.
What Cortex Search Does Not Replace
For structured data question-answering — queries answerable by generating and executing SQL — Cortex Analyst is more appropriate. Cortex Search is the right choice for enterprise knowledge base, document search, and conversational AI over text data that lives in Snowflake.
Medallion Architecture on Snowflake: What the Diagrams Leave Out
The Medallion Architecture — Bronze, Silver, Gold — has become the dominant pattern for organizing data pipelines on modern cloud platforms. What the standard diagrams omit is everything that makes the difference between an architecture that looks right on paper and one that holds up when real data arrives from real systems under real operational pressure.
The Bronze Layer: Immutability Is Not Optional
The most common mistake is transforming data before storing it. Bronze exists to preserve raw source data unchanged — duplicates, nulls, malformed records, unexpected schema changes all land in Bronze as-is. This immutability gives you the ability to reprocess data when business logic changes and recover from downstream errors without re-extracting from source systems. In Snowflake: Snowpipe for automated ingestion without transformation; VARIANT columns for semi-structured data parsed at Silver, not Bronze; Time Travel configured for extended retention. Add metadata columns (_source_file_name, _load_timestamp, _row_hash) — they cost almost nothing and make debugging dramatically easier.
The Silver Layer: Where Complexity Lives
Deduplication — row hash for exact duplicates, primary key with timestamp ordering for latest version, QUALIFY with ROW_NUMBER() for partition deduplication. Define logic explicitly per source.
Validation and exception strategy — failed records should not silently be dropped. Route them to an exceptions table with rejection reason codes.
Entity resolution — a customer appears as CUST_12345 in the CRM and C-12345 in the ERP. Silver is where you resolve identities into a canonical entity with a defined identifier hierarchy. This is one of the most underestimated design challenges in Silver layer architecture — it cannot be retrofitted easily.
Incremental vs. full refresh — Dynamic Tables with INCREMENTAL refresh work well for Silver transformations expressible as SELECT statements. For complex MERGE or multi-step CDC, Streams and Tasks remain more appropriate.
The Gold Layer: Serving Specific Consumers
Each Gold table should serve a defined consumer with data matching that consumer's query patterns. A dashboard table and an ML feature table have different requirements — one prioritizes pre-aggregated denormalized data, the other prioritizes feature definition stability. Building one Gold table that tries to serve both typically serves neither well. Metric definitions live in Gold — what exactly is an "active customer"? These must be explicit, documented, and consistent across the organization. Clustering keys improve performance for high-frequency filtered queries — align them with the most common query filter patterns and monitor with SYSTEM$CLUSTERING_INFORMATION.
Governance Across Layers
Bronze — restrict access to ingestion pipeline and Silver jobs only. Silver — apply row-level security and column masking before data reaches Gold. Gold — this is what business users access and where Snowflake's Secure Data Sharing is most relevant. Snowflake's tag-based policy framework scales governance across large numbers of tables without per-table management.
The Mistake That Costs the Most
Getting layer boundaries wrong early and discovering it late. The safest guard is a written specification of what each layer is responsible for, agreed upon before implementation begins. That document does not survive the architecture review — it becomes the architecture.
Architecture Notes
Shorter observations from active practice — patterns encountered this week, trade-offs worth documenting, decisions I would make differently now.
Bronze to Gold with incremental refresh
Layer design, quality gates, and governance decisions that determine whether the architecture holds at enterprise scale.
Domain ownership via Snowflake Data Sharing
Federated governance, cross-account secure sharing, and the organizational decisions that determine whether Data Mesh actually works.
Natural language queries over structured data
Real prompts, real outputs, real limitations — what Cortex Analyst can and cannot do in a production context.
Community Work
Contributions to the Snowflake community — Stack Overflow answers, forum posts, meetup recordings, and public talks.
Snowflake Tag Contributions
Active answers on the Snowflake tag — architecture questions, Snowpark, Dynamic Tables, and Cortex AI.
View on Stack OverflowSnowflake User Group Sessions
Recordings from user group meetups and virtual sessions on Dynamic Tables, Cortex AI, and data quality at scale.
Speaking KitForum Participation
Questions answered, patterns shared, and architecture discussions in the official Snowflake Community forum.
Snowflake CommunityPeer-Reviewed Research on Data Quality
Published research on immune-inspired anomaly detection applied to large-scale data engineering — directly relevant to Snowflake migration validation.
Read the PaperResearch Relevant to Snowflake
Published work connecting directly to data engineering, data quality, and large-scale system design on Snowflake.