Skip to main content

Command Palette

Search for a command to run...

Easy Time‑Travel for your Data Warehouse with dbt snapshots

Updated
10 min read
Easy Time‑Travel for your Data Warehouse with dbt snapshots
B
Freelance Data Architect (ex-Criteo, JobTeaser, Welcome to the Jungle...) — I work 2–3 days/week with clients so I can spend the rest building. Currently: Limner (soon to come), a visual conversation threading layer for AI agents. Writing at here at bluu.dev

Why your "Sales by Region" keeps lying to you

If a "simple" KPI like Sales by Region keeps changing every time someone moves house or changes team, that's not your dashboard being flaky. That's your data model telling you it only remembers the latest truth.

As long as dimension tables keep one row per customer/account/contract, they can only answer one question:

"What does this entity look like right now?"

The questions stakeholders actually ask are different:

  • "What did this customer look like when they ordered?"

  • "Who owned this account when the deal closed?"

  • "What was the contract status at quarter-end?"

To answer those, the warehouse needs some form of time travel.

In an ideal world, that comes from CDC (Change Data Capture): reading the database's transaction logs (WAL, binlog, etc.) and replaying every insert, update, and delete as a continuous event stream.

In the real world, that often means begging infra for Debezium/Kafka/cloud‑$VENDOR$ you will never get.

This is where dbt snapshots are useful: a very simple, very boring way to add just enough history to make your warehouse behave like it remembers the past.


1. The real problem SCD Type 2 solves

TL;DR: Without SCD Type 2, the warehouse only remembers the latest state. With it, you can reconstruct what things looked like at any past date.

Think about a classic e‑commerce case: "Sales by Region" over the last year.

  • A customer lives in New York for 6 months and buys 5 items.

  • Then they move to California and buy 3 more.

If the customers dimension only keeps their current address:

  • All 8 orders get attributed to California.

  • New York's historical revenue is under‑reported.

  • California looks more successful than it really was.

The warehouse is effectively rewiring the past every time someone moves.

SCD Type 2 is the pattern that fixes this. It lets the dimensional model answer:

"What was the state of this customer at the time of each order?"

Once that exists, a whole class of questions becomes actually answerable:

  • Sales attribution by region / rep at deal time

  • Funnel conversion by segment at the time of signup

  • Regulatory views of contracts as of reporting date

Without SCD2, those are guesswork. With it, they're just SQL.


2. SCD Type 2 in two columns

TL;DR: Add valid_from and valid_to to the dimension, and stop overwriting rows.

Instead of overwriting a customer row when something changes, SCD Type 2 adds a new row and marks the old one as expired.

Minimal version of a customer address history table:

customer_id | name | region | valid_from  | valid_to
----------- | ---- | ------ | ----------- | ----------
CUST-001    | John | NY     | 2024-01-01  | 2024-07-01
CUST-001    | John | CA     | 2024-07-01  | NULL
  • customer_id: same across all versions for the same person.

  • valid_from: when this version became active.

  • valid_to: when this version stopped being true. NULL = "still active".

dbt will also add a surrogate key like dbt_scd_id under the hood, one per version, which is what you'd typically join from facts in a more "by‑the‑book" warehouse.

Point‑in‑time lookups

To know where John lived on March 15th:

SELECT *
FROM customer_scd2
WHERE '2024-03-15' BETWEEN valid_from AND COALESCE(valid_to, CURRENT_DATE);

March 15th sits inside the [2024‑01‑01, 2024‑07‑01) window, so you get the NY row.

That one pattern—"date BETWEEN valid_from AND valid_to (or COALESCE)"—is basically the entire mechanic.


3. dbt snapshots: an easy batch SCD2

TL;DR: dbt snapshots implement SCD Type 2 for you. You point them at a table or model, configure how to detect changes, and run them on a schedule.

What snapshots actually do

On each dbt snapshot run, dbt:

  1. Reads the current state of your source/model.

  2. Compares it to the existing snapshot table.

  3. For rows that changed:

    • Finds the existing "current" version (where dbt_valid_to IS NULL) for that entity.

    • Updates it by setting dbt_valid_to = snapshot timestamp.

    • Inserts a new version with dbt_valid_from = snapshot timestamp, dbt_valid_to = NULL.

  4. Leaves unchanged rows alone.

  5. Keeps current rows with dbt_valid_to = NULL (or a configurable max date like 9999‑12‑31).

You get:

  • dbt_valid_from

  • dbt_valid_to

  • dbt_scd_id (surrogate key, unique per version)

  • Optional extras (like dbt_is_deleted if you enable hard deletes).

How to configure snapshots (modern way)

As of dbt Core v1.9+, snapshots are configured in YAML, not in .sql blocks. If any transformation is needed, do it in a model (ephemeral or not), then snapshot that.

Example: snapshotting a raw customers table:

# snapshots/properties.yml
snapshots:
  - name: customer_snapshot
    relation: source('raw_database', 'customers')
    config:
      schema: snapshots
      unique_key: customer_id
      strategy: timestamp
      updated_at: updated_at

Key bits:

  • relation: what to snapshot (source() or ref()).

  • unique_key: stable business identifier for the entity (identifies the "same customer across versions").

  • strategy: how to detect changes.

  • updated_at: which column to use for timestamp strategy.

  • schema: where to store the snapshot table.

Optional, but very useful:

  • dbt_valid_to_current: store a max date instead of NULL for current rows (e.g. to_date('9999-12-31')), so queries can use a plain BETWEEN without COALESCE.

  • hard_deletes: 'new_record': add rows for hard deletions with a dbt_is_deleted flag, instead of silently losing them.

Two change‑detection strategies

dbt supports two built‑in strategies to decide when "this row is different enough to version it."

1. timestamp (use this whenever you can)

strategy: timestamp
updated_at: updated_at
  • Compares only the updated_at column.

  • If updated_at is newer than what's in the snapshot, dbt records a new version.

  • Handles schema changes gracefully—new columns don't require config changes.

Constraint: updated_at must actually change whenever any tracked attribute changes. If the source only bumps it for some columns, you will miss changes in the others.

2. check (fallback when no good timestamp)

strategy: check
check_cols:
  - region
  - customer_segment
  • Compares the actual values of the listed columns against the snapshot.

  • If any differ, new version.

Pros:

  • Works when there's no reliable updated_at.

  • Explicit about which attributes you care about.

Cons:

  • Scales worse than timestamp as rows get wider and tables bigger, because more data must be compared per row.

  • Requires config maintenance when adding/removing columns.

Good rule of thumb:

If there is a trustworthy updated_at, use timestamp.
If there isn't, use check on a small, intentional subset of columns.

How often to run snapshots

dbt's own recommendation: somewhere between hourly and daily. If you feel tempted to run snapshots every minute, you're trying to force a batch tool to behave like real‑time CDC.

  • Run too rarely → you miss more behaviour between runs.

  • Run too often → you're paying a lot for a poor approximation of a log‑based system.

Running snapshots vs `dbt run`

Snapshots are not run as part of dbt run.

  • dbt run materializes your models.

  • dbt snapshot runs only your snapshots.

So you’ll generally want to schedule both:

  • dbt snapshot on your chosen cadence (hourly/daily), and

  • dbt run for your transformation models.

Note: Snapshots only track changes from the moment you create and start running them. They can't reconstruct history from before the first snapshot run.


4. Putting it together in SQL

TL;DR: The pattern is always "join facts to the snapshot where the fact's date is between dbt_valid_from and dbt_valid_to (or its COALESCE)."

Point‑in‑time lookup

Basic "state on a given date":

SELECT 
  customer_id,
  region
FROM customer_snapshot
WHERE 
  '2024-03-15' BETWEEN dbt_valid_from
                   AND COALESCE(dbt_valid_to, CURRENT_DATE);

If you configure dbt_valid_to_current to store 9999‑12‑31 for current rows, the COALESCE can go away:

WHERE '2024-03-15' BETWEEN dbt_valid_from AND dbt_valid_to;

Sales by customer region at order time

SELECT
  cs.region,
  SUM(o.order_total) AS total_revenue
FROM fact_orders AS o
JOIN customer_snapshot AS cs
  ON o.customer_id = cs.customer_id
 AND o.order_date BETWEEN cs.dbt_valid_from
                      AND COALESCE(cs.dbt_valid_to, CURRENT_DATE)
GROUP BY
  cs.region;

The join condition is doing all the temporal heavy lifting: each order row finds the version of the customer record that was valid on the order date.

The result:

  • The 5 "New York" orders are counted for NY.

  • The 3 "California" orders are counted for CA.

  • No one has to hack around it in Looker/Power BI with window functions and CASE statements.


5. When snapshots are enough, and when you really need CDC

TL;DR: Snapshots are batch SCD2. They capture differences between runs, but not everything that happens in between those runs.

What snapshots are genuinely good at

  • Capturing most business‑relevant changes in CRUD‑style apps.

  • Making point‑in‑time joins trivial.

  • Giving analysts a reliable way to time‑travel dimensions using standard SQL.

  • Providing a simple audit trail for "what did this row look like before/after X?".

All with:

  • No log readers.

  • No Kafka, unless you otherwise need it.

  • No extra infra dependencies beyond "warehouse + dbt + scheduler".

Built‑in limitations (by design)

Snapshots cannot and will not:

  • Show every intermediate state between runs. If a field flips A → B → A between two snapshots, the snapshot will never know it changed at all.

  • React in real time. Until the next run, the snapshot is out of date.

  • Magically turn a flaky updated_at into a reliable change log. Garbage in, garbage out.

If a regulator, auditor, or core business requirement says:

"Every change to this record must be permanently recorded, in order."

then snapshots are the wrong tool. That's where real CDC—consuming transaction logs or event streams—earns its keep.

A healthy mental model:

  • Snapshots = batch time‑travel for analytics.

  • CDC = continuous ledger for systems that must never forget anything.

“But why not just freeze the numbers?”

You can also solve some of these problems with incremental data marts:

  • Each run upserts a new batch of results for the current period (for example, sales_by_region_2024_Q1).

  • Older batches are never updated, so your dashboards don’t change after the period closes.

This works, but there’s a trade-off:

  • If your business logic changes later (new definition of “active customer”, new region mapping, etc.), you simply can’t re-run the past without rebuilding the whole mart.

  • So you’d either live with inconsistent logic across periods, or you’d lose the original “frozen” numbers.

Snapshots give you a different superpower: you preserve row-level history and can re-run models with new logic while still joining facts to the state that was true at the time.


6. Operations: keeping history without drowning in it

Snapshots accumulate forever. That's their job. But large history tables can still hurt if left completely unmanaged.

Practical guardrails:

  • Partition by dbt_valid_from (or similar) so most queries only scan relevant periods.

  • Cluster on a key like customer_id or order_id to speed up point‑in‑time joins.

  • Retention rules: if no one cares about row‑level history older than a few years, consider archiving or aggregating old partitions.

None of this is specific to dbt; it's just basic warehouse hygiene applied to a table that, by definition, only grows.


Summary

ConceptImplementation
ProblemWarehouse only remembers the latest state; business questions need the past.
SolutionSCD Type 2: keep multiple versions per entity with validity windows.
Setupdbt snapshot defined in YAML; timestamp strategy with a good updated_at when possible.
Queryingdate BETWEEN dbt_valid_from AND COALESCE(dbt_valid_to, CURRENT_DATE)
MaintenanceRun dbt snapshot on a sensible schedule (hourly–daily), monitor table growth.
LimitationsNo visibility into changes between runs; not a replacement for true CDC.

Next steps

Set up a small snapshot on a single dimension (like customers), wire one existing report to use it, and watch how many "wait, why did this change?" conversations disappear. Once the pattern clicks, the rest of the team can use it everywhere without having to understand CDC or event logs.

Note: This article builds on the official dbt snapshots documentation and config reference changed with dbt Core 1.9 and documented on the dbt Developer Hub as of January 2026 (current dbt Core version at time of writing: 1.11.2).