Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster

It was 2 AM on a Saturday when I realized we’d been loading data wrong for six months.

The situation: a customer dimension with three years of history needed to be backfilled after a source system migration. The previous team’s approach was straightforward—run the daily incremental process 1,095 times, once for each day of history. They estimated three weeks to complete.

What they hadn’t accounted for was how errors compound. By the time I looked at the data, we had 47,000 records with overlapping date ranges, 12,000 timeline gaps where customers seemed to vanish and reappear, and an unknowable number of missed changes from when source systems updated the same record multiple times in a single day.

The dimension wasn’t just wrong. It was unfixably wrong using traditional methods. Every incremental run had layered new errors on top of old ones, creating a Jenga tower of data quality issues that couldn’t be untangled without starting over.

That night, I started building what I now call a Healing Table—a dimension that can be completely rebuilt from source data at any point, “healing” whatever accumulated inconsistencies have crept in over months or years of incremental loads.

This isn’t just a technique for disaster recovery. It’s a fundamentally different approach to SCD Type 2 that separates change detection from period construction, processes entire historical ranges in a single pass, and produces dimensions that are deterministically reproducible from source data.

Here’s how it works.

Why Day-by-Day Backfills Fail

The traditional approach to backfilling historical SCD2 dimensions follows a pattern most data engineers know well: take your daily incremental process, wrap it in a loop, and execute it once for each historical date. Simple, reuses existing code, and catastrophically fragile.

The problems start immediately.

Errors compound across iterations. If your change detection logic misses an edge case on January 3rd, that incorrect record becomes the baseline for January 4th’s comparison. The error doesn’t stay contained—it propagates forward through every subsequent run. By the time you discover the issue months later, you can’t simply fix January 3rd because hundreds of downstream records depend on that incorrect state.

Performance degrades non-linearly. Each daily run needs to compare incoming records against the existing dimension. As the dimension grows through backfill, comparison costs increase. A process that takes 5 minutes for a single day’s delta might take 45 minutes per day when the target table contains three years of history. That three-week estimate becomes three months.

Source system quirks multiply. Real source systems don’t change data once per day at a predictable time. They update records multiple times, delete and recreate rows, and occasionally backdate changes. Day-by-day processing either misses these patterns entirely or handles them inconsistently across runs.

Recovery requires complete rebuild anyway. When—not if—something goes wrong, fixing it requires blowing away the dimension and starting over. But if you’re going to rebuild from scratch regardless, why not design for that from the beginning?

The Healing Tables framework embraces this reality. Instead of trying to prevent rebuilds, it makes them fast, reliable, and deterministic.

The Six-Step Framework

Healing Tables work by separating two concerns that traditional SCD2 implementations conflate: change detection (identifying when attributes changed) and period construction (building valid time slices with proper start and end dates).

Traditional approaches detect changes and construct periods simultaneously, comparing incoming records to existing dimension state. This creates tight coupling between current dimension contents and processing logic—if the dimension is wrong, future processing will also be wrong.

Healing Tables decouple these concerns through a six-step pipeline that operates entirely on source data, constructing the dimension from scratch without reference to any existing target state:

Effectivity Table Creation — Extract all change points from sources
Time Slice Generation — Build date ranges with proper valid_from/valid_to
Source Table Joining — Conform attributes from multiple sources
Hash Computation — Enable efficient change detection
Row Compression — Eliminate consecutive identical states
Validation Testing — Verify temporal integrity before loading

Let me walk through each step.

Step 1: Building the Effectivity Table

The Effectivity Table captures every moment when any tracked attribute changed for any business key. Think of it as a timeline of “something happened here” markers that we’ll later fill in with actual attribute values.

For sources with explicit timestamps, extraction looks like this:

-- Extract all change points from source
WITH source_ordered AS (
    SELECT
         customer_id
        ,customer_name
        ,customer_status
        ,credit_limit
        ,updated_at AS change_timestamp
        ,ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at
        ) AS version_sequence
    FROM
        raw_customers
    WHERE
        updated_at IS NOT NULL
),

-- Detect actual changes using LAG comparison
change_detection AS (
    SELECT
         *
        ,LAG(customer_name) OVER (
            PARTITION BY customer_id
            ORDER BY change_timestamp
        ) AS prev_name
        ,LAG(customer_status) OVER (
            PARTITION BY customer_id
            ORDER BY change_timestamp
        ) AS prev_status
        ,LAG(credit_limit) OVER (
            PARTITION BY customer_id
            ORDER BY change_timestamp
        ) AS prev_credit_limit
    FROM
        source_ordered
)

-- Keep only first record OR records where attributes actually changed
SELECT
     customer_id
    ,customer_name
    ,customer_status
    ,credit_limit
    ,change_timestamp
FROM
    change_detection
WHERE
 version_sequence = 1
   OR customer_name IS DISTINCT FROM prev_name
   OR customer_status IS DISTINCT FROM prev_status
   OR credit_limit IS DISTINCT FROM prev_credit_limit

The critical insight here is filtering to actual changes only. Raw source tables often contain records where nothing meaningful changed—perhaps a batch process touched every row, or a status field was updated and then immediately reverted. Including these non-changes creates unnecessary dimension versions that complicate queries and waste storage.

When working with daily snapshot sources (where you don’t have change timestamps, just daily extracts), the pattern shifts to detecting differences between consecutive snapshots:

WITH daily_ordered AS (
    SELECT
         customer_id
        ,customer_name
        ,customer_status
        ,snapshot_date
        ,LAG(customer_name) OVER (
            PARTITION BY customer_id
            ORDER BY snapshot_date
        ) AS prev_name
        ,LAG(customer_status) OVER (
            PARTITION BY customer_id
            ORDER BY snapshot_date
        ) AS prev_status
    FROM
        daily_customer_snapshots
)

SELECT
     customer_id
    ,customer_name
    ,customer_status
    ,snapshot_date AS effective_from
FROM
    daily_ordered
WHERE
   prev_name IS NULL  -- First occurrence
   OR customer_name IS DISTINCT FROM prev_name
   OR customer_status IS DISTINCT FROM prev_status

The IS DISTINCT FROM operator handles NULL comparisons correctly—a common source of bugs in change detection logic. Standard != treats NULL as unknown, meaning NULL != 'value' returns NULL rather than TRUE. IS DISTINCT FROM treats NULL as a concrete value, giving us the comparison semantics we actually want.

Step 2: Generating Time Slices

Time Slices transform point-in-time change events into date ranges with valid_from and valid_to columns. The LEAD window function does the heavy lifting, calculating when each version expires based on when the next version begins:

WITH effectivity_data AS (
    SELECT
        customer_id, customer_name, customer_status, change_timestamp
    FROM
        effectivity_table
)

SELECT
     customer_id
    ,customer_name
    ,customer_status
    ,change_timestamp AS valid_from
    -- Calculate valid_to from next version's start date
    ,COALESCE(
        LEAD(change_timestamp) OVER (
            PARTITION BY customer_id
            ORDER BY change_timestamp
        ),
        DATE '9999-12-31'  -- High date for current records
    ) AS valid_to
    -- Flag current record for easy filtering
    ,CASE
        WHEN LEAD(change_timestamp) OVER (
            PARTITION BY customer_id
            ORDER BY change_timestamp
        ) IS NULL THEN TRUE
        ELSE FALSE
    END AS is_current
FROM
    effectivity_data

A few implementation decisions deserve attention here.

High dates versus NULLs for current records. Using 9999-12-31 for current record end dates simplifies BETWEEN queries but can confuse some BI tools that try to calculate date differences. Using NULL for valid_to has clearer semantic meaning but requires IS NULL predicates everywhere. My recommendation: use the high date for the column value, but also include an explicit is_current boolean flag. This gives query authors flexibility without forcing them to remember magic date values.

Interval conventions matter. Use left-closed, right-open intervals where valid_from <= date < valid_to. This ensures every point in time maps to exactly one dimension version without ambiguity at boundaries. The alternative—inclusive on both ends—creates situations where midnight timestamps belong to two periods simultaneously.

Point-in-time lookup pattern:

SELECT d.*
FROM
    dim_customer d
WHERE
    @lookup_date >= d.valid_from
    AND @lookup_date < d.valid_to

Step 3: Joining Multiple Sources

When dimension attributes come from multiple systems, you need to conform those sources before SCD2 processing begins. The temptation is to load directly from multiple sources into a single target—don’t do this. You’ll create conflicting versions where the same timestamp has different attribute combinations depending on which source processed first.

The solution is a Unified Timeline that collects all change points from all sources, then joins each source back using as-of-date logic:

-- Step 1: Collect ALL change points from ALL sources
WITH unified_change_points AS (
    SELECT customer_id, change_timestamp FROM crm_customers
    UNION
    SELECT customer_id, change_timestamp FROM erp_customers
    UNION
    SELECT customer_id, change_timestamp FROM web_profiles
),

-- Step 2: Build time slices from unified timeline
unified_time_slices AS (
    SELECT
         customer_id
        ,change_timestamp AS valid_from
        ,COALESCE(
            LEAD(change_timestamp) OVER (
                PARTITION BY customer_id
                ORDER BY change_timestamp
            ),
            TIMESTAMP '9999-12-31 23:59:59'
        ) AS valid_to
    FROM
        unified_change_points
),

-- Step 3: Join each source back using as-of logic
final_dimension AS (
    SELECT
         ts.customer_id
        ,ts.valid_from
        ,ts.valid_to
        -- CRM attributes (authoritative for name)
        ,crm.customer_name
        -- ERP attributes (authoritative for financial data)
        ,erp.credit_limit
        ,erp.payment_terms
        -- Web attributes (authoritative for contact preferences)
        ,web.email_opt_in
        ,web.preferred_contact_method
    FROM
        unified_time_slices ts

        LEFT JOIN crm_customers crm
        ON ts.customer_id = crm.customer_id
        AND crm.change_timestamp = (
            SELECT MAX(change_timestamp)
            FROM crm_customers
            WHERE customer_id = ts.customer_id
              AND change_timestamp <= ts.valid_from
        )

        LEFT JOIN erp_customers erp
        ON ts.customer_id = erp.customer_id
        AND erp.change_timestamp = (
            SELECT MAX(change_timestamp)
            FROM erp_customers
            WHERE customer_id = ts.customer_id
              AND change_timestamp <= ts.valid_from
        )

        LEFT JOIN web_profiles web
        ON ts.customer_id = web.customer_id
        AND web.change_timestamp = (
            SELECT MAX(change_timestamp)
            FROM web_profiles
            WHERE customer_id = ts.customer_id
              AND change_timestamp <= ts.valid_from
        )
)

SELECT * FROM final_dimension

Notice the correlated subqueries for each join—we’re finding the most recent version from each source that was effective at each unified timeline point. This handles cases where one source updates frequently while another updates rarely.

Document attribute ownership explicitly. When CRM and ERP both have a customer name field, which wins? When web profile email conflicts with CRM email, which is authoritative? These decisions belong in documentation and code comments, not in tribal knowledge. Build a table:

Attribute	Primary Source	Fallback	Resolution Rule
Customer Name	CRM	ERP	CRM always wins
Credit Limit	ERP	None	ERP authoritative
Email	Web	CRM	Most recent wins

Then implement those rules explicitly in your SQL using COALESCE or CASE expressions.

Step 4: Hash Computation

Hash-based change detection replaces expensive multi-column comparisons with single-column hash comparisons. The Healing Tables framework uses a two-hash strategy: one hash for business key identification, another for attribute change detection.

key_hash identifies the business entity. Hash the natural key columns together:

-- Trino/Starburst syntax
lower(to_hex(sha256(to_utf8(
    COALESCE(CAST(source_system AS VARCHAR), '') || '|' ||
    COALESCE(CAST(customer_id AS VARCHAR), '')
)))) AS key_hash

row_hash detects when any tracked attribute changes. Hash all business attributes excluding keys and metadata:

lower(to_hex(sha256(to_utf8(
    COALESCE(customer_name, '^^NULL^^') || '|' ||
    COALESCE(customer_status, '^^NULL^^') || '|' ||
    COALESCE(CAST(credit_limit AS VARCHAR), '^^NULL^^') || '|' ||
    COALESCE(CAST(birth_date AS VARCHAR), '^^NULL^^')
)))) AS row_hash

The ^^NULL^^ placeholder distinguishes NULL values from empty strings—a critical detail that trips up many implementations. Without this, ('John', NULL, 'Active') and ('John', '', 'Active') produce identical hashes despite having semantically different data.

Why not use the built-in CONCAT_WS? Because CONCAT_WS skips NULL values entirely. The inputs ('A', NULL, 'B') and ('A', 'B') produce identical output, creating false positive matches where none exist. Always wrap columns in COALESCE before concatenation.

Column ordering best practice: concatenate in alphabetical order. This creates deterministic, reproducible hashes that survive schema evolution—adding a new column doesn’t change the hash of existing columns.

Platform-specific hash functions vary:

Platform	Function	Syntax
Snowflake	SHA2	`SHA2(expression, 256)`
BigQuery	SHA256 + TO_HEX	`TO_HEX(SHA256(expression))`
Trino/Starburst	sha256 + to_hex	`to_hex(sha256(to_utf8(expression)))`
SQL Server	HASHBYTES	`HASHBYTES('SHA2_256', expression)`

Step 5: Row Compression

Row compression eliminates consecutive duplicate states—adjacent time periods where all attributes are identical. This happens more often than you’d expect: source systems frequently touch records without changing meaningful data, or changes get reverted within the same processing window.

The technique is a classic Islands and Gaps pattern. Assign a group identifier to runs of consecutive identical rows, then collapse each group into a single time slice:

-- Step 1: Detect group boundaries
WITH boundary_detection AS (
    SELECT
         *
        ,CASE
            -- New group if: periods don't connect OR attributes changed
            WHEN LAG(valid_to) OVER (
                    PARTITION BY customer_id
                    ORDER BY valid_from
                ) = valid_from
             AND row_hash = LAG(row_hash) OVER (
                    PARTITION BY customer_id
                    ORDER BY valid_from
                )
            THEN 0  -- Continuation of previous group
            ELSE 1  -- Start of new group
        END AS is_group_start
    FROM
        dimension_with_hashes
),

-- Step 2: Assign group IDs via running sum
group_assignment AS (
    SELECT
         *
        ,SUM(is_group_start) OVER (
            PARTITION BY customer_id
            ORDER BY valid_from
            ROWS UNBOUNDED PRECEDING
        ) AS group_id
    FROM
        boundary_detection
)

-- Step 3: Collapse each group to single row
SELECT
     customer_id
    ,MIN(valid_from) AS valid_from
    ,MAX(valid_to) AS valid_to
    ,MAX(CASE WHEN is_current THEN 1 ELSE 0 END) = 1 AS is_current
    ,MAX(customer_name) AS customer_name  -- All identical within group
    ,MAX(customer_status) AS customer_status
    ,MAX(row_hash) AS row_hash
    ,MAX(key_hash) AS key_hash
    ,COUNT(*) AS compressed_row_count  -- Audit trail
FROM
    group_assignment
GROUP BY
    customer_id, group_id
ORDER BY
    customer_id, valid_from

Two conditions must both be true for rows to compress:

Temporal contiguity: Previous row’s valid_to equals current row’s valid_from
Attribute identity: row_hash values match

The contiguity check prevents incorrectly merging records that happen to have identical attributes but represent separate valid periods. If a customer had credit limit 10000 from January to March, then credit limit 5000 from March to June, then credit limit 10000 again from June to December—those are three distinct business states, not one, even though the first and third have identical attribute values.

Before compression (6 rows):

customer_id	valid_from	valid_to	credit_limit
1002	2020-01-01	2020-01-05	40000
1002	2020-01-05	2020-01-09	40000
1002	2020-01-09	2020-01-11	30000
1002	2022-01-01	2022-03-01	30000
1002	2022-03-01	2022-05-01	30000
1002	2022-05-01	2022-06-01	30000

After compression (3 rows):

customer_id	valid_from	valid_to	credit_limit
1002	2020-01-01	2020-01-09	40000
1002	2020-01-09	2020-01-11	30000
1002	2022-01-01	2022-06-01	30000

The gap between 2020-01-11 and 2022-01-01 correctly prevents merging the two “30000” groups—they represent distinct validity periods separated by time when the customer didn’t exist in source data.

Step 6: Validation Testing

Production SCD2 dimensions require comprehensive validation to ensure temporal integrity. These tests should run after every load and block pipeline completion on failure. Silent failures in dimension loading corrupt downstream fact tables in ways that are extremely difficult to diagnose later.

Test 1: Single current record per key

Each business key must have exactly one active (current) record:

-- CRITICAL: Fails if any key has multiple current records
SELECT
     customer_id
    ,COUNT(*) AS current_record_count
FROM
    dim_customer
WHERE
    is_current = TRUE
GROUP BY
    customer_id
HAVING
    COUNT(*) <> 1

-- Expected: Zero rows returned

Test 2: No overlapping date ranges

Time periods for the same key must not overlap. This is where I learned a painful lesson: the obvious LEAD-based check is insufficient.

-- INSUFFICIENT - Only checks adjacent records
WITH overlap_check AS (
    SELECT
         customer_id
        ,valid_from
        ,valid_to
        ,LEAD(valid_from) OVER (
            PARTITION BY customer_id
            ORDER BY valid_from
        ) AS next_valid_from
    FROM
        dim_customer
)
SELECT *
FROM
    overlap_check
WHERE
    next_valid_from <= valid_to

This misses overlaps between non-adjacent records. Records at positions 1 and 3 could overlap even if record 2 doesn’t overlap with either. The comprehensive check requires a self-join:

-- COMPREHENSIVE - Checks all pairs
SELECT a.*, b.*
FROM
    dim_customer a

    INNER JOIN dim_customer b
    ON a.customer_id = b.customer_id
    AND a.dbt_scd_id != b.dbt_scd_id  -- Different records
WHERE
    a.valid_from < b.valid_to
    AND a.valid_to > b.valid_from

-- Expected: Zero rows returned

Test 3: No timeline gaps

For dimensions requiring continuous history, adjacent periods must connect:

WITH gap_check AS (
    SELECT
         customer_id
        ,valid_from
        ,LAG(valid_to) OVER (
            PARTITION BY customer_id
            ORDER BY valid_from
        ) AS prev_valid_to
    FROM
        dim_customer
)
SELECT
     customer_id
    ,prev_valid_to AS gap_start
    ,valid_from AS gap_end
FROM
    gap_check
WHERE
    prev_valid_to IS NOT NULL
    AND
    prev_valid_to <> valid_from

-- Expected: Zero rows (or documented acceptable gaps)

Test 4: Valid date ordering

Every record must have valid_from < valid_to:

SELECT *
FROM
    dim_customer
WHERE
    valid_from >= valid_to

-- Expected: Zero rows

Test 5: No consecutive duplicate versions

After compression, no adjacent rows should have identical attributes:

WITH version_check AS (
    SELECT
         customer_id
        ,valid_from
        ,row_hash
        ,LAG(row_hash) OVER (
            PARTITION BY customer_id
            ORDER BY valid_from
        ) AS prev_row_hash
        ,LAG(valid_to) OVER (
            PARTITION BY customer_id
            ORDER BY valid_from
        ) AS prev_valid_to
    FROM
        dim_customer
)
SELECT *
FROM
    version_check
WHERE
    row_hash = prev_row_hash
    AND valid_from = prev_valid_to  -- Contiguous AND identical

-- Expected: Zero rows

In dbt, implement these as reusable tests:

# schema.yml
models:
  - name: dim_customer
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - customer_id
            - valid_from
    columns:
      - name: customer_id
        tests:
          - unique:
              config:
                where: "is_current = TRUE"
      - name: valid_from
        tests:
          - not_null
      - name: valid_to
        tests:
          - not_null

The Self-Healing Property

The name “Healing Tables” comes from a property that emerges from this design: dimensions built this way can recover from accumulated data quality issues simply by reprocessing source data.

Traditional incremental SCD2 implementations are path-dependent—the current state depends on the exact sequence of historical loads. If Tuesday’s load had a bug, and Wednesday through Friday ran correctly, you can’t just re-run Tuesday. Those later loads built on Tuesday’s incorrect state.

Healing Tables are path-independent. The dimension produced depends only on source data contents, not on the history of load operations. Re-run any date range with corrected logic, and the result is identical to what you’d get from a full rebuild.

This property transforms how teams think about data quality issues:

Bug in change detection logic? Fix it, reprocess affected date range, done.
Source system correction applied retroactively? Reprocess includes those corrections automatically.
New attribute added to dimension? Reprocess rebuilds all historical versions with the new column.
Suspected data corruption? Rebuild and compare—if results differ, you’ve found your problem.

The dimension becomes reproducible in the same way good software builds are reproducible. Given the same inputs and logic, you always get the same outputs. This isn’t just convenient for debugging—it’s a fundamental quality characteristic that enables reliable data pipelines.

When to Use This Approach

Healing Tables aren’t appropriate for every situation. They work best when:

You have access to complete source history. The pattern assumes you can extract all historical changes, not just deltas. If your source system only provides “current state” with no history, you’ll need a different approach.
Source data is contiguous. The timeline construction assumes no gaps in source timestamps. If customers can appear, disappear, and reappear in source data, you’ll need additional logic to handle these discontinuities.
Rebuild time is acceptable. Processing years of history takes longer than processing a single day’s changes. For dimensions with billions of historical rows, full rebuilds might be impractical even with optimization.
Your sources don’t have hard deletes. The pattern assumes deletions are logical (flagged) rather than physical. If source systems physically remove rows, you’ll lose that history.

Traditional incremental approaches remain appropriate for:

Very high-volume dimensions where full rebuilds are impractical
Real-time or near-real-time requirements
Situations where source history is unavailable
Simple dimensions with stable, well-understood source systems

The choice isn’t binary. Many teams use Healing Tables for periodic “refresh” operations while running standard incremental processes daily. If the daily process accumulates issues, the weekly or monthly heal repairs them.

Wrapping Up

That 2 AM debugging session taught me something important: the traditional approach to SCD2—incremental changes applied to existing dimension state—creates tight coupling between historical load operations and current data quality. Every bug, every edge case, every source system quirk gets baked into the dimension permanently.

Healing Tables break that coupling. By extracting change points from source data, constructing time slices independently, compressing duplicate states, and validating results before loading, you create dimensions that are deterministically reproducible from source data.

The six-step framework isn’t complicated, but it requires discipline:

Build the Effectivity Table with actual changes only
Generate Time Slices with proper valid_from/valid_to boundaries
Join multiple sources using unified timeline logic
Compute hashes for efficient change detection
Compress consecutive identical states
Validate temporal integrity before loading

Implement those steps correctly, and you’ll have dimensions that can heal themselves from whatever accumulated issues have crept in over months or years of operation.

The next time someone asks you to backfill three years of history by running the daily process 1,095 times, you’ll have a better answer.

Chris Hillman