Your Data Model Isn't Broken, Part II: The Refactoring Playbook

In [Part I], I made the case that your legacy data model isn’t the disaster it looks like. That the strange WHERE clauses, the bridge tables nobody can explain, and the slowly-changing-dimension-within-a-slowly-changing-dimension aren’t bugs — they’re business rules earned through years of production reality. I argued that big-bang rebuilds fail at alarming rates, that the complexity you’re fighting is mostly essential rather than accidental, and that the impulse to “start from scratch” is driven more by cognitive bias than by engineering judgment.

A few people messaged me afterward and said, roughly: “Okay, I’m convinced. But what am I supposed to do about it?”

Fair question. Let’s talk about the playbook.

The vine that eats the tree

In 2004, Martin Fowler was visiting a rainforest in Queensland when he noticed something remarkable about the strangler fig trees. These vines start as seeds deposited in the upper branches of a host tree by birds. They grow downward, wrapping around the trunk, eventually establishing their own root system in the soil. Over years — sometimes decades — the fig gradually replaces the host tree. The original tree decays inside the fig’s lattice until there’s nothing left of it. But at no point during this process does the canopy collapse. The forest continues to function the entire time.

Fowler looked at those trees and thought: that’s how you should replace software systems.

The Strangler Fig pattern is, in my experience, the single most useful idea in all of software architecture — and it maps onto data warehouse migrations with almost uncomfortable precision. The concept is deceptively simple: instead of replacing the old system all at once, you build the new system around it. You intercept one piece of functionality at a time, route it through the new implementation, verify it works, and move on. The old system slowly shrinks. The new system slowly grows. And at no point does the business lose access to anything.

Fowler updated his essay on this pattern as recently as August 2024, and his framing is worth internalising. He described the alternative — the big-bang rewrite — and noted that he’d watched it fail most of the time. The strangler fig approach, by contrast, gives value steadily and allows you to monitor progress more carefully through frequent releases.

Now, here’s where it gets interesting for data teams specifically. Application systems have clean interfaces — APIs, endpoints, well-defined contracts. You can intercept a request, route it to the new system, and nobody downstream knows the difference. Data systems are messier. Elliott Cordo, who documented migrating three legacy data warehouses using this pattern, identified the core challenge: analytics systems have porous boundaries and poorly defined interfaces with consumers. You don’t always know who’s querying what, or how.

Cordo’s solution — and this is the variant I’ve seen work best — is what he calls the Legacy Façade. You build a new interface that mimics the legacy system’s contract. Same table names, same column names, same output format. Behind the façade, you’ve plumbed in the new logic. Consumers don’t change anything. They don’t even need to know a migration is happening. And this, critically, eliminates the organisational bottleneck that kills most migrations: getting twenty different teams to simultaneously update their queries, dashboards, and downstream pipelines to point at the new system.

Cordo’s experience is particularly telling because one of the three warehouses he migrated existed precisely because a previous attempt to replace one of the others had failed — mainly due to change management. The strangler fig approach bypasses that failure mode entirely by asking consumers to do little or nothing to adopt the new system.

I think this matters so much: the majority of data warehouse migrations don’t fail for technical reasons. They fail because you can’t convince the entire organisation to cut over on the same Tuesday. The Legacy Façade removes that dependency. You migrate at your pace, not at the pace of every team that consumes your data.

Stop publishing before you’ve checked

I’m continually surprised by how few people have heard of this pattern. When Robin Moffatt polled data engineers on Reddit about Write-Audit-Publish, most hadn’t even encountered the term — let alone used it. That’s a problem, because WAP solves the single scariest moment in any refactoring effort: the moment you push changed logic to production and hope you didn’t break anything.

Netflix introduced WAP at the DataWorks Summit in 2017, and the concept is elegant in its simplicity. Instead of the typical pipeline flow — transform data, write it to the production table, then maybe run some checks afterward — you split the process into three distinct stages:

Write to an isolated staging area. Not production. Not anywhere consumers can see it.

Audit the output against your quality checks. Row counts. Schema validation. Business rule assertions. Whatever gives you confidence.

Publish only if the audit passes. If it doesn’t, production never sees the bad data. Nobody’s dashboard breaks. Nobody’s quarterly report includes garbage. The corrupted data dies quietly in staging where it belongs.

The reason this matters for refactoring specifically is that refactoring requires you to change transformation logic while producing identical output. You’re restructuring the internals — breaking a 400-line SQL query into modular CTEs, moving logic from stored procedures into dbt models, replacing hardcoded values with reference table lookups. Each of these changes is supposed to be behaviour-preserving. WAP gives you the safety net to verify that it actually was.

Julien Hurault described the common anti-pattern — what he calls Write-(Publish)-Audit — where if an error is detected during testing, it’s already too late because the corrupted data has already been released to downstream systems. I’ve seen this play out more times than I want to admit. Someone refactors a model, the tests pass in the development environment, the change gets deployed, and then a stakeholder notices that revenue is off by 12% in a dashboard three hours later. By that point, the bad data has propagated through six downstream models and someone’s already screenshot the dashboard for a presentation.

WAP prevents this entirely. And with Apache Iceberg now supporting it natively through branch-based isolation, the tooling has finally caught up with the idea. You write to an audit branch, run your validations, and only fast-forward merge to main when everything checks out. It’s the data engineering equivalent of a pull request with automated tests — and honestly, it’s bizarre that we’ve accepted as normal the practice of deploying data transformations straight to production without a verification step in between.

If you’re on Snowflake, the implementation is straightforward using CLONE and SWAP operations. Clone the production table, run your refactored pipeline against the clone, validate the output, and swap the clone into the production role. If validation fails, the clone gets dropped and production is untouched. I’ve written about this pattern in more detail here — but the key point for this article is simpler: WAP turns refactoring from a high-wire act into a methodical, reversible process.

The archaeology of SQL

How does refactoring a data model actually works in practice, because it’s less dramatic than people expect. That’s sort of the point.

dbt Labs published a refactoring course that formalises the approach, and having used variants of this workflow on multiple teams, I think they’ve got it right. The core philosophy is one sentence: while refactoring, you’ll be moving around a lot of logic, but ideally you won’t be changing the logic.

Read that again. You are not improving business rules during a refactor. You are not fixing data quality issues during a refactor. You are not adding new features during a refactor. You are rearranging the existing logic into a better structure while proving, at every step, that the output hasn’t changed. The improvements come later, once the structure is clean enough to support them safely.

The workflow goes something like this. First, you take the legacy SQL — the 800-line stored procedure, the five nested subqueries, the transformation that nobody wants to touch because the last person who edited it left the company — and you port it into a dbt model unchanged. You don’t clean it up. You don’t refactor it. You bring it over exactly as it is, warts and all, and you verify that it produces identical output.

This is the step most people want to skip, and it’s the most important one. That ugly SQL is your baseline. It’s your source of truth. Until you’ve proved your refactored version matches it row-for-row, column-for-column, you don’t have a refactor — you have a rewrite wearing a refactor’s clothing.

Once you’ve got your baseline, you start decomposing. Extract the source references into dbt sources. Break the monolithic query into CTEs, each one doing a single logical operation. Move shared logic into staging models. Separate business rules from data cleaning from aggregation. At each step — and this is non-negotiable — you use something like dbt’s audit_helper package to compare the output of your refactored model against the original. If the outputs match, you move on. If they don’t, you figure out what you changed and fix it before going further.

Tristan Handy talks about his formative experience at Casper in 2016, where his team refactored all of their existing pipelines and brought them over to dbt in a single week, delivering work at least ten times faster than they would have been able to do otherwise. That speed came not from heroics but from methodology: migrate unchanged, then decompose with verification at every step.

I’ve watched a team at refactor over 200 dbt models with complex logic in three days using audit_helper. Three days. Not three months. Not the eighteen-month rebuild timeline that some stakeholder approved in a steering committee. Three days, because the tooling verified every change automatically and the team could move fast with confidence.

That’s the real payoff of refactoring over rebuilding. Not that it’s safer — though it is. Not that it preserves business logic — though it does. The payoff is that it’s faster. Faster to start delivering value. Faster to reach a clean architecture. Faster to build the confidence you need to actually make meaningful improvements.

Expand, then contract

There’s a pattern for handling breaking schema changes that I don’t see data teams use nearly enough, and it deserves more attention. Fowler calls it Parallel Change. Others call it Expand and Contract. The name doesn’t matter.

Imagine you need to rename a column in a fact table that thirty dashboards reference. In a rebuild mindset, you’d design the new schema, build it, and then coordinate a cutover where every consumer updates their queries simultaneously. Good luck with that. In my experience, “simultaneously” means “over the course of several painful weeks, with a growing list of things that are broken in the meantime.”

Expand and Contract takes a different approach, in three phases:

In the Expand phase, you add the new column alongside the old one. Both exist. Both contain the same data. Nothing breaks, because nothing has been removed.

In the Migrate phase, you update consumers one at a time. Dashboard A switches to the new column. Pipeline B switches. Report C switches. Each migration is independent, low-risk, and reversible. There’s no coordination required between teams.

In the Contract phase — and only after every consumer has migrated — you drop the old column. By this point, nobody is using it, so removing it is a non-event.

Does this take more deployments? Yes. it might require up to five deployments before the change is fully in effect. But each deployment is safe. Each deployment is reversible. And at no point does anyone lose access to their data because you renamed a column in a way they weren’t expecting.

For data models specifically, Expand and Contract shines when you’re restructuring dimensions, splitting fact tables, or changing grain. Add the new structure alongside the old one. Migrate consumers gradually. Remove the old structure once it’s no longer referenced. It’s boring. It’s methodical. It works.

The ship that replaced itself

If you want a single case study that demonstrates what successful incremental replacement looks like at scale, look at what Slack did with their desktop client in 2019.

Slack’s engineering team faced a classic dilemma. Their desktop application — built on jQuery and a custom framework — had accumulated years of complexity. It was slow. It used too much memory. The architecture made certain improvements difficult or impossible. Every instinct said: rewrite it in React.

Their engineering blog explicitly acknowledged the risk: running code knows things. Hard-won knowledge gained through billions of hours of cumulative usage and tens of thousands of bug fixes.

So instead of a big-bang rewrite, they did something much harder and much smarter. Over two years, they replaced every component of the desktop client while shipping continuously. They called it modernising “bit by bit” — enforcing strict interfaces between existing and modern code, shipping every change to production as it was completed, and never asking users to endure a disruptive cutover.

The results: 50% memory reduction, 33% load time improvement. Users never experienced a broken release. Had they waited until the entire application was rewritten before releasing it, their users would have had a worse experience for years while the team worked in a vacuum.

Now, Slack is an application, not a data warehouse. But the principles translate directly. Strict interfaces between old and new code? That’s the Legacy Façade. Shipping continuously? That’s WAP combined with Expand and Contract. Never asking users to endure a disruption? That’s the entire point.

The ancient Greeks had a thought experiment about this: if you replace every plank of a ship over time, is it still the same ship? Philosophers argue about it. Engineers don’t care. The ship still floats. The passengers still get where they’re going.

The numbers, for the sceptics

I held back on the quantitative evidence in Part I because I wanted to make the argument from experience and conviction. But some people want receipts, so here they are.

McKinsey and Oxford University studied 5,400 IT projects with initial budgets exceeding $15 million. On average, large projects ran 45% over budget and 7% over time, while delivering 56% less value than predicted. Seventeen percent were “black swans” — cost overruns exceeding 200% that threatened the company’s viability. One retailer abandoned a $1.4 billion IT modernisation, failed on a $600 million follow-up, and filed for bankruptcy.

The Standish Group has been tracking IT project outcomes since 1994. In their original CHAOS report, only 16.2% of projects succeeded on time, on budget, with specified features. Average cost overrun: 189%. By 2020, the overall success rate had improved to 31% — but here’s the statistic that matters most: small projects succeeded approximately 90% of the time. Large projects succeeded less than 10% of the time.

Read that again. Small projects: 90%. Large projects: under 10%.

That’s not a technology problem. That’s a scope problem. And the single most reliable way to reduce scope is to stop rebuilding entire systems and start refactoring them one piece at a time.

For data systems specifically, Gartner reports that more than 50% of data warehouse projects fail to reach user acceptance. Other analyses put the number closer to 80%. CDInsights reports that 70% of data warehouse modernisation projects exceed budget or fail. Integrate.io found that approximately 80% of data migration projects exceed timelines or budgets, with large-scale projects showing 50% higher failure rates than incremental approaches.

I don’t know how much clearer the data can be. Big-bang rewrites of data platforms fail most of the time. Incremental approaches succeed most of the time. The methodology I’ve described in this article — strangler fig, WAP, dbt refactoring, expand and contract — isn’t just safer or more philosophically sound. It’s what the evidence says actually works.

Refactoring as a practice, not a project

Refactoring should not be a project. It should not have a start date and an end date and a steering committee and a Gantt chart. Refactoring should be a practice — something your team does continuously, as a natural part of how you work with your data models.

Every time you touch a model to add a feature or fix a bug, you leave it a little cleaner than you found it. You extract a hardcoded value into a reference table. You add a test that didn’t exist before. You rename a column from col_7 to something a human can understand. You break a 200-line CTE into two smaller, focused ones.

None of these changes are dramatic. None of them require a business case or a migration plan. They’re small, safe, incremental improvements that compound over time. A year from now, the model is cleaner, better-tested, and easier to understand — not because you stopped everything to rebuild it, but because you improved it a little bit every time you were in the neighbourhood.

This is what Fowler meant when he defined refactoring as changing internal structure without modifying observable behaviour. It’s not a phase. It’s a discipline. And it’s the discipline that separates data teams that thrive from data teams that periodically torch everything and start over every three years, wondering why they never seem to make lasting progress.

The closing argument

In Part I, I said your data model isn’t broken — it’s battle-scarred, and those scars are knowledge. In Part II, I’ve tried to show you what the alternative to demolition looks like: patient, methodical, verified improvement. Strangler figs instead of bulldozers. Write-Audit-Publish instead of deploy-and-pray. Expand and contract instead of coordinate-and-hope.

It’s not glamorous work. Nobody’s going to invite you to give a conference talk about the time you renamed some columns and extracted a few staging models. You won’t get a promotion for migrating a legacy fact table so smoothly that nobody noticed it happened.

But that’s exactly the point. The best infrastructure work is invisible. The best migration is the one where the business never had to care. The best data model is the one that quietly absorbed twenty years of business complexity and still answers questions accurately at 7am on a Monday when a Department Head needs numbers for a board meeting.

I wrote Part I of this series last week. The technology has changed — Netscape to Snowflake, CGI scripts to dbt — but the lesson hasn’t. Old code that works is more valuable than new code that doesn’t exist yet. Embedded knowledge is harder to create than clean architecture. And the patient, unglamorous work of incremental improvement will always outperform the seductive fantasy of starting fresh.

Your data model isn’t broken. Stop trying to replace it. Start making it better.

Chris Hillman