The Broken Window in Your Data Pipeline

Chris Hillman — Sat, 09 May 2026 09:00:00 +1000

There’s a particular kind of data problem that doesn’t announce itself. It accumulates.

We were receiving Salesforce data through delta extraction — sensible in theory, because full snapshots can run to hundreds of terabytes and less than 1% of records change on any given day. The problem is that deltas require someone to know what “changed” means. In Salesforce, that’s less obvious than it sounds. Watch a last_modified column and you’ll miss objects that get updated when a related object changes, without their own timestamp reflecting it.

Over time: drift. Orphan records. Data that looks current because the record exists, but isn’t. The fix was documented — run a full snapshot periodically to correct the accumulated drift, painful as that was — and everyone with any context on the system knew it.

What happened was roughly this: the drift accumulated silently until something downstream looked wrong. An investigation traced it back to the delta logic. The full snapshot was run. The problem was resolved. Notes were written. The workaround was filed away.

And then, about twelve months later, the same conversation happened again.

What made that moment stick wasn’t the technical failure. It was the recognition that the window had been broken for a while. Everybody who walked past it knew it was broken. We’d even put a note next to it.

I’ve been that person — the one who wrote the documentation, filed the Jira ticket, and moved on. Which is probably why I remember the twelve-month cycle so clearly. And why I started paying attention to the pattern of it, across teams and companies, long after that particular incident.

We just hadn’t fixed it.

Bear with me here, because what I’m about to describe starts with an abandoned car in the Bronx in 1969 — and I promise it ends somewhere relevant to your dbt models.

A broken window in the Bronx, a broken window in your warehouse

A Stanford psychologist named Philip Zimbardo ran a strange experiment. He abandoned two cars — one in the Bronx, one in Palo Alto — and watched what happened.

The Bronx car was stripped within twenty-four hours. Within three days it was completely gutted. The Palo Alto car sat untouched for a week — until Zimbardo himself walked up with a sledgehammer and broke a window. Within hours, it had been stripped too.

Same car. Different signal.

Thirteen years later, criminologists James Q. Wilson and George Kelling built a theory on that experiment. If a window in a building is broken and left unrepaired, they argued, all the rest will soon follow. Not because there’s a particular breed of window-breaker lurking around, but because an unrepaired window sends a message: nobody here cares. And once that signal is broadcast, breaking more windows costs nothing.

The insight was semiotic, not structural. It wasn’t about windows. It was about what an unrepaired window communicates about the norms of a place.

The theory eventually found its way into software. Andrew Hunt and David Thomas put it into The Pragmatic Programmer almost verbatim: don’t leave broken windows — bad designs, wrong decisions, poor code — unrepaired. They’d watched clean systems deteriorate quickly once windows started breaking. The mechanism was the same. A developer looking at a messy codebase thinks: if someone else got away with being careless, maybe I can too. The norm shifts. The entropy accelerates.

Researchers at Empirical Software Engineering ran a controlled experiment — twenty-nine developers, codebases seeded with either high or low technical-debt density — and found exactly what Hunt and Thomas had intuited. Pre-existing debt measurably caused developers to introduce more debt. Non-descriptive variable names. Duplicated logic instead of reuse. Additional code smells. The broken window was statistically contagious.

It’s a compelling idea, and it maps well to software. But it doesn’t map perfectly to data engineering. And the gap between “maps well” and “maps perfectly” is where teams get into serious trouble.

The thing that’s different about data

When a broken window appears in a codebase, it’s local. A messy function, an undocumented module, a class that’s doing five things at once — these are ugly, they invite imitation, and they slow future development. But they stay where they are. They don’t go anywhere.

A broken window in a data pipeline doesn’t stay where it is.

It travels.

That’s the thing nobody really talks about when they apply broken-windows thinking to data work. In a pipeline, everything is connected. A schema drift in a source table doesn’t just make that table annoying to work with — it silently corrupts every model downstream that touches that field. Which means it corrupts every dashboard that uses those models. Which means it corrupts the metrics those dashboards expose. Which means it corrupts the business decisions made from those metrics.

The broken window is in row one. By the time someone notices, the damage is in the boardroom.

And unlike software, where the damage is visible — a stacktrace, a failing build, a crash — data damage is often invisible. The pipeline still runs. The dashboard still renders. The report still reconciles. The numbers just happen to be wrong, quietly, for reasons nobody can easily trace.

Software bugs produce noise. Bad data produces silence.

In software, the broken window signals disorder and invites imitation. In data pipelines, it does all of that and it propagates at machine speed through every downstream system that trusts the upstream to be clean. By the time the propagation is discovered, it’s usually been underway for a while.

This is the propagation problem. It’s not just that bad data begets more bad data. It’s that one bad window can contaminate an entire watershed.

What a contaminated watershed actually looks like

In May 2022, Unity Software — the game engine company — disclosed something remarkable in their SEC earnings filing. Their Audience Pinpointer ML model, the system that powered their ad targeting business, had ingested bad data from a large customer. That data had corrupted their training set.

CEO John Riccitiello on the earnings call: “we lost the value of a portion of our training data due in part to us ingesting bad data from a large customer.”

The bad data didn’t just produce one wrong prediction. It poisoned the model weights. Every subsequent retrain was built on a contaminated foundation. The model had to be taken offline, the bad data removed, and training restarted from scratch. The estimated impact was $110 million in revenue for 2022. The stock dropped 37% in a single day. Market cap losses in the billions.

The broken window wasn’t in Unity’s systems — it was in data they were ingesting. Once it crossed the boundary into their training pipeline, it propagated in the only direction data knows: forward and downstream, embedding itself into every layer of the system that touched it.

Riccitiello’s pledge after the fact was almost poignant: “We are deploying monitoring, alerting and recovery systems and processes to promptly mitigate future events.” The observability came after the disaster. The window had already broken every other window in the building.

These are the normal failure mode of connected data systems. The propagation isn’t a bug in the design — it’s inherent to the architecture. Data flows in one direction. Trust flows with it. The moment you have a pipeline, you have propagation risk.

The question isn’t whether your broken windows will propagate. It’s how far they’ll travel before anyone notices.

The tools we built to tolerate this

Modern data tooling has given us genuinely sophisticated ways to formally accept broken windows.

dbt — the transformation tool most data engineering teams live inside — has a severity configuration on data tests. You can set a test to warn instead of error. The test runs, detects a problem, and… doesn’t fail the pipeline. It records the warning and moves on.

The documentation is cheerful about it: “Maybe 1 duplicate record can count as a warning, but 10 duplicate records should count as an error.”

In principle, this is sensible. In practice, the warn threshold becomes a Schelling point. Teams configure tests to warn to avoid breaking CI. The warnings accumulate. And then — almost invariably — the warnings become wallpaper. (Go check your own test results right now. Count how many have been in warn state for more than two weeks. I’ll wait.)

dbt also has a feature called store_failures that saves failed records to an audit schema table. The test is doing its job. The failures are being recorded. They overwrite the previous run’s failures on the next run. If nobody is actively querying that audit table — and almost nobody is — the failures exist only to make the test feel like it’s being taken seriously.

It’s a passive graveyard. The window is monitored. Nobody fixes the glass.

Airflow has an equivalent pattern. The soft_fail parameter on sensors means that if an exception is raised — the source system is down, the file hasn’t arrived — the task is marked as skipped rather than failed. Downstream tasks, depending on their trigger rules, may also skip. An entire branch of your DAG quietly collapses to a skipped state, which most pipelines treat as benign, and your stakeholders get a dashboard with no data in it instead of an error message.

Retries do something similar. A flaky source that fails 30% of the time gets retries=3 configured. The task eventually succeeds on the third attempt. The 30% failure rate never surfaces as an anomaly in any meaningful way. Until the source dies entirely, at which point the symptom everyone responds to is not “this has been flaky for six months” but “this suddenly started failing today.”

None of this is the fault of the tools. dbt and Airflow are doing what they’re designed to do. The issue is that the default ergonomics of both make tolerating failure significantly easier than stopping propagation. “Don’t break the build” is a more convenient goal than “don’t ship broken data downstream.” The tools give you excellent knobs for the former and adequate knobs for the latter.

Chad Sanderson — formerly at Convoy, now building in the data contracts space — has a name for what this produces over time. He calls it the POSIWID principle: the Purpose Of a System Is What It Does. If your data pipelines are consistently producing low-quality data, and your team consistently tolerates that, then whatever you think the purpose of your data platform is, its actual purpose is to enable teams to move fast, ship without accountability, and tolerate breakages.

The broken windows aren’t exceptions. They’re the product.

We’ve built monitoring. Why isn’t it working?

Data observability emerged as a formal discipline. The concept — largely popularised by Barr Moses at Monte Carlo — drew directly from the software world’s SRE and distributed systems monitoring practices: freshness, distribution, volume, schema, lineage. Five pillars. Automated anomaly detection. Alerts in Slack when something looks wrong.

The framing was right. The problem is what happened next.

Monte Carlo’s own telemetry — across millions of monitored tables — shows that alert engagement drops 15% when a Slack channel exceeds 50 alerts per week, and a further 20% past 100 per week. The data observability system has a data quality problem: the signal-to-noise ratio degrades, and so does the human response.

This isn’t specific to data. The clinical literature on alarm fatigue is sobering. ICU environments average more than 150 alarms per bed per day. Studies consistently find that 72% to 99% of those alarms are non-actionable. The clinical community’s response to this was institutional — the Joint Commission made alarm safety a National Patient Safety Goal in 2014 — because they recognised that a monitoring system that produces more noise than signal doesn’t just fail to help. It actively worsens outcomes by training humans to stop responding.

Cybersecurity teams face the same thing. Security operations centres receive thousands of alerts daily; most go unaddressed, not because analysts are lazy, but because the ratio of genuine signals to false positives has degraded to the point where sustained attention is cognitively impossible.

The Google SRE book is direct on this: “Every page should be actionable. If a page merely merits a robotic response, it shouldn’t be a page.”

In data engineering, the equivalent of a “robotic response” is the acknowledged-and-unresolved alert. The monitor that fires every Tuesday morning, gets a thumbs up in Teams, gets added to the “known issues” document, and never gets fixed. The window is now monitored. The fact that it’s monitored makes the team feel responsible. The window stays broken.

There’s a term for this: observability theatre. The infrastructure of visibility exists. The dashboards are green. Nobody’s actually looking at the glass.

This is where the broken windows metaphor earns its keep most fully. Wilson and Kelling weren’t saying that disorder is bad because it looks bad. They were saying that an unrepaired broken window sends a signal that no one cares — and that signal is the actual mechanism of decay. Monitoring a broken window without repairing it sends exactly the same signal. Possibly a worse one, because now everyone knows the problem is being tracked and nobody’s acting on it. The norm becomes: acknowledged problems are not necessarily fixed problems.

The stakeholder sees it differently

Here is the asymmetry that data teams consistently underestimate.

When an engineer looks at a broken window in a data pipeline — a column with a known null rate, a test that’s been in warn state for three weeks, a DAG that soft-fails every second run — they see technical debt. A problem to eventually be addressed. Something that exists on a spectrum of severity, probably not at the top of the list.

When a business stakeholder encounters the downstream consequence of that broken window — a dashboard that contradicts a number they just presented to the CFO, a metric that moved in an unexplained direction, a report that doesn’t reconcile with another report — they don’t think “technical debt.” They think: can I trust the data team?

Benn Stancil — one of the clearest thinkers writing about this — put it well: “Trust is built, and blown up, by the outputs — and specifically, the consistency of those outputs.” He uses a vivid image that I keep coming back to: “Data and the dashboards that display it create a shared sense of reality. Looking at two dashboards that don’t match is like looking out two adjacent windows and not seeing the same thing.”

That’s the experience of broken-window propagation from the stakeholder’s side. Two adjacent windows. Different views. A reality that doesn’t cohere.

Monte Carlo’s 2023 State of Data Quality research put a number to the trust inversion that most data engineers already sense. In 2022, 47% of respondents reported that business stakeholders identified data issues first “all or most of the time” — more often than the data team itself. By 2023, that figure had risen to 74%. (Three in four. Let that land.)

Think about what that means. In three out of four data incidents, the people who rely on the data found the problem before the people who built it. The pipeline runs, the test passes, the dashboard renders, and somewhere downstream a business analyst is staring at a number that doesn’t look right and is about to send a message that starts with: “Quick question about this figure…”

The damage isn’t technical. It’s relational. And it compounds in a way that’s harder to reverse than any schema migration.

Thomas Redman — who has spent decades studying data quality — made this point in Harvard Business Review: “When data are unreliable, managers quickly lose faith in them and fall back on their intuition to make decisions.” Once that happens, you haven’t just produced bad data. You’ve trained decision-makers to ignore good data too, because they can no longer distinguish between the two.

The broken window didn’t just propagate through the pipeline. It propagated into the culture.

Nobody owns the broken window

There’s a specific pattern worth naming, because it’s where most data teams I’ve worked with have the most unacknowledged broken windows: the orphaned asset.

The pipeline that was built by an engineer who left eighteen months ago. The table in the warehouse that exists in the data catalogue but whose owner field says “Unknown” or, worse, the name of someone who’s no longer at the company. The dbt model that runs in production, that feeds two dashboards, that nobody on the current team can fully explain.

These aren’t broken in any obvious sense. They run. They load. The tests pass, or they’re set to warn. But they’re broken windows in the original sense: they signal that nobody cares. And because nobody owns them, nobody’s in a position to repair them even when something goes wrong.

What typically happens is this: a new engineer joins the team. They need to understand how the data flows. They look at the catalogue. They look at the undocumented table. They look at the model that references it. They look at the Jira ticket from fourteen months ago that says “Known issue — downstream teams aware.” And they make a rational decision: don’t touch it, build around it, replicate it if necessary.

The broken window has now inspired a new window. Same mechanism as the Bronx car. Different materials.

The organisational research on this is unambiguous. Ron Westrum’s typology of organisational cultures — validated empirically by the DORA research programme across thousands of software teams — found that information flow predicts safety and performance more reliably than almost any structural variable. In pathological cultures, information is hoarded or withheld for political reasons. In generative cultures, information flows freely, failures are shared, and problems get fixed because surfacing problems is rewarded rather than penalised.

Amy Edmondson’s research on psychological safety adds the other half: 85% of employees have withheld important information from their manager due to fear of speaking up. In data teams, this looks like: the junior engineer who noticed the null rate had been climbing for two weeks and didn’t raise it because they weren’t sure it was their call to make. The analyst who suspected the metric definition had drifted but didn’t want to slow down the dashboard delivery. The data engineer who knew the pipeline was flaky but figured someone more senior would have noticed if it really mattered.

The broken window gets left unrepaired not because nobody sees it, but because the culture hasn’t made repair feel safe or worthwhile.

Edmondson on this: “If there’s no bad news, remind yourself: It’s not that it’s not there. It’s that you’re not hearing about it.”

The original theory’s mistake — and ours

Before we get to what to actually do about this, it’s worth pausing on where the original theory went wrong. Because data engineering is at risk of making exactly the same mistake.

When Wilson and Kelling published their 1982 Atlantic article, the theory was nuanced. It was about signals and norms. It was explicitly a theory about community — about how residents and police could work together to maintain shared standards in public spaces. It was, in Kelling’s own framing, a theory of collective efficacy.

What cities did with it was zero tolerance. Mass arrests for minor offences. Stop-and-frisk. 685,000 stops in New York City in a single year, more than 85% of them finding nothing at all.

Kelling’s reaction, when he learned how comprehensively his theory had been misapplied: “Oh, shit.”

The software world made a version of the same mistake when it imported broken windows thinking. “Don’t live with broken windows” became, for some teams, a linting rule for everything, a zero-tolerance policy for code smells, a culture of perfectionism that burned people out and produced beautiful codebases that never shipped.

Adam Tornhill’s research at CodeScene offers the corrective: not all broken windows matter equally. In a 400,000-line codebase with 89 developers, his analysis found that 4% of the code was responsible for 72% of the defects. The broken windows that needed fixing weren’t distributed evenly. They clustered in hotspots — files that were both frequently changed and highly complex. Fix the hotspots. Let the cold, ugly, stable corner of the codebase sit.

The principle for data engineering is the same. Zero-tolerance data quality enforcement — failing every pipeline on every test at severity error — produces fragile systems and tired teams. It’s the data equivalent of arresting everyone for jaywalking. The signal gets lost in the noise.

What the original theory was actually pointing at was something more like: maintain the norm. Make it visible that people care. Ensure that the broken windows that matter — the ones that propagate, the ones in the high-traffic, high-trust parts of the system — get fixed promptly and publicly. Not every window. The ones that signal whether anyone’s paying attention.

What collective efficacy looks like in a data team

Sampson, Raudenbush, and Earls — the Chicago sociologists who ran the most rigorous empirical test of broken windows theory — found that the variable that actually predicted neighbourhood safety wasn’t the presence or absence of disorder. It was collective efficacy: the combination of social cohesion and shared willingness to intervene. Neighbours who knew each other, trusted each other, and were willing to act on behalf of each other’s wellbeing.

The policy prescription that emerges from their work is very different from zero tolerance. It’s community investment. Shared ownership. Making the costs of non-intervention visible.

The equivalent in data engineering starts with one thing, and if you do nothing else in this list, do this one:

Make propagation visible before anything else.

Column-level data lineage — now available in most modern data observability platforms and increasingly in the open-source OpenLineage standard — lets you answer the question: if this column is wrong, what does it break? That question should be answerable in seconds, not hours. Teams that can visualise propagation chains respond to broken windows faster because they can see the radius of the damage before they decide whether to act.

This matters beyond incident response. When you can show an engineer that the null column they’re tolerating in a staging model feeds seven downstream gold-layer tables, three dashboards, and a Snowflake share that two other teams consume — the calculus on whether to fix it changes. The broken window stops being an abstract code quality concern and becomes a blast radius. That’s a much more compelling argument for repair than “we should improve our data quality culture.”

Treat ownership as load-bearing infrastructure, not housekeeping.

Every pipeline, every table, every model should have a named owner. Not a team. A person. The Jira ticket that says “Known issue — downstream teams aware” with no assigned owner is the data equivalent of a broken window with an orange cone next to it. The cone acknowledges the hazard. Nobody’s fixing the glass.

The counterargument is always resourcing — “we don’t have time to own everything properly.” That’s true, and worth taking seriously. But the right response isn’t to pretend you own things you don’t. It’s to make orphaned assets visible and have an honest conversation about whether the organisation can afford to run production pipelines with no accountable maintainer. Most of the time, when the question is asked that directly, the answer is no.

Calibrate your tolerance patterns deliberately — and actually revisit them.

The dbt severity: warn setting and Airflow’s soft_fail are legitimate tools when they’re conscious decisions: “this condition is a signal worth tracking but not a pipeline-stopper, and here’s the threshold at which that changes.” The problem is that almost nobody uses them that way. They’re the path of least resistance to avoid a broken CI build — and six months later you audit your test results and discover forty tests set to warn that haven’t been at zero failures since the day they were written.

Treat warn-severity tests the way you treat Jira tickets that never get triaged. Set a review cadence. If a test has been consistently warning for more than two weeks without an associated investigation, it’s either a bug that needs fixing or a threshold that needs changing. “Known issue” is not a status. It’s an admission.

Fix alert fatigue before it hollows out your monitoring culture.

The Teams channel where every data quality alert lands is the digital equivalent of a neighbourhood where every broken window gets photographed and logged and nobody ever fixes one. The log is evidence that someone noticed. It’s not evidence that anyone will act.

Set alert thresholds that produce actionable signals. The Google SRE principle applies directly: if an alert merits a robotic response, it shouldn’t be an alert. If your first instinct on seeing a particular monitor fire is to click acknowledge and move on, that monitor is producing noise, not signal. Change it or delete it. Grouping alerts by lineage — “these five monitors fired because of one upstream schema change” rather than five separate pings — reduces volume while making propagation visible at the moment of failure, which is exactly when you want it.

Make broken windows visible to leadership, because right now the cost is invisible.

Data quality work is famously hard to demonstrate. Pipelines that run cleanly leave no artefact. A well-maintained model with a 0% null rate looks identical to a freshly built one. There’s no natural demo format for “nothing bad happened this week.”

This invisibility is part of why broken windows accumulate. The cost of prevention is hidden in maintenance time that doesn’t get counted. The cost of failure gets absorbed by analysts who spend their Tuesdays reconciling numbers instead of answering strategic questions, by data engineers who spend their Fridays investigating stakeholder tickets, by business decisions made on incorrect information that can’t be traced back to a specific incident.

Quantify the propagation radius when incidents do occur. How many downstream models were affected? How many stakeholders were exposed to incorrect data, and for how long? What was the resolution effort in hours? Those numbers, tracked consistently, build a case for investment that abstract arguments about data quality never will.

The window was broken before I noticed

What we ended up doing was building the fix into the operating rhythm. A scheduled data drift catch-up: a full snapshot weekly, or monthly, depending on the volatility of the Salesforce object — not to respond to an incident, but to systematically correct the accumulated drift before it became visible to anyone downstream. We stopped waiting for the escalation. We made the repair part of the architecture.

The broken window was still there, technically. The delta logic still had its blind spots around hidden relationships. But we stopped waiting for it to propagate before we addressed it. Some broken windows aren’t things you permanently seal — they’re things you build a maintenance routine around. Knowing that is the fix.

What stayed with me wasn’t the technical solution, which was straightforward enough once we committed to it. What stayed with me was the calendar predictability of the escalation cycle before the fix. The fact that we had a known issue, a known workaround, documentation of both — and still managed to have the same downstream discovery, the same investigation, the same remediation conversation, almost exactly twelve months apart.

The window wasn’t hidden. It was visible in the delta logs if you knew where to look. We just hadn’t made it someone’s job to look, and we hadn’t made repair part of the routine. We’d fixed the glass, filed the paperwork, and assumed the problem was solved. Until the same signal appeared in the same downstream reports, a year later.

What I keep coming back to, though, is something that was true of that experience and is true of most of the data quality failures I’ve seen since: the window wasn’t broken secretly. It wasn’t hidden. It was visible, it was acknowledged, and it was left.

We had the monitoring. We had the test. We had the Jira ticket.

We had, in other words, all the infrastructure of concern. What we didn’t have was the collective agreement that repair mattered — that the propagation radius of that one broken window was large enough, and trust-eroding enough, that it justified stopping what we were doing and fixing the glass.

That’s the thing broken windows theory is actually about, underneath all the criminology and the code smells and the schema drift. It’s about the signal that unrepaired damage sends. Not to the criminals or the developers or the data consumers. To the people who are supposed to care about the system.

When a broken window sits long enough in a data pipeline, it stops being a problem and starts being a norm. The new engineer doesn’t flag it — they work around it. The analyst doesn’t escalate it — they add a caveat to their report. The data engineer doesn’t fix it — they document it.

And somewhere downstream, a business decision gets made on numbers that were broken before anyone thought to check.

The window isn’t just in the pipeline. The window is in the standard you’re willing to keep.

Pipeline Architecture on Ghost in the data