The Data Quality Test: 10 Questions That Predict Pipeline Disasters

I’ve been writing about data quality a lot lately.

Enough that I notice myself doing it. Enough that a small voice says: haven’t you made this point already? Schema drift, NULL propagation, duplicate records, the whole catalogue of things that go wrong in the space between a source system and a warehouse. I keep circling back to it.

And every time, I almost talk myself out of writing the piece. Then I reflect on what’s happened in the last few years of work. The postmortems I’ve read, the pipelines I’ve inherited — and the same pattern shows up with depressing regularity. Not exotic failures. Not edge cases. The boring stuff. The questions nobody asked before the first row hit the warehouse.

That’s why I keep writing about this.

Not because data quality is a trendy topic. Not because it makes for dramatic storytelling. Because I believe, with the kind of quiet conviction that comes from watching the same preventable disaster play out across different companies and different years, that the measure of a data engineer isn’t how fast they respond to incidents — it’s whether those incidents needed to happen at all.

Prevention isn’t glamorous. Nobody gets a shoutout in the all-hands for the outage that didn’t occur. There’s no adrenaline rush from a pipeline that just… works. But there’s something else — something I’ve come to value more than any firefighting war story. It’s the feeling of walking into a Monday morning, seeing that a source system changed its schema over the weekend, and watching your pipeline handle it exactly the way you designed it to. No pages. No panic. No scramble. Just a notification in Teams that says: schema change detected in orders table, quarantined for review.

That feeling is what I want more data engineers to experience. And the path to it isn’t complicated. It starts with asking the right questions before you build anything.

The pre-mortem nobody wants to do

I have inherited alot of data pipelines over time: the ones that blow up spectacularly are almost never the ones with exotic technical problems. They’re the ones where somebody skipped the boring questions.

Not the “what framework should we use” questions. Not the “should we go with Spark or DuckDB” questions. The uncomfortable, mundane questions that feel like they’re slowing you down when there’s pressure to ship.

I’ve distilled these into ten questions. Ten things you should be able to answer about every single source system before a single row hits your warehouse. For each one, I’ll walk you through what a good answer looks like — the baseline that keeps you out of trouble — and what an amazing answer looks like — the kind of engineering that lets you sleep at night.

If you can’t answer these questions, you’re not building a pipeline. You’re building a time bomb with a variable-length fuse.

1. What happens when the schema changes without warning?

This is the one that gets everyone eventually. Not if — when.

Source systems change. Application developers add columns, rename fields, change data types, and refactor table structures. They do this because it’s their job. And in most organisations, they have absolutely no obligation to tell you about it first. Your pipeline wakes up one morning expecting a VARCHAR and gets handed an INTEGER, and suddenly you’re in an incident channel explaining why the dashboard is blank.

Research from Integrate.io found that production incidents increase by roughly 27% for each percentage point increase in schema drift incidents. That’s a compounding reliability problem — each unhandled change makes the next one more likely to cause real damage.

The dangerous part isn’t the changes that cause immediate failures. It’s the silent ones. A column gets renamed, and your pipeline continues running because it doesn’t reference that column directly — but a downstream model does. A data type widens from INT to BIGINT, and everything works fine until an aggregation overflows six months later. A field that was never NULL suddenly starts arriving NULL, and your NOT NULL constraint in the warehouse? You never actually enforced it.

What good looks like

A good team validates incoming schemas before loading. At minimum, you’re comparing expected versus actual column names, data types, and nullability on every run. When something changes, the pipeline halts gracefully and sends a clear alert — not a cryptic stack trace, but a human-readable message that says “the user_status column changed from VARCHAR(20) to VARCHAR(50) in the orders table.”

You’re also versioning your schema expectations somewhere. Whether that’s a schema registry, a YAML file in version control, or even a documented contract in Confluence — the point is you’ve written down what you expect, so deviations are detectable.

Good teams also quarantine unexpected changes rather than silently ingesting them. If a new column appears, it goes into a staging area or metadata log — not straight into the warehouse where it might break a SELECT * somewhere downstream.

What amazing looks like

If that was good, what is amazing? Well, teams that have schema contracts with source system owners. Not informal agreements — actual documented contracts that specify: these are the columns, these are the types, these are the nullability constraints, and here’s the process for changing any of them.

Amazing means automated schema evolution handling. When a new non-breaking column appears, the pipeline can automatically accommodate it — adding the column to the target table, logging the change, and notifying the team — without human intervention. When a breaking change arrives (column removed, type narrowed, key structure changed), it’s automatically classified by severity and routed to the right person.

Amazing also means your CI/CD pipeline includes schema validation. Before any source system deployment goes live, there’s an automated check that compares the proposed schema against all downstream consumers. The application team can’t deploy a breaking change without explicitly acknowledging the impact.

Think of it this way: good catches the fire. Amazing prevents the spark.

2. What happens when yesterday’s data arrives tomorrow?

Late-arriving data is one of those problems that sounds simple until you actually try to solve it. A mobile app loses connectivity and batches up events. A partner system has an outage and replays three days of data on Monday morning. An IoT sensor comes back online after a firmware update and dumps a week’s worth of readings.

Your pipeline ran for yesterday. It produced correct results based on the data available. Now more data shows up that belongs to yesterday. What happens?

If your answer is “we’d just rerun the pipeline,” congratulations — you’ve just introduced a different set of problems. Did that rerun overwrite aggregations? Did downstream dashboards update? Did anyone notice the numbers changed?

Late-arriving data is especially painful in dimensional modelling. The Kimball methodology specifically warns about late-arriving dimensions — facts that arrive before their corresponding dimension records exist. You end up with orphan foreign keys pointing at nothing, or worse, pointing at a placeholder “Unknown” record that never gets reconciled.

What good looks like

Good teams have defined a tolerance window for late data. You know, explicitly, that your pipeline handles data arriving up to N hours late, and you’ve designed your partitioning and processing logic around that window.

The most common pattern is a lookback window — your daily pipeline doesn’t just process today’s partition, it reprocesses the last N days to catch any late arrivals. It’s simple, it’s battle-tested, and for most batch workloads it’s perfectly adequate. You pair this with event-time partitioning (using the time the event actually occurred, not when it arrived) so that late data lands in the correct partition.

Good also means you have observability on late data volumes. You’re tracking how much data arrives outside your expected window, and you’re alerting when that percentage exceeds a threshold. If 0.1% of your data normally arrives late and suddenly it’s 15%, something upstream has gone wrong.

What amazing looks like

Amazing teams implement bi-temporal modelling. Every record carries two timestamps: when the event happened (event time) and when the system recorded it (processing time). This means you can always reconstruct what you knew at any point in time, and you can always find the data that arrived after a particular processing run.

Amazing means automated reconciliation. When late data arrives, the system detects it, identifies affected downstream aggregations and models, and triggers targeted reprocessing — not a full pipeline rerun, but surgical updates to just the affected partitions and metrics. Databricks published a reconciliation pattern where unjoined records (facts without matching dimensions) get written to an error table, automatically retried on a schedule, and merged back into the target table once the dimension arrives.

Amazing also means your SLAs explicitly account for late data. Your stakeholders know that the 7 AM dashboard reflects data up to midnight, with a reconciliation pass at 10 AM that catches stragglers. No surprises, no “the numbers changed” panic.

3. What happens when a primary key gets reused?

This is more common than most engineers realise. Source systems recycle keys. It happens during migrations, database resets, test-to-production bleed, or when an auto-increment counter gets reset. I’ve seen it happen when a company acquired another business and merged customer databases — both systems had customer_id = 1, and they meant completely different people.

The problem is insidious because your pipeline has no way to know that customer_id = 47382 in today’s extract refers to a different entity than customer_id = 47382 from two years ago. Your MERGE statement happily updates the old record with the new data. Your historical analyses silently corrupt.

What good looks like

Good teams never use source system keys (business keys) as their warehouse primary keys. Full stop. You generate surrogate keys in the warehouse — monotonically increasing integers that are completely independent of whatever identifier the source system uses. The source key gets stored as a natural key attribute, but it’s not what your fact tables join on.

Good also means you have uniqueness monitoring. You’re checking that natural key + effective date combinations are unique, and you’re alerting when they’re not. If the same customer_id appears twice in a single extract with different attributes, your pipeline flags it rather than silently picking one.

You’re also maintaining a mapping table — source system key to warehouse surrogate key — with effective date ranges. When a key gets reused, the old mapping gets closed and a new one opens.

What amazing looks like

Amazing teams establish immutable business keys that are independent of any source system. Instead of relying on customer_id from the CRM, you create a composite business key that combines system identifier, source system name, and the natural key — something like CRM|customer_id|47382. This makes key reuse impossible by construction, because the key includes its origin.

Amazing means your data contracts specify key semantics. The source team has explicitly documented whether keys are permanent, recyclable, or sequential. Your pipeline’s handling logic adapts based on these documented properties.

Amazing also means you have automated key collision detection. When a key arrives that previously belonged to a different entity (detected by comparing against historical attributes), the pipeline automatically creates a new entity, generates a new surrogate key, and flags the collision for review — all without human intervention.

4. What happens when NULLs appear in non-nullable fields?

NULL is the ghost in your data. It doesn’t fail assertions cleanly. It propagates silently through calculations, turning sums into NULLs, joining on nothing, and making counts unreliable. And the most common source of NULLs in your warehouse? Fields that were never supposed to be NULL in the source system, but started arriving that way because a developer forgot to add a NOT NULL constraint, or an API started returning partial records during degraded performance, or a new code path doesn’t populate a field that the old one always did.

I’ve watched a NULL in a single country_code field cascade through an entire reporting layer, producing “Unknown” slices in regional dashboards that the leadership team took weeks to notice — and then two more weeks to trace back to the source.

What good looks like

Good teams define explicit NULL handling rules for every column in every source. You have a document (or better, code) that says: for email_address, NULLs are acceptable and should be stored as-is. For order_total, NULLs indicate a data quality problem and should be quarantined. For created_date, NULLs are logically impossible and the record should be rejected.

Good means you’re asserting non-nullability in your transformation layer, not just hoping the source enforces it. In dbt, that’s a not_null test on the relevant column. In raw SQL, it’s a WHERE column IS NOT NULL predicate that filters before load, with the rejects sent somewhere visible.

Good also means your warehouse distinguishes between “NULL because the value is genuinely unknown” and “NULL because the source didn’t send it.” A dedicated default value like -1 for numeric foreign keys, or “Not Provided” for string fields, makes this distinction queryable.

What amazing looks like

Amazing teams implement NULL anomaly detection. You’re not just testing for NULLs — you’re tracking the NULL rate for every column over time and alerting when it deviates from the baseline. If email_address is typically 2% NULL and suddenly jumps to 40%, that’s an upstream problem, and you catch it before anyone sees broken reports.

Amazing means you have NULL semantics documented in your data contracts. The source team has committed to: this field will never be NULL under normal operation; if it’s NULL, that indicates a system error and should be treated as a defect. That turns a data quality conversation into a bug report.

Amazing also means your data models are NULL-resilient by design. Your aggregations use COALESCE and NULLIF defensively. Your joins handle NULL keys without silently dropping records. Your BI layer renders “Data Unavailable” instead of showing blank cells. The NULLs might still arrive, but they can’t break anything.

5. What happens when the source sends duplicates?

Sources send duplicates. Constantly. Webhook retries, at-least-once delivery guarantees in message queues, CDC replay after an outage, batch extracts that overlap, operators who click the “send” button twice. If your pipeline assumes every record it receives is unique, it’s only a matter of time before your dashboards start reporting inflated numbers.

The worst part is that duplicate records don’t announce themselves. They look exactly like legitimate data. The only way to catch them is to have already decided what “unique” means and to actively enforce it.

What good looks like

Good teams have defined the grain of every table and enforce it. You know that a fact record is uniquely identified by order_id + line_number + event_timestamp, and you have a uniqueness constraint or test that validates this on every load.

Good means your ingestion layer deduplicates before writing to the warehouse. Whether that’s a QUALIFY ROW_NUMBER() OVER (PARTITION BY key ORDER BY timestamp DESC) = 1 pattern in SQL, a dropDuplicates() in Spark, or a MERGE statement that handles duplicates as upserts — you have a defined pattern and you apply it consistently.

Good also means you track duplicate rates. You know that 0.3% of records from Source A arrive duplicated, and when that spikes to 5%, you investigate. Duplicates are a signal, not just a nuisance.

What amazing looks like

Amazing teams deduplicate at multiple layers — a defense-in-depth approach. Raw ingestion deduplicates by event ID. The staging layer deduplicates by business key. The transformation layer validates uniqueness at the grain. Any duplicate that makes it through all three layers triggers an alert.

Amazing means your streaming pipelines use stateful deduplication with sliding windows. You maintain a cache of recently processed event IDs (in-memory, in Redis, or in the target database) and reject duplicates within a configurable window — say, the last 24 hours. Outside that window, you trade perfect deduplication for practical performance, which is an explicit, documented tradeoff.

Amazing also means you never delete or discard duplicates silently. You route them to a dead-letter table with metadata about when they arrived, why they were classified as duplicates, and what the primary record looks like. This gives you an audit trail and lets you spot systematic issues — like a webhook that’s been misconfigured to retry on success.

6. What happens when the extract job runs twice?

This is idempotency, and it’s the single most important design principle for data pipelines that most teams get wrong.

Your Airflow DAG fails halfway through. You hit retry. Your scheduler has a hiccup and triggers the same job twice. An engineer reruns a backfill without checking whether the original succeeded. In every one of these scenarios, your pipeline gets the same input more than once. What happens to the output?

If you’re appending records with INSERT INTO, you just doubled your data. If you’re overwriting a table, you might lose records that were added by another process. If you’re updating aggregations incrementally, your totals are now wrong.

Research from Airbyte describes how a multinational bank’s payment-processing system, lacking idempotency controls, created duplicate transactions worth millions during a simple retry. The root cause wasn’t a complex failure — it was a fundamental design flaw.

What good looks like

Good teams use the delete-write pattern. Before writing output for a given partition or date, you delete any existing data for that partition, then write the new data. Run it once, run it ten times — the result is the same. In SQL, that looks like:

DELETE FROM target_table WHERE partition_date = '2025-03-09';
INSERT INTO target_table
SELECT * FROM staging_table WHERE partition_date = '2025-03-09';

Good means every pipeline can be safely rerun without manual intervention. No “before you rerun, make sure you delete the output from the last run” instructions. If your runbook has manual steps for reruns, your pipeline isn’t idempotent.

Good also means you test for idempotency explicitly. Run the pipeline. Note the row count. Run it again with the same input. If the row count changes, you have a problem. This should be an automated test in your CI pipeline.

What amazing looks like

Amazing teams use MERGE / upsert patterns that handle inserts, updates, and deletes in a single atomic operation. Combined with table formats like Delta Lake or Iceberg that provide ACID transactions, you get idempotent writes that survive partial failures.

Amazing means your pipeline’s idempotency extends beyond the write. If your extract hits an API, it uses idempotency keys. If it publishes events downstream, it includes deduplication IDs. The entire pipeline, from source to serving layer, produces identical output regardless of how many times it’s executed.

Amazing also means you maintain processing metadata. Each pipeline run records its run ID, input parameters, record counts, and checksums. Before writing, the pipeline checks whether a run with identical parameters has already produced output — and if so, skips processing entirely. Not just idempotent writes — idempotent execution.

7. What happens when timestamps change timezone?

Timezone bugs are the cockroaches of data engineering — they survive everything. They lurk in corner cases, they reproduce during daylight saving transitions, and they’re nearly impossible to fully eradicate once they’ve infested a codebase.

Here’s a scenario I’ve seen multiple times: a source system stores timestamps in local time without timezone information. Your pipeline ingests them and assumes UTC. Everything works for months — until the clocks change and suddenly your event timestamps are off by an hour. A financial services company found that a source system migration introduced a timezone mismatch between loan default dates and origination dates. The resulting temporal inconsistency corrupted months of historical analysis.

The sneaky part? Timezone bugs don’t always cause failures. They cause subtle inaccuracies. Events that happened at 11:55 PM on Tuesday show up as Wednesday. Daily aggregations double-count an hour during the spring transition and miss an hour in autumn. A job scheduled for 2 AM during a daylight saving transition either runs twice or not at all.

What good looks like

Good teams store everything in UTC, always. Not “usually UTC.” Not “UTC except for that one source that sends in Eastern.” UTC, full stop. Local time is a presentation-layer concern and has no business being in your warehouse.

Good means you know the timezone semantics of every source. You’ve documented that System A sends timestamps in UTC, System B sends in US/Eastern with DST adjustments, and System C sends timezone-naive timestamps that are implicitly Australia/Sydney. Your pipeline converts explicitly on ingestion.

Good also means you handle DST transitions in your scheduling. If your pipeline runs at 2 AM local time, you’ve tested what happens when that hour doesn’t exist (spring forward) or exists twice (fall back). Most orchestrators evaluate cron expressions in UTC — you’ve confirmed this for yours.

What amazing looks like

Amazing teams store timestamps with timezone metadata. Not just “2025-03-09 14:30:00” — but “2025-03-09 14:30:00 UTC” alongside the original timezone identifier (like America/New_York). This lets you reconstruct the local time accurately for any historical record, even if timezone rules change.

Amazing means you have automated timezone consistency checks. Every timestamp column in your warehouse is tested to verify it’s in UTC. If a record arrives with a timestamp that’s suspiciously offset from its expected range (events from a US system arriving with timestamps that look like Asian business hours), it’s flagged for investigation.

Amazing also means your data models account for the fact that “today” means different things in different timezones. A global company’s “daily active users” metric explicitly defines which timezone’s midnight marks the boundary. Your models support configurable timezone-based aggregation windows, not just UTC midnight.

8. What happens when the source has a bad day and sends garbage?

Systems degrade. APIs return malformed JSON. Database exports get truncated. Character encoding changes silently. A deployment goes wrong and half the fields come back as their default values. A logging system buffer overflows and starts concatenating records.

The question isn’t whether you’ll receive garbage data — it’s whether your pipeline can tell the difference between garbage and legitimate outliers. A revenue figure of $0 might be a real zero-dollar transaction or it might be a failed parse. A customer name of “NULL” might be an actual person named Null (they exist) or a serialisation bug.

What good looks like

Good teams implement data quality checks at the boundaries. Before data enters the warehouse, you validate: record counts are within expected ranges, key fields are populated, data types match expectations, and categorical fields contain only known values.

Good means you have circuit-breakers. If a source extract is more than 30% smaller than the previous day (or 30% larger), the pipeline pauses and alerts. If more than 5% of records fail validation, the entire batch is quarantined for review rather than partially loaded.

Good also means you maintain a data quality score for each source. Not a single binary pass/fail, but a composite metric that tracks completeness, accuracy, consistency, and timeliness over time. When the score dips below a threshold, stakeholders are notified with context, not just “data quality alert.”

What amazing looks like

Amazing teams implement statistical anomaly detection. You’re not just checking against static thresholds — you’re comparing today’s data against a rolling baseline of the last 30 days. A revenue column with a standard deviation of $50 that suddenly shows values of $50,000 triggers an investigation, even though $50,000 is a perfectly valid number in isolation.

Amazing means you have data quality rules that are co-authored with the business. The data team doesn’t decide alone what constitutes “garbage.” The finance team has specified that order totals should always be positive, that discount percentages can’t exceed 100%, and that order_date must be within the last 90 days. These rules are version-controlled and tested automatically.

Amazing also means you have automated rollback capabilities. When a bad batch is detected after it’s already been loaded, the system can automatically revert to the last known good state — using table time-travel features in Delta Lake or Iceberg — without manual intervention. Garbage got in, garbage gets out, and the dashboard never shows corrupted data.

9. Who gets paged when this breaks at 2am?

This is where engineering meets organisational design, and it’s where most data teams are embarrassingly underprepared.

When a pipeline fails at 2 AM, what actually happens? In too many organisations, the answer is: nothing, until someone opens a dashboard at 9 AM and notices the numbers are stale. Or worse: the alert fires, it goes to a Teams channel that nobody’s watching, and it sits there until Monday.

The question isn’t really about paging infrastructure. It’s about whether you’ve decided, ahead of time, who is responsible for what. Is a missing report the responsibility of the data engineers who built the pipeline, the analytics team who created the dashboard, the application team whose schema change caused the failure, or the platform team whose Kubernetes cluster ran out of memory?

If you haven’t answered that question before the incident, you’ll waste the first hour of every incident answering it during the crisis.

What good looks like

Good teams have on-call rotations with clear escalation paths. Pipeline failures route to a specific person, not a channel. That person has documented runbooks for the most common failure modes, and they have the access and authority to restart jobs, quarantine bad data, and communicate with stakeholders.

Good means you’ve classified your pipelines by criticality. The CEO dashboard pipeline and the experimental marketing attribution pipeline don’t get the same SLA. Critical pipelines page immediately. Non-critical pipelines send a notification that gets triaged during business hours.

Good also means you have defined SLAs with your stakeholders. They know that the revenue dashboard refreshes by 7 AM, that freshness is guaranteed within 4 hours, and that if something breaks, they’ll get a status update within 30 minutes. No surprises.

What amazing looks like

Amazing teams practice incident response. Not just documenting it — rehearsing it. They run game days where they simulate pipeline failures and walk through the response process. They measure time-to-detection, time-to-response, and time-to-resolution, and they track these metrics over time.

Amazing means every incident becomes a rule. After every data incident, there’s a blameless postmortem that produces concrete action items: a new test, a new alert, a new runbook entry, a new data contract clause. The goal is that no failure mode catches you twice.

Amazing also means your monitoring is proactive, not just reactive. You’re not just detecting failures — you’re detecting the conditions that precede failures. If processing time has been creeping up by 5% per week, you’ll hit your SLA window in six weeks. Amazing teams notice that trend and fix it before it becomes an incident.

10. Who actually owns this data?

This is the most important question on the list, and it’s the one that gets the vaguest answers. “The data team owns it.” “Engineering owns it.” “It’s in the data platform, so I guess… platform engineering?”

When nobody owns the data, everybody suffers. Source system teams change schemas without notification because it’s not their job to worry about downstream consumers. Data engineers treat data quality as someone else’s problem because they’re just moving bytes. Analysts blame engineers, engineers blame source systems, and source systems are blissfully unaware that anyone is even using their data.

The question of data ownership is really three questions: Who is accountable for the data’s content (is it correct)? Who is responsible for its delivery (does it arrive on time)? And who decides its definition (what does “active customer” actually mean)?

What good looks like

Good teams have explicit ownership documented for every source. There’s a human name attached to each data source — not a team, not a Teams channel, but a person who is accountable for its quality, availability, and definition. When that person changes roles, ownership is explicitly transferred, not silently abandoned.

Good means the data team has data contracts with source system owners. These contracts specify: what data is provided, in what format, with what SLAs, with what change notification process, and what happens when the contract is violated. It’s not adversarial — it’s professional.

Good also means you have a data catalog that maps lineage from source to consumption. When a downstream dashboard breaks, you can trace back through every transformation to the specific source table and the specific owner. No detective work required.

What amazing looks like

Amazing teams treat data as a product. Each critical dataset has a product owner who is accountable for its fitness for use — not just its technical availability. They gather requirements from consumers, they prioritise improvements, and they measure satisfaction. Data isn’t just “there.” It’s maintained, versioned, and evolved with the same rigour as an API.

Amazing means ownership boundaries are encoded in the infrastructure. Data contracts aren’t just documents — they’re executable specifications that are validated on every pipeline run. When a source system violates its contract, the pipeline doesn’t just fail — it identifies the violation, classifies the severity, notifies the source owner, and quarantines the affected data. The accountability is automated.

Amazing also means you conduct regular data ownership reviews. Quarterly, you audit every data source: Is the owner still accurate? Is the contract still valid? Are the SLAs still appropriate? Are there new consumers who need to be included in change notification? Data ownership isn’t a one-time assignment — it’s an ongoing practice.

The checklist that saves you

Here are all ten questions in a format you can bring to your next pipeline design review. Print it out. Tape it to your monitor. Make it a required section in your technical design documents.

Before you ingest a new source system, answer these:

Schema changes — What’s our detection and response strategy?
Late-arriving data — What’s our tolerance window and reconciliation process?
Primary key reuse — How do we generate warehouse-independent identity?
NULL invasion — What are our per-column NULL handling rules?
Duplicate records — How do we define and enforce uniqueness at every layer?
Double execution — Can this pipeline run twice safely?
Timezone chaos — Are we storing UTC with source timezone metadata?
Garbage data — What are our circuit-breakers and anomaly thresholds?
Incident response — Who gets paged, and what’s the runbook?
Data ownership — Who’s accountable for content, delivery, and definition?

If you can answer all ten, you’ve done more due diligence than most. If you can answer them at the “amazing” level, you’ve built something that’ll survive the chaos of production for years.

Why I keep writing about this

Every one of these ten questions is, at its core, about the same thing: caring enough to think ahead. Not just building something that works today, but building something that handles the mess that tomorrow will inevitably deliver. That’s not a technical skill. It’s a disposition. It’s deciding that the quiet discipline of prevention matters more than the visible heroism of firefighting.

The best data engineers I’ve worked with all share this trait. They’re not the ones with the most dramatic incident stories. They’re the ones whose systems are boring. Predictably, reliably, beautifully boring.

That’s the work I want to celebrate.

Chris Hillman

The Data Quality Test: 10 Questions That Predict Pipeline Disasters

The pre-mortem nobody wants to do

1. What happens when the schema changes without warning?

What good looks like

What amazing looks like

2. What happens when yesterday’s data arrives tomorrow?

What good looks like

What amazing looks like

3. What happens when a primary key gets reused?

What good looks like

What amazing looks like

4. What happens when NULLs appear in non-nullable fields?

What good looks like

What amazing looks like

5. What happens when the source sends duplicates?

What good looks like

What amazing looks like

6. What happens when the extract job runs twice?

What good looks like

What amazing looks like

7. What happens when timestamps change timezone?

What good looks like

What amazing looks like

8. What happens when the source has a bad day and sends garbage?

What good looks like

What amazing looks like

9. Who gets paged when this breaks at 2am?

What good looks like

What amazing looks like

10. Who actually owns this data?

What good looks like

What amazing looks like

The checklist that saves you

Why I keep writing about this