When Your Data Quality Fails at 9 PM on a Friday
When everything goes wrong at once
It’s 9 PM on a Friday. You’re halfway through your second beer, finally relaxing after a brutal week. Your phone buzzes. Then it buzzes again. And again. The support team’s in full panic mode, your manager’s calling, and somewhere in Melbourne, two very angry guests are standing outside the same Airbnb property—both holding confirmation emails that say the place is theirs for the weekend.
The keys don’t work for one of them. The host isn’t answering. Both guests are already tweeting about it. Your company’s about to become the main character on social media, and not in a good way.
This isn’t hypothetical. This is what happens when data quality breaks down in production, and it’s more common than you’d think. I’ve been there. Most data professionals have been there at some point—that sinking feeling when you realize a query you wrote, a constraint you didn’t add, or a test you didn’t think was necessary just created a real-world disaster.
Today we’re going to walk through exactly how a seemingly simple SQL query can create this nightmare. More importantly, we’re going to build a framework for preventing it—one that goes beyond “just write better code” and addresses the systemic issues that let these problems slip through in the first place.
The real cost of getting it wrong
When data quality fails, the consequences don’t show up as a ticket in your project management system marked “P3 - Low Priority.” They show up as:
Customer trust evaporating in real-time. Those two guests aren’t just filing support tickets. They’re telling everyone they know. They’re warning their friends. They’re writing detailed reviews about how your platform failed them when they needed it most. You’ve lost them forever, plus everyone they influence.
Revenue hemorrhaging faster than you can calculate it. Immediate refunds, sure. Compensation packages to prevent lawsuits, definitely. But what about the bookings those guests would’ve made next year? The referrals you’ll never get? The hosts who lose confidence in your platform and stop accepting new bookings?
Operational chaos that cascades. Your support team’s drowning. They can’t keep up with the volume. Other tickets are backing up. Response times are slipping across the board. The hosts are freaking out because they’re caught in the middle. Your trust and safety team is scrambling to figure out if this is isolated or systemic.
Regulatory nightmares you didn’t see coming. Depending on your jurisdiction, you might’ve just violated consumer protection laws. The legal team’s getting involved. Someone’s drafting incident reports for regulators. This isn’t just an engineering problem anymore.
Brand damage that’s impossible to quantify. Social media amplifies everything. That tweet from Guest #1 is getting retweeted. Tech bloggers are picking it up. Someone on Hacker News is explaining why your engineering team clearly doesn’t understand basic database constraints. Your CEO’s getting messages from investors asking what’s going on.
And here’s the thing—all of this stemmed from a single flawed SQL query.
The bug that broke everything
Let me show you the actual data that caused this disaster. We’ve got a bookings table—nothing fancy, just the basics:
CREATE TABLE bookings (
room_id INT,
booking_id INT,
guest_name VARCHAR(100),
start_date DATE,
end_date DATE
);
INSERT INTO bookings VALUES
(101, 1, 'Alice', '2024-01-01', '2024-01-10'),
(101, 2, 'Bob', '2024-01-05', '2024-01-07'),
(101, 3, 'Charlie', '2024-01-08', '2024-01-12'),
(101, 4, 'Diana', '2024-01-15', '2024-01-18'),
(102, 5, 'Eve', '2024-01-03', '2024-01-06'),
(102, 6, 'Frank', '2024-01-08', '2024-01-11');
Look at Room 101. Alice booked January 1-10. Bob’s there January 5-7. Charlie arrives January 8 and stays through the 12th.
You see the problem, right? Alice and Charlie both think they have the place from January 8-10. That’s three nights where two different guests are expecting exclusive access to the same property.
Now here’s where it gets interesting (and by interesting, I mean horrifying). The development team wrote what seemed like a clever query to detect overlaps:
-- The query that seemed so clever in code review
SELECT
room_id,
booking_id,
start_date,
end_date,
LEAD(start_date) OVER (
PARTITION BY room_id
ORDER BY start_date
) as next_start
FROM bookings
WHERE start_date < LEAD(start_date) OVER (
PARTITION BY room_id
ORDER BY start_date
);
The LEAD() function looks at the next booking after sorting by start date. It compares each booking to the one immediately following it. Sounds reasonable, right?
Except it completely misses Alice and Charlie’s overlap.
Why? Because Bob’s booking sits between them chronologically. The query compares Alice to Bob (finds overlap), compares Bob to Charlie (finds overlap), but never compares Alice to Charlie directly. It’s like having three people in a room where person A and person C are standing on each other’s toes, but you only check if adjacent people are touching.
This is the kind of bug that passes code review. It passes initial testing if you’re not thorough. It might even work perfectly fine in production for months until you hit just the right combination of bookings.
And when it fails? It fails spectacularly, on a Friday night, when two guests are standing outside the same property.
The six dimensions that define quality
Data quality isn’t a binary thing where data is either “good” or “bad.” It’s multidimensional, and understanding these dimensions changes how you think about building systems.
Accuracy asks whether your data correctly represents reality. Guest email is “john.smith@email.com” and that’s actually John’s email? Accurate. Phone number has a typo? Inaccurate. Here’s the tricky part—in our booking scenario, even if the dates are recorded wrong (user books Jan 8-12 but system records Jan 8-10), every downstream calculation becomes inaccurate, even if your overlap detection logic is perfect.
Completeness means all necessary data is present. Your booking has room_id, booking_id, guest info, dates, contact details? Complete. Missing contact information so you can’t notify guests of problems? Incomplete. But here’s what most people miss—the overlap detection logic itself was incomplete. It checked some overlaps but not all overlaps. Missing validation logic is as critical as missing data.
Consistency asks whether data agrees across systems. Guest’s booking shows “Room 101” in both booking and property management systems? Consistent. Booking system shows checkout on Jan 10 but calendar sync shows Jan 11? Inconsistent. And this is where things get messy in real systems—you’ve got data flowing between multiple databases, cache layers, message queues, third-party integrations. Maintaining consistency across all of that is harder than it looks.
Timeliness measures whether data is current and available when needed. Booking confirmation sent within 30 seconds? Timely. Cancellation not reflected in availability for 24 hours? Untimely, and now you’re creating new double-bookings. Even perfect overlap detection doesn’t help if it runs as a nightly batch job instead of real-time validation.
Validity checks if data conforms to business rules. Email follows format rules? Valid. Booking end_date after start_date? Valid. Two bookings overlapping for the same room? Invalid—this was our core problem. The business rule was crystal clear: “No room can have overlapping bookings.” The LEAD() query failed to properly validate this rule.
Uniqueness ensures entities are recorded once and only once. Each booking has a unique booking_id? Unique. Same guest creates account twice with slightly different emails? Duplicate. And while not our primary issue, if bookings get recorded multiple times (user double-clicks submit), you’ve got duplicate records complicating your overlap detection.
These dimensions interact in ways that’ll bite you if you’re not careful. A timeliness issue creates stale data, which becomes an accuracy problem, which breaks consistency, which produces invalid states. In our case, validity failed first (overlap logic was wrong), causing accuracy issues (booking data didn’t reflect reality), leading to consistency problems (booking table disagreed with availability calendar), creating timeliness challenges (support scrambling to fix in real-time).
How the disaster unfolds
Let’s walk through what actually happened when our buggy query made it to production.
Development phase: Someone writes the LEAD() query. It looks elegant. Uses a window function. Modern SQL. The code reviewer sees it’s checking for overlaps and approves it. Nobody thinks to test the non-adjacent overlap case because, honestly, who thinks about that?
Testing phase: QA creates test cases. They test two adjacent bookings. Works perfectly. They test bookings with gaps. Works perfectly. They test bookings in different rooms. Works perfectly. But they don’t test three bookings where A overlaps C with B sitting between them. Why would they? The spec didn’t call it out.
Production deployment: Everything’s smooth. No errors during rollout. Initial bookings look good. The system’s humming along. For weeks, maybe months, you don’t see the problem because your booking patterns haven’t hit that specific edge case.
The incident: Then one day, someone books around an existing booking. Alice extends her stay. Or Charlie books early. Or the timing just lines up wrong. Suddenly you’ve got overlapping bookings in the database, both confirmed, both valid as far as the system knows.
The discovery: It’s not immediate. Both guests receive confirmation emails. Both guests make travel plans. Both guests show up. That’s when someone discovers the problem—when it’s too late to fix gracefully.
The response: Support gets flooded. Engineering gets paged. Someone runs a query to check for other overlaps. Oh no, there are dozens. Some in the future. Some happening right now. The scope of the problem becomes clear, and it’s worse than anyone thought.
The aftermath: You’re fixing data in production. You’re calling customers. You’re issuing refunds. You’re updating hosts. You’re explaining to executives. You’re writing post-mortems. You’re implementing new tests. You’re adding database constraints. You’re probably looking for a new job.
This sequence plays out somewhere in the data world every single day. The specific domain changes—it might be financial transactions, healthcare records, e-commerce inventory, user permissions. But the pattern stays the same: subtle bug, insufficient testing, production deployment, catastrophic failure, expensive cleanup.
Building a framework that actually works
Fixing the query is easy. Preventing the next incident requires systemic change. Let me show you a framework that’s saved my ass more times than I can count.
Principle 1: The database should enforce truth
Your database knows what valid data looks like. Make it impossible to insert invalid data. Not hard, not inconvenient—impossible.
For our booking scenario:
-- Basic date validation
ALTER TABLE bookings ADD CONSTRAINT valid_date_range
CHECK (end_date > start_date);
-- Now comes the interesting part
CREATE OR REPLACE FUNCTION prevent_booking_overlap()
RETURNS TRIGGER AS $$
BEGIN
IF EXISTS (
SELECT 1
FROM bookings
WHERE room_id = NEW.room_id
AND booking_id != NEW.booking_id
AND start_date < NEW.end_date
AND end_date > NEW.start_date
) THEN
RAISE EXCEPTION 'Booking overlap detected for room % between % and %',
NEW.room_id, NEW.start_date, NEW.end_date;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER validate_booking_overlap
BEFORE INSERT OR UPDATE ON bookings
FOR EACH ROW
EXECUTE FUNCTION prevent_booking_overlap();
This trigger runs before every insert or update. It checks every existing booking for that room. If there’s any overlap—adjacent or not, doesn’t matter—the insert fails. You can’t create invalid data even if you try.
Some people worry about performance. “Running a query on every insert will be slow!” Maybe. But you know what’s slower? Manually fixing data corruption at 9 PM on Friday. Besides, this query is simple, hits an index, and finishes in milliseconds. Profile it if you’re worried, but in my experience, correctness beats speed every time.
Principle 2: Tests must cover your assumptions
You can’t test everything. But you absolutely must test your critical business rules, and you need to think adversarially about edge cases.
For overlap detection, the test suite needs to cover:
-- Test Case 1: Adjacent overlaps (the easy one)
-- Booking A: Jan 1-10, Booking B: Jan 5-7
-- Expected: Overlap detected
-- Test Case 2: Non-adjacent overlaps (the one we missed)
-- Booking A: Jan 1-10, Booking B: Jan 5-7, Booking C: Jan 8-12
-- Expected: A-C overlap detected
-- Test Case 3: Exact match
-- Booking A: Jan 1-10, Booking B: Jan 1-10
-- Expected: Overlap detected
-- Test Case 4: One booking completely contains another
-- Booking A: Jan 1-15, Booking B: Jan 5-7
-- Expected: Overlap detected
-- Test Case 5: Bookings touch but don't overlap
-- Booking A: Jan 1-10, Booking B: Jan 10-15
-- Expected: No overlap (checkout day is available)
-- Test Case 6: Different rooms
-- Room 101 Booking A: Jan 1-10, Room 102 Booking B: Jan 5-7
-- Expected: No overlap
Notice Test Case 5. Should checkout day equal next checkin day trigger an overlap? Depends on your business rules. Maybe you need a few hours between bookings for cleaning. Document these assumptions explicitly.
Write these tests before fixing the bug. Watch them fail. Then fix the bug and watch them pass. Now you’ve got protection against regression.
Principle 3: Monitoring catches what testing misses
Tests verify behavior under known conditions. Monitoring detects problems under real conditions.
Create a daily data quality report:
CREATE OR REPLACE VIEW daily_data_quality_report AS
WITH overlap_check AS (
SELECT
'Booking Overlaps' as metric,
COUNT(*) as issue_count,
'CRITICAL' as severity
FROM bookings a
INNER JOIN bookings b
ON a.room_id = b.room_id
AND a.booking_id < b.booking_id
AND a.start_date < b.end_date
AND a.end_date > b.start_date
),
date_check AS (
SELECT
'Invalid Date Ranges' as metric,
COUNT(*) as issue_count,
'HIGH' as severity
FROM bookings
WHERE end_date <= start_date
),
completeness_check AS (
SELECT
'Missing Guest Names' as metric,
COUNT(*) as issue_count,
'MEDIUM' as severity
FROM bookings
WHERE guest_name IS NULL OR guest_name = ''
)
SELECT * FROM overlap_check
UNION ALL SELECT * FROM date_check
UNION ALL SELECT * FROM completeness_check;
Run this every morning. Alert if any check returns issue_count > 0. You’ll catch problems before customers do.
Better yet, surface these metrics on a dashboard everyone can see. Make data quality visible. When the overlap count goes from zero to three, someone notices immediately.
Principle 4: Culture eats process for breakfast
You can have perfect constraints, comprehensive tests, and real-time monitoring. If your team culture doesn’t value data quality, you’ll still have problems.
Building the right culture means:
Making data quality everyone’s job, not just the data team’s. When a product manager writes requirements, they specify data validation rules. When a designer creates forms, they consider what invalid input looks like. When an engineer writes code, they think about data integrity.
Treating data quality incidents as learning opportunities, not blame games. When something breaks, the question isn’t “whose fault is this?” It’s “what system failed and how do we fix it?” The engineer who wrote the buggy query isn’t the problem. The process that let it reach production is.
Celebrating prevention as much as firefighting. The engineer who adds a constraint that prevents a future incident deserves recognition. The QA who writes a test that catches a bug before production deserves praise. But this rarely happens—we celebrate the hero who fixed production at 2 AM, not the person who prevented the incident entirely.
Investing time in foundational work. Yes, building that monitoring dashboard takes time away from new features. Yes, writing comprehensive tests slows down initial development. But it speeds up everything else—debugging, onboarding, refactoring, building on top of existing code.
The correct query (finally)
Let’s fix the actual overlap detection query. The correct approach uses a self-join:
SELECT DISTINCT
a.room_id,
a.booking_id as booking_a,
a.guest_name as guest_a,
a.start_date as start_a,
a.end_date as end_a,
b.booking_id as booking_b,
b.guest_name as guest_b,
b.start_date as start_b,
b.end_date as end_b,
GREATEST(a.start_date, b.start_date) as overlap_start,
LEAST(a.end_date, b.end_date) as overlap_end
FROM bookings a
INNER JOIN bookings b
ON a.room_id = b.room_id
AND a.booking_id < b.booking_id
AND a.start_date < b.end_date
AND a.end_date > b.start_date
ORDER BY a.room_id, a.start_date;
This query compares every booking to every other booking in the same room. No more “adjacent only” limitation. If any two bookings overlap, we find them.
The join conditions are crucial:
a.room_id = b.room_id- only compare bookings in the same rooma.booking_id < b.booking_id- avoid comparing a booking to itself and avoid duplicate pairsa.start_date < b.end_date- booking A starts before booking B endsa.end_date > b.start_date- booking A ends after booking B starts
This is the standard interval overlap formula. It’s been used for decades. It works. Use it.
Run this query against your current data:
-- Emergency audit script
SELECT
COUNT(*) as total_overlaps,
COUNT(DISTINCT room_id) as affected_rooms
FROM (
SELECT DISTINCT a.room_id, a.booking_id
FROM bookings a
INNER JOIN bookings b
ON a.room_id = b.room_id
AND a.booking_id < b.booking_id
AND a.start_date < b.end_date
AND a.end_date > b.start_date
) overlaps;
If this returns anything other than zero, you’ve got work to do.
Implementing the fix without breaking everything else
You can’t just deploy new constraints to production and hope for the best. You need a careful rollout strategy that accounts for existing bad data.
Phase 1: Discovery (Day 1)
Run the corrected query against production. Export all overlaps. Calculate the scope of damage. How many customers affected? How many rooms? How far into the future?
-- Generate a report of all overlaps
CREATE TEMP TABLE overlap_report AS
SELECT
a.room_id,
a.booking_id as first_booking,
a.guest_name as first_guest,
b.booking_id as second_booking,
b.guest_name as second_guest,
GREATEST(a.start_date, b.start_date) as overlap_start,
LEAST(a.end_date, b.end_date) as overlap_end,
CASE
WHEN GREATEST(a.start_date, b.start_date) < CURRENT_DATE
THEN 'Past'
WHEN GREATEST(a.start_date, b.start_date) = CURRENT_DATE
THEN 'Today'
ELSE 'Future'
END as urgency
FROM bookings a
INNER JOIN bookings b
ON a.room_id = b.room_id
AND a.booking_id < b.booking_id
AND a.start_date < b.end_date
AND a.end_date > b.start_date;
-- Prioritize by urgency
SELECT urgency, COUNT(*) as count
FROM overlap_report
GROUP BY urgency
ORDER BY
CASE urgency
WHEN 'Today' THEN 1
WHEN 'Future' THEN 2
WHEN 'Past' THEN 3
END;
Phase 2: Triage (Days 2-3)
Handle immediate issues first. Anyone checking in today or tomorrow needs manual resolution. Contact guests, contact hosts, figure out alternative arrangements. This is messy human work that can’t be automated.
For future bookings, you’ve got options:
- Cancel the later booking and issue a refund
- Find alternative accommodations
- Negotiate with guests to change dates
- In extreme cases, cancel both and eat the compensation cost
There’s no good answer here. Every option makes someone unhappy. Document your decisions and learn from them.
Phase 3: Data Cleanup (Days 3-7)
Once you’ve handled the urgent cases, clean up the historical data:
-- Mark overlapping bookings for review
ALTER TABLE bookings ADD COLUMN IF NOT EXISTS data_quality_flag VARCHAR(50);
UPDATE bookings
SET data_quality_flag = 'overlap_detected'
WHERE booking_id IN (
SELECT DISTINCT a.booking_id
FROM bookings a
INNER JOIN bookings b
ON a.room_id = b.room_id
AND a.booking_id < b.booking_id
AND a.start_date < b.end_date
AND a.end_date > b.start_date
);
Don’t delete data. Flag it. You might need it for analysis, audits, or legal purposes.
Phase 4: Prevention (Week 2)
Now you can add the database constraint:
-- First, make sure you've cleaned up all existing overlaps
-- This should return 0 rows
SELECT COUNT(*) FROM bookings WHERE data_quality_flag = 'overlap_detected';
-- Deploy the trigger
CREATE OR REPLACE FUNCTION prevent_booking_overlap()
RETURNS TRIGGER AS $$
BEGIN
IF EXISTS (
SELECT 1
FROM bookings
WHERE room_id = NEW.room_id
AND booking_id != NEW.booking_id
AND start_date < NEW.end_date
AND end_date > NEW.start_date
) THEN
RAISE EXCEPTION 'Booking overlap detected for room % between % and %',
NEW.room_id, NEW.start_date, NEW.end_date;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER validate_booking_overlap
BEFORE INSERT OR UPDATE ON bookings
FOR EACH ROW
EXECUTE FUNCTION prevent_booking_overlap();
Test the trigger in staging first. Try to insert overlapping bookings. Verify they’re rejected. Try valid bookings. Verify they succeed. Only then deploy to production.
Phase 5: Testing (Week 3)
Write automated tests that run against production-like data:
import pytest
from datetime import date, timedelta
def test_adjacent_overlap_rejected():
"""Test that adjacent bookings can overlap"""
# Insert first booking
insert_booking(room_id=101, start=date(2024, 1, 1), end=date(2024, 1, 10))
# Try to insert overlapping booking
with pytest.raises(OverlapError):
insert_booking(room_id=101, start=date(2024, 1, 5), end=date(2024, 1, 7))
def test_non_adjacent_overlap_rejected():
"""Test that non-adjacent overlaps are detected"""
# Insert first booking
insert_booking(room_id=101, start=date(2024, 1, 1), end=date(2024, 1, 10))
# Insert non-overlapping booking
insert_booking(room_id=101, start=date(2024, 1, 5), end=date(2024, 1, 7))
# Try to insert booking that overlaps first but not second
with pytest.raises(OverlapError):
insert_booking(room_id=101, start=date(2024, 1, 8), end=date(2024, 1, 12))
def test_different_rooms_allowed():
"""Test that same dates in different rooms work"""
insert_booking(room_id=101, start=date(2024, 1, 1), end=date(2024, 1, 10))
# Should succeed - different room
insert_booking(room_id=102, start=date(2024, 1, 1), end=date(2024, 1, 10))
Run these tests on every deployment. If they fail, the deployment fails. No exceptions. I would even reference in the code, to the incident, and problem that arose, so if someone things about removing this at a later date they have reference to it.
What this looks like at scale
Everything I’ve described works at small scale. But what happens when you’re dealing with millions of bookings, thousands of rooms, and hundreds of transactions per second?
Performance considerations become real. That trigger checking for overlaps? It’s running a query on every insert. With proper indexing on (room_id, start_date, end_date), it’ll stay fast. Without indexing, you’ve got problems.
Monitoring needs to scale too. You can’t run a full table scan every morning to check for overlaps. You need incremental checks, sampling strategies, or streaming validation.
Data cleanup gets complicated. You can’t manually review millions of records. You need automated classification, bulk correction tools, and clear escalation paths for edge cases.
Testing requires different strategies. Property-based testing becomes valuable. Generate random booking scenarios and verify overlaps are always caught. Run chaos engineering experiments where you deliberately try to break constraints.
But here’s the thing—the fundamentals don’t change. Database constraints still enforce truth. Tests still verify behavior. Monitoring still catches problems. Culture still determines whether people care.
The implementation details scale up, but the principles stay the same.
The post-mortem that actually matters
After you’ve fixed the bug, cleaned the data, and added constraints, you need a post-mortem. Not the kind where everyone pretends to be honest while actually covering their ass. The kind where you genuinely figure out what went wrong and how to prevent it.
Here’s what a good post-mortem covers:
What happened, precisely. Timeline of events. Who discovered the issue, when, how. What was the immediate impact. What systems were affected. Use facts, not blame.
Root cause analysis. Not “the engineer wrote bad code” (that’s always true). Dig deeper. Why did the code review miss it? Why didn’t tests catch it? Why wasn’t there a database constraint? Why didn’t monitoring alert us? Why weren’t business rules documented?
Contributing factors. Was the team under time pressure? Were requirements unclear? Did we lack domain expertise? Were we using unfamiliar technology? Did we skip steps in the normal process?
What worked. Don’t only focus on failures. What did work? How did we detect the problem? What allowed us to respond quickly? Who stepped up? What processes helped?
Action items with owners and deadlines. Vague “we should do better” doesn’t help. Specific “Jane will implement overlap detection tests by next Friday” does. Assign owners. Set deadlines. Follow up.
Distribution and discussion. Share the post-mortem widely. Engineering, product, support, leadership. Create space for people to ask questions and add context. Update it based on feedback.
The goal isn’t to make everyone feel bad. It’s to extract maximum learning from a painful experience so you don’t repeat it.
Building data quality into your DNA
Everything we’ve discussed—constraints, tests, monitoring, culture—needs to become reflexive, not something you remember to do when you have time.
This means changing how you approach new features. Before you write a single line of code, you ask:
- What are the business rules this needs to enforce?
- What invalid states need to be impossible?
- What edge cases could break this?
- How will we verify it works correctly?
- How will we know if it breaks in production?
This means changing how you do code reviews. You’re not just checking for syntax errors or performance issues. You’re asking:
- What data quality problems could this create?
- What constraints are missing?
- What tests would catch regressions?
- What monitoring would detect problems?
This means changing how you measure success. You don’t just track feature velocity or query performance. You track:
- Data quality metrics over time
- Time to detect data quality issues
- Time to resolve data quality issues
- Percentage of issues caught before production
- Customer impact of data quality incidents
This means changing your hiring and onboarding. New engineers learn about your data quality framework on day one. They see examples of past incidents and what was learned. They understand that data quality isn’t someone else’s job.
This means changing your documentation. Your data dictionary doesn’t just list column names and types. It documents business rules, constraints, validation logic, known edge cases, and monitoring approaches.
This means changing your architecture. You design systems with data quality in mind from the start. You choose tools that support validation. You build pipelines that can detect and quarantine bad data. You create feedback loops so quality issues surface quickly.
When you’re the person who has to fix this
Maybe you’re reading this because you’re dealing with a data quality crisis right now. You’ve discovered overlapping bookings, or duplicate transactions, or inconsistent user states, or some other nightmare. Everyone’s freaking out. Leadership wants answers. Customers are angry. And somehow, you’re the person who has to fix it.
First, breathe. This isn’t the end of the world, even though it feels like it. I’ve been there. Most experienced data professionals have been there. You’ll get through it.
Second, assess the scope quickly. How bad is it? How many records affected? How far back does it go? How many customers impacted? You need these numbers to make decisions.
Third, stabilize before you optimize. Stop the bleeding. Prevent new bad data from entering the system. That might mean temporarily disabling features, adding emergency validation, or manually reviewing submissions. Do what’s necessary to prevent the problem from growing.
Fourth, communicate clearly and often. Tell leadership what you know, what you don’t know, and what you’re doing about it. Tell affected customers what happened and how you’re resolving it. Tell your team what needs to happen and when.
Fifth, fix the immediate problem before solving the systemic issue. Get the data cleaned up, get customers taken care of, get operations stable. Then you can think about long-term improvements.
Sixth, document everything. What you found, what you did, what worked, what didn’t. This documentation becomes the foundation for your post-mortem and future prevention.
Finally, remember that mistakes don’t define you. How you respond does. Handle this professionally, learn from it, and implement systems to prevent recurrence. That’s what separates senior engineers from junior engineers—not avoiding mistakes, but responding to them effectively.
From nightmare to framework
The Airbnb double-booking scenario isn’t just a cautionary tale. It’s a template for thinking about data quality across every domain.
Maybe you’re working on financial transactions where duplicate charges cause customer disputes. Maybe you’re dealing with medical records where data inconsistencies impact patient care. Maybe you’re managing inventory where overselling creates fulfillment nightmares. Maybe you’re building user permissions where invalid states create security vulnerabilities.
The specifics change, but the framework stays the same:
Define your business rules clearly. What states should be impossible? What invariants must hold true? Document these explicitly.
Enforce rules at the database level. Make invalid states impossible to create. Use constraints, triggers, and checks liberally.
Test your assumptions thoroughly. Cover edge cases. Think adversarially. Verify that your logic actually prevents what you think it prevents.
Monitor continuously. Don’t wait for customers to discover problems. Detect issues before they impact operations.
Build a culture where data quality matters. Make it everyone’s job. Celebrate prevention. Learn from incidents. Invest in foundational work.
This approach doesn’t eliminate all data quality issues. Nothing does. But it dramatically reduces their frequency and impact. And when issues do occur, you’ve got the systems to detect and resolve them quickly.
The investment pays for itself the first time you don’t have a data quality crisis at 9 PM on Friday.
Moving forward with confidence
Data quality isn’t a project you complete. It’s a continuous practice that becomes part of how you build systems.
Start small. Pick your most critical business rule. Implement database constraints. Write tests. Add monitoring. Learn what works in your environment.
Expand gradually. Apply the same approach to other critical rules. Build momentum. Share successes. Show ROI.
Make it systematic. Create templates for new features. Establish patterns for common scenarios. Document your approaches.
Keep improving. When incidents occur, extract lessons. When new patterns emerge, update your framework. When technology changes, adapt your tools.
And remember—the goal isn’t perfection. It’s building systems that fail gracefully, detect problems quickly, and learn continuously.
Your customers won’t thank you for preventing crises they never knew about. Your boss might not notice the incidents that didn’t happen. But you’ll know. You’ll sleep better. You’ll respond faster when problems occur. You’ll build with confidence instead of fear.
That’s worth the investment.
Now go add some database constraints. Future you will thank you for it—probably at 9 PM on a Friday when you’re enjoying that second beer instead of fielding angry support calls.
Reference to code: https://github.com/krshillman/ghost_in_the_data_data_quality/blob/main/examples/2025-11-22-data-quality.sql
