Data Quality

Ghost Skills: Teaching AI Agents to Think Like Data Engineers

Another week, another skills repo on the GitHub trending page. I know. There are roughly seventeen of them now, all promising to turn your AI coding agent from a confident intern into a slightly-less-confident intern. Most of them are great. Most of them are also built by solo devs, for solo devs, on solo-dev codebases that fit comfortably in a context window. Which is fine, if that’s your world. Less fine if your world involves a Snowflake warehouse with four tables that could be the source of truth for “customer”, an SCD2 someone half-built in 2021 and quietly walked away from, and a dbt project where stg_users_final_v3_actually_use_this is, somehow, the one you’re meant to use. (Don’t laugh. You’ve seen worse.)

Sunday, June 14, 2026 Read

SQL Tells You What. Comments Tell You Why.

The best SQL doesn’t need comments. Write meaningful CTE names, descriptive aliases, clear column labels — and a skilled reader will follow your logic without a single annotation. That’s the right instinct. It’s also only half right. SQL is a declarative language. You’re not writing how the database retrieves your data; you’re writing what you want. That’s a useful distinction, because “what” and “why” are very different questions, and SQL can answer exactly one of them.

Saturday, June 6, 2026 Read

Don't Go Dark: Visibility Is a Data Engineering Skill

There’s a specific kind of silence in data engineering that I’ve learned to fear. Not the silence of a system that’s working well. Not the comfortable quiet of a team in flow. I mean the silence of a project that’s been running for three weeks and you still can’t point to a single visible thing it has produced. The kind of silence where, if your manager stopped you in the hallway and asked “how’s that migration going?”, you’d say “fine” because saying anything more accurate would require explaining things you haven’t fully articulated yet — even to yourself.

Saturday, May 23, 2026 Read

The Broken Window in Your Data Pipeline

There’s a particular kind of data problem that doesn’t announce itself. It accumulates. We were receiving Salesforce data through delta extraction — sensible in theory, because full snapshots can run to hundreds of terabytes and less than 1% of records change on any given day. The problem is that deltas require someone to know what “changed” means. In Salesforce, that’s less obvious than it sounds. Watch a last_modified column and you’ll miss objects that get updated when a related object changes, without their own timestamp reflecting it.

Saturday, May 9, 2026 Read

Your Data Model Isn't Broken, Part II: The Refactoring Playbook

In [Part I], I made the case that your legacy data model isn’t the disaster it looks like. That the strange WHERE clauses, the bridge tables nobody can explain, and the slowly-changing-dimension-within-a-slowly-changing-dimension aren’t bugs — they’re business rules earned through years of production reality. I argued that big-bang rebuilds fail at alarming rates, that the complexity you’re fighting is mostly essential rather than accidental, and that the impulse to “start from scratch” is driven more by cognitive bias than by engineering judgment.

Saturday, March 28, 2026 Read

You Don't Need Permission to Fix Your Data

Let me tell you about a junior engineer called Sam. Sam had been on the team about four months when I noticed something in a pull request. Tucked between two routine model changes was a new schema.yml entry — five accepted_values tests on a column called customer_status that had been silently accumulating fourteen different spellings of “active” for the better part of a year. Nobody asked Sam to do this. It wasn’t in a sprint. There was no Jira ticket. Sam had just been working in that part of the warehouse, noticed the mess, and decided to clean it up on the way through.

Saturday, March 21, 2026 Read

Your Data Model Isn't Broken, Part I: Why Refactoring Beats Rebuilding

In the early 2000’s - Netscape’s decision to rewrite their browser from scratch was the single worst strategic mistake a software company could make. At the time, Netscape was winning. They had the dominant browser. They had market share. They had momentum. And then they decided the codebase was too messy, too tangled, too hard to work with — so they threw it all away and started over. Navigator 4.0 became the foundation for a rewrite that would eventually ship as version 6.0. There was no 5.0. Three years of development. No shipping product. And while Netscape’s engineers were busy building their beautiful new browser in a vacuum, Internet Explorer ate their lunch, their dinner, and most of their market share.

Saturday, March 14, 2026 Read

12 Steps to Better Data Engineering

Let me tell you about the moment I stopped trusting architecture diagrams. I was three days into a new role, getting up to speed with the data team. Smart people. Modern stack. On paper, everything looked right. They walked me through a beautiful data platform diagram: clean lines, labelled layers, colour-coded domains. It looked like something you’d see in a data conference. Then I asked a question that changed everything: “Can you rebuild your finance table from scratch right now?”

Saturday, March 7, 2026 Read

The CSV Test Suite Nobody Writes

In October 2020, roughly 16,000 positive COVID-19 test results vanished from the UK’s public health reporting for nearly a week. Not because the tests weren’t run. Not because the labs didn’t report them. The results were collected, transmitted, and received — inside CSV files. The problem? Public Health England was importing those CSV files into Microsoft Excel’s legacy .xls format. The format has a hard row limit of 65,536. When the files grew past that limit, Excel didn’t throw an error. It didn’t warn anyone. It just silently dropped the extra rows. Sixteen thousand people who tested positive for a deadly virus during a second wave went untraced. An estimated 50,000 of their contacts were never notified. And the system this happened in? Part of a £12 billion Test and Trace programme.

Wednesday, March 4, 2026 Read

Write-Audit-Publish with Iceberg Tables in Snowflake

It was a Tuesday afternoon when the analyst pinged me on Microsoft Teams: “Hey, the Total Portfolio numbers just jumped 40% overnight. Did we land a whale?” We hadn’t. What actually happened was more mundane and significantly more painful. A schema change in the source system introduced a currency conversion bug. Our pipeline dutifully loaded the corrupted data into production at 3 AM, the dashboards updated by 6 AM, and the Department Head opened her morning report to numbers that looked like champagne-worthy growth.

Friday, February 27, 2026 Read

Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster

It was 2 AM on a Saturday when I realized we’d been loading data wrong for six months. The situation: a customer dimension with three years of history needed to be backfilled after a source system migration. The previous team’s approach was straightforward—run the daily incremental process 1,095 times, once for each day of history. They estimated three weeks to complete. What they hadn’t accounted for was how errors compound. By the time I looked at the data, we had 47,000 records with overlapping date ranges, 12,000 timeline gaps where customers seemed to vanish and reappear, and an unknowable number of missed changes from when source systems updated the same record multiple times in a single day.

Saturday, February 7, 2026 Read

Context Engineering: The New Must-Have Skill for Data Engineers

Last year I watched a colleague ask AI to help write a dbt model. The AI spit out perfectly functional SQL—clean syntax, proper CTEs, the works. Looked great. Then I noticed the table would eventually hold 800 million rows. No partitioning. No clustering. Just a raw, unoptimised heap waiting to turn into a query performance nightmare (that would likely become my nightmare to fix). The engineer wasn’t at fault. The AI wasn’t at fault either, really. The AI simply didn’t know that our environment clusters large tables by date. It didn’t know our team’s conventions around incremental models. It couldn’t know, because nobody had told it.

Saturday, January 31, 2026 Read