Ghost in the data
  • Home
  • About
  • Posts
  • Topics
  • Resources
  • RSS
  • Posts
  • 2026
    • Talk
    • Brainstorming
    • Guerrilla Interview Guide
    • 2026 Strategy
    • Dimensional Modeling AWS
    • Duct Tape Data Engineer
    • AI Peer Reviewer
    • NBA Coach Lessons for Data Leaders
    • For Sooty
    • Healing Tables SCD2
    • WAP Iceberg Snowflake
    • The CSV Test Suite Nobody Writes
    • 12 Steps to Better Data Engineering
    • Your Data Model Isn't Broken Pt I
    • Your Friends Will Be There
    • Fix Your Data Without Permission
    • Your Data Model Isn't Broken Pt II
    • Stop Building Salesforce Integrations
  • 2025
    • UV Tools
    • Zsh Virtual Environments
    • Piracy Service Problem
    • 2025 Data Trends
    • Data Modeling Approaches
    • MacOS Dev Setup
    • Windows Dev Setup
    • Business Context Guide
    • Data Impact
    • Data Engineering Interviews
    • First 90 Days as Data Engineer
    • Senior to Staff Engineer
    • LLMs for Business Part 1
    • LLMs for Business Part 2
    • Mastering 1:1 Meetings
    • Data Quality Test
    • AI Prompting Secret
    • Conceptual Data Modeling
    • WAP Pattern for Data Pipelines
    • AI Simplified
    • dbt Fusion: The Engine Upgrade
    • Continuous Integration for Data Teams
    • Claude Code AI Agents
    • Clear Communication Superpower
    • Compliance vs Commitment
    • D&D Leadership
    • Reflective Best Self
    • Financial Independence
    • Dimensional Modeling Lives
    • Balancing Data Accessibility & Privacy
    • Data Quality Crisis
    • Data Quality Framework
    • AWS Data Pipeline
    • Invisible PR
    • AI's Twin Crises
  • 2024
    • Delta-lake
    • Data Normalisation
    • Data Profiling
    • Defensive Engineering
    • CI/CD
    • Setup Docker and Airflow
    • Find and Attract Data Engineers
    • 17 Years of Insights
    • Relationship Building
    • Individual Contributor
  • 2023
    • GitBash with SSH
    • Journalling
    • Minecraft Server in GCP
    • Onboarding a data team
    • File Format for Big Data
    • Incident Management
    • Data Vault
    • Books that are worth you time?
Hero Image
Stop Building Salesforce Integrations From Scratch

Let me tell you about Marcus. Marcus was on a team I led a few years back. Sharp, motivated, the kind of engineer who actually read documentation before writing code. When the business asked us to get Salesforce data into our warehouse, Marcus volunteered. He’d done API work before. He figured a few weeks, tops. He scoped it carefully. Built a Python service that authenticated via OAuth, pulled Account, Contact, and Opportunity objects through the Bulk API, flattened the nested JSON into relational tables, handled pagination, managed rate limits. Wrote solid tests. Documented everything. The kind of work you’d point to in a code review and say this is how it’s done.

  • Data Engineering
  • Snowflake
  • OpenFlow
  • Salesforce
  • API Integration
  • Schema Evolution
  • Fivetran
  • Data Pipelines
Saturday, April 4, 2026 Read
Hero Image
Your Data Model Isn't Broken, Part II: The Refactoring Playbook

In [Part I], I made the case that your legacy data model isn’t the disaster it looks like. That the strange WHERE clauses, the bridge tables nobody can explain, and the slowly-changing-dimension-within-a-slowly-changing-dimension aren’t bugs — they’re business rules earned through years of production reality. I argued that big-bang rebuilds fail at alarming rates, that the complexity you’re fighting is mostly essential rather than accidental, and that the impulse to “start from scratch” is driven more by cognitive bias than by engineering judgment.

  • Data Engineering
  • Refactoring
  • Data Warehousing
  • dbt
  • Snowflake
  • Apache Iceberg
  • Write-Audit-Publish
  • Strangler Fig
  • Data Quality
Saturday, March 28, 2026 Read
Hero Image
You Don't Need Permission to Fix Your Data

Let me tell you about a junior engineer called Sam. Sam had been on the team about four months when I noticed something in a pull request. Tucked between two routine model changes was a new schema.yml entry — five accepted_values tests on a column called customer_status that had been silently accumulating fourteen different spellings of “active” for the better part of a year. Nobody asked Sam to do this. It wasn’t in a sprint. There was no Jira ticket. Sam had just been working in that part of the warehouse, noticed the mess, and decided to clean it up on the way through.

  • Data Quality
  • dbt
  • SQL
  • Testing
  • Documentation
  • Junior Engineer
  • Career Growth
  • Psychological Safety
Saturday, March 21, 2026 Read
Hero Image
Your Friends Will Be There for You. Your Work Won't.

There’s a table in a rented house in Queenscliff — nothing fancy, just whatever furniture came with the place — with a hand-drawn map spread across it, dice scattered at the edges, and snacks slowly migrating toward the centre of the board. Once a year, Steve, one of our friends, organises for us to make a trip down to Queenscliff to play tabletop RPGs for a weekend. More recently, we’ve started playing monthly or fortnightly at home too. Nothing glamorous about any of it. We’re not Glass Cannon Network or Critical Role — no cameras, no production values, no audience watching us roleplay with solemn gravity.

  • Leadership
  • Career Development
  • Mental Health
  • Burnout
  • Wellbeing
  • Friendship
  • Work-Life Balance
Wednesday, March 18, 2026 Read
Hero Image
Your Data Model Isn't Broken, Part I: Why Refactoring Beats Rebuilding

In the early 2000’s - Netscape’s decision to rewrite their browser from scratch was the single worst strategic mistake a software company could make. At the time, Netscape was winning. They had the dominant browser. They had market share. They had momentum. And then they decided the codebase was too messy, too tangled, too hard to work with — so they threw it all away and started over. Navigator 4.0 became the foundation for a rewrite that would eventually ship as version 6.0. There was no 5.0. Three years of development. No shipping product. And while Netscape’s engineers were busy building their beautiful new browser in a vacuum, Internet Explorer ate their lunch, their dinner, and most of their market share.

  • Data Engineering
  • Refactoring
  • Data Warehousing
  • Technical Debt
  • Snowflake
  • dbt
  • Legacy Systems
  • Data Quality
Saturday, March 14, 2026 Read
Hero Image
12 Steps to Better Data Engineering

Let me tell you about the moment I stopped trusting architecture diagrams. I was three days into a new role, getting up to speed with the data team. Smart people. Modern stack. On paper, everything looked right. They walked me through a beautiful data platform diagram: clean lines, labelled layers, colour-coded domains. It looked like something you’d see in a data conference. Then I asked a question that changed everything: “Can you rebuild your finance table from scratch right now?”

  • Data Engineering
  • dbt
  • Snowflake
  • GitHub Actions
  • AWS
  • Data Quality
  • CI/CD
  • Data Contracts
Saturday, March 7, 2026 Read
Hero Image
The CSV Test Suite Nobody Writes

In October 2020, roughly 16,000 positive COVID-19 test results vanished from the UK’s public health reporting for nearly a week. Not because the tests weren’t run. Not because the labs didn’t report them. The results were collected, transmitted, and received — inside CSV files. The problem? Public Health England was importing those CSV files into Microsoft Excel’s legacy .xls format. The format has a hard row limit of 65,536. When the files grew past that limit, Excel didn’t throw an error. It didn’t warn anyone. It just silently dropped the extra rows. Sixteen thousand people who tested positive for a deadly virus during a second wave went untraced. An estimated 50,000 of their contacts were never notified. And the system this happened in? Part of a £12 billion Test and Trace programme.

  • CSV
  • Data Quality
  • Testing
  • RFC 4180
  • Python
  • Data Engineering
  • Data Pipelines
Wednesday, March 4, 2026 Read
Hero Image
Write-Audit-Publish with Iceberg Tables in Snowflake

It was a Tuesday afternoon when the analyst pinged me on Microsoft Teams: “Hey, the Total Portfolio numbers just jumped 40% overnight. Did we land a whale?” We hadn’t. What actually happened was more mundane and significantly more painful. A schema change in the source system introduced a currency conversion bug. Our pipeline dutifully loaded the corrupted data into production at 3 AM, the dashboards updated by 6 AM, and the Department Head opened her morning report to numbers that looked like champagne-worthy growth.

  • Apache Iceberg
  • Snowflake
  • WAP Pattern
  • Data Quality
  • SQL
  • Lakehouse
  • Data Pipelines
  • Best Practices
Friday, February 27, 2026 Read
Hero Image
Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster

It was 2 AM on a Saturday when I realized we’d been loading data wrong for six months. The situation: a customer dimension with three years of history needed to be backfilled after a source system migration. The previous team’s approach was straightforward—run the daily incremental process 1,095 times, once for each day of history. They estimated three weeks to complete. What they hadn’t accounted for was how errors compound. By the time I looked at the data, we had 47,000 records with overlapping date ranges, 12,000 timeline gaps where customers seemed to vanish and reappear, and an unknowable number of missed changes from when source systems updated the same record multiple times in a single day.

  • SCD
  • Historical Load
  • dbt
  • SQL
  • Data Quality
  • Dimensional Modeling
  • Delta Lake
  • Best Practices
Saturday, February 7, 2026 Read
Hero Image
For Sooty

This one isn’t about data pipelines. There’s no framework, no architecture diagram, no code snippet at the end. Yesterday I said goodbye to my best friend. Sooty was a miniature schnauzer — nine kilograms of stubbornness, loyalty, and heart. Born October 2011. Gone February 2026. Fourteen years that changed the shape of everything. I don’t have the right words for this. I’m not sure anyone does. But I tried to write something that comes close, and I wanted to share it here — because this is my corner of the internet, and she deserves a place in it.

  • Personal
  • Life
  • Loss
  • Grief
Tuesday, February 3, 2026 Read
Hero Image
What an NBA Coach Can Teach Data Leaders About Building Teams That Actually Work

I was three hours into a retrospective that had devolved into blame-shifting when the most senior engineer on the team finally spoke up. “Look,” he said, “we can keep pointing fingers at the data model, or we can admit we don’t actually trust each other enough to have an honest conversation about what went wrong.” The room went quiet. He was right. That moment stuck with me because it exposed something I’ve seen destroy more data teams than bad architecture ever could: the absence of genuine connection between people who spend forty-plus hours a week depending on each other.

  • Leadership
  • Team Building
  • Culture
  • Management
  • Data Teams
  • Remote Work
  • Psychological Safety
Monday, February 2, 2026 Read
Hero Image
Context Engineering: The New Must-Have Skill for Data Engineers

Last year I watched a colleague ask AI to help write a dbt model. The AI spit out perfectly functional SQL—clean syntax, proper CTEs, the works. Looked great. Then I noticed the table would eventually hold 800 million rows. No partitioning. No clustering. Just a raw, unoptimised heap waiting to turn into a query performance nightmare (that would likely become my nightmare to fix). The engineer wasn’t at fault. The AI wasn’t at fault either, really. The AI simply didn’t know that our environment clusters large tables by date. It didn’t know our team’s conventions around incremental models. It couldn’t know, because nobody had told it.

  • AI
  • dbt
  • Data Quality
  • SQL
  • Productivity
  • VSCode
  • Claude
Saturday, January 31, 2026 Read
  • ««
  • «
  • 1
  • 2
  • »
  • »»