Ghost in the data
  • Home
  • About
  • Posts
  • Topics
  • Resources
  • RSS
  • Posts
  • 2026
    • Talk
    • Brainstorming
    • Guerrilla Interview Guide
    • 2026 Strategy
    • Dimensional Modeling AWS
    • Duct Tape Data Engineer
    • AI Peer Reviewer
    • NBA Coach Lessons for Data Leaders
    • For Sooty
    • Healing Tables SCD2
    • WAP Iceberg Snowflake
    • The CSV Test Suite Nobody Writes
    • 12 Steps to Better Data Engineering
    • Your Data Model Isn't Broken Pt I
    • Your Friends Will Be There
    • Fix Your Data Without Permission
    • Your Data Model Isn't Broken Pt II
    • Stop Building Salesforce Integrations
    • Why Your Pipeline Finishes Later Every Month
    • Your Data Platform Costs More Than It Should
    • Five Worlds
    • The Broken Window
    • Don't Go Dark
  • 2025
    • UV Tools
    • Zsh Virtual Environments
    • Piracy Service Problem
    • 2025 Data Trends
    • Data Modeling Approaches
    • MacOS Dev Setup
    • Windows Dev Setup
    • Business Context Guide
    • Data Impact
    • Data Engineering Interviews
    • First 90 Days as Data Engineer
    • Senior to Staff Engineer
    • LLMs for Business Part 1
    • LLMs for Business Part 2
    • Mastering 1:1 Meetings
    • Data Quality Test
    • AI Prompting Secret
    • Conceptual Data Modeling
    • WAP Pattern for Data Pipelines
    • AI Simplified
    • dbt Fusion: The Engine Upgrade
    • Continuous Integration for Data Teams
    • Claude Code AI Agents
    • Clear Communication Superpower
    • Compliance vs Commitment
    • D&D Leadership
    • Reflective Best Self
    • Financial Independence
    • Dimensional Modeling Lives
    • Balancing Data Accessibility & Privacy
    • Data Quality Crisis
    • Data Quality Framework
    • AWS Data Pipeline
    • Invisible PR
    • AI's Twin Crises
  • 2024
    • Delta-lake
    • Data Normalisation
    • Data Profiling
    • Defensive Engineering
    • CI/CD
    • Setup Docker and Airflow
    • Find and Attract Data Engineers
    • 17 Years of Insights
    • Relationship Building
    • Individual Contributor
  • 2023
    • GitBash with SSH
    • Journalling
    • Minecraft Server in GCP
    • Onboarding a data team
    • File Format for Big Data
    • Incident Management
    • Data Vault
    • Books that are worth you time?
Hero Image
Don't Go Dark: Visibility Is a Data Engineering Skill

There’s a specific kind of silence in data engineering that I’ve learned to fear. Not the silence of a system that’s working well. Not the comfortable quiet of a team in flow. I mean the silence of a project that’s been running for three weeks and you still can’t point to a single visible thing it has produced. The kind of silence where, if your manager stopped you in the hallway and asked “how’s that migration going?”, you’d say “fine” because saying anything more accurate would require explaining things you haven’t fully articulated yet — even to yourself.

  • Communication
  • Data Quality
  • GitHub
  • dbt
  • Remote Work
  • Career Development
  • Engineering Culture
  • Psychological Safety
Saturday, May 23, 2026 Read
Hero Image
The Broken Window in Your Data Pipeline

There’s a particular kind of data problem that doesn’t announce itself. It accumulates. We were receiving Salesforce data through delta extraction — sensible in theory, because full snapshots can run to hundreds of terabytes and less than 1% of records change on any given day. The problem is that deltas require someone to know what “changed” means. In Salesforce, that’s less obvious than it sounds. Watch a last_modified column and you’ll miss objects that get updated when a related object changes, without their own timestamp reflecting it.

  • Data Quality
  • Data Pipelines
  • Technical Debt
  • Data Observability
  • dbt
  • Apache Airflow
  • Data Culture
  • Pipeline Architecture
Saturday, May 9, 2026 Read
Hero Image
Five Worlds of Data Engineering

You watch a conference talk about implementing data contracts, and nobody mentions that the advice assumes you have multiple teams producing data — which you don’t. You read a post declaring “if you’re still using stored procedures in 2026, you’re doing it wrong,” and the comments erupt. Half the people are nodding along. Half are furious. Both sides are right. They’re just living in different worlds and don’t realise it.

  • Data Engineering
  • Modern Data Stack
  • Enterprise
  • Data Architecture
  • Career Development
  • Leadership
Saturday, May 2, 2026 Read
Hero Image
Your Data Platform Costs More Than It Should

Let me tell you about the moment I stopped treating cloud costs as someone else’s problem. We were three months into a Snowflake migration. Everything was humming. Pipelines were green, dashboards were fast, the analytics team was happier than I’d seen them before. I felt good about the work we’d done. Then finance forwarded me the invoice. The number wasn’t catastrophic. But it was significantly higher than what we’d budgeted, and when I started digging, I couldn’t explain where most of it was going. I knew we had warehouses running. I knew we had pipelines executing. But I couldn’t tell you which warehouse was responsible for what cost, which pipelines were the expensive ones, or whether the money was well spent. I had built a platform I was proud of — and I had no idea what it actually cost to operate.

  • Snowflake
  • AWS
  • Cost Optimization
  • FinOps
  • dbt
  • Data Platform
Saturday, April 25, 2026 Read
Hero Image
Why Your Pipeline Finishes Later Every Month

Let me tell you about a graph that changed how I think about data engineering. A junior engineer on my team — let’s call her Priya — had been tracking something nobody asked her to track. Every morning for two months, she’d noted the timestamp when our main analytics pipeline completed. She wasn’t trying to make a point. She was just curious, because the finance team kept mentioning their dashboards weren’t ready when they arrived at 8 AM anymore.

  • Snowflake
  • AWS
  • Airflow
  • Pipeline Optimization
  • dbt
  • Data Freshness
Saturday, April 18, 2026 Read
Hero Image
Stop Building Salesforce Integrations From Scratch

Let me tell you about Marcus. Marcus was on a team I led a few years back. Sharp, motivated, the kind of engineer who actually read documentation before writing code. When the business asked us to get Salesforce data into our warehouse, Marcus volunteered. He’d done API work before. He figured a few weeks, tops. He scoped it carefully. Built a Python service that authenticated via OAuth, pulled Account, Contact, and Opportunity objects through the Bulk API, flattened the nested JSON into relational tables, handled pagination, managed rate limits. Wrote solid tests. Documented everything. The kind of work you’d point to in a code review and say this is how it’s done.

  • Data Engineering
  • Snowflake
  • OpenFlow
  • Salesforce
  • API Integration
  • Schema Evolution
  • Fivetran
  • Data Pipelines
Saturday, April 4, 2026 Read
Hero Image
Your Data Model Isn't Broken, Part II: The Refactoring Playbook

In [Part I], I made the case that your legacy data model isn’t the disaster it looks like. That the strange WHERE clauses, the bridge tables nobody can explain, and the slowly-changing-dimension-within-a-slowly-changing-dimension aren’t bugs — they’re business rules earned through years of production reality. I argued that big-bang rebuilds fail at alarming rates, that the complexity you’re fighting is mostly essential rather than accidental, and that the impulse to “start from scratch” is driven more by cognitive bias than by engineering judgment.

  • Data Engineering
  • Refactoring
  • Data Warehousing
  • dbt
  • Snowflake
  • Apache Iceberg
  • Write-Audit-Publish
  • Strangler Fig
  • Data Quality
Saturday, March 28, 2026 Read
Hero Image
You Don't Need Permission to Fix Your Data

Let me tell you about a junior engineer called Sam. Sam had been on the team about four months when I noticed something in a pull request. Tucked between two routine model changes was a new schema.yml entry — five accepted_values tests on a column called customer_status that had been silently accumulating fourteen different spellings of “active” for the better part of a year. Nobody asked Sam to do this. It wasn’t in a sprint. There was no Jira ticket. Sam had just been working in that part of the warehouse, noticed the mess, and decided to clean it up on the way through.

  • Data Quality
  • dbt
  • SQL
  • Testing
  • Documentation
  • Junior Engineer
  • Career Growth
  • Psychological Safety
Saturday, March 21, 2026 Read
Hero Image
Your Friends Will Be There for You. Your Work Won't.

There’s a table in a rented house in Queenscliff — nothing fancy, just whatever furniture came with the place — with a hand-drawn map spread across it, dice scattered at the edges, and snacks slowly migrating toward the centre of the board. Once a year, Steve, one of our friends, organises for us to make a trip down to Queenscliff to play tabletop RPGs for a weekend. More recently, we’ve started playing monthly or fortnightly at home too. Nothing glamorous about any of it. We’re not Glass Cannon Network or Critical Role — no cameras, no production values, no audience watching us roleplay with solemn gravity.

  • Leadership
  • Career Development
  • Mental Health
  • Burnout
  • Wellbeing
  • Friendship
  • Work-Life Balance
Wednesday, March 18, 2026 Read
Hero Image
Your Data Model Isn't Broken, Part I: Why Refactoring Beats Rebuilding

In the early 2000’s - Netscape’s decision to rewrite their browser from scratch was the single worst strategic mistake a software company could make. At the time, Netscape was winning. They had the dominant browser. They had market share. They had momentum. And then they decided the codebase was too messy, too tangled, too hard to work with — so they threw it all away and started over. Navigator 4.0 became the foundation for a rewrite that would eventually ship as version 6.0. There was no 5.0. Three years of development. No shipping product. And while Netscape’s engineers were busy building their beautiful new browser in a vacuum, Internet Explorer ate their lunch, their dinner, and most of their market share.

  • Data Engineering
  • Refactoring
  • Data Warehousing
  • Technical Debt
  • Snowflake
  • dbt
  • Legacy Systems
  • Data Quality
Saturday, March 14, 2026 Read
Hero Image
12 Steps to Better Data Engineering

Let me tell you about the moment I stopped trusting architecture diagrams. I was three days into a new role, getting up to speed with the data team. Smart people. Modern stack. On paper, everything looked right. They walked me through a beautiful data platform diagram: clean lines, labelled layers, colour-coded domains. It looked like something you’d see in a data conference. Then I asked a question that changed everything: “Can you rebuild your finance table from scratch right now?”

  • Data Engineering
  • dbt
  • Snowflake
  • GitHub Actions
  • AWS
  • Data Quality
  • CI/CD
  • Data Contracts
Saturday, March 7, 2026 Read
Hero Image
The CSV Test Suite Nobody Writes

In October 2020, roughly 16,000 positive COVID-19 test results vanished from the UK’s public health reporting for nearly a week. Not because the tests weren’t run. Not because the labs didn’t report them. The results were collected, transmitted, and received — inside CSV files. The problem? Public Health England was importing those CSV files into Microsoft Excel’s legacy .xls format. The format has a hard row limit of 65,536. When the files grew past that limit, Excel didn’t throw an error. It didn’t warn anyone. It just silently dropped the extra rows. Sixteen thousand people who tested positive for a deadly virus during a second wave went untraced. An estimated 50,000 of their contacts were never notified. And the system this happened in? Part of a £12 billion Test and Trace programme.

  • CSV
  • Data Quality
  • Testing
  • RFC 4180
  • Python
  • Data Engineering
  • Data Pipelines
Wednesday, March 4, 2026 Read
  • ««
  • «
  • 1
  • 2
  • »
  • »»