Ghost in the data
  • Home
  • About
  • Posts
  • Topics
  • Resources
  • RSS
  • Tags
  • 2026 Trends
  • AI
  • AI Agents
  • AI Bubble
  • AI Business Applications
  • AI Communication
  • AI Concepts
  • AI Ethics
  • AI Productivity
  • AI Prompting
  • AI Tools
  • AI Workflows
  • Airflow
  • Analytics
  • AnalyticsEngineering
  • Anonymization
  • Apache Airflow
  • Apache Iceberg
  • API Integration
  • Architecture
  • Athena
  • Automation
  • AVRO
  • AWS
  • AWS Glue
  • BankingData
  • Bedrock Edition
  • Best Practices
  • BigData
  • Blue-Green Deployment
  • Budgeting
  • Burnout
  • Business Case
  • Business Value
  • Business-Communication
  • Career Advice
  • Career Development
  • Career Growth
  • Career Planning
  • Career Strategy
  • Change Management
  • Chapter Lead
  • ChatGPT
  • CI/CD
  • Claude
  • Claude-Code
  • Cloud Computing
  • Cloud Gaming
  • Code Review
  • Collaboration
  • Communication
  • ConceptualDataModeling
  • Continuous Learning
  • ContinuousIntegration
  • Cost Optimization
  • CSV
  • Culture
  • Data Architecture
  • Data Contracts
  • Data Culture
  • Data Engineering
  • Data Ethics
  • Data Freshness
  • Data Governance
  • Data Impact
  • Data Ingestion
  • Data Leadership
  • Data Modeling
  • Data Modelling
  • Data Observability
  • Data Ownership
  • Data Pipeline
  • Data Pipelines
  • Data Platform
  • Data Platforms
  • Data Quality
  • Data Reliability
  • Data Solutions
  • Data System Resilience
  • Data Teams
  • Data Testing
  • Data Transformation
  • Data Validation
  • Data Vault
  • Data Warehouse
  • Data Warehouse Architecture
  • Data Warehousing
  • Database Design
  • DataDemocratization
  • DataEngineering
  • Datafold
  • DataGovernance
  • DataMinimization
  • DataModeling
  • DataPipelines
  • DataPrivacy
  • DataQuality
  • DataTools
  • DataValidation
  • DataWarehouse
  • Dbt
  • Decision Making
  • Delta Lake
  • Development
  • Development Tools
  • DevOps
  • Dimensional Modeling
  • DimensionalModeling
  • Documentation
  • DuckDB
  • Emergency Fund
  • Emotional Intelligence
  • EmpatheticDesign
  • Employee Engagement
  • Employee Productivity
  • Engineering Career
  • Enterprise
  • ETL
  • ETL Pipeline
  • Family Gaming
  • Feedback
  • File Formats
  • Financial Crisis
  • Financial Independence
  • FinOps
  • Fivetran
  • Frameworks
  • Friendship
  • Future of Work
  • GCP
  • GDPR
  • Git
  • GitBash
  • GitHub
  • GitHub Actions
  • Grief
  • Hiring Strategies
  • Historical Load
  • Idempotency
  • Incident Response
  • Industry Trends
  • Innovation
  • Inspirational Quote
  • Intergroup Conflict
  • Interviews
  • Job Security
  • Journal
  • Journaling Techniques
  • JSON
  • Junior Engineer
  • Kimball
  • Kimball Methodology
  • Lakehouse
  • Lambda
  • Language Models
  • Leadership
  • Legacy Systems
  • Life
  • LLM
  • LLM Interaction
  • Loss
  • MacOS
  • Management
  • Mental Health
  • Mentorship
  • Mindfulness Practices
  • Minecraft
  • Modern Data Stack
  • Moral Development
  • Onboarding
  • One-on-One Meetings
  • OpenFlow
  • OpenSource
  • ORC
  • Organizational Culture
  • Parquet
  • Performance Optimization
  • Personal
  • Personal Growth
  • Pipeline
  • Pipeline Architecture
  • Pipeline Design
  • Pipeline Optimization
  • PostegreSQL
  • Pragmatism
  • Presentation-Skills
  • Problem Solving
  • Production Issues
  • Productivity
  • Professional Development
  • Professional Growth
  • Professional Relationships
  • Professional-Skills
  • Promotion
  • Psychological Safety
  • Public-Speaking
  • Python
  • RAG
  • Recruitment
  • Redundancy
  • Refactoring
  • Remote Work
  • Reputation
  • RequirementGathering
  • RetentionPolicies
  • RFC 4180
  • Risk Management
  • Robbers Cave Experiment
  • ROI
  • Roleplaying
  • S3
  • Salesforce
  • SCD
  • SCD Type 2
  • Schema Drift
  • Schema Evolution
  • Self-Awareness
  • Self-Reflection
  • Server Setup
  • ServiceDesign
  • ShadowIT
  • Snowflake
  • Soft Skills
  • SQL
  • SQL Standards
  • Sql-Agents
  • Sql-Validation
  • SSH
  • SSH Keys
  • Staff Engineer
  • Stakeholder Engagement
  • Stakeholder Management
  • StakeholderManagement
  • Star Schema
  • Starburst
  • Step Functions
  • Strangler Fig
  • Strategy
  • Strengths
  • Success Habits
  • Talent Acquisition
  • Team Building
  • Team Collaboration
  • Team Culture
  • Team Enablement
  • Team-Management
  • Technical Assessment
  • Technical Debt
  • Technical Leadership
  • Technical Strategy
  • Testing
  • Tools and Access
  • Trino
  • Trust
  • Trust Building
  • Trust Crisis
  • UserExperience
  • UV
  • UV Package Manager
  • Value Creation
  • Vector Databases
  • Virtual Environments
  • Visualization
  • Vocal-Techniques
  • VSCode
  • WAP Pattern
  • Wellbeing
  • Windows
  • Work-Life Balance
  • Workplace Communication
  • Workplace Relationships
  • Workplace Stress
  • Write-Audit-Publish
  • Zsh
Hero Image
The Broken Window in Your Data Pipeline

There’s a particular kind of data problem that doesn’t announce itself. It accumulates. We were receiving Salesforce data through delta extraction — sensible in theory, because full snapshots can run to hundreds of terabytes and less than 1% of records change on any given day. The problem is that deltas require someone to know what “changed” means. In Salesforce, that’s less obvious than it sounds. Watch a last_modified column and you’ll miss objects that get updated when a related object changes, without their own timestamp reflecting it.

  • Data Quality
  • Data Pipelines
  • Technical Debt
  • Data Observability
  • dbt
  • Apache Airflow
  • Data Culture
  • Pipeline Architecture
Saturday, May 9, 2026 Read
Hero Image
Your Data Platform Costs More Than It Should

Let me tell you about the moment I stopped treating cloud costs as someone else’s problem. We were three months into a Snowflake migration. Everything was humming. Pipelines were green, dashboards were fast, the analytics team was happier than I’d seen them before. I felt good about the work we’d done. Then finance forwarded me the invoice. The number wasn’t catastrophic. But it was significantly higher than what we’d budgeted, and when I started digging, I couldn’t explain where most of it was going. I knew we had warehouses running. I knew we had pipelines executing. But I couldn’t tell you which warehouse was responsible for what cost, which pipelines were the expensive ones, or whether the money was well spent. I had built a platform I was proud of — and I had no idea what it actually cost to operate.

  • Snowflake
  • AWS
  • Cost Optimization
  • FinOps
  • dbt
  • Data Platform
Saturday, April 25, 2026 Read
Hero Image
Why Your Pipeline Finishes Later Every Month

Let me tell you about a graph that changed how I think about data engineering. A junior engineer on my team — let’s call her Priya — had been tracking something nobody asked her to track. Every morning for two months, she’d noted the timestamp when our main analytics pipeline completed. She wasn’t trying to make a point. She was just curious, because the finance team kept mentioning their dashboards weren’t ready when they arrived at 8 AM anymore.

  • Snowflake
  • AWS
  • Airflow
  • Pipeline Optimization
  • dbt
  • Data Freshness
Saturday, April 18, 2026 Read
Hero Image
Your Data Model Isn't Broken, Part II: The Refactoring Playbook

In [Part I], I made the case that your legacy data model isn’t the disaster it looks like. That the strange WHERE clauses, the bridge tables nobody can explain, and the slowly-changing-dimension-within-a-slowly-changing-dimension aren’t bugs — they’re business rules earned through years of production reality. I argued that big-bang rebuilds fail at alarming rates, that the complexity you’re fighting is mostly essential rather than accidental, and that the impulse to “start from scratch” is driven more by cognitive bias than by engineering judgment.

  • Data Engineering
  • Refactoring
  • Data Warehousing
  • dbt
  • Snowflake
  • Apache Iceberg
  • Write-Audit-Publish
  • Strangler Fig
  • Data Quality
Saturday, March 28, 2026 Read
Hero Image
You Don't Need Permission to Fix Your Data

Let me tell you about a junior engineer called Sam. Sam had been on the team about four months when I noticed something in a pull request. Tucked between two routine model changes was a new schema.yml entry — five accepted_values tests on a column called customer_status that had been silently accumulating fourteen different spellings of “active” for the better part of a year. Nobody asked Sam to do this. It wasn’t in a sprint. There was no Jira ticket. Sam had just been working in that part of the warehouse, noticed the mess, and decided to clean it up on the way through.

  • Data Quality
  • dbt
  • SQL
  • Testing
  • Documentation
  • Junior Engineer
  • Career Growth
  • Psychological Safety
Saturday, March 21, 2026 Read
Hero Image
Your Data Model Isn't Broken, Part I: Why Refactoring Beats Rebuilding

In the early 2000’s - Netscape’s decision to rewrite their browser from scratch was the single worst strategic mistake a software company could make. At the time, Netscape was winning. They had the dominant browser. They had market share. They had momentum. And then they decided the codebase was too messy, too tangled, too hard to work with — so they threw it all away and started over. Navigator 4.0 became the foundation for a rewrite that would eventually ship as version 6.0. There was no 5.0. Three years of development. No shipping product. And while Netscape’s engineers were busy building their beautiful new browser in a vacuum, Internet Explorer ate their lunch, their dinner, and most of their market share.

  • Data Engineering
  • Refactoring
  • Data Warehousing
  • Technical Debt
  • Snowflake
  • dbt
  • Legacy Systems
  • Data Quality
Saturday, March 14, 2026 Read
Hero Image
12 Steps to Better Data Engineering

Let me tell you about the moment I stopped trusting architecture diagrams. I was three days into a new role, getting up to speed with the data team. Smart people. Modern stack. On paper, everything looked right. They walked me through a beautiful data platform diagram: clean lines, labelled layers, colour-coded domains. It looked like something you’d see in a data conference. Then I asked a question that changed everything: “Can you rebuild your finance table from scratch right now?”

  • Data Engineering
  • dbt
  • Snowflake
  • GitHub Actions
  • AWS
  • Data Quality
  • CI/CD
  • Data Contracts
Saturday, March 7, 2026 Read
Hero Image
Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster

It was 2 AM on a Saturday when I realized we’d been loading data wrong for six months. The situation: a customer dimension with three years of history needed to be backfilled after a source system migration. The previous team’s approach was straightforward—run the daily incremental process 1,095 times, once for each day of history. They estimated three weeks to complete. What they hadn’t accounted for was how errors compound. By the time I looked at the data, we had 47,000 records with overlapping date ranges, 12,000 timeline gaps where customers seemed to vanish and reappear, and an unknowable number of missed changes from when source systems updated the same record multiple times in a single day.

  • SCD
  • Historical Load
  • dbt
  • SQL
  • Data Quality
  • Dimensional Modeling
  • Delta Lake
  • Best Practices
Saturday, February 7, 2026 Read
Hero Image
Context Engineering: The New Must-Have Skill for Data Engineers

Last year I watched a colleague ask AI to help write a dbt model. The AI spit out perfectly functional SQL—clean syntax, proper CTEs, the works. Looked great. Then I noticed the table would eventually hold 800 million rows. No partitioning. No clustering. Just a raw, unoptimised heap waiting to turn into a query performance nightmare (that would likely become my nightmare to fix). The engineer wasn’t at fault. The AI wasn’t at fault either, really. The AI simply didn’t know that our environment clusters large tables by date. It didn’t know our team’s conventions around incremental models. It couldn’t know, because nobody had told it.

  • AI
  • dbt
  • Data Quality
  • SQL
  • Productivity
  • VSCode
  • Claude
Saturday, January 31, 2026 Read
Hero Image
The Guerrilla Guide to Data Engineering Interviews

The Scenario That Changes Everything Picture this: You’re sitting in an interview room—or more likely these days, staring at a Zoom window with your carefully curated bookshelf background—and the interviewer asks you about data quality. “Tell me about your experience with data quality,” they say. You have two choices. Choice A: “Data quality is really important in data engineering. It involves ensuring data is accurate, complete, consistent, and timely. I believe strongly in implementing data quality checks throughout the pipeline.”

  • Interviews
  • Career Growth
  • Technical Assessment
  • SQL
  • Data Modeling
  • Problem Solving
  • Delta Lake
  • dbt
  • Data Quality
Sunday, January 11, 2026 Read
Hero Image
Building AI Agents with Claude Code

Introduction Imagine you’re reviewing a pull request with dozens of SQL files, each containing complex queries for your data pipeline. You spot inconsistent formatting, or syntax which doesn’t work with your infrastructure. Sound familiar? It’s common for data professionals to struggle with maintaining consistent SQL standards across their projects, especially when working with specialized platforms and it can be time consuming to review these elements within a peer review. It would be better use of time to focus on the hard thinking elements, like logic etc. However these small syntax or style issues, can be distracting. Well at least they are for me.

  • claude-code
  • sql-agents
  • starburst
  • delta-lake
  • trino
  • sql-validation
  • dbt
  • data-engineering
  • ai-tools
  • vscode
Saturday, September 13, 2025 Read
Hero Image
Continuous Integration for Data Teams: Beyond the Buzzwords

The Day Everything Broke (And How CI Could Have Saved Us) Picture this: It’s 9 AM on a Monday, and your Slack is exploding. The executive dashboard is showing impossible numbers. Customer support is fielding complaints about incorrect billing amounts. The marketing team is questioning why their conversion metrics suddenly dropped to zero. You trace it back to a seemingly innocent change you merged Friday afternoon—a simple column rename that seemed harmless enough. But that “harmless” change cascaded through your entire data pipeline, breaking downstream models, dashboards, and automated reports.

  • ContinuousIntegration
  • DataQuality
  • dbt
  • DevOps
  • DataEngineering
  • GitHub
  • Datafold
  • DataValidation
Saturday, June 28, 2025 Read
  • ««
  • «
  • 1
  • 2
  • »
  • »»