Context Engineering: The New Must-Have Skill for Data Engineers

Last year I watched a colleague ask AI to help write a dbt model. The AI spit out perfectly functional SQL—clean syntax, proper CTEs, the works. Looked great.

Then I noticed the table would eventually hold 800 million rows. No partitioning. No clustering. Just a raw, unoptimised heap waiting to turn into a query performance nightmare (that would likely become my nightmare to fix).

The engineer wasn’t at fault. The AI wasn’t at fault either, really. The AI simply didn’t know that our environment clusters large tables by date. It didn’t know our team’s conventions around incremental models. It couldn’t know, because nobody had told it.

Here’s the thing: most data engineers treat AI assistants like they’re infinitely capable strangers. We give them tasks, accept their output, maybe tweak a few lines. But we never invest in teaching them our context—the hard-won lessons, the team conventions, the mistakes we’ve already made (and would rather not make again).

What if you could? What if every partitioning decision, every naming convention, every “never do this” rule you’ve learned over the years lived in a file that your AI assistant read before every single interaction?

That’s exactly what context engineering enables. Once you set it up, your AI stops being a generic autocomplete and starts acting like a peer who actually knows your codebase.

The Problem With Generic AI Assistance

I’ve been using AI coding assistants for a while. First Copilot (that was a terrible experience), then ChatGPT, and now more recently Claude. I can write up blocks faster with the right prompts.

But something kept nagging at me.

The AI would suggest something technically correct but contextually wrong. It would generate a model without tests. Propose a column name that violated our conventions. Use VARCHAR when we standardise on STRING. Small things, individually. Which would lead to more and more refining on the prompts.

The root cause was simple: the AI had no memory. Every conversation started fresh. It didn’t know that we’d spent three painful weeks last quarter cleaning up inconsistent date formats. It didn’t know that our dim_customer table had a specific SCD Type 2 pattern we’d refined over months.

Every session, I found myself re-explaining the same context. “Remember to add tests.” “We use snake_case for everything.” “Always partition by event_date.”

This is always a problem in work generally. You could have an equally talented data engineer sitting next to you, but you don’t share the same problems faced, the same challenges solved. You’d find out months or weeks later—we ran into this problem in that table, and we resolved it by doing this. And you’d say, oh we had the same problem elsewhere, but here’s how we tackled it.

The solution can be better prompting, as I discussed in this article - https://ghostinthedata.info/posts/2026/2026-01-04-talk/. It can also be solved with persistent context.

But there’s another approach that’s more fundamental: teaching your AI assistant your context once, so it remembers across every session. That’s what context engineering is about.

What Context Engineering Actually Looks Like

Context engineering is a file (or set of files) that contains your team’s accumulated wisdom, loaded automatically into every AI session. Think of it as institutional memory for your AI assistant.

The core concept is straightforward: you maintain markdown files that describe your conventions, standards, and hard-won lessons. Your AI assistant reads these files at the start of every session.

Different tools implement this differently:

Claude Code (Anthropic’s CLI tool) looks for a CLAUDE.md file in your project root. It loads this automatically when you start a session.

Cline (the VSCode extension I use) supports custom system prompts and can reference local files for context. You can configure it to always include specific markdown files in its context window.

Cursor has similar capabilities with its rules files.

The implementation details vary, but the principle is identical: give the AI your context before it starts generating code.

Here’s what a basic context file might look like for a data engineering project:

# Analytics Pipeline Context

## Stack

- Snowflake (Enterprise tier)
- dbt Core 1.7+
- Fivetran for ingestion
- Tableau for BI

## Critical Standards

### Partitioning & Clustering

- **ALWAYS** cluster tables >10M rows by their primary date column
- Use `cluster_by` in dbt config, not raw DDL
- Partition Staging tables by data_date

### Naming Conventions

- Models: `stg_source__entity`, `fct_entity`, `dim_entity`
- Columns: snake_case, never camelCase
- Booleans: prefix with `is_` or `has_`
- Dates: suffix with `_at` for timestamps, `_date` for dates
- IDs: suffix with `_id` (never `_key` for surrogate keys)

### Testing Requirements

- Every model needs at least: unique, not_null on primary key
- Fact tables need referential integrity tests to all dimensions
- SCD2 dimensions need: valid_from < valid_to, no gaps or overlaps in history

This isn’t complicated. It’s just written-down knowledge. The magic happens when this knowledge persists across sessions.

Structuring Your Context Files

I use Cline for most of my dbt work because it integrates cleanly with VSCode and connects directly to the Claude API. Rather than one massive context file, I structure my context into logical chunks—this helps the AI (and me) find relevant information quickly.

Here’s the directory structure I use:

When I start a session, this context is already loaded. The AI knows our conventions before I type a single character.

The specific configuration will depend on your tool of choice—Cline, Cursor, and Claude Code each have their own settings. The important part is the content itself: clear, specific rules that capture how your team actually works.

The “Every Mistake Becomes a Rule” Philosophy

Here’s where context engineering gets genuinely powerful.

Boris Cherny, who created Claude Code, maintains his context file at exactly 2,500 tokens. His practice: “Every mistake becomes a rule.” When something goes wrong that’s generalisable, it gets added to the context file immediately.

This resonated with me because it mirrors how data engineers actually accumulate wisdom. We don’t learn by reading documentation. We learn by screwing things up, fixing them, and vowing to never make that mistake again.

The difference now? Those lessons don’t just live in our heads. They live in a file that our AI assistant reads before every interaction.

Last quarter, a seemingly innocent change to an SCD2 dimension caused our dashboards to double-count certain metrics. The root cause was subtle: we’d accidentally created overlapping validity windows during a backfill. It took three days to diagnose.

The fix took an hour. But more importantly, I added this to my testing patterns file:

SCD Type 2 Validation Rules

## SCD Type 2 Validation

Every SCD2 dimension MUST include these tests:

### No Overlapping Windows

```yaml
tests:
  - dbt_utils.unique_combination_of_columns:
      combination_of_columns:
        - customer_id
        - valid_from
  - dbt_expectations.expect_column_pair_values_A_to_be_less_than_B:
      column_A: valid_from
      column_B: valid_to
      or_equal: false
```

Chris Hillman

Context Engineering: The New Must-Have Skill for Data Engineers

The Problem With Generic AI Assistance

What Context Engineering Actually Looks Like

Structuring Your Context Files

The “Every Mistake Becomes a Rule” Philosophy