Ghost Skills: Teaching AI Agents to Think Like Data Engineers

Another week, another skills repo on the GitHub trending page. I know. There are roughly seventeen of them now, all promising to turn your AI coding agent from a confident intern into a slightly-less-confident intern. Most of them are great. Most of them are also built by solo devs, for solo devs, on solo-dev codebases that fit comfortably in a context window.

Which is fine, if that’s your world.

Less fine if your world involves a Snowflake warehouse with four tables that could be the source of truth for “customer”, an SCD2 someone half-built in 2021 and quietly walked away from, and a dbt project where stg_users_final_v3_actually_use_this is, somehow, the one you’re meant to use. (Don’t laugh. You’ve seen worse.)

I’ve been using AI agents day-to-day for a while now — mostly as a peer reviewer, a second pair of eyes, a thinking partner that doesn’t sigh when I ask it to look at the same query three times. They’re genuinely useful.

So I built something Ghost Skills — a collection of data-engineering methodology skills for AI coding agents. Before I get into what’s in it, let me explain why I bothered.

The problem isn’t the model. It’s that data work has rules nobody told it about.

Your AI agent is, by default, an extremely bright graduate on their first day. It knows Python. It knows SQL. It can write a beautifully-commented for loop and explain window functions to you in five different ways. It will happily generate a fact table, a dbt model, an Airflow DAG, all of it looking very professional.

What it doesn’t know is everything that actually matters.

It doesn’t know that the customer_id in that source isn’t really unique because of the 2022 CRM migration. It doesn’t know that Roger pushed a release on Friday afternoon (because of course he did) and the platform’s been wheezing through a seven-year backfill ever since. It doesn’t know that this dimension changes slowly and the audit team will lose their minds if you flatten the history. It doesn’t know that the business definition of “active customer” changed in March, but only in the marketing data mart, and only sometimes.

These aren’t model intelligence problems. The model is plenty smart. They’re context problems — and the context is exactly the stuff that takes a data engineer two years to learn and roughly thirty minutes to forget when they leave the company.

What agents are missing isn’t capability. It’s methodology. It’s the durable craft of data engineering — the part that doesn’t change when you swap Snowflake for BigQuery, or dbt for SQLMesh, or your warehouse for a “lakehouse” that’s somehow priced like a warehouse anyway.

That’s the gap this is attempting to fill.

Skills, briefly, for anyone who hasn’t drunk this particular Kool-Aid yet

If you’ve not been living inside Claude Code or Cursor for the last six months: a skill is a folder of instructions you give an AI coding agent. “When you build a fact table, declare the grain first.” “When you write tests, follow this severity model.” “When something breaks, here’s the comms workflow.” You version it, you commit it, you keep it in the repo. The agent reads it before doing the thing, and re-reads it next session because agents have the long-term memory of a goldfish.

It’s a deceptively powerful pattern. Instead of pasting the same prompt every session (and forgetting half of it, and writing it slightly differently this time), the standard lives in the repo. New team member joins? They get the skills. Agent updates next month? It still reads the skills.

The catch — and we’ll come back to this — is that skills tell an agent how you do something. They don’t tell it what you’ve built, why you built it that way, or which of your seven utility tables is the one that’s still maintained.

For solo devs, that’s fine. There is no “seven utility tables”. For data teams at any kind of scale, it’s a real limit. More on that at the end.

Introducing Ghost Skills

Right. The repo: github.com/ghostinthedata-info/skills. Built mostly for myself, polished up for sharing.

30-second setup:

npx skills@latest add ghostinthedata-info/skills

Pick the skills you want, pick your agent (Claude Code, Codex, Cursor — whatever you’re using), and run /setup-ghost-skills. It’ll ask you three questions: warehouse dialect, transform tooling, and how your domain docs are laid out. Then it writes itself a configuration block into your CLAUDE.md or AGENTS.md and gets out of your way.

The agent reads it next session. And the one after that. And the one after that. The same standards, every time, without you re-explaining them.

That’s it. That’s the whole thing.

What’s in the catalogue

The skills group into four areas, mapped to roughly how data work actually happens.

Discovery is everything you should do before you build anything, and frequently don’t. profile-data runs the baseline checks on a new dataset — row counts, cardinality, null analysis, key uniqueness — so the agent isn’t generating models against assumptions that fall apart in week two. gather-requirements pins down grain, sources, consumers, freshness SLAs, and acceptance criteria one question at a time, instead of letting the agent guess. refine-context stress-tests a plan against your documented domain model and writes the decisions back to CONTEXT.md and ADRs as they crystallise.

Modeling is the methodology your agent should already be applying and usually isn’t. dimensional-modeling walks the four-step process. fact-table-design forces grain declaration and measure classification before a single line of SQL gets generated. keys covers business, natural, surrogate, composite, and durable keys — and the anti-patterns that bite you eighteen months later when the source system gets re-platformed. slowly-changing-dimensions covers SCD types 0 through 7, including my Healing Tables approach to deterministic, path-independent SCD2.

Quality is where defensive engineering lives. test-data produces a test plan you can actually defend in a code review — uniqueness, referential integrity, nulls, accepted values, freshness, volume variance — and maps each test to a severity level. performance-tuning follows the measure-first philosophy: find the critical path, then fix partition pruning, incremental processing, and the phantom dependencies that are silently doubling your costs. spark-performance handles the distributed case — partition counts, shuffle minimisation, skew handling, broadcasting small tables.

Operations is what should happen when things go pop. incident-comms gives the agent the severity classification, notification workflow, update cadence, and post-incident review template — the “Don’t Go Dark” pattern in skill form. pipeline-design encodes idempotency, reproducibility, and defensive engineering as defaults instead of afterthoughts. data-as-a-product brings data mesh thinking into the agent — domain ownership, discoverability, SLAs, federated governance. data-security-classification covers the four A’s, PII/PHI handling, and least privilege.

There’s a setup skill (setup-ghost-skills) that runs once per repo and handles the configuration. Everything else, you opt into.

Why tool-agnostic, even though tool-specific would sell better

The temptation when building something like this is to specialise. Snowflake-only. dbt-only. Make every skill assume your exact stack and produce immediately runnable code.

I deliberately didn’t do that, and the tradeoff is worth naming honestly. Generic skills are less immediately useful than highly specific ones. A skill that knows your exact dbt project structure, your column naming conventions, your team’s snake_case-vs-camelCase argument that’s been going on since 2022 — that’s more powerful, day one.

But it’s also yours. It’s not shareable. And the bits that are genuinely universal — declare the grain before you build the fact, a business key isn’t a real business key until you’ve checked it’s unique, test before publish — those don’t change between platforms. They didn’t change when we moved from Teradata to Snowflake. They won’t change when we move from Snowflake to whatever everyone’s furiously rebranding next year.

Ghost Skills is the base layer. Your repo’s CONTEXT.md is your specialisation layer. They’re meant to work together, not replace each other. Fork it. Add a snowflake-cost-optimisation skill on top. Open a PR if you want to share it back. (Or don’t. Do what you like.)

Skills help. They don’t solve everything. Let’s not pretend.

The honest thing to say at this point is that skills are not a silver bullet, and anyone telling you otherwise is selling you a course.

Skills tell an agent how to do things. They don’t tell it what’s there or why. An agent that knows the Kimball four-step process will still make the wrong grain declaration if it doesn’t understand the business process it’s modelling. A skill can encode what SCD2 is; it can’t replace a conversation with the stakeholder who actually needs the history. A skill can encode the Write-Audit-Publish pattern; it can’t tell the agent which of your fourteen “users” tables is the one to run it against.

The other thing skills can’t do is fix a stale skill. The conventions you encode today are the conventions you knew today. If your team’s standards evolve — and they should — your skills need to evolve with them. A stale skill is worse than no skill, because it gives the agent a confident wrong answer instead of asking a question.

So treat them as a starting point. Use them. Fork them. Argue with them in a code review. Update them when your standards shift. The agent will keep reading whatever’s in the repo, faithfully, every session — which means the skills are only ever as good as the last person who maintained them.

The repo is at github.com/ghostinthedata-info/skills. Fork it for your cloud-specific layer. Open issues if something’s wrong. Open a PR if you’ve got methodology worth encoding and you don’t want to keep it to yourself.

And if you build something better on top of it, please tell me.

Chris Hillman