Five Worlds of Data Engineering

Chris Hillman — Sat, 02 May 2026 09:00:00 +1000

You watch a conference talk about implementing data contracts, and nobody mentions that the advice assumes you have multiple teams producing data — which you don’t. You read a post declaring “if you’re still using stored procedures in 2026, you’re doing it wrong,” and the comments erupt. Half the people are nodding along. Half are furious. Both sides are right. They’re just living in different worlds and don’t realise it.

That mismatch — smart people giving each other advice that doesn’t apply — is the thing that almost never gets named. And the reason it doesn’t get named is that most public data engineering discourse is produced by and for one particular world, while pretending to speak for all of them.

I’ve worked across a few of these worlds myself — SQL Server migrations to Teradata, then Teradata to S3, regulatory reporting under APRA and ASIC, a data mesh initiative at scale, and now a university environment that runs closer to a startup than anything I expected. The advice that kept me out of trouble in one would have gotten me fired in the other. That experience is what’s behind this taxonomy.

I think there are five worlds here. Sometimes they intersect. Often they don’t.

World 1: The Modern Analytics Shop

This is the world most people picture when they hear “data engineering.” A team of two to ten engineers at a startup, scale-up, or mid-size company. Cloud-native from day one, or recently migrated. The stack reads like a vendor sponsor list: Snowflake or BigQuery or Databricks, dbt for transformations, Fivetran or Airbyte for ingestion, Looker or Metabase for dashboards. Everything managed. Everything SaaS. GitHub Actions wiring it together.

What matters here is speed. Getting value to stakeholders fast. Shipping a working dashboard before the quarterly review. Iterating on models without a change advisory board. The team is small enough that everyone knows the codebase, and requirements come from a Teams message, not a 40-page specification document.

I find myself in this world more than I expected right now. The current role is an institution that by any measure is large and complex — but the data team is small, the autonomy is real, and the energy genuinely feels like a startup. New tools, fast decisions, the ability to actually make change. It’s a reminder that World 1 isn’t only about company size. It’s about how the team operates.

This is also a great world to work in, especially early in your career. The tooling is mature, the feedback loops are tight, and the problems are tractable.

Here’s the thing, though. This is also the world that roughly 80% of public data engineering content is written for and about. Not because it’s the most common world — it isn’t — but because it’s the most fundable. The vendor-funded conference talks, the sponsored podcasts, the developer advocate blog posts, the LinkedIn hot takes — they overwhelmingly reflect this world. Developer advocates write about the stack their employer sells (this is not a dig — it’s the job). Conference sponsors want talks that showcase their tools in the most flattering light. The technical depth suffers because the incentive is awareness, not education.

None of that is wrong. But it creates a gravitational pull that distorts the whole discourse. When someone writes “the right way to do data engineering,” they almost always mean this world. And if you’re in a different one and don’t recognise the mismatch, you end up feeling like you’re doing it wrong when you’re actually just solving a different problem.

Advice that works here but rarely travels: “Just use dbt.” “Schema-on-read is fine for now.” “You don’t need a data catalog yet.” “Start with a star schema and iterate.”

World 2: The Enterprise Legacy Estate

This is the world of large organisations — banks, insurers, manufacturers, utilities, healthcare systems, government agencies — with data infrastructure that predates the cloud. Often predates the people currently maintaining it. Teams of twenty to two hundred data professionals spread across business units that don’t always talk to each other.

The stack tells a different story: Informatica, SSIS, Ab Initio, Teradata, Oracle, maybe an on-prem Hadoop cluster that someone championed in 2014 and nobody’s had the political capital to decommission. Perhaps some Snowflake or Databricks grafted on top, creating a hybrid that’s more complex than either system alone.

I lived this migration arc twice — once from SQL Server to Teradata, and again from Teradata to S3 and Starburst. Each time, the temptation was to treat the old system as a problem to escape rather than a body of knowledge to decode. Each time, the engineers who did the most damage were the ones who arrived with the new stack and immediately started designing around the old one rather than understanding it first.

What matters here is stability. Not breaking things. Migration plans that span years, not sprints. Governance and lineage aren’t aspirational — they’re audit requirements. And politics, because the data warehouse your predecessor built in 2011 is somebody’s empire (you know the one), and you can’t “just replace it” without navigating a web of organisational power dynamics that no architecture diagram captures.

This is the world where refactoring beats rebuilding every time. That fact table with 200 columns? Those bridge tables nobody understands? The slowly-changing-dimension-within-a-slowly-changing-dimension? They’re not bugs — they’re reality encoded. Every weird modelling choice represents a business rule someone fought to understand. The “clean” data vault remodel will eventually end up with the same complexity, just distributed across more tables with more confusing names.

Advice from World 1 can be actively dangerous here. “Just rewrite it” destroys institutional memory. “Adopt a lakehouse architecture” sounds great until you realise you have 4,000 stored procedures that encode fifteen years of business rules, and nobody documented them. The engineer in this world isn’t slow because they’re behind the curve. They’re careful because the cost of breaking something is measured in regulatory findings and executive phone calls, not a failed CI check.

Advice that works here but rarely travels: “Document everything before you touch it.” “Strangler fig, never big bang.” “Spend more time understanding why it was built this way than planning what to replace it with.” “The weird WHERE clause isn’t a bug — it’s institutional memory.”

World 3: The Product Engine

This is the world where data engineering directly powers an outcome the business delivers. That might be real-time personalisation, recommendation engines, or pricing algorithms. But it also includes building data products that drive campaigns, marketing attribution, and business decisions — cases where the engineer’s output flows into something customers or commercial teams act on directly, rather than landing in a dashboard that an analyst reviews on Monday morning.

The real-time variant of this world has its own stack: Kafka or Kinesis for streaming, Spark or Flink for processing, feature stores for serving machine learning models. Often custom infrastructure because off-the-shelf tools can’t meet the latency requirements.

But the defining characteristic isn’t milliseconds — it’s consequence. If a pipeline breaks in this world, someone notices immediately. A campaign fires with the wrong audience. A pricing decision gets made on stale data. A recommendation engine serves the same product to everyone because the features stopped updating overnight. The engineer is accountable to an outcome, not just a pipeline.

I’ve worked in this space building data products that powered business campaigns and marketing decisions. Not real-time serving in the p99 latency sense, but consequential enough that a broken pipeline meant a broken business process. That accountability changes how you think about testing, monitoring, and what “done” actually means.

The skills that matter here — operational thinking, system design, understanding how your data is consumed downstream — overlap more with software engineering and product thinking than with SQL and dbt. When someone says “data engineering is just SQL and orchestration,” someone in this world quietly closes the tab.

Advice that works here but rarely travels: “Treat pipelines like production services.” “Your tests need to run in CI, not in a notebook.” “Schema evolution is a deployment problem, not a modelling problem.” “The consumer of your data is your customer — know what breaks their day.”

World 4: The Regulated Pipeline

This is the world defined by compliance. In Australia: APRA for prudential standards, ASIC for financial services conduct, AFCA for dispute resolution obligations, and the Privacy Act for anything touching personal information. Internationally: HIPAA, SOX, MiFID, GDPR. The defining characteristic isn’t the technology — it’s that regulatory requirements shape every architectural decision before the first line of code is written.

The stack is whatever passed the security review. Often years behind the cutting edge because new tools need months of compliance evaluation before they’re approved. Data lineage tools aren’t nice-to-haves — they’re audit requirements. Encryption isn’t a best practice — it’s a legal mandate. Access controls aren’t “we should get around to that” — they’re the first thing you build.

What matters is auditability. Can you answer “who accessed what, when, and why” for any record in the system? Can you prove data lineage from source to report? Can you demonstrate that a deletion request was honoured within the legally mandated timeframe? Your data deletion pipeline is as important as your ingestion pipeline, and it needs the same rigour.

I spent significant time in this world — financial services, where regulatory reporting wasn’t a background concern but a core deliverable. APRA and ASIC don’t ask nicely. The audit wasn’t a hypothetical and the regulator’s question list arrived without warning. The engineers I worked alongside weren’t slow because they lacked ambition. They were deliberate because the cost of getting it wrong wasn’t a postmortem — it was a regulatory finding, a remediation programme, and occasionally a front-page story.

The trap for engineers entering this world from World 1 is assuming that governance is bureaucracy. It isn’t. Governance is architecture. The compliance requirements aren’t obstacles to good engineering — they’re constraints that shape what good engineering looks like. The best engineers I’ve worked with in regulated environments don’t fight the constraints. They design systems where compliance is a property of the architecture itself, not a layer bolted on top.

Advice that works here but rarely travels: “Governance is architecture, not bureaucracy.” “If you can’t prove lineage, you can’t ship it.” “The security review IS the sprint.” “Your data deletion pipeline is as important as your data ingestion pipeline.”

World 5: The Internal Platform

This is the world of organisations large enough that the data team has split into “platform” and “domain” functions. The platform team builds the infrastructure, tooling, and self-service capabilities that other data teams consume. Data mesh adopters. Companies with fifty or more data practitioners who realised that a centralised team can’t scale to serve every domain’s needs.

The stack is about abstraction and enablement: Kubernetes, self-hosted Airflow or Dagster or Prefect, internal developer portals, data contracts, schema registries, internal PyPI packages. Often custom abstractions layered on top of cloud services, designed to give domain teams guardrails without bottlenecks.

I experienced this world through ANZx — ANZ’s digital platform initiative — where data mesh concepts were applied at scale. Not as a theoretical framework on a conference slide, but as a real attempt to distribute data ownership across domains while maintaining coherence at the platform level. The challenge wasn’t the technology. It was getting domain teams to think like data product owners rather than data consumers. That shift is harder than any infrastructure problem.

What matters here is adoption. Not how many pipelines you build, but how many pipelines your consumers build without needing your help. Your success metric isn’t “pipelines shipped” — it’s “time to first pipeline for a new domain team.” Developer experience for internal consumers is your product, and if the experience is poor, your consumers will route around you. They’ll spin up their own Snowflake account, write their own ingestion scripts, and create exactly the kind of ungoverned sprawl your platform was supposed to prevent.

If adopting your platform requires a Jira ticket and a two-week wait, you’ve already lost. The shadow pipelines will multiply, and nobody will tell you until the audit.

Advice that works here but rarely travels: “Treat internal teams as customers with choices.” “Self-service is the goal, not centralised delivery.” “Your documentation IS the product.” “Measure adoption, not output.”

So What?

Most things in data engineering are the same no matter which world you’re in. Data quality matters everywhere. Testing matters everywhere. Documentation matters everywhere. Understanding the business context behind the data — that’s universal, and I’d argue it’s the most undervalued skill in the profession.

But not everything transfers. And when somebody tells you about methodology — at a conference, on LinkedIn, in a blog post (including mine) — it’s worth thinking about which world they’re coming from.

Here’s something I’ve noticed when hiring vetran data engineers: the ones who stand out aren’t necessarily the deepest specialists in any one world. They’re the ones who’ve visited enough worlds to know the laws. They don’t have to have lived in each one for years — but they’ve spent enough time in the regulated environment to understand why governance isn’t bureaucracy. Enough time in the enterprise to know that “just rewrite it” destroys things. Enough time in the product engine to feel what accountability to an outcome actually means. That layered experience is what hardens an engineer. It’s what lets them walk into an unfamiliar environment and read it quickly — recognising the constraints, the trade-offs, the things that will matter before anyone tells them.

There’s also something worth saying about what these worlds are not: interchangeable, or combinable. I’ve never seen an organisation that genuinely encompasses all five. You occasionally see attempts — usually during a data mesh initiative — and they tend to produce something more complex than any individual world, with the clarity of none of them. The worlds don’t collapse into a super-universe. They coexist, imperfectly, inside large organisations. A regulated pipeline team and a modern analytics shop can operate within the same company and have almost nothing in common in terms of how they work, what they value, and what good looks like.

The tradeoff is real and worth naming plainly: the content that’s most abundant — the conference talks, the vendor posts, the LinkedIn hot takes — is calibrated for a world that isn’t yours if you’re in World 2, 3, 4, or 5. There’s more of it than ever, and it’s easier than ever to access. What you give up is the ability to consume it passively. Filtering for relevance is now part of the job, and nobody puts that in the job description.

When you read advice — any advice, including everything on this blog — ask yourself which world it’s coming from. If it doesn’t apply to yours, that’s not a failing on your part or theirs. It just means they’re in a different world.

Now you know to notice.