The Guerrilla Guide to Data Engineering Interviews
The Scenario That Changes Everything
Picture this: You’re sitting in an interview room—or more likely these days, staring at a Zoom window with your carefully curated bookshelf background—and the interviewer asks you about data quality.
“Tell me about your experience with data quality,” they say.
You have two choices.
Choice A: “Data quality is really important in data engineering. It involves ensuring data is accurate, complete, consistent, and timely. I believe strongly in implementing data quality checks throughout the pipeline.”
Choice B: “Last year, we were bleeding money. Our data wasn’t running on time, it wasn’t consistent we had a table that had a 12% duplicate rate—we only discovered this when the Head of department noticed and our spend per customer was mysteriously 12% higher than industry benchmarks. I implemented row-level assertions on the incoming CDC stream that caught duplicates before they hit the merge, then backfilled three months of historical data by running a deduplication job that prioritized records based on source system hierarchy and last-updated timestamps. Took two weeks. Cut the duplicate rate to 0.3%.”
One of these answers gets you hired. The other gets you a polite “we’ll be in touch.”
Here’s the thing about data engineering interviews: they’re not testing whether you know things. Any idiot with ChatGPT can tell you what a slowly changing dimension is. What interviewers desperately want to know is whether you’ve actually built things. Have you stood in the wreckage of a failed pipeline at 2 AM and figured out what went wrong? Have you stared at two conflicting requirements from two department heads and architected a solution that made both of them reasonably happy?
This guide is about demonstrating that you’ve been in the trenches. Because that’s what separates the candidates who get offers from the candidates who get “we decided to move forward with other applicants.”
The Two Things That Actually Matter
After nearly two decades of interviewing data engineers—and being interviewed plenty of times myself—I’ve come to believe that good data engineering candidates have exactly two qualities:
- Smart, and
- Get things done
That’s it. That’s the whole list.
Sound overly simplistic? It’s not. These two qualities encompass everything an interviewer is actually trying to assess. Technical knowledge? That’s a component of smart. Communication skills? Part of getting things done. Problem-solving ability? Both.
The beauty of this framework is that it helps you understand what interviewers are really looking for beneath all those questions about window functions and star schemas.
People who are Smart but don’t Get Things Done often have impressive credentials but struggle to ship anything. They’re the ones who want to spend three weeks “properly” designing a data model for a proof-of-concept that needs to be done by Friday. They’ll tell you why your approach has theoretical limitations but won’t propose a practical alternative. In interviews, they can answer technical questions perfectly but give vague, abstract answers when asked about projects they’ve completed.
People who Get Things Done but aren’t Smart will build things that barely work, create technical debt that takes months to unwind, and make decisions that seem reasonable in the moment but reveal fundamental misunderstandings later. In interviews, they can talk enthusiastically about all the pipelines they’ve built but can’t explain why they made specific technical choices.
The magic happens when you have both. Smart people who get things done understand the theory, recognize when it matters and when it doesn’t, and ship production-quality solutions that their colleagues can maintain.
Your job in an interview is to demonstrate that you’re one of these people.
Show Me What You’ve Built
The single most important piece of advice I can give you is this: Come to every interview with a mental catalog of specific implementations you’ve delivered.
Not concepts you understand. Not technologies you’ve used. Not certificates you have attained. Actual problems you’ve solved, with specific details about what you did and why.
I’ve seen brilliant engineers fumble interviews because when asked “Tell me about a time you implemented a backfill strategy,” they gave a generic answer about what backfills are. Meanwhile, mediocre engineers who happened to prepare good stories about their work sailed through.
The stories matter. Let me show you why.
The Backfill Story
Every data engineer has done backfills. But here’s how most candidates talk about them:
“Yes, I’ve done backfills. We had to reload historical data when we changed the schema. I wrote a script to process the data in batches.”
Okay. That tells me nothing about your judgment, your problem-solving, or your ability to handle complexity.
Here’s what I want to hear:
“Our analytics team realized they needed two years of historical data for a new churn model, but our pipeline had only been running for six months. The source system had the data, but it was in a different format—they’d migrated from Oracle to Postgres about 18 months prior. So I had three different data formats to reconcile: the current CDC stream, the Postgres historical data, and the Oracle archives.
I built a unified transformation layer that normalized all three formats, then created a DAG that processed the Oracle data first—about 400 million records—in weekly chunks to avoid overwhelming the warehouse. The tricky part was handling the transition period where we had data in both Oracle and Postgres with potential duplicates. I used a combination of source system timestamps and record hashes to deduplicate across the boundary.
The whole backfill took about four days to run, but the prep work took two weeks. The churn model team was able to start their training three weeks after the initial request.”
See the difference? The second answer demonstrates:
- Handling messy real-world complexity
- Breaking down a large problem into manageable pieces
- Making thoughtful tradeoffs (weekly chunks to avoid warehouse load)
- Understanding the business context (enabling the churn model team)
- Concrete numbers (400 million records, three weeks total)
That’s what gets you hired.
The Technical Questions They Actually Ask
Let’s get practical. Based on my experience on both sides of the table, here are the technical areas that come up most frequently in data engineering interviews—and more importantly, what interviewers are really trying to learn from each question.
Data Modeling: Don’t Just Know It, Defend It
When someone asks about your approach to data modeling, they’re not testing whether you can define a star schema. They’re trying to understand your judgment.
“We need to model customer order data. Walk me through your approach.”
The wrong answer starts rattling off dimensional modeling terminology. The right answer asks questions:
“Before I design anything, I need to understand a few things. What are the primary queries this data needs to support? Are we optimizing for dashboard performance, ad-hoc analysis, or ML feature generation? What’s the data volume and growth rate? How frequently will it be updated? And who are the consumers—analysts writing SQL, a BI tool like Tableau, or data scientists in notebooks?”
Only after understanding the requirements do you start talking about your design choices. And when you do, explain why:
“For this use case, I’d go with a denormalized wide table rather than a traditional star schema. Your analysts are primarily doing ad-hoc analysis in notebooks, and they’ve told you query simplicity matters more than storage efficiency. A star schema would mean teaching everyone to join fact to dimension tables correctly—and in my experience, that’s where most analytical errors come from. The denormalized approach trades some storage cost for query simplicity and reduces the chance of incorrect joins.”
This shows you understand that data modeling isn’t about following rules—it’s about making tradeoffs based on specific requirements.
SCD Type 2: The Implementation Details Matter
Every data engineer can explain what SCD Type 2 is. Few can explain how to actually implement it efficiently at scale.
If you’re working with Delta Lake or Iceberg, this is where things get interesting:
“Our customer dimension has about 50 million records and changes maybe 100,000 times per day. We’re on Delta Lake. How would you implement SCD Type 2?”
Here’s where I want to see you’ve actually done this:
“With Delta Lake, I’d use the MERGE statement with a match condition on the natural key, but here’s where it gets nuanced. You can’t just do a simple merge because you need to both update existing records (set the end date) and insert new records (the current version) in the same operation.
The approach I’ve used is to structure the merge to match on natural key AND where the record is current (end_date is null), then on MATCHED and when there’s an actual change in the tracked columns, update the end_date to yesterday, and separately insert the new record. But that requires two passes—or you can use the multi-action MERGE syntax if your Delta version supports it.
The gotcha is handling late-arriving data. If you get yesterday’s changes after you’ve already processed today’s changes, your end_dates can get out of order. We solved this by including a logical sequence number in our CDC stream and processing changes in order within each merge batch.”
This answer shows implementation experience, awareness of edge cases, and practical problem-solving.
Merge vs. Delete-Insert: Know When to Use Each
This comes up constantly, and the wrong answer is “I always use MERGE” or “I always use DELETE-INSERT.” The right answer is “it depends, and here’s how I decide.”
MERGE is better when:
- You need atomic upsert behavior
- You’re implementing SCD Type 2
- The update volume is small relative to table size
- Your warehouse has efficient MERGE support (most modern ones do)
DELETE-INSERT is better when:
- You’re doing full partition replacement
- The “update” volume is close to 100% of the existing data
- You want simpler logic that’s easier to debug
- You need to handle deletes from the source system
Here’s how I’d explain it in an interview:
“In my experience, DELETE-INSERT is actually more common in analytics pipelines than people expect. When we refresh a daily partition, we don’t mess around trying to figure out what changed—we drop the partition and reload it. It’s simpler, it’s idempotent, and if something goes wrong, we just run it again.
MERGE is what we use for dimensions where we need SCD Type 2 behavior, or for fact tables where we’re getting late-arriving facts that need to be upserted into historical partitions. The key decision point is: do I actually need to identify and handle individual changes, or can I just reload the whole thing? Nine times out of ten, the answer is reload.”
This shows practical wisdom about tradeoffs, not textbook regurgitation.
Problem-Solving Questions: Your Chance to Shine
The questions I love most as an interviewer—and the ones you should love as a candidate—are the open-ended problem-solving questions. These are where you demonstrate that you’re smart and get things done.
“You’ve been asked to build a pipeline that ingests data from a new source system. Walk me through how you’d approach it.”
This is a goldmine of an opportunity. Here’s how a strong candidate handles it:
Step 1: Ask clarifying questions (shows you understand context matters)
“What’s the source system? Is it pushing data to us, or do we need to pull it? What’s the volume and velocity? What’s the latency requirement—does it need to be near-real-time, or is daily batch okay? Who are the consumers, and what questions are they trying to answer? Is there existing documentation, or do we need to do discovery?”
Step 2: Outline your approach at a high level (shows you can structure complex problems)
“Assuming this is a reasonably standard scenario—a REST API that we need to poll daily—here’s my approach. First, I’d spend time understanding the data. What entities are exposed? What are the relationships? What’s the grain? Are there any gotchas like soft deletes or non-standard timestamp formats?
Then I’d design a landing zone—probably a staging schema with minimal transformation, just enough to make the data queryable. From there, I’d build the transformation layer to reshape it into our target schema, applying data quality checks along the way.
Finally, I’d implement monitoring: row counts, schema change detection, and anomaly alerts for unexpected patterns.”
Step 3: Dive into specific challenges (shows depth of experience)
“The piece that often trips people up is handling incremental loads with APIs that don’t support proper change tracking. If the API doesn’t give you a reliable modified_timestamp, you might need to do full loads and then diff against your existing data—which gets expensive fast. I’ve handled this by implementing a hash-based change detection system where we store a hash of each record and only process records whose hash has changed.”
Step 4: Discuss what could go wrong (shows you’ve been burned before)
“The things I’d watch out for: API rate limits that could cause us to fall behind, schema changes from the source system that break our transformations, and timestamp timezone issues—I’ve been bitten by that one more times than I’d like to admit.”
This kind of answer demonstrates everything an interviewer is looking for: structured thinking, technical depth, practical experience, and awareness of edge cases.
The Incident Management Question
At some point, you’ll be asked about handling production incidents. This is where your war stories come in handy.
“Tell me about a time you had to debug a production data issue.”
Here’s the thing: everyone has these stories. The difference between a good answer and a great answer is how you structure it.
The Great Answer Structure:
- The alert: How did you find out something was wrong?
- The triage: How did you assess the severity and scope?
- The investigation: How did you identify the root cause?
- The fix: What did you do to resolve it?
- The prevention: What did you do to prevent it from happening again?
Here’s an example:
“We got an alert at 9 PM on a Friday—because of course it was Friday—that our customer metrics dashboard was showing a 30% drop in active users. At first, I thought it might be a real business event, but the drop was too sudden and too large.
I started with the basics: checked the pipeline run logs, all green. Checked row counts in the metrics table, normal. Checked upstream tables, normal. Nothing obviously broken.
Then I looked at the actual data. Our active user count was filtering by ’last_activity_date >= current_date - 7’. Turns out, a schema change in the source system had changed the last_activity_date column from a timestamp to a date-with-timezone, and our transformation was truncating it incorrectly. Users in certain timezones were getting their dates shifted by a day, which pushed them outside the 7-day window.
The immediate fix was straightforward—I adjusted the timezone handling in the transformation and backfilled the last two weeks of data. Took about an hour.
But the prevention was the interesting part. I added explicit timezone assertions to our data quality framework. Any time we ingest a timestamp column, we now validate that the timezone handling matches our expectations. We’ve caught three similar issues since then before they hit production.”
This answer hits all the marks: structured approach, technical depth, practical resolution, and systemic improvement.
The AI and LLM Question
If you’re interviewing in 2026, you will be asked about AI. How you answer tells the interviewer a lot about whether you’re keeping current.
“How have you used AI or LLMs in your data engineering work?”
The weak answer is “I’ve experimented with ChatGPT for writing queries.”
The strong answer shows you’ve actually integrated AI into production workflows:
“We’ve implemented LLM-assisted data classification in our pipeline. We have a data catalog with about 3,000 tables, and maintaining accurate tags for sensitivity, domain, and data type was a nightmare—the manual process was always out of date.
I built a service that uses Claude’s API to analyze table schemas and sample data, then suggests classifications. It’s not fully automated—we have a human review step—but it reduced the time to classify a new table from 20 minutes of manual analysis to about 2 minutes of review.
The tricky part was prompt engineering. The first version was too aggressive with PII classification—it was flagging everything with ‘ID’ in the name as personally identifiable. We refined the prompts to include context about our specific data domain and examples of what we consider PII versus internal identifiers.”
This shows practical application, awareness of limitations (human review), and iteration based on real-world feedback.
Building Experience in Areas You Don’t Know
Here’s a question I hear all the time: “What if I don’t have experience with [Delta Lake / Iceberg / dbt / whatever technology they’re asking about]?”
First, be honest. Don’t claim experience you don’t have—good interviewers will expose this within two follow-up questions.
Second, go build something. You can spin up a personal project in a weekend that gives you legitimate hands-on experience.
Want to learn Delta Lake? Create a databricks community edition account, load some public dataset, and implement a basic SCD Type 2 merge. Want to learn dbt? Fork the jaffle_shop demo project and extend it with some real transformations. Want to understand data quality at scale? Implement Great Expectations on one of your side projects.
The beauty of data engineering is that the tools are mostly free for small-scale learning. You don’t need a $50,000/month Snowflake instance to learn Snowflake—you need the free trial and a CSV file.
When you interview, you can honestly say: “I haven’t used Delta Lake in production, but I’ve built a personal project that implements [specific thing]. Here’s what I learned, and here’s how I’d apply it at scale.”
That’s infinitely better than “I’ve heard of Delta Lake but haven’t used it.”
The Questions You Should Ask Them
At the end of every interview, you’ll be asked “Do you have any questions for us?” This isn’t a formality—it’s a real opportunity to demonstrate your sophistication as a data engineer.
Bad questions:
- “What’s the work-life balance like?” (Important, but not what shows you’re a great candidate)
- “What technologies do you use?” (You should have researched this already)
- “What’s the career growth path?” (Reasonable, but generic)
Great questions:
- “What’s the biggest data quality challenge you’re facing right now, and what’s your current approach to solving it?”
- “Walk me through your deployment process for a new data pipeline—from development to production.”
- “How do you handle schema evolution in your source systems? Is it a significant pain point?”
- “What’s the split between building new pipelines versus maintaining existing ones?”
- “If I joined, what would success look like in the first 90 days?”
These questions show you understand real data engineering challenges and you’re thinking about how you’d actually do the job, not just whether you want it.
The Meta-Game: Time Management During Interviews
Here’s something nobody tells you about interviews: time management matters as much as your answers.
If you spend 15 minutes on the first whiteboard question and only have 5 minutes for the next two, you’ve failed—even if your first answer was brilliant. Interviewers have a checklist of things they need to assess, and if you don’t give them enough data points, you’ll get a “no hire” simply because they couldn’t tell.
The Rule of Thirds:
For a 60-minute technical interview:
- First 20 minutes: The introductory/behavioral question. Don’t ramble. Get your point across clearly and move on.
- Middle 30 minutes: The core technical assessment. This is where you prove your skills. Spend your time wisely here.
- Final 10 minutes: Wrap-up, your questions, selling you on the role.
If you find yourself 25 minutes into a detailed answer about your architecture philosophy and haven’t written any code yet, you’re in trouble. Learn to recognize when you’ve made your point and need to move on.
Breaking Down Problems:
When faced with a complex technical question, don’t start writing code immediately. Spend 2-3 minutes structuring your approach out loud:
“Okay, let me break this down. There are three main pieces here: the ingestion, the transformation, and the output. For the ingestion, I need to handle [X]. For transformation, the key challenge is [Y]. Let me start with the ingestion piece, since that’s the foundation.”
This shows structured thinking and gives the interviewer a roadmap of what you’re going to do. If you run out of time, they at least know you understood the full scope.
The Uncomfortable Truth About Interviews
I’m going to tell you something that might be unsettling: interviews are imperfect signals.
Smart, talented people fail interviews all the time. Mediocre people sometimes pass them. The best interviewers are wrong about 20-30% of the time.
This is why you shouldn’t let a failed interview devastate you, and you shouldn’t let a passed interview make you complacent. The skills that make you successful in interviews are related to—but not identical to—the skills that make you successful in the job.
What you can control:
- Your preparation
- Your catalog of project stories
- Your ability to explain technical concepts clearly
- Your structured approach to problem-solving
- Your honesty about what you know and don’t know
What you can’t control:
- Whether the interviewer is having a bad day
- Whether they’re biased toward a technology you don’t know
- Whether they’ve already decided to hire an internal candidate
- Whether you remind them of someone they didn’t like in their last job
Do everything you can to control the controllables. Then recognize that sometimes it just doesn’t work out, and that’s not a reflection of your worth as a data engineer.
Your Pre-Interview Checklist
Before any data engineering interview, make sure you can speak confidently about:
At least three implementation stories covering different areas: one about pipeline building, one about problem-solving/debugging, one about data modeling or architecture decisions.
Your approach to data quality: What checks you implement, how you handle failures, how you balance coverage with performance.
Tradeoff decisions: When to use batch vs. streaming, when to denormalize vs. normalize, when to merge vs. delete-insert, when to build vs. buy.
Your technical fundamentals: Window functions in SQL, how you handle slowly changing dimensions, your experience with your primary orchestration tool (Airflow, Dagster, dbt).
Technology-specific depth: Whatever platforms they use (Databricks, Snowflake, BigQuery), have at least surface-level familiarity and one or two interesting insights.
Questions for them: At least three substantive questions that show you’re thinking about the actual work.
The Final Word
Here’s what I’ve learned after interviewing hundreds of data engineers: the candidates who get hired aren’t necessarily the ones who know the most. They’re the ones who can demonstrate that they’ve taken on hard problems and found practical solutions.
Knowledge without execution is academic. Execution without knowledge is dangerous. The combination is rare and valuable.
Your job in an interview is to prove you have both.
So go catalog your victories. Remember the gnarly problems you solved, the architectural decisions you made, the production fires you put out. Write them down if you need to. Practice telling those stories until they flow naturally.
Because in the end, data engineering interviews aren’t about reciting definitions or demonstrating mastery of every possible technology. They’re about convincing a group of smart people that if they hire you, things will get built, problems will get solved, and the pipelines will run.
Everything else is details.
