Mid leveldata

Data Engineer
Interview Questions

Covering Data Engineer interview questions — pipelines, ETL, Spark, SQL, and data architecture.. Free, no signup required.

10 questions ready

Q1
Walk me through how you would design an ETL pipeline to ingest 500GB of daily data from multiple sources into a data warehouse, ensuring data quality and handling late-arriving data.
Why they ask this:* They want to assess your understanding of end-to-end pipeline architecture, scalability considerations, error handling, and real-world constraints that mid-level engineers face daily.
Q2
Explain the differences between batch processing and stream processing. When would you choose Apache Spark Streaming over Apache Kafka, and what are the trade-offs?
Why they ask this:* This tests your knowledge of fundamental data engineering paradigms and your ability to make informed technology choices based on business requirements and technical constraints.
Q3
You notice a Spark job is running 40% slower than last week. How would you approach diagnosing the performance issue, and what are the common bottlenecks you'd investigate?
Why they ask this:* They're evaluating your troubleshooting methodology, familiarity with profiling tools, and ability to optimize distributed computing systems—a critical skill for mid-level engineers.
Q4
Design a data lake schema using a medallion architecture (bronze, silver, gold layers). What transformations would occur at each layer, and how would you handle schema evolution?
Q5
Describe a situation where you had to work with a team to migrate data from a legacy system to a new platform. What was your role, what challenges did you encounter, and how did you ensure data integrity throughout the process?
Q6
Tell me about a time when you identified a data quality issue in production that impacted downstream users. How did you approach the problem, what did you do to fix it, and what preventive measures did you implement?
Q7
Describe a project where you had to learn a new tool or framework quickly to meet a deadline. How did you approach the learning process, and what was the outcome?
Q8
What would you do if you discovered that your team's data pipeline is processing and storing personally identifiable information (PII) without proper encryption or access controls?
Q9
How would you handle a situation where a business stakeholder requests a new data model with a tight deadline, but your current infrastructure can't support the required query performance without significant optimization work?
Q10
Imagine a critical data pipeline failed mid-run at 2 AM, affecting morning reports. You're on-call. Walk me through how you'd handle this: diagnosis, communication, resolution, and post-incident actions.
🔒

7 questions locked

Upgrade to unlock all 10 questions with answer guides, videos & PDF

Upgrade to unlock →

Want questions tailored to a specific company?

Try the full generator →