Q1
Walk me through how you would design an ETL pipeline to ingest 500GB of daily data from multiple sources into a data warehouse, ensuring data quality and handling late-arriving data.
Why they ask this:* They want to assess your understanding of end-to-end pipeline architecture, scalability considerations, error handling, and real-world constraints that mid-level engineers face daily.
Q2
Explain the differences between batch processing and stream processing. When would you choose Apache Spark Streaming over Apache Kafka, and what are the trade-offs?
Why they ask this:* This tests your knowledge of fundamental data engineering paradigms and your ability to make informed technology choices based on business requirements and technical constraints.
Q3
You notice a Spark job is running 40% slower than last week. How would you approach diagnosing the performance issue, and what are the common bottlenecks you'd investigate?
Why they ask this:* They're evaluating your troubleshooting methodology, familiarity with profiling tools, and ability to optimize distributed computing systems—a critical skill for mid-level engineers.
Q4
Design a data lake schema using a medallion architecture (bronze, silver, gold layers). What transformations would occur at each layer, and how would you handle schema evolution?