Q1
Design a data pipeline that ingests 500GB of daily log data from multiple sources, transforms it, and loads it into a data warehouse. Walk me through your architecture, tools, and how you'd handle schema changes.
Why they ask this:* They want to assess your understanding of ETL/ELT design patterns, scalability, tool selection (Spark, Airflow, dbt, etc.), and your ability to handle real-world data complexity at scale.
Q2
Explain the differences between batch processing and stream processing. When would you use Apache Kafka vs. Apache Spark for a real-time analytics use case, and what are the trade-offs?
Why they ask this:* This tests your foundational knowledge of data processing paradigms and your ability to make informed technology choices based on use case requirements like latency, throughput, and cost.
Q3
You're optimizing a slow-running SQL query that joins three large tables and filters on multiple conditions. Walk me through your debugging and optimization approach, including indexing strategies.
Why they ask this:* They're evaluating your hands-on SQL proficiency, query optimization skills, and understanding of database internals—core competencies for a mid-level Data Engineer.
Q4
How would you implement a data quality framework for a data lake containing hundreds of tables? What metrics would you track, and which tools would you use?