Q1
Walk me through how you would design a machine learning pipeline to detect fraudulent transactions in real-time, including data ingestion, feature engineering, model selection, and monitoring considerations.
Why they ask this:* They want to assess your end-to-end ML system design skills, understanding of production constraints, scalability considerations, and ability to think beyond model training to deployment and monitoring.
Q2
Explain the trade-offs between using a batch processing framework like Spark versus a streaming framework like Kafka for a recommendation system that needs to update predictions every hour. What would influence your choice?
Why they ask this:* This tests your knowledge of distributed data processing tools, understanding of latency vs. throughput trade-offs, and ability to match architectural decisions to business requirements in a data-heavy environment.
Q3
You've trained a model that performs well on validation data but poorly in production. Walk through your debugging process—what metrics would you check, and what are common causes you'd investigate?
Why they ask this:* They're evaluating your practical troubleshooting skills, understanding of data drift, model degradation, feature engineering issues, and your ability to bridge the gap between development and production environments.
Q4
How would you approach feature engineering for a dataset with 500 columns where many are highly correlated? Describe your feature selection strategy and the tools or techniques you'd use.