Q1
Walk me through how you would design a monitoring and alerting strategy for a microservices architecture running on Kubernetes. What metrics would you prioritize, and which tools have you used?
Why they ask this:* They want to assess your understanding of observability fundamentals, cloud-native monitoring patterns, and your hands-on experience with tools like Prometheus, Datadog, or New Relic—core competencies for an SRE.
Q2
Describe your experience with Infrastructure as Code. How have you used tools like Terraform or CloudFormation to manage cloud resources, and how do you ensure idempotency and prevent configuration drift?
Why they ask this:* This tests whether you can automate infrastructure provisioning at scale, manage complex cloud environments reliably, and apply version control to infrastructure—essential SRE responsibilities.
Q3
Explain how you would implement and manage CI/CD pipelines to ensure reliable deployments with minimal downtime. What strategies have you used to handle rollbacks and canary deployments?
Why they ask this:* They're evaluating your ability to balance deployment velocity with reliability, your understanding of deployment patterns, and your experience with tools like Jenkins, GitLab CI, or GitHub Actions.
Q4
How would you troubleshoot and optimize a cloud database experiencing intermittent latency spikes? Walk through your diagnostic approach and the tools you'd use.