Production AI SystemsThe Staff Engineer's Playbook
The system design course written for the interviewer's rubric, not the textbook's table of contents. Original frameworks, calibration ladders, real post-mortems, and the downloadable one-pagers you take into the interview.
Proprietary frameworks
Every major topic gets a named, memorable framework you use by name in the room. CLARO. The 5-Phase Latency Anatomy. The Bounded Autonomy Pattern. Original to this course, not repackaged.
Calibration ladders
Every key question shows L4 → L5 → L6 → L7 answers verbatim, with a gloss on what each level missed. You see exactly where you sit today and the specific move that gets you to the next tier.
The unspoken rubric
What interviewers grade but never write down. The first-5-minute signals. The pushback decoder. The recovery playbook. The judgments separating someone ready to lead from someone ready to execute.
Decision artifacts
Twenty printable one-pagers — cheatsheets, decision trees, checklists, reference cards. The CLARO One-Pager. The LLM Serving Diagnostic Tree. The Latency Budget Worksheet. Take them into the room.
Real post-mortems
Anonymized real interview failures with the exact moment things went wrong, what the candidate should have said, and the structural lesson. Pattern-matched to your loop.
Active practice
Every lesson ends with a 7- to 15-minute drill, a 4-dimension self-grading rubric, a full model solution, and the common failures so you know if you actually got it.
Curriculum
Lesson 1.2 — The CLARO Framework — is fully written and free to read. The remaining lessons are scaffolded with module summaries, premises, and framework names. They unlock as the course ships.
The Staff Bar
What L6/L7 actually scores in an AI design interview, and the opening framework you run before drawing anything.
- 1.1
The Interviewer's Rubric: What L6/L7 Actually Scores
· 28 minThe official rubric is generic. The unspoken rubric is what gets you hired. This lesson names the 7 specific signals interviewers grade in any AI design round, and the moves that signal each one. Every later lesson in this course is one or two of these seven signals expanded into a technique.
Framework: The 7-Signal RubricRead → - 1.2
The CLARO Framework: How to Open Any AI Design Question
· 42 minMost candidates open AI design questions by drawing boxes. Staff candidates spend the first 5–7 minutes defining the problem so precisely that the architecture becomes mechanical. CLARO is the 5-step opening you run before any line is drawn.
Framework: CLARORead → - 1.3
The Trade-off Vocabulary: Naming What You're Choosing Between
· 30 minStaff candidates name the dimensions of their trade-offs out loud. Senior candidates describe options. This lesson builds the precise vocabulary so you stop saying 'it depends' and start saying 'it depends on whether [X], and here is the rule.' TRACK is the five dimensions every AI system design trade-off lives on.
Framework: The TRACK DimensionsRead →
LLM Systems at Scale
Serving, retrieval, agents, and the prompt-as-distributed-system mental model. Where most candidates miss the structural answer and reach for tactical fixes.
- 2.1
Serving LLMs: The Latency Anatomy Framework
· 48 minEvery LLM latency conversation gets stuck on quantization and batching because candidates don't have a structural way to decompose where time is actually spent. The 5-Phase Latency Anatomy gives you a vocabulary for diagnosing any LLM serving system in under 60 seconds.
Framework: The 5-Phase Latency AnatomyRead → - 2.2
RAG Beyond the Tutorial: The Retrieval Quality Loop
· 34 minEvery production RAG system fails for one of four reasons, and most candidates conflate them. The Retrieval Quality Loop is the four-stage diagnostic that decomposes 'RAG accuracy is bad' into a fixable problem instead of a category.
Framework: The Retrieval Quality LoopRead → - 2.3
Agentic Systems: The Bounded Autonomy Pattern
· 36 minAgentic systems fail in the boundary, not in the model. The Bounded Autonomy Pattern names the five dimensions where autonomy lives and lets you design the boundary deliberately. Without the pattern, every agent design becomes a debate about ReAct vs Reflexion; with it, the architecture follows from the bounds.
Framework: The Bounded Autonomy PatternRead → - 2.4
Production Prompts: Why Your Prompt Is a Distributed System
· 28 minA production prompt is not a string. It is a versioned, evaluated, cached, observably-degraded distributed system in disguise. The Prompt Lifecycle Stack names the five layers; once you see prompts that way, the right interview answers become obvious — and so do the failure modes most production teams hit.
Framework: The Prompt Lifecycle StackRead →
ML Platform
Feature stores, training pipelines, deployment patterns, and silent-failure detection. The platform layer that decides whether models survive contact with reality.
- 3.1
Feature Stores: The Freshness/Consistency/Cost Triangle
· 32 minEvery feature-store debate is the same triangle: freshness, consistency, cost. Pick two. This lesson builds the vocabulary to name which two you picked, why, and what corner case the interviewer is trying to push you into.
Framework: The Freshness/Consistency/Cost TriangleRead → - 3.2
Training Pipelines: Reproducibility as a System Property
· 30 minReproducibility isn't a checkbox — it's a system property that requires versioning data, code, environment, randomness, and intent. This lesson covers the 5-Axis Model and how to design for it without paying 10× in operational overhead.
Framework: The 5-Axis Reproducibility ModelRead → - 3.3
Deployment Patterns for ML: Why Blue/Green Fails for Models
· 32 minBlue/green is a deploy pattern for code, where the failure mode is crashes. Models need a different one because their failure mode is silent quality drift. This lesson covers the shadow → canary → interleaved/A/B → ramped pattern and when each step is the right call.
Framework: The Quality-Aware Rollout LadderRead → - 3.4
ML Observability: The Silent Failure Detection Stack
· 34 minML systems fail silently. The interview question is: how would you know? This lesson builds the Silent Failure Detection Stack — the four observability layers a Staff candidate names and the three layers a Senior candidate forgets exist.
Framework: The Silent Failure Detection StackRead →
Canonical Design Problems
Five 45-minute design problems walked through phase by phase, with the interviewer's internal scoring shown alongside the candidate's design.
- 4.1
Real-Time Recommendation Engine (200M users, 100 ms p99) — Simulated Interview
· 55 minA 45-minute simulated Staff interview on a 200M-user video platform serving personalized recommendations at p99 100ms. Walked phase by phase: the CLARO opening, the five clarifying questions with dependency mapping, the architecture, five deep-dive calibration ladders, the latency Gantt, the scale math, the interviewer's follow-up probes, and the cross-company scoring lens. Two anonymized post-mortems and two downloadable artifacts. The lesson assumes you know what a two-tower model is — and tells you what most candidates miss when designing one under SLA.
Framework: The 100ms Recsys SpineRead → - 4.2
LLM-Powered Search (Perplexity-scale) — Simulated Interview
· 42 minLLM-powered search systems all converge to the same four-stage architecture: Retrieve, Ground, Stream, Audit. This walkthrough applies RGSA to a Perplexity-class system at 10M monthly users with the 1.5-second TTFT bar, and shows where the structural decisions live.
Framework: Retrieve → Ground → Stream → AuditRead → - 4.3
Fraud Detection (Stripe-scale, <50ms decision) — Simulated Interview
· 40 minInline risk scoring at the payment-auth latency budget. Feature freshness, model size constraints, and the rules-vs-ML composition pattern that always comes up. The Inline Risk Budget is the framework that converts 'design fraud detection' from a tour of ML techniques into a budget-driven architectural decomposition.
Framework: The Inline Risk BudgetRead → - 4.4
Multi-Modal Content Moderation (Meta-scale) — Simulated Interview
· 44 minMeta-scale content moderation is two systems with two SLAs and a human-review feedback loop, not one pipeline that does everything. The Cascaded Sync + Streaming Async framework names the decomposition and forces the candidate to address the structural fact that policy categories have different latency budgets.
Framework: Cascaded Sync + Streaming AsyncRead → - 4.5
AI Coding Assistant (Copilot-scale) — Simulated Interview
· 42 minCopilot-class coding assistants serve completions inline in IDE keystrokes with sub-100 ms perceived latency. The Ghost-Text Completion Stack names the five layers — anticipatory fetch, FIM prompting, speculative cancellation, privacy-preserving tenancy, acceptance feedback — and shows where the architectural decisions live.
Framework: The Ghost-Text Completion StackRead →
Interview Craft
The opening 5 minutes, handling pushback, and the recovery playbook for when you've made a visible mistake mid-interview.
- 5.1
The First 5 Minutes: How Staff Candidates Open Differently
· 26 minThe first 5 minutes set the floor for your final score. The Opening Triplet is the three-move choreography Staff candidates execute by reflex: name a framework, surface the highest-leverage constraint, commit under uncertainty. Done in 5 minutes, those three moves score three of the seven signals from Lesson 1.1 before the design even begins.
Framework: The Opening TripletRead → - 5.2
Handling Pushback: The 4 Interviewer Probe Types
· 28 minInterviewers push back for four distinct reasons: stress-testing, ceiling-finding, debugging the candidate, or offering a hint. Each requires a different response. The 4-Probe Decoder names the four types and the first-sentence cues that distinguish them — so you respond appropriately instead of defending against a stress test when you were actually being taught.
Framework: The 4-Probe DecoderRead → - 5.3
The Recovery Playbook: When You've Made a Mistake Mid-Interview
· 24 minEveryone makes a visible mistake. The interview is decided by what you do in the next 90 seconds. The 3-Move Recovery — acknowledge, walk back, commit — turns a stumble into a Staff signal. Done well, it is the highest-leverage moment in the entire interview because it demonstrates the metacognitive ability the rest of the loop only tests indirectly.
Framework: The 3-Move RecoveryRead →
Lesson 1.2 — The CLARO Framework
The 5-step opening every Staff candidate runs before drawing anything. Includes a 12-turn simulated interview transcript, a calibration ladder showing L4 → L7 verbatim, three mental models with ASCII visuals, an unspoken-rubric block on the first-5-minute signals, a 7-minute drill with self-grading rubric, the CLARO One-Pager, and a real post-mortem.
Read Lesson 1.2 →Who this is for
You should buy this if
- ✓You're preparing for a Staff or Principal AI/ML system design loop at Google, Meta, OpenAI, Anthropic, Stripe, Databricks, or a peer.
- ✓You can already ship a RAG system and serve a model — but you've been told your interviews need more ‘technical leadership’ signal.
- ✓You read engineering blogs and skip the tutorial sections. You want trade-offs and the unspoken rubric, not a refresher.
- ✓You're willing to do 7-minute drills against a rubric, not just read.
You should skip this if
- ·You're early-career and looking for a primer on ML systems. This course assumes you already operate them in production.
- ·You want a pure cert-prep checklist for cloud ML platforms. This is interview craft, not vendor certification.
- ·You're looking for code examples to copy. The frameworks are intentionally code-light because interview rooms are.
What we won't do
Start with Lesson 1.2
The CLARO Framework is the keystone of the course. Read it first. If it changes how you'd open your next mock interview, the rest of the course is built against the same bar.
Read Lesson 1.2 — The CLARO Framework