The CLARO Framework: How to Open Any AI Design Question
Most candidates open AI design questions by drawing boxes. Staff candidates spend the first 5–7 minutes defining the problem so precisely that the architecture becomes mechanical. CLARO is the 5-step opening you run before any line is drawn.
There is a recognizable shape to a candidate who is about to fail a Staff system design interview. They hear the prompt, nod, and start drawing. Within four minutes the whiteboard has a load balancer, a queue, a model server, and a vector database, and the interviewer is asking the first real question — what's the objective? — and the candidate is already committed to an architecture defending decisions they never made. The interviewer does not interrupt them. They write a note. They will write more notes.
The Staff candidate, faced with the same prompt, does not draw anything for the first five to seven minutes. They run a sequence. They name constraints out loud. They allocate a latency budget before they know the architecture. They ask one shape-of-load question that collapses half the possible designs. They state the objective in one sentence that ends with a number. By the time they draw the first box, the architecture is almost mechanical — because the constraints, the budget, the access pattern, and the objective only admit a small set of viable designs. The interviewer's notes for those first minutes read differently.
CLARO is that sequence, written down. It is not a checklist. It is a five-move opening you commit to muscle memory so that on the morning of the interview, when adrenaline is competing with sleep deprivation, you do not have to remember what to do first. You run CLARO. You earn the right to draw the first box.
CLARO
CLARO is a 5-step opening sequence. Each step is a small constraint-discovery move, not a hand-wave. The goal isn't to look thorough — it's to extract the four or five facts that determine the rest of the design. By the end of CLARO, you should be able to write the system's objective on one line and defend why a different architecture would be wrong, not just suboptimal.
- 1C — ConstraintsWhat cannot change. Regulatory, contractual, latency SLA, existing data residency, on-call team's current skill set, deadline. You're not asking for nice-to-haves. You're surfacing the hard walls of the design space. Two or three real constraints are usually enough.
- 2L — Latency budgetThe end-to-end SLA, then a deliberate allocation across the request path. Not 'we'll be fast.' A budget tree: client + network + auth + retrieval + model + post-processing + return. Allocate before you design. If you can't fit the model's likely latency inside the budget, you've just earned the right to push back on the SLA.
- 3A — Access patternsShape of the load, not the volume. Peak vs steady-state ratio. Read-heavy vs write-heavy. Bursty vs uniform. Long-tail vs hot-keyed. Synchronous user-facing vs async batch. The single highest-leverage question is usually 'is this user-facing or background?' — because the answer collapses half the design space.
- 4R — Read/write asymmetryNow name what falls out of the access pattern. If reads dominate writes 1000:1, you cache aggressively and accept eventual consistency. If writes dominate, you optimize the write path and the read path lives on materialized views. This step turns 'access patterns' into a design commitment.
- 5O — Objective functionWhat the system is being optimized for, in one sentence with one measurable number. Not 'good user experience.' 'p99 first-token latency under 800 ms at $0.004/request at 99.9% safety classifier agreement.' If you can't write the objective in one line, the rest of the interview becomes a tour of options instead of a defense of a decision.
Open every AI system design question with CLARO. It takes 5–7 minutes and is most of what separates a Staff answer from a Senior one in the first 10 minutes. Skip it only when the interviewer explicitly hands you the constraints and objective up front — and even then, restate them out loud so the room agrees on them.
Prompt: 'Design a real-time content moderation system.' After CLARO: Constraint = must run inside the post-write path of the publishing service, EU residency required. Latency = 200 ms p99 budget, 80 ms allocated to inference. Access = write-heavy, 50k posts/sec at peak, 10k steady-state. Asymmetry = writes dominate; reads are auditor queries that can be batch-served. Objective = maximize harmful-content recall at fixed false-positive rate (say, <1%) while staying inside the 80 ms inference budget. The architecture (cascaded classifier, small-model first, large-model second-pass on uncertain) is now nearly forced.
L6 candidate, AI infra org at a top lab, on-site loop. Prompt: 'Design a real-time content moderation system for a social platform.'
Design a real-time content moderation system for a social platform. You can ask clarifying questions.
Before I start sketching, I want to walk through five things — constraints, latency, access pattern, read/write shape, and the objective. That should take five to seven minutes and it'll keep me from designing the wrong system.
Constraints first. Two questions. One: are we deciding whether to publish before the post is visible, or scanning after publish and taking down later? Two: any data residency or regulatory regime — EU DSA, COPPA?
Good. Pre-publish for known categories — CSAM, terror, copyright. Post-publish for everything else. We operate globally including the EU.
Got it. So I have a synchronous pre-publish path that gates publishing, and an async post-publish path that scans and takes down. The first has a hard latency budget tied to publish UX. The second has a recall budget — how long after publish can we tolerate harmful content being live.
L — latency budget. For the synchronous path, what's the publish SLA today? And for the async path, what's our policy commitment — within 60 seconds, 5 minutes?
Publish API is 250 ms p99 today. Policy commitment for async is 60 seconds for high-severity, 5 minutes for medium.
Inside 250 ms, I'll allocate 50 ms for auth and routing, 30 ms for response, leaving roughly 170 ms for the moderation decision. That's tight for any large model. I'm going to plan for a cascaded classifier — a small fast model with a confident-pass-through, and a slower model for the uncertain band. I'll commit to that number now and we can revisit if it's wrong.
A — access pattern. Two questions. What's peak posts-per-second versus steady-state? And what fraction of posts have media — images and video — versus text-only?
50k posts per second at peak, around 10k steady. Media: about 30% have images, 5% have video.
R — read/write asymmetry. This is write-dominated by orders of magnitude. The read side — auditor dashboards, appeals — is low QPS and can run off materialized views or a separate analytics store. I won't spend time on the read path unless you want me to. So the design effort goes entirely into the write path.
O — objective. I'd state it as: maximize harmful-content recall at a fixed false-positive rate that the policy team is willing to commit to publicly — say 1% — subject to the 170 ms p99 inference budget on the sync path and 60 seconds end-to-end on the async path. That's one sentence with three numbers. If those three numbers are right, the rest of the design follows. Does this match how the policy team thinks about it?
Seven minutes elapsed. The candidate has not drawn a single component yet. The interviewer's notes say: 'opened with CLARO, named the two sub-systems, fixed the latency budget arithmetically, deprioritized the read path with reason, stated objective as a measurable trade-off, asked the policy-alignment question.' This is the rubric. Most candidates do not reach minute seven still un-architected — and that's why most candidates don't score Staff on this question.
The interviewer uses the phrase 'real-time' anywhere in the prompt.
Build a latency budget tree before anything else. 'Real-time' is not a number — it spans roughly four orders of magnitude depending on context — and the design space collapses differently at each one.
How would you start solving this problem?
Interviewer's opening prompt: 'Design a real-time content moderation system.' The next sentence out of the candidate's mouth is graded.
I'd start by drawing the high-level components — we'd need a publishing service, a moderation service, probably a model API, and some kind of storage. Then I'd refine each one.
I'd ask a few clarifying questions first. What's the scale? Are we doing text only, or media too? Is this synchronous or async? Then I'd sketch the pipeline.
Before I draw anything, I want to lock down five things: constraints, latency budget, access patterns, read/write shape, and the objective function. I'm going to spend about five minutes on that because the architecture will fall out of those four or five facts.
I want to start by asking the question that splits the problem into the right sub-systems. The single largest variable in content moderation is pre-publish vs post-publish — they're effectively two different systems with two different SLAs. I'll ask that first, then I'll run constraints, latency budget, access pattern, and objective on each sub-system separately. The reason I'm not just naming a checklist is that I've seen candidates run a great clarification pass on the wrong unit of design and end up with one pipeline trying to be two systems. So the meta-move here is: find the fork first, then run the framework on each branch.
Saw the framework one level above itself: when a problem contains a hidden fork (two sub-systems with different SLAs, two user populations with different costs, two latency regimes), running CLARO on the combined system produces an incoherent answer. The Staff move is to find the fork first, then run CLARO on each branch. This is a pattern: search for the hidden fork before applying any framework. Try it on caching ('hot vs cold reads are two systems'), on RAG ('factual vs creative queries route differently'), on rate limiting ('per-user vs per-IP are two policies that compose'). Once a candidate sees this, they apply it across the entire loop.
The Constraint Wall
Design space (all possible systems) +----------------------------------------+ | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | ░░░░░░░ ┌────────────┐ ░░░░░░░░░ | | ░░░░░░░ │ Designs │ ░░░░░░░░░ | | ░░░░░░░ │ that fit: │ ░░░░░░░░░ | | ░░░░░░░ │ │ ░░░░░░░░░ | | ░░░░░░░ │ • CLM-1 │ ░░░░░░░░░ | | ░░░░░░░ │ • CLM-2 │ ░░░░░░░░░ | | ░░░░░░░ │ • CLM-3 │ ░░░░░░░░░ | | ░░░░░░░ └────────────┘ ░░░░░░░░░ | | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | | ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | +----------------------------------------+ ░ = ruled out by constraints Each constraint shrinks the feasible region.
Think of constraints as walls that shrink the feasible design space, not as boxes to check. A regulatory constraint (EU data residency) removes every architecture that ships data to US-only inference clusters. A latency constraint (170 ms p99) removes every architecture whose model alone takes 300 ms. After two or three real constraints, the number of viable architectures usually drops to two or three, and they have visible names: cascaded classifier, single large model with quantization, ensemble with majority vote. You're not choosing creatively — you're picking from a short list the constraints already wrote.
The Latency Budget Tree
End-to-end SLA: 250 ms p99 └── Network round-trip: 20 ms └── Auth + rate limit + routing: 30 ms └── Moderation decision: 170 ms ◄── the only flexible budget │ ├── Tier-1 model (text): 40 ms (handles ~85% of traffic) │ ├── Tier-2 model (uncertain): 120 ms (handles ~15%) │ └── Vote / merge: 10 ms └── Response serialization: 30 ms Children sum to parent. Any node over budget = redesign.
Every latency-bound system is a tree where each node has a budget and the children must sum to the parent. Most candidates think of latency as 'how fast is my model.' Staff candidates think of it as a budget tree where the model is one node and most of the budget is already spent on infrastructure they didn't choose. Drawing the tree out loud, with numbers, is the single highest-signal move in the L step. It also gives you a defensible reason to push back on the SLA if the math doesn't work — and pushing back well is a Staff signal in itself.
The One-Number Objective
Bad: "We want a good user experience."
"Optimize cost, latency, and quality."
"Maximize accuracy."
Good: "Maximize harmful-content recall at 1% FPR,
subject to p99 inference ≤ 170 ms,
at cost ≤ $0.002 per post."
──┬── ──┬── ──┬──
│ │ │
primary constraint constraint
metric (latency) (cost)
Real objectives have one primary metric and two or three numbered constraints. Multi-objective phrasings ('cost, latency, and quality') hide the fact that you have to pick which is the optimization target and which are walls. If you can't name the primary metric, you can't make a single trade-off cleanly, and every architectural choice becomes a small argument with the room. State the objective as 'maximize X subject to Y and Z.' Then when an interviewer probes — 'why not use a bigger model?' — your answer is 'because that violates the cost constraint at this load,' not a vague preference.
Why CLARO is not a checklist
Senior candidates who learn CLARO sometimes turn it into a checklist they recite mechanically. This is worse than not knowing CLARO at all, because it produces five minutes of low-signal questions ('what's the scale, what's the latency, what's the read pattern…') that sound like preparation but are actually stalling. The interviewer notices.
CLARO is a sequence of decisions, not a sequence of questions. Each step ends with a commitment, not a fact. C ends with 'these are the walls.' L ends with a budget tree. A ends with 'this is write-heavy, here's what that forces.' R ends with 'I'm deprioritizing the read path because…' O ends with a one-line measurable objective. If a step ends with the candidate still asking, the step isn't done.
| Dimension | Full CLARO (5–7 min) | Quick-open (1 min) | Constraint-surface only (3 min) |
|---|---|---|---|
| Time cost | 5–7 min | 1 min | 2–3 min |
| Staff signal | Strongest. Names the framework, commits at each step. | Reads as eager but premature. Senior-tier ceiling. | Strong on constraints, weak on objective alignment. |
| Risk of designing wrong system | Low. Each step closes a class of failure modes. | High. Architecture committed before constraints surfaced. | Medium. May design for wrong primary metric. |
| Best fit | Under-specified prompts. Ambiguous scope. Multi-system problems. | Prompts where the interviewer pre-states constraints and objective ('Design a 10ms fraud-scoring API, here's the QPS, here's the budget'). | Time-boxed rounds (30 min total). Highly technical prompts where the objective is implicit. |
| Choose when | Default. Use unless the interviewer signals time pressure or hands you the constraints. | Only when the prompt is fully specified up front. Restate the constraints back to confirm, then proceed. | 30-min rounds. Compress C+L+O into ~3 minutes; skip A and R as separate steps but address them when designing the write path. |
Default to full CLARO. The 5–7 minute cost is dwarfed by the cost of designing the wrong system and arguing with the interviewer for 25 minutes. Use the abbreviated forms only when the prompt or the round explicitly forces them.
Radar — real-time fraud detection
Stripe's Radar makes a fraud decision on every authorization in the synchronous payment path. The system has roughly a 100 ms p99 budget for the entire risk decision — including feature lookup, model inference, and rules evaluation — because anything slower starts shedding payment authorization revenue. The architecture follows directly from that constraint: feature stores are co-located with model serving, models are small and quantized, and rules engines run inline rather than as a separate service.
CLARO timing on 'Design an AI-generated essay detector at university scale.' Total budget is 7 minutes (420 seconds) of interview time. The proportions below are what a Staff candidate actually spends per step. Notice C and L together take more than half the time — they do most of the work.
What interviewers at Google L6+, Meta E6+, Anthropic, and OpenAI are actually scoring in the first 5 minutes — that they will not tell you in the rubric document.
What they score
- ·Did the candidate commit, or did they just enumerate? Listing 'we'd need to consider X, Y, Z' without making a choice is the most common Senior-tier failure pattern.
- ·Did they name the highest-leverage question first, or did they go in textbook order? Asking 'pre-publish vs post-publish' before 'what's the QPS' is a Staff signal.
- ·Did they ask one question that, depending on the answer, would change the architecture, or did they ask five questions whose answers would only fill in details?
- ·When they were given a constraint they didn't expect, did they restate it back in their own words, or did they keep going on their original plan?
- ·Did the room have alignment by minute 7 — did the interviewer know what the candidate was about to build — or was there still ambiguity?
Why it's not on the rubric
These aren't on the rubric because they're judgments about engineering maturity, not technical knowledge. They're what separates someone who's ready to lead a team from someone who's ready to execute on one. The rubric document says things like 'demonstrates strong problem decomposition' — these bullets are what 'strong problem decomposition' actually looks like in the room.
How to signal it
- →End every CLARO step with a sentence that starts with 'I'll commit to…' or 'My working assumption is…'. Then say what you'd change your mind on.
- →If the prompt has a hidden fork (pre/post-publish, sync/async, hot/cold reads), ask about it inside the first 90 seconds. Do not wait for the interviewer to surface it.
- →When the interviewer answers a question with a number, restate the number out loud and what it implies. 'Fifty thousand posts per second peak — that's about 4 billion per day, so the daily inference cost dominates the design.'
- →When a constraint surprises you, say 'OK, that changes the design — let me walk back the latency budget' rather than continuing as if the constraint didn't reset something.
- →At minute 6 or 7, summarize aloud: 'So we're building two sub-systems. Sync gates publishing under 170 ms inference budget; async catches medium-severity within 60 seconds. Primary metric is recall at 1% FPR. Sound right?' If the interviewer nods, you've earned the room.
Practice this. Time yourself.
You have 7 minutes. Apply CLARO to this prompt: 'Design a system that detects AI-generated text in student essays at university scale.' Write your output as a 5-paragraph response, one per CLARO letter, ending with a one-sentence objective function containing exactly one primary metric and two numbered constraints. Time yourself. Do not draw any architecture during these 7 minutes. After the timer ends, compare your response against the rubric below.
Self-assessment rubric
| Dimension | Weak | Passing | Strong | Staff bar |
|---|---|---|---|---|
| Constraint discovery | Listed scale and tech-stack constraints only ('we'd use Python, maybe a queue'). | Named 2–3 real constraints including at least one regulatory or organizational one. | Named the false-positive cost (an accusation of cheating) as a primary constraint that shapes the model choice. | Surfaced the non-obvious constraint that the system must produce auditable evidence per decision — not just a confidence score — because of due-process expectations in academic discipline. Connected that to a specific architectural commitment (evidence payload alongside the prediction). |
| Latency budget | Said 'this can be async' and moved on. | Said async, gave a rough budget like '<60 seconds.' | Named the bursty access pattern (assignment deadlines) as the reason the budget can be soft on the average and tight on the p99. | Allocated the budget tree even for an async system: queue wait time, detection inference, evidence-extraction pass, indexing for instructor view. Identified which step is the controllable lever. |
| Access pattern | Said 'students submit essays, instructors read results.' | Identified the deadline-driven burst pattern. | Connected the burst to an autoscaling decision and called out the cold-start cost of model serving under sudden load. | Noted that the read pattern is two-sided — instructors during grading, students during appeals — and that the appeals path has different latency and audit requirements than the instructor dashboard. Routed them differently. |
| Objective clarity | Wrote a multi-clause sentence with no measurable numbers ('detect AI text accurately and fairly'). | Wrote 'maximize recall at low FPR' without committing to a number. | Stated 'maximize recall at <0.5% institution-wide FPR' or similar, with one concrete number. | Wrote the full shape: 'maximize X subject to Y and Z' with one primary metric and two numbered constraints. Asked or noted that the FPR threshold is a policy decision, not an engineering one, and would need confirmation from the academic integrity office before final commitment. |
Reveal model solution
Common failures
- ✗Treated 'AI-generated text detection' as a model problem, not a system problem. Spent the 7 minutes on classifier choice, missed the auditability constraint entirely.
- ✗Wrote the objective as 'high accuracy' without naming a primary metric or any numbered constraint. The objective has to fit the shape 'maximize X subject to Y and Z.'
- ✗Ignored the burst pattern. A system that's fine at average rate and falls over at deadline-hour peak is the canonical fail mode here.
- ✗Forgot the read path entirely. The appeals workflow is the part of the system that, when broken, makes the news.
- ✗Allocated the latency budget as one number ('60 seconds') without decomposing it into queue / inference / evidence / index. Without the tree, you have no controllable lever when the SLA slips.
The CLARO One-Pager
- C — Constraints: 'What can't change here? Regulatory, residency, existing SLA, current team capacity, deadline?'
- L — Latency budget: 'What's the end-to-end SLA, and how is it allocated across network / auth / retrieval / model / response?'
- A — Access patterns: 'Is this user-facing or background? Peak vs steady-state ratio? Bursty or uniform? Hot keys?'
- R — Read/write asymmetry: 'Given the pattern, where am I spending design effort — write path, read path, or both?'
- O — Objective: 'In one sentence, what is the system optimizing for? Primary metric plus two numbered constraints, in the shape: maximize X subject to Y and Z.'
- ·End every step with a commitment, not a question. 'I'll plan for…' / 'My working assumption is…' — then move on.
- ·Find the hidden fork before applying the framework. Pre-publish vs post-publish, sync vs async, hot vs cold reads — when a prompt has two sub-systems, run CLARO on each branch.
- ·When given a number, say what it implies out loud. '50k posts/sec means 4 billion/day, which means daily inference cost dominates.'
- ·When surprised by a constraint, walk the budget back explicitly. Don't keep going on your original plan as if the constraint didn't reset it.
- ·At minute 6 or 7, summarize back: 'Two sub-systems, sync 170 ms inference, async 60s, recall at 1% FPR. Sound right?' Then draw.
Examples
C: pre-publish gate + post-publish scan; EU residency; high-severity categories only sync. L: 250ms publish SLA → 170ms inference budget on sync; 60s on async. A: 50k posts/sec peak; 30% have images. R: write-dominated; read path is auditor dashboards (low QPS, separate store). O: maximize harmful-content recall at 1% FPR subject to 170 ms sync inference and 60 s async end-to-end.
C: corpus updates daily; English + 3 markets; results must cite sources. L: 1.5 s p99 user-facing → 200ms retrieval + 800ms generation + 500ms first-token streaming. A: 10k QPS steady, 50k peak; 80/20 query head, fat long-tail. R: read-dominated 1000:1; write path is corpus indexing, runs nightly. O: maximize answer faithfulness (LLM-as-judge or human eval) at p99 ≤ 1.5 s and cost ≤ $0.005/query.
C: inline to payment auth; PCI scope; <100ms p99 hard limit. L: 100ms total → 30ms feature lookup + 40ms inference + 20ms rules + 10ms wire. A: 3k TPS steady, 30k peak (Black Friday); skewed by merchant. R: write = decision log (async); read = on next transaction (online feature lookup). O: maximize fraud recall at 0.1% legitimate-decline rate subject to 100ms p99 and $0.0005/decision.
L6 candidate, AI infrastructure org at a top AI lab, third interview of a six-round on-site. Prompt: 'Design a RAG system that lets engineers query our internal documentation.' Candidate had 4 years of RAG production experience and was highly technical. Interviewer was a Staff engineer on the platform team.
Candidate heard 'RAG' and immediately started drawing. By minute 3, the whiteboard had a document loader, a chunker, an embedding model, a vector database (Pinecone, with rationale), a retriever, a reranker, and an LLM. By minute 8, they were debating chunk size and overlap strategies. By minute 12, they were comparing HNSW vs IVF-PQ. The interviewer was nodding and asking pointed follow-ups about each choice. At minute 22, the interviewer asked, 'What's the relevance metric you're optimizing?' The candidate paused and said 'recall at K… or maybe MRR, depending.' The interviewer asked, 'What does the team currently measure?' The candidate didn't know. The interviewer asked, 'How often does the corpus update — daily, weekly, or every commit?' The candidate didn't know. They guessed weekly. The interviewer asked, 'And how much of the docs are out of date right now?' The candidate said 'probably some, hard to know.' The interviewer wrote a note.
Minute 22 was the visible failure, but the actual failure was at minute 2. When the candidate started drawing, they committed to a system that retrieves and generates — when in fact the team's biggest problem with internal docs is that 40% of the docs are stale and engineers don't trust the answers regardless of how good retrieval is. The right system isn't a better RAG pipeline; it's a corpus-staleness detector that flags low-confidence answers and routes them to a human curator. The candidate never had the chance to discover that, because they never asked.
Before any drawing: 'Before I sketch a RAG system, I want to understand whether this is a retrieval problem or a corpus problem. Two questions: how fresh is the corpus typically, and what's the current trust level — do engineers actually use the docs they have, or do they Slack each other instead? If the docs are mostly trustworthy and the issue is finding the right one, that's a retrieval problem and we're building RAG. If the docs are mostly stale and engineers have lost trust, no retriever fixes that — we'd be building a corpus-confidence system with a small retrieval layer attached.' That single 30-second exchange would have changed the entire architecture, and the interviewer's feedback later in the loop confirmed it: 'Strong implementation knowledge but jumped past the system definition. We don't need someone who can ship RAG; we need someone who can tell us when RAG is the wrong system.'
CLARO doesn't exist to be thorough. It exists to prevent the failure mode where you build the right architecture for the wrong system. The candidate above wasn't weak at RAG — they were probably one of the strongest RAG engineers in the loop. They failed because they assumed the prompt was an architecture question when it was actually a 'name the right problem' question. Running CLARO at minute 2 would have forced them to ask the corpus-staleness question, which would have revealed that the team's actual problem was upstream of retrieval. The most expensive interview mistakes are not in the architecture you draw — they're in the architecture you started drawing too early.