Module 1 · Lesson 2 · Foundations · 42 min

The CLARO Framework: How to Open Any AI Design Question

Most candidates open AI design questions by drawing boxes. Staff candidates spend the first 5–7 minutes defining the problem so precisely that the architecture becomes mechanical. CLARO is the 5-step opening you run before any line is drawn.

There is a recognizable shape to a candidate who is about to fail a Staff system design interview. They hear the prompt, nod, and start drawing. Within four minutes the whiteboard has a load balancer, a queue, a model server, and a vector database, and the interviewer is asking the first real question — what's the objective? — and the candidate is already committed to an architecture defending decisions they never made. The interviewer does not interrupt them. They write a note. They will write more notes.

The Staff candidate, faced with the same prompt, does not draw anything for the first five to seven minutes. They run a sequence. They name constraints out loud. They allocate a latency budget before they know the architecture. They ask one shape-of-load question that collapses half the possible designs. They state the objective in one sentence that ends with a number. By the time they draw the first box, the architecture is almost mechanical — because the constraints, the budget, the access pattern, and the objective only admit a small set of viable designs. The interviewer's notes for those first minutes read differently.

CLARO is that sequence, written down. It is not a checklist. It is a five-move opening you commit to muscle memory so that on the morning of the interview, when adrenaline is competing with sleep deprivation, you do not have to remember what to do first. You run CLARO. You earn the right to draw the first box.

Framework

CLARO

CLARO is a 5-step opening sequence. Each step is a small constraint-discovery move, not a hand-wave. The goal isn't to look thorough — it's to extract the four or five facts that determine the rest of the design. By the end of CLARO, you should be able to write the system's objective on one line and defend why a different architecture would be wrong, not just suboptimal.

1
C — Constraints
What cannot change. Regulatory, contractual, latency SLA, existing data residency, on-call team's current skill set, deadline. You're not asking for nice-to-haves. You're surfacing the hard walls of the design space. Two or three real constraints are usually enough.
2
L — Latency budget
The end-to-end SLA, then a deliberate allocation across the request path. Not 'we'll be fast.' A budget tree: client + network + auth + retrieval + model + post-processing + return. Allocate before you design. If you can't fit the model's likely latency inside the budget, you've just earned the right to push back on the SLA.
3
A — Access patterns
Shape of the load, not the volume. Peak vs steady-state ratio. Read-heavy vs write-heavy. Bursty vs uniform. Long-tail vs hot-keyed. Synchronous user-facing vs async batch. The single highest-leverage question is usually 'is this user-facing or background?' — because the answer collapses half the design space.
4
R — Read/write asymmetry
Now name what falls out of the access pattern. If reads dominate writes 1000:1, you cache aggressively and accept eventual consistency. If writes dominate, you optimize the write path and the read path lives on materialized views. This step turns 'access patterns' into a design commitment.
5
O — Objective function
What the system is being optimized for, in one sentence with one measurable number. Not 'good user experience.' 'p99 first-token latency under 800 ms at $0.004/request at 99.9% safety classifier agreement.' If you can't write the objective in one line, the rest of the interview becomes a tour of options instead of a defense of a decision.

When to use

Open every AI system design question with CLARO. It takes 5–7 minutes and is most of what separates a Staff answer from a Senior one in the first 10 minutes. Skip it only when the interviewer explicitly hands you the constraints and objective up front — and even then, restate them out loud so the room agrees on them.

Worked example

Prompt: 'Design a real-time content moderation system.' After CLARO: Constraint = must run inside the post-write path of the publishing service, EU residency required. Latency = 200 ms p99 budget, 80 ms allocated to inference. Access = write-heavy, 50k posts/sec at peak, 10k steady-state. Asymmetry = writes dominate; reads are auditor queries that can be batch-served. Objective = maximize harmful-content recall at fixed false-positive rate (say, <1%) while staying inside the 80 ms inference budget. The architecture (cascaded classifier, small-model first, large-model second-pass on uncertain) is now nearly forced.

Simulated interview

L6 candidate, AI infra org at a top lab, on-site loop. Prompt: 'Design a real-time content moderation system for a social platform.'

Interviewer

Design a real-time content moderation system for a social platform. You can ask clarifying questions.

Candidate

Before I start sketching, I want to walk through five things — constraints, latency, access pattern, read/write shape, and the objective. That should take five to seven minutes and it'll keep me from designing the wrong system.

Candidate

Constraints first. Two questions. One: are we deciding whether to publish before the post is visible, or scanning after publish and taking down later? Two: any data residency or regulatory regime — EU DSA, COPPA?

Interviewer

Good. Pre-publish for known categories — CSAM, terror, copyright. Post-publish for everything else. We operate globally including the EU.

Candidate

Got it. So I have a synchronous pre-publish path that gates publishing, and an async post-publish path that scans and takes down. The first has a hard latency budget tied to publish UX. The second has a recall budget — how long after publish can we tolerate harmful content being live.

Candidate

L — latency budget. For the synchronous path, what's the publish SLA today? And for the async path, what's our policy commitment — within 60 seconds, 5 minutes?

Interviewer

Publish API is 250 ms p99 today. Policy commitment for async is 60 seconds for high-severity, 5 minutes for medium.

Candidate

Inside 250 ms, I'll allocate 50 ms for auth and routing, 30 ms for response, leaving roughly 170 ms for the moderation decision. That's tight for any large model. I'm going to plan for a cascaded classifier — a small fast model with a confident-pass-through, and a slower model for the uncertain band. I'll commit to that number now and we can revisit if it's wrong.

Candidate

A — access pattern. Two questions. What's peak posts-per-second versus steady-state? And what fraction of posts have media — images and video — versus text-only?

Interviewer

50k posts per second at peak, around 10k steady. Media: about 30% have images, 5% have video.

Candidate

R — read/write asymmetry. This is write-dominated by orders of magnitude. The read side — auditor dashboards, appeals — is low QPS and can run off materialized views or a separate analytics store. I won't spend time on the read path unless you want me to. So the design effort goes entirely into the write path.

Candidate

O — objective. I'd state it as: maximize harmful-content recall at a fixed false-positive rate that the policy team is willing to commit to publicly — say 1% — subject to the 170 ms p99 inference budget on the sync path and 60 seconds end-to-end on the async path. That's one sentence with three numbers. If those three numbers are right, the rest of the design follows. Does this match how the policy team thinks about it?

Outcome

Seven minutes elapsed. The candidate has not drawn a single component yet. The interviewer's notes say: 'opened with CLARO, named the two sub-systems, fixed the latency budget arithmetically, deprioritized the read path with reason, stated objective as a measurable trade-off, asked the policy-alignment question.' This is the rubric. Most candidates do not reach minute seven still un-architected — and that's why most candidates don't score Staff on this question.

Pattern recognition

When you see

The interviewer uses the phrase 'real-time' anywhere in the prompt.

→

Think

Build a latency budget tree before anything else. 'Real-time' is not a number — it spans roughly four orders of magnitude depending on context — and the design space collapses differently at each one.

'Real-time' content moderation means ~200 ms. 'Real-time' fraud scoring means ~50 ms. 'Real-time' analytics dashboards mean ~5 s. 'Real-time' ML training means ~5 min. The word is interview-room shorthand for 'figure out the actual SLA and the budget tree before designing.' Candidates who treat 'real-time' as a vibe instead of a number end up designing a system that's two orders of magnitude in the wrong direction. The interviewer is testing whether you ask, or whether you assume.

Calibration ladder

How would you start solving this problem?

Interviewer's opening prompt: 'Design a real-time content moderation system.' The next sentence out of the candidate's mouth is graded.

L4 · Mid

I'd start by drawing the high-level components — we'd need a publishing service, a moderation service, probably a model API, and some kind of storage. Then I'd refine each one.

Missed: Treated the prompt as an architecture task instead of a definition task. Will be defending decisions they never made by minute eight.

L5 · Senior

I'd ask a few clarifying questions first. What's the scale? Are we doing text only, or media too? Is this synchronous or async? Then I'd sketch the pipeline.

Missed: Knew to ask questions but didn't have a structured way to extract the load-bearing facts. Will get partial information and miss a constraint that resets the design later in the interview.

L6 · Staff

Before I draw anything, I want to lock down five things: constraints, latency budget, access patterns, read/write shape, and the objective function. I'm going to spend about five minutes on that because the architecture will fall out of those four or five facts.

Missed: Strong. The single thing missing is the meta-move — recognizing that some problems contain a fork that has to be resolved before the framework is applied, otherwise the framework runs on the wrong unit.

L7 · Principal

I want to start by asking the question that splits the problem into the right sub-systems. The single largest variable in content moderation is pre-publish vs post-publish — they're effectively two different systems with two different SLAs. I'll ask that first, then I'll run constraints, latency budget, access pattern, and objective on each sub-system separately. The reason I'm not just naming a checklist is that I've seen candidates run a great clarification pass on the wrong unit of design and end up with one pipeline trying to be two systems. So the meta-move here is: find the fork first, then run the framework on each branch.

What scored L7

Saw the framework one level above itself: when a problem contains a hidden fork (two sub-systems with different SLAs, two user populations with different costs, two latency regimes), running CLARO on the combined system produces an incoherent answer. The Staff move is to find the fork first, then run CLARO on each branch. This is a pattern: search for the hidden fork before applying any framework. Try it on caching ('hot vs cold reads are two systems'), on RAG ('factual vs creative queries route differently'), on rate limiting ('per-user vs per-IP are two policies that compose'). Once a candidate sees this, they apply it across the entire loop.

Mental model

The Constraint Wall

Design space (all possible systems)
+----------------------------------------+
|   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   |
|   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   |
|   ░░░░░░░  ┌────────────┐  ░░░░░░░░░   |
|   ░░░░░░░  │ Designs    │  ░░░░░░░░░   |
|   ░░░░░░░  │ that fit:  │  ░░░░░░░░░   |
|   ░░░░░░░  │            │  ░░░░░░░░░   |
|   ░░░░░░░  │  • CLM-1   │  ░░░░░░░░░   |
|   ░░░░░░░  │  • CLM-2   │  ░░░░░░░░░   |
|   ░░░░░░░  │  • CLM-3   │  ░░░░░░░░░   |
|   ░░░░░░░  └────────────┘  ░░░░░░░░░   |
|   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   |
|   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   |
+----------------------------------------+
  ░ = ruled out by constraints
  Each constraint shrinks the feasible region.

Think of constraints as walls that shrink the feasible design space, not as boxes to check. A regulatory constraint (EU data residency) removes every architecture that ships data to US-only inference clusters. A latency constraint (170 ms p99) removes every architecture whose model alone takes 300 ms. After two or three real constraints, the number of viable architectures usually drops to two or three, and they have visible names: cascaded classifier, single large model with quantization, ensemble with majority vote. You're not choosing creatively — you're picking from a short list the constraints already wrote.

Use it when: Use this when you catch yourself proposing an architecture and then reverse-checking constraints. Reverse the order: surface the walls first, then the architectures that survive them are obvious. If you find yourself defending a design that violates a constraint you discovered later, you ran the model in the wrong order.

Mental model

The Latency Budget Tree

End-to-end SLA: 250 ms p99
└── Network round-trip:           20 ms
└── Auth + rate limit + routing:  30 ms
└── Moderation decision:         170 ms  ◄── the only flexible budget
│   ├── Tier-1 model (text):      40 ms   (handles ~85% of traffic)
│   ├── Tier-2 model (uncertain): 120 ms  (handles ~15%)
│   └── Vote / merge:             10 ms
└── Response serialization:       30 ms

Children sum to parent. Any node over budget = redesign.

Every latency-bound system is a tree where each node has a budget and the children must sum to the parent. Most candidates think of latency as 'how fast is my model.' Staff candidates think of it as a budget tree where the model is one node and most of the budget is already spent on infrastructure they didn't choose. Drawing the tree out loud, with numbers, is the single highest-signal move in the L step. It also gives you a defensible reason to push back on the SLA if the math doesn't work — and pushing back well is a Staff signal in itself.

Use it when: Use this any time a prompt mentions p99 latency, real-time, low-latency, or any specific millisecond number. Also use it as a recovery move: if you discover at minute 25 that your design can't hit the SLA, redraw the tree on the spot, find the over-budget node, and propose the single change that fixes it.

Mental model

The One-Number Objective

Bad:  "We want a good user experience."
              "Optimize cost, latency, and quality."
              "Maximize accuracy."

Good: "Maximize harmful-content recall at 1% FPR,
       subject to p99 inference ≤ 170 ms,
       at cost ≤ $0.002 per post."

       ──┬──        ──┬──         ──┬──
         │            │             │
       primary    constraint     constraint
       metric      (latency)      (cost)

Real objectives have one primary metric and two or three numbered constraints. Multi-objective phrasings ('cost, latency, and quality') hide the fact that you have to pick which is the optimization target and which are walls. If you can't name the primary metric, you can't make a single trade-off cleanly, and every architectural choice becomes a small argument with the room. State the objective as 'maximize X subject to Y and Z.' Then when an interviewer probes — 'why not use a bigger model?' — your answer is 'because that violates the cost constraint at this load,' not a vague preference.

Use it when: Use this at the end of CLARO. If the sentence you write doesn't fit the shape 'maximize X subject to Y and Z,' rewrite it until it does. If the interviewer pushes back on the framing, that's good — it surfaces the real objective faster than designing for the wrong one.

Why CLARO is not a checklist

Senior candidates who learn CLARO sometimes turn it into a checklist they recite mechanically. This is worse than not knowing CLARO at all, because it produces five minutes of low-signal questions ('what's the scale, what's the latency, what's the read pattern…') that sound like preparation but are actually stalling. The interviewer notices.

CLARO is a sequence of decisions, not a sequence of questions. Each step ends with a commitment, not a fact. C ends with 'these are the walls.' L ends with a budget tree. A ends with 'this is write-heavy, here's what that forces.' R ends with 'I'm deprioritizing the read path because…' O ends with a one-line measurable objective. If a step ends with the candidate still asking, the step isn't done.

Dimension	Full CLARO (5–7 min)	Quick-open (1 min)	Constraint-surface only (3 min)
Time cost	5–7 min	1 min	2–3 min
Staff signal	Strongest. Names the framework, commits at each step.	Reads as eager but premature. Senior-tier ceiling.	Strong on constraints, weak on objective alignment.
Risk of designing wrong system	Low. Each step closes a class of failure modes.	High. Architecture committed before constraints surfaced.	Medium. May design for wrong primary metric.
Best fit	Under-specified prompts. Ambiguous scope. Multi-system problems.	Prompts where the interviewer pre-states constraints and objective ('Design a 10ms fraud-scoring API, here's the QPS, here's the budget').	Time-boxed rounds (30 min total). Highly technical prompts where the objective is implicit.
Choose when	Default. Use unless the interviewer signals time pressure or hands you the constraints.	Only when the prompt is fully specified up front. Restate the constraints back to confirm, then proceed.	30-min rounds. Compress C+L+O into ~3 minutes; skip A and R as separate steps but address them when designing the write path.

Verdict

Default to full CLARO. The 5–7 minute cost is dwarfed by the cost of designing the wrong system and arguing with the interviewer for 25 minutes. Use the abbreviated forms only when the prompt or the round explicitly forces them.

Real-world reference · Stripe

Radar — real-time fraud detection

Stripe's Radar makes a fraud decision on every authorization in the synchronous payment path. The system has roughly a 100 ms p99 budget for the entire risk decision — including feature lookup, model inference, and rules evaluation — because anything slower starts shedding payment authorization revenue. The architecture follows directly from that constraint: feature stores are co-located with model serving, models are small and quantized, and rules engines run inline rather than as a separate service.

Takeaway: Notice how the constraint (100 ms inline, sync to payment auth) wrote the architecture. Stripe's engineers didn't choose small quantized models because they prefer them — they chose them because the constraint left no room for anything else. When you apply CLARO and the constraints rule out every familiar architecture, that's not a problem with the framework; that's the signal that the design space has narrowed enough to commit. The shape of any real production AI system is almost always 'this is what fit inside the constraints,' not 'this is what we'd build greenfield.'

Stripe Engineering Blog — 'Similarity clustering to improve flagged transaction detection' and related Radar posts ↗

Latency anatomy · budget 420 ms

CLARO timing on 'Design an AI-generated essay detector at university scale.' Total budget is 7 minutes (420 seconds) of interview time. The proportions below are what a Staff candidate actually spends per step. Notice C and L together take more than half the time — they do most of the work.

C — Constraints120 ms

Ask: per-student or per-essay submission? FERPA implications. False-positive cost (an accusation of cheating) vs false-negative cost. Existing LMS integration vs greenfield. Commit: 'high FP cost, FERPA-relevant, must surface evidence not just a score.'

L — Latency budget90 ms

Async, batch-acceptable. Budget is in seconds to minutes, not milliseconds. Commit: 'soft SLA of <60 seconds per essay, can run as a background job on submission.'

A — Access pattern60 ms

Highly bursty — submissions cluster around assignment deadlines. Average is low, peak is 100x average. Commit: 'autoscaling worker pool, queue between submission and detection.'

R — Read/write asymmetry60 ms

Detection results read by instructors during grading, by students during appeals. Write once at submission, read multiple times. Cache aggressively. Commit: 'write-once result with evidence payload, indexed for instructor dashboard.'

O — Objective function90 ms

Primary: maximize detection recall at <0.5% false-positive rate institution-wide. Constraint 1: must produce auditable evidence per decision. Constraint 2: must run at <$0.05 per essay. Commit: stated as one sentence, asked to confirm with the team.

Unspoken rubric

What interviewers at Google L6+, Meta E6+, Anthropic, and OpenAI are actually scoring in the first 5 minutes — that they will not tell you in the rubric document.

What they score

·Did the candidate commit, or did they just enumerate? Listing 'we'd need to consider X, Y, Z' without making a choice is the most common Senior-tier failure pattern.
·Did they name the highest-leverage question first, or did they go in textbook order? Asking 'pre-publish vs post-publish' before 'what's the QPS' is a Staff signal.
·Did they ask one question that, depending on the answer, would change the architecture, or did they ask five questions whose answers would only fill in details?
·When they were given a constraint they didn't expect, did they restate it back in their own words, or did they keep going on their original plan?
·Did the room have alignment by minute 7 — did the interviewer know what the candidate was about to build — or was there still ambiguity?

Why it's not on the rubric

These aren't on the rubric because they're judgments about engineering maturity, not technical knowledge. They're what separates someone who's ready to lead a team from someone who's ready to execute on one. The rubric document says things like 'demonstrates strong problem decomposition' — these bullets are what 'strong problem decomposition' actually looks like in the room.

How to signal it

→End every CLARO step with a sentence that starts with 'I'll commit to…' or 'My working assumption is…'. Then say what you'd change your mind on.
→If the prompt has a hidden fork (pre/post-publish, sync/async, hot/cold reads), ask about it inside the first 90 seconds. Do not wait for the interviewer to surface it.
→When the interviewer answers a question with a number, restate the number out loud and what it implies. 'Fifty thousand posts per second peak — that's about 4 billion per day, so the daily inference cost dominates the design.'
→When a constraint surprises you, say 'OK, that changes the design — let me walk back the latency budget' rather than continuing as if the constraint didn't reset something.
→At minute 6 or 7, summarize aloud: 'So we're building two sub-systems. Sync gates publishing under 170 ms inference budget; async catches medium-severity within 60 seconds. Primary metric is recall at 1% FPR. Sound right?' If the interviewer nods, you've earned the room.

Drill · 7 minutes

Practice this. Time yourself.

You have 7 minutes. Apply CLARO to this prompt: 'Design a system that detects AI-generated text in student essays at university scale.' Write your output as a 5-paragraph response, one per CLARO letter, ending with a one-sentence objective function containing exactly one primary metric and two numbered constraints. Time yourself. Do not draw any architecture during these 7 minutes. After the timer ends, compare your response against the rubric below.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Constraint discovery	Listed scale and tech-stack constraints only ('we'd use Python, maybe a queue').	Named 2–3 real constraints including at least one regulatory or organizational one.	Named the false-positive cost (an accusation of cheating) as a primary constraint that shapes the model choice.	Surfaced the non-obvious constraint that the system must produce auditable evidence per decision — not just a confidence score — because of due-process expectations in academic discipline. Connected that to a specific architectural commitment (evidence payload alongside the prediction).
Latency budget	Said 'this can be async' and moved on.	Said async, gave a rough budget like '<60 seconds.'	Named the bursty access pattern (assignment deadlines) as the reason the budget can be soft on the average and tight on the p99.	Allocated the budget tree even for an async system: queue wait time, detection inference, evidence-extraction pass, indexing for instructor view. Identified which step is the controllable lever.
Access pattern	Said 'students submit essays, instructors read results.'	Identified the deadline-driven burst pattern.	Connected the burst to an autoscaling decision and called out the cold-start cost of model serving under sudden load.	Noted that the read pattern is two-sided — instructors during grading, students during appeals — and that the appeals path has different latency and audit requirements than the instructor dashboard. Routed them differently.
Objective clarity	Wrote a multi-clause sentence with no measurable numbers ('detect AI text accurately and fairly').	Wrote 'maximize recall at low FPR' without committing to a number.	Stated 'maximize recall at <0.5% institution-wide FPR' or similar, with one concrete number.	Wrote the full shape: 'maximize X subject to Y and Z' with one primary metric and two numbered constraints. Asked or noted that the FPR threshold is a policy decision, not an engineering one, and would need confirmation from the academic integrity office before final commitment.

Reveal model solution

C — Constraints. This is an academic-discipline-adjacent system: a false positive is an accusation of cheating with real consequences, so the dominant constraint is auditability of every decision, not raw accuracy. FERPA applies to any identifiable student data we cross-reference. The detector must produce per-decision evidence (which spans, which features) that an instructor can review without an ML background, because instructors will be reviewing them in appeals. The model itself cannot be a pure black-box classifier. L — Latency budget. Async per essay. Soft SLA: instructors expect results within a few minutes of submission, definitely before they start grading the batch. I'll commit to <60 seconds p99 from submission to detection result indexed. The budget tree inside that: queue wait under 5 seconds at steady-state (worse at peak — see access pattern below), detector inference under 10 seconds for a typical 2000-word essay, evidence extraction under 30 seconds, write/index under 5 seconds. The detector inference and evidence extraction are the controllable levers. A — Access pattern. Highly bursty. Most assignments have deadlines, and submissions cluster heavily in the last hour. Average rate institution-wide might be 10 essays/minute; peak rate is 100–500x average, sustained for 30–60 minutes around major deadlines. Autoscaling has to absorb a 100x burst without violating the soft SLA. This is the access pattern that will dominate the cost model and the on-call story. R — Read/write asymmetry. Write once at submission. Read on three paths: instructor dashboard during grading (low QPS, moderate latency tolerance), student appeals (very low QPS, high audit requirement), and aggregate institutional reporting (batch, no latency requirement). I'll commit to a single write-once result with the full evidence payload, indexed for instructor dashboard, with the appeals path going through the same record but adding an explicit access log. No separate read path — the asymmetry isn't extreme enough to justify the complexity. O — Objective function. Maximize detection recall at <0.5% institution-wide false-positive rate, subject to <$0.05 per essay total cost (including evidence extraction) and <60 seconds p99 end-to-end at the 100x burst rate. The FPR threshold is the policy lever — I'd confirm 0.5% with the academic integrity office before treating it as fixed, because their willingness to defend a public number determines everything downstream.

Common failures

✗Treated 'AI-generated text detection' as a model problem, not a system problem. Spent the 7 minutes on classifier choice, missed the auditability constraint entirely.
✗Wrote the objective as 'high accuracy' without naming a primary metric or any numbered constraint. The objective has to fit the shape 'maximize X subject to Y and Z.'
✗Ignored the burst pattern. A system that's fine at average rate and falls over at deadline-hour peak is the canonical fail mode here.
✗Forgot the read path entirely. The appeals workflow is the part of the system that, when broken, makes the news.
✗Allocated the latency budget as one number ('60 seconds') without decomposing it into queue / inference / evidence / index. Without the tree, you have no controllable lever when the SLA slips.

Artifact · cheatsheet

The CLARO One-Pager

Front — The 5 prompts (take into the interview)

C — Constraints: 'What can't change here? Regulatory, residency, existing SLA, current team capacity, deadline?'
L — Latency budget: 'What's the end-to-end SLA, and how is it allocated across network / auth / retrieval / model / response?'
A — Access patterns: 'Is this user-facing or background? Peak vs steady-state ratio? Bursty or uniform? Hot keys?'
R — Read/write asymmetry: 'Given the pattern, where am I spending design effort — write path, read path, or both?'
O — Objective: 'In one sentence, what is the system optimizing for? Primary metric plus two numbered constraints, in the shape: maximize X subject to Y and Z.'

Back — Five tactical rules

·End every step with a commitment, not a question. 'I'll plan for…' / 'My working assumption is…' — then move on.
·Find the hidden fork before applying the framework. Pre-publish vs post-publish, sync vs async, hot vs cold reads — when a prompt has two sub-systems, run CLARO on each branch.
·When given a number, say what it implies out loud. '50k posts/sec means 4 billion/day, which means daily inference cost dominates.'
·When surprised by a constraint, walk the budget back explicitly. Don't keep going on your original plan as if the constraint didn't reset it.
·At minute 6 or 7, summarize back: 'Two sub-systems, sync 170 ms inference, async 60s, recall at 1% FPR. Sound right?' Then draw.

Examples

Example 1 — Content moderation (sync + async)

C: pre-publish gate + post-publish scan; EU residency; high-severity categories only sync. L: 250ms publish SLA → 170ms inference budget on sync; 60s on async. A: 50k posts/sec peak; 30% have images. R: write-dominated; read path is auditor dashboards (low QPS, separate store). O: maximize harmful-content recall at 1% FPR subject to 170 ms sync inference and 60 s async end-to-end.

Example 2 — LLM-powered search (read-heavy)

C: corpus updates daily; English + 3 markets; results must cite sources. L: 1.5 s p99 user-facing → 200ms retrieval + 800ms generation + 500ms first-token streaming. A: 10k QPS steady, 50k peak; 80/20 query head, fat long-tail. R: read-dominated 1000:1; write path is corpus indexing, runs nightly. O: maximize answer faithfulness (LLM-as-judge or human eval) at p99 ≤ 1.5 s and cost ≤ $0.005/query.

Example 3 — Real-time fraud scoring (latency-tight)

C: inline to payment auth; PCI scope; <100ms p99 hard limit. L: 100ms total → 30ms feature lookup + 40ms inference + 20ms rules + 10ms wire. A: 3k TPS steady, 30k peak (Black Friday); skewed by merchant. R: write = decision log (async); read = on next transaction (online feature lookup). O: maximize fraud recall at 0.1% legitimate-decline rate subject to 100ms p99 and $0.0005/decision.

Post-mortem · anonymized

Setup

L6 candidate, AI infrastructure org at a top AI lab, third interview of a six-round on-site. Prompt: 'Design a RAG system that lets engineers query our internal documentation.' Candidate had 4 years of RAG production experience and was highly technical. Interviewer was a Staff engineer on the platform team.

What happened

Candidate heard 'RAG' and immediately started drawing. By minute 3, the whiteboard had a document loader, a chunker, an embedding model, a vector database (Pinecone, with rationale), a retriever, a reranker, and an LLM. By minute 8, they were debating chunk size and overlap strategies. By minute 12, they were comparing HNSW vs IVF-PQ. The interviewer was nodding and asking pointed follow-ups about each choice. At minute 22, the interviewer asked, 'What's the relevance metric you're optimizing?' The candidate paused and said 'recall at K… or maybe MRR, depending.' The interviewer asked, 'What does the team currently measure?' The candidate didn't know. The interviewer asked, 'How often does the corpus update — daily, weekly, or every commit?' The candidate didn't know. They guessed weekly. The interviewer asked, 'And how much of the docs are out of date right now?' The candidate said 'probably some, hard to know.' The interviewer wrote a note.

The moment

Minute 22 was the visible failure, but the actual failure was at minute 2. When the candidate started drawing, they committed to a system that retrieves and generates — when in fact the team's biggest problem with internal docs is that 40% of the docs are stale and engineers don't trust the answers regardless of how good retrieval is. The right system isn't a better RAG pipeline; it's a corpus-staleness detector that flags low-confidence answers and routes them to a human curator. The candidate never had the chance to discover that, because they never asked.

What they should have said

Before any drawing: 'Before I sketch a RAG system, I want to understand whether this is a retrieval problem or a corpus problem. Two questions: how fresh is the corpus typically, and what's the current trust level — do engineers actually use the docs they have, or do they Slack each other instead? If the docs are mostly trustworthy and the issue is finding the right one, that's a retrieval problem and we're building RAG. If the docs are mostly stale and engineers have lost trust, no retriever fixes that — we'd be building a corpus-confidence system with a small retrieval layer attached.' That single 30-second exchange would have changed the entire architecture, and the interviewer's feedback later in the loop confirmed it: 'Strong implementation knowledge but jumped past the system definition. We don't need someone who can ship RAG; we need someone who can tell us when RAG is the wrong system.'

Lesson

CLARO doesn't exist to be thorough. It exists to prevent the failure mode where you build the right architecture for the wrong system. The candidate above wasn't weak at RAG — they were probably one of the strongest RAG engineers in the loop. They failed because they assumed the prompt was an architecture question when it was actually a 'name the right problem' question. Running CLARO at minute 2 would have forced them to ask the corpus-staleness question, which would have revealed that the team's actual problem was upstream of retrieval. The most expensive interview mistakes are not in the architecture you draw — they're in the architecture you started drawing too early.