Module 2 · Lesson 1 · Core · 48 min

Serving LLMs: The Latency Anatomy Framework

Every LLM latency conversation gets stuck on quantization and batching because candidates don't have a structural way to decompose where time is actually spent. The 5-Phase Latency Anatomy gives you a vocabulary for diagnosing any LLM serving system in under 60 seconds.

There is a script most candidates run when an interviewer asks about LLM latency. It goes: 'I'd look at batching, maybe quantize the model, possibly distill it to something smaller, and add more GPUs.' Every word of that script is technically true. None of it is wrong. And it is the answer that locks the candidate at Senior, every time. The reason is structural. The interviewer is not asking what techniques exist. They are asking whether the candidate has a model of where time is actually being spent — and they will know inside thirty seconds whether the answer is 'a list of fixes' or 'a diagnosis followed by a fix.'

LLM serving has five phases, not one. Each phase has a different bottleneck and a different fix. Quantization, the most over-suggested intervention in this entire interview category, helps one of the five phases meaningfully and is irrelevant or harmful to the other four. Continuous batching, the second most over-suggested, helps a different phase. Mixing them up is what makes candidates flail under pushback — they propose a fix, the interviewer asks why, and the candidate cannot articulate which phase the fix targets, because they were treating LLM latency as a single number. The framework in this lesson is the vocabulary that prevents that conversation from happening.

The phases are Queue, Prefill, First Token, Decode, and Detok-and-Stream. Time-to-first-token (TTFT) is the sum of the first three; inter-token latency is the fourth; end-to-end latency is the sum of all five. You will read those two metrics — TTFT and inter-token latency — separately, and the interviewer will respect you for it, because no Senior candidate does. By the time you've named the five phases and asked for those two numbers, you have already done the only thing that matters in the first ninety seconds: you have refused the single-number framing and forced a real diagnosis.

Framework

The 5-Phase Latency Anatomy

LLM latency is not a number. It is five sequential phases with five distinct bottlenecks and five distinct fixes. The job in the interview is to refuse the single-number framing and decompose by phase out loud, because every Senior-level answer collapses two phases together and proposes a fix that only addresses one of them. The phases are Queue, Prefill, First Token, Decode, and Detok-and-Stream — in that order, on every request, every time.

1
Phase 1 — Queue
The request waits for a batch slot or for the server to finish a previous batch step. Bottleneck: replica count, batch-scheduling policy, admission control. Symptom: TTFT is dominated by wait time and grows non-linearly with QPS. Fix: more replicas, smaller max-batch, smarter admission. Quantization does nothing here — the model never runs.
2
Phase 2 — Prefill
The prompt is processed in parallel through the transformer. Attention is O(n²) over prompt length. Bottleneck: GPU compute, prompt length, lack of prefix caching. Symptom: TTFT scales roughly with prompt-length squared at long context. Fix: chunked prefill, KV-cache reuse for shared prefixes (system prompts, few-shot examples), prompt budgets. Quantization helps a little; FlashAttention and sequence parallelism help more.
3
Phase 3 — First Token (TTFT)
The user-facing 'is it working?' moment. TTFT = Queue + Prefill + the first decode step. It is a derived metric, not a fixable one — you fix it by fixing the phase that dominates it. The point of carving it out as its own phase is that the interviewer will use TTFT and inter-token-latency as the two numbers that diagnose the system, and you need vocabulary that separates them.
4
Phase 4 — Decode
Autoregressive token generation. One forward pass per token. Compute per token is tiny; what dominates is reading the KV cache from HBM. Bottleneck: memory bandwidth, KV cache size, batch composition. Symptom: inter-token latency is high or unstable, especially under mixed-length traffic. Fix: continuous batching, GQA/MQA, KV-cache quantization, speculative decoding, PagedAttention. Quantization helps here because the bottleneck is memory.
5
Phase 5 — Detok and Stream
Tokens are converted to text, framed for SSE or HTTP/2, and pushed across the wire. Bottleneck: usually nothing — but it is non-zero, and at very high QPS or when streaming through buffering proxies, this phase can add 20–100 ms invisibly. Fix: profile it before assuming it's free. The reason it's its own phase is that 'we'll just stream' is not a fix, it's an architectural commitment with its own failure modes.

When to use

Run this any time the interview mentions LLM latency, p99, TTFT, or inter-token latency. The framework is also the right opening for 'how would you reduce cost' questions, because cost reduction and latency reduction touch the same five phases with different priorities. Skip it only for prompts that are explicitly about training, fine-tuning, or evaluation — those are different anatomies.

Worked example

Prompt: 'Our LLM API has p99 of 8 seconds.' Senior answer: 'Probably batching, try continuous batching and quantization.' Staff answer using the framework: 'Before I propose a fix, I need TTFT p99 and inter-token p99 separately. If TTFT is 7 seconds and inter-token is 50 ms, we have a queue or prefill problem and quantization is the wrong tool. If TTFT is 200 ms and inter-token is 250 ms, we have a decode problem and speculative decoding will move the needle. Without that decomposition I'd be guessing.' The single sentence — 'I need TTFT and inter-token separately' — is the entire Staff signal.

Calibration ladder

Your LLM API has p99 of 8 seconds. Where do you look first?

The interviewer hands you one number — p99 of 8 seconds for a chat-style LLM API. The next sentence out of the candidate's mouth is the whole point of this question.

L4 · Mid

I'd add more GPUs. 8 seconds is way too slow — sounds like we're under-provisioned.

Missed: Treated latency as a capacity problem. Adding GPUs without diagnosing the phase usually just adds idle capacity at high TTFT and does nothing for high inter-token latency. Wrong tool for both failure modes.

L5 · Senior

I'd profile the request path first. Probably it's batching. I'd look at whether we're using continuous batching, and I'd check if the model is quantized — both should bring p99 down. If not, larger replicas.

Missed: Named the right two interventions but didn't say which one targets which phase. Will fail under pushback when the interviewer asks 'why quantization?' because the answer 'it makes things faster' isn't true in the prefill or queue phases.

L6 · Staff

I want to decompose. Is the 8 seconds TTFT or end-to-end? If TTFT is high, we're either queue-bound or prefill-bound — those have different fixes. If inter-token latency is high, we're decode-bound, and the fix is speculative decoding or better continuous batching, not quantization. Quantization helps memory, not latency directly. Show me TTFT p99 and inter-token p99 separately and I can give you a real answer.

Missed: Strong technical decomposition. The single thing missing is the meta-move — converting 'don't trust a single latency number' into a portable pattern that applies beyond this question (cost numbers, throughput numbers, error-rate numbers all have the same structural failure).

L7 · Principal

Before I propose anything, I want TTFT p99 and inter-token p99 separately. They diagnose totally different systems. If TTFT is 7 seconds and inter-token is 50 ms, this is a queue or prefill problem — likely under-provisioned replicas if QPS is moderate, or unbounded prompt length if the use case is RAG with stuffed context. Fix is more replicas with admission control, or chunked prefill plus shared-prefix KV caching. If TTFT is 200 ms and inter-token is 250 ms, this is decode-bound — fix is speculative decoding, continuous-batching policy tuning, or cascading to a smaller model on easy queries. Quantization is a memory fix; it only helps latency indirectly, by letting you raise batch size before hitting OOM. The pattern I'd take away from this question is: any time you're handed one latency number, refuse it. Ask for two. The two-number decomposition isolates which phase to fix and usually rules out the most-suggested fix on the first pass.

What scored L7

Refused the single-number framing and named the two-number decomposition as the diagnostic, not the answer. Then ran two parallel paths (TTFT-high vs inter-token-high) and showed how the same prompt produces totally different system fixes. Closed by abstracting the move — 'when handed one number, ask for two' — into a pattern the reader uses on cost ($/request is meaningless without $/input-token and $/output-token), on throughput (QPS hides batch size), and on error rate (overall doesn't tell you which class of request is failing). This is the same meta-pattern as the hidden fork from CLARO: don't trust the unit of analysis the prompt hands you. Re-decompose it before designing a fix.

Architecture

The 5 phases mapped to the actual hardware and software they touch. Notice that the bottleneck resource is different in every phase — and only one phase is bottlenecked by the thing candidates usually fix.

Client (browser, app, or upstream service)

“The user's experience is a function of TTFT and streaming smoothness — those are the only two latency numbers the user perceives directly.”

API gateway · auth, rate limit, admission control

“Phase 1 begins here. Admission control is the cheapest fix for queue starvation — drop or downgrade lower-priority traffic at the edge before it consumes scheduler time downstream.”

Inference scheduler · continuous batching, KV-cache mgmt

“Phase 1 lives here. The scheduler decides when a request enters the batch and when it gets evicted. Misconfigured max-batch and KV-cache reservation are responsible for more 'mysterious latency' than any model-side issue.”

GPU compute path · prefill (compute-bound, O(n²) attention)

“Phase 2. Bound by FLOPs and attention complexity. FlashAttention, chunked prefill, and prefix-cache reuse all live here. Quantization helps a little but is not the lever.”

KV cache · per-layer K and V tensors in HBM

“The hidden state of the system. KV cache size determines max batch size, which determines throughput, which determines economics. Most production tuning is implicitly KV-cache tuning.”

GPU memory path · decode (memory-bandwidth-bound)

“Phase 4. Bound by HBM bandwidth. Speculative decoding, KV-cache quantization, GQA/MQA, and PagedAttention all live here. This is where quantization actually pays off.”

Detokenizer + streaming framer (SSE / HTTP/2 / gRPC)

“Phase 5. Usually fast, but proxy buffering, Nagle's algorithm on raw TCP, or middleware that waits for newlines can silently add tens of milliseconds. Profile before assuming.”

User-perceived stream

client → gateway · request

gateway → scheduler · Phase 1 (Queue)

scheduler → gpu-prefill · Phase 2 (Prefill)

gpu-prefill → kv-cache · writes K, V

kv-cache → gpu-decode · reads K, V

gpu-decode → tokenizer · Phase 4 (Decode, one token at a time)

tokenizer → user · Phase 5 (Detok + stream)

Latency anatomy · budget 1500 ms

Realistic per-phase latency for a 70B model with GQA, tensor-parallel across 4×H100 80GB, ~1500-token prompts, ~150-token outputs, continuous batching at max-batch ~64, and 100 QPS aggregate. Numbers are typical engineering-blog ranges, not benchmarks — your mileage will vary by ±30%. The point of the chart is the shape, not the absolute values.

Phase 1 — Queue90 ms

Wait for a batch slot. At 100 QPS into a continuous-batching server with max-batch 64, queue is usually short. At 200 QPS it doubles and starts dominating TTFT before any model bottleneck kicks in.

Phase 2 — Prefill (1500 tokens)220 ms

Compute-bound. FlashAttention helps; chunked prefill smooths the spike when a long prompt arrives mid-batch. At 4k context prefill jumps to ~700 ms; at 8k it's ~2.5 s. This is why prompt-length budget matters.

Phase 3 — First token (derived)320 ms

TTFT ≈ Queue + Prefill + one decode step. This is the user-facing 'is it working?' number. The fix is whichever sub-phase dominates — at this load it's prefill.

Phase 4 — Decode (150 tokens × ~6 ms)900 ms

Memory-bandwidth-bound. Each token reads the KV cache from HBM. Inter-token latency is ~6 ms with GQA at this batch size; ~12 ms without GQA. Speculative decoding can effectively halve this when draft acceptance is high.

Phase 5 — Detok + stream30 ms

Mostly free. The 30 ms accounts for the tokenizer call and one SSE frame trip back to the client. If a buffering proxy sits in the path, this can quietly become 200 ms — profile it before assuming.

Why the same fix has opposite effects in different phases

Most LLM serving advice is phase-blind. 'Use a smaller model.' 'Quantize.' 'Add more GPUs.' Each of these helps exactly one or two phases and is irrelevant or counterproductive in the others. A smaller model fixes prefill and decode but does nothing for queue starvation. More GPUs (more replicas) fix queue but cost money to fix something a config change could fix. Quantization helps decode meaningfully, prefill a little, and queue not at all — and at high batch sizes, aggressively quantized models actually increase tail latency because of dequant overhead on the activation path. The framework lets you say which fix targets which phase so the interviewer hears engineering, not enumeration.

Dimension	Static batching	Continuous (iteration-level) batching	Chunked prefill + continuous batching
TTFT behavior	Bad at low QPS (waits for batch fill), unstable at high QPS (long requests block batch).	Good. New requests enter the batch on the next iteration, not on batch drain.	Best. Long prompts are sliced into chunks of K tokens that interleave with decode steps — TTFT for the long prompt is bounded, and other requests don't stall.
Throughput at saturation	OK but plateaus early — slot underutilization is real.	High. Saturates GPU memory bandwidth.	Equal to or slightly below continuous batching; tuning K trades TTFT vs throughput.
Mixed-length fairness	Bad. A 4000-token output in the batch starves every other slot for the full decode.	Good. Short outputs leave the batch quickly, freeing slots for new arrivals.	Best. Prefill no longer blocks decode for other in-flight requests.
Implementation complexity	Trivial. Two settings: batch size, timeout.	Moderate. Requires KV-cache management; PagedAttention or equivalent is essentially required at scale.	Higher. Needs scheduler awareness of chunk boundaries and a chunk-size policy.
Long-prompt failure mode	Long prompt forces the whole batch to wait through prefill. p99 explodes.	Long prompt arrival still spikes TTFT for that request; doesn't fully solve it.	Solved by construction. The whole reason to deploy this.
Choose when	Only for batch-mode offline workloads with no SLA (summarize this corpus overnight). Never for serving.	Default for all online serving. If prompt lengths are uniformly short (chat with small context), this is enough.	Whenever prompt length distribution has a long tail — RAG, long-document Q&A, agentic flows with growing context. The TTFT improvement on the long tail is the entire point.

Verdict

Continuous batching is the modern default for online serving. Chunked prefill is the upgrade that prevents the long-prompt p99 disaster. Static batching has one legitimate use — offline batch jobs — and naming it in an interview is a signal you know the option exists, not a recommendation.

Chart

Throughput grows roughly linearly with batch size until KV-cache or memory-bandwidth saturation, but TTFT grows non-linearly because long-prompt prefill blocks more in-flight requests as the batch fills. The right operating point is just before the TTFT knee — typically around max-batch 24–48 for 70B-class models on 4×H100, depending on prompt length distribution. The interview move is to name this curve out loud: 'I'd find the batch size where TTFT p99 just starts climbing and operate one notch below that, then add replicas instead of more batch.'

Mental model

The KV cache as a working set

HBM (GPU memory) — bounded resource
+-----------------------------------------------+
|  Model weights (140 GB for 70B in bf16)       |
+-----------------------------------------------+
|  KV cache — the "working set"                 |
|                                               |
|  [req A · 1500 tok · 600 MB]                  |
|  [req B · 300 tok · 120 MB]                   |
|  [req C · 4000 tok · 1.6 GB]   ← hot key      |
|  [req D · 200 tok · 80 MB]                    |
|  ... up to max_batch_size                     |
+-----------------------------------------------+
|  Free                                         |
+-----------------------------------------------+

When KV cache fills → no new requests admitted → queue grows
When a request's KV is evicted → that request is killed/restarted
KV size per request ≈ 2 × num_layers × hidden_dim × seq_len × dtype_bytes
                       (×2 for K and V, and GQA divides hidden_dim further)

Think of the KV cache as your inference server's working set, in the operating-systems sense. Like a page cache, it has bounded size; like a working set, the things you 'reuse' (shared system-prompt prefixes, recent decode steps) are cheap and the things you don't (one-shot long prompts) are expensive. PagedAttention (vLLM) and similar systems explicitly treat the KV cache like virtual memory pages — fragmentable, swappable, and reservable. Most production tuning of LLM servers is implicitly KV-cache tuning: max batch size is a KV-cache budget; GQA is a KV-cache compression scheme; KV quantization is a working-set shrink; chunked prefill is a way to admit a new request without doubling the working set in one step.

Use it when: Use this any time the interviewer mentions 'OOM,' 'batch size,' 'memory pressure,' 'long context,' or 'how do you handle 100k context.' Reframing the conversation around the KV-cache working set shows the interviewer you understand why memory is the binding constraint, not compute, in modern LLM serving.

Real-world reference · Anyscale

Continuous batching for LLM inference

Anyscale published a comparison showing continuous batching delivering roughly an order of magnitude higher throughput than static batching for the same model and hardware, with most of the gain coming from no longer leaving GPU slots idle while long-output requests drain the batch. The published numbers are workload-specific and cited in the original post (Anyscale benchmarked LLaMA-style 13B–70B models on A100/H100 with synthetic chat-like traces).

Takeaway: The headline number is real but the load-bearing fact is the mechanism: continuous batching wins because the alternative wastes slots, not because the math gets faster. When an interviewer asks 'how much would continuous batching help?', the honest answer is 'a lot at saturation, very little at low QPS' — because at low QPS, static batching's slot waste doesn't matter. This is the right altitude for a Staff answer: name the regime, then quote the gain.

Anyscale Engineering — 'How continuous batching enables 23× throughput' ↗

Pattern recognition

When you see

Workload is decode-dominated — long outputs, repetitive style, or any 'reasoning' / chain-of-thought traffic.

→

Think

Reach for speculative decoding before quantization. Speculative decoding tackles the memory-bandwidth bottleneck of decode head-on by amortizing one expensive forward pass over multiple accepted tokens; quantization only helps the memory bottleneck indirectly through larger batches.

Most candidates name quantization first because it sounds aggressive. But in decode-bound serving, speculative decoding (Medusa, EAGLE, draft-and-verify with a small draft model) typically reduces inter-token latency by 1.5–3× when accept rate is high, with zero quality loss — because the large model still verifies every accepted token. Quantization gets you maybe 1.3× decode improvement plus larger batch capacity, but with a quality risk that the interviewer will ask you to defend. Lead with speculative decoding when the workload supports it (predictable style, long outputs); quantization is a follow-up, not the headline.

Real-world reference · Character.AI

Custom inference stack for high-concurrency chat

Character.AI published an engineering post describing how they reduced cost-per-token by an order of magnitude over generic stacks, primarily through aggressive KV cache sharing (a single user's conversation history becomes a shared prefix), int8 quantization on attention, and a custom multi-query attention variant that radically shrinks the KV working set. The combination lets them keep many more concurrent users per GPU without dropping into queue starvation.

Takeaway: Notice the order of operations: they did not start with quantization. They started by shrinking the KV cache working set (MQA, sharing prefixes across turns of the same conversation) and only then quantized the residual. When a candidate proposes quantization first in an interview about a chat-style workload, they're skipping the bigger lever. The Staff move is: name the dominant resource (memory, specifically KV cache), then propose fixes in order of impact — working-set reduction, then compression.

Character.AI — 'Optimizing AI Inference at Character.AI' ↗

Scale calculator presets

Three serving regimes. Same model class, totally different optimal configurations. Use this as the answer when an interviewer asks 'how would the design change for use case X?' — and use it as the diagnostic when an interviewer hands you ambiguous SLA numbers.

Preset A — Chatbot (low concurrency, latency-sensitive)

Hold batch smaller than you think you can — TTFT is the user-perceived metric and grows fast above ~20. Spend on replicas, not on bigger batches. Decode optimization is moderate-priority because output length is short.

Tokens/sec per request25 tok/s

Aggregate decode tok/s400 tok/s

TTFT headroom over prefill380 ms

Decoder slot utilization (proxy)17 %

Preset B — Batch summarization (high throughput, latency-tolerant)

Max out batch. TTFT is irrelevant. Quantization is finally the right answer because batch size is the binding constraint and memory is the resource. Spot/preemptible compute is acceptable; this workload is restart-tolerant.

Tokens/sec per request13 tok/s

Aggregate decode tok/s1.20k tok/s

TTFT headroom over prefill4.40k ms

Decoder slot utilization (proxy)83 %

Preset C — Code completion (extreme TTFT sensitivity)

TTFT < 150ms is the entire product. Lead with prefix caching (the user's file context is reused across keystrokes), smaller batches, and consider a smaller model for the first-pass with a bigger model for accepted suggestions. Quantization helps decode but the binding constraint is prefill — chunked prefill and FlashAttention are higher-impact.

Tokens/sec per request40 tok/s

Aggregate decode tok/s320 tok/s

TTFT headroom over prefill-225 ms

Decoder slot utilization (proxy)13 %

Unspoken rubric

What the interviewer is scoring when they ask 'how would you reduce latency?' that is not in the rubric document.

What they score

·Did the candidate ask for the user-facing metric before proposing fixes? (TTFT is what users feel; e2e is what dashboards show. They differ.)
·Did they separate TTFT from inter-token latency in the first 60 seconds — without being prompted?
·Did they avoid the quantization trap — proposing quantization without naming the phase it targets and the regime where it helps?
·Did they ask about prompt-length distribution before recommending architectural changes?
·When the interviewer pushed back with 'why not just add more GPUs?', did they answer with the per-phase economic argument, or did they hedge?
·Did they distinguish 'fix for steady-state' from 'fix for spike'? Real serving has both; most candidates only address one.

Why it's not on the rubric

These aren't in the rubric because they look like 'soft skills' but they're actually structural knowledge — you only ask the right question if you have the right model. The rubric says 'demonstrates depth in LLM serving systems'; these bullets are what depth actually looks like in conversation. An interviewer who has watched 50 candidates can grade these in real time.

How to signal it

→Open with 'before I propose a fix, I want TTFT p99 and inter-token p99 separately' — the single highest-signal sentence in the entire question category.
→Name phases out loud when proposing fixes: 'continuous batching is a Phase 1 and Phase 4 win; it doesn't help Phase 2.'
→When suggesting quantization, qualify the regime: 'this helps decode meaningfully — Phase 4 — by letting us hold more requests in HBM at once; it does very little for prefill or queue.'
→Ask one prompt-distribution question explicitly: 'what's the p50 vs p99 input length, and is it a chat workload or RAG?' RAG changes the answer.
→When pushed on adding GPUs, respond with the economic split: 'replicas fix queue at $X/QPS; batching policy fixes queue at config-change cost. I'd exhaust the config change before signing the GPU PO.'
→Address steady-state and spike separately: 'for steady-state I'd tune batch and admission; for spike I'd commit to a smaller-model fallback path with explicit quality budget.'

Drill · 12 minutes

Practice this. Time yourself.

You're handed a profiling dump: a 13B model on 2×A100 80GB, p99 TTFT = 3.2 s, p99 inter-token latency = 80 ms, QPS = 60, avg prompt = 2200 input tokens, avg output = 180 tokens. Diagnose the dominant phase, name the secondary problem, and propose three interventions ranked by expected impact. State which phase each intervention targets. You have 12 minutes. Write the answer as a 4-paragraph response — one paragraph per heading: Diagnosis, Secondary, Interventions (ranked), Open questions.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Phase diagnosis	Said 'high latency, probably batching.' Did not separate TTFT from inter-token.	Identified TTFT as dominant (3.2 s ≫ 80 ms × 180 = 14.4 s decode total, so TTFT is ~18% of e2e — but the e2e is huge, so something is wrong upstream).	Diagnosed TTFT as the primary problem AND noted that 80 ms inter-token at 13B on A100 is mildly elevated (expected ~30–50 ms); flagged decode as a secondary concern.	Called out that the 14.4-second decode total is the actual end-to-end story, that the user is waiting 17+ seconds, and that the breakdown points to (a) prefill or queue dominating TTFT and (b) decode being slow enough to suggest KV-cache pressure or under-batched continuous batching. Named the dominant phase as prefill or queue with 'I'd want to see the queue-wait component to separate them.'
Secondary problem identification	Did not identify a secondary problem.	Noted inter-token latency is on the high side.	Connected the inter-token latency to a likely cause (KV-cache fragmentation, max-batch too high causing memory pressure, no GQA on a 13B that doesn't have it natively).	Reframed the 80 ms inter-token as 'expected for 13B with vanilla MHA, but you'd want to confirm whether GQA or MQA is in use and what the effective batch utilization is — because if continuous batching is starving on KV space, decode degrades.'
Intervention ranking	Listed unranked options. Recommended quantization without phase justification.	Ranked options by perceived impact but did not tag phases.	Each intervention tagged with the phase it targets, expected impact range, and a reason it's ranked where it is.	Ranked with explicit dependencies: 'Intervention 1 (admission control + chunked prefill) targets Phase 1 + 2 and should cut TTFT by 50–70%. If after that TTFT is still high, Intervention 2 (prefix caching for any shared system prompt) targets Phase 2 and is workload-dependent. Intervention 3 (speculative decoding or KV-cache compression) targets Phase 4 and should be deferred until Phase 1+2 are fixed because optimizing decode while TTFT is broken is solving the wrong problem.'
Open questions	No open questions. Treated the dump as complete information.	Asked one or two questions about the workload.	Asked specifically about queue-wait, prompt-length distribution, and current batching policy.	Asked the five questions that distinguish queue-bound from prefill-bound TTFT, named which sub-diagnosis each question collapses, and stated explicitly which assumption their ranked-intervention list depends on. 'If queue wait is 2 seconds, my intervention order changes.'

Reveal model solution

Diagnosis. TTFT p99 of 3.2 s on 2200-token prompts is dominated by either queue wait or prefill compute. Inter-token p99 of 80 ms is on the high side for a 13B on A100 (expected 30–50 ms with reasonable batching), but not the primary failure. The end-to-end p99 is roughly 3.2 s + (180 × 80 ms) ≈ 17.6 s, which means the user is waiting around 18 seconds and TTFT is only ~18% of that — but TTFT is the user-perceived "is it working?" number, so it is still the right primary target. Diagnosis: TTFT is the dominant problem and is most likely prefill-bound or queue-bound; I'd need queue-wait p99 to disambiguate. Secondary. The 80 ms inter-token latency is the secondary signal. At 13B on A100, this suggests either (a) MHA without GQA, which doubles the KV cache read per token, (b) continuous batching is starving on KV-cache space (max-batch too high, KV-cache fragmented), or (c) the batch is mostly long-output requests that occupy slots for many decode steps. None of these are catastrophic individually; they compound. Interventions, ranked. (1) Admission control plus chunked prefill — targets Phase 1 (queue) and Phase 2 (prefill). Admission control bounds queue-wait spikes; chunked prefill slices long prompts so a 2200-token arrival doesn't block other in-flight requests through prefill. Expected impact: TTFT p99 drops by 50–70%. (2) Prefix caching for shared system-prompt prefixes — targets Phase 2. Only helps if there is a meaningful shared prefix across requests (system prompt, few-shot template). At 2200-token average prompt with a 200-token shared prefix, this is ~10% TTFT savings and free quality; not headline but cheap. (3) Speculative decoding with a small draft model — targets Phase 4. Cuts inter-token latency by 1.5–2× when draft-accept rate is high (likely for structured chat outputs). Deferred to third because optimizing decode while TTFT is 3.2 s is solving the wrong problem; do this after (1) and (2) land. Open questions. (1) What is queue-wait p99 separately from TTFT? Determines whether (1a) admission control or (1b) chunked prefill is the larger lever. (2) What's the prompt-length distribution — is 2200 the mean of a tight distribution or the mean of a long-tail? Long-tail makes chunked prefill the primary lever. (3) What's the current max-batch, and what fraction of decode steps run at max-batch? Tells me if the inter-token degradation is from KV pressure. (4) Is there a shared system prompt across most requests? Determines whether prefix caching is worth the implementation cost. (5) What's the SLA — is 3.2 s TTFT actually breaching, or is the team trying to hit a stretch goal? Changes the urgency and the budget for interventions.

Common failures

✗Proposed quantization as Intervention #1 without naming a phase. Quantization is the wrong primary lever for a TTFT-dominated problem.
✗Treated the 80 ms inter-token as fine because 'it's not the worst number,' missing that 80 ms on 13B-on-A100 is already elevated and points to a compounding problem.
✗Wrote a single 17-second end-to-end number and proposed fixes to that — instead of separating TTFT and decode-time and addressing them with different interventions.
✗Ranked interventions by familiarity (speculative decoding sounds advanced) instead of by phase + expected impact. Speculative decoding is the right intervention for the secondary problem, not the primary.
✗Asked no open questions. The dump is incomplete on purpose; designing without naming the missing data is the canonical Senior failure.

Artifact · decision tree

The LLM Serving Diagnostic Tree

Get TTFT p99 and inter-token p99 separately. Which is the dominant problem?

→TTFT is high (>1s)

Is queue-wait p99 a large fraction of TTFT (i.e., requests are sitting before any compute)?

→Yes — queue-bound

Phase 1 fix. Pick the cheapest lever that fits the regime.

→Bursty trafficAdmission control + priority queues. Drop or downgrade low-priority traffic at the edge before it queues. Quantization helps zero here.

→Steady high QPSMore replicas. Compute the replica count from desired queue-wait p99 = (1 / (μ - λ)) — basic queueing. Add headroom for spike.

→Long-output requests blocking batchSwitch to continuous batching if not already, then tune max-batch downward. Static or near-static batching is the actual cause.

→No — prefill-bound

Phase 2 fix. What does the prompt-length distribution look like?

→Long tail with frequent >4k arrivalsChunked prefill. Bounds the worst-case TTFT spike caused by long-prompt arrival. This is the highest-leverage fix in this branch.

→Heavy shared system prompt or few-shot examplesPrefix caching (cached KV for shared prefix). Free TTFT savings when the prefix is large. vLLM, TensorRT-LLM, and similar support this natively.

→Uniform short prompts but still slowFlashAttention if not enabled. Verify TP/PP placement — cross-node TP is silent prefill death. Check FP8/bf16 dtype consistency.

→Inter-token is high (>50ms on 13B / >15ms on 70B)

Phase 4 fix. What's the workload shape?

→Long outputs with predictable style (chat, reasoning)Speculative decoding (draft-and-verify, Medusa, EAGLE). 1.5–3× decode speedup with zero quality risk. Lead with this.

→Mixed-length traffic, KV-cache likely under pressureContinuous batching policy tuning + PagedAttention. Investigate KV-cache fragmentation; consider GQA/MQA model variants if not already.

→Memory-pressured at current batch sizeKV-cache quantization (INT8). Then weight quantization. This is the regime where quantization is the right call.

→Latency-tight, can accept quality trade-offCascade routing: small model for easy queries, big model for hard. Often beats decode optimization for chat-style workloads.

→Both

Both TTFT and inter-token are bad. The system is likely OOM-thrashing or grossly under-provisioned.

→Check OOM-recovery logs firstIf the server is silently OOM-killing and restarting requests, every metric is corrupt. Fix this before doing anything else.

→Max-batch too high for KV cache availableReduce max-batch until KV cache fits cleanly. Throughput drops slightly but tail latency stabilizes. Then re-diagnose.

→Wrong tensor-parallel placementCross-node TP can degrade everything. Verify TP groups are within a single node with NVLink/NVSwitch.

Post-mortem · anonymized

Setup

Series C company building an LLM-powered developer tool. ~30-person engineering org. Self-hosted serving stack on H100s. The serving team — three senior engineers, one staff — spent roughly three months optimizing quantization to fix a 'latency problem.' They tried INT8, GPTQ, AWQ, and finally a custom mixed-precision scheme. Quality on internal evals dropped slightly each time. p99 latency improved by maybe 8% across the entire effort.

What happened

The team was paged repeatedly on p99 latency violations. The default reaction from anyone with ML serving experience is 'the model is too slow, optimize the model.' They never separated TTFT from inter-token latency in their dashboards. Their default Grafana panel showed end-to-end p99 only. They had a continuous-batching server, but their max-batch was set to 96 — a number their original benchmarking had picked when their prompt distribution was much shorter. As their product matured and average prompt length grew from ~800 tokens to ~2400 tokens, the max-batch-96 setting caused continuous batching to fragment KV cache aggressively, and the scheduler ended up holding inbound requests in a queue for 4–6 seconds at p99 while waiting for batch slots to free. Their TTFT was queue-bound. The model was running fine.

The moment

Month three of the quantization effort, the staff engineer's manager — who happened to have come from a high-frequency trading background — asked, 'how long are requests waiting before they even hit the GPU?' Nobody knew. They added the metric in an afternoon. p99 queue-wait was 4.8 seconds. The quantization effort had been optimizing a part of the request path that contributed less than 200 ms to the problem, while the actual 4.8 seconds was sitting in a config file under 'max_batch_size: 96.'

What they should have said

At the first paging incident, three months earlier: 'Before I touch the model, I need TTFT p99 broken down into queue-wait p99 and prefill p99, and I need inter-token p99 separately. If queue dominates, the fix is a config change or admission control. If prefill dominates, it's chunked prefill or prefix caching. If decode dominates, it's continuous-batching tuning or speculative decoding. Quantization is a Phase 4 memory-pressure fix; it would be the wrong lever for any of the first three.' That sentence — spoken by anyone on the team — would have saved roughly three person-months of effort and a small quality regression in eval scores. They had the technical capacity to do that diagnosis from day one. What they were missing was the framework that names the diagnostic step as load-bearing.

Lesson

The framework is not a fancy way to know what you already know. It is the thing that forces you to slow down for sixty seconds and ask the diagnostic question before reaching for the familiar tool. In the interview, the same dynamic plays out in compressed form: the candidate who proposes quantization in the first ninety seconds is the candidate who, in production, would have spent three months on it. The interviewer is not looking for someone who knows about quantization. They are looking for someone who would not have started there. The framework is the difference.