Serving LLMs: The Latency Anatomy Framework
Every LLM latency conversation gets stuck on quantization and batching because candidates don't have a structural way to decompose where time is actually spent. The 5-Phase Latency Anatomy gives you a vocabulary for diagnosing any LLM serving system in under 60 seconds.
There is a script most candidates run when an interviewer asks about LLM latency. It goes: 'I'd look at batching, maybe quantize the model, possibly distill it to something smaller, and add more GPUs.' Every word of that script is technically true. None of it is wrong. And it is the answer that locks the candidate at Senior, every time. The reason is structural. The interviewer is not asking what techniques exist. They are asking whether the candidate has a model of where time is actually being spent — and they will know inside thirty seconds whether the answer is 'a list of fixes' or 'a diagnosis followed by a fix.'
LLM serving has five phases, not one. Each phase has a different bottleneck and a different fix. Quantization, the most over-suggested intervention in this entire interview category, helps one of the five phases meaningfully and is irrelevant or harmful to the other four. Continuous batching, the second most over-suggested, helps a different phase. Mixing them up is what makes candidates flail under pushback — they propose a fix, the interviewer asks why, and the candidate cannot articulate which phase the fix targets, because they were treating LLM latency as a single number. The framework in this lesson is the vocabulary that prevents that conversation from happening.
The phases are Queue, Prefill, First Token, Decode, and Detok-and-Stream. Time-to-first-token (TTFT) is the sum of the first three; inter-token latency is the fourth; end-to-end latency is the sum of all five. You will read those two metrics — TTFT and inter-token latency — separately, and the interviewer will respect you for it, because no Senior candidate does. By the time you've named the five phases and asked for those two numbers, you have already done the only thing that matters in the first ninety seconds: you have refused the single-number framing and forced a real diagnosis.
The 5-Phase Latency Anatomy
LLM latency is not a number. It is five sequential phases with five distinct bottlenecks and five distinct fixes. The job in the interview is to refuse the single-number framing and decompose by phase out loud, because every Senior-level answer collapses two phases together and proposes a fix that only addresses one of them. The phases are Queue, Prefill, First Token, Decode, and Detok-and-Stream — in that order, on every request, every time.
- 1Phase 1 — QueueThe request waits for a batch slot or for the server to finish a previous batch step. Bottleneck: replica count, batch-scheduling policy, admission control. Symptom: TTFT is dominated by wait time and grows non-linearly with QPS. Fix: more replicas, smaller max-batch, smarter admission. Quantization does nothing here — the model never runs.
- 2Phase 2 — PrefillThe prompt is processed in parallel through the transformer. Attention is O(n²) over prompt length. Bottleneck: GPU compute, prompt length, lack of prefix caching. Symptom: TTFT scales roughly with prompt-length squared at long context. Fix: chunked prefill, KV-cache reuse for shared prefixes (system prompts, few-shot examples), prompt budgets. Quantization helps a little; FlashAttention and sequence parallelism help more.
- 3Phase 3 — First Token (TTFT)The user-facing 'is it working?' moment. TTFT = Queue + Prefill + the first decode step. It is a derived metric, not a fixable one — you fix it by fixing the phase that dominates it. The point of carving it out as its own phase is that the interviewer will use TTFT and inter-token-latency as the two numbers that diagnose the system, and you need vocabulary that separates them.
- 4Phase 4 — DecodeAutoregressive token generation. One forward pass per token. Compute per token is tiny; what dominates is reading the KV cache from HBM. Bottleneck: memory bandwidth, KV cache size, batch composition. Symptom: inter-token latency is high or unstable, especially under mixed-length traffic. Fix: continuous batching, GQA/MQA, KV-cache quantization, speculative decoding, PagedAttention. Quantization helps here because the bottleneck is memory.
- 5Phase 5 — Detok and StreamTokens are converted to text, framed for SSE or HTTP/2, and pushed across the wire. Bottleneck: usually nothing — but it is non-zero, and at very high QPS or when streaming through buffering proxies, this phase can add 20–100 ms invisibly. Fix: profile it before assuming it's free. The reason it's its own phase is that 'we'll just stream' is not a fix, it's an architectural commitment with its own failure modes.
Run this any time the interview mentions LLM latency, p99, TTFT, or inter-token latency. The framework is also the right opening for 'how would you reduce cost' questions, because cost reduction and latency reduction touch the same five phases with different priorities. Skip it only for prompts that are explicitly about training, fine-tuning, or evaluation — those are different anatomies.
Prompt: 'Our LLM API has p99 of 8 seconds.' Senior answer: 'Probably batching, try continuous batching and quantization.' Staff answer using the framework: 'Before I propose a fix, I need TTFT p99 and inter-token p99 separately. If TTFT is 7 seconds and inter-token is 50 ms, we have a queue or prefill problem and quantization is the wrong tool. If TTFT is 200 ms and inter-token is 250 ms, we have a decode problem and speculative decoding will move the needle. Without that decomposition I'd be guessing.' The single sentence — 'I need TTFT and inter-token separately' — is the entire Staff signal.
Your LLM API has p99 of 8 seconds. Where do you look first?
The interviewer hands you one number — p99 of 8 seconds for a chat-style LLM API. The next sentence out of the candidate's mouth is the whole point of this question.
I'd add more GPUs. 8 seconds is way too slow — sounds like we're under-provisioned.
I'd profile the request path first. Probably it's batching. I'd look at whether we're using continuous batching, and I'd check if the model is quantized — both should bring p99 down. If not, larger replicas.
I want to decompose. Is the 8 seconds TTFT or end-to-end? If TTFT is high, we're either queue-bound or prefill-bound — those have different fixes. If inter-token latency is high, we're decode-bound, and the fix is speculative decoding or better continuous batching, not quantization. Quantization helps memory, not latency directly. Show me TTFT p99 and inter-token p99 separately and I can give you a real answer.
Before I propose anything, I want TTFT p99 and inter-token p99 separately. They diagnose totally different systems. If TTFT is 7 seconds and inter-token is 50 ms, this is a queue or prefill problem — likely under-provisioned replicas if QPS is moderate, or unbounded prompt length if the use case is RAG with stuffed context. Fix is more replicas with admission control, or chunked prefill plus shared-prefix KV caching. If TTFT is 200 ms and inter-token is 250 ms, this is decode-bound — fix is speculative decoding, continuous-batching policy tuning, or cascading to a smaller model on easy queries. Quantization is a memory fix; it only helps latency indirectly, by letting you raise batch size before hitting OOM. The pattern I'd take away from this question is: any time you're handed one latency number, refuse it. Ask for two. The two-number decomposition isolates which phase to fix and usually rules out the most-suggested fix on the first pass.
Refused the single-number framing and named the two-number decomposition as the diagnostic, not the answer. Then ran two parallel paths (TTFT-high vs inter-token-high) and showed how the same prompt produces totally different system fixes. Closed by abstracting the move — 'when handed one number, ask for two' — into a pattern the reader uses on cost ($/request is meaningless without $/input-token and $/output-token), on throughput (QPS hides batch size), and on error rate (overall doesn't tell you which class of request is failing). This is the same meta-pattern as the hidden fork from CLARO: don't trust the unit of analysis the prompt hands you. Re-decompose it before designing a fix.
The 5 phases mapped to the actual hardware and software they touch. Notice that the bottleneck resource is different in every phase — and only one phase is bottlenecked by the thing candidates usually fix.
Realistic per-phase latency for a 70B model with GQA, tensor-parallel across 4×H100 80GB, ~1500-token prompts, ~150-token outputs, continuous batching at max-batch ~64, and 100 QPS aggregate. Numbers are typical engineering-blog ranges, not benchmarks — your mileage will vary by ±30%. The point of the chart is the shape, not the absolute values.
Why the same fix has opposite effects in different phases
Most LLM serving advice is phase-blind. 'Use a smaller model.' 'Quantize.' 'Add more GPUs.' Each of these helps exactly one or two phases and is irrelevant or counterproductive in the others. A smaller model fixes prefill and decode but does nothing for queue starvation. More GPUs (more replicas) fix queue but cost money to fix something a config change could fix. Quantization helps decode meaningfully, prefill a little, and queue not at all — and at high batch sizes, aggressively quantized models actually increase tail latency because of dequant overhead on the activation path. The framework lets you say which fix targets which phase so the interviewer hears engineering, not enumeration.
| Dimension | Static batching | Continuous (iteration-level) batching | Chunked prefill + continuous batching |
|---|---|---|---|
| TTFT behavior | Bad at low QPS (waits for batch fill), unstable at high QPS (long requests block batch). | Good. New requests enter the batch on the next iteration, not on batch drain. | Best. Long prompts are sliced into chunks of K tokens that interleave with decode steps — TTFT for the long prompt is bounded, and other requests don't stall. |
| Throughput at saturation | OK but plateaus early — slot underutilization is real. | High. Saturates GPU memory bandwidth. | Equal to or slightly below continuous batching; tuning K trades TTFT vs throughput. |
| Mixed-length fairness | Bad. A 4000-token output in the batch starves every other slot for the full decode. | Good. Short outputs leave the batch quickly, freeing slots for new arrivals. | Best. Prefill no longer blocks decode for other in-flight requests. |
| Implementation complexity | Trivial. Two settings: batch size, timeout. | Moderate. Requires KV-cache management; PagedAttention or equivalent is essentially required at scale. | Higher. Needs scheduler awareness of chunk boundaries and a chunk-size policy. |
| Long-prompt failure mode | Long prompt forces the whole batch to wait through prefill. p99 explodes. | Long prompt arrival still spikes TTFT for that request; doesn't fully solve it. | Solved by construction. The whole reason to deploy this. |
| Choose when | Only for batch-mode offline workloads with no SLA (summarize this corpus overnight). Never for serving. | Default for all online serving. If prompt lengths are uniformly short (chat with small context), this is enough. | Whenever prompt length distribution has a long tail — RAG, long-document Q&A, agentic flows with growing context. The TTFT improvement on the long tail is the entire point. |
Continuous batching is the modern default for online serving. Chunked prefill is the upgrade that prevents the long-prompt p99 disaster. Static batching has one legitimate use — offline batch jobs — and naming it in an interview is a signal you know the option exists, not a recommendation.
Throughput grows roughly linearly with batch size until KV-cache or memory-bandwidth saturation, but TTFT grows non-linearly because long-prompt prefill blocks more in-flight requests as the batch fills. The right operating point is just before the TTFT knee — typically around max-batch 24–48 for 70B-class models on 4×H100, depending on prompt length distribution. The interview move is to name this curve out loud: 'I'd find the batch size where TTFT p99 just starts climbing and operate one notch below that, then add replicas instead of more batch.'
The KV cache as a working set
HBM (GPU memory) — bounded resource
+-----------------------------------------------+
| Model weights (140 GB for 70B in bf16) |
+-----------------------------------------------+
| KV cache — the "working set" |
| |
| [req A · 1500 tok · 600 MB] |
| [req B · 300 tok · 120 MB] |
| [req C · 4000 tok · 1.6 GB] ← hot key |
| [req D · 200 tok · 80 MB] |
| ... up to max_batch_size |
+-----------------------------------------------+
| Free |
+-----------------------------------------------+
When KV cache fills → no new requests admitted → queue grows
When a request's KV is evicted → that request is killed/restarted
KV size per request ≈ 2 × num_layers × hidden_dim × seq_len × dtype_bytes
(×2 for K and V, and GQA divides hidden_dim further)
Think of the KV cache as your inference server's working set, in the operating-systems sense. Like a page cache, it has bounded size; like a working set, the things you 'reuse' (shared system-prompt prefixes, recent decode steps) are cheap and the things you don't (one-shot long prompts) are expensive. PagedAttention (vLLM) and similar systems explicitly treat the KV cache like virtual memory pages — fragmentable, swappable, and reservable. Most production tuning of LLM servers is implicitly KV-cache tuning: max batch size is a KV-cache budget; GQA is a KV-cache compression scheme; KV quantization is a working-set shrink; chunked prefill is a way to admit a new request without doubling the working set in one step.
Continuous batching for LLM inference
Anyscale published a comparison showing continuous batching delivering roughly an order of magnitude higher throughput than static batching for the same model and hardware, with most of the gain coming from no longer leaving GPU slots idle while long-output requests drain the batch. The published numbers are workload-specific and cited in the original post (Anyscale benchmarked LLaMA-style 13B–70B models on A100/H100 with synthetic chat-like traces).
Workload is decode-dominated — long outputs, repetitive style, or any 'reasoning' / chain-of-thought traffic.
Reach for speculative decoding before quantization. Speculative decoding tackles the memory-bandwidth bottleneck of decode head-on by amortizing one expensive forward pass over multiple accepted tokens; quantization only helps the memory bottleneck indirectly through larger batches.
Custom inference stack for high-concurrency chat
Character.AI published an engineering post describing how they reduced cost-per-token by an order of magnitude over generic stacks, primarily through aggressive KV cache sharing (a single user's conversation history becomes a shared prefix), int8 quantization on attention, and a custom multi-query attention variant that radically shrinks the KV working set. The combination lets them keep many more concurrent users per GPU without dropping into queue starvation.
Three serving regimes. Same model class, totally different optimal configurations. Use this as the answer when an interviewer asks 'how would the design change for use case X?' — and use it as the diagnostic when an interviewer hands you ambiguous SLA numbers.
What the interviewer is scoring when they ask 'how would you reduce latency?' that is not in the rubric document.
What they score
- ·Did the candidate ask for the user-facing metric before proposing fixes? (TTFT is what users feel; e2e is what dashboards show. They differ.)
- ·Did they separate TTFT from inter-token latency in the first 60 seconds — without being prompted?
- ·Did they avoid the quantization trap — proposing quantization without naming the phase it targets and the regime where it helps?
- ·Did they ask about prompt-length distribution before recommending architectural changes?
- ·When the interviewer pushed back with 'why not just add more GPUs?', did they answer with the per-phase economic argument, or did they hedge?
- ·Did they distinguish 'fix for steady-state' from 'fix for spike'? Real serving has both; most candidates only address one.
Why it's not on the rubric
These aren't in the rubric because they look like 'soft skills' but they're actually structural knowledge — you only ask the right question if you have the right model. The rubric says 'demonstrates depth in LLM serving systems'; these bullets are what depth actually looks like in conversation. An interviewer who has watched 50 candidates can grade these in real time.
How to signal it
- →Open with 'before I propose a fix, I want TTFT p99 and inter-token p99 separately' — the single highest-signal sentence in the entire question category.
- →Name phases out loud when proposing fixes: 'continuous batching is a Phase 1 and Phase 4 win; it doesn't help Phase 2.'
- →When suggesting quantization, qualify the regime: 'this helps decode meaningfully — Phase 4 — by letting us hold more requests in HBM at once; it does very little for prefill or queue.'
- →Ask one prompt-distribution question explicitly: 'what's the p50 vs p99 input length, and is it a chat workload or RAG?' RAG changes the answer.
- →When pushed on adding GPUs, respond with the economic split: 'replicas fix queue at $X/QPS; batching policy fixes queue at config-change cost. I'd exhaust the config change before signing the GPU PO.'
- →Address steady-state and spike separately: 'for steady-state I'd tune batch and admission; for spike I'd commit to a smaller-model fallback path with explicit quality budget.'
Practice this. Time yourself.
You're handed a profiling dump: a 13B model on 2×A100 80GB, p99 TTFT = 3.2 s, p99 inter-token latency = 80 ms, QPS = 60, avg prompt = 2200 input tokens, avg output = 180 tokens. Diagnose the dominant phase, name the secondary problem, and propose three interventions ranked by expected impact. State which phase each intervention targets. You have 12 minutes. Write the answer as a 4-paragraph response — one paragraph per heading: Diagnosis, Secondary, Interventions (ranked), Open questions.
Self-assessment rubric
| Dimension | Weak | Passing | Strong | Staff bar |
|---|---|---|---|---|
| Phase diagnosis | Said 'high latency, probably batching.' Did not separate TTFT from inter-token. | Identified TTFT as dominant (3.2 s ≫ 80 ms × 180 = 14.4 s decode total, so TTFT is ~18% of e2e — but the e2e is huge, so something is wrong upstream). | Diagnosed TTFT as the primary problem AND noted that 80 ms inter-token at 13B on A100 is mildly elevated (expected ~30–50 ms); flagged decode as a secondary concern. | Called out that the 14.4-second decode total is the actual end-to-end story, that the user is waiting 17+ seconds, and that the breakdown points to (a) prefill or queue dominating TTFT and (b) decode being slow enough to suggest KV-cache pressure or under-batched continuous batching. Named the dominant phase as prefill or queue with 'I'd want to see the queue-wait component to separate them.' |
| Secondary problem identification | Did not identify a secondary problem. | Noted inter-token latency is on the high side. | Connected the inter-token latency to a likely cause (KV-cache fragmentation, max-batch too high causing memory pressure, no GQA on a 13B that doesn't have it natively). | Reframed the 80 ms inter-token as 'expected for 13B with vanilla MHA, but you'd want to confirm whether GQA or MQA is in use and what the effective batch utilization is — because if continuous batching is starving on KV space, decode degrades.' |
| Intervention ranking | Listed unranked options. Recommended quantization without phase justification. | Ranked options by perceived impact but did not tag phases. | Each intervention tagged with the phase it targets, expected impact range, and a reason it's ranked where it is. | Ranked with explicit dependencies: 'Intervention 1 (admission control + chunked prefill) targets Phase 1 + 2 and should cut TTFT by 50–70%. If after that TTFT is still high, Intervention 2 (prefix caching for any shared system prompt) targets Phase 2 and is workload-dependent. Intervention 3 (speculative decoding or KV-cache compression) targets Phase 4 and should be deferred until Phase 1+2 are fixed because optimizing decode while TTFT is broken is solving the wrong problem.' |
| Open questions | No open questions. Treated the dump as complete information. | Asked one or two questions about the workload. | Asked specifically about queue-wait, prompt-length distribution, and current batching policy. | Asked the five questions that distinguish queue-bound from prefill-bound TTFT, named which sub-diagnosis each question collapses, and stated explicitly which assumption their ranked-intervention list depends on. 'If queue wait is 2 seconds, my intervention order changes.' |
Reveal model solution
Common failures
- ✗Proposed quantization as Intervention #1 without naming a phase. Quantization is the wrong primary lever for a TTFT-dominated problem.
- ✗Treated the 80 ms inter-token as fine because 'it's not the worst number,' missing that 80 ms on 13B-on-A100 is already elevated and points to a compounding problem.
- ✗Wrote a single 17-second end-to-end number and proposed fixes to that — instead of separating TTFT and decode-time and addressing them with different interventions.
- ✗Ranked interventions by familiarity (speculative decoding sounds advanced) instead of by phase + expected impact. Speculative decoding is the right intervention for the secondary problem, not the primary.
- ✗Asked no open questions. The dump is incomplete on purpose; designing without naming the missing data is the canonical Senior failure.
The LLM Serving Diagnostic Tree
Series C company building an LLM-powered developer tool. ~30-person engineering org. Self-hosted serving stack on H100s. The serving team — three senior engineers, one staff — spent roughly three months optimizing quantization to fix a 'latency problem.' They tried INT8, GPTQ, AWQ, and finally a custom mixed-precision scheme. Quality on internal evals dropped slightly each time. p99 latency improved by maybe 8% across the entire effort.
The team was paged repeatedly on p99 latency violations. The default reaction from anyone with ML serving experience is 'the model is too slow, optimize the model.' They never separated TTFT from inter-token latency in their dashboards. Their default Grafana panel showed end-to-end p99 only. They had a continuous-batching server, but their max-batch was set to 96 — a number their original benchmarking had picked when their prompt distribution was much shorter. As their product matured and average prompt length grew from ~800 tokens to ~2400 tokens, the max-batch-96 setting caused continuous batching to fragment KV cache aggressively, and the scheduler ended up holding inbound requests in a queue for 4–6 seconds at p99 while waiting for batch slots to free. Their TTFT was queue-bound. The model was running fine.
Month three of the quantization effort, the staff engineer's manager — who happened to have come from a high-frequency trading background — asked, 'how long are requests waiting before they even hit the GPU?' Nobody knew. They added the metric in an afternoon. p99 queue-wait was 4.8 seconds. The quantization effort had been optimizing a part of the request path that contributed less than 200 ms to the problem, while the actual 4.8 seconds was sitting in a config file under 'max_batch_size: 96.'
At the first paging incident, three months earlier: 'Before I touch the model, I need TTFT p99 broken down into queue-wait p99 and prefill p99, and I need inter-token p99 separately. If queue dominates, the fix is a config change or admission control. If prefill dominates, it's chunked prefill or prefix caching. If decode dominates, it's continuous-batching tuning or speculative decoding. Quantization is a Phase 4 memory-pressure fix; it would be the wrong lever for any of the first three.' That sentence — spoken by anyone on the team — would have saved roughly three person-months of effort and a small quality regression in eval scores. They had the technical capacity to do that diagnosis from day one. What they were missing was the framework that names the diagnostic step as load-bearing.
The framework is not a fancy way to know what you already know. It is the thing that forces you to slow down for sixty seconds and ask the diagnostic question before reaching for the familiar tool. In the interview, the same dynamic plays out in compressed form: the candidate who proposes quantization in the first ninety seconds is the candidate who, in production, would have spent three months on it. The interviewer is not looking for someone who knows about quantization. They are looking for someone who would not have started there. The framework is the difference.