Interviews Vector

LLM-powered search is the cleanest design interview prompt in the 2025 loop. The product is well-known (Perplexity, Phind, You.com), the latency budget is brutal (sub-2-second TTFT), the failure modes are visible (hallucination, slow streaming, broken citations), and the architecture decomposes cleanly into four stages once you name them. The reason it's still hard is that most candidates default to discussing RAG techniques (chunk size, embedding model, reranker) instead of naming the four stages and the budget tree across them.

The Retrieve → Ground → Stream → Audit framework is the structural opening. Each stage has a budget, a failure mode, and a quality metric. The walkthrough below applies the framework to the prompt 'design an LLM-powered search product, 10M MAU, 1.5-second TTFT bar, with citation correctness as a hard requirement,' and shows where the L5/L6/L7 differentiation lives.

Framework

Retrieve → Ground → Stream → Audit

LLM-powered search systems all converge to the same four-stage architecture: retrieve relevant sources, ground generation in retrieved passages, stream the response to the user, audit citations and content for trust. Each stage has its own latency budget, its own failure mode, and its own quality metric. Teams that conflate stages produce systems that hallucinate (no grounding), feel slow (no streaming), or get sued (no audit). Naming the four stages is the structural opening to any LLM-search interview prompt.

1
Retrieve
Hybrid retrieval over the corpus — dense for semantic, lexical (BM25) for proper nouns and exact matches. ~150–300 ms budget for retrieval + rerank. The job is recall, not precision; the LLM does precision via grounding. Most failures here are retrieval failures from Lesson 2.2 — measurable, fixable, but only if Stage 1 of the Retrieval Quality Loop is in place.
2
Ground
Construct the prompt with retrieved passages, citation markers, and explicit grounding instructions. Use constrained decoding or post-hoc citation validation to ensure every claim ties to a passage. ~50 ms budget. This is the stage where 'faithful but wrong' answers come from — the model cites a passage but the passage itself was wrong or out of date.
3
Stream
Stream the response token by token to the user via SSE or HTTP/2. The user-perceived 'is it working?' moment lives here. Time-to-first-token (TTFT) under 1.5 s is the Perplexity-class bar. Streaming is not a technical detail — it's the difference between a usable product and an unshippable one for sub-2-second perceived latency.
4
Audit
Post-generation validation: do citations resolve to passages that actually exist? Do the cited passages support the claim? Are there policy violations in the generated output? Audit happens in parallel with streaming when possible, or as a synchronous gate when correctness matters more than latency. The audit stage is what separates 'demo' search systems from 'enterprise' search systems.

When to use

Apply RGSA to any LLM-powered search, Q&A, or document-assistant interview prompt. The four-stage decomposition is the cleanest opening because it forces you to talk about latency budget, hallucination, streaming UX, and trust — the four things every interviewer for this prompt is checking.

Worked example

Senior answer to 'design LLM search like Perplexity': 'Retrieve relevant pages, pass them to GPT-4, return the answer.' Staff answer: 'Four stages. Retrieve — hybrid retrieval, 200 ms budget, recall-focused. Ground — construct prompt with citation markers, constrained decoding to enforce citation correctness, 50 ms budget. Stream — SSE response, TTFT under 1.5 s. Audit — post-hoc citation resolution and policy check, parallel with streaming. The 1.5 s TTFT budget forces specific decisions in the retrieve stage: rerank can use a small model, hybrid is mandatory because dense alone misses entity-heavy queries, prefix caching for the grounding prompt is the cheapest cost lever. Without naming the stages, the answer collapses into a tour of techniques.'

Calibration ladder

How do you ensure citations point to passages that actually support the claim, not just passages that exist?

Most candidates conflate 'citation exists' with 'citation supports the claim.' The interviewer wants to see whether you have the distinction.

L4 · Mid

Add citations to the prompt and check that they're in the right format.

Missed: Treated citations as formatting.

L5 · Senior

Validate that cited URLs resolve to documents in the corpus. If they don't, regenerate or remove the citation.

Missed: Knew about citation resolution but not citation support. Will ship a system that 'has citations' but still hallucinates.

L6 · Staff

Two-level audit. (1) Citation resolution — the cited passage exists in the retrieved set. (2) Citation support — the cited passage actually supports the specific claim. Level 1 is a string match; level 2 requires an LLM-as-judge or entailment model that checks claim-passage support. Level 1 catches the easier failure ('cited a passage that doesn't exist'); level 2 catches the harder one ('cited a real passage that doesn't actually support what I claimed').

Missed: Strong two-level decomposition. Missing the meta-move — that Level 2 audit is a continuing operational system requiring calibration.

L7 · Principal

Same two levels with the meta-acknowledgment that Level 2 — claim-support validation — is the structural defense against hallucination, and the team has to commit to it as a continuing system, not a one-time eval. LLM-as-judge for support is itself imperfect; it needs calibration against human labels and re-calibration as the model evolves. The trade-off is real: synchronous Level 2 audit blocks streaming (bad UX), async Level 2 audit catches problems after they reach users (bad trust). The right design is hybrid: synchronous Level 1 (cheap, fast) blocks publish; async Level 2 runs in parallel with streaming and retroactively flags or rewrites the response if a claim is unsupported. Most teams design only Level 1 and call it done. Level 2 — the structural defense against the failure mode that actually hurts users — is the L7 design choice that separates 'we have citations' from 'our citations are trustworthy.'

What scored L7

Named that Level 2 audit is the structural defense against the failure mode users care about, that it requires continuing calibration, and that the right design is a hybrid sync/async pattern. Conflating Level 1 and Level 2 is the canonical Senior-level failure on this question.

Architecture

Perplexity-class LLM-powered search at 10M MAU. The four RGSA stages are visible across the diagram. Notice that retrieve and ground are sequential, streaming overlaps with audit, and the audit stage has both sync and async paths.

Edge gateway · auth + rate limit

“Per-user QPS limits live here; abusive prompts get filtered before any retrieval cost.”

Stage 1: Hybrid retriever (BM25 + dense)

“Dense alone misses proper nouns, codes, exact matches — entity-heavy queries are a 30%+ failure class without lexical.”

Stage 1: Cross-encoder reranker

“Top-50 from retrieval reranked to top-10. Highest impact-per-cost lever in retrieval (Lesson 2.2 tradeoff).”

Stage 2: Prompt assembly + grounding

“Constructs the LLM prompt with retrieved passages, citation markers, and explicit grounding instructions. Prefix-cached system prompt.”

Stage 3: LLM inference with streaming

“Streams response via SSE. Continuous batching + speculative decoding for sub-1.5 s TTFT under load.”

Stage 3: SSE/HTTP2 stream to client

“Token-by-token to user. TTFT is the binding UX metric.”

Stage 4: Sync citation resolution (Level 1)

“String-match: cited URLs in retrieved set. Cheap; blocks stream end if citations don't resolve.”

Stage 4: Async claim-support audit (Level 2)

“LLM-as-judge entailment per claim. Runs parallel with streaming; flags or retracts unsupported claims after stream end.”

Document corpus + indexes

“Web crawl + curated content. Dense index (~150 GB embeddings) + BM25 index (~600 GB inverted).”

Per-stage observability + per-version tracking

“TTFT and per-stage latency; recall@10; faithfulness; citation correctness — all per LLM version.”

edge → retriever

retriever → corpus · dense + BM25 queries

retriever → rerank

rerank → ground

ground → llm · prompt with passages

llm → stream · tokens

stream → audit-sync · stream end

llm → audit-async · claims to validate

Latency anatomy · budget 1500 ms

Latency budget for the 1.5-second TTFT bar. Retrieve and rerank dominate; ground is cheap; first token of LLM stream is the user-perceived signal. Audit runs in parallel with streaming and doesn't count toward TTFT.

Edge + auth80 ms

Standard.

Hybrid retrieve (BM25 + dense)200 ms

BM25 and dense run in parallel; result fusion is cheap.

Rerank (cross-encoder)150 ms

Top-50 to top-10.

Ground (prompt assembly)50 ms

Prefix-cached system prompt; only the passages need fresh encoding.

LLM first token (prefill)950 ms

Dominated by prefill of ~2-3k token grounded prompt. Speculative decoding kicks in after first token; doesn't affect TTFT.

Response framing70 ms

Buffer first chunk for SSE.

Dimension	Weak	Passing	Strong	Staff bar
Stage attribution	Generic 'we're slow.'	Attributed the gap to one stage (probably retrieve+rerank or prefill).	Decomposed: retrieve (~350 ms), rerank (~150 ms), ground (~50 ms), prefill (~950 ms), framing (~70 ms). Named prefill as primary suspect.	Same plus: identified the grounded prompt size as the controllable lever and named that reducing passages from 10 to 5 cuts prefill by ~40% at small recall cost.
Ranked interventions	Listed unranked options.	Ranked.	Ranked by impact-per-cost: (1) tighter reranker to fewer passages, (2) prefix caching for system prompt, (3) speculative decoding setup.	Same plus: identified what would NOT work — quantization (memory fix, not TTFT fix from Lesson 2.1), bigger model (no), more replicas (helps p99 queue contribution but not the model time).
Architectural non-answer	Did not identify.	Said 'quantization wouldn't help TTFT.'	Named quantization specifically as a memory fix that doesn't address prefill (the actual bottleneck here).	Connected back to Lesson 2.1 — quantization is a Phase 4 (decode) fix; TTFT is Phase 1+2 (queue+prefill); the wrong tool for the wrong phase.

LLM-Powered Search (Perplexity-scale) — Simulated Interview

Retrieve → Ground → Stream → Audit

How do you ensure citations point to passages that actually support the claim, not just passages that exist?

Practice this. Time yourself.

Self-assessment rubric

Common failures

The LLM Search Latency Diagnostic Tree