LLM-Powered Search (Perplexity-scale) — Simulated Interview
LLM-powered search systems all converge to the same four-stage architecture: Retrieve, Ground, Stream, Audit. This walkthrough applies RGSA to a Perplexity-class system at 10M monthly users with the 1.5-second TTFT bar, and shows where the structural decisions live.
LLM-powered search is the cleanest design interview prompt in the 2025 loop. The product is well-known (Perplexity, Phind, You.com), the latency budget is brutal (sub-2-second TTFT), the failure modes are visible (hallucination, slow streaming, broken citations), and the architecture decomposes cleanly into four stages once you name them. The reason it's still hard is that most candidates default to discussing RAG techniques (chunk size, embedding model, reranker) instead of naming the four stages and the budget tree across them.
The Retrieve → Ground → Stream → Audit framework is the structural opening. Each stage has a budget, a failure mode, and a quality metric. The walkthrough below applies the framework to the prompt 'design an LLM-powered search product, 10M MAU, 1.5-second TTFT bar, with citation correctness as a hard requirement,' and shows where the L5/L6/L7 differentiation lives.
Retrieve → Ground → Stream → Audit
LLM-powered search systems all converge to the same four-stage architecture: retrieve relevant sources, ground generation in retrieved passages, stream the response to the user, audit citations and content for trust. Each stage has its own latency budget, its own failure mode, and its own quality metric. Teams that conflate stages produce systems that hallucinate (no grounding), feel slow (no streaming), or get sued (no audit). Naming the four stages is the structural opening to any LLM-search interview prompt.
- 1RetrieveHybrid retrieval over the corpus — dense for semantic, lexical (BM25) for proper nouns and exact matches. ~150–300 ms budget for retrieval + rerank. The job is recall, not precision; the LLM does precision via grounding. Most failures here are retrieval failures from Lesson 2.2 — measurable, fixable, but only if Stage 1 of the Retrieval Quality Loop is in place.
- 2GroundConstruct the prompt with retrieved passages, citation markers, and explicit grounding instructions. Use constrained decoding or post-hoc citation validation to ensure every claim ties to a passage. ~50 ms budget. This is the stage where 'faithful but wrong' answers come from — the model cites a passage but the passage itself was wrong or out of date.
- 3StreamStream the response token by token to the user via SSE or HTTP/2. The user-perceived 'is it working?' moment lives here. Time-to-first-token (TTFT) under 1.5 s is the Perplexity-class bar. Streaming is not a technical detail — it's the difference between a usable product and an unshippable one for sub-2-second perceived latency.
- 4AuditPost-generation validation: do citations resolve to passages that actually exist? Do the cited passages support the claim? Are there policy violations in the generated output? Audit happens in parallel with streaming when possible, or as a synchronous gate when correctness matters more than latency. The audit stage is what separates 'demo' search systems from 'enterprise' search systems.
Apply RGSA to any LLM-powered search, Q&A, or document-assistant interview prompt. The four-stage decomposition is the cleanest opening because it forces you to talk about latency budget, hallucination, streaming UX, and trust — the four things every interviewer for this prompt is checking.
Senior answer to 'design LLM search like Perplexity': 'Retrieve relevant pages, pass them to GPT-4, return the answer.' Staff answer: 'Four stages. Retrieve — hybrid retrieval, 200 ms budget, recall-focused. Ground — construct prompt with citation markers, constrained decoding to enforce citation correctness, 50 ms budget. Stream — SSE response, TTFT under 1.5 s. Audit — post-hoc citation resolution and policy check, parallel with streaming. The 1.5 s TTFT budget forces specific decisions in the retrieve stage: rerank can use a small model, hybrid is mandatory because dense alone misses entity-heavy queries, prefix caching for the grounding prompt is the cheapest cost lever. Without naming the stages, the answer collapses into a tour of techniques.'
How do you ensure citations point to passages that actually support the claim, not just passages that exist?
Most candidates conflate 'citation exists' with 'citation supports the claim.' The interviewer wants to see whether you have the distinction.
Add citations to the prompt and check that they're in the right format.
Validate that cited URLs resolve to documents in the corpus. If they don't, regenerate or remove the citation.
Two-level audit. (1) Citation resolution — the cited passage exists in the retrieved set. (2) Citation support — the cited passage actually supports the specific claim. Level 1 is a string match; level 2 requires an LLM-as-judge or entailment model that checks claim-passage support. Level 1 catches the easier failure ('cited a passage that doesn't exist'); level 2 catches the harder one ('cited a real passage that doesn't actually support what I claimed').
Same two levels with the meta-acknowledgment that Level 2 — claim-support validation — is the structural defense against hallucination, and the team has to commit to it as a continuing system, not a one-time eval. LLM-as-judge for support is itself imperfect; it needs calibration against human labels and re-calibration as the model evolves. The trade-off is real: synchronous Level 2 audit blocks streaming (bad UX), async Level 2 audit catches problems after they reach users (bad trust). The right design is hybrid: synchronous Level 1 (cheap, fast) blocks publish; async Level 2 runs in parallel with streaming and retroactively flags or rewrites the response if a claim is unsupported. Most teams design only Level 1 and call it done. Level 2 — the structural defense against the failure mode that actually hurts users — is the L7 design choice that separates 'we have citations' from 'our citations are trustworthy.'
Named that Level 2 audit is the structural defense against the failure mode users care about, that it requires continuing calibration, and that the right design is a hybrid sync/async pattern. Conflating Level 1 and Level 2 is the canonical Senior-level failure on this question.
Perplexity-class LLM-powered search at 10M MAU. The four RGSA stages are visible across the diagram. Notice that retrieve and ground are sequential, streaming overlaps with audit, and the audit stage has both sync and async paths.
Latency budget for the 1.5-second TTFT bar. Retrieve and rerank dominate; ground is cheap; first token of LLM stream is the user-perceived signal. Audit runs in parallel with streaming and doesn't count toward TTFT.
Practice this. Time yourself.
You have 12 minutes. A team is shipping an LLM-powered search product and the marketing team has set a hard TTFT target of 1.2 seconds. The current p99 TTFT is 1.8 seconds. Diagnose where the 600 ms gap is most likely living, propose three interventions ranked by expected impact and cost, and name the one architectural change that would not earn the TTFT win.
Self-assessment rubric
| Dimension | Weak | Passing | Strong | Staff bar |
|---|---|---|---|---|
| Stage attribution | Generic 'we're slow.' | Attributed the gap to one stage (probably retrieve+rerank or prefill). | Decomposed: retrieve (~350 ms), rerank (~150 ms), ground (~50 ms), prefill (~950 ms), framing (~70 ms). Named prefill as primary suspect. | Same plus: identified the grounded prompt size as the controllable lever and named that reducing passages from 10 to 5 cuts prefill by ~40% at small recall cost. |
| Ranked interventions | Listed unranked options. | Ranked. | Ranked by impact-per-cost: (1) tighter reranker to fewer passages, (2) prefix caching for system prompt, (3) speculative decoding setup. | Same plus: identified what would NOT work — quantization (memory fix, not TTFT fix from Lesson 2.1), bigger model (no), more replicas (helps p99 queue contribution but not the model time). |
| Architectural non-answer | Did not identify. | Said 'quantization wouldn't help TTFT.' | Named quantization specifically as a memory fix that doesn't address prefill (the actual bottleneck here). | Connected back to Lesson 2.1 — quantization is a Phase 4 (decode) fix; TTFT is Phase 1+2 (queue+prefill); the wrong tool for the wrong phase. |
Reveal model solution
Common failures
- ✗Did not decompose the latency by stage. Generic 'slow' diagnosis doesn't earn the staff signal.
- ✗Suggested quantization as the primary fix. Wrong phase from Lesson 2.1.
- ✗Did not identify the grounded prompt size as the controllable lever. Most teams obsess over reranker quality and miss that the prompt itself is the cost driver.
- ✗Did not propose the prefix cache. Free TTFT win that most teams skip.
The LLM Search Latency Diagnostic Tree
Mid-stage AI startup, LLM-powered search product. Hit TTFT regression — went from 1.4 s to 2.3 s over four weeks. Team spent two weeks investigating, focused on inference optimization (quantization, smaller model).
The team had been improving retrieval quality by increasing the number of passages in the grounded prompt — from 5 to 10 over the quarter, justified by faithfulness improvements. Each addition was a small improvement; aggregated, they doubled prefill time. The TTFT regression was a direct consequence of decisions the team had been making, but those decisions had been instrumented for retrieval quality and not for latency. The investigation focused on the model because that's where the team's instinct went; the actual cause was upstream in the grounded prompt assembly.
Week three of the investigation, a new engineer asked what the average grounded prompt size had been four weeks ago vs now. The answer was 1.4k tokens then, 2.9k tokens now. The TTFT regression was explained immediately; the team rolled back the most recent passage additions, hit the original TTFT, and resumed the faithfulness improvements with explicit latency budgeting per addition.
When the first passage addition was being discussed: 'Each passage added increases prefill latency. Before we add more passages, we need a latency budget for grounded prompt size and a faithfulness-vs-latency willingness-to-trade ratio. Otherwise we'll improve faithfulness and regress TTFT silently, which is what users will feel.' That conversation at the time of the first passage addition would have prevented the entire regression.
LLM search latency lives in the grounded prompt, not the model. Faithfulness improvements via more passages compound into TTFT regressions if the budget isn't explicit. The RGSA framework forces the budget conversation by naming the stages and their costs. The wrong place to look during a TTFT regression is the model; the right place is upstream in the prompt construction.