← Course
Module 4 · Lesson 2 · Advanced · 42 min

LLM-Powered Search (Perplexity-scale) — Simulated Interview

LLM-powered search systems all converge to the same four-stage architecture: Retrieve, Ground, Stream, Audit. This walkthrough applies RGSA to a Perplexity-class system at 10M monthly users with the 1.5-second TTFT bar, and shows where the structural decisions live.

LLM-powered search is the cleanest design interview prompt in the 2025 loop. The product is well-known (Perplexity, Phind, You.com), the latency budget is brutal (sub-2-second TTFT), the failure modes are visible (hallucination, slow streaming, broken citations), and the architecture decomposes cleanly into four stages once you name them. The reason it's still hard is that most candidates default to discussing RAG techniques (chunk size, embedding model, reranker) instead of naming the four stages and the budget tree across them.

The Retrieve → Ground → Stream → Audit framework is the structural opening. Each stage has a budget, a failure mode, and a quality metric. The walkthrough below applies the framework to the prompt 'design an LLM-powered search product, 10M MAU, 1.5-second TTFT bar, with citation correctness as a hard requirement,' and shows where the L5/L6/L7 differentiation lives.

Framework

Retrieve → Ground → Stream → Audit

LLM-powered search systems all converge to the same four-stage architecture: retrieve relevant sources, ground generation in retrieved passages, stream the response to the user, audit citations and content for trust. Each stage has its own latency budget, its own failure mode, and its own quality metric. Teams that conflate stages produce systems that hallucinate (no grounding), feel slow (no streaming), or get sued (no audit). Naming the four stages is the structural opening to any LLM-search interview prompt.

  1. 1
    Retrieve
    Hybrid retrieval over the corpus — dense for semantic, lexical (BM25) for proper nouns and exact matches. ~150–300 ms budget for retrieval + rerank. The job is recall, not precision; the LLM does precision via grounding. Most failures here are retrieval failures from Lesson 2.2 — measurable, fixable, but only if Stage 1 of the Retrieval Quality Loop is in place.
  2. 2
    Ground
    Construct the prompt with retrieved passages, citation markers, and explicit grounding instructions. Use constrained decoding or post-hoc citation validation to ensure every claim ties to a passage. ~50 ms budget. This is the stage where 'faithful but wrong' answers come from — the model cites a passage but the passage itself was wrong or out of date.
  3. 3
    Stream
    Stream the response token by token to the user via SSE or HTTP/2. The user-perceived 'is it working?' moment lives here. Time-to-first-token (TTFT) under 1.5 s is the Perplexity-class bar. Streaming is not a technical detail — it's the difference between a usable product and an unshippable one for sub-2-second perceived latency.
  4. 4
    Audit
    Post-generation validation: do citations resolve to passages that actually exist? Do the cited passages support the claim? Are there policy violations in the generated output? Audit happens in parallel with streaming when possible, or as a synchronous gate when correctness matters more than latency. The audit stage is what separates 'demo' search systems from 'enterprise' search systems.
When to use

Apply RGSA to any LLM-powered search, Q&A, or document-assistant interview prompt. The four-stage decomposition is the cleanest opening because it forces you to talk about latency budget, hallucination, streaming UX, and trust — the four things every interviewer for this prompt is checking.

Worked example

Senior answer to 'design LLM search like Perplexity': 'Retrieve relevant pages, pass them to GPT-4, return the answer.' Staff answer: 'Four stages. Retrieve — hybrid retrieval, 200 ms budget, recall-focused. Ground — construct prompt with citation markers, constrained decoding to enforce citation correctness, 50 ms budget. Stream — SSE response, TTFT under 1.5 s. Audit — post-hoc citation resolution and policy check, parallel with streaming. The 1.5 s TTFT budget forces specific decisions in the retrieve stage: rerank can use a small model, hybrid is mandatory because dense alone misses entity-heavy queries, prefix caching for the grounding prompt is the cheapest cost lever. Without naming the stages, the answer collapses into a tour of techniques.'

Calibration ladder

How do you ensure citations point to passages that actually support the claim, not just passages that exist?

Most candidates conflate 'citation exists' with 'citation supports the claim.' The interviewer wants to see whether you have the distinction.

L4 · Mid

Add citations to the prompt and check that they're in the right format.

Missed: Treated citations as formatting.
L5 · Senior

Validate that cited URLs resolve to documents in the corpus. If they don't, regenerate or remove the citation.

Missed: Knew about citation resolution but not citation support. Will ship a system that 'has citations' but still hallucinates.
L6 · Staff

Two-level audit. (1) Citation resolution — the cited passage exists in the retrieved set. (2) Citation support — the cited passage actually supports the specific claim. Level 1 is a string match; level 2 requires an LLM-as-judge or entailment model that checks claim-passage support. Level 1 catches the easier failure ('cited a passage that doesn't exist'); level 2 catches the harder one ('cited a real passage that doesn't actually support what I claimed').

Missed: Strong two-level decomposition. Missing the meta-move — that Level 2 audit is a continuing operational system requiring calibration.
L7 · Principal

Same two levels with the meta-acknowledgment that Level 2 — claim-support validation — is the structural defense against hallucination, and the team has to commit to it as a continuing system, not a one-time eval. LLM-as-judge for support is itself imperfect; it needs calibration against human labels and re-calibration as the model evolves. The trade-off is real: synchronous Level 2 audit blocks streaming (bad UX), async Level 2 audit catches problems after they reach users (bad trust). The right design is hybrid: synchronous Level 1 (cheap, fast) blocks publish; async Level 2 runs in parallel with streaming and retroactively flags or rewrites the response if a claim is unsupported. Most teams design only Level 1 and call it done. Level 2 — the structural defense against the failure mode that actually hurts users — is the L7 design choice that separates 'we have citations' from 'our citations are trustworthy.'

What scored L7

Named that Level 2 audit is the structural defense against the failure mode users care about, that it requires continuing calibration, and that the right design is a hybrid sync/async pattern. Conflating Level 1 and Level 2 is the canonical Senior-level failure on this question.

Architecture

Perplexity-class LLM-powered search at 10M MAU. The four RGSA stages are visible across the diagram. Notice that retrieve and ground are sequential, streaming overlaps with audit, and the audit stage has both sync and async paths.

Edge gateway · auth + rate limit
Per-user QPS limits live here; abusive prompts get filtered before any retrieval cost.
Stage 1: Hybrid retriever (BM25 + dense)
Dense alone misses proper nouns, codes, exact matches — entity-heavy queries are a 30%+ failure class without lexical.
Stage 1: Cross-encoder reranker
Top-50 from retrieval reranked to top-10. Highest impact-per-cost lever in retrieval (Lesson 2.2 tradeoff).
Stage 2: Prompt assembly + grounding
Constructs the LLM prompt with retrieved passages, citation markers, and explicit grounding instructions. Prefix-cached system prompt.
Stage 3: LLM inference with streaming
Streams response via SSE. Continuous batching + speculative decoding for sub-1.5 s TTFT under load.
Stage 3: SSE/HTTP2 stream to client
Token-by-token to user. TTFT is the binding UX metric.
Stage 4: Sync citation resolution (Level 1)
String-match: cited URLs in retrieved set. Cheap; blocks stream end if citations don't resolve.
Stage 4: Async claim-support audit (Level 2)
LLM-as-judge entailment per claim. Runs parallel with streaming; flags or retracts unsupported claims after stream end.
Document corpus + indexes
Web crawl + curated content. Dense index (~150 GB embeddings) + BM25 index (~600 GB inverted).
Per-stage observability + per-version tracking
TTFT and per-stage latency; recall@10; faithfulness; citation correctness — all per LLM version.
edgeretriever
retrievercorpus · dense + BM25 queries
retrieverrerank
rerankground
groundllm · prompt with passages
llmstream · tokens
streamaudit-sync · stream end
llmaudit-async · claims to validate
Latency anatomy · budget 1500 ms

Latency budget for the 1.5-second TTFT bar. Retrieve and rerank dominate; ground is cheap; first token of LLM stream is the user-perceived signal. Audit runs in parallel with streaming and doesn't count toward TTFT.

Edge + auth80 ms
Standard.
Hybrid retrieve (BM25 + dense)200 ms
BM25 and dense run in parallel; result fusion is cheap.
Rerank (cross-encoder)150 ms
Top-50 to top-10.
Ground (prompt assembly)50 ms
Prefix-cached system prompt; only the passages need fresh encoding.
LLM first token (prefill)950 ms
Dominated by prefill of ~2-3k token grounded prompt. Speculative decoding kicks in after first token; doesn't affect TTFT.
Response framing70 ms
Buffer first chunk for SSE.
Drill · 12 minutes

Practice this. Time yourself.

You have 12 minutes. A team is shipping an LLM-powered search product and the marketing team has set a hard TTFT target of 1.2 seconds. The current p99 TTFT is 1.8 seconds. Diagnose where the 600 ms gap is most likely living, propose three interventions ranked by expected impact and cost, and name the one architectural change that would not earn the TTFT win.

Self-assessment rubric

DimensionWeakPassingStrongStaff bar
Stage attributionGeneric 'we're slow.'Attributed the gap to one stage (probably retrieve+rerank or prefill).Decomposed: retrieve (~350 ms), rerank (~150 ms), ground (~50 ms), prefill (~950 ms), framing (~70 ms). Named prefill as primary suspect.Same plus: identified the grounded prompt size as the controllable lever and named that reducing passages from 10 to 5 cuts prefill by ~40% at small recall cost.
Ranked interventionsListed unranked options.Ranked.Ranked by impact-per-cost: (1) tighter reranker to fewer passages, (2) prefix caching for system prompt, (3) speculative decoding setup.Same plus: identified what would NOT work — quantization (memory fix, not TTFT fix from Lesson 2.1), bigger model (no), more replicas (helps p99 queue contribution but not the model time).
Architectural non-answerDid not identify.Said 'quantization wouldn't help TTFT.'Named quantization specifically as a memory fix that doesn't address prefill (the actual bottleneck here).Connected back to Lesson 2.1 — quantization is a Phase 4 (decode) fix; TTFT is Phase 1+2 (queue+prefill); the wrong tool for the wrong phase.
Reveal model solution
Stage attribution. Budget breakdown: edge ~80 ms, retrieve ~350 ms, rerank ~150 ms, ground ~50 ms, prefill ~950 ms, framing ~70 ms = 1.65 s actual. The 600 ms gap from 1.8 to 1.2 s is most likely in prefill (~950 ms) and secondarily in retrieve (~350 ms). Prefill scales with grounded prompt size; if the team is putting 10 passages in the prompt at ~300 tokens each, the prompt is ~3k tokens and prefill is the dominant line item. Ranked interventions. (1) Tighten the reranker output from 10 passages to 5 in the grounded prompt. Prefill is roughly quadratic in prompt length for typical implementations; halving the input cuts prefill by ~40-50%. Expected TTFT impact: ~400 ms saved. Cost: a day of work, with a small recall trade-off (typically <2% on faithfulness). This is the highest-impact lowest-cost lever and it should be done first. (2) Prefix-cache the system prompt and grounding instructions. The instruction component of the prompt is stable across requests; caching the KV for it saves ~100-150 ms on cache hit. Cost: trivial — most inference engines support this natively. (3) Speculative decoding for the decode phase. Doesn't help TTFT directly but improves perceived streaming smoothness. Mention but defer. Architectural non-answer. Quantization. The team's instinct will be to quantize the model to make it faster — that's the canonical wrong move here. Quantization is a memory-bandwidth fix that helps decode (Phase 4 from Lesson 2.1); TTFT is queue+prefill (Phases 1-2). Quantization on a prefill-bound workload may slightly help by allowing larger batches, but it won't address the 600 ms gap. The right framing: 'we're prefill-bound; quantization is a decode fix; we should not be reaching for it on this problem.'

Common failures

  • Did not decompose the latency by stage. Generic 'slow' diagnosis doesn't earn the staff signal.
  • Suggested quantization as the primary fix. Wrong phase from Lesson 2.1.
  • Did not identify the grounded prompt size as the controllable lever. Most teams obsess over reranker quality and miss that the prompt itself is the cost driver.
  • Did not propose the prefix cache. Free TTFT win that most teams skip.
Artifact · decision tree

The LLM Search Latency Diagnostic Tree

Get TTFT broken down by stage. Which stage dominates?
Retrieve dominates (>300 ms)
Is BM25 and dense running in parallel?
No — sequentialMake them parallel. Free latency save.
Yes — parallel. Reranker is the next suspect.Use a smaller reranker model or score fewer candidates. Cross-encoder is the cost driver.
Prefill dominates (>800 ms)
What's the grounded prompt size?
Above 2.5k tokensTighten reranker output. 5-7 passages typically beats 10 at <2% recall cost. Biggest single TTFT lever.
Under 1.5k tokensAlready efficient. Try prefix caching for stable system prompt and speculative decoding for decode-phase improvements.
Variable / unboundedCap the grounded prompt length. Unbounded grounded prompts are the canonical TTFT regression mechanism.
Both significantAddress prefill first — the bigger lever. Then revisit retrieve.
Post-mortem · anonymized
Setup

Mid-stage AI startup, LLM-powered search product. Hit TTFT regression — went from 1.4 s to 2.3 s over four weeks. Team spent two weeks investigating, focused on inference optimization (quantization, smaller model).

What happened

The team had been improving retrieval quality by increasing the number of passages in the grounded prompt — from 5 to 10 over the quarter, justified by faithfulness improvements. Each addition was a small improvement; aggregated, they doubled prefill time. The TTFT regression was a direct consequence of decisions the team had been making, but those decisions had been instrumented for retrieval quality and not for latency. The investigation focused on the model because that's where the team's instinct went; the actual cause was upstream in the grounded prompt assembly.

The moment

Week three of the investigation, a new engineer asked what the average grounded prompt size had been four weeks ago vs now. The answer was 1.4k tokens then, 2.9k tokens now. The TTFT regression was explained immediately; the team rolled back the most recent passage additions, hit the original TTFT, and resumed the faithfulness improvements with explicit latency budgeting per addition.

What they should have said

When the first passage addition was being discussed: 'Each passage added increases prefill latency. Before we add more passages, we need a latency budget for grounded prompt size and a faithfulness-vs-latency willingness-to-trade ratio. Otherwise we'll improve faithfulness and regress TTFT silently, which is what users will feel.' That conversation at the time of the first passage addition would have prevented the entire regression.

Lesson

LLM search latency lives in the grounded prompt, not the model. Faithfulness improvements via more passages compound into TTFT regressions if the budget isn't explicit. The RGSA framework forces the budget conversation by naming the stages and their costs. The wrong place to look during a TTFT regression is the model; the right place is upstream in the prompt construction.