Module 2 · Lesson 2 · Core · 34 min

RAG Beyond the Tutorial: The Retrieval Quality Loop

Every production RAG system fails for one of four reasons, and most candidates conflate them. The Retrieval Quality Loop is the four-stage diagnostic that decomposes 'RAG accuracy is bad' into a fixable problem instead of a category.

Talk to a production RAG team that is six months in and the conversation always sounds the same. 'We tried smaller chunks, bigger chunks, semantic chunks, overlapping chunks. We tried three embedding models. We added a reranker. We tried query rewriting. Accuracy went from 60% to 64% and we're stuck.' The reason they are stuck is not in any of the levers they pulled. It is in the absence of a measurement that would have told them which lever to pull. They were optimizing retrieval when the bottleneck was synthesis, or vice versa, and they had no way to know.

The Retrieval Quality Loop is the four-stage framework that separates retrieval from synthesis, measures each independently, and gives you the cheapest-to-most-expensive sequence of fixes for whichever stage is binding. The framework's job is not to be a checklist; it is to make the most expensive RAG conversation — 'we tried everything and nothing worked' — impossible to have.

Framework

The Retrieval Quality Loop

RAG fails for structural reasons that no amount of chunking strategy fixes. The Retrieval Quality Loop is the four-stage diagnostic that lets you find the actual failure mode in any production RAG system in under five minutes. It is also the structure most candidates wish they had used when answering 'why is your RAG system's accuracy low?' in interviews — because without it, every answer collapses into a tour of chunking, embeddings, and rerankers without naming where in the loop the failure lives.

1
Stage 1 — Measure retrieval quality independent of generation
Build a labeled query → relevant-passages set. Compute recall@k and MRR. This number must exist and must be tracked separately from end-to-end answer quality. Skipping this stage means every later debugging conversation is guessing.
2
Stage 2 — Fix retrieval (if Stage 1 is the bottleneck)
Embedding model, chunking strategy, hybrid retrieval (BM25 + dense), reranking, query rewriting. The order matters: rerank > query rewrite > hybrid > embedding > chunking, roughly, in terms of cost-to-impact ratio. Most teams reach for chunking first; chunking is usually the last lever.
3
Stage 3 — Measure synthesis quality assuming perfect retrieval
Feed the model the labeled relevant passages and measure faithfulness, completeness, citation correctness. This isolates whether the generation step is the bottleneck — and most production RAG failures that look like 'wrong answer' turn out to be Stage 3 failures, not retrieval failures.
4
Stage 4 — Fix synthesis (if Stage 3 is the bottleneck)
Prompt structure, model choice, constrained decoding, citation enforcement, contradiction detection. Often the cheapest fix and the most under-attempted because teams default to blaming retrieval.

When to use

Apply the loop to any RAG interview prompt, any production RAG debugging conversation, any 'why is the AI giving wrong answers?' question. The loop's value is in the order: you cannot fix what you have not isolated, and most RAG conversations skip Stage 1 entirely.

Worked example

Interview prompt: 'Your RAG system has 60% answer accuracy. How would you improve it?' Senior answer: 'I'd try a better embedding model, smaller chunks, maybe a reranker.' Staff answer: 'Before improving anything, I need to know whether the failure is at Stage 1 or Stage 3. Stage 1: feed a labeled query set, measure recall@k. If recall is high but answers are wrong, the failure is at Stage 3 — synthesis. If recall is low, fix retrieval first. Without that decomposition, every change is a guess. Most production RAG systems I've seen with 60% accuracy turn out to have 85% recall@10 and the bottleneck is at Stage 3 — the model is getting the right passages and still hallucinating.'

Calibration ladder

Your RAG system returns the wrong answer 30% of the time. How do you start fixing it?

The interviewer wants to see whether you can decompose the failure or whether you reach for fixes.

L4 · Mid

I'd try a better embedding model and smaller chunks. Maybe add a reranker.

Missed: Reached for fixes without measuring. Will spend three months trying levers and report a 4% accuracy gain.

L5 · Senior

I'd evaluate where the failure lives. Is it retrieving the wrong documents, or generating wrong answers from right documents? I'd build a labeled set and measure recall@k. If recall is low, work on retrieval; if recall is high but answers are wrong, work on the prompt or model.

Missed: Knew to decompose but didn't commit to which order to fix. Will work through fixes inefficiently.

L6 · Staff

Stage 1 first — labeled query-to-passage set, measure recall@k and MRR. If retrieval is the problem, fix in cost order: reranker first, then query rewriting, then hybrid (BM25 + dense), then embedding model, then chunking. If retrieval is good (say recall@10 above 85%) and answers are still wrong, the problem is synthesis. Stage 3 — feed the labeled passages and measure faithfulness with a separate eval. Most production RAG failures turn out to be Stage 3, not Stage 1, and teams that skip the decomposition end up over-investing in retrieval changes that don't move accuracy.

Missed: Strong decomposition with cost-ordering. Missing the meta-move — that the labeled eval set is itself a system commitment, and that corpus quality upstream of retrieval can be the actual problem.

L7 · Principal

Same decomposition with two additions. (1) The labeled query set itself is a non-trivial investment that the team has to commit to as a continuing system, not a one-time effort. Labeling drifts as the corpus evolves; if the team can't commit to that, they can't run the loop, and the design has to account for that — possibly with LLM-as-judge as a noisier substitute with explicit acknowledgment of its limitations. (2) The Stage 3 failure mode that most teams miss is when the model is faithful to retrieved passages that are themselves wrong or out of date. Faithfulness is necessary but not sufficient; corpus freshness and document-level quality monitoring are the upstream failure mode that no amount of synthesis fixing addresses. The pattern: RAG is a multi-stage pipeline and 'fix the model' or 'fix the retriever' usually misses that the upstream data system is the binding constraint. Same pattern as the training-data-is-the-system insight from the recsys lesson.

What scored L7

Named the labeled-set-as-continuing-commitment dependency, and pointed at corpus freshness and document-level quality as the upstream failure mode most teams miss. Connected it to the training-data-as-system pattern from the recsys lesson — RAG and recsys share the structural truth that the data infrastructure is more important than the model. This is the L7 move that scales across lessons.

Pattern recognition

When you see

Anyone says 'our RAG is bad, we should try [retrieval lever]' without showing you a retrieval-quality measurement.

→

Think

Stop. Demand the Stage 1 measurement first. If they don't have it, the design conversation pauses until they build it.

Roughly half of production RAG systems improving 'retrieval quality' are actually improving the wrong stage. The cost of building the labeled set is small relative to the cost of three months of unfocused tuning. The cost of not building it is approximately infinite — without the measurement there is no closed loop, and the system gets stuck at whatever accuracy the most-recent unprincipled tweak landed on.

Dimension	Add a reranker	Query rewriting	Hybrid (BM25 + dense)	Better embedding model	Re-chunking strategy
C — implementation cost	Low. Drop-in cross-encoder.	Low. Prompt change + LLM call.	Moderate. New index, fusion logic.	High. Re-embed corpus, A/B.	High. Re-chunk and re-embed full corpus.
A — typical recall@10 gain	+5–15% on recall@10 typical.	+2–10%. High variance.	+3–8%. Best on entity-heavy corpora.	+1–8%. Often disappointing.	+1–5% typical. Sometimes nothing.
R — risk of regression	Low.	Moderate — bad rewrites can degrade.	Low.	Real — embedding space change can shift behavior unpredictably.	Moderate. Boundary effects.
K — ops complexity added	Adds an inference step; usually fine.	Adds latency and a failure path.	Two indices to maintain.	Full reindex and version-aware retrieval.	Full reindex.
Choose when	Default first lever. Highest impact-per-cost ratio.	When queries are short, vague, or under-specified — typical for end-user chat.	When corpus has many proper nouns, codes, IDs that dense embeddings smooth over.	Last resort within retrieval. Usually disappointing relative to cost.	Often the wrong lever. Try the others first.

Verdict

The order most teams pull these levers is roughly reverse of the optimal order. Rerank first. Chunking last. The cost-to-impact ratio dominates, and most teams underweight the reranker because it feels like a small change.

Drill · 12 minutes

Practice this. Time yourself.

You have 12 minutes. A team tells you their production RAG system has 65% answer accuracy and they've already tried two embedding models, three chunk sizes, and a reranker with no improvement. Write a 4-paragraph diagnostic: (1) the first three measurements you'd demand, (2) the most likely failure stage given what they've tried, (3) the next two interventions in cost order, (4) the meta-question you'd ask the team about their commitment to the eval set.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Diagnostic priority	Suggested more retrieval levers.	Demanded Stage 1 measurement (recall@k on labeled set).	Demanded recall@k AND faithfulness on labeled passages AND end-to-end answer eval — three independent measurements.	Demanded all three plus a corpus-staleness audit — the upstream failure mode that no amount of retrieval/synthesis tuning addresses.
Failure stage inference	Could not identify a likely stage.	Said 'probably synthesis given retrieval levers were already tried.'	Said 'almost certainly Stage 3 — they've tuned retrieval to a wall; the next 30% is in synthesis or upstream.'	Said Stage 3 and named the specific Stage 3 failure modes to investigate first: faithfulness to wrong passages, citation correctness, contradiction handling.
Cost-ordered interventions	Listed unranked options.	Ranked by perceived impact.	Ranked by cost-to-impact ratio: prompt structure, constrained decoding, citation enforcement.	Ranked by C/A ratio AND named which would be ruled out at each step, so the team has a fast decision tree, not just an option list.
Meta-question about eval set	Did not ask.	Asked if a labeled set existed.	Asked if it exists, how often it's updated, and whether the team has owned that as a continuing investment.	Named that the labeled set is a continuing system requiring dedicated ownership and asked what the team's commitment to that ownership is. If they can't commit, the framework can't run — and that itself is the most important diagnostic finding.

Reveal model solution

The three measurements I'd demand before any further work. (1) Recall@10 on a labeled query → relevant-passage set. If the team doesn't have this set, this is the diagnostic finding — they've been tuning blindly. (2) Faithfulness score on a labeled (passage, answer) set, computed with an LLM judge anchored to human-labeled gold examples. This isolates Stage 3 quality from retrieval quality. (3) End-to-end answer accuracy on a holdout set, the existing 65% metric, with the per-query error category labeled — wrong passages retrieved, right passages but wrong answer, right passages but unsupported claim. The error-category breakdown is the single highest-information measurement and the one most teams skip. Most likely failure stage. Given the team has already tuned retrieval levers with no movement, the bottleneck is almost certainly Stage 3 — synthesis. The reranker plus embedding tuning typically gets recall@10 above 85% on most corpora, which means the model is being handed relevant passages most of the time. If accuracy is still 65%, the failure is happening at generation: the model is hallucinating beyond the passages, citing wrong passages, or being confidently wrong on edge cases. There's a less-likely third possibility: the corpus itself is stale or low-quality, which no retrieval or synthesis fix addresses. Cost-ordered interventions. (1) Prompt structure changes — adding explicit grounding instructions, requiring citation per claim, structured-output formatting. Cost: hours. Typical impact: +5-15% on faithfulness. (2) Constrained decoding for citation correctness — every claim must be tied to a passage span the model retrieved. Cost: days. Typical impact: +5-10% on accuracy, and dramatically reduces 'confident wrong' answers which are the worst failure mode for trust. If neither of these moves the needle by week 2, the problem is upstream of synthesis and we audit the corpus — staleness, document quality, conflicting sources within the corpus that the model can't reconcile. Meta-question about the eval set. Does the team own the labeled query set as a continuing system, with a named owner, an update cadence (weekly or biweekly), and a budget for re-labeling as the corpus evolves? If yes, the framework runs. If no — and most teams say no — the team has been operating blind, and any improvement we make is uninstrumented and can regress without notice. The eval set is the system that makes the rest of the system improvable. If the team can't commit to owning it, the right answer is to scope down the RAG ambition to a corpus stable enough to evaluate, because nothing about retrieval or synthesis tuning is reliable without that closed loop.

Common failures

✗Suggested more retrieval levers despite the team having already exhausted them. The whole point of the diagnostic is to move past retrieval when retrieval isn't the bottleneck.
✗Did not demand the labeled eval set as a precondition. The framework is unrunnable without it; treating it as optional misses the load-bearing point.
✗Treated synthesis as 'just prompt engineering.' The Stage 3 fixes are real engineering — constrained decoding, citation enforcement, structured outputs — not vibes-based prompt tweaking.
✗Did not name corpus staleness as an upstream failure mode. The data system is usually the real bottleneck in mature RAG; missing this is the canonical L6-vs-L7 gap.

Artifact · decision tree

The RAG Failure Triage Decision Tree

Do you have a labeled query → relevant-passage set with recent measurements?

→NoSTOP all tuning. Build the eval set first. Every change without this measurement is a guess. Assign an owner and an update cadence.

→Yes — what's recall@10?

What's recall@10 on the labeled set?

→< 60%Retrieval is the bottleneck. Stage 2 fixes in cost order: reranker → query rewrite → hybrid retrieval → embedding model → chunking.

→60-80%Mixed bottleneck. Add a reranker if not present (recall ceiling), then measure faithfulness on labeled passages.

→> 80%Retrieval is mostly fine. Go to Stage 3 — measure synthesis quality on labeled (passage, answer) pairs.

Post-mortem · anonymized

Setup

Series B startup, 6 engineers, building an internal-docs RAG product. Three months invested in retrieval tuning — three embedding models, three chunking strategies, two rerankers, custom query rewriter. Accuracy moved from 58% to 64%.

What happened

The team treated 'low accuracy' as a retrieval problem because that's where the interesting research papers are. They had no labeled eval set; they tracked accuracy on a single holdout set that hadn't been updated in two months. After three months, the founder hired a Staff engineer to figure out why the product wasn't shipping. Her first move was to build a 200-query labeled eval set in a week. Recall@10 was 87%. The 36% wrong answers were almost entirely synthesis failures: the model was being handed correct passages and producing confidently wrong claims about them.

The moment

Week one of the new Staff engineer's investigation. She ran the labeled eval and showed the team that recall@10 was 87% and faithfulness was 64%. The team had been optimizing the wrong stage for three months. The fix was prompt structure changes and citation enforcement — three days of work — that took accuracy from 64% to 81% with zero retrieval changes. The retrospective conclusion was that the team had been operating without the measurement that would have told them which lever to pull.

What they should have said

At week one of the project: 'We need a labeled eval set before any tuning. The labeled set is the system that makes the rest of the system improvable. We'll spend a week building 200 labeled queries and we'll update it every two weeks as the corpus evolves.' Three months of tuning would have collapsed into roughly four weeks of measured, targeted work. None of it required new technical skills; it required the framework that orders the decisions.

Lesson

Production RAG fails for measurable reasons. The Retrieval Quality Loop is the measurement that makes the failures fixable in order of impact. Teams that skip the loop spend months tuning the wrong stage; teams that run it ship in weeks. The framework's value is not its sophistication — it is its order.