Hybrid Retrieval with BM25 and Dense Embeddings

Lexical and semantic retrieval fail on opposite query distributions. Hybrid retrieval with reciprocal rank fusion does not interpolate, it votes - and the vote wins on every query class.

Type: Build Languages: Python Prerequisites: Phase 11 lessons 04 (embeddings), 06 (RAG); Phase 19 Track B foundations (lessons 20-29); Phase 19 lesson 64 (chunking strategies) Time: ~90 minutes

Learning Objectives

Implement BM25 from scratch from the Robertson and Sparck Jones formulation, with field weighting, document length normalization, and tunable k1 and b.
Build a dense retriever on top of a deterministic mock embedding so the loop runs offline.
Implement reciprocal rank fusion exactly as Cormack, Clarke, and Buettcher published it in 2009, and explain why it dominates score-weighted interpolation.
Tune the RRF k constant and the per-modality weights and read the trade-offs on a small fixture corpus.

The Problem

Lexical search wins when the query carries a literal identifier the corpus contains verbatim. A query for AbortMultipartOnFail returns the right Go function via BM25 in microseconds. The same query, embedded, sits at the boundary of three similarity clusters and a dense retriever ranks the wrong file first.

Dense search wins when the query is paraphrased away from the corpus's literal tokens. A user asking "how do we handle cancelled uploads" never typed the word abort or multipart. BM25 returns the documentation chunk on "uploading large files" because that page contains the word uploads. Dense retrieval finds the abort function whose summary mentions cancellation.

The choice between the two is not a static one. The query distribution is the variable. A production RAG system handles both classes from the same endpoint, so retrieval has to handle both at once. That is hybrid retrieval. The merge step is the part that has to be right.

The Concept

flowchart LR
  Query[Query] --> BM25[BM25 Index]
  Query --> Dense[Dense Index]
  BM25 --> RanksA[Ranked List A]
  Dense --> RanksB[Ranked List B]
  RanksA --> RRF[Reciprocal Rank Fusion]
  RanksB --> RRF
  RRF --> Top[Top-k Chunks]

BM25 in one paragraph

BM25 scores a query-document pair by summing, over query terms, an inverse document frequency factor multiplied by a saturating term-frequency factor that includes a length-normalization correction. Two knobs. k1 controls term-frequency saturation; the default 1.5 is the published recommendation and you should not move it without a benchmark. b controls how much document length matters; the default 0.75 says longer documents are penalized, but not linearly.

The IDF formula uses the smoothed Robertson and Sparck Jones definition, which is log((N - df + 0.5) / (df + 0.5) + 1). The plus-one inside the log keeps the IDF positive when a term appears in more than half the corpus. This matters in small corpora where stopwords are technically rare.

Field weighting lets you tell BM25 that a match on the symbol name counts more than a match in the body. Implementation is a multiplier on the term counts during indexing, not at scoring time. That keeps the math identical and avoids a separate score per field.

Dense retrieval in one paragraph

Embed each chunk into a fixed-dimension vector with an embedding model. At query time, embed the query, cosine-rank every chunk by similarity, and return the top-k. The model is the variable that decides quality. The retrieval algorithm itself is two lines: dot product and sort.

This lesson uses a deterministic hash-based embedding so you can read the fusion math without a network call. The hash sums token-keyed offsets into a 96-dimensional vector and normalizes. The cosine ranks are deterministic across runs, which is what the test suite requires.

Reciprocal rank fusion, the published formula

Two ranked lists. For each candidate that appears in either list, sum its reciprocal-rank contributions. The 2009 paper used 1 / (k + rank) with k equal to 60 as the default. Sort by total score. That is the whole algorithm.

The published constant k = 60 is not arbitrary. With k = 60 the rank-1 contribution is 1 / 61 and the rank-10 contribution is 1 / 70. The contribution decays slowly so deep candidates still vote. Smaller k makes the top results dominate. Larger k flattens the contribution curve.

Two tunable knobs in our implementation. The k constant. A pair of per-modality weights so you can boost BM25 or dense when you have prior evidence one is better on your corpus. Multiplying the rank contribution by the weight is the simplest principled implementation; it preserves the rank-decay shape and stays scale-free.

Why fusion beats score-weighted interpolation

BM25 scores are unbounded and corpus-dependent. Cosine similarities are bounded in -1 to 1. A linear combination alpha * bm25 + (1 - alpha) * cosine requires per-corpus alpha tuning and breaks every time you reindex. The rank-based fusion does not. Two ranks are comparable across modalities. The published RRF baseline beats score-interpolation in every public TREC track since 2010.

This is the same argument you hear about RankFusion vs RRF in Vespa and Weaviate documentation. They came to the same conclusion: stay rank-based unless you have very strong evidence to interpolate scores.

Build It

code/main.py implements:

tokenize(text) - a fast regex tokenizer.
BM25Index - field-weighted, with add and search and tunable k1, b.
mock_embed, DenseIndex - the same deterministic embedding as lesson 64 so chunks are comparable.
rrf(rankings, k, weights) - the published fusion with multi-modality weights.
HybridRetriever - combines BM25 and dense.
A demo main() that loads a small fixture corpus, runs three queries that target each retriever's strength and weakness, and prints the rankings each modality produced plus the fused list.

Run it:

python3 code/main.py

Read the demo output side by side. The literal identifier query lands at BM25 rank 1, dense rank 4, RRF rank 1. The paraphrased query lands at BM25 rank 6, dense rank 1, RRF rank 1. The ambiguous query lands at BM25 rank 3, dense rank 3, RRF rank 1. The fusion is not a tie-breaker; it is the system that wins on every query class.

Tuning the knobs

Knob	Default	Move it up when	Move it down when
BM25 k1	1.5	Terms repeat in documents and you want frequency to matter more	Documents are short and term repetition is noise
BM25 b	0.75	Long documents really do say less per word	Document length is uncorrelated with topic
RRF k	60	Deep candidates should keep voting	The top-1 should dominate
BM25 weight	1.0	Your corpus contains literal identifiers and queries match them	Your queries are user-paraphrased
Dense weight	1.0	Queries are paraphrased	Queries are literal

Tune by re-running lesson 68's eval harness on your held-out query set, not by intuition.

Failure modes the demo will hide

Out-of-vocabulary tokens. BM25's IDF is computed from the corpus, so terms only in the query contribute zero. Dense embeddings hallucinate a vector for the same term. On out-of-corpus identifiers the dense modality returns plausible-looking but wrong neighbors. The fusion absorbs this because BM25 returns nothing and the rank contribution drops out, but only if you de-duplicate by document, not by chunk.

Stop-token domination. BM25 against the word "the" produces a uniform ranking over the corpus. Filter stop tokens in the indexer or accept that high-IDF terms dominate naturally.

Identical content across modalities. If your corpus is small enough that the top-1 of BM25 is also the top-1 of dense, RRF gives you the same top-1 with the same neighbors. That is correct behavior, not a failure, but it makes the fusion look invisible. Add an adversarial query pair in your eval to verify the fusion is actually working.

Use It

Production patterns:

Index BM25 in process; the bottleneck is the term-frequency dictionary, not the vectors.
Index dense vectors in a separate store (in this lesson we use a flat list; in production you would use HNSW).
Run both queries in parallel; the fusion is a constant-time merge over the union.
Persist the modality of each retrieved hit so a downstream reranker can see which modality voted for it.

Ship It

Lesson 66 takes the fused top-k from this lesson and reranks with a cross-encoder. Lesson 68 evaluates the entire pipeline with precision, recall, MRR, and nDCG. The hybrid retriever in this lesson is the first stage of the end-to-end system in lesson 69.

Exercises

Replace mock_embed with a real model from your provider. Re-run the demo and report how the dense-only ranking changes on the paraphrased query.
Add a third modality: chunk summaries indexed separately and fused as a third ranked list. Measure the gain.
Sweep RRF k across 10, 30, 60, 100, 200. Plot the recall@k curve from lesson 68. Report the value of k where the curve peaks on your corpus.
Implement BM25F properly (per-field length normalization rather than the multiplier trick) and compare on a corpus where symbol matches matter most.

Key Terms

Term	What people say	What it actually means
BM25	"Lexical search"	Probabilistic ranking with idf x saturating tf x length normalization
RRF	"Rank fusion"	Sum of 1 / (k + rank) across ranked lists; k = 60 default
k1	"TF saturation"	Controls how fast a repeated term stops adding more score
b	"Length penalty"	0 means ignore document length, 1 means full normalization
Field weighting	"Symbol boost"	Repeat tokens during indexing to boost matches in that field
Rank-based vs score-based fusion	"Why RRF beats linear"	Ranks are comparable across modalities; scores are not