ML Systems

Recommendation Ranking System

Two-stage candidate generation + ranking for personalized recommendations at low latency.

Scale to anchor on

Hundreds of millions of users, sub-100 ms p99, candidate generation over billions of items, ranking applied to ~1k candidates per request.

Requirements

Functional

Generate a personalized candidate set per user.
Rank candidates by predicted engagement and business value.
Apply business rules, freshness, and diversity constraints.
Support A/B testing of new models without downtime.

Non-functional

Low latency under load.
Calibrated CTR predictions where pCTR multiplies bid or value.
Resilience to feedback loops (filter bubbles, exploration vs. exploitation).

High-level architecture

Two-tower retrieval produces a candidate set via ANN over precomputed item embeddings. A heavier ranking model (multi-task, multi-objective) scores the candidates. A re-ranker enforces diversity, freshness, and business policy.

Components

User and item embedding service

Maintains tower outputs; refreshes embeddings as users and items evolve.

ANN index

Vector store (HNSW, IVF-PQ) for sub-10 ms retrieval over billions of items.

Ranking model

Multi-task DNN scoring engagement, dwell, value, and safety per candidate.

Re-ranker

Applies diversity, freshness, exploration, and business constraints.

Feature store

Point-in-time correct features online and offline.

Experimentation framework

Holdouts, interleaving, and counterfactual eval.

Key decisions

Two-tower over cross-encoder for retrieval.

Item-side precompute is the only way to hit sub-100 ms over billions of items.

Multi-task ranker over CTR-only.

Optimizing CTR alone produces clickbait; engagement and value heads improve long-term metrics.

Explicit exploration budget.

Closed-loop systems narrow over time. Bandit-style exploration measures cold items and prevents filter-bubble collapse.

Calibration layer.

When pCTR multiplies bid or downstream value, ranking-only metrics misalign with business reality.

Pitfalls

Treating offline AUC as ground truth — it doesn't predict online lift well.
Forgetting position bias in training data.
Skipping calibration; downstream value gets distorted silently.
No separation of retrieval and ranking eval — failures get misattributed.

Follow-up questions

How do you measure retrieval quality independently of ranking?
How do you balance exploration vs. exploitation in this loop?
How do new items get a chance to be ranked?
How do you safely roll out a new ranking model?

Related patterns

caching queue-decoupling feature-flags