ML Systems

Recommendation Ranking System

Two-stage candidate generation + ranking for personalized recommendations at low latency.

Scale to anchor on

Hundreds of millions of users, sub-100 ms p99, candidate generation over billions of items, ranking applied to ~1k candidates per request.

Requirements

Functional

  • Generate a personalized candidate set per user.
  • Rank candidates by predicted engagement and business value.
  • Apply business rules, freshness, and diversity constraints.
  • Support A/B testing of new models without downtime.

Non-functional

  • Low latency under load.
  • Calibrated CTR predictions where pCTR multiplies bid or value.
  • Resilience to feedback loops (filter bubbles, exploration vs. exploitation).

High-level architecture

Two-tower retrieval produces a candidate set via ANN over precomputed item embeddings. A heavier ranking model (multi-task, multi-objective) scores the candidates. A re-ranker enforces diversity, freshness, and business policy.

Components

User and item embedding service
Maintains tower outputs; refreshes embeddings as users and items evolve.
ANN index
Vector store (HNSW, IVF-PQ) for sub-10 ms retrieval over billions of items.
Ranking model
Multi-task DNN scoring engagement, dwell, value, and safety per candidate.
Re-ranker
Applies diversity, freshness, exploration, and business constraints.
Feature store
Point-in-time correct features online and offline.
Experimentation framework
Holdouts, interleaving, and counterfactual eval.

Key decisions

Two-tower over cross-encoder for retrieval.
Item-side precompute is the only way to hit sub-100 ms over billions of items.
Multi-task ranker over CTR-only.
Optimizing CTR alone produces clickbait; engagement and value heads improve long-term metrics.
Explicit exploration budget.
Closed-loop systems narrow over time. Bandit-style exploration measures cold items and prevents filter-bubble collapse.
Calibration layer.
When pCTR multiplies bid or downstream value, ranking-only metrics misalign with business reality.

Pitfalls

  • Treating offline AUC as ground truth — it doesn't predict online lift well.
  • Forgetting position bias in training data.
  • Skipping calibration; downstream value gets distorted silently.
  • No separation of retrieval and ranking eval — failures get misattributed.

Follow-up questions

  • How do you measure retrieval quality independently of ranking?
  • How do you balance exploration vs. exploitation in this loop?
  • How do new items get a chance to be ranked?
  • How do you safely roll out a new ranking model?

Related patterns

Further reading