ML Systems
Recommendation Ranking System
Two-stage candidate generation + ranking for personalized recommendations at low latency.
Scale to anchor on
Hundreds of millions of users, sub-100 ms p99, candidate generation over billions of items, ranking applied to ~1k candidates per request.
Requirements
Functional
- Generate a personalized candidate set per user.
- Rank candidates by predicted engagement and business value.
- Apply business rules, freshness, and diversity constraints.
- Support A/B testing of new models without downtime.
Non-functional
- Low latency under load.
- Calibrated CTR predictions where pCTR multiplies bid or value.
- Resilience to feedback loops (filter bubbles, exploration vs. exploitation).
High-level architecture
Two-tower retrieval produces a candidate set via ANN over precomputed item embeddings. A heavier ranking model (multi-task, multi-objective) scores the candidates. A re-ranker enforces diversity, freshness, and business policy.
Components
User and item embedding service
Maintains tower outputs; refreshes embeddings as users and items evolve.
ANN index
Vector store (HNSW, IVF-PQ) for sub-10 ms retrieval over billions of items.
Ranking model
Multi-task DNN scoring engagement, dwell, value, and safety per candidate.
Re-ranker
Applies diversity, freshness, exploration, and business constraints.
Feature store
Point-in-time correct features online and offline.
Experimentation framework
Holdouts, interleaving, and counterfactual eval.
Key decisions
Two-tower over cross-encoder for retrieval.
Item-side precompute is the only way to hit sub-100 ms over billions of items.
Multi-task ranker over CTR-only.
Optimizing CTR alone produces clickbait; engagement and value heads improve long-term metrics.
Explicit exploration budget.
Closed-loop systems narrow over time. Bandit-style exploration measures cold items and prevents filter-bubble collapse.
Calibration layer.
When pCTR multiplies bid or downstream value, ranking-only metrics misalign with business reality.
Pitfalls
- Treating offline AUC as ground truth — it doesn't predict online lift well.
- Forgetting position bias in training data.
- Skipping calibration; downstream value gets distorted silently.
- No separation of retrieval and ranking eval — failures get misattributed.
Follow-up questions
- How do you measure retrieval quality independently of ranking?
- How do you balance exploration vs. exploitation in this loop?
- How do new items get a chance to be ranked?
- How do you safely roll out a new ranking model?