Platform
A/B Testing & Experimentation Infrastructure
Run thousands of overlapping experiments with statistically valid analysis and safe rollouts.
Scale to anchor on
Thousands of concurrent experiments, hundreds of millions of users, low-latency assignment, daily analysis pipeline over petabytes.
Requirements
Functional
- Deterministic per-user assignment to variants.
- Layered experiments to isolate interactions.
- Manual or automatic ramping.
- Statistical analysis with confidence intervals.
Non-functional
- Sub-ms assignment latency.
- Robust to clock skew, user churn, and tracking gaps.
- Audit trail for decisions.
High-level architecture
Assignment service hashes user_id with experiment salt to deterministically pick a variant. Exposure events flow to a logging pipeline. Daily / hourly analysis joins exposures with outcome metrics, computes per-variant statistics with appropriate corrections (CUPED, variance reduction, multiple-testing).
Components
Experiment config service
Holds experiment definitions, variants, and layer membership.
Assignment SDK
In-process variant computation; deterministic and fast.
Exposure logger
Records who saw what variant when.
Analysis pipeline
Joins exposures to metrics, computes stats, surfaces results.
Experiment review UI
Drives go/no-go decisions.
Key decisions
Deterministic hash-based assignment.
Stateless, reproducible, and fast — no DB lookup per request.
Layered experiments.
Without layers, overlapping experiments confound each other; layers contain interactions.
Pre-experiment power analysis.
Running an underpowered experiment wastes weeks of traffic and produces no signal.
Variance reduction techniques.
Real product metrics are noisy; CUPED-style corrections dramatically tighten confidence intervals.
Pitfalls
- Skipping layered experiments and getting confounded results.
- Sample ratio mismatch (SRM) ignored — signals broken bucketing.
- Reading peeked p-values during ramp.
- No global holdout — long-term effects unmeasurable.
Follow-up questions
- How do you handle network experiments (recsys, marketplace)?
- What's the SRM detection mechanism?
- How do you handle multiple overlapping experiments mathematically?