Platform

A/B Testing & Experimentation Infrastructure

Run thousands of overlapping experiments with statistically valid analysis and safe rollouts.

Scale to anchor on

Thousands of concurrent experiments, hundreds of millions of users, low-latency assignment, daily analysis pipeline over petabytes.

Requirements

Functional

  • Deterministic per-user assignment to variants.
  • Layered experiments to isolate interactions.
  • Manual or automatic ramping.
  • Statistical analysis with confidence intervals.

Non-functional

  • Sub-ms assignment latency.
  • Robust to clock skew, user churn, and tracking gaps.
  • Audit trail for decisions.

High-level architecture

Assignment service hashes user_id with experiment salt to deterministically pick a variant. Exposure events flow to a logging pipeline. Daily / hourly analysis joins exposures with outcome metrics, computes per-variant statistics with appropriate corrections (CUPED, variance reduction, multiple-testing).

Components

Experiment config service
Holds experiment definitions, variants, and layer membership.
Assignment SDK
In-process variant computation; deterministic and fast.
Exposure logger
Records who saw what variant when.
Analysis pipeline
Joins exposures to metrics, computes stats, surfaces results.
Experiment review UI
Drives go/no-go decisions.

Key decisions

Deterministic hash-based assignment.
Stateless, reproducible, and fast — no DB lookup per request.
Layered experiments.
Without layers, overlapping experiments confound each other; layers contain interactions.
Pre-experiment power analysis.
Running an underpowered experiment wastes weeks of traffic and produces no signal.
Variance reduction techniques.
Real product metrics are noisy; CUPED-style corrections dramatically tighten confidence intervals.

Pitfalls

  • Skipping layered experiments and getting confounded results.
  • Sample ratio mismatch (SRM) ignored — signals broken bucketing.
  • Reading peeked p-values during ramp.
  • No global holdout — long-term effects unmeasurable.

Follow-up questions

  • How do you handle network experiments (recsys, marketplace)?
  • What's the SRM detection mechanism?
  • How do you handle multiple overlapping experiments mathematically?

Related patterns

Further reading