Platform

A/B Testing & Experimentation Infrastructure

Run thousands of overlapping experiments with statistically valid analysis and safe rollouts.

Scale to anchor on

Thousands of concurrent experiments, hundreds of millions of users, low-latency assignment, daily analysis pipeline over petabytes.

Requirements

Functional

Deterministic per-user assignment to variants.
Layered experiments to isolate interactions.
Manual or automatic ramping.
Statistical analysis with confidence intervals.

Non-functional

Sub-ms assignment latency.
Robust to clock skew, user churn, and tracking gaps.
Audit trail for decisions.

High-level architecture

Assignment service hashes user_id with experiment salt to deterministically pick a variant. Exposure events flow to a logging pipeline. Daily / hourly analysis joins exposures with outcome metrics, computes per-variant statistics with appropriate corrections (CUPED, variance reduction, multiple-testing).

Components

Experiment config service

Holds experiment definitions, variants, and layer membership.

Assignment SDK

In-process variant computation; deterministic and fast.

Exposure logger

Records who saw what variant when.

Analysis pipeline

Joins exposures to metrics, computes stats, surfaces results.

Experiment review UI

Drives go/no-go decisions.

Key decisions

Deterministic hash-based assignment.

Stateless, reproducible, and fast — no DB lookup per request.

Layered experiments.

Without layers, overlapping experiments confound each other; layers contain interactions.

Pre-experiment power analysis.

Running an underpowered experiment wastes weeks of traffic and produces no signal.

Variance reduction techniques.

Real product metrics are noisy; CUPED-style corrections dramatically tighten confidence intervals.

Pitfalls

Skipping layered experiments and getting confounded results.
Sample ratio mismatch (SRM) ignored — signals broken bucketing.
Reading peeked p-values during ramp.
No global holdout — long-term effects unmeasurable.

Follow-up questions

How do you handle network experiments (recsys, marketplace)?
What's the SRM detection mechanism?
How do you handle multiple overlapping experiments mathematically?

Related patterns

feature-flags queue-decoupling cdc