Deployment Patterns for ML: Why Blue/Green Fails for Models
Blue/green is a deploy pattern for code, where the failure mode is crashes. Models need a different one because their failure mode is silent quality drift. This lesson covers the shadow → canary → interleaved/A/B → ramped pattern and when each step is the right call.
The deploy patterns engineering teams have spent two decades refining for code — blue/green, canary, feature flags — were designed around a specific failure mode: the new code crashes or returns errors, and you roll back. ML changes don't fail that way. ML changes fail by being subtly worse on a slice of traffic that nobody notices for three weeks, by the time the team has shipped two more model versions on top. Applying blue/green to ML is like applying a fire alarm to flood detection — the right alarm for the wrong failure.
The Quality-Aware Rollout Ladder is the four-step progression designed around the silent-quality-drift failure mode. Shadow, canary, interleaving or A/B, ramped with holdback. Each step catches a class of failure that the previous step couldn't, and the order matters — shadow catches gross failures before any user sees them, canary catches quality regressions on small traffic, interleaving or A/B measures whether the change is actually better, and the permanent holdback measures whether the improvement persists at retention timescale. Skipping a step is rarely catastrophic, but compounding skipped steps is how ML platforms drift into a state where no one trusts the model changes anyone ships.
The Quality-Aware Rollout Ladder
Blue/green is a deploy pattern for code, where the failure mode is crashes. Models have a different failure mode — silent quality drift — and need a deploy pattern designed around quality, not uptime. The Quality-Aware Rollout Ladder is the four-step progression that lets you ship model changes with the same operational confidence as code, while catching the failure modes that blue/green hides.
- 1Step 1 — Shadow modeRun the new model in parallel with the current production model, log both predictions, but serve only the current model's output. Lasts as long as it takes to build confidence — typically 1-7 days. The single most under-used deploy pattern in ML. Catches gross failures (model loading errors, latency regressions, output distribution shifts) at zero user risk.
- 2Step 2 — Canary at low trafficServe the new model to 1-5% of traffic, ideally to a known low-stakes population (employees, beta cohort, low-value users). Measure quality metrics, not just system metrics. Hold for 24-72 hours depending on metric variance. Most quality regressions show up here if the rollout discipline includes per-version observability.
- 3Step 3 — Interleaving or A/B at moderate trafficEither interleave (mix old and new model outputs in the same response and observe user behavior) or A/B (split users between models and compare aggregate metrics). Interleaving is more efficient — same user is exposed to both — but only works for slate-based products. A/B is universal but slower and bias-prone. Both run for 1-3 weeks depending on metric.
- 4Step 4 — Ramped rollout with holdbackMove traffic up in stages (10% → 25% → 50% → 100%) with a permanent holdback group (1-5% of users) on the old model for long-term comparison. The holdback is what tells you whether the new model's improvement persists over months; without it, you cannot measure retention-level effects on the timescale they actually move.
- 5The hidden requirement — quality observability per versionNone of the four steps work without per-version quality metrics. If your dashboard shows aggregate quality but not per-model-version quality, you cannot tell which step's traffic is degrading. The Ladder's prerequisite is observability that tags every prediction with the model version and exposes per-version quality at the same fidelity as aggregate latency.
Apply the Ladder to any model deployment in a system that serves real users. Skip it only for offline-only models with no user-facing impact. The Ladder is also the right opening for 'how do you safely roll out an ML change?' questions, because the answer is the four steps — not 'we'd use blue/green.'
Senior answer to 'how do you deploy a new model safely': 'Blue/green with monitoring.' Staff answer: 'Four steps. Shadow for 3 days to catch loading errors, latency regressions, and output distribution shifts. Canary to 2% of low-stakes traffic for 48 hours with per-version quality metrics. Interleaving (if slate-based) or A/B with proper randomization for 2 weeks against the primary metric and guardrails. Ramped rollout to 100% with a 2% permanent holdback for long-term retention measurement. The holdback runs for months because that's the timescale retention actually moves on. Without per-version observability, none of these steps tell you anything you don't already know.'
Your team ships a new recsys ranking model. Walk me through the rollout.
Operational reality probe. The interviewer wants to see whether you have the Ladder mental model or whether you'll default to blue/green.
We'd blue/green it — bring up the new model, route some traffic, monitor, then ramp.
Canary first. Send 5% to the new model, monitor metrics for 24 hours, then ramp if everything looks good.
Four steps. Shadow for 2-3 days to catch loading errors and confirm latency. Canary at 2% of low-stakes traffic for 48 hours with per-version quality metrics, not just system metrics. A/B at 50/50 for 2 weeks against the primary metric (retention) and guardrails (CTR, watch time). Ramp to 100% with a permanent 2% holdback on the old model. Each step has a different question — shadow asks 'does it work?', canary asks 'does it work on real users without breaking?', A/B asks 'is it better?', holdback asks 'is it still better in 3 months?'
Same four steps with two additions. (1) The 'is it better' step has to define what 'better' means before the rollout starts. The willingness-to-trade ratio between primary metric and guardrails must be negotiated with product upfront — not at the moment of decision when the data is in. Otherwise the team will retroactively justify whichever model has the cleanest story. (2) The permanent holdback is operationally expensive — 2% of traffic running on a stale model accumulates drift over time, and the team has to commit to maintaining it as a system, not setting it up once and forgetting. Many teams skip the holdback because they don't want to operate two models long-term, and then they cannot measure retention effects properly. The pattern: rollout discipline is a commitment to ongoing operational cost, not a one-time setup. Treating it as one-time is the canonical Senior-vs-Staff gap in ML platform conversations.
Made the willingness-to-trade ratio a pre-rollout decision (avoiding retroactive justification) and named the holdback as ongoing operational cost (avoiding the 'set it up once' failure). Both are L7 because they convert the framework from a procedure into a commitment.
Someone proposes deploying a new model directly to ≥25% of traffic.
Stop. The Ladder's lower steps catch a class of failure that ≥25% deployment cannot. Going to 25% without shadow and canary trades 21% of users against the cost of 3 days of slow rollout.
Practice this. Time yourself.
You have 12 minutes. A team wants to roll out a new ranker that 'looked great in offline eval — 5% better on the primary metric.' They propose going directly to 100%. Walk them through the Ladder, naming what each step would catch that they're proposing to skip. Then propose the timeline they should commit to. Write 4 paragraphs: (1) the failure modes a 100%-direct rollout would expose them to, (2) the catch-cost per Ladder step, (3) the timeline, (4) the observability prerequisite.
Self-assessment rubric
| Dimension | Weak | Passing | Strong | Staff bar |
|---|---|---|---|---|
| Direct-rollout failure modes | Said 'they might have a regression.' | Named 2-3 specific failure modes. | Named: model loading errors, latency regression, output distribution shift, offline-online metric divergence (the recurring failure mode from Lesson 3.1). | Same plus: named that 'looked great in offline' is the canonical pre-incident signal, because offline metrics overstate online performance routinely for structural reasons. |
| Catch-cost per step | Said each step 'catches problems.' | Per-step catch with one example. | Per-step catch with specific failure class and 'what would have happened on direct rollout' counterfactual. | Same plus: explicit cost-benefit per step. Shadow costs 3 days and catches gross failures at zero risk. Canary costs 2 days and catches quality regressions on 2% of traffic instead of 100%. Each step is justified as insurance with a price tag. |
| Timeline commitment | Vague 'a few weeks.' | Total timeline of ~3 weeks. | Per-step duration: 3 days shadow, 2 days canary, 14 days A/B, then ramp. | Per-step duration AND the criteria for advancing to the next step (e.g., 'advance from canary if quality metric within 1% of current production and no per-class regression detected'). |
| Observability prerequisite | Did not name. | Said 'we need monitoring.' | Named per-version quality metrics as the prerequisite for the Ladder to work. | Per-version observability AND distribution-shift detection AND quality-regression alerting at the per-class level. Without these, the Ladder is theater. |
Reveal model solution
Common failures
- ✗Compressed the timeline to 'a few days' to please the team's urgency. The Ladder's duration is the cost of insurance; compressing it is volunteering for risk you don't have to take.
- ✗Did not connect to per-version observability. Without it, the Ladder steps are theater.
- ✗Treated the 5% offline lift as evidence of safety. Offline metrics overstate online performance routinely.
- ✗Did not propose the permanent holdback. Without it, retention effects are unmeasurable.
The Quality-Aware Rollout Checklist
Step 1 — Shadow (Days 0-3)
- ☐New model serving real production traffic in parallel; outputs logged but not served.
- ☐Monitor: model loads correctly, p99 latency vs current production, output distribution KL divergence on labeled set.
- ☐Advance criterion: latency within 10%, KL divergence < threshold, no errors.
Step 2 — Canary (Days 3-5)
- ☐New model serves 2% of traffic, ideally to low-stakes population.
- ☐Monitor: per-version quality metrics, per-input-class regression, error rate.
- ☐Advance criterion: quality within 1% of current AND no per-class regression beyond noise threshold.
Step 3 — A/B (Days 5-19)
- ☐50/50 split. Primary metric and guardrails measured.
- ☐Willingness-to-trade ratio between primary and guardrails pre-negotiated with product before A/B starts.
- ☐Advance criterion: primary metric improvement statistically significant AND guardrails within ratio.
Step 4 — Ramped + holdback (Days 19+)
- ☐Ramp to 100% in stages (10% → 25% → 50% → 100% over a week).
- ☐Permanent holdback (1-5% of users) retained on old model for long-term comparison.
- ☐Holdback is an ongoing operational commitment; assign owner.
Prerequisite — per-version observability
- ☐Every prediction tagged with model_version.
- ☐Per-version quality dashboards at same fidelity as latency.
- ☐Distribution-shift alerts on input features per version.
- ☐Per-input-class quality breakdowns visible without ad-hoc queries.
Large e-commerce platform. Recommendation ranker team had a clean rollout discipline — shadow, canary, A/B, ramp — for two years. A new tech lead joined and proposed compressing rollouts to ship faster. The proposed compression: skip shadow when 'offline metrics look great' (defined as >3% lift on primary metric).
Six weeks after the compression policy went into effect, a new ranker shipped through skip-shadow → canary at 5% → straight to 50% A/B. Canary monitoring caught nothing visible. The A/B showed the new ranker was 2% better on aggregate. Three weeks into the A/B, an enterprise customer reported that their product detail pages were sometimes showing related products from a completely different category. Investigation revealed the new ranker had a subtle output schema change that downstream caching had been silently dropping for 8% of slates. Shadow would have caught this in 24 hours by comparing output distributions. The compression policy cost roughly two weeks of investigation, a customer escalation, and a full rollback.
The post-incident review identified that the skip-shadow policy had been justified as 'we're confident in our offline metrics.' The metrics didn't lie; the offline pipeline didn't measure schema compatibility because it had no reason to. The failure mode was exactly the kind shadow exists to catch — invisible to quality metrics, visible only to output-distribution monitoring. The tech lead's 'ship faster' policy had removed the step that catches the cheapest failures cheapest.
When the tech lead proposed the compression: 'Each Ladder step exists to catch a specific class of failure that the next step cannot catch as cheaply. Shadow catches schema, loading, latency, distribution-shift failures at zero user risk and 3 days of compute. Canary catches quality regressions at 2% user risk. Skipping shadow saves 3 days and trades zero-user-risk insurance for 2%-user-risk insurance on the same class of failure. The math doesn't work — we're paying with users instead of compute. The compression I'd support is in A/B duration when the primary metric moves fast (recsys engagement signals stabilize in 7-10 days); shadow is the wrong step to compress.'
Each step of the Quality-Aware Rollout Ladder catches a different class of failure at a different cost. Compressing the wrong step trades a cheap failure category for an expensive one. The Staff move in 'ship faster' conversations is to name which step compresses safely (usually A/B duration when metrics stabilize quickly) and which step is non-negotiable (usually shadow, because the failures it catches are invisible elsewhere). Treating the Ladder as a fixed pipeline misses where the real flexibility lives.