Module 3 · Lesson 3 · Core · 32 min

Deployment Patterns for ML: Why Blue/Green Fails for Models

Blue/green is a deploy pattern for code, where the failure mode is crashes. Models need a different one because their failure mode is silent quality drift. This lesson covers the shadow → canary → interleaved/A/B → ramped pattern and when each step is the right call.

The deploy patterns engineering teams have spent two decades refining for code — blue/green, canary, feature flags — were designed around a specific failure mode: the new code crashes or returns errors, and you roll back. ML changes don't fail that way. ML changes fail by being subtly worse on a slice of traffic that nobody notices for three weeks, by the time the team has shipped two more model versions on top. Applying blue/green to ML is like applying a fire alarm to flood detection — the right alarm for the wrong failure.

The Quality-Aware Rollout Ladder is the four-step progression designed around the silent-quality-drift failure mode. Shadow, canary, interleaving or A/B, ramped with holdback. Each step catches a class of failure that the previous step couldn't, and the order matters — shadow catches gross failures before any user sees them, canary catches quality regressions on small traffic, interleaving or A/B measures whether the change is actually better, and the permanent holdback measures whether the improvement persists at retention timescale. Skipping a step is rarely catastrophic, but compounding skipped steps is how ML platforms drift into a state where no one trusts the model changes anyone ships.

Framework

The Quality-Aware Rollout Ladder

Blue/green is a deploy pattern for code, where the failure mode is crashes. Models have a different failure mode — silent quality drift — and need a deploy pattern designed around quality, not uptime. The Quality-Aware Rollout Ladder is the four-step progression that lets you ship model changes with the same operational confidence as code, while catching the failure modes that blue/green hides.

1
Step 1 — Shadow mode
Run the new model in parallel with the current production model, log both predictions, but serve only the current model's output. Lasts as long as it takes to build confidence — typically 1-7 days. The single most under-used deploy pattern in ML. Catches gross failures (model loading errors, latency regressions, output distribution shifts) at zero user risk.
2
Step 2 — Canary at low traffic
Serve the new model to 1-5% of traffic, ideally to a known low-stakes population (employees, beta cohort, low-value users). Measure quality metrics, not just system metrics. Hold for 24-72 hours depending on metric variance. Most quality regressions show up here if the rollout discipline includes per-version observability.
3
Step 3 — Interleaving or A/B at moderate traffic
Either interleave (mix old and new model outputs in the same response and observe user behavior) or A/B (split users between models and compare aggregate metrics). Interleaving is more efficient — same user is exposed to both — but only works for slate-based products. A/B is universal but slower and bias-prone. Both run for 1-3 weeks depending on metric.
4
Step 4 — Ramped rollout with holdback
Move traffic up in stages (10% → 25% → 50% → 100%) with a permanent holdback group (1-5% of users) on the old model for long-term comparison. The holdback is what tells you whether the new model's improvement persists over months; without it, you cannot measure retention-level effects on the timescale they actually move.
5
The hidden requirement — quality observability per version
None of the four steps work without per-version quality metrics. If your dashboard shows aggregate quality but not per-model-version quality, you cannot tell which step's traffic is degrading. The Ladder's prerequisite is observability that tags every prediction with the model version and exposes per-version quality at the same fidelity as aggregate latency.

When to use

Apply the Ladder to any model deployment in a system that serves real users. Skip it only for offline-only models with no user-facing impact. The Ladder is also the right opening for 'how do you safely roll out an ML change?' questions, because the answer is the four steps — not 'we'd use blue/green.'

Worked example

Senior answer to 'how do you deploy a new model safely': 'Blue/green with monitoring.' Staff answer: 'Four steps. Shadow for 3 days to catch loading errors, latency regressions, and output distribution shifts. Canary to 2% of low-stakes traffic for 48 hours with per-version quality metrics. Interleaving (if slate-based) or A/B with proper randomization for 2 weeks against the primary metric and guardrails. Ramped rollout to 100% with a 2% permanent holdback for long-term retention measurement. The holdback runs for months because that's the timescale retention actually moves on. Without per-version observability, none of these steps tell you anything you don't already know.'

Calibration ladder

Your team ships a new recsys ranking model. Walk me through the rollout.

Operational reality probe. The interviewer wants to see whether you have the Ladder mental model or whether you'll default to blue/green.

L4 · Mid

We'd blue/green it — bring up the new model, route some traffic, monitor, then ramp.

Missed: Treated ML like code. Will ship silent quality regressions.

L5 · Senior

Canary first. Send 5% to the new model, monitor metrics for 24 hours, then ramp if everything looks good.

Missed: Knew about canary but didn't decompose the rollout into purpose-driven steps.

L6 · Staff

Four steps. Shadow for 2-3 days to catch loading errors and confirm latency. Canary at 2% of low-stakes traffic for 48 hours with per-version quality metrics, not just system metrics. A/B at 50/50 for 2 weeks against the primary metric (retention) and guardrails (CTR, watch time). Ramp to 100% with a permanent 2% holdback on the old model. Each step has a different question — shadow asks 'does it work?', canary asks 'does it work on real users without breaking?', A/B asks 'is it better?', holdback asks 'is it still better in 3 months?'

Missed: Strong four-step pattern. Missing the meta-move — naming that the willingness-to-trade ratio must be pre-negotiated and that the holdback is ongoing operational commitment.

L7 · Principal

Same four steps with two additions. (1) The 'is it better' step has to define what 'better' means before the rollout starts. The willingness-to-trade ratio between primary metric and guardrails must be negotiated with product upfront — not at the moment of decision when the data is in. Otherwise the team will retroactively justify whichever model has the cleanest story. (2) The permanent holdback is operationally expensive — 2% of traffic running on a stale model accumulates drift over time, and the team has to commit to maintaining it as a system, not setting it up once and forgetting. Many teams skip the holdback because they don't want to operate two models long-term, and then they cannot measure retention effects properly. The pattern: rollout discipline is a commitment to ongoing operational cost, not a one-time setup. Treating it as one-time is the canonical Senior-vs-Staff gap in ML platform conversations.

What scored L7

Made the willingness-to-trade ratio a pre-rollout decision (avoiding retroactive justification) and named the holdback as ongoing operational cost (avoiding the 'set it up once' failure). Both are L7 because they convert the framework from a procedure into a commitment.

Pattern recognition

When you see

Someone proposes deploying a new model directly to ≥25% of traffic.

→

Think

Stop. The Ladder's lower steps catch a class of failure that ≥25% deployment cannot. Going to 25% without shadow and canary trades 21% of users against the cost of 3 days of slow rollout.

The 'we're confident, let's ship to half' move is the canonical pre-incident move. The model team is confident because offline metrics look great; the offline metrics overstate online performance for the reasons in Lesson 3.1 (point-in-time leaks, feature skew, distribution shift); the rollout exposes the gap to half of users; the regression is invisible for a week because there's no per-version quality observability. Every step the Ladder skips is risk you carry on real traffic. The slow rollout is the cheap version of insurance against the failure mode you don't know exists yet.

Drill · 12 minutes

Practice this. Time yourself.

You have 12 minutes. A team wants to roll out a new ranker that 'looked great in offline eval — 5% better on the primary metric.' They propose going directly to 100%. Walk them through the Ladder, naming what each step would catch that they're proposing to skip. Then propose the timeline they should commit to. Write 4 paragraphs: (1) the failure modes a 100%-direct rollout would expose them to, (2) the catch-cost per Ladder step, (3) the timeline, (4) the observability prerequisite.

Self-assessment rubric

Dimension	Weak	Passing	Strong	Staff bar
Direct-rollout failure modes	Said 'they might have a regression.'	Named 2-3 specific failure modes.	Named: model loading errors, latency regression, output distribution shift, offline-online metric divergence (the recurring failure mode from Lesson 3.1).	Same plus: named that 'looked great in offline' is the canonical pre-incident signal, because offline metrics overstate online performance routinely for structural reasons.
Catch-cost per step	Said each step 'catches problems.'	Per-step catch with one example.	Per-step catch with specific failure class and 'what would have happened on direct rollout' counterfactual.	Same plus: explicit cost-benefit per step. Shadow costs 3 days and catches gross failures at zero risk. Canary costs 2 days and catches quality regressions on 2% of traffic instead of 100%. Each step is justified as insurance with a price tag.
Timeline commitment	Vague 'a few weeks.'	Total timeline of ~3 weeks.	Per-step duration: 3 days shadow, 2 days canary, 14 days A/B, then ramp.	Per-step duration AND the criteria for advancing to the next step (e.g., 'advance from canary if quality metric within 1% of current production and no per-class regression detected').
Observability prerequisite	Did not name.	Said 'we need monitoring.'	Named per-version quality metrics as the prerequisite for the Ladder to work.	Per-version observability AND distribution-shift detection AND quality-regression alerting at the per-class level. Without these, the Ladder is theater.

Reveal model solution

Direct-rollout failure modes. The 'looked great offline' framing is the most common precursor to an online regression. Specific failure modes the team is volunteering for: (1) Model loading errors — the new model artifact has a different schema or fails on the production hardware; 100% deployment means 100% of users see errors. (2) Latency regression — the new model has higher inference cost than measured offline; production p99 violates SLA. (3) Output distribution shift — the model's outputs are systematically different from the old model in ways that break downstream consumers (cache invalidation, re-rank assumptions). (4) Offline-online metric divergence — the 5% offline improvement is partially or entirely a measurement artifact (point-in-time leak, feature skew, distribution shift between training and serving cohorts). Each of these is a known failure mode and each is roughly an order of magnitude cheaper to catch in shadow or canary than in production. Catch-cost per step. Shadow (3 days, zero user impact): catches the first three failure modes — loading errors, latency, distribution shift — at zero risk. Cost: 3 days of dual-serving compute. Counterfactual: any of these in direct rollout is a same-day Sev-1 incident. Canary (2 days, 2% traffic): catches quality regressions on real users at 50× lower exposure than direct rollout. Cost: 2 days of running two models on production traffic. Counterfactual: a 4% regression on 2% of users costs the same as a 0.08% regression on 100% — the canary buys two orders of magnitude in trade-off. A/B (14 days, 50/50): catches the offline-online metric divergence by measuring online effect directly. Cost: 14 days of split traffic with rigorous measurement. Counterfactual: shipping to 100% with no A/B means the team cannot distinguish 'new model is better' from 'new model is worse but seasonal effects are masking it.' Timeline. Day 0 to 3: shadow. Advance if: model loads correctly, p99 latency within 10% of current production, output distribution shift below 0.1 KL on labeled set. Day 3 to 5: canary at 2%. Advance if: quality metric within 1% of current production AND no per-input-class regression beyond noise threshold. Day 5 to 19: A/B at 50/50. Advance to ramped rollout if: primary metric improvement is statistically significant AND guardrail metrics within pre-negotiated trade ratio. Day 19 to 25: ramp from 50% to 100% in stages with a permanent 2% holdback retained on the old model. Total ~25 days; team can compress if specific risks are explicitly accepted, but the compression should be a documented decision with named accountability. Observability prerequisite. None of this works without per-version quality metrics. Specifically: every prediction tagged with model_version, quality proxies (LLM-as-judge sample for generation tasks, downstream engagement for recsys) reported per model_version, distribution-shift alerts on input feature distributions per model_version, per-input-class quality breakdowns to catch slice-specific regressions that aggregate metrics hide. If the team's observability does not include these, the Ladder is theater — they will advance through steps based on the absence of obvious failures, not on the presence of quality signal. The right answer to a team without per-version observability is 'we build that first, then we deploy.' The week of observability work pays back on the first non-trivial rollout.

Common failures

✗Compressed the timeline to 'a few days' to please the team's urgency. The Ladder's duration is the cost of insurance; compressing it is volunteering for risk you don't have to take.
✗Did not connect to per-version observability. Without it, the Ladder steps are theater.
✗Treated the 5% offline lift as evidence of safety. Offline metrics overstate online performance routinely.
✗Did not propose the permanent holdback. Without it, retention effects are unmeasurable.

Artifact · checklist

The Quality-Aware Rollout Checklist

Step 1 — Shadow (Days 0-3)

☐New model serving real production traffic in parallel; outputs logged but not served.
☐Monitor: model loads correctly, p99 latency vs current production, output distribution KL divergence on labeled set.
☐Advance criterion: latency within 10%, KL divergence < threshold, no errors.

Step 2 — Canary (Days 3-5)

☐New model serves 2% of traffic, ideally to low-stakes population.
☐Monitor: per-version quality metrics, per-input-class regression, error rate.
☐Advance criterion: quality within 1% of current AND no per-class regression beyond noise threshold.

Step 3 — A/B (Days 5-19)

☐50/50 split. Primary metric and guardrails measured.
☐Willingness-to-trade ratio between primary and guardrails pre-negotiated with product before A/B starts.
☐Advance criterion: primary metric improvement statistically significant AND guardrails within ratio.

Step 4 — Ramped + holdback (Days 19+)

☐Ramp to 100% in stages (10% → 25% → 50% → 100% over a week).
☐Permanent holdback (1-5% of users) retained on old model for long-term comparison.
☐Holdback is an ongoing operational commitment; assign owner.

Prerequisite — per-version observability

☐Every prediction tagged with model_version.
☐Per-version quality dashboards at same fidelity as latency.
☐Distribution-shift alerts on input features per version.
☐Per-input-class quality breakdowns visible without ad-hoc queries.

Post-mortem · anonymized

Setup

Large e-commerce platform. Recommendation ranker team had a clean rollout discipline — shadow, canary, A/B, ramp — for two years. A new tech lead joined and proposed compressing rollouts to ship faster. The proposed compression: skip shadow when 'offline metrics look great' (defined as >3% lift on primary metric).

What happened

Six weeks after the compression policy went into effect, a new ranker shipped through skip-shadow → canary at 5% → straight to 50% A/B. Canary monitoring caught nothing visible. The A/B showed the new ranker was 2% better on aggregate. Three weeks into the A/B, an enterprise customer reported that their product detail pages were sometimes showing related products from a completely different category. Investigation revealed the new ranker had a subtle output schema change that downstream caching had been silently dropping for 8% of slates. Shadow would have caught this in 24 hours by comparing output distributions. The compression policy cost roughly two weeks of investigation, a customer escalation, and a full rollback.

The moment

The post-incident review identified that the skip-shadow policy had been justified as 'we're confident in our offline metrics.' The metrics didn't lie; the offline pipeline didn't measure schema compatibility because it had no reason to. The failure mode was exactly the kind shadow exists to catch — invisible to quality metrics, visible only to output-distribution monitoring. The tech lead's 'ship faster' policy had removed the step that catches the cheapest failures cheapest.

What they should have said

When the tech lead proposed the compression: 'Each Ladder step exists to catch a specific class of failure that the next step cannot catch as cheaply. Shadow catches schema, loading, latency, distribution-shift failures at zero user risk and 3 days of compute. Canary catches quality regressions at 2% user risk. Skipping shadow saves 3 days and trades zero-user-risk insurance for 2%-user-risk insurance on the same class of failure. The math doesn't work — we're paying with users instead of compute. The compression I'd support is in A/B duration when the primary metric moves fast (recsys engagement signals stabilize in 7-10 days); shadow is the wrong step to compress.'

Lesson

Each step of the Quality-Aware Rollout Ladder catches a different class of failure at a different cost. Compressing the wrong step trades a cheap failure category for an expensive one. The Staff move in 'ship faster' conversations is to name which step compresses safely (usually A/B duration when metrics stabilize quickly) and which step is non-negotiable (usually shadow, because the failures it catches are invisible elsewhere). Treating the Ladder as a fixed pipeline misses where the real flexibility lives.