Interviews Vector

The outage that outlived its cause

A dependency hiccups for fifteen seconds — a GC pause, a brief network blip, nothing unusual. Requests to it time out. Your service, being robust, retries them. Each retry is more load on a dependency that was already struggling, so more requests time out, so more get retried. The fifteen-second blip ends. The dependency’s underlying problem is completely gone. And your system stays down.

It stays down because the retries are now the problem. The dependency is pinned at 100% serving retry traffic for requests whose original callers gave up long ago, and every served request spawns new retries as fast as old ones complete. The system has found a stable, terrible equilibrium — a metastable failure — and it will sit there until a human sheds the load. The thing that was supposed to make you reliable (retries) is the thing keeping you down. This module is about the layered defenses that prevent this, and the precise way each one, misconfigured, causes the outage it was meant to stop.

Reliability isn’t one trick; it’s a stack of defenses applied in order, each catching a specific failure and each capable of causing one if you set it wrong. The senior skill isn’t knowing the patterns — it’s knowing the second-order effect of each, because the most dangerous outages are caused by a reliability mechanism doing exactly what it was told.

The failure mode that defines the module

A normal overload is self-correcting: remove the spike and the system recovers. A metastable failure is not. A trigger (a blip, a deploy, a traffic spike) pushes the system into a state that sustains itself through a feedback loop — almost always retries — even after the trigger is gone. The defining test: if removing the original cause does not restore service, you’re in a metastable failure, and you cannot wait it out. Drive the simulator below into one.

Retry-storm simulator · capacity 100, base load 70

Retries / request2Retry budget (cap retries at 20% of load)

offered load over capacity per-step capacity

Metastable collapse. The blip is long over (capacity back to 100 since step 8), yet offered load is stuck at 250 — above capacity — because retries keep feeding the fire. The system found a stable bad equilibrium and won’t leave it without intervention (shed load, or stop retrying). This is the Roblox-2021 shape.

Try it: set retries to 2–3 with the budget off and watch the bars stay red after the blip. Then turn the budget on — same blip, full recovery.

The Resilience Stack

Here is the full stack, in the order a request encounters it. Read it as a defense-in-depth diagram: each layer catches a failure the layers above let through — and each, if misconfigured, manufactures the failure in the right-hand column. That second column is the part nobody teaches and every incident review rediscovers.

Layer

Catches

…but misconfigured, causes

1Timeout

A hung dependency holding your thread or connection forever.

Set too short, it declares healthy-but-slow requests dead and manufactures failures under load (AWS DynamoDB 2015).

2Deadline propagation

Downstream work continuing after the client already gave up — capacity burned on results nobody will read.

Omitted, every hop adds its own timeout, so end-to-end latency balloons past any single service's budget.

3Retry + backoff + jitter

Transient, isolated blips that a second attempt clears.

No jitter → clients retry in synchronized waves, hammering the dependency on a schedule (thundering herd).

4Retry budget

Retries amplifying a partial outage into a self-sustaining storm.

Too generous, it permits the storm; too tight, it fails requests a single retry would have saved.

5Circuit breaker

Pounding a known-dead dependency; trips open to give it room to recover.

Opens too eagerly and converts a slow dependency into a fully-failed one — turning 'slow' into 'broken'.

6Bulkhead

One slow dependency exhausting the shared thread/connection pool and taking down unrelated features.

Pools sized too small → false rejections and underutilization even when nothing is wrong.

7Load shedding / backpressure

Accepting more work than you can serve, so the queue grows unbounded and latency → ∞.

Shedding indiscriminately drops high-value traffic; backpressure not propagated upstream lets callers keep pushing.

The Resilience Stack — each layer catches what the ones above miss, and causes the failure on the right if misconfigured. Screenshot this and put it in your design doc.

The Retry Budget: the one defense that stops the storm

Of all these layers, one is non-negotiable for preventing metastable collapse: the Retry Budget. Backoff and jitter spread retries out in time, but they don’t bound the total retry volume — and under a sustained partial outage, even well-spaced retries add up to a storm. A retry budget caps retries as a fraction of total traffic (say, 10%): retries draw tokens from a shared bucket that refills with successful traffic, and when the bucket is empty, you stop retrying and fail fast. That cap is what breaks the feedback loop.

Framework · Token bucket

The Retry Budget

Retries are a privilege drawn from a shared, refilling budget — not a right every failed request gets. When the budget is empty, fail fast. This is the cap that prevents the storm.

resilience.go — bound the total retry volume

1type RetryBudget struct {
2  tokens, max, depositPerCall float64
3}
4 
5// Each top-level call deposits a little (e.g. 0.1 token/call -> retries
6// capped at ~10% of traffic). A healthy system refills the budget faster
7// than retries drain it; a struggling one runs the budget dry and STOPS
8// retrying, which is exactly what breaks the metastable loop.
9func (b *RetryBudget) Deposit() {
10  b.tokens += b.depositPerCall
11  if b.tokens > b.max { b.tokens = b.max }
12}
13 
14func (b *RetryBudget) TryRetry() bool {
15  if b.tokens >= 1 { b.tokens--; return true }
16  return false // budget exhausted -> do not retry, fail fast
17}

The mental shift: a retry is not something every failed request is entitled to. It’s a scarce resource the whole service shares. When many requests are failing, the budget runs dry and the service correctly decides that adding retry load to a struggling dependency is worse than failing fast. The simulator above is exactly this: budget off → collapse, budget on → recovery, same blip.

Runnable reference implementation

courses/distributed-systems/reference-impl/07-resilience/

A circuit breaker (closed → open → half-open), the token-bucket retry budget, a bulkhead, and the deterministic storm simulation. The demo shows the same blip collapse to a sustained offered-load of 250 (capacity 100) without a budget, and recover to 70 with one. go run ., tests for every layer.

Mental model

Every retry is a small DDoS you aim at yourself

A single client retrying 3× turns one request into up to four. Now multiply by every client, during the exact window when the dependency can least afford it. Retries convert a dependency’s partial failure into a traffic amplification aimed precisely at its weakest moment. That’s why the defenses are all about subtraction — backoff (retry later), jitter (not all at once), budget (not too many), circuit breaker (not at all, for now).

The reframe that makes this stick: you are not adding reliability when you add a retry. You are adding load, and hoping the reliability benefit outweighs it. Under partial outage, it doesn’t — which is why the budget and the breaker exist to take the retry away.

Use it when: Whenever you add or review a retry. Ask: if every caller did this during a partial outage, what's the multiplier on the struggling dependency?

Dimension	No retry	Immediate retry	Exp. backoff	Backoff + jitter + budget
Recovers transient errors	No	Yes	Yes	Yes
Thundering-herd risk	None	High — instant re-hammer	Spaced out over time	Spaced + de-synchronized
Metastable-storm risk	None	Severe — tightest feedback loop	Still high — volume unbounded	Bounded by the budget
Complexity	Trivial	Trivial	Low	Moderate (budget + jitter)
Choose when	The operation isn't safely retryable (non-idempotent write) or the caller can handle the error better than a retry can.	Almost never. Immediate retry is the canonical way to turn a blip into an outage.	Transient errors are common and you've confirmed retry volume is naturally bounded (rare). Still missing the budget.	The default for any retryable call to a shared dependency. Backoff spaces them, jitter de-synchronizes them, the budget caps them.

Verdict

The only safe default for retrying a shared dependency is exponential backoff + full jitter + a retry budget. Backoff and jitter are necessary but not sufficient — they spread retries in time without bounding their total volume, so they slow the storm without preventing it. The budget is the part that actually breaks the metastable loop, and it’s the part most teams omit.

How this fails in production · Roblox

The 73-hour metastable outage, 28–31 October 2021

The setup

Roblox ran HashiCorp Consul as the backbone for service discovery and configuration across its fleet. A new feature increased the load on Consul, and a performance issue in Consul’s underlying storage (BoltDB) under that load created contention — a classic trigger.

What happened

Once Consul slowed, the services depending on it retried and re-queued their requests, piling load onto an already-struggling Consul, which slowed further. The system entered a state where Consul could not catch up because the very act of services trying to use it generated more load than it could clear. Crucially, the outage sustained itself: even as the team worked the problem, bringing traffic back re-triggered the overload. Full recovery took roughly 73 hours and ultimately required bringing the system back up in a controlled, low-traffic way to let Consul stabilize before reopening the gates.

The moment it went wrong

This is a textbook metastable failure: the trigger (a Consul performance bug under new load) became almost irrelevant once the feedback loop closed. The system was stuck in a stable bad equilibrium held in place by its own retry and request load. You cannot fix that by removing the trigger — you have to break the loop by removing load, which is exactly why recovery meant deliberately keeping traffic out until the core was healthy.

The transferable lesson

Build the load-shedding and retry-budget defenses before you need them, and design an explicit, practiced way to bring the system up under reduced load. When you’re in a metastable failure, the recovery lever is subtraction — shed traffic, cap retries, drain queues — not “wait for it to pass” and not “add more capacity,” which the loop will happily consume.

Roblox — Return to Service 10/28–10/31 2021 ↗

What this sounds like in an interview

Calibration ladder · L3 → L6

One of your downstream dependencies starts responding slowly. What does your service do?

The interviewer is listening for whether you reach for retries (and stop there) or think about the whole stack and the feedback loop.

L3 · Junior

I'd add retries so that if a request to the dependency fails or times out, we try again and the user still gets a response.

Missed: Retries with no timeout, budget, or backoff — this is the configuration that causes the metastable outage, not prevents it.

L4 · Mid

I'd set a timeout so we don't hang, and retry with exponential backoff so we don't hammer it. Maybe a circuit breaker so if it's really down we stop calling it for a bit.

Missed: Good instincts and the right primitives, but no retry budget and no mention of the storm — so under a real partial outage this still amplifies into collapse.

L5 · Senior

First, a tight-but-realistic timeout and retries with backoff AND jitter, because synchronized retries are their own outage. But the key risk with a slow dependency is a retry storm, so I'd put a retry budget on it — cap retries at a fraction of traffic so a partial outage can't be amplified into a metastable failure. A circuit breaker to fail fast when it's clearly down, and a bulkhead so this one slow dependency can't exhaust the thread pool and take down everything else my service does.

Missed: Strong and complete on the patterns. Missing the control-loop framing, the deadline propagation, the breaker's 'slow→broken' nuance on critical paths, and the idempotency precondition for retries.

L6 · Staff

Same stack, but I'd reason about it as a control loop and be explicit about the failure I'm preventing. The danger isn't the slow dependency — it's my service turning that slowness into a self-sustaining storm via retries. So the non-negotiable is the retry budget plus load shedding, because those are what actually break the feedback loop; backoff and jitter only slow it. I'd propagate deadlines so we stop doing work the caller has abandoned, and size the bulkhead so the blast radius is contained to features that truly need this dependency. I'd also be careful with the circuit breaker: opening it converts 'slow' into 'failed', which is the right call for a non-critical dependency and the wrong one for a critical-path call I can't actually serve without — there I'd rather shed load than blanket-fail. And I'd make sure retries only happen on idempotent operations, because retrying a non-idempotent write under timeout is a different disaster. The trade across all of it is availability vs. amplification: every retry I allow is load I'm aiming at the weakest point in my system at its weakest moment.

What scored L6

Framed it as a feedback loop, named the budget + load shedding as the things that actually break the loop (vs. backoff which only slows it), reasoned about the circuit breaker's downside on critical paths, and caught the idempotency precondition. That's someone who has been in the 3am bridge call for one of these.

When NOT to use this

Don't retry a non-idempotent write on timeout

A timeout means “no answer,” not “it failed” (Module 1). Retrying a charge, a send, or an increment after a timeout double-applies it when the first attempt actually succeeded. Retries are only safe on idempotent operations — make the operation idempotent first (Module 6), then retry.

Don't put a circuit breaker on a call you can't fail

A circuit breaker’s job is to fail fast. If the call is on a critical path you genuinely cannot serve the request without (the auth check, the payment), opening the breaker just converts “slow” into “definitely broken” for every user. There, prefer load shedding (serve fewer requests well) over blanket-failing all of them, or degrade to a safe fallback.

Don't add retries without a budget and idempotency

Backoff and jitter make retries polite, not safe. Without a budget capping total retry volume, a sustained partial outage still amplifies into a storm. A retry without a budget is a metastable failure waiting for a trigger.

Don't bulkhead everything into tiny pools

Bulkheads contain blast radius, but over-partitioning into many small pools wastes capacity (each pool idle while another is saturated) and produces false rejections under normal load. Isolate the genuinely independent, genuinely risky dependencies — not every call into its own thimble.

Exercises

Exercise · Design scenario

Your API gateway fans each user request out to five backend services; the page can render usefully with any four of them. One backend begins to degrade during peak traffic. Design the gateway’s behavior so (a) the slow backend can’t take down the whole page, (b) retries can’t amplify its slowness into a storm, and (c) the page still renders. Specify timeouts, the retry policy, isolation, and what “degrade gracefully” concretely means here.

Exercise · Implementation task

In 07-resilience, add two things to the retry path: a Deadline (a logical time budget threaded through a call so a retry is refused once the deadline has passed) and exponential backoff with full jitter (sleep = random(0, base · 2^attempt)). Add a test showing two clients with jittered backoff do not retry on the same tick, while fixed backoff makes them synchronize.

Exercise · Find the race

This generic retry wrapper is used everywhere, including around a payment call. It looks correct and is the source of sporadic double charges. Find the bug.

retry.ts — shipped, double-charged

1async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
2  let lastErr: unknown
3  for (let i = 0; i < attempts; i++) {
4    try {
5      return await fn()                 // e.g. () => paymentGateway.charge(...)
6    } catch (err) {
7      lastErr = err                     // retries on ANY error, including timeout
8      await sleep(backoff(i))
9    }
10  }
11  throw lastErr
12}

Walk away with this

01The defining failure of this module is metastable collapse: a trigger starts a feedback loop (usually retries) that sustains the outage after the trigger is gone. Test: if removing the cause doesn’t restore service, you’re in one — and only shedding load gets you out.
02Reliability is the Resilience Stack: timeout → deadline propagation → retry+backoff+jitter → retry budget → circuit breaker → bulkhead → load shedding. Each catches what the others miss and causes an outage if misconfigured.
03The Retry Budget is the non-negotiable defense. Backoff and jitter space retries out; only a budget bounds their total volume and breaks the storm loop. Most teams skip it.
04Every retry is load you aim at the weakest point in your system at its weakest moment. Retry only idempotent operations, only retryable errors, and only while the budget has tokens.
05A circuit breaker turns “slow” into “failed” — right for a non-critical dependency, wrong for a critical-path call you can’t serve without (shed load there instead). Build these defenses, and a controlled low-traffic recovery path, before the incident (Roblox 2021).