Reliability: Retries, Breakers, Backpressure, Bulkheads
The layered defense — and how each layer causes the outage it was meant to prevent
Reliability isn’t one trick; it’s a stack of defenses applied in order, each catching a specific failure and each capable of causing one if you set it wrong. The senior skill isn’t knowing the patterns — it’s knowing the second-order effect of each, because the most dangerous outages are caused by a reliability mechanism doing exactly what it was told.
The failure mode that defines the module
A normal overload is self-correcting: remove the spike and the system recovers. A metastable failure is not. A trigger (a blip, a deploy, a traffic spike) pushes the system into a state that sustains itself through a feedback loop — almost always retries — even after the trigger is gone. The defining test: if removing the original cause does not restore service, you’re in a metastable failure, and you cannot wait it out. Drive the simulator below into one.
Try it: set retries to 2–3 with the budget off and watch the bars stay red after the blip. Then turn the budget on — same blip, full recovery.
The Resilience Stack
Here is the full stack, in the order a request encounters it. Read it as a defense-in-depth diagram: each layer catches a failure the layers above let through — and each, if misconfigured, manufactures the failure in the right-hand column. That second column is the part nobody teaches and every incident review rediscovers.
The Retry Budget: the one defense that stops the storm
Of all these layers, one is non-negotiable for preventing metastable collapse: the Retry Budget. Backoff and jitter spread retries out in time, but they don’t bound the total retry volume — and under a sustained partial outage, even well-spaced retries add up to a storm. A retry budget caps retries as a fraction of total traffic (say, 10%): retries draw tokens from a shared bucket that refills with successful traffic, and when the bucket is empty, you stop retrying and fail fast. That cap is what breaks the feedback loop.
The Retry Budget
Retries are a privilege drawn from a shared, refilling budget — not a right every failed request gets. When the budget is empty, fail fast. This is the cap that prevents the storm.
1type RetryBudget struct {2 tokens, max, depositPerCall float643}4 5// Each top-level call deposits a little (e.g. 0.1 token/call -> retries6// capped at ~10% of traffic). A healthy system refills the budget faster7// than retries drain it; a struggling one runs the budget dry and STOPS8// retrying, which is exactly what breaks the metastable loop.9func (b *RetryBudget) Deposit() {10 b.tokens += b.depositPerCall11 if b.tokens > b.max { b.tokens = b.max }12}13 14func (b *RetryBudget) TryRetry() bool {15 if b.tokens >= 1 { b.tokens--; return true }16 return false // budget exhausted -> do not retry, fail fast17}The mental shift: a retry is not something every failed request is entitled to. It’s a scarce resource the whole service shares. When many requests are failing, the budget runs dry and the service correctly decides that adding retry load to a struggling dependency is worse than failing fast. The simulator above is exactly this: budget off → collapse, budget on → recovery, same blip.
courses/distributed-systems/reference-impl/07-resilience/A circuit breaker (closed → open → half-open), the token-bucket retry budget, a bulkhead, and the deterministic storm simulation. The demo shows the same blip collapse to a sustained offered-load of 250 (capacity 100) without a budget, and recover to 70 with one. go run ., tests for every layer.
Every retry is a small DDoS you aim at yourself
A single client retrying 3× turns one request into up to four. Now multiply by every client, during the exact window when the dependency can least afford it. Retries convert a dependency’s partial failure into a traffic amplification aimed precisely at its weakest moment. That’s why the defenses are all about subtraction — backoff (retry later), jitter (not all at once), budget (not too many), circuit breaker (not at all, for now).
The reframe that makes this stick: you are not adding reliability when you add a retry. You are adding load, and hoping the reliability benefit outweighs it. Under partial outage, it doesn’t — which is why the budget and the breaker exist to take the retry away.
| Dimension | No retry | Immediate retry | Exp. backoff | Backoff + jitter + budget |
|---|---|---|---|---|
| Recovers transient errors | No | Yes | Yes | Yes |
| Thundering-herd risk | None | High — instant re-hammer | Spaced out over time | Spaced + de-synchronized |
| Metastable-storm risk | None | Severe — tightest feedback loop | Still high — volume unbounded | Bounded by the budget |
| Complexity | Trivial | Trivial | Low | Moderate (budget + jitter) |
| Choose when | The operation isn't safely retryable (non-idempotent write) or the caller can handle the error better than a retry can. | Almost never. Immediate retry is the canonical way to turn a blip into an outage. | Transient errors are common and you've confirmed retry volume is naturally bounded (rare). Still missing the budget. | The default for any retryable call to a shared dependency. Backoff spaces them, jitter de-synchronizes them, the budget caps them. |
The only safe default for retrying a shared dependency is exponential backoff + full jitter + a retry budget. Backoff and jitter are necessary but not sufficient — they spread retries in time without bounding their total volume, so they slow the storm without preventing it. The budget is the part that actually breaks the metastable loop, and it’s the part most teams omit.
The 73-hour metastable outage, 28–31 October 2021
Build the load-shedding and retry-budget defenses before you need them, and design an explicit, practiced way to bring the system up under reduced load. When you’re in a metastable failure, the recovery lever is subtraction — shed traffic, cap retries, drain queues — not “wait for it to pass” and not “add more capacity,” which the loop will happily consume.
What this sounds like in an interview
One of your downstream dependencies starts responding slowly. What does your service do?
The interviewer is listening for whether you reach for retries (and stop there) or think about the whole stack and the feedback loop.
I'd add retries so that if a request to the dependency fails or times out, we try again and the user still gets a response.
I'd set a timeout so we don't hang, and retry with exponential backoff so we don't hammer it. Maybe a circuit breaker so if it's really down we stop calling it for a bit.
First, a tight-but-realistic timeout and retries with backoff AND jitter, because synchronized retries are their own outage. But the key risk with a slow dependency is a retry storm, so I'd put a retry budget on it — cap retries at a fraction of traffic so a partial outage can't be amplified into a metastable failure. A circuit breaker to fail fast when it's clearly down, and a bulkhead so this one slow dependency can't exhaust the thread pool and take down everything else my service does.
Same stack, but I'd reason about it as a control loop and be explicit about the failure I'm preventing. The danger isn't the slow dependency — it's my service turning that slowness into a self-sustaining storm via retries. So the non-negotiable is the retry budget plus load shedding, because those are what actually break the feedback loop; backoff and jitter only slow it. I'd propagate deadlines so we stop doing work the caller has abandoned, and size the bulkhead so the blast radius is contained to features that truly need this dependency. I'd also be careful with the circuit breaker: opening it converts 'slow' into 'failed', which is the right call for a non-critical dependency and the wrong one for a critical-path call I can't actually serve without — there I'd rather shed load than blanket-fail. And I'd make sure retries only happen on idempotent operations, because retrying a non-idempotent write under timeout is a different disaster. The trade across all of it is availability vs. amplification: every retry I allow is load I'm aiming at the weakest point in my system at its weakest moment.
Framed it as a feedback loop, named the budget + load shedding as the things that actually break the loop (vs. backoff which only slows it), reasoned about the circuit breaker's downside on critical paths, and caught the idempotency precondition. That's someone who has been in the 3am bridge call for one of these.
Don't retry a non-idempotent write on timeout
A timeout means “no answer,” not “it failed” (Module 1). Retrying a charge, a send, or an increment after a timeout double-applies it when the first attempt actually succeeded. Retries are only safe on idempotent operations — make the operation idempotent first (Module 6), then retry.
Don't put a circuit breaker on a call you can't fail
A circuit breaker’s job is to fail fast. If the call is on a critical path you genuinely cannot serve the request without (the auth check, the payment), opening the breaker just converts “slow” into “definitely broken” for every user. There, prefer load shedding (serve fewer requests well) over blanket-failing all of them, or degrade to a safe fallback.
Don't add retries without a budget and idempotency
Backoff and jitter make retries polite, not safe. Without a budget capping total retry volume, a sustained partial outage still amplifies into a storm. A retry without a budget is a metastable failure waiting for a trigger.
Don't bulkhead everything into tiny pools
Bulkheads contain blast radius, but over-partitioning into many small pools wastes capacity (each pool idle while another is saturated) and produces false rejections under normal load. Isolate the genuinely independent, genuinely risky dependencies — not every call into its own thimble.
Exercises
07-resilience, add two things to the retry path: a Deadline (a logical time budget threaded through a call so a retry is refused once the deadline has passed) and exponential backoff with full jitter (sleep = random(0, base · 2^attempt)). Add a test showing two clients with jittered backoff do not retry on the same tick, while fixed backoff makes them synchronize.1async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {2 let lastErr: unknown3 for (let i = 0; i < attempts; i++) {4 try {5 return await fn() // e.g. () => paymentGateway.charge(...)6 } catch (err) {7 lastErr = err // retries on ANY error, including timeout8 await sleep(backoff(i))9 }10 }11 throw lastErr12}- 01The defining failure of this module is metastable collapse: a trigger starts a feedback loop (usually retries) that sustains the outage after the trigger is gone. Test: if removing the cause doesn’t restore service, you’re in one — and only shedding load gets you out.
- 02Reliability is the Resilience Stack: timeout → deadline propagation → retry+backoff+jitter → retry budget → circuit breaker → bulkhead → load shedding. Each catches what the others miss and causes an outage if misconfigured.
- 03The Retry Budget is the non-negotiable defense. Backoff and jitter space retries out; only a budget bounds their total volume and breaks the storm loop. Most teams skip it.
- 04Every retry is load you aim at the weakest point in your system at its weakest moment. Retry only idempotent operations, only retryable errors, and only while the budget has tokens.
- 05A circuit breaker turns “slow” into “failed” — right for a non-critical dependency, wrong for a critical-path call you can’t serve without (shed load there instead). Build these defenses, and a controlled low-traffic recovery path, before the incident (Roblox 2021).