Field Manual
Module 9 · Operations · 55 min

Operating It: Observability, Tail Latency & Capacity

Debugging partial failure, the tail-at-scale math, and capacity planning that holds

Framework: The Distributed Debugging Loop · Tail-at-Scale MathAnchored to: Slack outage postmortem (Jan 2021)

Averages lie in distributed systems, and they lie in the direction that hurts: they hide the unhappy minority that is actually your problem. Operating well means measuring the tail, not the middle; debugging partial failure with a method instead of a hunch; and doing the capacity math that keeps you off the wrong side of a queue.

Your p99 is not your users' p99

Start with the number that reframes everything. If a single request fans out to many services — and in a microservice architecture it does — then the request is only as fast as its slowest participant. Each service being slow just 1% of the time sounds fine until you multiply. The probability a request touching N services hits at least one slow one is 1 − (0.99)^N. Move the slider:

Tail-at-scale amplifier
P(request is slow)
63%
1 − (0.990)100

At a fan-out of 100 with each service slow 1% of the time, 63% of requests touch at least one slow service. This is why optimizing the p99 of a single service barely moves user-perceived latency in a fan-out architecture — and why the real levers are reducing fan-out, hedged requests, and tightening the tail, not the median.

Framework · Formula + levers

Tail-at-Scale Math

P(slow request) = 1 − (per-service OK fraction)^fan-out. At 100 services and 1% slow each, ~63% of requests are slow. The levers are not where you'd guess.

The counterintuitive consequence: improving the median of a single service barely moves user-perceived latency. If most requests hit a tail somewhere, shaving the p50 of one service does nothing for them. The levers that actually work are about the tail and the fan-out:

Reduce fan-out — fewer services per request shrinks the exponent directly. Hedged requests — after the p95, send a duplicate request to another replica and take the first answer; this collapses the tail at the cost of a little extra load. Tighten the tail, not the median — a service whose p50 is 5ms and p99 is 800ms is the problem; fixing the 800ms (GC, lock contention, a cold cache) helps far more than shaving the 5ms.

Mental model
The critical path is the only path that matters

A trace of a slow request is mostly noise. Auth took 20ms, the cache lookup took 2ms, six things ran in parallel — none of it matters except the critical path: the chain of spans that actually gated completion. Speeding up anything off the critical path changes the end-to-end latency by exactly zero, and engineers waste days optimizing off-path spans because they were the easy ones to see.

The reference implementation computes this: it follows the latest-finishing child at each step to find the path, then identifies the span with the most self-time on it. That span — not the biggest span, not the scariest-looking service — is where the latency actually is.

Use it when: Every latency investigation. Before optimizing anything, confirm it's on the critical path — most of the trace isn't.
trace.ts — find where the latency actually is
1// The critical path: from the root, repeatedly follow the child that
2// FINISHES LAST (the one gating completion). Optimizing anything else is wasted.
3export function criticalPath(spans: Span[]): Span[] {
4 const kids = childrenOf(spans)
5 let cur = spans.find((s) => s.parentId === null) ?? spans[0]
6 const path: Span[] = []
7 while (cur) {
8 path.push(cur)
9 const children = kids.get(cur.id) ?? []
10 if (children.length === 0) break
11 cur = children.reduce((a, b) => (finish(b) > finish(a) ? b : a))
12 }
13 return path
14}
Runnable reference implementation
TypeScript
courses/distributed-systems/reference-impl/09-trace-causality/

A trace reconstructor that finds the critical path and the tail culprit (a fan-out trace where auth and db are off-path while an ML model dominates), plus the capacity math — Little’s Law and the tail-amplification formula. The demo prints “tail culprit: ml-model (260 ms)” and the 63% number. npm run demo, 6 tests.

The Distributed Debugging Loop

When the aggregate is green but users are unhappy, you cannot debug by staring at dashboards — you need a method that turns “something is slow somewhere” into a specific, attributable cause. This is the loop: four steps, each answering one question, each with the observability signal that answers it.

Framework · Method

The Distributed Debugging Loop

Locate → Correlate → Isolate → Attribute. Each step narrows 'somewhere, something' to a specific cause — and tells you which observability signal to reach for.

  1. 1
    LocateWhich slice is actually affected?
    Stop looking at the average. Break latency down by percentile, then by dimension — host, shard, tenant, region, app version, endpoint. The slow requests are concentrated somewhere; find the slice. (High-cardinality metrics / exemplars.)
  2. 2
    CorrelateWhat do the slow requests share?
    Pull the slow requests and find their common factor. All on one host? One database shard? One customer? Requests that started after the 10:32 deploy? The shared attribute is the lead. (Traces with attributes, logs joined on a request ID.)
  3. 3
    IsolateWhich component on the path is responsible?
    Take an exemplar slow trace and find its critical path. The span with the most self-time is the suspect component. Now you've gone from 'the app is slow' to 'the ranking service's DB query is slow on shard 7'. (Distributed tracing, critical-path analysis.)
  4. 4
    AttributeWhat is the root cause inside that component?
    GC pause? Lock contention? A hot key (Module 4)? A slow dependency (Module 7)? A cold cache after the deploy? This is where profiles and component-level metrics close the case. (Profiles, GC logs, lock metrics.)

The loop is what makes partial failure tractable. Skip Locate and you debug the whole system at once; skip Correlate and you isolate the wrong request; skip Isolate and you guess at the component. Run it in order and “the app is slow” becomes “shard 7’s host is GC-thrashing since the 10:32 deploy” in four steps.

DimensionMetricsLogsTracesProfiles
Answers 'what' vs 'why'What — aggregate healthNone — pre-aggregatedCheap (unless high-cardinality)Alerting, dashboards, SLOs
Per-request detailWhat happened, in detailFull, per eventExpensive at volumeForensics on a known request
Cost at scaleWhere — across servicesFull request path + timingsExpensive — sample itLocating the slow component (the loop)
Best forWhy — inside a componentStatistical, not per-requestCheap if continuous + sampledAttributing CPU/alloc/lock time
Choose whenAlways-on health, SLOs, and alerting. Your first line — but aggregate metrics hide the tail, so percentile-bucket and add exemplars.You need the full detail of specific events and can afford the volume (or sample aggressively). Don't make logs your primary latency tool.You're locating a slow component across services — the heart of the debugging loop. Sample heads/tails so cost stays sane.You've isolated a component and need to know why it's slow inside (CPU, allocations, lock contention).
Verdict

They’re a pipeline, not a menu: metrics tell you something’s wrong (and for whom, if you bucket by percentile and dimension), traces tell you where, and profiles tell you why. Logs are forensics for a known request, not a discovery tool. The most common operational gap is having only metrics — averaged, low-cardinality metrics that render every partial failure invisible. Add exemplars and traces before you add another dashboard.

How this fails in production · Slack

The January 4th, 2021 outage

The setup
The first Monday after the winter holidays — a predictable, enormous traffic spike as the workforce came back online all at once. Slack ran on AWS, with traffic crossing AWS Transit Gateways (TGWs) between their network segments, and an autoscaling fleet of web servers.
What happened
The traffic ramp saturated the Transit Gateways, which don’t scale instantly — and the resulting packet loss made everything slower. Slower backends caused threads to pile up, which triggered the autoscaler to add servers, which created more connections through the already-saturated TGWs, making things worse. Meanwhile their own monitoring was partially degraded by the same network problems, so the people trying to debug it were working with impaired visibility. It was a partial, saturation-driven degradation with a feedback loop — not a clean “X is down.”
The moment it went wrong
Two lessons converge here. First, the failure was a capacity problem (a network component that couldn’t scale as fast as demand) dressed up as a latency problem — exactly the kind of thing Little’s Law and headroom planning exist to prevent. Second, the observability needed to debug it was itself a casualty of the incident, which is the recurring nightmare of operating distributed systems: the tools you need most degrade exactly when you need them.
The transferable lesson

Plan capacity for the spike you can predict (the Monday-morning ramp), including the components that don’t autoscale instantly, and leave headroom — queues explode as utilization approaches 1, not at 100%. And make your observability independent enough to survive the outage it’s meant to diagnose: out-of-band metrics, separate failure domains for monitoring, and dashboards that degrade gracefully.

Slack Engineering — Slack's Outage on January 4th 2021

What this sounds like in an interview

Calibration ladder · L3 → L6

Users report the app is slow, but every dashboard you have is green. Walk me through what you do.

The interviewer wants to see a method for partial failure — not 'I'd check the logs.'

L3 · Junior

I'd check the logs and the CPU and memory dashboards to see if anything is spiking, and restart the service if something looks off.

Missed: No method, and 'restart it' on a partial failure usually just moves the problem. Doesn't know to look past the average.
L4 · Mid

Green dashboards probably means it's averages hiding it. I'd look at p99 latency instead of p50, and check if it's a specific endpoint or region. Then trace a slow request to see which service is slow.

Missed: Right instincts (p99, trace) but ad hoc — no systematic loop, and doesn't address why the dashboards failed to show it.
L5 · Senior

The averages are hiding a partial failure, so I'd run a method: break latency down by percentile and then by dimension — host, shard, tenant, version — to locate which slice is affected. Then correlate the slow requests to find what they share, pull an exemplar trace and find its critical path to isolate the slow component, and use profiles or component metrics to attribute the root cause. The dashboards are green because they're aggregated; the problem is a minority that the mean erased.

Missed: Strong method. Missing the observability-gap framing (the green dashboard is itself a bug to fix), the tail-at-scale reasoning, and the capacity-vs-latency question.
L6 · Staff

Same Locate-Correlate-Isolate-Attribute method, but I'd also reason about why I couldn't see it and fix that. The green dashboards are an observability gap: I'm measuring averages and low-cardinality metrics, so I'd add percentile breakdowns with exemplars that link straight to traces, and high-cardinality dimensions like tenant and shard so a partial failure is visible by construction. On the tail itself, I'd remember that in a fan-out architecture my users' latency is dominated by the worst service per request — so I'd look at the slowest dependency on the critical path, not the average across services, and consider hedged requests if one replica's tail is the issue. And I'd ask whether this is a capacity problem masquerading as latency — a component near saturation where queueing is blowing up the tail — because the fix for that is headroom, not a code change. The meta-point: a partial failure that's invisible on the dashboard is a monitoring bug as much as a system bug, and I'd close both.

What scored L6

Ran a named method, treated the invisible failure as an observability bug to fix (high-cardinality + exemplars), reasoned about tail-at-scale and hedging, and asked whether it was really a capacity problem. That's someone who has debugged partial failure at 3am with half their tools down.

When NOT to use this
Don't alert on or reason from averages

A mean latency is the one number guaranteed to hide your worst users. Alert on percentiles (p99, p99.9) and bucket by dimension. An average that looks healthy while 2% of requests time out is not “mostly fine” — it’s a partial outage you’ve chosen not to see.

Don't trace 100% of requests at scale

Full tracing of every request is ruinously expensive in storage and throughput at high volume. Sample intelligently — keep a baseline rate plus all the slow/error traces (tail-based sampling) — so you pay for the traces that teach you something, not the millions of identical fast ones.

Don't capacity-plan on the average or at 100% utilization

Plan for the peak (the predictable Monday-morning spike), not the mean, and leave headroom: queueing latency rises sharply as utilization approaches 1, so a system sized for 100% util is already in trouble at 85%. Little’s Law gives you the concurrency; multiply for headroom.

Don't chase the p99.99 of everything

Tightening the extreme tail is expensive and has diminishing returns; not every endpoint deserves it. Spend the tail-latency budget where users feel it (the interactive request path) and accept a looser tail on background and batch work. Reliability effort, like everything else, should follow where the cost of slowness actually lands.

Exercises

Exercise · Design scenario
Design the observability and capacity plan for a new checkout service expected to do 2,000 req/s at peak with a 250ms p99 target, fanning out to inventory, pricing, fraud, and payments. Specify: what you measure and alert on, how you’d detect a partial failure affecting one downstream, how many servers you provision (state your Little’s Law assumptions), and how you keep one slow downstream from blowing the p99.
Exercise · Implementation task
In 09-trace-causality, implement hedged requests: given a service’s latency distribution, model sending a second request to a replica after the p95 and taking the first to return, and compute the resulting effective p99. Show that hedging meaningfully shrinks the tail at a small (~5%) increase in total request load.
Exercise · Find the race
Your trace viewer occasionally shows a child span starting before its parent, and the critical-path analysis comes out nonsensical. The trace-building code is correct. Find the cause — it ties straight back to Module 1.
span.ts — the timestamps that lie
1// Each service stamps its own span using its local wall clock.
2function startSpan(service: string, parentId: string | null): Span {
3 return {
4 id: newId(),
5 parentId,
6 service,
7 startMs: Date.now(), // <- this machine's wall clock
8 durationMs: 0,
9 }
10}
Walk away with this
  • 01Averages hide the partial failures that are your actual problem. Measure and alert on percentiles (p99, p99.9), bucketed by high-cardinality dimensions, so the unhappy minority is visible.
  • 02Tail-at-scale: with fan-out N and each service slow 1% of the time, ~1−0.99^N of requests are slow (≈63% at N=100). Your p99 is not your users’ p99. The levers are reducing fan-out, hedged requests, and tightening the tail — not the median.
  • 03Debug partial failure with the Distributed Debugging Loop: Locate (which slice) → Correlate (what they share) → Isolate (which component, via the critical path) → Attribute (root cause, via profiles).
  • 04Optimize only what’s on the critical path — the chain of spans that gates completion. Most of a trace is off-path and speeding it up changes end-to-end latency by zero.
  • 05Capacity is math: Little’s Law (L = λ·W) sizes concurrency; plan for the predictable peak with headroom, because queueing latency explodes as utilization → 1 (Slack 2021). And keep your observability alive when the system isn’t.