Operating It: Observability, Tail Latency & Capacity
Debugging partial failure, the tail-at-scale math, and capacity planning that holds
Averages lie in distributed systems, and they lie in the direction that hurts: they hide the unhappy minority that is actually your problem. Operating well means measuring the tail, not the middle; debugging partial failure with a method instead of a hunch; and doing the capacity math that keeps you off the wrong side of a queue.
Your p99 is not your users' p99
Start with the number that reframes everything. If a single request fans out to many services — and in a microservice architecture it does — then the request is only as fast as its slowest participant. Each service being slow just 1% of the time sounds fine until you multiply. The probability a request touching N services hits at least one slow one is 1 − (0.99)^N. Move the slider:
At a fan-out of 100 with each service slow 1% of the time, 63% of requests touch at least one slow service. This is why optimizing the p99 of a single service barely moves user-perceived latency in a fan-out architecture — and why the real levers are reducing fan-out, hedged requests, and tightening the tail, not the median.
Tail-at-Scale Math
P(slow request) = 1 − (per-service OK fraction)^fan-out. At 100 services and 1% slow each, ~63% of requests are slow. The levers are not where you'd guess.
The counterintuitive consequence: improving the median of a single service barely moves user-perceived latency. If most requests hit a tail somewhere, shaving the p50 of one service does nothing for them. The levers that actually work are about the tail and the fan-out:
Reduce fan-out — fewer services per request shrinks the exponent directly. Hedged requests — after the p95, send a duplicate request to another replica and take the first answer; this collapses the tail at the cost of a little extra load. Tighten the tail, not the median — a service whose p50 is 5ms and p99 is 800ms is the problem; fixing the 800ms (GC, lock contention, a cold cache) helps far more than shaving the 5ms.
The critical path is the only path that matters
A trace of a slow request is mostly noise. Auth took 20ms, the cache lookup took 2ms, six things ran in parallel — none of it matters except the critical path: the chain of spans that actually gated completion. Speeding up anything off the critical path changes the end-to-end latency by exactly zero, and engineers waste days optimizing off-path spans because they were the easy ones to see.
The reference implementation computes this: it follows the latest-finishing child at each step to find the path, then identifies the span with the most self-time on it. That span — not the biggest span, not the scariest-looking service — is where the latency actually is.
1// The critical path: from the root, repeatedly follow the child that2// FINISHES LAST (the one gating completion). Optimizing anything else is wasted.3export function criticalPath(spans: Span[]): Span[] {4 const kids = childrenOf(spans)5 let cur = spans.find((s) => s.parentId === null) ?? spans[0]6 const path: Span[] = []7 while (cur) {8 path.push(cur)9 const children = kids.get(cur.id) ?? []10 if (children.length === 0) break11 cur = children.reduce((a, b) => (finish(b) > finish(a) ? b : a))12 }13 return path14}courses/distributed-systems/reference-impl/09-trace-causality/A trace reconstructor that finds the critical path and the tail culprit (a fan-out trace where auth and db are off-path while an ML model dominates), plus the capacity math — Little’s Law and the tail-amplification formula. The demo prints “tail culprit: ml-model (260 ms)” and the 63% number. npm run demo, 6 tests.
The Distributed Debugging Loop
When the aggregate is green but users are unhappy, you cannot debug by staring at dashboards — you need a method that turns “something is slow somewhere” into a specific, attributable cause. This is the loop: four steps, each answering one question, each with the observability signal that answers it.
The Distributed Debugging Loop
Locate → Correlate → Isolate → Attribute. Each step narrows 'somewhere, something' to a specific cause — and tells you which observability signal to reach for.
- 1Locate — Which slice is actually affected?Stop looking at the average. Break latency down by percentile, then by dimension — host, shard, tenant, region, app version, endpoint. The slow requests are concentrated somewhere; find the slice. (High-cardinality metrics / exemplars.)
- 2Correlate — What do the slow requests share?Pull the slow requests and find their common factor. All on one host? One database shard? One customer? Requests that started after the 10:32 deploy? The shared attribute is the lead. (Traces with attributes, logs joined on a request ID.)
- 3Isolate — Which component on the path is responsible?Take an exemplar slow trace and find its critical path. The span with the most self-time is the suspect component. Now you've gone from 'the app is slow' to 'the ranking service's DB query is slow on shard 7'. (Distributed tracing, critical-path analysis.)
- 4Attribute — What is the root cause inside that component?GC pause? Lock contention? A hot key (Module 4)? A slow dependency (Module 7)? A cold cache after the deploy? This is where profiles and component-level metrics close the case. (Profiles, GC logs, lock metrics.)
The loop is what makes partial failure tractable. Skip Locate and you debug the whole system at once; skip Correlate and you isolate the wrong request; skip Isolate and you guess at the component. Run it in order and “the app is slow” becomes “shard 7’s host is GC-thrashing since the 10:32 deploy” in four steps.
| Dimension | Metrics | Logs | Traces | Profiles |
|---|---|---|---|---|
| Answers 'what' vs 'why' | What — aggregate health | None — pre-aggregated | Cheap (unless high-cardinality) | Alerting, dashboards, SLOs |
| Per-request detail | What happened, in detail | Full, per event | Expensive at volume | Forensics on a known request |
| Cost at scale | Where — across services | Full request path + timings | Expensive — sample it | Locating the slow component (the loop) |
| Best for | Why — inside a component | Statistical, not per-request | Cheap if continuous + sampled | Attributing CPU/alloc/lock time |
| Choose when | Always-on health, SLOs, and alerting. Your first line — but aggregate metrics hide the tail, so percentile-bucket and add exemplars. | You need the full detail of specific events and can afford the volume (or sample aggressively). Don't make logs your primary latency tool. | You're locating a slow component across services — the heart of the debugging loop. Sample heads/tails so cost stays sane. | You've isolated a component and need to know why it's slow inside (CPU, allocations, lock contention). |
They’re a pipeline, not a menu: metrics tell you something’s wrong (and for whom, if you bucket by percentile and dimension), traces tell you where, and profiles tell you why. Logs are forensics for a known request, not a discovery tool. The most common operational gap is having only metrics — averaged, low-cardinality metrics that render every partial failure invisible. Add exemplars and traces before you add another dashboard.
The January 4th, 2021 outage
Plan capacity for the spike you can predict (the Monday-morning ramp), including the components that don’t autoscale instantly, and leave headroom — queues explode as utilization approaches 1, not at 100%. And make your observability independent enough to survive the outage it’s meant to diagnose: out-of-band metrics, separate failure domains for monitoring, and dashboards that degrade gracefully.
What this sounds like in an interview
Users report the app is slow, but every dashboard you have is green. Walk me through what you do.
The interviewer wants to see a method for partial failure — not 'I'd check the logs.'
I'd check the logs and the CPU and memory dashboards to see if anything is spiking, and restart the service if something looks off.
Green dashboards probably means it's averages hiding it. I'd look at p99 latency instead of p50, and check if it's a specific endpoint or region. Then trace a slow request to see which service is slow.
The averages are hiding a partial failure, so I'd run a method: break latency down by percentile and then by dimension — host, shard, tenant, version — to locate which slice is affected. Then correlate the slow requests to find what they share, pull an exemplar trace and find its critical path to isolate the slow component, and use profiles or component metrics to attribute the root cause. The dashboards are green because they're aggregated; the problem is a minority that the mean erased.
Same Locate-Correlate-Isolate-Attribute method, but I'd also reason about why I couldn't see it and fix that. The green dashboards are an observability gap: I'm measuring averages and low-cardinality metrics, so I'd add percentile breakdowns with exemplars that link straight to traces, and high-cardinality dimensions like tenant and shard so a partial failure is visible by construction. On the tail itself, I'd remember that in a fan-out architecture my users' latency is dominated by the worst service per request — so I'd look at the slowest dependency on the critical path, not the average across services, and consider hedged requests if one replica's tail is the issue. And I'd ask whether this is a capacity problem masquerading as latency — a component near saturation where queueing is blowing up the tail — because the fix for that is headroom, not a code change. The meta-point: a partial failure that's invisible on the dashboard is a monitoring bug as much as a system bug, and I'd close both.
Ran a named method, treated the invisible failure as an observability bug to fix (high-cardinality + exemplars), reasoned about tail-at-scale and hedging, and asked whether it was really a capacity problem. That's someone who has debugged partial failure at 3am with half their tools down.
Don't alert on or reason from averages
A mean latency is the one number guaranteed to hide your worst users. Alert on percentiles (p99, p99.9) and bucket by dimension. An average that looks healthy while 2% of requests time out is not “mostly fine” — it’s a partial outage you’ve chosen not to see.
Don't trace 100% of requests at scale
Full tracing of every request is ruinously expensive in storage and throughput at high volume. Sample intelligently — keep a baseline rate plus all the slow/error traces (tail-based sampling) — so you pay for the traces that teach you something, not the millions of identical fast ones.
Don't capacity-plan on the average or at 100% utilization
Plan for the peak (the predictable Monday-morning spike), not the mean, and leave headroom: queueing latency rises sharply as utilization approaches 1, so a system sized for 100% util is already in trouble at 85%. Little’s Law gives you the concurrency; multiply for headroom.
Don't chase the p99.99 of everything
Tightening the extreme tail is expensive and has diminishing returns; not every endpoint deserves it. Spend the tail-latency budget where users feel it (the interactive request path) and accept a looser tail on background and batch work. Reliability effort, like everything else, should follow where the cost of slowness actually lands.
Exercises
09-trace-causality, implement hedged requests: given a service’s latency distribution, model sending a second request to a replica after the p95 and taking the first to return, and compute the resulting effective p99. Show that hedging meaningfully shrinks the tail at a small (~5%) increase in total request load.1// Each service stamps its own span using its local wall clock.2function startSpan(service: string, parentId: string | null): Span {3 return {4 id: newId(),5 parentId,6 service,7 startMs: Date.now(), // <- this machine's wall clock8 durationMs: 0,9 }10}- 01Averages hide the partial failures that are your actual problem. Measure and alert on percentiles (p99, p99.9), bucketed by high-cardinality dimensions, so the unhappy minority is visible.
- 02Tail-at-scale: with fan-out N and each service slow 1% of the time, ~
1−0.99^Nof requests are slow (≈63% at N=100). Your p99 is not your users’ p99. The levers are reducing fan-out, hedged requests, and tightening the tail — not the median. - 03Debug partial failure with the Distributed Debugging Loop: Locate (which slice) → Correlate (what they share) → Isolate (which component, via the critical path) → Attribute (root cause, via profiles).
- 04Optimize only what’s on the critical path — the chain of spans that gates completion. Most of a trace is off-path and speeding it up changes end-to-end latency by zero.
- 05Capacity is math: Little’s Law (L = λ·W) sizes concurrency; plan for the predictable peak with headroom, because queueing latency explodes as utilization → 1 (Slack 2021). And keep your observability alive when the system isn’t.