Distributed Systems in PracticeThe Senior Engineer's Field Manual
You've shipped on a single Postgres, a Redis, and an app server. This is the course that takes you from there to leading an L5 system design interview and owning a distributed feature in production — built around named frameworks, runnable code, real postmortems, and visualizations you can actually poke at.
Named, citable frameworks
Every module ships an original framework you can name in a design review — The Assumption Ledger, The Write-Path Trilemma, The Resilience Stack, The Effectively-Once Triangle. Built to be screenshotted.
Runnable reference code
Not pseudocode. Every pattern has a real project — Go or TypeScript — with a README that runs locally. Vector clocks, consistent hashing, an idempotent service, a retry-storm simulator you can break.
Real postmortems
Each module is anchored to a public incident — AWS, GitHub, Discord, Roblox, the Redlock debate — with the exact moment it went wrong and the transferable lesson extracted.
Interactive, not static
Step a vector clock through three replicas and watch concurrency form. Drive a retry storm into metastable collapse. Drag nodes in and out of a hash ring. The visuals teach what prose can't.
When NOT to use it
The differentiator. Every pattern comes with the contrarian section: where it's the wrong tool, where it adds cost with no benefit, and the simpler thing you probably actually need.
Interview-calibrated
Every key question shows L3 → L4 → L5 → L6 answers verbatim, with what each level missed — so you can see exactly where you sit and the specific move to the next tier.
Curriculum
All ten modules are written — each with a named framework, a runnable reference implementation, a real postmortem, and interactive visualizations. Start anywhere; Module 1 is the foundation everything else builds on.
The Single-Node Lie
· Foundations · 55 minTime, order, and failure when there's more than one machine
Consistency Is a Per-Operation Decision
· Foundations · 50 minThe spectrum, PACELC, and why 'strong vs eventual' is the wrong question
Replication: Keeping Copies Honest
· Core · Data Layer · 60 minLeader-follower, multi-leader, quorums, and the lag you must budget for
Partitioning & the Physics of Hot Keys
· Core · Data Layer · 55 minHash vs range, consistent hashing, rebalancing, and the one key that melts a shard
Transactions Across the Network
· Core · Data Layer · 60 minReal isolation levels, the anomalies they miss, and why 2PC is an availability liability
Idempotency & 'Exactly-Once' That Survives Contact
· Communication · 60 minIdempotency keys, the transactional outbox, and effectively-once delivery
Reliability: Retries, Breakers, Backpressure, Bulkheads
· Reliability · 65 minThe layered defense — and how each layer causes the outage it was meant to prevent
Coordination: Consensus, Leases, Locks, Sagas
· Coordination · 65 minWhen to reach for Raft, when a lock needs a fencing token, when to avoid coordination entirely
Operating It: Observability, Tail Latency & Capacity
· Operations · 55 minDebugging partial failure, the tail-at-scale math, and capacity planning that holds
Canonical Systems Teardown
· Capstone · 70 minKafka · Spanner · DynamoDB · Cassandra through one repeatable lens
Who this is for
You should read this if
- ✓You've shipped real features but never designed a system bigger than one database, one cache, and an app server.
- ✓You've heard of CAP, consensus, and idempotency but can't yet make the trade-off under interview pressure or implement the pattern yourself.
- ✓You want to lead an L5 system-design loop and own a distributed feature in production — not just pass a quiz on it.
- ✓You read engineering blogs and want trade-offs and runnable code, not a glossary.
You should skip this if
- ·You're looking for a gentle, code-free 'intro to distributed systems'. This goes deeper than DDIA on the parts people get wrong.
- ·You want a cloud-vendor certification checklist. This is engineering judgment, not a product tour.
- ·You're already operating multi-region consensus systems at Staff+ — read the AI Systems course instead.
Module 1 — The Single-Node Lie
Time, order, and failure when there's more than one machine. Includes the Assumption Ledger, an interactive vector-clock stepper, a runnable reference implementation, the AWS DynamoDB 2015 postmortem, and an L3 → L6 calibration ladder.
Read Module 1 →