Premium course · Mid → Senior (L4 → L5)

Distributed Systems in PracticeThe Senior Engineer's Field Manual

You've shipped on a single Postgres, a Redis, and an app server. This is the course that takes you from there to leading an L5 system design interview and owning a distributed feature in production — built around named frameworks, runnable code, real postmortems, and visualizations you can actually poke at.

Read Module 1 free →See the curriculum

modules

runnable reference impls

hours of material

named frameworks

Named, citable frameworks

Every module ships an original framework you can name in a design review — The Assumption Ledger, The Write-Path Trilemma, The Resilience Stack, The Effectively-Once Triangle. Built to be screenshotted.

Runnable reference code

Not pseudocode. Every pattern has a real project — Go or TypeScript — with a README that runs locally. Vector clocks, consistent hashing, an idempotent service, a retry-storm simulator you can break.

Real postmortems

Each module is anchored to a public incident — AWS, GitHub, Discord, Roblox, the Redlock debate — with the exact moment it went wrong and the transferable lesson extracted.

Interactive, not static

Step a vector clock through three replicas and watch concurrency form. Drive a retry storm into metastable collapse. Drag nodes in and out of a hash ring. The visuals teach what prose can't.

When NOT to use it

The differentiator. Every pattern comes with the contrarian section: where it's the wrong tool, where it adds cost with no benefit, and the simpler thing you probably actually need.

Interview-calibrated

Every key question shows L3 → L4 → L5 → L6 answers verbatim, with what each level missed — so you can see exactly where you sit and the specific move to the next tier.

Curriculum

All ten modules are written — each with a named framework, a runnable reference implementation, a real postmortem, and interactive visualizations. Start anywhere; Module 1 is the foundation everything else builds on.

The Single-Node Lie

· Foundations · 55 min

Time, order, and failure when there's more than one machine

Framework: The Assumption Ledger · The Failure-Detector TrilemmaAnchored to: AWS DynamoDB metadata storm (Sept 2015)

Read →

Consistency Is a Per-Operation Decision

· Foundations · 50 min

The spectrum, PACELC, and why 'strong vs eventual' is the wrong question

Framework: The Consistency Spectrum Dial · PACELC Decision GridAnchored to: GitHub 24-hour outage (Oct 2018)

Read →

Replication: Keeping Copies Honest

· Core · Data Layer · 60 min

Leader-follower, multi-leader, quorums, and the lag you must budget for

Framework: The Write-Path Trilemma · The Lag BudgetAnchored to: GitHub Orchestrator cross-region failover (Oct 2018)

Read →

Partitioning & the Physics of Hot Keys

· Core · Data Layer · 55 min

Hash vs range, consistent hashing, rebalancing, and the one key that melts a shard

Framework: The Skew Budget · Hot-Key Triage treeAnchored to: Discord — storing trillions of messages

Read →

Transactions Across the Network

· Core · Data Layer · 60 min

Real isolation levels, the anomalies they miss, and why 2PC is an availability liability

Framework: The Isolation Anomaly Ladder · The 2PC Liability MatrixAnchored to: Jepsen isolation-anomaly analysis

Read →

Idempotency & 'Exactly-Once' That Survives Contact

· Communication · 60 min

Idempotency keys, the transactional outbox, and effectively-once delivery

Framework: The Effectively-Once Triangle · Idempotency-Key LifecycleAnchored to: Stripe idempotency design + duplicate-charge incidents

Read →

Reliability: Retries, Breakers, Backpressure, Bulkheads

· Reliability · 65 min

The layered defense — and how each layer causes the outage it was meant to prevent

Framework: The Resilience Stack · The Retry BudgetAnchored to: Roblox 73-hour metastable outage (Oct 2021)

Read →

Coordination: Consensus, Leases, Locks, Sagas

· Coordination · 65 min

When to reach for Raft, when a lock needs a fencing token, when to avoid coordination entirely

Framework: The Coordination Cost Ladder · The Fencing-Token RuleAnchored to: The Redlock debate (Kleppmann vs antirez)

Read →

Operating It: Observability, Tail Latency & Capacity

· Operations · 55 min

Debugging partial failure, the tail-at-scale math, and capacity planning that holds

Framework: The Distributed Debugging Loop · Tail-at-Scale MathAnchored to: Slack outage postmortem (Jan 2021)

Read →

Canonical Systems Teardown

· Capstone · 70 min

Kafka · Spanner · DynamoDB · Cassandra through one repeatable lens

Framework: The 6-Lens Teardown TemplateAnchored to: Spanner (OSDI'12), Dynamo (SOSP'07), DynamoDB (USENIX'22)

Read →

Who this is for

You should read this if

✓You've shipped real features but never designed a system bigger than one database, one cache, and an app server.
✓You've heard of CAP, consensus, and idempotency but can't yet make the trade-off under interview pressure or implement the pattern yourself.
✓You want to lead an L5 system-design loop and own a distributed feature in production — not just pass a quiz on it.
✓You read engineering blogs and want trade-offs and runnable code, not a glossary.

You should skip this if

·You're looking for a gentle, code-free 'intro to distributed systems'. This goes deeper than DDIA on the parts people get wrong.
·You want a cloud-vendor certification checklist. This is engineering judgment, not a product tour.
·You're already operating multi-region consensus systems at Staff+ — read the AI Systems course instead.

Start free

Module 1 — The Single-Node Lie

Time, order, and failure when there's more than one machine. Includes the Assumption Ledger, an interactive vector-clock stepper, a runnable reference implementation, the AWS DynamoDB 2015 postmortem, and an L3 → L6 calibration ladder.

Read Module 1 →