Field Manual
← PrevNext →
Module 10 · Capstone · 70 min

Canonical Systems Teardown

Kafka · Spanner · DynamoDB · Cassandra through one repeatable lens

Framework: The 6-Lens Teardown TemplateAnchored to: Spanner (OSDI'12), Dynamo (SOSP'07), DynamoDB (USENIX'22)

The point of studying canonical systems is not to memorize them — it’s to internalize a repeatable teardown you can apply to anything. Six lenses, applied in the same order, plus one question that finds the soul of the design: what did this system bet on?

The 6-Lens Teardown Template

Point these six lenses at any storage or messaging system, in order, and you reconstruct its entire design — and expose where it will hurt. The first five are the decisions from this course; the sixth is the synthesis that separates a description from an understanding.

Framework · Method

The 6-Lens Teardown Template

Six questions that reconstruct any system's design — and reveal its failure modes. The last lens is the one that matters most: what did it bet on?

LensThe questionWhere it was covered
1 · Data modelWhat is the fundamental unit, and what access patterns does it privilege?Module 4
2 · PartitioningHow is data split across machines, and what happens to a hot key?Module 4
3 · ReplicationHow are copies kept, and what's the durability-vs-latency trade?Module 3
4 · ConsistencyWhat does a read guarantee, and is it chosen per operation?Module 2
5 · Failure handlingWhat happens when a node dies, pauses, or partitions?Modules 1, 7, 8
6 · The one genius ideaWhat single bet defines this system — and what did it cost?All of it

Run them in order and a system that looked like a black box becomes a sequence of legible trade-offs. The sixth lens is the test of real understanding: anyone can list features; naming the one bet and its price is what a Staff engineer does.

Four teardowns

Each of these systems made a different bet, and each bet is a concept from earlier in this course taken to its logical extreme. Read them side by side and the modules connect into a single map.

Teardown

Apache Kafka

The bet (lens 6)

The log is the universal abstraction. Make it append-only, partitioned, and let consumers track their own position — and fan-out becomes nearly free.

Data model
An append-only, ordered log per partition. Not a queue (messages aren't deleted on read) — a durable, replayable sequence. Consumers track an offset into it.
Partitioning
Topics split into partitions; a record's key hashes to a partition. Ordering is guaranteed within a partition, not across — so the partition key IS your ordering and parallelism boundary. A hot key is a hot partition (Module 4).
Replication
Per-partition leader-follower with an in-sync replica set (ISR). A write is acked once the ISR has it (configurable: acks=all is quorum-ish durability; acks=1 is the async data-loss trade from Module 3).
Consistency
Strong order within a partition; at-least-once delivery by default, exactly-once effect via idempotent producers + transactions (the Effectively-Once Triangle, Module 6, productized).
Failure handling
A failed leader triggers election of a new one from the ISR (via the KRaft/ZooKeeper controller — consensus, Module 8). Unclean leader election trades consistency for availability and can lose data.
Connects to: Kafka is Module 6 (the log, idempotent producers, exactly-once effect) sitting on Module 3 (ISR replication) coordinated by Module 8 (controller consensus).
Kafka documentation & the original LinkedIn paper
Teardown

Google Spanner

The bet (lens 6)

Bound clock uncertainty with hardware (GPS + atomic clocks) and simply WAIT it out — and you can have externally-consistent transactions across the planet.

Data model
Relational rows with SQL and schematized semi-relational tables — a real database, not a key-value store. Rows are organized into hierarchies for locality.
Partitioning
Data split into ranges (tablets) that split and move based on load and size — range partitioning (Module 4) with automatic rebalancing.
Replication
Each tablet is a Paxos group replicated synchronously across zones/regions. Writes go through Paxos — a majority must agree (Module 8). Durable and consistent, at the cost of cross-region write latency.
Consistency
External consistency (linearizable, Module 2's strongest rung) globally. Achieved with TrueTime: an interval clock with a bounded error, where commits WAIT OUT the uncertainty before becoming visible (Module 1, done right).
Failure handling
Paxos majorities survive minority failures and partitions on the majority side. Leader leases (Module 8) ensure a single writer per group.
Connects to: Spanner is Module 1 (treat clock error as a first-class value, don't round it to zero) enabling Module 2's strongest consistency at global scale, via Module 8's consensus.
Spanner: Google's Globally-Distributed Database (OSDI 2012)
Teardown

Amazon DynamoDB

The bet (lens 6)

Make every dimension a dial you can turn, so performance is PREDICTABLE at any scale. No surprises is the feature.

Data model
Key-value / document, addressed by a partition key (+ optional sort key). The data model deliberately privileges point lookups and bounded queries — the access patterns that stay predictable at scale.
Partitioning
Hash on the partition key, spread across many storage nodes; partitions split automatically as they grow or get hot. Hot keys are the known enemy (Module 4) — adaptive capacity and split-for-heat exist precisely to fight them.
Replication
Synchronous replication across multiple availability zones; a write is durable once a quorum of replicas in the partition's replication group acknowledges (Module 3).
Consistency
Per-operation, by design (Module 2): eventually-consistent reads by default (cheap, fast), strongly-consistent reads on request (a flag on the read). You choose at the call site.
Failure handling
Transparent to the caller — AZ failures, node failures, and splits are handled below the API. The original Dynamo used vector clocks (Module 1) for conflict reconciliation; DynamoDB favors predictability and managed conflict handling.
Connects to: DynamoDB is Module 4 (partitioning + relentless hot-key defense) plus Module 2 (per-operation consistency as an explicit dial) in service of one goal: predictable latency at any scale.
Dynamo (SOSP 2007) & DynamoDB (USENIX ATC 2022)
Teardown

Apache Cassandra

The bet (lens 6)

No leader, no single point of failure. Any node takes any write, consistency is a per-query dial, and the cluster heals itself — availability above all.

Data model
Wide-column: a partition key plus clustering columns that keep rows sorted within a partition. The data model is designed around your queries — you denormalize per access pattern.
Partitioning
Consistent hashing on a token ring with virtual nodes (Module 4, exactly the reference implementation), so the fleet can grow and shrink with minimal data movement.
Replication
Leaderless. Each key is replicated to N nodes around the ring; any replica can take a read or write. Multi-datacenter aware.
Consistency
Tunable per query: ONE, QUORUM, ALL (Module 2's spectrum, exposed as a parameter). W + R > N gives you read-your-writes; lower gives you speed and availability. Conflicts resolved last-write-wins (Module 1's data-loss trap — beware).
Failure handling
Designed to never stop taking writes: hinted handoff stores writes for a down node, read-repair and anti-entropy (Merkle trees) converge replicas later. No election, no failover — there's no leader to lose.
Connects to: Cassandra is Module 4 (consistent hashing) + Module 3 (leaderless quorum replication) + Module 2 (tunable consistency as a query parameter), betting everything on availability — and inheriting Module 1's LWW data-loss risk as the price.
Cassandra documentation & the Lakshman/Malik paper
DimensionKafkaSpannerDynamoDBCassandra
Data modelPartitioned append-only logRelational + SQLKey-value / documentWide-column
Consistency defaultPer-partition order; EOS optionalExternal (linearizable), globalEventual; strong on requestTunable per query (ONE/QUORUM/ALL)
TopologyLeader-follower per partition (ISR)Paxos groups, synchronousQuorum across AZs, managedLeaderless ring, no SPOF
The defining betThe log as universal abstractionTrueTime — bound clock error, wait it outPredictability as the featureAvailability above all; tunable
Choose whenYou need a durable, replayable, ordered event stream and cheap fan-out to many consumers — event sourcing, stream processing, the backbone of an event-driven architecture.You need strong, global, transactional consistency (financial ledgers, inventory) and can pay cross-region write latency and the operational cost.You need predictable single-digit-ms performance at any scale with point-lookup access patterns and don't want to operate the database.You need always-on writes, multi-datacenter availability, and can model your data around queries and tolerate (or avoid) LWW conflicts.
Verdict

There is no “best” — each is the logical conclusion of a different bet. Kafka bet on the log, Spanner on bounded time, DynamoDB on predictability, Cassandra on availability. The Staff move in an interview (and in a vendor evaluation) is to name the bet and then ask whether your workload wants what it bought — and can afford what it cost. A system is only “wrong” relative to a workload.

How this fails in production · Apache Kafka

Unclean leader election — when availability silently eats your data

The setup
A Kafka partition has a leader and an in-sync replica set (ISR). With acks=all, a producer’s write is acknowledged only once every in-sync replica has it — the durable choice. But operators control a setting, unclean.leader.election.enable, that decides what happens when every in-sync replica is unavailable.
What happened
If the leader and all in-sync replicas fail, Kafka faces a choice: wait (the partition is unavailable until an in-sync replica returns) or elect an out-of-sync replica that’s missing the most recent writes (available, but those acknowledged writes are gone). With unclean election enabled, Kafka picks availability — and silently truncates the data the new leader never received. Teams that enabled it for uptime discovered, during an incident, that they’d also enabled silent data loss of acknowledged records.
The moment it went wrong
This is the Write-Path Trilemma (Module 3) and the CAP choice (Module 2) as a single config flag. “Available even when all in-sync replicas are gone” and “never lose an acknowledged write” are mutually exclusive, and Kafka makes you choose explicitly. The failure is choosing it implicitly — flipping the flag for availability without realizing you traded away the durability you thought acks=all guaranteed.
The transferable lesson

Every canonical system encodes the trade-offs from this course as configuration, and the defaults are someone else’s opinion about your workload. Tearing a system down with the six lenses isn’t academic — it’s how you find the flags that decide whether you lose data, and set them on purpose. Read the trade-off before the incident reads it to you.

Kafka documentation — replication & unclean.leader.election.enable

What this sounds like in an interview

Calibration ladder · L3 → L6

Walk me through how you'd design a system like Kafka — a durable, high-throughput event log.

The interviewer wants to see whether you reach for the defining abstraction (the log) and reason about the trade-offs, or just draw queues and brokers.

L3 · Junior

I'd have producers send messages to a broker, store them in a queue, and have consumers pull from the queue and acknowledge each message so it gets removed.

Missed: Designed a queue, not a log — messages deleted on read, no replay, no fan-out. Missed the entire defining idea.
L4 · Mid

An append-only log instead of a queue, partitioned across brokers for throughput, replicated for durability. Consumers track an offset so they can replay, and messages aren't deleted on read — they age out by retention.

Missed: Has the log and the offset — the right abstraction — but treats it as a feature list rather than reasoning about the trade-offs (durability flags, hot partitions, exactly-once).
L5 · Senior

The core bet is the log as the abstraction: append-only, partitioned (ordering and parallelism per partition, keyed so related events stay ordered), with consumers owning their offset — that's what makes fan-out cheap, because the broker doesn't track per-consumer state. Replication is leader-follower per partition with an in-sync replica set; acks=all gives durability at the cost of write latency. For exactly-once I'd lean on idempotent producers plus transactional writes — at-least-once delivery plus dedup, since exactly-once delivery isn't real.

Missed: Strong: names the bet, reasons about replication and exactly-once. Missing the explicit failure-mode flags (unclean election), the hot-partition risk, and the 'partition key does three jobs' insight.
L6 · Staff

Same log-centric design, but I'd build it as a stack of the trade-offs and be explicit about each bet. Partitioning: the partition key is simultaneously the ordering boundary, the parallelism unit, and the hot-key risk, so I'd call out that a skewed key melts a partition (Module 4) and how I'd detect it. Replication: per-partition ISR with acks=all for durability, and I'd flag the unclean-leader-election trade explicitly — it's the Write-Path Trilemma as a config flag, and I will not enable it silently. Consistency: per-partition total order only, never global, so cross-partition ordering is the consumer's problem; exactly-once is the effectively-once triangle (idempotent producer + transactional offset commit + idempotent consumer). Coordination: leader election and membership go through a consensus layer (KRaft) — I wouldn't roll my own. And I'd close by naming the bet: the log plus consumer-owned offsets is what turns one write into N cheap reads, and everything else is in service of keeping that log durable and ordered. The trade is that I've given up global ordering and made the partition key load-bearing for three different things at once.

What scored L6

Reached for the defining abstraction, built it as a stack of named trade-offs (each traceable to a module), surfaced the dangerous config (unclean election) before being asked, and closed by naming the bet and its cost. That's the six-lens teardown run in reverse — design as teardown.

When NOT to use this
Don't use Kafka as a database or a task queue

Kafka is a durable, ordered log optimized for sequential throughput and replay — not random point lookups (no index by arbitrary key) and not per-message ack/retry/priority semantics. Using it as a primary datastore or a job queue fights its data model. Reach for it when you want an ordered, replayable stream with cheap fan-out.

Don't use Spanner when you don't need global strong consistency

Spanner’s external consistency is expensive — cross-region Paxos writes and the operational/financial cost of the platform. If your workload is single-region, or tolerates per-operation weaker consistency (most do), you’re paying for a guarantee you aren’t using. Match the consistency to the operation (Module 2), not to the most impressive database.

Don't pick Cassandra for strong consistency or ad-hoc queries

Cassandra bets on availability and tunable consistency, with last-write-wins conflict resolution that can silently drop concurrent writes (Module 1). It rewards data modeled tightly around known queries and punishes ad-hoc access and read-modify-write invariants. If you need strong consistency or flexible querying, it’s the wrong bet.

Don't choose any of them by reputation

“Netflix uses Cassandra” and “Google built Spanner” tell you nothing about your workload. Each system is the right answer to the bet it made and the wrong answer to the others. Run the six lenses against your access patterns, consistency needs, and scale — then choose. The teardown is the selection method.

Exercises

Exercise · Design scenario
Apply the 6-Lens Teardown Template to a system you use but haven’t studied internally — pick one of Redis, PostgreSQL, Elasticsearch, or S3. Write the six lenses (data model, partitioning, replication, consistency, failure handling) and then name its one genius idea and what that bet cost. The goal is to prove the lens works on a system nobody handed you notes for.
Exercise · Implementation task
Extend a reference implementation from an earlier module into a mini-version of a canonical system. Two good options: turn 06-idempotency-outbox into a tiny Kafka — a partitioned, append-only log with per-consumer offsets and replay — or extend 04-consistent-hashing with quorum reads/writes and read-repair (from 03-log-replication) into a tiny Cassandra. Document which of the six lenses your version implements and which it punts on, and why.
Exercise · Find the race
A teammate proposes: “We’ll use Cassandra with consistency level ONE for both reads and writes — it’ll be blazing fast and always available.” The data is user account balances. Find the bug in this design before it ships.
Walk away with this
  • 01Don’t memorize systems — carry the 6-Lens Teardown Template: data model, partitioning, replication, consistency, failure handling, and the one genius idea. It works on any system, including ones nobody briefed you on.
  • 02The sixth lens is the one that matters: name the bet. Kafka bet on the log, Spanner on bounded time, DynamoDB on predictability, Cassandra on availability. A system is only “right” relative to the workload that wants its bet and can afford its cost.
  • 03Canonical systems encode this course’s trade-offs as configuration (Kafka’s unclean leader election, Cassandra’s consistency level, DynamoDB’s strong-read flag). The defaults are someone else’s opinion about your workload — read the trade before the incident does.
  • 04Choose by teardown, not reputation. “Company X uses Y” is not a reason. Run the six lenses against your access patterns, consistency needs, and scale.
  • 05Designing a system is the teardown run in reverse: name the abstraction, then make each of the six decisions as a deliberate, defensible trade. That is the skill this entire course was building.