Replication: Keeping Copies Honest
Leader-follower, multi-leader, quorums, and the lag you must budget for
Replication exists for three reasons that pull against each other: durability (don’t lose acknowledged writes), read scaling and latency (serve reads from nearby copies), and availability (survive a node dying). You cannot maximize all three. This module is about the trade you’re actually making when you pick a replication mode — and the lag budget that decides which reads are even allowed to touch a replica.
The write path can't have all three
When a write arrives, the leader faces one question: how many copies must hold this before I tell the client “done”? Answer “just me” (asynchronous) and the ack is fast and survives a dead follower — but if the leader dies before replicating, the acknowledged write is gone. Answer “me and every follower” (synchronous) and no acked write is ever lost — but one slow or dead follower stalls or blocks every write. Answer “a majority” (quorum) and you get the balanced middle. Three modes, one trilemma.
The Write-Path Trilemma
Durability, write latency, and availability under failure. Each replication mode is a different corner. Naming the corner you chose is the whole skill.
The reason this matters in an interview and at 3am: when someone says “we have a replica, so we’re safe,” ask which corner. A bank running async replication for write latency has chosen to lose acknowledged transactions on a leader crash, whether or not anyone decided that on purpose.
The clearest way to feel it: an asynchronous ack is a promise the system can’t yet keep. Here’s the policy knob from the reference implementation — the only difference between “safe” and “fast” is how many holders you count before acking.
1func (c *Cluster) Durable(index int, mode AckMode) bool {2 switch mode {3 case Async:4 return c.Leader.LastIndex() >= index // the leader alone is "enough"5 case Quorum:6 holders := 07 if c.Leader.LastIndex() >= index {8 holders++9 }10 for _, f := range c.Followers {11 if f.LastIndex() >= index {12 holders++13 }14 }15 return holders > (len(c.Followers)+1)/2 // a majority must hold it16 }17 return false18}courses/distributed-systems/reference-impl/03-log-replication/A single-leader log with controllable follower lag. The demo shows an async-acked write lost when the leader crashes before replicating, the same write surviving under quorum, and a stale replica healed by read-repair. go run . · tests with go test ./....
The Lag Budget: which reads may touch a replica
The double-transfer bug isn’t fixed by “turn off replicas.” It’s fixed by deciding, per read path, how stale a value is allowed to be — and routing accordingly. That decision is the Lag Budget: every read in your system gets an explicit maximum tolerable staleness, and only reads whose budget exceeds a replica’s current lag are allowed to use it.
The Lag Budget
Assign every read path a staleness budget. Reads with a zero budget go to the leader (or carry a session token). Everything else earns the right to scale on replicas.
| Read path | Tolerable lag | Where it's allowed to read |
|---|---|---|
| Your own balance right after a transfer | 0 — must be exact | Leader, or a replica caught up to your write (session token). |
| Someone else's profile / post | Seconds | Any healthy replica. The 90% case for read scaling. |
| Search index / 'people you may know' | Minutes | A dedicated, possibly batch-updated replica or index. |
| Analytics / reporting dashboard | Minutes to hours | A read replica explicitly isolated from serving traffic. |
| Fraud check before approving a charge | 0 — correctness-critical | Leader only, or a linearizable read. Never a lagging replica. |
The budget is the artifact you put in the design doc. It turns “reads scale on replicas” from a hope into a routing rule, and it makes the dangerous reads (the two with a zero budget) visible before they ship.
Topologies are just 'where do writes enter?'
Single-leader: all writes go to one node, replicated out. Simple, no write conflicts, but the leader is a write bottleneck and a failover event. The default, and right far more often than people admit.
Multi-leader: writes accepted in multiple regions, replicated between them. Buys local write latency and survives a region outage — at the cost of write conflicts (two regions edit the same row) you must resolve with the tools from Module 1 (vector clocks, CRDTs). Don’t reach for it for “HA”; reach for it when geography forces local writes.
Leaderless (quorum): any replica takes writes; reads and writes hit a quorum and reconcile (Dynamo, Cassandra). Maximum availability, but you own conflict resolution and read-repair. This is where the read-repair in the reference impl lives.
| Dimension | Sync | Async | Semi-sync | Leaderless quorum |
|---|---|---|---|---|
| Acked write lost on leader crash? | Never | Highest — wait for all | No — one dead follower blocks writes | Zero on synced replicas |
| Write latency | Yes — the default data-loss bug | Lowest — ack on leader | Yes | Unbounded by lag |
| Writes survive a dead follower? | Rare — needs ≥1 follower acked | Moderate — wait for one | Yes (falls back to async) | Low on the synced follower |
| Read staleness | No if W+R>N quorum | Moderate — wait for a quorum | Yes — any node takes writes | Bounded; healed by read-repair |
| Operational complexity | — | — | — | — |
| Choose when | Tiny clusters where losing an acked write is unthinkable and you can tolerate a follower blocking writes (rare; e.g. a config store). | Staleness and a small data-loss window on crash are acceptable, and write latency is king. Honest default for non-critical data. | You want most of sync's durability without one slow follower halting writes. A strong default for important relational data. | You need multi-region writes and maximum availability, and you've budgeted for conflict resolution and read-repair. |
For important data on a relational store, semi-synchronous (ack after the leader + at least one follower) is the pragmatic sweet spot — it removes the async-data-loss bug without letting a single slow follower take down writes. Reserve full sync for small, correctness-critical clusters and leaderless quorum for when geography genuinely demands it. Plain async is fine — as long as someone decided the data-loss window is acceptable, rather than inheriting it.
The database outage of 31 January 2017
rm -rf on what they believed was the empty, to-be-rebuilt replica’s data directory — but it was the primary. ~300 GB of production data gone in seconds. Then the real horror unfolded: of five configured backup/replication mechanisms, none were working. Replication was lagging, the backups were silently failing, the snapshots weren’t taken. They recovered from a six-hour-old staging copy a team member happened to have, losing ~6 hours of data.rm -rf: a replica had been quietly treated as a backup. A replica protects against a node failing; it does nothing against a write you didn’t mean to make — including a destructive command — because it faithfully replicates that too. The replication lag that started the night was a symptom that the durability story had been broken for a long time and nobody was watching the lag.A replica is not a backup. Replication gives you availability and read scaling; it propagates your mistakes at the speed of the network. You need independent, tested, point-in-time backups and active monitoring of replication lag — because lag isn’t just a staleness problem, it’s the early warning that your durability guarantees have quietly stopped being true.
What this sounds like in an interview
Reads are slow, so you add a read replica and send reads to it. What did you just break, and how do you keep it safe?
The interviewer is checking whether you understand that a replica is a new consistency surface, not just more capacity.
Nothing should break — the replica has the same data as the primary, it just shares the read load.
The replica can lag, so users might see slightly old data. I'd send reads that need to be fresh to the primary and the rest to the replica.
I added replication lag as a new failure surface. The specific bug is read-your-writes: a user does a write, reads from a lagging replica, and sees the old value. I'd define a lag budget per read path — fresh-critical reads (their own balance, a fraud check) go to the primary or use a session token; tolerant reads (other users' data, search) go to replicas. I'd also monitor replication lag as a first-class metric, because once lag exceeds a read's budget I need to route around that replica or shed it.
Same lag-budget framing, plus the durability and operational layers. Adding a replica changed my write-path trade: if it's async, I haven't improved durability at all and I may now believe I have redundancy I don't. I'd be explicit that a replica is not a backup — it replicates mistakes — so it doesn't change my backup/PITR requirements. Operationally, replication lag becomes a paging signal, and I'd put a hard staleness bound in the read router so a badly-lagging replica is automatically removed from rotation rather than silently serving 40-second-old balances. The trade I'm making is read scale and availability for a new consistency surface I now have to budget, monitor, and bound.
Named the new consistency surface, gave it a budget AND a monitor AND an automated eviction bound, and separated replication (availability/scaling) from backups (durability) — including that an async replica may add zero durability. That's someone who has been paged for replication lag.
Don't add a read replica to fix write contention
Replicas scale reads. If your bottleneck is write throughput or lock contention on the primary, a replica does nothing — you’ve added a lagging copy and a consistency surface while the actual problem (the write path) is untouched. That’s a partitioning or batching problem (Module 4), not a replication one.
Don't treat a replica as a backup
A replica replicates everything, including your DELETE FROM with a bad WHERE and your rm -rf. It protects against a node failing, not against a write you regret. You need independent, tested, point-in-time backups in addition. GitLab learned this with production down and five broken backup mechanisms.
Don't run synchronous replication across regions
Sync replication waits for the follower to ack. Across regions that’s 50–150 ms added to every write, and a single slow region stalls your write path globally. Cross-region wants async or quorum replication with explicit conflict handling — not sync, unless you genuinely cannot lose a single write and have accepted the latency.
Don't reach for multi-leader just to feel highly available
Multi-leader replication sounds like free HA, but it hands you write conflicts: two leaders accept edits to the same row and now you own conflict resolution forever. Unless geography forces local writes, a single leader with fast failover is simpler and correct. Multi-leader is a tool for a latency/geography problem, not an availability wish.
Exercises
03-log-replication, add a SemiSync ack mode (durable once the leader plus at least one follower hold the entry, falling back to async if no follower is reachable within a bound) and a MaxStaleness parameter on reads that refuses any follower lagging beyond N entries. Add tests showing semi-sync survives a single-follower crash and that a too-stale follower is skipped.1func (c *Cluster) Failover() *Node {2 // leader is presumed dead; promote whoever is most caught up3 newLeader := c.PromoteMostCaughtUp()4 5 // immediately start accepting writes on the new leader6 for _, f := range c.Followers {7 if f != newLeader {8 // ...point each remaining follower at newLeader and keep streaming9 f.ReplicateFrom(newLeader, newLeader.LastIndex())10 }11 }12 return newLeader13}- 01Replication trades among durability, write latency, and availability — the Write-Path Trilemma. Async sacrifices durability on crash; sync sacrifices latency and availability; quorum and semi-sync are the balanced middle. Name your corner.
- 02“Acknowledged” is a policy: async acks before the data is safe on more than one machine. For important data, semi-sync (leader + ≥1 follower) removes the silent data-loss window.
- 03Give every read path a Lag Budget. Zero-budget reads (your own balance, fraud checks) go to the leader or carry a session token; everything else earns the right to scale on replicas. Monitor lag and auto-evict replicas past the bound.
- 04A replica is not a backup. It replicates your mistakes too. Replication buys availability and read scale; durability still requires independent, tested, point-in-time backups (GitLab 2017).
- 05Topology follows where writes enter. Single-leader is the right default; multi-leader is a geography tool that hands you write conflicts; leaderless quorum is maximum availability at the cost of owning conflict resolution and read-repair.