Interviews Vector

Replication exists for three reasons that pull against each other: durability (don’t lose acknowledged writes), read scaling and latency (serve reads from nearby copies), and availability (survive a node dying). You cannot maximize all three. This module is about the trade you’re actually making when you pick a replication mode — and the lag budget that decides which reads are even allowed to touch a replica.

The write path can't have all three

When a write arrives, the leader faces one question: how many copies must hold this before I tell the client “done”? Answer “just me” (asynchronous) and the ack is fast and survives a dead follower — but if the leader dies before replicating, the acknowledged write is gone. Answer “me and every follower” (synchronous) and no acked write is ever lost — but one slow or dead follower stalls or blocks every write. Answer “a majority” (quorum) and you get the balanced middle. Three modes, one trilemma.

Framework · Trilemma

The Write-Path Trilemma

Durability, write latency, and availability under failure. Each replication mode is a different corner. Naming the corner you chose is the whole skill.

Async = latency + availability, sacrifices durability on leader crash. Sync = durability, sacrifices latency and availability (a slow follower blocks writes). Quorum = durability + availability, pays the latency of the majority's slowest member.

The reason this matters in an interview and at 3am: when someone says “we have a replica, so we’re safe,” ask which corner. A bank running async replication for write latency has chosen to lose acknowledged transactions on a leader crash, whether or not anyone decided that on purpose.

The clearest way to feel it: an asynchronous ack is a promise the system can’t yet keep. Here’s the policy knob from the reference implementation — the only difference between “safe” and “fast” is how many holders you count before acking.

replication.go — 'durable' is a policy, not a fact

1func (c *Cluster) Durable(index int, mode AckMode) bool {
2  switch mode {
3  case Async:
4    return c.Leader.LastIndex() >= index // the leader alone is "enough"
5  case Quorum:
6    holders := 0
7    if c.Leader.LastIndex() >= index {
8      holders++
9    }
10    for _, f := range c.Followers {
11      if f.LastIndex() >= index {
12        holders++
13      }
14    }
15    return holders > (len(c.Followers)+1)/2 // a majority must hold it
16  }
17  return false
18}

Runnable reference implementation

courses/distributed-systems/reference-impl/03-log-replication/

A single-leader log with controllable follower lag. The demo shows an async-acked write lost when the leader crashes before replicating, the same write surviving under quorum, and a stale replica healed by read-repair. go run . · tests with go test ./....

The Lag Budget: which reads may touch a replica

The double-transfer bug isn’t fixed by “turn off replicas.” It’s fixed by deciding, per read path, how stale a value is allowed to be — and routing accordingly. That decision is the Lag Budget: every read in your system gets an explicit maximum tolerable staleness, and only reads whose budget exceeds a replica’s current lag are allowed to use it.

Framework · Worksheet

The Lag Budget

Assign every read path a staleness budget. Reads with a zero budget go to the leader (or carry a session token). Everything else earns the right to scale on replicas.

Read path	Tolerable lag	Where it's allowed to read
Your own balance right after a transfer	0 — must be exact	Leader, or a replica caught up to your write (session token).
Someone else's profile / post	Seconds	Any healthy replica. The 90% case for read scaling.
Search index / 'people you may know'	Minutes	A dedicated, possibly batch-updated replica or index.
Analytics / reporting dashboard	Minutes to hours	A read replica explicitly isolated from serving traffic.
Fraud check before approving a charge	0 — correctness-critical	Leader only, or a linearizable read. Never a lagging replica.

The budget is the artifact you put in the design doc. It turns “reads scale on replicas” from a hope into a routing rule, and it makes the dangerous reads (the two with a zero budget) visible before they ship.

Mental model

Topologies are just 'where do writes enter?'

Single-leader: all writes go to one node, replicated out. Simple, no write conflicts, but the leader is a write bottleneck and a failover event. The default, and right far more often than people admit.

Multi-leader: writes accepted in multiple regions, replicated between them. Buys local write latency and survives a region outage — at the cost of write conflicts (two regions edit the same row) you must resolve with the tools from Module 1 (vector clocks, CRDTs). Don’t reach for it for “HA”; reach for it when geography forces local writes.

Leaderless (quorum): any replica takes writes; reads and writes hit a quorum and reconcile (Dynamo, Cassandra). Maximum availability, but you own conflict resolution and read-repair. This is where the read-repair in the reference impl lives.

Use it when: Deciding between a single leader, multiple leaders, or no leader — usually forced by geography and write-conflict tolerance, not preference.

Dimension	Sync	Async	Semi-sync	Leaderless quorum
Acked write lost on leader crash?	Never	Highest — wait for all	No — one dead follower blocks writes	Zero on synced replicas
Write latency	Yes — the default data-loss bug	Lowest — ack on leader	Yes	Unbounded by lag
Writes survive a dead follower?	Rare — needs ≥1 follower acked	Moderate — wait for one	Yes (falls back to async)	Low on the synced follower
Read staleness	No if W+R>N quorum	Moderate — wait for a quorum	Yes — any node takes writes	Bounded; healed by read-repair
Operational complexity	—	—	—	—
Choose when	Tiny clusters where losing an acked write is unthinkable and you can tolerate a follower blocking writes (rare; e.g. a config store).	Staleness and a small data-loss window on crash are acceptable, and write latency is king. Honest default for non-critical data.	You want most of sync's durability without one slow follower halting writes. A strong default for important relational data.	You need multi-region writes and maximum availability, and you've budgeted for conflict resolution and read-repair.

Verdict

For important data on a relational store, semi-synchronous (ack after the leader + at least one follower) is the pragmatic sweet spot — it removes the async-data-loss bug without letting a single slow follower take down writes. Reserve full sync for small, correctness-critical clusters and leaderless quorum for when geography genuinely demands it. Plain async is fine — as long as someone decided the data-loss window is acceptable, rather than inheriting it.

How this fails in production · GitLab

The database outage of 31 January 2017

The setup

GitLab.com ran PostgreSQL with a primary and a streaming replica. Late at night, replication lag spiked under load (amplified by spam cleanup), and an engineer worked to get the replica re-syncing — the normal operational response to a lagging follower.

What happened

While troubleshooting the broken replication, the engineer ran rm -rf on what they believed was the empty, to-be-rebuilt replica’s data directory — but it was the primary. ~300 GB of production data gone in seconds. Then the real horror unfolded: of five configured backup/replication mechanisms, none were working. Replication was lagging, the backups were silently failing, the snapshots weren’t taken. They recovered from a six-hour-old staging copy a team member happened to have, losing ~6 hours of data.

The moment it went wrong

The fatal assumption sat upstream of the rm -rf: a replica had been quietly treated as a backup. A replica protects against a node failing; it does nothing against a write you didn’t mean to make — including a destructive command — because it faithfully replicates that too. The replication lag that started the night was a symptom that the durability story had been broken for a long time and nobody was watching the lag.

The transferable lesson

A replica is not a backup. Replication gives you availability and read scaling; it propagates your mistakes at the speed of the network. You need independent, tested, point-in-time backups and active monitoring of replication lag — because lag isn’t just a staleness problem, it’s the early warning that your durability guarantees have quietly stopped being true.

GitLab — Postmortem of the database outage of January 31 ↗

What this sounds like in an interview

Calibration ladder · L3 → L6

Reads are slow, so you add a read replica and send reads to it. What did you just break, and how do you keep it safe?

The interviewer is checking whether you understand that a replica is a new consistency surface, not just more capacity.

L3 · Junior

Nothing should break — the replica has the same data as the primary, it just shares the read load.

Missed: Believes a replica is a perfect copy. This is the mental model that ships the double-transfer bug.

L4 · Mid

The replica can lag, so users might see slightly old data. I'd send reads that need to be fresh to the primary and the rest to the replica.

Missed: Knows lag exists and routes around it ad hoc, but has no systematic budget and no monitoring — so the first bad-lag incident is a surprise.

L5 · Senior

I added replication lag as a new failure surface. The specific bug is read-your-writes: a user does a write, reads from a lagging replica, and sees the old value. I'd define a lag budget per read path — fresh-critical reads (their own balance, a fraud check) go to the primary or use a session token; tolerant reads (other users' data, search) go to replicas. I'd also monitor replication lag as a first-class metric, because once lag exceeds a read's budget I need to route around that replica or shed it.

Missed: Strong. Missing the durability and backup distinction (a replica isn't redundancy for your data) and the automated staleness-bound-eviction that production needs.

L6 · Staff

Same lag-budget framing, plus the durability and operational layers. Adding a replica changed my write-path trade: if it's async, I haven't improved durability at all and I may now believe I have redundancy I don't. I'd be explicit that a replica is not a backup — it replicates mistakes — so it doesn't change my backup/PITR requirements. Operationally, replication lag becomes a paging signal, and I'd put a hard staleness bound in the read router so a badly-lagging replica is automatically removed from rotation rather than silently serving 40-second-old balances. The trade I'm making is read scale and availability for a new consistency surface I now have to budget, monitor, and bound.

What scored L6

Named the new consistency surface, gave it a budget AND a monitor AND an automated eviction bound, and separated replication (availability/scaling) from backups (durability) — including that an async replica may add zero durability. That's someone who has been paged for replication lag.

When NOT to use this

Don't add a read replica to fix write contention

Replicas scale reads. If your bottleneck is write throughput or lock contention on the primary, a replica does nothing — you’ve added a lagging copy and a consistency surface while the actual problem (the write path) is untouched. That’s a partitioning or batching problem (Module 4), not a replication one.

Don't treat a replica as a backup

A replica replicates everything, including your DELETE FROM with a bad WHERE and your rm -rf. It protects against a node failing, not against a write you regret. You need independent, tested, point-in-time backups in addition. GitLab learned this with production down and five broken backup mechanisms.

Don't run synchronous replication across regions

Sync replication waits for the follower to ack. Across regions that’s 50–150 ms added to every write, and a single slow region stalls your write path globally. Cross-region wants async or quorum replication with explicit conflict handling — not sync, unless you genuinely cannot lose a single write and have accepted the latency.

Don't reach for multi-leader just to feel highly available

Multi-leader replication sounds like free HA, but it hands you write conflicts: two leaders accept edits to the same row and now you own conflict resolution forever. Unless geography forces local writes, a single leader with fast failover is simpler and correct. Multi-leader is a tool for a latency/geography problem, not an availability wish.

Exercises

Exercise · Design scenario

Design replication for a multi-region e-commerce inventory service. Users in the US, EU, and APAC read product availability constantly and occasionally place orders that decrement stock. Stock must never go negative (you can’t sell what you don’t have). Specify: topology, replication mode, the lag budget for “show availability” vs “decrement on purchase,” and what happens during a cross-region partition.

Exercise · Implementation task

In 03-log-replication, add a SemiSync ack mode (durable once the leader plus at least one follower hold the entry, falling back to async if no follower is reachable within a bound) and a MaxStaleness parameter on reads that refuses any follower lagging beyond N entries. Add tests showing semi-sync survives a single-follower crash and that a too-stale follower is skipped.

Exercise · Find the race

This failover picks the most caught-up follower and promotes it. In a real cluster it can fork the log — produce a history where two nodes disagree about what entry 7 was. Find the window.

failover.go — shipped, subtly broken

1func (c *Cluster) Failover() *Node {
2  // leader is presumed dead; promote whoever is most caught up
3  newLeader := c.PromoteMostCaughtUp()
4 
5  // immediately start accepting writes on the new leader
6  for _, f := range c.Followers {
7    if f != newLeader {
8      // ...point each remaining follower at newLeader and keep streaming
9      f.ReplicateFrom(newLeader, newLeader.LastIndex())
10    }
11  }
12  return newLeader
13}

Walk away with this

01Replication trades among durability, write latency, and availability — the Write-Path Trilemma. Async sacrifices durability on crash; sync sacrifices latency and availability; quorum and semi-sync are the balanced middle. Name your corner.
02“Acknowledged” is a policy: async acks before the data is safe on more than one machine. For important data, semi-sync (leader + ≥1 follower) removes the silent data-loss window.
03Give every read path a Lag Budget. Zero-budget reads (your own balance, fraud checks) go to the leader or carry a session token; everything else earns the right to scale on replicas. Monitor lag and auto-evict replicas past the bound.
04A replica is not a backup. It replicates your mistakes too. Replication buys availability and read scale; durability still requires independent, tested, point-in-time backups (GitLab 2017).
05Topology follows where writes enter. Single-leader is the right default; multi-leader is a geography tool that hands you write conflicts; leaderless quorum is maximum availability at the cost of owning conflict resolution and read-repair.