Interviews Vector

11:42 PM · the support ticket that doesn't make sense

A customer emails: “Your app sent my order to my old address. I changed it an hour before I checked out. I watched it save.” You pull the audit log. The address change is there, timestamped, committed. So is the checkout, a full minute later, with the old address. Both writes succeeded. Neither errored. The newer, correct value is simply gone.

The cause isn’t a dropped write or a cache bug. Two services updated the same customer row at almost the same moment — the mobile app, and a background “enrich profile” job — and your conflict resolution kept the one with the higher timestamp. The background job’s server clock happened to be 1.2 seconds ahead of the app server’s. The older write carried the larger number. Last-writer-wins did exactly what it says, and last-writer-wins was a lie the whole time.

Every bug in your first distributed system traces back to an assumption that cost nothing on one machine and became a lie on two. You didn’t choose these assumptions — single-node programming gave them to you for free, and they were correct for years. This module is about the three that break first, why they break, and the one tool that repairs all three: logical time.

DDIA spends three chapters circling this; we’re going to compress the load-bearing parts and then go one level deeper than the book on the part everyone gets wrong — the difference between a node that has crashed and a node that is merely slow, which turns out to be the same problem as “which write happened first,” wearing a different hat.

Three assumptions that were free

On a single machine, three things are true and you never think about them. There is one clock, and it only moves forward. A function call either returns or throws — there is no third outcome. And when something stops responding, it has stopped: there is no meaningful distinction between “crashed” and “slow,” because your own process being slow doesn’t change while you wait for it.

Add one more machine and all three become false in ways that produce silent data loss, phantom orderings, and outages that survive the thing that triggered them. Here is the full ledger, because naming the assumption is half of catching it in a design review.

Framework · Taxonomy

The Assumption Ledger

Every single-node freebie, the distributed failure it becomes, and the repair. Cite a row by name in a design review and you've caught the bug before it ships.

The free assumption	How it breaks on two machines	The repair
There is one clock	Wall clocks on two servers disagree by milliseconds to seconds. Comparing timestamps across machines orders events wrong.	Logical clocks (Lamport / vector) for ordering; physical time only for human display and coarse TTLs.
Time only moves forward	NTP steps the clock backward; leap seconds repeat a second. Durations measured with the wall clock go negative.	Measure elapsed time with a monotonic clock, never the wall clock.
A call returns or throws	A network call has a third outcome: no answer yet. The request may have succeeded, failed, or still be in flight.	Timeouts + idempotency + retries (Modules 6–7). Assume every call may have already happened.
Crashed is distinguishable from slow	Across an async network you cannot tell a dead node from a slow one. Any failure detector that says you can is guessing.	Failure detectors with explicit accuracy/completeness trade-offs; fencing for safety (Module 8).
Writes apply in the order I made them	Concurrent writes have no inherent order. 'Last' is undefined; last-writer-wins silently discards data.	Detect concurrency (vector clocks), then resolve deliberately — merge, or surface a conflict.
Everyone sees the same state	Replicas diverge for as long as replication lag lasts. Two readers see two truths.	Pick a consistency model per operation (Module 2); design for the lag, don't pretend it's zero.
The network is reliable enough to ignore	Partitions happen — partial ones, where A reaches B but not C. The system splits without anyone going down.	Quorums and partition-aware design (Modules 3, 8). Treat the partition as a normal Tuesday.

Lie #1: there is one clock

The clock on a server is two clocks wearing a trench coat. The wall clock (clock_gettime(CLOCK_REALTIME), Date.now(), time.Now()) tells you the calendar time, and it is constantly being corrected by NTP to track an external reference. Those corrections jump — forward when the clock was slow, and backward when it was fast. The monotonic clock (CLOCK_MONOTONIC, performance.now()) only ever increases, but its zero is arbitrary and it’s meaningless across machines.

Two failures fall out of this immediately. First: if you measure “how long did this take” with the wall clock and NTP steps backward mid-measurement, your duration is negative — a well-known source of timeouts that fire instantly and rate limiters that grant infinite budget. Always measure elapsed time with the monotonic clock. Second, and the one that ate the customer’s address: comparing wall-clock timestamps across machines to order events is comparing two independently-wrong numbers. A 1-second skew is ordinary. Under load it’s worse, because NTP corrections get deprioritized exactly when the machine is busy.

Mental model

Wall-clock time answers the wrong question

“What time did this happen?” and “Did this happen before that?” feel like the same question. On one machine they are. Across machines they come apart: the first needs a synchronized physical clock you don’t have, while the second only needs to know whether information could have flowed from one event to the other.

That second relation is happens-before (Lamport, 1978). Event A happens-before B if A is on the same node and earlier, or A is a message-send and B is its receive — or any chain of those. If no such chain exists in either direction, the events are concurrent. Concurrent doesn’t mean “at the same instant.” It means causally independent: neither could have known about the other.

Use it when: Any time you're about to compare timestamps from two machines to decide what happened first. Stop, and reach for happens-before instead.

That distinction is abstract until you watch it form. Below, three replicas of a shopping cart take writes and gossip them around. Step through it, then use the probe to ask whether any two events are ordered — or concurrent. Watch what happens to A’s write and C’s write.

Vector-clock stepper · 3 replicas, one cart

step 1/7

a1 · Replica A (US)

write cart += milk

[1,0,0]

[0,0,0]

Causality probe

Step forward until both events are revealed to compare them.

The repair: logical clocks

A Lamport clock is a single counter per process: increment on every event, and on receiving a message set your counter to max(local, received) + 1. It guarantees that if A happens-before B then stamp(A) < stamp(B). That’s enough to build a consistent total order (break ties by node id) — and it’s all you need for, say, deterministic tie-breaking. But it has a fatal blind spot for conflict detection: a smaller stamp does not prove causality. From two Lamport stamps alone you can never tell “A caused B” from “A and B were concurrent.” That information was never captured.

A vector clock captures it. Keep one counter per replica. On a local event, bump your own. On receive, take the element-wise max, then bump your own. Now the comparison is total information: A happens-before B iff every component of A is ≤ B and at least one is strictly less; if each clock leads in some component, they’re concurrent. That “each leads somewhere” case is the one LWW cannot see.

vector-clock.ts — the comparison that LWW can't make

1compare(other: VectorClock): Ordering {
2  let lessSomewhere = false
3  let greaterSomewhere = false
4  for (const id of VectorClock.union(this, other)) {
5    const a = this.get(id)
6    const b = other.get(id)
7    if (a < b) lessSomewhere = true
8    else if (a > b) greaterSomewhere = true
9  }
10  // leads in both directions => causally independent
11  if (lessSomewhere && greaterSomewhere) return "concurrent"
12  if (lessSomewhere) return "before"
13  if (greaterSomewhere) return "after"
14  return "equal"
15}

The point of returning concurrent as a first-class answer is that your application now gets to decide. For a shopping cart, the right resolution is to merge (union the items) — Amazon’s Dynamo paper made exactly this choice. For a username, it might be to surface the conflict to a human. The bug in the opening ticket was never “we picked the wrong winner.” It was “we treated a concurrent pair as if one had to win.”

Runnable reference implementation

TypeScript

courses/distributed-systems/reference-impl/01-logical-clocks/

A map-keyed vector clock (survives replicas joining and leaving), a scalar Lamport clock for contrast, unit tests, and a runnable demo that reproduces the opening ticket: two concurrent cart writes resolved two ways, where LWW prints >>> SILENTLY DROPPED: eggs <<< and the vector-clock path keeps both. npm install && npm run demo.

Dimension	Physical (NTP)	Lamport	Vector clock	Hybrid logical (HLC)
Orders causally-related events	Only if clocks are synced — i.e. never exactly	Yes (consistent total order)	Yes	Yes
Detects concurrency	No	No — blind spot	Yes — the whole point	Approximate (within clock bound)
Space per event	8 bytes	8 bytes	O(replicas) — grows, must be pruned	~16 bytes
Needs synced wall clocks	Critically	No	No	Yes, but tolerates bounded skew
Operational cost	NTP everywhere; still wrong	Trivial	Pruning + actor-id management	Moderate; needs a skew bound
Choose when	Human-readable timestamps and coarse TTLs only. Never for ordering or conflict resolution.	You need a cheap consistent total order and will never have to detect concurrent writes.	You must distinguish concurrent from causal writes — multi-master replication, collaborative editing, Dynamo-style stores.	You want near-physical timestamps that are also causally sound, and can bound clock skew (Spanner, CockroachDB, modern Cassandra).

Verdict

For ordering, never trust physical time across machines. For detecting concurrent writes, you need vector clocks or their pruned descendants (dotted version vectors). If you can bound clock skew, HLC gives you 90% of vector clocks’ safety with physical-looking timestamps and constant size — which is why it’s the modern default in new databases.

Lies #2 and #3: instant calls, and crashed vs. slow

These two are the same lie. On one machine a call returns or throws. Across a network there is a third outcome — silence — and you cannot tell which of three things it means: the request never arrived, it arrived and the work was done but the reply was lost, or it’s all still happening and you’re just early. The only tool you have is a timeout, and a timeout is a guess.

This is not an engineering shortcoming you can buy your way out of. The FLP result (1985) proved that in a fully asynchronous network — one with no bound on message delay — there is no algorithm that can reliably tell a crashed node from a slow one. Real networks aren’t perfectly asynchronous, so we get to cheat with timeouts. But every timeout is a bet on the maximum normal delay, and every such bet is sometimes wrong. A failure detector is just the formal name for that bet, and the useful insight is that you get to choose how it’s wrong.

Framework · Trilemma

The Failure-Detector Trilemma

A timeout is a dial with three settings and you only get two. Where you set it is a design decision, not a default.

Tighten the timeout → fast + complete, but you wrongly declare slow nodes dead (low accuracy). Loosen it → accurate, but slow to notice real deaths. You cannot have all three at once.

A 50 ms timeout catches a real crash almost instantly and never misses one — but it also declares a node dead every time a GC pause or a network hiccup stretches a healthy response past 50 ms. Those false positives trigger failovers, re-replication, and retries against a node that was fine. A 30-second timeout is accurate — if it fires, the node is genuinely gone — but for 30 seconds your system routes traffic into a black hole. The accident most teams make is choosing a side by accident, inheriting a library default that silently optimizes for the wrong corner under load.

How this fails in production · AWS

DynamoDB's metadata service, 20 September 2015

The setup

DynamoDB storage servers periodically fetch their partition assignments (“membership”) from an internal metadata service. Each fetch had a timeout. Recently, a new feature (Global Secondary Indexes) had grown the size of some membership responses, so a healthy fetch now took longer than it used to — closer to that timeout.

What happened

A brief network blip slowed metadata responses just enough that a wave of storage servers didn’t get their membership data within the timeout. Treating the slow response as a failure, they retried. The retries piled onto the metadata service, slowing it further, which pushed more servers past the timeout, which produced more retries. The failure sustained itself long after the original blip was gone.

The moment it went wrong

The timeout was set for the old, smaller response size. Membership had quietly grown until a normal, healthy response no longer fit inside the budget — so the failure detector began reading “slow” as “dead” under perfectly ordinary latency. The system had been sitting one traffic blip away from a retry storm, and nobody had re-checked the bet.

The transferable lesson

A timeout is a claim about the maximum normal latency, and that claim rots as the system grows. Two repairs, both from this module: size failure detectors for today’s latency distribution (or make them adaptive, like phi-accrual), and ensure retries can’t amplify a slowdown into a self-sustaining storm — the metastable-failure pattern we build defenses for in Module 7.

AWS — Summary of the Amazon DynamoDB Service Disruption (us-east-1) ↗

What this sounds like in an interview

Calibration ladder · L3 → L6

Two app servers each write a log line for the same user, a few milliseconds apart. How do you decide which event happened first?

The interviewer is probing whether you understand that 'first' is not a property physical timestamps can give you across machines.

L3 · Junior

I'd compare the timestamps on the two log lines and take the earlier one.

Missed: Believes physical timestamps are comparable across machines. This is the exact bug in the opening ticket.

L4 · Mid

Timestamps across servers can be skewed, so I'd sync them with NTP and maybe add a sequence number to reduce ties.

Missed: Knows clocks drift but thinks NTP fixes it. NTP reduces skew; it never eliminates it, and it can move time backward. Sequence numbers across independent servers don't compose into an order.

L5 · Senior

Across machines, 'first' isn't well-defined by physical time — clocks drift and NTP can step backward. If the two events are causally related (one triggered the other), I'd capture that with logical clocks. If they're genuinely concurrent, there is no 'first', and the honest answer is to detect the concurrency and resolve it deliberately rather than letting a timestamp pick a silent winner.

Missed: Strong and correct. The one thing missing is the meta-move: asking whether the ordering is needed at all before paying for it, and naming the concrete designs (HLC, commit-wait) for the case where real global ordering is genuinely required.

L6 · Staff

I'd first ask whether ordering these two events is even load-bearing for correctness, because the cheapest fix is often to not need the order. If it is load-bearing, I separate two cases. For causal ordering, logical clocks — Lamport if I only need a total order for tie-breaking, vector clocks if I need to distinguish concurrent from causal writes. For conflict resolution under concurrency, vector clocks plus an application-level merge, or a CRDT if I can model the data that way. If the business genuinely needs a single global order with real timestamps, that's a TrueTime/HLC-style design with an explicit clock-skew bound and commit-wait — and I'd flag the operational cost of that up front rather than discovering it later.

What scored L6

Treated 'order these events' as a decision with a cost, not a given. Asked whether the ordering is load-bearing before solving it; separated causal ordering from concurrency-detection from global-real-time ordering; named the specific mechanism for each and flagged the operational price of the expensive one. That's someone who has paid for this in production.

When NOT to use this

Don't reach for vector clocks when there's a single writer

Vector clocks earn their keep only when writes can originate from multiple replicas concurrently. If every write to a key goes through one leader (single-leader replication, or a partition with one owner), a plain version number or sequence is simpler, smaller, and sufficient. Vector clocks for a single-writer key are cost with no benefit.

Don't let vector clocks grow unbounded

A vector clock keyed by client has one entry per client that ever wrote — millions of mobile devices become a multi-megabyte clock attached to every value. Real systems key by a small set of server-side replicas, or use dotted version vectors with explicit pruning. If your “actors” are unbounded and ephemeral, vector clocks are the wrong tool.

Don't use a wall-clock timestamp as a fence or a lock token

It is tempting to use “most recent timestamp wins” to arbitrate who holds a lock or whose write is authoritative. Because clocks skew, two nodes can both believe they’re newest. Safety that depends on physical time is safety that fails during clock skew — use monotonic fencing tokens instead (Module 8).

Don't tighten a failure-detector timeout to feel fast

A snappy timeout looks great on a dashboard and quietly optimizes for the worst corner of the trilemma under load — declaring healthy-but-slow nodes dead exactly when the system is stressed, which is when false failovers hurt most. Prefer adaptive detection, and make the timeout a measured decision against your real latency distribution, not a number that felt responsive.

Exercises

Exercise · Design scenario

You’re designing the sync layer for a note-taking app. A user edits the same note on their phone (offline on the subway) and their laptop (online) within the same ten minutes. Both eventually reach the server. Design how the server decides what the note’s content becomes.

Address: how you order or detect concurrency between the two edits; what you do when they’re concurrent; and what you store alongside each note version to make that decision possible. Then state the one question you’d ask the product team that most changes the design.

Exercise · Implementation task

In the reference implementation (01-logical-clocks), add two things to VectorClock: a dominates(other) helper (true iff this happens-after other), and a module-level mergeSiblings(versions) that takes an array of concurrent versions and returns a single reconciled value (union the contents) with a clock equal to the element-wise max of all inputs. Add tests proving that merging a causal pair returns the dominant version unchanged, and merging a concurrent pair keeps both items.

Exercise · Find the race

This receive shipped in a multi-replica store. Reads were occasionally reporting two genuinely different concurrent writes as equal, so the conflict-resolution path never ran and one write was dropped. The clock arithmetic looks right. Find the bug.

vector-clock.ts — shipped, subtly broken

1// Receive a message carrying `msg`, then record that WE observed it.
2receive(msg: VectorClock, selfId: string): VectorClock {
3  // take the element-wise max so we incorporate the sender's history
4  const merged = this.merge(msg)
5  // ...and return it.
6  return merged
7  // (we already merged in the sender's counters — what else is there?)
8}

Walk away with this

01Three single-node freebies break first on two machines: one clock, calls that return-or-throw, and crashed-vs-slow being distinguishable. The Assumption Ledger names each one and its repair.
02Across machines, never use physical timestamps to order events or resolve conflicts. Use happens-before: Lamport clocks for a cheap total order, vector clocks when you must tell concurrent writes apart from causal ones.
03Last-writer-wins is concurrency detection that throws the answer away. It silently drops data whenever two writes are concurrent — which is most of the time in a multi-writer system.
04You cannot distinguish a crashed node from a slow one; a timeout is a bet. The Failure-Detector Trilemma says you choose two of speed, accuracy, completeness — so choose on purpose, and prefer adaptive detectors over a constant that rots as you grow.
05Physical time is usable for ordering only if you treat its error as a first-class value (Spanner’s TrueTime) — never if you round the uncertainty to zero (the opening ticket).