The Single-Node Lie
Time, order, and failure when there's more than one machine
Every bug in your first distributed system traces back to an assumption that cost nothing on one machine and became a lie on two. You didn’t choose these assumptions — single-node programming gave them to you for free, and they were correct for years. This module is about the three that break first, why they break, and the one tool that repairs all three: logical time.
DDIA spends three chapters circling this; we’re going to compress the load-bearing parts and then go one level deeper than the book on the part everyone gets wrong — the difference between a node that has crashed and a node that is merely slow, which turns out to be the same problem as “which write happened first,” wearing a different hat.
Three assumptions that were free
On a single machine, three things are true and you never think about them. There is one clock, and it only moves forward. A function call either returns or throws — there is no third outcome. And when something stops responding, it has stopped: there is no meaningful distinction between “crashed” and “slow,” because your own process being slow doesn’t change while you wait for it.
Add one more machine and all three become false in ways that produce silent data loss, phantom orderings, and outages that survive the thing that triggered them. Here is the full ledger, because naming the assumption is half of catching it in a design review.
The Assumption Ledger
Every single-node freebie, the distributed failure it becomes, and the repair. Cite a row by name in a design review and you've caught the bug before it ships.
| The free assumption | How it breaks on two machines | The repair |
|---|---|---|
| There is one clock | Wall clocks on two servers disagree by milliseconds to seconds. Comparing timestamps across machines orders events wrong. | Logical clocks (Lamport / vector) for ordering; physical time only for human display and coarse TTLs. |
| Time only moves forward | NTP steps the clock backward; leap seconds repeat a second. Durations measured with the wall clock go negative. | Measure elapsed time with a monotonic clock, never the wall clock. |
| A call returns or throws | A network call has a third outcome: no answer yet. The request may have succeeded, failed, or still be in flight. | Timeouts + idempotency + retries (Modules 6–7). Assume every call may have already happened. |
| Crashed is distinguishable from slow | Across an async network you cannot tell a dead node from a slow one. Any failure detector that says you can is guessing. | Failure detectors with explicit accuracy/completeness trade-offs; fencing for safety (Module 8). |
| Writes apply in the order I made them | Concurrent writes have no inherent order. 'Last' is undefined; last-writer-wins silently discards data. | Detect concurrency (vector clocks), then resolve deliberately — merge, or surface a conflict. |
| Everyone sees the same state | Replicas diverge for as long as replication lag lasts. Two readers see two truths. | Pick a consistency model per operation (Module 2); design for the lag, don't pretend it's zero. |
| The network is reliable enough to ignore | Partitions happen — partial ones, where A reaches B but not C. The system splits without anyone going down. | Quorums and partition-aware design (Modules 3, 8). Treat the partition as a normal Tuesday. |
Lie #1: there is one clock
The clock on a server is two clocks wearing a trench coat. The wall clock (clock_gettime(CLOCK_REALTIME), Date.now(), time.Now()) tells you the calendar time, and it is constantly being corrected by NTP to track an external reference. Those corrections jump — forward when the clock was slow, and backward when it was fast. The monotonic clock (CLOCK_MONOTONIC, performance.now()) only ever increases, but its zero is arbitrary and it’s meaningless across machines.
Two failures fall out of this immediately. First: if you measure “how long did this take” with the wall clock and NTP steps backward mid-measurement, your duration is negative — a well-known source of timeouts that fire instantly and rate limiters that grant infinite budget. Always measure elapsed time with the monotonic clock. Second, and the one that ate the customer’s address: comparing wall-clock timestamps across machines to order events is comparing two independently-wrong numbers. A 1-second skew is ordinary. Under load it’s worse, because NTP corrections get deprioritized exactly when the machine is busy.
Wall-clock time answers the wrong question
“What time did this happen?” and “Did this happen before that?” feel like the same question. On one machine they are. Across machines they come apart: the first needs a synchronized physical clock you don’t have, while the second only needs to know whether information could have flowed from one event to the other.
That second relation is happens-before (Lamport, 1978). Event A happens-before B if A is on the same node and earlier, or A is a message-send and B is its receive — or any chain of those. If no such chain exists in either direction, the events are concurrent. Concurrent doesn’t mean “at the same instant.” It means causally independent: neither could have known about the other.
That distinction is abstract until you watch it form. Below, three replicas of a shopping cart take writes and gossip them around. Step through it, then use the probe to ask whether any two events are ordered — or concurrent. Watch what happens to A’s write and C’s write.
The repair: logical clocks
A Lamport clock is a single counter per process: increment on every event, and on receiving a message set your counter to max(local, received) + 1. It guarantees that if A happens-before B then stamp(A) < stamp(B). That’s enough to build a consistent total order (break ties by node id) — and it’s all you need for, say, deterministic tie-breaking. But it has a fatal blind spot for conflict detection: a smaller stamp does not prove causality. From two Lamport stamps alone you can never tell “A caused B” from “A and B were concurrent.” That information was never captured.
A vector clock captures it. Keep one counter per replica. On a local event, bump your own. On receive, take the element-wise max, then bump your own. Now the comparison is total information: A happens-before B iff every component of A is ≤ B and at least one is strictly less; if each clock leads in some component, they’re concurrent. That “each leads somewhere” case is the one LWW cannot see.
1compare(other: VectorClock): Ordering {2 let lessSomewhere = false3 let greaterSomewhere = false4 for (const id of VectorClock.union(this, other)) {5 const a = this.get(id)6 const b = other.get(id)7 if (a < b) lessSomewhere = true8 else if (a > b) greaterSomewhere = true9 }10 // leads in both directions => causally independent11 if (lessSomewhere && greaterSomewhere) return "concurrent"12 if (lessSomewhere) return "before"13 if (greaterSomewhere) return "after"14 return "equal"15}The point of returning concurrent as a first-class answer is that your application now gets to decide. For a shopping cart, the right resolution is to merge (union the items) — Amazon’s Dynamo paper made exactly this choice. For a username, it might be to surface the conflict to a human. The bug in the opening ticket was never “we picked the wrong winner.” It was “we treated a concurrent pair as if one had to win.”
courses/distributed-systems/reference-impl/01-logical-clocks/A map-keyed vector clock (survives replicas joining and leaving), a scalar Lamport clock for contrast, unit tests, and a runnable demo that reproduces the opening ticket: two concurrent cart writes resolved two ways, where LWW prints >>> SILENTLY DROPPED: eggs <<< and the vector-clock path keeps both. npm install && npm run demo.
| Dimension | Physical (NTP) | Lamport | Vector clock | Hybrid logical (HLC) |
|---|---|---|---|---|
| Orders causally-related events | Only if clocks are synced — i.e. never exactly | Yes (consistent total order) | Yes | Yes |
| Detects concurrency | No | No — blind spot | Yes — the whole point | Approximate (within clock bound) |
| Space per event | 8 bytes | 8 bytes | O(replicas) — grows, must be pruned | ~16 bytes |
| Needs synced wall clocks | Critically | No | No | Yes, but tolerates bounded skew |
| Operational cost | NTP everywhere; still wrong | Trivial | Pruning + actor-id management | Moderate; needs a skew bound |
| Choose when | Human-readable timestamps and coarse TTLs only. Never for ordering or conflict resolution. | You need a cheap consistent total order and will never have to detect concurrent writes. | You must distinguish concurrent from causal writes — multi-master replication, collaborative editing, Dynamo-style stores. | You want near-physical timestamps that are also causally sound, and can bound clock skew (Spanner, CockroachDB, modern Cassandra). |
For ordering, never trust physical time across machines. For detecting concurrent writes, you need vector clocks or their pruned descendants (dotted version vectors). If you can bound clock skew, HLC gives you 90% of vector clocks’ safety with physical-looking timestamps and constant size — which is why it’s the modern default in new databases.
Lies #2 and #3: instant calls, and crashed vs. slow
These two are the same lie. On one machine a call returns or throws. Across a network there is a third outcome — silence — and you cannot tell which of three things it means: the request never arrived, it arrived and the work was done but the reply was lost, or it’s all still happening and you’re just early. The only tool you have is a timeout, and a timeout is a guess.
This is not an engineering shortcoming you can buy your way out of. The FLP result (1985) proved that in a fully asynchronous network — one with no bound on message delay — there is no algorithm that can reliably tell a crashed node from a slow one. Real networks aren’t perfectly asynchronous, so we get to cheat with timeouts. But every timeout is a bet on the maximum normal delay, and every such bet is sometimes wrong. A failure detector is just the formal name for that bet, and the useful insight is that you get to choose how it’s wrong.
The Failure-Detector Trilemma
A timeout is a dial with three settings and you only get two. Where you set it is a design decision, not a default.
A 50 ms timeout catches a real crash almost instantly and never misses one — but it also declares a node dead every time a GC pause or a network hiccup stretches a healthy response past 50 ms. Those false positives trigger failovers, re-replication, and retries against a node that was fine. A 30-second timeout is accurate — if it fires, the node is genuinely gone — but for 30 seconds your system routes traffic into a black hole. The accident most teams make is choosing a side by accident, inheriting a library default that silently optimizes for the wrong corner under load.
DynamoDB's metadata service, 20 September 2015
A timeout is a claim about the maximum normal latency, and that claim rots as the system grows. Two repairs, both from this module: size failure detectors for today’s latency distribution (or make them adaptive, like phi-accrual), and ensure retries can’t amplify a slowdown into a self-sustaining storm — the metastable-failure pattern we build defenses for in Module 7.
What this sounds like in an interview
Two app servers each write a log line for the same user, a few milliseconds apart. How do you decide which event happened first?
The interviewer is probing whether you understand that 'first' is not a property physical timestamps can give you across machines.
I'd compare the timestamps on the two log lines and take the earlier one.
Timestamps across servers can be skewed, so I'd sync them with NTP and maybe add a sequence number to reduce ties.
Across machines, 'first' isn't well-defined by physical time — clocks drift and NTP can step backward. If the two events are causally related (one triggered the other), I'd capture that with logical clocks. If they're genuinely concurrent, there is no 'first', and the honest answer is to detect the concurrency and resolve it deliberately rather than letting a timestamp pick a silent winner.
I'd first ask whether ordering these two events is even load-bearing for correctness, because the cheapest fix is often to not need the order. If it is load-bearing, I separate two cases. For causal ordering, logical clocks — Lamport if I only need a total order for tie-breaking, vector clocks if I need to distinguish concurrent from causal writes. For conflict resolution under concurrency, vector clocks plus an application-level merge, or a CRDT if I can model the data that way. If the business genuinely needs a single global order with real timestamps, that's a TrueTime/HLC-style design with an explicit clock-skew bound and commit-wait — and I'd flag the operational cost of that up front rather than discovering it later.
Treated 'order these events' as a decision with a cost, not a given. Asked whether the ordering is load-bearing before solving it; separated causal ordering from concurrency-detection from global-real-time ordering; named the specific mechanism for each and flagged the operational price of the expensive one. That's someone who has paid for this in production.
Don't reach for vector clocks when there's a single writer
Vector clocks earn their keep only when writes can originate from multiple replicas concurrently. If every write to a key goes through one leader (single-leader replication, or a partition with one owner), a plain version number or sequence is simpler, smaller, and sufficient. Vector clocks for a single-writer key are cost with no benefit.
Don't let vector clocks grow unbounded
A vector clock keyed by client has one entry per client that ever wrote — millions of mobile devices become a multi-megabyte clock attached to every value. Real systems key by a small set of server-side replicas, or use dotted version vectors with explicit pruning. If your “actors” are unbounded and ephemeral, vector clocks are the wrong tool.
Don't use a wall-clock timestamp as a fence or a lock token
It is tempting to use “most recent timestamp wins” to arbitrate who holds a lock or whose write is authoritative. Because clocks skew, two nodes can both believe they’re newest. Safety that depends on physical time is safety that fails during clock skew — use monotonic fencing tokens instead (Module 8).
Don't tighten a failure-detector timeout to feel fast
A snappy timeout looks great on a dashboard and quietly optimizes for the worst corner of the trilemma under load — declaring healthy-but-slow nodes dead exactly when the system is stressed, which is when false failovers hurt most. Prefer adaptive detection, and make the timeout a measured decision against your real latency distribution, not a number that felt responsive.
Exercises
You’re designing the sync layer for a note-taking app. A user edits the same note on their phone (offline on the subway) and their laptop (online) within the same ten minutes. Both eventually reach the server. Design how the server decides what the note’s content becomes.
Address: how you order or detect concurrency between the two edits; what you do when they’re concurrent; and what you store alongside each note version to make that decision possible. Then state the one question you’d ask the product team that most changes the design.
01-logical-clocks), add two things to VectorClock: a dominates(other) helper (true iff this happens-after other), and a module-level mergeSiblings(versions) that takes an array of concurrent versions and returns a single reconciled value (union the contents) with a clock equal to the element-wise max of all inputs. Add tests proving that merging a causal pair returns the dominant version unchanged, and merging a concurrent pair keeps both items.receive shipped in a multi-replica store. Reads were occasionally reporting two genuinely different concurrent writes as equal, so the conflict-resolution path never ran and one write was dropped. The clock arithmetic looks right. Find the bug.1// Receive a message carrying `msg`, then record that WE observed it.2receive(msg: VectorClock, selfId: string): VectorClock {3 // take the element-wise max so we incorporate the sender's history4 const merged = this.merge(msg)5 // ...and return it.6 return merged7 // (we already merged in the sender's counters — what else is there?)8}- 01Three single-node freebies break first on two machines: one clock, calls that return-or-throw, and crashed-vs-slow being distinguishable. The Assumption Ledger names each one and its repair.
- 02Across machines, never use physical timestamps to order events or resolve conflicts. Use happens-before: Lamport clocks for a cheap total order, vector clocks when you must tell concurrent writes apart from causal ones.
- 03Last-writer-wins is concurrency detection that throws the answer away. It silently drops data whenever two writes are concurrent — which is most of the time in a multi-writer system.
- 04You cannot distinguish a crashed node from a slow one; a timeout is a bet. The Failure-Detector Trilemma says you choose two of speed, accuracy, completeness — so choose on purpose, and prefer adaptive detectors over a constant that rots as you grow.
- 05Physical time is usable for ordering only if you treat its error as a first-class value (Spanner’s TrueTime) — never if you round the uncertainty to zero (the opening ticket).