Transactions Across the Network
Real isolation levels, the anomalies they miss, and why 2PC is an availability liability
“ACID” and “the database handles concurrency” are true in a way that lulls you. Your database is almost certainly running at Read Committed or Snapshot Isolation, both of which permit anomalies that lose money. And the moment a transaction spans two services, the single-DB guarantees evaporate and you’re choosing between two-phase commit (an availability liability) and giving up atomicity for something weaker. This module makes both choices precise.
The Isolation Anomaly Ladder
Isolation levels are usually taught as a list of names. That’s useless under pressure. The way to hold them is by the specific anomaly each level newly prevents — because that’s what you actually reason about when a bug appears. Climb the ladder only as far as the anomalies that would hurt you.
The trap: your database's DEFAULT is Read Committed, two rungs below where most invariants are actually safe. 'Repeatable Read' sounds strong but still allows write skew.
The reference implementation makes the two dangerous anomalies happen on every run, deterministically — and shows that they need different fixes. A row lock stops the lost update; it does nothing for write skew, because the two transactions write different rows after reading a shared predicate.
1export function* naiveWithdraw(s, acct, amount, log) {2 const bal = s.get(acct)3 yield // <-- the database can schedule the OTHER transaction right here4 if (bal >= amount) { // both transactions saw the same stale bal5 s.set(acct, bal - amount) // both subtract from it -> one debit vanishes6 log.push(`withdrew ${amount}`)7 }8}courses/distributed-systems/reference-impl/05-isolation-anomalies/A deterministic interleaving harness (each transaction is a generator that yields at every scheduling point). The demo dispenses $120 against a $100 balance under naive code, then prevents it with SELECT FOR UPDATE — and separately shows that a row lock does not fix write skew, only serializing the predicate does. npm run demo, 4 passing tests.
Write skew: the anomaly that looks like nothing
Lost update is intuitive: two writers, one row, one write disappears. Write skew is the subtle one. Two transactions each read a set of rows (the predicate), each decide their write is safe based on that read, and each write a different row. No single row is contended, so a row lock sees no conflict — yet the combined effect violates an invariant neither transaction could have violated alone. Two doctors each go off-call because each sees the other still on; now nobody is.
This is why “Repeatable Read” isn’t enough for invariants: snapshot isolation gives each transaction a clean, consistent read, and that’s exactly the problem — both snapshots are from before either write. Only serializable execution (or an explicit lock over the predicate, or a materialized conflict) closes it.
Across services: why 2PC is an availability liability
Everything above assumed one database. The instant a transaction must be atomic across two services — debit the wallet service and create the order in the order service, all-or-nothing — the single-database machinery is gone. The textbook answer is two-phase commit: a coordinator asks every participant to prepare (promise it can commit, and lock the rows), and once all say yes, tells them to commit. It does give atomicity. It also has a failure mode that makes it dangerous in practice.
The 2PC Liability Matrix
What two-phase commit does on each failure — and why the 'prepared but not committed' window is where availability goes to die.
| Failure | What 2PC does | Consequence |
|---|---|---|
| A participant votes NO (can't commit) | Coordinator aborts everyone. | Clean — this is the case 2PC handles well. |
| Coordinator crashes AFTER prepare, BEFORE commit | Participants are stuck 'prepared': locks held, can't commit, can't abort (only the coordinator knows the decision). | Blocking. Locked rows, frozen transactions, until the coordinator recovers. This is the liability. |
| Network partition between coordinator and a participant | The participant holds its prepared locks indefinitely. | One unreachable participant freezes the whole transaction — and anything contending those locks. |
| A participant crashes while prepared | On recovery it must reconnect and ask the coordinator the outcome before releasing locks. | Recovery coupling: participants can't make progress independently. |
The pattern: 2PC converts an availability problem in one participant into an availability problem for the whole transaction and everything touching its locked rows. Under partition, it chooses consistency by blocking — which is correct, and often unacceptable. That’s why high-scale systems reach for sagas and idempotency (Module 6) instead of distributed ACID transactions.
| Dimension | Read Committed | Snapshot / RR | Serializable | 2PC (cross-service) |
|---|---|---|---|---|
| Anomalies prevented | Dirty reads only | + non-repeatable reads (allows write skew) | All — incl. write skew, phantoms | Atomicity across services |
| Throughput | Highest | High | Lower — aborts/locking | Low — locks held across round trips |
| Availability under failure | High | High | High (single DB) | Blocks on coordinator/participant failure |
| Where it fits | Reads that don't gate writes | Snapshot reads, reporting | Any multi-row invariant; money | Rare: small, trusted, co-located participants |
| Choose when | High-volume reads and writes that don't enforce a cross-row invariant. Know that it permits lost updates — guard money paths explicitly. | You want consistent-snapshot reads and can prove no write-skew invariant is at stake (or you add explicit locks where one is). | Any operation enforcing an invariant — balances, uniqueness, scheduling. Default to it for correctness-critical writes; optimize down only if measured. | You genuinely need cross-service atomicity AND can tolerate blocking. Usually a sign you should redesign toward a saga + idempotency. |
For correctness-critical writes on one database, default to Serializable (or guard the specific path with SELECT FOR UPDATE / a uniqueness constraint) and optimize down only when you’ve measured a throughput problem. Across services, treat 2PC as a last resort — its blocking failure mode means most teams are better served by a saga with compensations and idempotent steps. The expensive mistake is trusting the default isolation level to protect an invariant it doesn’t.
Isolation anomalies as a repeatable money exploit
WHERE clause — the code looks correct on a single thread. It’s that “check then act” across two statements is only safe under serializable isolation or an explicit lock, and the application ran at the database’s permissive default. The isolation level was a security boundary nobody had reviewed.Treat your isolation level as part of your threat model. Any read-check-write over money or a limited resource must be one atomic, serializable unit — a single UPDATE ... WHERE balance >= amount, a SELECT FOR UPDATE, a uniqueness constraint, or serializable isolation. “It worked in testing” means nothing; the anomaly only appears under the concurrency an attacker will happily supply.
What this sounds like in an interview
Two concurrent requests each deduct from a user's account balance. How do you make sure the balance can't go negative?
The interviewer is probing whether you know your default isolation level doesn't protect this, and what the right-sized fix is.
I'd read the balance, check it's enough, and if so subtract the amount and save it. Maybe wrap it in a transaction.
A plain transaction at the default isolation can still let both reads see the old balance — that's a lost update. I'd use SELECT ... FOR UPDATE to lock the row, so the second request waits for the first to commit.
The default is Read Committed, which permits lost updates, so I need to make the read-check-write atomic. Cleanest is a single conditional update — UPDATE accounts SET balance = balance - :amt WHERE id = :id AND balance >= :amt — and treat zero rows affected as 'insufficient funds'. That pushes the check into the database atomically with no app-side gap. SELECT FOR UPDATE also works but holds a lock across a round trip. I'd avoid serializable for the whole app and apply the guarantee surgically to this path.
Same atomic conditional update as the primary fix, but I'd frame it around the invariant and the failure modes. The invariant 'balance >= 0' is single-row, so a conditional update or a CHECK constraint enforces it without serializable isolation — but I'd note that if the invariant were multi-row (e.g. 'sum of sub-accounts <= limit'), a row lock wouldn't help and I'd need serializable or a materialized conflict. I'd also make the operation idempotent with a request key (Module 6), because the client will retry on a timeout and a non-idempotent debit double-charges — which is a different bug than the race and needs its own fix. And I'd treat the isolation level as a reviewed security property, not a default. The trade is a tiny bit of write contention on the hot account for a guarantee that holds under adversarial concurrency.
Enforced the invariant atomically at the right altitude (conditional update, not global serializable), distinguished single-row from multi-row invariants, AND caught that retries need idempotency — a separate failure from the race. Plus treated isolation as a security property. That's production-scarred.
Don't run the whole application at Serializable
Serializable is the right default for invariant-enforcing writes, not for every query. Forcing it globally means more aborts and lower throughput on reads and writes that never needed it. Apply the strong guarantee surgically — to the money path, the booking path — and let lag-tolerant reads run cheaper.
Don't use 2PC across services if you can avoid it
Two-phase commit makes the whole transaction only as available as its least-available participant, and its prepared-but-not-committed window holds locks during exactly the failures you can’t control. Unless participants are few, trusted, and co-located, a saga with compensating actions and idempotent steps is more available and easier to reason about.
Don't trust the database's default isolation for an invariant
The default is Read Committed almost everywhere, and “Repeatable Read” is usually snapshot isolation, which allows write skew. If correctness depends on an invariant, you must deliberately choose the level (or the lock, or the constraint) that protects it — never inherit it.
Don't reach for a distributed transaction when idempotency suffices
Many “we need cross-service atomicity” cases are really “this step must happen exactly once even if retried.” That’s an idempotency-key problem (Module 6), not a 2PC problem. Reaching for distributed transactions when an idempotent, retryable step would do is paying coordination cost for a guarantee you can get cheaper.
Exercises
05-isolation-anomalies, add an optimistic concurrency strategy: version each account, have the transaction read the version, and on write require the version to be unchanged (abort + retry on mismatch). Add it as a third withdraw variant and a test showing it prevents the overdraft by retrying rather than blocking — and discuss when optimistic beats pessimistic locking (low contention) and when it doesn’t (hot row → retry storm).1async function redeemGiftCard(cardId: string, amount: number) {2 await db.begin() // default isolation: READ COMMITTED3 const card = await db.query("SELECT balance FROM cards WHERE id = $1", [cardId])4 if (card.balance < amount) {5 await db.rollback()6 throw new Error("insufficient balance")7 }8 await db.query("UPDATE cards SET balance = $1 WHERE id = $2",9 [card.balance - amount, cardId]) // writes back a value computed from a stale read10 await db.commit()11}- 01Hold isolation levels by the anomaly each one prevents, not by name. Your database’s default is Read Committed, which permits lost updates; “Repeatable Read” is snapshot isolation, which permits write skew.
- 02A read-check-write across separate statements is a money-losing race unless it’s one atomic unit. Use an atomic conditional
UPDATE ... WHERE,SELECT FOR UPDATE, a constraint, or serializable — and treat the isolation level as part of your threat model (ACIDRain). - 03Write skew defeats row locks: two transactions write different rows after reading a shared predicate. Multi-row invariants need serializable or an explicit predicate lock.
- 042PC buys cross-service atomicity at the price of blocking: a coordinator crash in the prepared window freezes participants holding locks. Reach for it only as a last resort.
- 05Most “distributed transaction” needs are really idempotency (Module 6) or a saga (Module 8) needs. Knowing when not to use 2PC is the senior signal.