Field Manual
Module 5 · Core · Data Layer · 60 min

Transactions Across the Network

Real isolation levels, the anomalies they miss, and why 2PC is an availability liability

Framework: The Isolation Anomaly Ladder · The 2PC Liability MatrixAnchored to: Jepsen isolation-anomaly analysis

“ACID” and “the database handles concurrency” are true in a way that lulls you. Your database is almost certainly running at Read Committed or Snapshot Isolation, both of which permit anomalies that lose money. And the moment a transaction spans two services, the single-DB guarantees evaporate and you’re choosing between two-phase commit (an availability liability) and giving up atomicity for something weaker. This module makes both choices precise.

The Isolation Anomaly Ladder

Isolation levels are usually taught as a list of names. That’s useless under pressure. The way to hold them is by the specific anomaly each level newly prevents — because that’s what you actually reason about when a bug appears. Climb the ladder only as far as the anomalies that would hurt you.

The Isolation Anomaly Ladder
stronger isolation
4
Serializableprevents everything
As if transactions ran one at a time. Prevents write skew and phantoms — the anomalies SI misses. The only level safe for arbitrary invariants. Cost: lowest concurrency (more aborts or locking).
3
Snapshot Isolation / Repeatable Readallows write skew
Every transaction reads from a consistent snapshot; no dirty or non-repeatable reads. But two transactions can read the same snapshot, write different rows, and break a multi-row invariant — write skew. Postgres 'Repeatable Read' is SI.
2
Read Committedallows lost update
Only prevents dirty reads (you never see uncommitted data). Allows non-repeatable reads, phantoms, and lost updates — including the gift-card double-spend. The DEFAULT in Postgres, Oracle, SQL Server.
1
Read Uncommittedallows dirty reads
Prevents nothing meaningful — you can read another transaction's uncommitted, soon-to-be-rolled-back writes. Almost never what you want.

The trap: your database's DEFAULT is Read Committed, two rungs below where most invariants are actually safe. 'Repeatable Read' sounds strong but still allows write skew.

The reference implementation makes the two dangerous anomalies happen on every run, deterministically — and shows that they need different fixes. A row lock stops the lost update; it does nothing for write skew, because the two transactions write different rows after reading a shared predicate.

anomalies.ts — the gap that loses the money
1export function* naiveWithdraw(s, acct, amount, log) {
2 const bal = s.get(acct)
3 yield // <-- the database can schedule the OTHER transaction right here
4 if (bal >= amount) { // both transactions saw the same stale bal
5 s.set(acct, bal - amount) // both subtract from it -> one debit vanishes
6 log.push(`withdrew ${amount}`)
7 }
8}
Runnable reference implementation
TypeScript
courses/distributed-systems/reference-impl/05-isolation-anomalies/

A deterministic interleaving harness (each transaction is a generator that yields at every scheduling point). The demo dispenses $120 against a $100 balance under naive code, then prevents it with SELECT FOR UPDATE — and separately shows that a row lock does not fix write skew, only serializing the predicate does. npm run demo, 4 passing tests.

Mental model
Write skew: the anomaly that looks like nothing

Lost update is intuitive: two writers, one row, one write disappears. Write skew is the subtle one. Two transactions each read a set of rows (the predicate), each decide their write is safe based on that read, and each write a different row. No single row is contended, so a row lock sees no conflict — yet the combined effect violates an invariant neither transaction could have violated alone. Two doctors each go off-call because each sees the other still on; now nobody is.

This is why “Repeatable Read” isn’t enough for invariants: snapshot isolation gives each transaction a clean, consistent read, and that’s exactly the problem — both snapshots are from before either write. Only serializable execution (or an explicit lock over the predicate, or a materialized conflict) closes it.

Use it when: Any invariant that spans multiple rows — 'at least one doctor on call', 'no double-booking', 'sum of allocations ≤ budget'. Row locks won't protect these.

Across services: why 2PC is an availability liability

Everything above assumed one database. The instant a transaction must be atomic across two services — debit the wallet service and create the order in the order service, all-or-nothing — the single-database machinery is gone. The textbook answer is two-phase commit: a coordinator asks every participant to prepare (promise it can commit, and lock the rows), and once all say yes, tells them to commit. It does give atomicity. It also has a failure mode that makes it dangerous in practice.

Framework · Failure analysis

The 2PC Liability Matrix

What two-phase commit does on each failure — and why the 'prepared but not committed' window is where availability goes to die.

FailureWhat 2PC doesConsequence
A participant votes NO (can't commit)Coordinator aborts everyone.Clean — this is the case 2PC handles well.
Coordinator crashes AFTER prepare, BEFORE commitParticipants are stuck 'prepared': locks held, can't commit, can't abort (only the coordinator knows the decision).Blocking. Locked rows, frozen transactions, until the coordinator recovers. This is the liability.
Network partition between coordinator and a participantThe participant holds its prepared locks indefinitely.One unreachable participant freezes the whole transaction — and anything contending those locks.
A participant crashes while preparedOn recovery it must reconnect and ask the coordinator the outcome before releasing locks.Recovery coupling: participants can't make progress independently.

The pattern: 2PC converts an availability problem in one participant into an availability problem for the whole transaction and everything touching its locked rows. Under partition, it chooses consistency by blocking — which is correct, and often unacceptable. That’s why high-scale systems reach for sagas and idempotency (Module 6) instead of distributed ACID transactions.

DimensionRead CommittedSnapshot / RRSerializable2PC (cross-service)
Anomalies preventedDirty reads only+ non-repeatable reads (allows write skew)All — incl. write skew, phantomsAtomicity across services
ThroughputHighestHighLower — aborts/lockingLow — locks held across round trips
Availability under failureHighHighHigh (single DB)Blocks on coordinator/participant failure
Where it fitsReads that don't gate writesSnapshot reads, reportingAny multi-row invariant; moneyRare: small, trusted, co-located participants
Choose whenHigh-volume reads and writes that don't enforce a cross-row invariant. Know that it permits lost updates — guard money paths explicitly.You want consistent-snapshot reads and can prove no write-skew invariant is at stake (or you add explicit locks where one is).Any operation enforcing an invariant — balances, uniqueness, scheduling. Default to it for correctness-critical writes; optimize down only if measured.You genuinely need cross-service atomicity AND can tolerate blocking. Usually a sign you should redesign toward a saga + idempotency.
Verdict

For correctness-critical writes on one database, default to Serializable (or guard the specific path with SELECT FOR UPDATE / a uniqueness constraint) and optimize down only when you’ve measured a throughput problem. Across services, treat 2PC as a last resort — its blocking failure mode means most teams are better served by a saga with compensations and idempotent steps. The expensive mistake is trusting the default isolation level to protect an invariant it doesn’t.

How this fails in production · Real exchanges & e-commerce (ACIDRain)

Isolation anomalies as a repeatable money exploit

The setup
The ACIDRain researchers analyzed widely-used e-commerce platforms (and the pattern matches real exchange incidents like the Poloniex 2014 withdrawal race). The common shape: application code reads a balance or a coupon/gift-card state, checks it in application memory, then writes — across separate statements, at the database’s default Read Committed isolation.
What happened
By firing concurrent requests, an attacker exploits the gap between the check and the write. A gift card or store credit gets redeemed multiple times; a withdrawal passes the balance check several times before any debit lands. The researchers found exploitable anomalies in 22 of 12 popular self-hosted e-commerce applications’ widely-used workflows — and the same class drained real cryptocurrency exchanges of funds via concurrent withdrawals.
The moment it went wrong
The vulnerability isn’t a missing WHERE clause — the code looks correct on a single thread. It’s that “check then act” across two statements is only safe under serializable isolation or an explicit lock, and the application ran at the database’s permissive default. The isolation level was a security boundary nobody had reviewed.
The transferable lesson

Treat your isolation level as part of your threat model. Any read-check-write over money or a limited resource must be one atomic, serializable unit — a single UPDATE ... WHERE balance >= amount, a SELECT FOR UPDATE, a uniqueness constraint, or serializable isolation. “It worked in testing” means nothing; the anomaly only appears under the concurrency an attacker will happily supply.

Warszawski & Bailis — ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications (SIGMOD 2017)

What this sounds like in an interview

Calibration ladder · L3 → L6

Two concurrent requests each deduct from a user's account balance. How do you make sure the balance can't go negative?

The interviewer is probing whether you know your default isolation level doesn't protect this, and what the right-sized fix is.

L3 · Junior

I'd read the balance, check it's enough, and if so subtract the amount and save it. Maybe wrap it in a transaction.

Missed: Doesn't know the default isolation permits this exact race. Ships the ACIDRain bug.
L4 · Mid

A plain transaction at the default isolation can still let both reads see the old balance — that's a lost update. I'd use SELECT ... FOR UPDATE to lock the row, so the second request waits for the first to commit.

Missed: Correct fix, but reaches for a lock without knowing the lighter atomic-conditional-update option, and doesn't mention the idempotency-on-retry problem.
L5 · Senior

The default is Read Committed, which permits lost updates, so I need to make the read-check-write atomic. Cleanest is a single conditional update — UPDATE accounts SET balance = balance - :amt WHERE id = :id AND balance >= :amt — and treat zero rows affected as 'insufficient funds'. That pushes the check into the database atomically with no app-side gap. SELECT FOR UPDATE also works but holds a lock across a round trip. I'd avoid serializable for the whole app and apply the guarantee surgically to this path.

Missed: Strong. Missing the multi-row-invariant caveat (where locks fail) and the idempotency-on-retry concern that turns one bug into two.
L6 · Staff

Same atomic conditional update as the primary fix, but I'd frame it around the invariant and the failure modes. The invariant 'balance >= 0' is single-row, so a conditional update or a CHECK constraint enforces it without serializable isolation — but I'd note that if the invariant were multi-row (e.g. 'sum of sub-accounts <= limit'), a row lock wouldn't help and I'd need serializable or a materialized conflict. I'd also make the operation idempotent with a request key (Module 6), because the client will retry on a timeout and a non-idempotent debit double-charges — which is a different bug than the race and needs its own fix. And I'd treat the isolation level as a reviewed security property, not a default. The trade is a tiny bit of write contention on the hot account for a guarantee that holds under adversarial concurrency.

What scored L6

Enforced the invariant atomically at the right altitude (conditional update, not global serializable), distinguished single-row from multi-row invariants, AND caught that retries need idempotency — a separate failure from the race. Plus treated isolation as a security property. That's production-scarred.

When NOT to use this
Don't run the whole application at Serializable

Serializable is the right default for invariant-enforcing writes, not for every query. Forcing it globally means more aborts and lower throughput on reads and writes that never needed it. Apply the strong guarantee surgically — to the money path, the booking path — and let lag-tolerant reads run cheaper.

Don't use 2PC across services if you can avoid it

Two-phase commit makes the whole transaction only as available as its least-available participant, and its prepared-but-not-committed window holds locks during exactly the failures you can’t control. Unless participants are few, trusted, and co-located, a saga with compensating actions and idempotent steps is more available and easier to reason about.

Don't trust the database's default isolation for an invariant

The default is Read Committed almost everywhere, and “Repeatable Read” is usually snapshot isolation, which allows write skew. If correctness depends on an invariant, you must deliberately choose the level (or the lock, or the constraint) that protects it — never inherit it.

Don't reach for a distributed transaction when idempotency suffices

Many “we need cross-service atomicity” cases are really “this step must happen exactly once even if retried.” That’s an idempotency-key problem (Module 6), not a 2PC problem. Reaching for distributed transactions when an idempotent, retryable step would do is paying coordination cost for a guarantee you can get cheaper.

Exercises

Exercise · Design scenario
Design the concurrency safety for a movie-ticket booking system. Seats are individual rows; a booking reserves N specific seats atomically (all or nothing), and the same seat must never be sold twice. Specify: the isolation level or locking strategy for a booking, what happens when two users grab overlapping seat sets at once, and how you keep a held-but-unpaid seat from being locked forever. Identify which anomaly is the real threat.
Exercise · Implementation task
In 05-isolation-anomalies, add an optimistic concurrency strategy: version each account, have the transaction read the version, and on write require the version to be unchanged (abort + retry on mismatch). Add it as a third withdraw variant and a test showing it prevents the overdraft by retrying rather than blocking — and discuss when optimistic beats pessimistic locking (low contention) and when it doesn’t (hot row → retry storm).
Exercise · Find the race
This is the gift-card redemption from the opening scenario, as it actually shipped. It passes every single-threaded test. Find the window, and name the anomaly.
redeem.ts — shipped, lost real money
1async function redeemGiftCard(cardId: string, amount: number) {
2 await db.begin() // default isolation: READ COMMITTED
3 const card = await db.query("SELECT balance FROM cards WHERE id = $1", [cardId])
4 if (card.balance < amount) {
5 await db.rollback()
6 throw new Error("insufficient balance")
7 }
8 await db.query("UPDATE cards SET balance = $1 WHERE id = $2",
9 [card.balance - amount, cardId]) // writes back a value computed from a stale read
10 await db.commit()
11}
Walk away with this
  • 01Hold isolation levels by the anomaly each one prevents, not by name. Your database’s default is Read Committed, which permits lost updates; “Repeatable Read” is snapshot isolation, which permits write skew.
  • 02A read-check-write across separate statements is a money-losing race unless it’s one atomic unit. Use an atomic conditional UPDATE ... WHERE, SELECT FOR UPDATE, a constraint, or serializable — and treat the isolation level as part of your threat model (ACIDRain).
  • 03Write skew defeats row locks: two transactions write different rows after reading a shared predicate. Multi-row invariants need serializable or an explicit predicate lock.
  • 042PC buys cross-service atomicity at the price of blocking: a coordinator crash in the prepared window freezes participants holding locks. Reach for it only as a last resort.
  • 05Most “distributed transaction” needs are really idempotency (Module 6) or a saga (Module 8) needs. Knowing when not to use 2PC is the senior signal.