Field Manual
Module 4 · Core · Data Layer · 55 min

Partitioning & the Physics of Hot Keys

Hash vs range, consistent hashing, rebalancing, and the one key that melts a shard

Framework: The Skew Budget · Hot-Key Triage treeAnchored to: Discord — storing trillions of messages

Partitioning splits data across machines so the system scales past one box. Two questions decide everything: how you split (the partitioning scheme) and what happens when one piece gets hot(the part everyone underestimates). This module covers both, with a consistent-hash ring you can poke and a triage tree for the 8 PM page.

How to split: range, hash, and the ring

Range partitioning keeps keys sorted and splits by boundaries (A–F on shard 1, G–M on shard 2). Range scans are cheap (“all orders in March” hits one shard), but sequential keys — timestamps, auto-increment IDs — pile every new write onto the last shard. Hash partitioning scatters keys by hash(key), which spreads writes evenly but destroys range scans (adjacent keys land anywhere). The catch with the naive version, hash(key) % N: change N and almost every key remaps — resharding becomes a full data migration, i.e. an outage.

Consistent hashing fixes the resharding problem. Place nodes and keys on a ring; each key belongs to the next node clockwise. Add or remove a node and only the keys in that node’s arc move — about K/N of them, not all. Drag the fleet size below and watch how few keys move, versus what modulo would do:

Consistent-hash ring · 3 nodes
ABC
Each key (dot) is owned by the first node (square) clockwise. Add or remove a node and only the keys in that arc move — highlighted with a ring. With hash(key) % N, changing N reshuffles almost everything, which is why it’s an outage to reshard.
Watch for skew: with few nodes, arcs are uneven — one node can own twice its share. That imbalance is what virtual nodes fix, and it’s the seed of a hot partition.
Runnable reference implementation
Go
courses/distributed-systems/reference-impl/04-consistent-hashing/

A consistent-hash ring with virtual nodes and a hot-key detector. The demo proves adding a node moves ~25% of keys (vs ~75% for modulo), that virtual nodes flatten skew, and that a viral key is detected even though the ring placed it perfectly evenly. go run .

The Skew Budget: distribution is a number, not a vibe

“Evenly distributed” is not a yes/no. It’s a ratio you measure and bound. The Skew Budget is the most-loaded partition’s load divided by the average, and the maximum value you’re willing to tolerate before you act. Pick the number up front (say 1.5×), measure it continuously, and treat a breach as a signal — not a surprise during an incident.

Framework · Metric + threshold

The Skew Budget

max-partition-load ÷ mean-partition-load. Set the ceiling, measure it, and know what each breach means.

Ratio ≈ 1.0–1.5× and stable — healthy. Virtual nodes are doing their job; key distribution is even.

Ratio creeping up across all partitions — a distribution problem. Too few virtual nodes, or a partitioning key with low cardinality. Fix the scheme.

One partition spiking while the rest are flat — a hot key, not a distribution problem. Re-sharding won’t help (the key still lands on one partition). Go to the triage tree.

The screenshot-worthy distinction: a distribution problem raises the whole floor; a hot key spikes one partition. They look similar on a dashboard and have opposite fixes.

Mental model
Consistent hashing solves distribution, never hotness

This is the single most common confusion at the L4→L5 boundary. Consistent hashing and virtual nodes make sure keys are spread evenly across nodes. They do absolutely nothing about the fact that all the traffic for one key goes to one node. A celebrity user, a viral channel, a flash-sale SKU — these are single keys, and their heat is a property of the workload, not your hashing.

So when one shard is melting, resharding is the wrong reflex: the hot key just moves to a different single shard. You need a different class of fix entirely — and which one depends on whether the key is hot on reads or writes.

Use it when: Any time a single partition is hot. First ask: is this the whole fleet creeping up (distribution) or one shard spiking (a hot key)? Only the second needs the triage tree.

The Hot-Key Triage tree

When the Skew Budget says “one partition, spiking,” run this. The first branch — read-hot vs write-hot — splits the whole solution space, because a hot read is a caching/replication problem and a hot write is a sharding/data-model problem.

Hot-Key Triage
Is the key hot on READS or WRITES?
Reads
Can the value tolerate seconds of staleness?
Yes
Cache itEdge/CDN or local in-memory cache; replicate the key to many nodes. Cheapest fix.
No — must be fresh
Replicate readsFan the key's reads across dedicated replicas; accept the replication cost.
Writes
Can the key be split into independent sub-keys?
Yes (e.g. a counter)
Shard the keySalt into key#0..key#K across partitions, aggregate on read. Classic counter-sharding.
No — single ordered entity
Dedicated partition + bufferGive it its own partition, put a queue/batcher in front to absorb bursts, or rethink the data model.

Read-hot → caching/replication. Write-hot → splitting/buffering. The branch you take is forced by the access pattern, not preference.

counter-sharding — the canonical write-hot fix
1// A single hot counter ("likes on the viral post") serializes every
2// increment onto one partition. Split it into K shards, each on its own
3// partition, and sum on read. Writes fan out; reads pay a small gather.
4const shards = 16
5
6func incrementLikes(postID string) {
7 shard := rand.Intn(shards)
8 db.Incr(fmt.Sprintf("likes:%s#%d", postID, shard)) // lands on K partitions
9}
10
11func readLikes(postID string) int64 {
12 var total int64
13 for s := 0; s < shards; s++ {
14 total += db.Get(fmt.Sprintf("likes:%s#%d", postID, s))
15 }
16 return total // the read got K× more expensive; the write stopped melting one shard
17}
How this fails in production · Discord

Hot partitions at trillions of messages

The setup
Discord stored messages in Cassandra, partitioned by channel and a time bucket. Most channels are small and quiet. But some channels — huge public servers, a celebrity announcement channel — receive an enormous, concentrated share of all message traffic.
What happened
Those few enormous channels became hot partitions: a single partition receiving so many reads and writes that it couldn’t keep up. Cassandra’s performance on those partitions degraded badly — high latencies, and the dreaded behavior where one hot partition’s struggles spill over and affect the nodes hosting it. The data was evenly partitioned by channel; the load was not, because channel popularity is wildly skewed.
The moment it went wrong
The partitioning key (channel + time bucket) was correct for the median channel and catastrophic for the head of the distribution. No partitioning scheme makes a power-law workload uniform — the hottest channels were always going to concentrate. Discord ultimately moved to ScyllaDB and invested heavily in handling exactly this: detecting hot partitions and shielding the system from them (including request-coalescing so a thundering herd of identical reads collapses into one).
The transferable lesson

Choose your partition key for the tail of the distribution, not the median. Then assume the head will still concentrate, and build the hot-partition defenses in from the start: detection, request coalescing for hot reads, and a plan to split or buffer hot writes. Even partitioning is necessary and nowhere near sufficient.

Discord Engineering — How Discord Stores Trillions of Messages

What this sounds like in an interview

Calibration ladder · L3 → L6

How would you shard the messages table for a chat app like Slack or Discord?

The interviewer wants the scheme — and then wants to see if you anticipate the hot channel before they have to bring it up.

L3 · Junior

I'd shard by message ID, maybe hash it across N database shards so the load is spread evenly.

Missed: Sharding by message ID spreads load but makes 'fetch this channel's recent messages' a scatter-gather across every shard — the most common read is now the most expensive.
L4 · Mid

I'd shard by channel ID so all of a channel's messages live together and are cheap to fetch in order. I'd use consistent hashing so adding shards doesn't reshuffle everything.

Missed: Correct scheme, but treats it as solved. Hasn't anticipated the hot channel, which is the entire difficulty of this problem.
L5 · Senior

Shard by channel ID (keeps a channel's history together and ordered), consistent hashing with virtual nodes for even distribution and cheap rebalancing. But channel popularity follows a power law, so I'd plan for hot channels from day one: detect a partition exceeding a skew budget, and for a hot channel use request coalescing on reads and a time-bucketed sub-partition so a single huge channel's writes spread across partitions instead of melting one.

Missed: Strong — names the power law and the right defenses. Missing the explicit framing that the design exists for the tail, and the data-model lever (time bucketing) that bounds a giant partition's size.
L6 · Staff

Same channel-ID + consistent-hashing base, but I'd frame the whole thing around the head of the distribution. The median channel is trivial; the design exists for the top 0.1% of channels. So: (1) partition by channel plus a time bucket so even a giant channel's data is bounded per partition and old buckets age out cheaply; (2) a hot-partition detector wired to the skew budget that distinguishes a distribution problem from a hot-key problem, because they have opposite fixes; (3) request coalescing so 50k identical reads of the viral channel become one backend read; (4) for write-hot, accept that a single ordered channel can't be infinitely split, so a per-channel write buffer that batches. The trade I'm making is read-path complexity (gather across buckets, coalescing) to protect the write path of the few channels that would otherwise take down shared infrastructure.

What scored L6

Designed for the head of the distribution from the first sentence, distinguished distribution-skew from hot-key (opposite fixes), and named concrete defenses — coalescing, bucketing, write buffering — with the read-vs-write trade made explicit. That's someone who has operated a system like this.

When NOT to use this
Don't shard before you have to

Partitioning multiplies your operational complexity: cross-shard queries, distributed transactions, rebalancing, hot keys. A single well-indexed Postgres handles far more than people assume (tens of thousands of writes/sec, terabytes). Shard when you have a measured capacity wall, not because it feels like what real systems do. Premature sharding buys you all the distributed-systems problems and none of the scale benefit.

Don't hash-partition data you need to range-scan

Hashing destroys ordering, so “all events between T1 and T2” becomes a query against every shard. If your dominant access pattern is range scans (time series, ledgers, leaderboards), range partitioning is right despite its sequential-write hotspot — which you solve by salting the range prefix, not by hashing the whole key.

Don't use consistent hashing without virtual nodes

Plain consistent hashing with one position per node has bad luck built in: arcs are uneven, and removing a node dumps its entire arc onto a single neighbor. Virtual nodes (100–200 per node) are not optional polish — they’re what makes the distribution even and rebalancing smooth. Skipping them recreates the hotspot you sharded to avoid.

Don't reshard to fix a hot key

A hot key is one key taking disproportionate traffic. Resharding redistributes keys across nodes — the hot key just lands on a different single node and melts that one instead. Hot keys need caching, replication, splitting, or buffering (the triage tree), never a different partition count.

Exercises

Exercise · Design scenario
Design partitioning for a URL shortener that does 100k redirects/sec (reads) and 1k creates/sec (writes), where 1% of links account for 90% of redirect traffic (a few viral links). Specify the partition key and scheme for the links table, how you handle the viral links, and what your skew budget is. Then say what changes if creates must return a guaranteed-unique short code.
Exercise · Implementation task
In 04-consistent-hashing, add key-splitting: a HotKeyManager that, when the detector flags a key, starts routing it as key#0..#K across K ring positions and provides a GetAll(key) that gathers from all K. Add a test showing the hot key’s load now spans multiple nodes while cold keys are untouched.
Exercise · Find the race
This resharding routine rebuilds the ring when a node is added. It runs while Get is being called by request handlers on other goroutines. Find the window where a key resolves to a node that doesn’t have its data.
reshard.go — shipped, subtly broken
1func (r *Ring) AddAndReshard(node string) {
2 // add the new node's vnodes to the live ring, in place
3 for i := 0; i < r.replicas; i++ {
4 pos := hashKey(strconv.Itoa(i) + "#" + node)
5 r.sorted = append(r.sorted, pos) // mutates the slice Get() is reading
6 r.owner[pos] = node // mutates the map Get() is reading
7 }
8 sort.Slice(r.sorted, ...) // Get() may observe a half-sorted ring
9 migrateData(node) // ...and data hasn't moved yet anyway
10}
DimensionRangeHash (mod N)Consistent hashHash + key-split
Range scansCheap — sortedImpossible — scatteredImpossible — scatteredImpossible
Write distributionSequential-key hotspotEvenEven (with vnodes)Even, even for hot keys
Resharding costSplit/merge rangesCatastrophic — ~all keys move~K/N keys move~K/N keys move
Hot-key handlingSalt the prefixNone built inDetect, then handle separatelySplitting built in
Operational complexityLow; manage boundariesLow until you reshardModerate (vnodes, ring)High (gather reads)
Choose whenRange scans dominate — time series, ledgers, leaderboards. Accept the sequential hotspot and salt the prefix.A fixed, never-changing shard count and point lookups only. Rare; the resharding cliff makes this a trap.Point lookups with an evolving fleet — the default for sharded key-value and document stores.You have known hot keys (counters, viral entities) and can pay gather-on-read to keep writes from concentrating.
Verdict

Consistent hashing with virtual nodes is the default for point-lookup workloads with a changing fleet. Use range only when range scans dominate, and reach for key-splitting surgically, for the specific keys the detector flags — not globally. Naive hash % N is almost always a future outage disguised as simplicity.

Walk away with this
  • 01Partitioning has two separate questions: how you split (range vs hash vs consistent hashing) and what happens when one key gets hot. The second is harder and more commonly missed.
  • 02Consistent hashing + virtual nodes gives even distribution and cheap rebalancing (~K/N keys move on a fleet change, vs ~all for hash % N). Vnodes are mandatory, not optional.
  • 03The Skew Budget (max ÷ mean partition load) turns “evenly distributed” into a measured threshold — and distinguishes a distribution problem (whole fleet creeps up) from a hot key (one shard spikes). Opposite fixes.
  • 04Consistent hashing spreads keys; it does nothingabout one hot key. Run the Hot-Key Triage tree: read-hot → cache/replicate, write-hot → split/buffer. Never reshard to fix a hot key.
  • 05Choose the partition key for the tail of the distribution (Discord’s giant channels), then build hot-partition defenses — detection, request coalescing, splitting — from the start.