Partitioning & the Physics of Hot Keys
Hash vs range, consistent hashing, rebalancing, and the one key that melts a shard
Partitioning splits data across machines so the system scales past one box. Two questions decide everything: how you split (the partitioning scheme) and what happens when one piece gets hot(the part everyone underestimates). This module covers both, with a consistent-hash ring you can poke and a triage tree for the 8 PM page.
How to split: range, hash, and the ring
Range partitioning keeps keys sorted and splits by boundaries (A–F on shard 1, G–M on shard 2). Range scans are cheap (“all orders in March” hits one shard), but sequential keys — timestamps, auto-increment IDs — pile every new write onto the last shard. Hash partitioning scatters keys by hash(key), which spreads writes evenly but destroys range scans (adjacent keys land anywhere). The catch with the naive version, hash(key) % N: change N and almost every key remaps — resharding becomes a full data migration, i.e. an outage.
Consistent hashing fixes the resharding problem. Place nodes and keys on a ring; each key belongs to the next node clockwise. Add or remove a node and only the keys in that node’s arc move — about K/N of them, not all. Drag the fleet size below and watch how few keys move, versus what modulo would do:
hash(key) % N, changing N reshuffles almost everything, which is why it’s an outage to reshard.courses/distributed-systems/reference-impl/04-consistent-hashing/A consistent-hash ring with virtual nodes and a hot-key detector. The demo proves adding a node moves ~25% of keys (vs ~75% for modulo), that virtual nodes flatten skew, and that a viral key is detected even though the ring placed it perfectly evenly. go run .
The Skew Budget: distribution is a number, not a vibe
“Evenly distributed” is not a yes/no. It’s a ratio you measure and bound. The Skew Budget is the most-loaded partition’s load divided by the average, and the maximum value you’re willing to tolerate before you act. Pick the number up front (say 1.5×), measure it continuously, and treat a breach as a signal — not a surprise during an incident.
The Skew Budget
max-partition-load ÷ mean-partition-load. Set the ceiling, measure it, and know what each breach means.
Ratio ≈ 1.0–1.5× and stable — healthy. Virtual nodes are doing their job; key distribution is even.
Ratio creeping up across all partitions — a distribution problem. Too few virtual nodes, or a partitioning key with low cardinality. Fix the scheme.
One partition spiking while the rest are flat — a hot key, not a distribution problem. Re-sharding won’t help (the key still lands on one partition). Go to the triage tree.
The screenshot-worthy distinction: a distribution problem raises the whole floor; a hot key spikes one partition. They look similar on a dashboard and have opposite fixes.
Consistent hashing solves distribution, never hotness
This is the single most common confusion at the L4→L5 boundary. Consistent hashing and virtual nodes make sure keys are spread evenly across nodes. They do absolutely nothing about the fact that all the traffic for one key goes to one node. A celebrity user, a viral channel, a flash-sale SKU — these are single keys, and their heat is a property of the workload, not your hashing.
So when one shard is melting, resharding is the wrong reflex: the hot key just moves to a different single shard. You need a different class of fix entirely — and which one depends on whether the key is hot on reads or writes.
The Hot-Key Triage tree
When the Skew Budget says “one partition, spiking,” run this. The first branch — read-hot vs write-hot — splits the whole solution space, because a hot read is a caching/replication problem and a hot write is a sharding/data-model problem.
Read-hot → caching/replication. Write-hot → splitting/buffering. The branch you take is forced by the access pattern, not preference.
1// A single hot counter ("likes on the viral post") serializes every2// increment onto one partition. Split it into K shards, each on its own3// partition, and sum on read. Writes fan out; reads pay a small gather.4const shards = 165 6func incrementLikes(postID string) {7 shard := rand.Intn(shards)8 db.Incr(fmt.Sprintf("likes:%s#%d", postID, shard)) // lands on K partitions9}10 11func readLikes(postID string) int64 {12 var total int6413 for s := 0; s < shards; s++ {14 total += db.Get(fmt.Sprintf("likes:%s#%d", postID, s))15 }16 return total // the read got K× more expensive; the write stopped melting one shard17}Hot partitions at trillions of messages
Choose your partition key for the tail of the distribution, not the median. Then assume the head will still concentrate, and build the hot-partition defenses in from the start: detection, request coalescing for hot reads, and a plan to split or buffer hot writes. Even partitioning is necessary and nowhere near sufficient.
What this sounds like in an interview
How would you shard the messages table for a chat app like Slack or Discord?
The interviewer wants the scheme — and then wants to see if you anticipate the hot channel before they have to bring it up.
I'd shard by message ID, maybe hash it across N database shards so the load is spread evenly.
I'd shard by channel ID so all of a channel's messages live together and are cheap to fetch in order. I'd use consistent hashing so adding shards doesn't reshuffle everything.
Shard by channel ID (keeps a channel's history together and ordered), consistent hashing with virtual nodes for even distribution and cheap rebalancing. But channel popularity follows a power law, so I'd plan for hot channels from day one: detect a partition exceeding a skew budget, and for a hot channel use request coalescing on reads and a time-bucketed sub-partition so a single huge channel's writes spread across partitions instead of melting one.
Same channel-ID + consistent-hashing base, but I'd frame the whole thing around the head of the distribution. The median channel is trivial; the design exists for the top 0.1% of channels. So: (1) partition by channel plus a time bucket so even a giant channel's data is bounded per partition and old buckets age out cheaply; (2) a hot-partition detector wired to the skew budget that distinguishes a distribution problem from a hot-key problem, because they have opposite fixes; (3) request coalescing so 50k identical reads of the viral channel become one backend read; (4) for write-hot, accept that a single ordered channel can't be infinitely split, so a per-channel write buffer that batches. The trade I'm making is read-path complexity (gather across buckets, coalescing) to protect the write path of the few channels that would otherwise take down shared infrastructure.
Designed for the head of the distribution from the first sentence, distinguished distribution-skew from hot-key (opposite fixes), and named concrete defenses — coalescing, bucketing, write buffering — with the read-vs-write trade made explicit. That's someone who has operated a system like this.
Don't shard before you have to
Partitioning multiplies your operational complexity: cross-shard queries, distributed transactions, rebalancing, hot keys. A single well-indexed Postgres handles far more than people assume (tens of thousands of writes/sec, terabytes). Shard when you have a measured capacity wall, not because it feels like what real systems do. Premature sharding buys you all the distributed-systems problems and none of the scale benefit.
Don't hash-partition data you need to range-scan
Hashing destroys ordering, so “all events between T1 and T2” becomes a query against every shard. If your dominant access pattern is range scans (time series, ledgers, leaderboards), range partitioning is right despite its sequential-write hotspot — which you solve by salting the range prefix, not by hashing the whole key.
Don't use consistent hashing without virtual nodes
Plain consistent hashing with one position per node has bad luck built in: arcs are uneven, and removing a node dumps its entire arc onto a single neighbor. Virtual nodes (100–200 per node) are not optional polish — they’re what makes the distribution even and rebalancing smooth. Skipping them recreates the hotspot you sharded to avoid.
Don't reshard to fix a hot key
A hot key is one key taking disproportionate traffic. Resharding redistributes keys across nodes — the hot key just lands on a different single node and melts that one instead. Hot keys need caching, replication, splitting, or buffering (the triage tree), never a different partition count.
Exercises
04-consistent-hashing, add key-splitting: a HotKeyManager that, when the detector flags a key, starts routing it as key#0..#K across K ring positions and provides a GetAll(key) that gathers from all K. Add a test showing the hot key’s load now spans multiple nodes while cold keys are untouched.Get is being called by request handlers on other goroutines. Find the window where a key resolves to a node that doesn’t have its data.1func (r *Ring) AddAndReshard(node string) {2 // add the new node's vnodes to the live ring, in place3 for i := 0; i < r.replicas; i++ {4 pos := hashKey(strconv.Itoa(i) + "#" + node)5 r.sorted = append(r.sorted, pos) // mutates the slice Get() is reading6 r.owner[pos] = node // mutates the map Get() is reading7 }8 sort.Slice(r.sorted, ...) // Get() may observe a half-sorted ring9 migrateData(node) // ...and data hasn't moved yet anyway10}| Dimension | Range | Hash (mod N) | Consistent hash | Hash + key-split |
|---|---|---|---|---|
| Range scans | Cheap — sorted | Impossible — scattered | Impossible — scattered | Impossible |
| Write distribution | Sequential-key hotspot | Even | Even (with vnodes) | Even, even for hot keys |
| Resharding cost | Split/merge ranges | Catastrophic — ~all keys move | ~K/N keys move | ~K/N keys move |
| Hot-key handling | Salt the prefix | None built in | Detect, then handle separately | Splitting built in |
| Operational complexity | Low; manage boundaries | Low until you reshard | Moderate (vnodes, ring) | High (gather reads) |
| Choose when | Range scans dominate — time series, ledgers, leaderboards. Accept the sequential hotspot and salt the prefix. | A fixed, never-changing shard count and point lookups only. Rare; the resharding cliff makes this a trap. | Point lookups with an evolving fleet — the default for sharded key-value and document stores. | You have known hot keys (counters, viral entities) and can pay gather-on-read to keep writes from concentrating. |
Consistent hashing with virtual nodes is the default for point-lookup workloads with a changing fleet. Use range only when range scans dominate, and reach for key-splitting surgically, for the specific keys the detector flags — not globally. Naive hash % N is almost always a future outage disguised as simplicity.
- 01Partitioning has two separate questions: how you split (range vs hash vs consistent hashing) and what happens when one key gets hot. The second is harder and more commonly missed.
- 02Consistent hashing + virtual nodes gives even distribution and cheap rebalancing (~K/N keys move on a fleet change, vs ~all for
hash % N). Vnodes are mandatory, not optional. - 03The Skew Budget (max ÷ mean partition load) turns “evenly distributed” into a measured threshold — and distinguishes a distribution problem (whole fleet creeps up) from a hot key (one shard spikes). Opposite fixes.
- 04Consistent hashing spreads keys; it does nothingabout one hot key. Run the Hot-Key Triage tree: read-hot → cache/replicate, write-hot → split/buffer. Never reshard to fix a hot key.
- 05Choose the partition key for the tail of the distribution (Discord’s giant channels), then build hot-partition defenses — detection, request coalescing, splitting — from the start.