Consistent Hashing
Real Incident: Memcached at Facebook, 2010
Facebook had 1000+ Memcached servers. Adding one server with naive hash % N invalidated billions of cache entries simultaneously. Thundering herd → database crushed → partial outage. They switched to consistent hashing. Problem solved.
Why This Comes Up in Interviews
Every time you say "let's distribute data across N servers" in a system design interview, the interviewer is waiting for you to explain HOW. If you say hash % N, you've failed. Consistent hashing is the foundation of:
- Distributed caches (Memcached, Redis Cluster)
- Distributed databases (DynamoDB, Cassandra)
- CDNs (Akamai)
- Load balancers (consistent routing)
- Distributed file systems
The Problem: Why hash % N Breaks
Setup: 4 cache servers. Key "user:123" → hash("user:123") % 4 = 2 → goes to Server 2.
Now add a 5th server: hash("user:123") % 5 = 3 → goes to Server 3.
| Servers | Keys that MOVE | Cache Hit Rate After |
|---|---|---|
| 4 → 5 | ~80% of all keys | ~20% (catastrophic) |
| 10 → 11 | ~90% of all keys | ~10% |
| 100 → 101 | ~99% of all keys | ~1% |
The math: When going from N to N+1 servers, (N)/(N+1) fraction of keys remap. For N=100, that's 99% of keys moving.
What happens in production:
- 99% of cache keys miss simultaneously
- All requests flood the database
- Database can't handle 100x normal load
- Cascading failure → outage
This is called a thundering herd or cache stampede. Consistent hashing reduces key movement from ~100% to ~1/N.
How the Hash Ring Works
The idea: Instead of hash % N, place both servers and keys on a circular hash space (0 to 2³² - 1). Each key is assigned to the first server encountered moving clockwise.
Step by step:
-
Hash each server name to get its position on the ring
hash("ServerA") = 0x3A...→ position 230°hash("ServerB") = 0x7F...→ position 120°hash("ServerC") = 0xB2...→ position 45°
-
Hash each key to get its position
hash("user:123") = 0x55...→ position 180°
-
Walk clockwise from key position → first server hit owns that key
- "user:123" at 180° → walks clockwise → hits ServerA at 230°
Adding/Removing a Server
Add ServerD at position 200°:
- Only keys between 180° and 200° move (from ServerA to ServerD)
- All other keys stay exactly where they are
- ~1/N keys move instead of ~100%
Remove ServerB at position 120°:
- Only ServerB's keys (between previous server and 120°) move to next server clockwise
- Everything else untouched
Back-of-envelope: With 100 servers and 10 billion keys, adding a server moves ~100M keys (1%). With hash % N, you'd move ~9.9 billion keys (99%).
Virtual Nodes — Why You Always Need Them
The problem with basic consistent hashing: With 3 servers on a ring, one might own 60% of the space, another 25%, another 15%. Distribution is wildly uneven.
Mathematical reason: When placing N points randomly on a circle, the expected maximum gap is O(log N / N), but variance is very high for small N.
Solution: Virtual Nodes (Vnodes)
Each physical server gets 100-200 positions on the ring:
ServerA → hash("ServerA-0"), hash("ServerA-1"), ..., hash("ServerA-149")
ServerB → hash("ServerB-0"), hash("ServerB-1"), ..., hash("ServerB-149")
| Vnodes per server | Load variance | Key redistribution on removal |
|---|---|---|
| 1 | 50%+ (unusable) | All to one neighbor |
| 10 | ~20% | Somewhat spread |
| 100 | ~10% | Well distributed |
| 150 | ~5-7% | Production standard |
| 200 | <5% | Diminishing returns beyond this |
Critical insight for interviews: When a server with 150 vnodes is removed, its keys are distributed across ALL remaining servers (not just one neighbor). Each remaining server absorbs roughly 1/(N-1) of the dead server's load.
Back-of-Envelope: Capacity Planning
Scenario: Design a distributed cache for 50M keys across 20 servers.
| Parameter | Value | Reasoning |
|---|---|---|
| Keys per server | ~2.5M | 50M / 20 (with vnodes, ±5%) |
| Vnodes per server | 150 | Standard for <5% variance |
| Total ring positions | 3000 | 20 × 150 |
| Keys moved on add | ~2.5M | 50M / 20 = 1/N |
| Keys moved on remove | ~2.5M | Same, spread across 19 remaining |
| Memory for ring metadata | ~150KB | 3000 entries × (hash + server pointer) |
Replication on the Ring
For fault tolerance, replicate each key to the next K distinct physical servers clockwise.
| Replication Factor | Behavior | Used By |
|---|---|---|
| RF = 1 | No redundancy. Node death = data loss. | Test environments |
| RF = 3 | Survives 2 simultaneous failures. Standard. | DynamoDB, Cassandra |
| RF = 3 + rack-aware | Next 3 servers in DIFFERENT racks/AZs | Production systems |
Why "distinct physical servers": With vnodes, consecutive ring positions might belong to the same physical server. You must skip until you find K distinct physical machines.
Consistency trade-off (connects to CAP):
- Write to W replicas before ACK
- Read from R replicas
- If W + R > RF → strong consistency (guaranteed overlap)
- DynamoDB default: RF=3, W=2, R=2 → strongly consistent reads available
Handling Hotspots
Problem: Even with vnodes, some keys are naturally "hot" (celebrity accounts, viral content).
| Strategy | How | Used By |
|---|---|---|
| Micro-sharding | Split hot key into sub-keys (user:123:0, user:123:1, ...) | |
| Read replicas per partition | Multiple read replicas for hot partition | DynamoDB |
| Application-level caching | L1 cache in front of distributed cache | Facebook TAO |
| Key-aware routing | Route hot keys to beefier nodes | Custom solutions |
How Real Systems Implement It
| System | Implementation Details |
|---|---|
| DynamoDB | Consistent hashing with "partition splits" — hot partitions automatically split. No manual vnodes. |
| Cassandra | Token ring. Each node owns a range. Vnodes (default 256 per node). Murmur3 hash. |
| Memcached | Client-side consistent hashing (ketama algorithm). Server doesn't know about ring. |
| Redis Cluster | 16384 hash slots. Slots assigned to nodes. Not classic consistent hashing but similar concept. |
| Akamai CDN | Consistent hashing for content routing to edge servers. |
| Discord | Route messages to guild servers using consistent hashing on guild_id. |
Consistent Hashing vs Alternatives
| Approach | Key Movement | Complexity | Best For |
|---|---|---|---|
| hash % N | ~100% on any change | Trivial | Never changes (fixed N) |
| Consistent Hashing | ~1/N | Medium | Caches, DBs, CDNs |
| Rendezvous (HRW) Hashing | ~1/N | Medium | When K replicas needed elegantly |
| Jump Consistent Hash | ~1/N | Low | Sequential server IDs, no removal |
Rendezvous hashing (Highest Random Weight): For each key, compute score with every server. Pick highest. Elegant for K replicas (pick top K). Used by some CDNs. Downside: O(N) per lookup vs O(log N) for ring.
Interview Framework: How to Present This
When the interviewer asks "How do you distribute data across servers?":
Step 1 — State the problem: "With simple modular hashing, adding or removing any server remaps nearly all keys, causing a cache stampede that can take down the database."
Step 2 — Introduce the ring: "I'd use consistent hashing — place servers and keys on a hash ring. Each key belongs to the next server clockwise. Adding a server only moves ~1/N of keys."
Step 3 — Address distribution: "To ensure even load, each server gets 100-200 virtual nodes on the ring, reducing variance to under 5%."
Step 4 — Replication: "For fault tolerance, each key is replicated to the next 2-3 distinct physical servers clockwise, giving us RF=3."
Step 5 — Connect to consistency: "With RF=3, I can configure W=2, R=2 for strong consistency, or W=1, R=1 for availability-optimized eventual consistency."
Common Follow-Up Questions
| Question | Strong Answer |
|---|---|
| "What if a server is slow but not dead?" | "Virtual node reassignment — temporarily remove its vnodes from the ring. Or use a gossip protocol with phi-accrual failure detection." |
| "How do you handle a celebrity/hot key?" | "Micro-shard: append random suffix to key, scatter reads across multiple partitions, aggregate at application layer." |
| "How does this work with auto-scaling?" | "Controlled scaling — add one server at a time, let data rebalance (~1/N movement), then add next. Monitor rebalance completion before proceeding." |
| "Memory overhead of the ring?" | "Negligible. 200 servers × 150 vnodes = 30K entries. Each entry is a hash + pointer. Under 1MB total." |
Quick Recall
| Question | Answer |
|---|---|
| Why not hash % N? | Adding 1 server remaps ~100% of keys → cache stampede |
| How much moves with consistent hashing? | ~1/N keys (N = number of servers) |
| Virtual nodes purpose? | Even distribution + spread load on failure |
| How many vnodes? | 100-200 per server (<5% variance) |
| Replication? | Next K distinct physical servers clockwise |
| Strong consistency formula? | W + R > RF (e.g., W=2, R=2, RF=3) |
| Real systems? | DynamoDB, Cassandra, Memcached, Redis Cluster, Akamai |