Distributed Locks
Real Incident: Ticket Booking System, 2019
A flight booking system sells Seat 14A to 47 people simultaneously. Two app servers both read available = true at the same millisecond, both wrote available = false. Each thought it won the race. Result: 47 angry customers, refunds, and a P0 postmortem. This is what happens without coordination.
The 30-Second Explanation
Distributed lock = a mechanism that ensures only ONE process across multiple servers can access a shared resource at a time.
Single Machine
Mutex/synchronized — trivial. OS handles it.
Distributed
Network failures, clock skew, process crashes mid-lock. Much harder.
The key insight: The lock itself must be distributed AND fault-tolerant. A single Redis node as your lock server is a single point of failure.
The Bathroom Key Analogy
Single-stall bathroom in a coffee shop. Only one key.
- Acquire lock = take the key from the counter
- Hold lock = you're in the bathroom
- Release lock = return the key
- TTL = if you don't return in 5 min, staff opens with master key (prevents deadlock)
- Fencing token = your receipt number. If someone got a NEWER receipt, your key is invalid.
When You Need Distributed Locks
| Use Case | What Breaks Without It |
|---|---|
| Inventory/Booking | Overselling (same seat sold twice) |
| Payment Processing | Double-charge (same transaction processed twice) |
| Leader Election | Split brain (two leaders, data diverges) |
| Cron Jobs | Duplicate execution across instances |
| Rate Limiting | Inaccurate counts (race on counter) |
| File/Resource Access | Corruption (concurrent writes) |
Implementation Options
| Approach | How | Pros | Cons |
|---|---|---|---|
| Redis (SETNX) | SET key value NX EX 30 | Fast, simple, widely used | Not safe during Redis failover |
| Redlock | Acquire on majority (3/5) Redis nodes | Tolerates single node failure | Controversial (clock assumptions) |
| ZooKeeper | Ephemeral sequential znodes | Strong guarantees, auto-release on crash | Complex, higher latency |
| etcd | Lease-based with Raft consensus | Linearizable, built into K8s | Slower than Redis |
| Database (SELECT FOR UPDATE) | Row-level lock in RDBMS | No extra infra needed | Doesn't scale, long lock = blocked conns |
The Redlock Controversy
Martin Kleppmann (author of "Designing Data-Intensive Applications") vs. Salvatore Sanfilippo (Redis creator).
| Kleppmann's Argument | Antirez's Response |
|---|---|
| Clock drift can violate safety | Clock drift is bounded in practice |
| GC pauses can outlive TTL | Use fencing tokens |
| Redlock isn't formally proven safe | It's practically safe for real workloads |
The takeaway for interviews: Mention the controversy. Say "For correctness-critical locks (payment, inventory), I'd use ZooKeeper/etcd. For performance-critical but best-effort locks (deduplication, cron), Redis SETNX with TTL is fine."
Properties of a Good Distributed Lock
| Property | What It Means |
|---|---|
| Mutual Exclusion | At most one client holds the lock at any time |
| Deadlock Freedom | Lock always eventually becomes available (TTL) |
| Fault Tolerance | Lock service survives node failures |
| Fencing | Stale lock holders can't corrupt data (fencing tokens) |
Fencing Tokens — The Detail That Shows Seniority
Problem: Client A acquires lock, then has a long GC pause. Lock TTL expires. Client B acquires lock. Client A wakes up, thinks it still has the lock, writes data.
Solution: Fencing token = monotonically increasing number given with each lock acquisition.
- Client A gets lock with token 34
- Lock expires, Client B gets lock with token 35
- Client A wakes up, sends write with token 34
- Resource rejects token 34 because it's seen token 35
Interview Gold
"I'd use fencing tokens to handle the GC-pause scenario. Each lock acquisition gets a monotonically increasing token. The protected resource rejects operations with stale tokens. This is how Google's Chubby lock service works."
Decision Framework
| Need | Use |
|---|---|
| Fast, best-effort (dedup, cron) | Redis SETNX + TTL |
| Correctness-critical (payments) | ZooKeeper or etcd |
| Already have a database, low scale | SELECT FOR UPDATE |
| Multi-region | Spanner's TrueTime locks |
| Kubernetes-native | etcd (already there) |
The 3 Mistakes That Get You Rejected
Don't Say These
- "Just use Redis SETNX" — Without discussing TTL, fencing tokens, and what happens during Redis failover, this answer is incomplete and shows junior thinking.
- "Set a long TTL to be safe" — Long TTL = long deadlock if the holder crashes. Short TTL = risk of premature expiry. The right answer is short TTL + renewal (heartbeat).
- "Locks solve all concurrency problems" — Locks are ONE tool. Sometimes optimistic concurrency (CAS), idempotency keys, or queue-based serialization is better.
Interview Answer Template
"For [use case], I'd use [Redis/ZooKeeper/etcd] because [correctness vs performance trade-off]. The lock would have a TTL of [N]s with heartbeat renewal every [N/3]s. For safety against GC pauses and clock drift, I'd use fencing tokens. The protected resource rejects operations with stale tokens."
Quick Recall Card
| Question | Answer |
|---|---|
| Simplest distributed lock? | Redis SETNX with TTL |
| Safest distributed lock? | ZooKeeper (ephemeral znodes) or etcd (leases) |
| What's Redlock? | Acquire on majority (⅗) Redis nodes |
| What's a fencing token? | Monotonically increasing number to detect stale locks |
| TTL too short? | Premature expiry → two holders |
| TTL too long? | Deadlock if holder crashes |
| How to fix TTL dilemma? | Short TTL + heartbeat renewal |