Skip to content
3 min read

Distributed Locks

Real Incident: Ticket Booking System, 2019

A flight booking system sells Seat 14A to 47 people simultaneously. Two app servers both read available = true at the same millisecond, both wrote available = false. Each thought it won the race. Result: 47 angry customers, refunds, and a P0 postmortem. This is what happens without coordination.


The 30-Second Explanation

Distributed lock = a mechanism that ensures only ONE process across multiple servers can access a shared resource at a time.

🔑

Single Machine

Mutex/synchronized — trivial. OS handles it.

🌐

Distributed

Network failures, clock skew, process crashes mid-lock. Much harder.

The key insight: The lock itself must be distributed AND fault-tolerant. A single Redis node as your lock server is a single point of failure.


The Bathroom Key Analogy

Single-stall bathroom in a coffee shop. Only one key.

  • Acquire lock = take the key from the counter
  • Hold lock = you're in the bathroom
  • Release lock = return the key
  • TTL = if you don't return in 5 min, staff opens with master key (prevents deadlock)
  • Fencing token = your receipt number. If someone got a NEWER receipt, your key is invalid.

When You Need Distributed Locks

Use Case What Breaks Without It
Inventory/Booking Overselling (same seat sold twice)
Payment Processing Double-charge (same transaction processed twice)
Leader Election Split brain (two leaders, data diverges)
Cron Jobs Duplicate execution across instances
Rate Limiting Inaccurate counts (race on counter)
File/Resource Access Corruption (concurrent writes)

Implementation Options

Approach How Pros Cons
Redis (SETNX)SET key value NX EX 30Fast, simple, widely usedNot safe during Redis failover
RedlockAcquire on majority (3/5) Redis nodesTolerates single node failureControversial (clock assumptions)
ZooKeeperEphemeral sequential znodesStrong guarantees, auto-release on crashComplex, higher latency
etcdLease-based with Raft consensusLinearizable, built into K8sSlower than Redis
Database (SELECT FOR UPDATE)Row-level lock in RDBMSNo extra infra neededDoesn't scale, long lock = blocked conns

The Redlock Controversy

Martin Kleppmann (author of "Designing Data-Intensive Applications") vs. Salvatore Sanfilippo (Redis creator).

Kleppmann's Argument Antirez's Response
Clock drift can violate safety Clock drift is bounded in practice
GC pauses can outlive TTL Use fencing tokens
Redlock isn't formally proven safe It's practically safe for real workloads

The takeaway for interviews: Mention the controversy. Say "For correctness-critical locks (payment, inventory), I'd use ZooKeeper/etcd. For performance-critical but best-effort locks (deduplication, cron), Redis SETNX with TTL is fine."


Properties of a Good Distributed Lock

Property What It Means
Mutual Exclusion At most one client holds the lock at any time
Deadlock Freedom Lock always eventually becomes available (TTL)
Fault Tolerance Lock service survives node failures
Fencing Stale lock holders can't corrupt data (fencing tokens)

Fencing Tokens — The Detail That Shows Seniority

Problem: Client A acquires lock, then has a long GC pause. Lock TTL expires. Client B acquires lock. Client A wakes up, thinks it still has the lock, writes data.

Solution: Fencing token = monotonically increasing number given with each lock acquisition.

  • Client A gets lock with token 34
  • Lock expires, Client B gets lock with token 35
  • Client A wakes up, sends write with token 34
  • Resource rejects token 34 because it's seen token 35

Interview Gold

"I'd use fencing tokens to handle the GC-pause scenario. Each lock acquisition gets a monotonically increasing token. The protected resource rejects operations with stale tokens. This is how Google's Chubby lock service works."


Decision Framework

Need Use
Fast, best-effort (dedup, cron) Redis SETNX + TTL
Correctness-critical (payments) ZooKeeper or etcd
Already have a database, low scale SELECT FOR UPDATE
Multi-region Spanner's TrueTime locks
Kubernetes-native etcd (already there)

The 3 Mistakes That Get You Rejected

Don't Say These

  1. "Just use Redis SETNX" — Without discussing TTL, fencing tokens, and what happens during Redis failover, this answer is incomplete and shows junior thinking.
  2. "Set a long TTL to be safe" — Long TTL = long deadlock if the holder crashes. Short TTL = risk of premature expiry. The right answer is short TTL + renewal (heartbeat).
  3. "Locks solve all concurrency problems" — Locks are ONE tool. Sometimes optimistic concurrency (CAS), idempotency keys, or queue-based serialization is better.

Interview Answer Template

"For [use case], I'd use [Redis/ZooKeeper/etcd] because [correctness vs performance trade-off]. The lock would have a TTL of [N]s with heartbeat renewal every [N/3]s. For safety against GC pauses and clock drift, I'd use fencing tokens. The protected resource rejects operations with stale tokens."


Quick Recall Card

Question Answer
Simplest distributed lock? Redis SETNX with TTL
Safest distributed lock? ZooKeeper (ephemeral znodes) or etcd (leases)
What's Redlock? Acquire on majority (⅗) Redis nodes
What's a fencing token? Monotonically increasing number to detect stale locks
TTL too short? Premature expiry → two holders
TTL too long? Deadlock if holder crashes
How to fix TTL dilemma? Short TTL + heartbeat renewal