Leader Election
Real Incident: Split-Brain at GitHub, 2018
GitHub's MySQL cluster experienced a network partition. Two nodes both believed they were the primary. Both accepted writes for 43 seconds. Result: data divergence across 100K+ repositories, 24 hours of manual reconciliation, and a postmortem that redesigned their entire failover mechanism. Without proper leader election and fencing, distributed systems write themselves into irrecoverable states.
Why This Comes Up in Interviews
Any time you design a system with replication, the interviewer will ask: "What happens when the leader dies?" You need to reason about:
- How is a new leader elected?
- What prevents two nodes from both thinking they're leader (split-brain)?
- What happens to in-flight writes during failover?
- How fast is failover? What's the unavailability window?
This comes up in: database replication, distributed locks, scheduler design, message broker design, and any system that needs a single coordinator.
Why You Need a Leader
| Problem | Without Leader | With Leader |
|---|---|---|
| Write conflicts | Multiple nodes accept conflicting writes | One node sequences all writes |
| Ordering | No global order of events | Leader provides total order |
| Decision making | Expensive consensus on every operation | Leader decides, others follow |
| Coordination | Every node must talk to every other node O(N²) | All coordinate through one node O(N) |
Trade-off: Leader = performance optimization (batch decisions) at the cost of availability (leader death = unavailability until new election).
Election Algorithms Compared
| Algorithm | How | Partition-Safe? | Complexity | Used By |
|---|---|---|---|---|
| Bully | Highest ID wins | No (both partitions elect) | Simple | Educational |
| Raft | Term + majority vote | Yes (need N/2+1) | Medium | etcd, Consul, CockroachDB, Kafka (KRaft) |
| Paxos | Multi-phase consensus | Yes | Hard to implement | Google Chubby, Spanner |
| ZAB | Epoch + quorum | Yes | Medium | ZooKeeper |
| Lease-based | Time-limited ownership | Yes (with correct clocks) | Simple | DynamoDB, Chubby |
For interviews: Know Raft well. It's the modern standard and designed to be understandable.
Raft Consensus (Deep Dive)
Node States
Every node is in exactly one state:
| State | Behavior | Transitions To |
|---|---|---|
| Follower | Passive. Responds to leader's heartbeats. | → Candidate (if election timeout) |
| Candidate | Requests votes from all peers. | → Leader (if majority) or → Follower (if higher term seen) |
| Leader | Sends heartbeats, replicates log entries. | → Follower (if higher term discovered) |
Election Process
- Trigger: Follower doesn't receive heartbeat within election timeout (randomized 150-300ms)
- Candidate starts: Increments term, votes for self, requests votes from all
- Voting rules: Each node votes for at most ONE candidate per term. Votes for candidate only if candidate's log is at least as up-to-date.
- Win condition: Receives votes from majority (N/2 + 1)
- Split vote: If no majority, timeout → new election with higher term. Random timeout ensures convergence.
Why Raft Prevents Split-Brain
Mathematical guarantee: In a cluster of N nodes, majority = N/2 + 1. Two majorities MUST overlap (pigeonhole principle). Therefore, at most ONE leader can be elected per term.
| Cluster Size | Majority | Tolerates Failures |
|---|---|---|
| 3 | 2 | 1 node |
| 5 | 3 | 2 nodes |
| 7 | 4 | 3 nodes |
Why 5 is the sweet spot: Tolerates 2 failures with only 5 nodes. Going to 7 gives marginal benefit (3 failures) at significant cost (more replication traffic).
Log Replication (After Election)
- Client sends write to leader
- Leader appends to its log
- Leader sends AppendEntries to all followers
- Once majority ACK (2 of 3 in a 3-node cluster), entry is committed
- Leader responds to client: "write successful"
- Followers apply committed entries to state machine
Key insight: A write is durable once majority has it in their log, even before they apply it. If leader dies after commit, new leader WILL have this entry (because it got majority, and new leader needs majority vote — overlap guaranteed).
The Split-Brain Problem — Deep Dive
Scenario: Network partition divides 5-node cluster into [A, B, C] and [D, E].
| Partition | Nodes | Can Elect? | Why |
|---|---|---|---|
| [A, B, C] | 3 nodes | Yes | 3 ≥ majority (3) |
| [D, E] | 2 nodes | No | 2 < majority (3) |
Result: Only one partition can elect a leader. The other is unavailable for writes. This is the correct behavior — availability sacrificed to prevent divergence (CP from CAP).
But what if old leader is in the minority partition?
- Old leader (say D) is in [D, E] partition
- D can't get heartbeat ACKs from majority → steps down
- New leader elected in [A, B, C]
- When partition heals, D discovers higher term → becomes follower
Fencing — The Critical Safety Mechanism
Problem: Even with elections, brief windows exist where two nodes might both think they're leader (old leader hasn't realized it's been replaced).
Fencing token: Every leader gets a monotonically increasing number (epoch/term).
| Time | Node A (old leader) | Node B (new leader) | Storage |
|---|---|---|---|
| T=0 | Leader, token=5 | Follower | Accepts writes with token≥5 |
| T=1 | Network partition | Elected, token=6 | Accepts writes with token≥6 |
| T=2 | Tries write with token=5 | Active with token=6 | Rejects A's write (5 < 6) |
How storage implements fencing:
- Every write carries the leader's fencing token
- Storage remembers highest token seen
- Rejects any write with a lower token
- Result: Even if two leaders exist briefly, only one can write successfully
Lease-Based Election
Alternative to consensus-based election: Use time-limited locks.
| Step | What |
|---|---|
| 1 | Node acquires lease from coordination service (ZooKeeper/etcd) |
| 2 | Lease = leadership for T seconds |
| 3 | Leader must renew before T expires |
| 4 | If leader fails to renew → lease expires → others compete |
Advantage: Simpler than full consensus. Good enough for many use cases.
Danger — Clock Skew:
- Leader thinks: "My lease is valid for 5 more seconds" (leader's clock is slow)
- Other nodes think: "Lease expired 2 seconds ago" (their clocks are faster)
- Both think they're leader
Mitigation: Leader stops doing work BEFORE lease expires (with safety margin). If lease is 10s, leader stops accepting writes at 7s and renews. This accounts for clock skew up to 3s.
Failover — What Happens When Leader Dies
Detection
| Method | Speed | False Positive Risk |
|---|---|---|
| Heartbeat timeout (fixed) | 5-30s | Medium (network blip = false alarm) |
| Phi accrual failure detector | Adaptive | Low (adjusts to network conditions) |
| Lease expiry | Predictable (lease duration) | None (time-based) |
Failover Process (Raft)
- Detection: Followers notice missing heartbeats (150-300ms timeout)
- Election: Random timeout → first to timeout becomes candidate → requests votes
- New leader: Wins majority → starts sending heartbeats
- Log reconciliation: New leader identifies uncommitted entries, ensures consistency
- Client redirect: Clients discover new leader (via redirect or service discovery)
Total failover time: Typically 1-5 seconds end-to-end.
What About In-Flight Writes?
| Write Status | After Failover |
|---|---|
| Committed (majority ACKed) | Safe — new leader has it |
| Uncommitted (only on old leader) | LOST — wasn't replicated to majority |
| Sent but not ACKed to client | Unknown — client must retry (needs idempotency) |
This is why idempotency matters: Client doesn't know if write succeeded before leader died. Must be safe to retry.
Real Systems
| System | Election Method | Failure Detection | Failover Time |
|---|---|---|---|
| etcd | Raft | Heartbeat + timeout | 1-3s |
| ZooKeeper | ZAB (Zookeeper Atomic Broadcast) | Session timeout | 2-10s |
| Kafka (KRaft) | Raft | Controller heartbeat | 1-5s |
| Redis Sentinel | Quorum voting | Heartbeat + configurable timeout | 5-30s (configurable) |
| MongoDB | Raft-like (replica set elections) | Heartbeat | 10-12s (default) |
| PostgreSQL | External (Patroni/etcd) | Health checks | 5-30s |
| CockroachDB | Raft (per range) | Liveness table + leases | 9s (default lease duration) |
Leader Election in System Design Interviews
| When You're Designing... | Leader Election Applies To... |
|---|---|
| Replicated database | Primary that accepts writes |
| Distributed task scheduler | Coordinator that assigns tasks (prevents duplicates) |
| Kafka | Partition leader (accepts writes for that partition) |
| Distributed lock service | The lock service itself needs an elected leader |
| Rate limiter | Centralized counter or coordination |
| Configuration service | Single source of truth for configuration |
Interview Framework
When asked "What happens if the primary/leader fails?":
Step 1 — Detection: "Followers detect leader failure via heartbeat timeout. I'd use a randomized timeout (150-300ms for Raft) to avoid split votes."
Step 2 — Election: "Using Raft consensus, a follower becomes candidate, requests votes. Needs majority (N/2 + 1 of N nodes). This guarantees only one leader per term."
Step 3 — Safety: "Split-brain is prevented by the majority requirement — two majorities can't exist simultaneously. Additionally, fencing tokens ensure storage rejects writes from old leaders."
Step 4 — Data safety: "Committed writes (ACKed by majority) survive failover. Uncommitted writes may be lost — clients handle this via idempotent retries."
Step 5 — Availability: "Total failover takes 1-5 seconds. For stricter availability, I'd use multi-region deployment with leader in the primary region and fast failover to the secondary."
Quick Recall
| Question | Answer |
|---|---|
| Why a leader? | Sequences operations, avoids conflicts, O(N) coordination |
| How does Raft elect? | Majority vote + term number. One leader per term guaranteed. |
| Split-brain prevention? | Majority quorum (pigeonhole: two majorities must overlap) |
| Fencing token? | Monotonic number — storage rejects writes from old leaders |
| Failover time? | 1-5 seconds (Raft). Up to 30s (some systems). |
| Committed writes safe? | Yes — majority has them, new leader must have them |
| Uncommitted writes? | May be lost — need idempotent client retries |
| Cluster size sweet spot? | 5 nodes (tolerates 2 failures, manageable replication) |
| Lease-based risk? | Clock skew — leader must stop early with safety margin |