3 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Leader Election

Real Incident: Split-Brain at GitHub, 2018

GitHub's MySQL cluster experienced a network partition. Two nodes both believed they were the primary. Both accepted writes for 43 seconds. Result: data divergence across 100K+ repositories, 24 hours of manual reconciliation, and a postmortem that redesigned their entire failover mechanism. Without proper leader election and fencing, distributed systems write themselves into irrecoverable states.

Why This Comes Up in Interviews

Any time you design a system with replication, the interviewer will ask: "What happens when the leader dies?" You need to reason about:

How is a new leader elected?
What prevents two nodes from both thinking they're leader (split-brain)?
What happens to in-flight writes during failover?
How fast is failover? What's the unavailability window?

This comes up in: database replication, distributed locks, scheduler design, message broker design, and any system that needs a single coordinator.

Why You Need a Leader

Problem	Without Leader	With Leader
Write conflicts	Multiple nodes accept conflicting writes	One node sequences all writes
Ordering	No global order of events	Leader provides total order
Decision making	Expensive consensus on every operation	Leader decides, others follow
Coordination	Every node must talk to every other node O(N²)	All coordinate through one node O(N)

Trade-off: Leader = performance optimization (batch decisions) at the cost of availability (leader death = unavailability until new election).

Election Algorithms Compared

Algorithm	How	Partition-Safe?	Complexity	Used By
Bully	Highest ID wins	No (both partitions elect)	Simple	Educational
Raft	Term + majority vote	Yes (need N/2+1)	Medium	etcd, Consul, CockroachDB, Kafka (KRaft)
Paxos	Multi-phase consensus	Yes	Hard to implement	Google Chubby, Spanner
ZAB	Epoch + quorum	Yes	Medium	ZooKeeper
Lease-based	Time-limited ownership	Yes (with correct clocks)	Simple	DynamoDB, Chubby

For interviews: Know Raft well. It's the modern standard and designed to be understandable.

Raft Consensus (Deep Dive)

Node States

Every node is in exactly one state:

State	Behavior	Transitions To
Follower	Passive. Responds to leader's heartbeats.	→ Candidate (if election timeout)
Candidate	Requests votes from all peers.	→ Leader (if majority) or → Follower (if higher term seen)
Leader	Sends heartbeats, replicates log entries.	→ Follower (if higher term discovered)

Election Process

Trigger: Follower doesn't receive heartbeat within election timeout (randomized 150-300ms)
Candidate starts: Increments term, votes for self, requests votes from all
Voting rules: Each node votes for at most ONE candidate per term. Votes for candidate only if candidate's log is at least as up-to-date.
Win condition: Receives votes from majority (N/2 + 1)
Split vote: If no majority, timeout → new election with higher term. Random timeout ensures convergence.

Why Raft Prevents Split-Brain

Mathematical guarantee: In a cluster of N nodes, majority = N/2 + 1. Two majorities MUST overlap (pigeonhole principle). Therefore, at most ONE leader can be elected per term.

Cluster Size	Majority	Tolerates Failures
3	2	1 node
5	3	2 nodes
7	4	3 nodes

Why 5 is the sweet spot: Tolerates 2 failures with only 5 nodes. Going to 7 gives marginal benefit (3 failures) at significant cost (more replication traffic).

Log Replication (After Election)

Client sends write to leader
Leader appends to its log
Leader sends AppendEntries to all followers
Once majority ACK (2 of 3 in a 3-node cluster), entry is committed
Leader responds to client: "write successful"
Followers apply committed entries to state machine

Key insight: A write is durable once majority has it in their log, even before they apply it. If leader dies after commit, new leader WILL have this entry (because it got majority, and new leader needs majority vote — overlap guaranteed).

The Split-Brain Problem — Deep Dive

Scenario: Network partition divides 5-node cluster into [A, B, C] and [D, E].

Partition	Nodes	Can Elect?	Why
[A, B, C]	3 nodes	Yes	3 ≥ majority (3)
[D, E]	2 nodes	No	2 < majority (3)

Result: Only one partition can elect a leader. The other is unavailable for writes. This is the correct behavior — availability sacrificed to prevent divergence (CP from CAP).

But what if old leader is in the minority partition?

Old leader (say D) is in [D, E] partition
D can't get heartbeat ACKs from majority → steps down
New leader elected in [A, B, C]
When partition heals, D discovers higher term → becomes follower

Fencing — The Critical Safety Mechanism

Problem: Even with elections, brief windows exist where two nodes might both think they're leader (old leader hasn't realized it's been replaced).

Fencing token: Every leader gets a monotonically increasing number (epoch/term).

Time	Node A (old leader)	Node B (new leader)	Storage
T=0	Leader, token=5	Follower	Accepts writes with token≥5
T=1	Network partition	Elected, token=6	Accepts writes with token≥6
T=2	Tries write with token=5	Active with token=6	Rejects A's write (5 < 6)

How storage implements fencing:

Every write carries the leader's fencing token
Storage remembers highest token seen
Rejects any write with a lower token
Result: Even if two leaders exist briefly, only one can write successfully

Lease-Based Election

Alternative to consensus-based election: Use time-limited locks.

Step	What
1	Node acquires lease from coordination service (ZooKeeper/etcd)
2	Lease = leadership for T seconds
3	Leader must renew before T expires
4	If leader fails to renew → lease expires → others compete

Advantage: Simpler than full consensus. Good enough for many use cases.

Danger — Clock Skew:

Leader thinks: "My lease is valid for 5 more seconds" (leader's clock is slow)
Other nodes think: "Lease expired 2 seconds ago" (their clocks are faster)
Both think they're leader

Mitigation: Leader stops doing work BEFORE lease expires (with safety margin). If lease is 10s, leader stops accepting writes at 7s and renews. This accounts for clock skew up to 3s.

Failover — What Happens When Leader Dies

Detection

Method	Speed	False Positive Risk
Heartbeat timeout (fixed)	5-30s	Medium (network blip = false alarm)
Phi accrual failure detector	Adaptive	Low (adjusts to network conditions)
Lease expiry	Predictable (lease duration)	None (time-based)

Failover Process (Raft)

Detection: Followers notice missing heartbeats (150-300ms timeout)
Election: Random timeout → first to timeout becomes candidate → requests votes
New leader: Wins majority → starts sending heartbeats
Log reconciliation: New leader identifies uncommitted entries, ensures consistency
Client redirect: Clients discover new leader (via redirect or service discovery)

Total failover time: Typically 1-5 seconds end-to-end.

What About In-Flight Writes?

Write Status	After Failover
Committed (majority ACKed)	Safe — new leader has it
Uncommitted (only on old leader)	LOST — wasn't replicated to majority
Sent but not ACKed to client	Unknown — client must retry (needs idempotency)

This is why idempotency matters: Client doesn't know if write succeeded before leader died. Must be safe to retry.

Real Systems

System	Election Method	Failure Detection	Failover Time
etcd	Raft	Heartbeat + timeout	1-3s
ZooKeeper	ZAB (Zookeeper Atomic Broadcast)	Session timeout	2-10s
Kafka (KRaft)	Raft	Controller heartbeat	1-5s
Redis Sentinel	Quorum voting	Heartbeat + configurable timeout	5-30s (configurable)
MongoDB	Raft-like (replica set elections)	Heartbeat	10-12s (default)
PostgreSQL	External (Patroni/etcd)	Health checks	5-30s
CockroachDB	Raft (per range)	Liveness table + leases	9s (default lease duration)

Leader Election in System Design Interviews

When You're Designing...	Leader Election Applies To...
Replicated database	Primary that accepts writes
Distributed task scheduler	Coordinator that assigns tasks (prevents duplicates)
Kafka	Partition leader (accepts writes for that partition)
Distributed lock service	The lock service itself needs an elected leader
Rate limiter	Centralized counter or coordination
Configuration service	Single source of truth for configuration

Interview Framework

When asked "What happens if the primary/leader fails?":

Step 1 — Detection: "Followers detect leader failure via heartbeat timeout. I'd use a randomized timeout (150-300ms for Raft) to avoid split votes."

Step 2 — Election: "Using Raft consensus, a follower becomes candidate, requests votes. Needs majority (N/2 + 1 of N nodes). This guarantees only one leader per term."

Step 3 — Safety: "Split-brain is prevented by the majority requirement — two majorities can't exist simultaneously. Additionally, fencing tokens ensure storage rejects writes from old leaders."

Step 4 — Data safety: "Committed writes (ACKed by majority) survive failover. Uncommitted writes may be lost — clients handle this via idempotent retries."

Step 5 — Availability: "Total failover takes 1-5 seconds. For stricter availability, I'd use multi-region deployment with leader in the primary region and fast failover to the secondary."

Quick Recall

Question	Answer
Why a leader?	Sequences operations, avoids conflicts, O(N) coordination
How does Raft elect?	Majority vote + term number. One leader per term guaranteed.
Split-brain prevention?	Majority quorum (pigeonhole: two majorities must overlap)
Fencing token?	Monotonic number — storage rejects writes from old leaders
Failover time?	1-5 seconds (Raft). Up to 30s (some systems).
Committed writes safe?	Yes — majority has them, new leader must have them
Uncommitted writes?	May be lost — need idempotent client retries
Cluster size sweet spot?	5 nodes (tolerates 2 failures, manageable replication)
Lease-based risk?	Clock skew — leader must stop early with safety margin