4 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Raft & Paxos Consensus Algorithms — Leader Election, Log Replication Explained

Q: Explain how Raft leader election works.

Answer: Each node starts as a Follower with a randomized election timeout (150-300ms) If a follower doesn't receive a heartbeat before timeout → becomes Candidate Candidate increments its term, votes for itself, sends RequestVote to all others Other nodes grant vote if: (a) they haven't voted in this term yet, and (b) candidate's log is at least as up-to-date as theirs If candidate gets majority votes → becomes Leader, starts sending heartbeats If two candidates split the vote → both timeout, re

Q: How does Raft handle a leader crash mid-replication?

Answer: If the leader crashes after sending AppendEntries to some but not all followers: Entries NOT replicated to majority: will be overwritten by new leader's log Entries replicated to majority: are committed and will survive The new leader (elected from the majority) will have all committed entries (election safety guarantee). It replays its log to followers, overwriting any uncommitted entries from the old leader.

Q: Why can't you have consensus with just 2 nodes?

Answer: With 2 nodes, a majority is 2. If either node crashes, the remaining node can't form a majority alone — it can't distinguish between "the other node crashed" and "I'm network-partitioned from a healthy cluster." This leads to split-brain (both nodes think they're the leader) or deadlock (neither can proceed). Minimum viable consensus requires 3 nodes.

How do distributed systems agree on a single truth when machines crash, networks partition, and messages get lost? Consensus is the hardest problem in distributed computing.

Real-World Analogy

Imagine a jury deliberation where jurors can't all be in the same room — they communicate via unreliable mail. Some jurors might be absent (crashed), letters might get lost (network failures), and they still need to reach a unanimous verdict that all present jurors agree on. That's consensus.

Why Consensus Matters

Every distributed database, every replicated state machine, and every leader election mechanism needs consensus:

System	Uses Consensus For
etcd / ZooKeeper	Distributed key-value agreement
Kafka	Partition leader election
CockroachDB / Spanner	Transaction commit across replicas
Kubernetes	etcd stores cluster state via Raft
MongoDB	Replica set leader election

The Problem Statement

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Problem["The Consensus Problem"]
        C(("Client")) -->|"write X=5"| N1["Node 1"]
        C -->|"write X=7"| N2["Node 2"]
        N1 & N2 & N3["Node 3"] -->|"???"| Q{"What is X?<br/>All must agree!"}
    end

    style Q fill:#FFCDD2,stroke:#C62828,color:#000

Requirements for valid consensus:

Property	Meaning
Agreement	All non-faulty nodes decide the same value
Validity	The decided value was proposed by some node
Termination	All non-faulty nodes eventually decide
Integrity	A node decides at most once

FLP Impossibility (1985)

Fischer, Lynch, and Paterson proved that in an asynchronous system with even ONE crash failure, deterministic consensus is impossible. Every practical algorithm "cheats" by using timeouts (partial synchrony), randomization, or failure detectors.

Raft — Consensus Made Understandable

Raft was designed in 2014 explicitly to be easier to understand than Paxos while providing the same guarantees.

Three Roles

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
stateDiagram-v2
    [*] --> Follower: Starts as follower
    Follower --> Candidate: Election timeout<br/>(no heartbeat received)
    Candidate --> Leader: Wins election<br/>(majority votes)
    Candidate --> Follower: Loses election<br/>(higher term seen)
    Leader --> Follower: Higher term discovered

Role	Responsibility
Leader	Handles ALL client requests, replicates log to followers
Follower	Passively receives log entries, responds to leader
Candidate	Temporarily during election (requesting votes)

Leader Election

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
sequenceDiagram
    participant F1 as Node A (Follower)
    participant F2 as Node B (Follower)
    participant F3 as Node C (Follower)

    Note over F1,F3: Leader crashed. Election timeout fires on Node A first.
    F1->>F1: Increment term (term=2), become Candidate
    F1->>F2: RequestVote(term=2, lastLogIndex=5)
    F1->>F3: RequestVote(term=2, lastLogIndex=5)
    F2->>F1: VoteGranted (term=2)
    F3->>F1: VoteGranted (term=2)
    Note over F1: 3/3 votes = majority ✅
    F1->>F1: Become Leader (term=2)
    F1->>F2: AppendEntries (heartbeat)
    F1->>F3: AppendEntries (heartbeat)

Election rules: 1. Follower's election timeout fires (randomized 150-300ms) 2. Becomes Candidate, increments term, votes for self 3. Sends RequestVote to all other nodes 4. Wins if receives majority votes (>N/2) 5. Each node votes at most ONCE per term 6. Candidate with stale log (lower lastLogIndex) gets rejected

Log Replication

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Write X=5
    L->>L: Append to local log (uncommitted)
    L->>F1: AppendEntries(entry: X=5)
    L->>F2: AppendEntries(entry: X=5)
    F1->>L: ACK (replicated)
    F2->>L: ACK (replicated)
    Note over L: Majority ACKed (2/3) → COMMIT
    L->>L: Apply to state machine
    L->>C: Success (X=5 committed)
    L->>F1: Next heartbeat (commitIndex updated)
    L->>F2: Next heartbeat (commitIndex updated)
    Note over F1,F2: Apply committed entries to local state machine

Commit rule: A log entry is committed when replicated to a majority of nodes. Even if the leader crashes after commit, the entry is safe (a new leader will have it).

Safety Guarantee

Text Only

If a log entry is committed in a given term, that entry will be present 
in the logs of ALL leaders for all higher-numbered terms.

This ensures: once you tell the client "write succeeded," that write survives any future leader changes.

Paxos — The Original (Theoretical)

Paxos (1989, Lamport) is mathematically proven correct but notoriously difficult to implement.

Roles in Paxos

Role	Function
Proposer	Proposes values (like Raft's leader)
Acceptor	Votes on proposals (like Raft's followers)
Learner	Learns the decided value (client/observer)

Basic Paxos (Single Value)

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
sequenceDiagram
    participant P as Proposer
    participant A1 as Acceptor 1
    participant A2 as Acceptor 2
    participant A3 as Acceptor 3

    Note over P,A3: Phase 1: PREPARE
    P->>A1: Prepare(n=1)
    P->>A2: Prepare(n=1)
    P->>A3: Prepare(n=1)
    A1->>P: Promise(n=1, no prior value)
    A2->>P: Promise(n=1, no prior value)
    A3->>P: Promise(n=1, no prior value)

    Note over P,A3: Phase 2: ACCEPT
    P->>A1: Accept(n=1, value="X=5")
    P->>A2: Accept(n=1, value="X=5")
    P->>A3: Accept(n=1, value="X=5")
    A1->>P: Accepted(n=1)
    A2->>P: Accepted(n=1)
    Note over P: Majority accepted → VALUE CHOSEN

Phase 1 (Prepare): - Proposer picks a unique proposal number n - Sends Prepare(n) to all acceptors - Acceptors promise not to accept proposals with number < n

Phase 2 (Accept): - If majority promised, proposer sends Accept(n, value) - Acceptors accept if they haven't promised a higher number - If majority accepts, value is chosen

Basic Paxos vs Multi-Paxos

Basic Paxos agrees on a single value — useful for theory but impractical alone. Production systems (Google Chubby, Spanner, etcd) use Multi-Paxos or Raft to agree on a sequence of values (a replicated log). Multi-Paxos elects a stable leader and skips Phase 1 for subsequent proposals, making it equivalent in structure to Raft. This is the form that actually matters in practice.

Raft vs Paxos

Aspect	Raft	Paxos
Understandability	Designed to be simple	Notoriously complex
Leader	Strong leader (all writes go through leader)	No explicit leader (any node can propose)
Log structure	Sequential, append-only	Can have gaps
Membership changes	Joint consensus (built-in)	Separate protocol needed
Implementations	etcd, Consul, CockroachDB	Chubby (Google), Spanner
Learning curve	Days	Weeks/months
Correctness proofs	Simpler (stronger invariants)	Harder (more general)

For Interviews

Always explain Raft — it's designed to be explainable. Mention Paxos to show awareness, but dive deep on Raft's leader election and log replication. Interviewers rarely ask for Paxos details.

Practical Considerations

Quorum Size

Text Only

For N nodes: quorum = ⌊N/2⌋ + 1 (majority)

3 nodes → quorum 2 → tolerates 1 failure
5 nodes → quorum 3 → tolerates 2 failures  
7 nodes → quorum 4 → tolerates 3 failures

Why Odd Numbers?

With 4 nodes, quorum is still 3 — same fault tolerance as 3 nodes! You pay for an extra node but gain nothing. Always use odd cluster sizes: 3, 5, or 7.

Performance Implications

Operation	Latency	Why
Read from leader	1 RTT	Leader has latest committed state
Write	1 RTT (client-perceived)	Client → leader, leader replicates to followers in parallel, majority ACK → responds to client. Internal leader-follower communication is pipelined within this single client RTT.
Read from follower	1 RTT + staleness	May read uncommitted data (linearizability trade-off)
Leader election	150-300ms	Election timeout + vote collection

Network Partitions

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Partition1["Partition A (majority)"]
        L["Leader"] & F1["Follower 1"]
    end
    subgraph Partition2["Partition B (minority)"]
        F2["Follower 2"]
    end

    Partition1 -.-|"network<br/>partition"| Partition2

    style Partition1 fill:#E8F5E9,stroke:#2E7D32,color:#000
    style Partition2 fill:#FFCDD2,stroke:#C62828,color:#000

Partition A (leader + 1 follower = majority): continues serving reads AND writes
Partition B (1 follower = minority): cannot elect a new leader, becomes read-only (stale)
When partition heals: follower in B catches up from leader's log

Interview Questions

Explain how Raft leader election works.

Answer:

Each node starts as a Follower with a randomized election timeout (150-300ms)
If a follower doesn't receive a heartbeat before timeout → becomes Candidate
Candidate increments its term, votes for itself, sends RequestVote to all others
Other nodes grant vote if: (a) they haven't voted in this term yet, and (b) candidate's log is at least as up-to-date as theirs
If candidate gets majority votes → becomes Leader, starts sending heartbeats
If two candidates split the vote → both timeout, retry with new term (randomized timeout prevents infinite ties)

How does Raft handle a leader crash mid-replication?

Answer: If the leader crashes after sending AppendEntries to some but not all followers:

Entries NOT replicated to majority: will be overwritten by new leader's log
Entries replicated to majority: are committed and will survive

The new leader (elected from the majority) will have all committed entries (election safety guarantee). It replays its log to followers, overwriting any uncommitted entries from the old leader.

Why can't you have consensus with just 2 nodes?

Answer: With 2 nodes, a majority is 2. If either node crashes, the remaining node can't form a majority alone — it can't distinguish between "the other node crashed" and "I'm network-partitioned from a healthy cluster." This leads to split-brain (both nodes think they're the leader) or deadlock (neither can proceed). Minimum viable consensus requires 3 nodes.