Replication
Real Incident: GitLab Database Loss, 2017
GitLab's primary PostgreSQL went down. Their replica had 6 hours of replication lag. The backup was also stale. Result: 6 hours of production data lost — merge requests, comments, CI pipelines, issues. 300GB gone. Replication without monitoring replication lag is a ticking time bomb disguised as safety.
Why This Comes Up in Every Interview
Every system design interview involves a database. The moment you say "we need high availability" or "we need to scale reads," replication is the mechanism. Interviewers test:
- Do you understand sync vs async trade-offs (the core tension)?
- Can you handle replication lag problems (read-after-write, causality)?
- Do you know when to use which topology (single-leader vs multi-leader vs leaderless)?
- Can you calculate the consistency implications of your choices?
The Three Reasons to Replicate
| Goal | How Replication Helps | Without It |
|---|---|---|
| High availability | Primary dies → replica promoted in seconds | Hours of downtime while restoring from backup |
| Read scaling | Spread reads across N replicas | Single machine handles all traffic |
| Latency reduction | Replica in each region → local reads | All users hit a single-region database |
Back-of-envelope for read scaling:
- Single PostgreSQL instance: ~10K reads/sec
- With 5 read replicas: ~50K reads/sec (for eventually-consistent reads)
- With caching layer: ~500K reads/sec (but adds complexity)
Replication Topologies
Single-Leader (Master-Slave)
| Property | Detail |
|---|---|
| Write path | All writes → single leader |
| Read path | Read from leader (consistent) OR replicas (may be stale) |
| Failover | Replica promoted to leader (election) |
| Consistency | Strong on leader reads, eventual on replica reads |
| Used by | PostgreSQL, MySQL, MongoDB, Redis |
When to use: 95% of cases. Simple, well-understood, sufficient for most workloads.
Multi-Leader (Master-Master)
| Property | Detail |
|---|---|
| Write path | Multiple nodes accept writes |
| Replication | Leaders replicate to each other asynchronously |
| Challenge | Same data modified on two leaders → CONFLICT |
| Used by | CouchDB, multi-datacenter setups, collaborative editing |
When to use: Multi-datacenter writes (each DC has a leader for low-latency local writes). Collaborative editing (Google Docs).
Leaderless (Dynamo-Style)
| Property | Detail |
|---|---|
| Write path | Write to W nodes (any nodes) |
| Read path | Read from R nodes, take most recent |
| Consistency | Tunable via W + R > N |
| Used by | DynamoDB, Cassandra, Riak |
When to use: High availability critical (no single leader = no SPOF). Tunable consistency needed.
Synchronous vs Asynchronous — The Core Trade-off
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| How | Leader waits for replica ACK before confirming to client | Leader confirms immediately, replica catches up later |
| Latency | Higher (add replica RTT to every write) | Lower (just local write) |
| Durability | No data loss on leader failure | Possible data loss (un-replicated writes) |
| Availability | If replica is slow/down, writes BLOCK | Writes always succeed regardless of replica state |
The math for sync replication latency:
- Local disk write: ~0.1ms
- Network RTT to replica (same region): ~1ms
- Network RTT (cross-region): ~50-200ms
- Sync replication cost: local_write + replica_RTT = noticeable
Real systems choose:
| System | Default | Reasoning |
|---|---|---|
| PostgreSQL | Async | Performance. Accepts brief data loss window. |
| MySQL | Async | Same. Semi-sync available. |
| DynamoDB | Sync to 2 of 3 | Durability guaranteed for committed writes |
| Spanner | Sync (Paxos) | Strong consistency globally at any cost |
| MongoDB | Configurable (writeConcern) | Application decides per-write |
Semi-synchronous (common production pattern):
- Write synced to ONE replica (guaranteed no data loss)
- Other replicas async (for performance)
- If the sync replica dies, another is promoted to sync role
- Trade-off: tolerable latency hit + durability guarantee
Replication Lag — The Problems It Creates
When using async replication, there's always a window where replicas are behind. This creates real user-facing bugs:
Problem 1: Read-After-Write Inconsistency
Scenario: User posts a comment → immediately refreshes → reads from replica → comment not there yet.
| Fix | How | Trade-off |
|---|---|---|
| Read from leader for own writes | After writing, read from leader for T seconds | Leader gets more load |
| Sticky sessions | Route user to same replica always | Less load distribution |
| Client-side tracking | Client sends last-write timestamp, replica waits if behind | Added complexity |
Problem 2: Monotonic Read Violations
Scenario: User loads feed from Replica A (up-to-date) → refreshes → hits Replica B (behind) → content DISAPPEARS.
Fix: Sticky sessions — same user always reads from same replica. Or: client sends last-seen version, replica B redirects to leader if it's behind.
Problem 3: Causal Ordering Violations
Scenario: User A posts "Who wants pizza?" → User B replies "Me!" → Reader sees "Me!" before the question (different replicas, different lag).
Fix: Causal ordering via vector clocks or lamport timestamps. Or: keep related data on same partition.
Leaderless Replication — Quorum Deep Dive
The quorum formula: W + R > N guarantees at least one node in the read set has the latest write.
| N | W | R | Behavior |
|---|---|---|---|
| 3 | 2 | 2 | Standard (W+R=4 > 3). Balanced. |
| 3 | 3 | 1 | Write to all, read from one. Fast reads, slow writes. |
| 3 | 1 | 3 | Write to one, read all. Fast writes, slow reads. |
| 3 | 1 | 1 | W+R=2 ≤ 3. NOT consistent. Eventual only. |
What "quorum" actually guarantees:
- If W=2 and we wrote to nodes A, B
- Then read from R=2 nodes. At least ONE of our read nodes is A or B.
- We'll see the latest write.
- BUT: during write propagation, a read might hit the one stale node → stale read (brief window)
Sloppy quorum (DynamoDB pattern):
- During partition, write to ANY W available nodes (not necessarily the "home" nodes)
- Handed off later when home nodes recover
- Improves availability at cost of consistency during partitions
Read repair:
- During read, if nodes disagree, update stale nodes
- Lazy consistency — eventually heals itself
- Not sufficient alone (what if stale data is never read?)
Anti-entropy (background repair):
- Background process compares all replicas
- Copies missing data
- Eventual consistency guarantee
Conflict Resolution (Multi-Leader / Leaderless)
When two nodes accept conflicting writes:
| Strategy | How | Data Loss? | Used By |
|---|---|---|---|
| Last Writer Wins (LWW) | Highest timestamp wins | Yes (silently drops "loser") | Cassandra, DynamoDB |
| Custom merge function | Application-specific logic | Depends | CouchDB |
| CRDT (Conflict-free Replicated Data Types) | Mathematical merge guarantee | No | Riak, Redis (CRDT types) |
| Operational Transformation | Transform concurrent operations | No | Google Docs |
| Show both to user | Let user resolve | No | Git (merge conflicts) |
LWW dangers: Two users edit same document. One wins, one loses work silently. Acceptable for "last update wins" data (sessions, counters), dangerous for "all updates matter" data (edits, comments).
Back-of-Envelope: Replication Lag Calculation
Given: - Write throughput: 10K writes/sec - Each write: 1KB - Network bandwidth to replica: 100 Mbps (12.5 MB/s) - Replication stream: 10K × 1KB = 10 MB/s
Normal operation: 10 MB/s stream on 12.5 MB/s link = 80% utilization. Lag = milliseconds.
During traffic spike (3x): 30 MB/s stream on 12.5 MB/s link = OVERWHELMED. Lag grows rapidly.
During replica maintenance (VACUUM, backup): Replica I/O saturated → can't keep up → lag grows.
GitLab's problem: Their replication lag was 6 hours. That's 6 hours × 10K writes/sec × 1KB = ~216 GB of un-replicated data. When primary died, all of it was lost.
Lesson: MONITOR replication lag. Alert if > N seconds. LinkedIn alerts at 30 seconds.
Failover — When the Primary Dies
Automatic Failover Process
- Detect failure: Heartbeat timeout (typically 5-30 seconds)
- Choose new leader: Most up-to-date replica (least replication lag)
- Reconfigure clients: Point writes to new leader
- Reconcile old leader (when it returns): Discard unreplicated writes, rejoin as replica
Failover Dangers
| Problem | What Happens | Mitigation |
|---|---|---|
| Data loss | Async replication → un-replicated writes lost | Semi-sync replication |
| Split-brain | Old leader comes back, both accept writes | Fencing (STONITH — Shoot The Other Node In The Head) |
| Client cached routing | Clients still send writes to old leader | Short DNS TTL, smart client with discovery |
| Cascading failure | New leader overwhelmed by traffic it wasn't scaled for | Pre-warm replicas to handle full load |
Real Systems — Replication Details
| System | Topology | Sync/Async | Consistency | Failover |
|---|---|---|---|---|
| PostgreSQL | Single-leader | Async default, sync optional | Strong on leader, eventual on replicas | External tool (Patroni) |
| MySQL | Single-leader | Async/Semi-sync | Configurable | MySQL Group Replication / Orchestrator |
| DynamoDB | Leaderless | Sync (W=⅔) | Tunable (strongly consistent reads optional) | Automatic |
| Cassandra | Leaderless | Tunable (W, R, N) | Tunable per query | No leader to fail |
| CockroachDB | Single-leader (Raft per range) | Sync (Raft consensus) | Strong (serializable) | Automatic (Raft) |
| Spanner | Single-leader (Paxos per split) | Sync | Strong + external consistency | Automatic (Paxos) |
| MongoDB | Single-leader (replica set) | Async default | Configurable per-operation | Automatic (raft-like) |
Interview Framework
When the interviewer asks about your data layer availability:
Step 1 — Topology: "I'd use single-leader replication with [N] replicas. Writes go to the leader, reads distributed across replicas."
Step 2 — Sync/Async: "For this use case, [sync to one replica for durability / async for performance]. The trade-off is [latency vs data loss risk]."
Step 3 — Consistency: "Reads from replicas are eventually consistent. For read-after-write consistency, I'd route a user's reads to the leader for [T] seconds after they write."
Step 4 — Failover: "If the leader dies, [Patroni/etcd/Raft] detects via heartbeat and promotes the most up-to-date replica. Total failover: [T] seconds. In-flight uncommitted writes may be lost — clients retry with idempotency keys."
Step 5 — Monitoring: "I'd alert on replication lag > [X] seconds. If lag grows, we're at risk of data loss proportional to the lag window."
Quick Recall
| Question | Answer |
|---|---|
| Sync vs async? | Sync = no data loss, higher latency. Async = fast, possible loss. |
| Semi-sync? | Sync to ONE replica, rest async. Best of both worlds. |
| Read-after-write fix? | Route user's reads to leader for T seconds after write |
| Quorum formula? | W + R > N guarantees overlap |
| Failover time? | 5-30 seconds depending on system and detection config |
| Data loss on failover? | Async: lose uncommitted writes. Sync: zero loss. |
| Split-brain fix? | Fencing (STONITH) + majority-based election |
| Monitor what? | Replication lag — alert if > seconds, investigate if > minutes |
| LWW danger? | Silent data loss — concurrent writer's data discarded without notice |