Chat System (WhatsApp / Slack Scale)
Real Incident: WhatsApp Outage, October 4, 2021 (6 hours)
A routine BGP maintenance change at Facebook's backbone routers withdrew all routes to Facebook's data centers from the global routing table. Within seconds, every WhatsApp client on Earth lost its WebSocket connection. 2 billion users went offline simultaneously. DNS resolution failed because Facebook's authoritative nameservers were inside the same unreachable network. The outage lasted 6 hours and 7 minutes — the longest in WhatsApp's history.
But the truly interesting engineering challenge came after the fix. When BGP routes were restored, 2 billion clients attempted to reconnect simultaneously — a classic thundering herd problem. WhatsApp's gateway fleet faced a surge 200x normal connection rate. Here is why their architecture survived the recovery:
- Decoupled persistence from delivery — messages sent during the outage were queued in Kafka and persisted to Cassandra. No messages were lost.
- Exponential backoff with jitter — clients used randomized retry intervals (base 1s, max 5 min, ± 20% jitter) to spread reconnection over minutes rather than milliseconds.
- Cursor-based sync — reconnected clients pulled missed messages using a
last_seen_message_idcursor, paginated in batches of 50, preventing any single client from overwhelming the storage tier. - Connection admission control — gateways throttled new connections at 80% capacity, returning HTTP 503 with
Retry-Afterheaders to shed load gracefully.
Lesson: Design for the recovery stampede, not just steady state. Your system's most dangerous moment is the 60 seconds after it comes back online.
System Design Concepts Used
WebSocket Long Polling (fallback) Pub/Sub Consistent Hashing Snowflake IDs Write-Ahead Log Fan-out (write vs. read) Heartbeat / Gossip Protocol End-to-End Encryption (Signal Protocol) Cursor-based Pagination Exponential Backoff with Jitter Partitioned Storage LSM-tree (Cassandra) Connection Multiplexing Push Notification Gateway Rate Limiting Circuit Breaker Message Queuing (Kafka) Sharded Redis Cluster Consistent Hashing Ring Object Storage (S3/CDN) Idempotency Keys
Functional Requirements
- One-to-one messaging — Users can send text messages to any other user with delivery guarantees (at-least-once, deduplicated at client).
- Group chat — Support groups up to 1,024 members with all members receiving messages in consistent order.
- Online presence and typing indicators — Show real-time online/offline/last-seen status and "typing..." indicators per conversation.
- Read receipts — Single-tick (sent), double-tick (delivered), blue-tick (read) semantics, propagated to sender.
- Media messages — Support images (up to 16MB), voice notes (up to 100MB), videos (up to 2GB), and documents with thumbnail previews.
- Offline message sync — When a user comes back online, deliver all missed messages in order without duplicates.
- End-to-end encryption — Messages must be unreadable by the server. Key exchange via Signal Protocol (X3DH + Double Ratchet).
- Message search — Full-text search across a user's message history (client-side index for E2E encrypted messages).
- Multi-device sync — A user logged in on phone and desktop receives messages on both devices simultaneously.
- Message reactions and replies — Users can react with emoji or reply to a specific message with threading context.
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Message delivery latency | < 200ms (P99 for online recipients) | Users perceive chat as "instant" below 300ms; 200ms gives headroom for network jitter |
| Availability | 99.99% (52 min downtime/year) | Chat is a primary communication channel; even minutes of downtime affect billions |
| Concurrent connections | 170M simultaneous WebSockets | Peak concurrency = ~34% of 500M DAU (measured from WhatsApp public stats) |
| Message ordering | Per-conversation total order | Users must see messages in same sequence; global order is impossible and unnecessary |
| Durability | Zero message loss once acknowledged | Messages are users' memories — losing even one is unacceptable |
| Throughput | 580K messages/sec sustained, 2.9M peak | 50B messages/day with 5x peak-to-average ratio |
| Storage retention | 30 days server-side (encrypted) | Balance between sync needs and storage costs; clients hold full history |
| Horizontal scalability | Linear scale-out to 10x current load | Must handle growth and events (New Year midnight, elections) without re-architecture |
Capacity Estimation
===============================================
CHAT SYSTEM — NAPKIN MATH (WhatsApp Scale)
===============================================
GIVEN:
DAU = 500,000,000 users
Messages per user/day = 100 (avg)
Total messages/day = 500M × 100 = 50,000,000,000 (50B)
Avg message size = 200 bytes (encrypted text + metadata)
Media messages = 5% of total
Avg media size = 500 KB
Peak-to-average ratio = 5x
───────────────────────────────────────────────
THROUGHPUT
───────────────────────────────────────────────
Avg messages/sec = 50B ÷ 86,400 = ~580,000 msg/sec
Peak messages/sec = 580K × 5 = ~2,900,000 msg/sec
Avg media uploads/sec = 50B × 0.05 ÷ 86,400 = ~29,000 uploads/sec
───────────────────────────────────────────────
CONNECTIONS
───────────────────────────────────────────────
Concurrent connections = 500M × 0.34 = 170,000,000
Connections per server = 100,000 (epoll limit with 32GB RAM)
Gateway servers needed = 170M ÷ 100K = 1,700 servers
RAM per connection = ~10 KB (buffer + session state)
RAM per gateway = 100K × 10 KB = 1 GB (connection state only)
───────────────────────────────────────────────
STORAGE (TEXT MESSAGES)
───────────────────────────────────────────────
Daily text storage = 50B × 200 bytes = 10 TB/day
30-day retention = 10 TB × 30 = 300 TB
Annual (if retained) = 10 TB × 365 = 3.6 PB/year
Cassandra replication = 3x
Actual disk needed = 300 TB × 3 = 900 TB (30-day window)
Cassandra nodes (4TB) = 900 TB ÷ 4 TB = 225 nodes
───────────────────────────────────────────────
STORAGE (MEDIA)
───────────────────────────────────────────────
Daily media storage = 50B × 0.05 × 500 KB = 1.25 PB/day
30-day media = 1.25 PB × 30 = 37.5 PB
(Stored in S3 with CDN, not Cassandra)
───────────────────────────────────────────────
BANDWIDTH
───────────────────────────────────────────────
Text bandwidth (avg) = 580K × 200 B = 116 MB/sec = ~1 Gbps
Media bandwidth (avg) = 29K × 500 KB = 14.5 GB/sec = ~116 Gbps
Total egress (peak) = (1 + 116) × 5 = ~585 Gbps
───────────────────────────────────────────────
REDIS PUB/SUB (ROUTING LAYER)
───────────────────────────────────────────────
Messages routed/sec = 580K (each needs pub/sub lookup)
Redis shards needed = 580K ÷ 100K ops/shard = 6 shards (min)
With headroom (3x) = 18 Redis shards
───────────────────────────────────────────────
KAFKA (PERSISTENCE PIPELINE)
───────────────────────────────────────────────
Partitions needed = Peak msg/sec ÷ 10K per partition
= 2.9M ÷ 10K = 290 partitions
Brokers (20 part/each) = 290 ÷ 20 = ~15 brokers (min)
With replication (3x) = 45 Kafka brokers
"Why X, Not Y?" Tradeoff Analysis
Why WebSocket, Not HTTP Long Polling or Server-Sent Events?
| Criterion | WebSocket | HTTP Long Polling | Server-Sent Events (SSE) |
|---|---|---|---|
| Direction | Full-duplex (bidirectional) | Simulated bidirectional (new request per message) | Server-to-client only |
| Overhead per message | 2-6 bytes framing | ~800 bytes (full HTTP headers each time) | ~50 bytes (event stream format) |
| Latency | Sub-millisecond (connection already open) | 50-200ms (TCP + TLS handshake per request) | Sub-ms receive, but needs separate channel for sends |
| Connection count at 170M users | 170M persistent sockets | 170M × 2 (send + receive) = 340M connections | 170M SSE + 170M HTTP POST = 340M |
| Mobile battery impact | Low (single persistent connection) | High (constant reconnection) | Medium (still needs POST for sends) |
Decision: WebSocket with long-polling fallback.
The math is decisive: at 580K messages/sec, HTTP long polling would require 580K × 800 bytes = 464 MB/sec of pure header overhead vs. 580K × 6 bytes = 3.5 MB/sec with WebSocket. That is a 130x bandwidth reduction. For mobile users on metered connections, this translates directly to data charges and battery life.
Long polling remains as a fallback for restrictive corporate proxies that terminate WebSocket upgrades. Approximately 2-3% of connections fall back to this path.
Why Cassandra, Not PostgreSQL for Message Storage?
| Criterion | Cassandra | PostgreSQL |
|---|---|---|
| Write throughput | 580K writes/sec across cluster (linear scale) | ~50K writes/sec per node (vertical limit) |
| Scale-out model | Add nodes, automatic rebalancing via consistent hashing | Complex sharding (Citus), manual shard management |
| Partition model | Native partition by conversation_id, sorted by message_id | Requires explicit table partitioning, harder to range-scan |
| Availability model | AP system (tunable consistency: QUORUM for writes) | CP system (single-leader, failover = downtime) |
| Storage engine | LSM-tree (optimized for sequential writes) | B-tree (optimized for random reads) |
| Multi-datacenter | Built-in multi-DC replication, conflict-free | Logical replication, complex conflict resolution |
Decision: Cassandra with LOCAL_QUORUM consistency.
Chat message storage is a write-heavy, time-series workload. At 580K msg/sec sustained, PostgreSQL would need 12+ shards with complex routing logic. Cassandra handles this natively: conversation_id as partition key gives us all messages for a conversation on the same node, sorted by message_id (Snowflake timestamp). Range queries like "give me the next 50 messages after cursor X" are a single sequential disk read.
The tradeoff: we lose ad-hoc SQL queries and transactions. But chat messages are immutable once written (append-only), so we never need UPDATE or multi-row transactions. The one case where we need atomic read-modify-write (message status: sent -> delivered -> read) is handled via Cassandra's lightweight transactions (Paxos) on the status field only.
Why Redis Pub/Sub, Not Kafka for Cross-Gateway Message Routing?
| Criterion | Redis Pub/Sub | Kafka |
|---|---|---|
| Latency | < 1ms (in-memory, no disk) | 2-10ms (disk write, batch flush) |
| Delivery semantics | Fire-and-forget (at-most-once) | At-least-once with consumer offsets |
| Connection model | Gateway subscribes to user channels | Gateway polls topic partitions |
| Message retention | None (if subscriber is offline, message is lost) | Configurable retention (hours/days) |
| Throughput | ~100K msg/sec per shard | ~100K msg/sec per partition |
| Operational complexity | Simple (no ZooKeeper, no offsets) | Complex (partition rebalancing, consumer groups) |
Decision: Redis Pub/Sub for real-time routing, Kafka for durable persistence pipeline.
This is a two-tier architecture. Redis Pub/Sub is the "hot path" — when User A sends a message to User B who is currently online, we need sub-millisecond routing from Gateway-1 to Gateway-2. Redis delivers this because it is purely in-memory with no disk I/O.
But Redis Pub/Sub is lossy: if a gateway crashes mid-delivery, the message vanishes from the pub/sub channel. This is acceptable because Kafka is the "cold path" — every message is also written to Kafka (partitioned by conversation_id) for durable persistence to Cassandra. If Redis fails to deliver, the recipient pulls missed messages from Cassandra on their next sync.
This separation gives us: sub-millisecond delivery for the happy path (Redis) + zero message loss guarantee (Kafka + Cassandra).
Why Partition by conversation_id, Not user_id?
| Criterion | Partition by conversation_id | Partition by user_id (recipient inbox) |
|---|---|---|
| Write amplification | 1 write per message (single partition) | N writes per group message (one per member) |
| Read pattern: "Load conversation" | Single partition scan (fast) | Scatter-gather across sender partitions (slow) |
| Read pattern: "Load all conversations" | Requires secondary index or separate inbox table | Single partition scan (fast) |
| Group messages | 1 write regardless of group size | 1,024 writes for a 1,024-member group |
| Hot partition risk | Celebrity conversations (millions of fans) | Celebrity user partitions |
| Ordering guarantee | Natural: all messages in same partition, sorted by time | Requires cross-partition coordination |
Decision: Primary partition by conversation_id with a lightweight inbox index.
The key insight is that the dominant read pattern in chat is "open a conversation and scroll through messages" — this is a range scan within a single partition. Partitioning by user_id would make this a scatter-gather across potentially thousands of partitions (one per conversation the user is in).
The tradeoff: listing all conversations for a user requires a separate lightweight index table (user_conversations) that maps user_id -> [conversation_id, last_message_preview, unread_count]. This table is small (one row per conversation, not per message) and handles the "conversation list" screen efficiently.
For group messages, this decision saves massive write amplification: a message to a 1,024-member group is 1 write (to the conversation partition) instead of 1,024 writes (to each member's inbox).
High-Level Architecture
Backend Services Explained
WebSocket Gateway Service
The gateway is the edge of the system — it maintains persistent WebSocket connections with every online client. Each gateway server handles approximately 100,000 concurrent connections using Linux epoll for event-driven I/O. The gateway is intentionally "dumb": it does not process business logic. Its responsibilities are: (1) authenticate the WebSocket handshake via JWT, (2) register the connection in the session store (user_id -> gateway_id), (3) forward incoming messages to the Message Service, and (4) push outbound messages to the client's socket. This separation means we can scale gateways independently of application logic, and a gateway crash only requires clients to reconnect (no state is lost server-side).
Message Service
The Message Service is the brain of the system. When it receives a message from a gateway, it performs: (1) validation (message size, rate limiting, spam detection), (2) Snowflake ID generation for global ordering, (3) encryption verification (ensuring the payload is E2E encrypted), (4) session store lookup to determine where the recipient is connected, (5) publishing to Redis Pub/Sub for real-time delivery, and (6) writing to Kafka for durable persistence. The service is stateless and horizontally scalable behind an internal load balancer.
Presence Service
Presence answers the question "is this user online?" at scale. Every connected client sends a heartbeat every 30 seconds. The Presence Service maintains an in-memory map of user_id -> {status, last_seen, gateway_id}. For efficiency, it uses a gossip protocol between instances to propagate state changes without a central coordinator. A user is marked "offline" if no heartbeat is received for 90 seconds (3 missed heartbeats). Typing indicators flow through the same service but are ephemeral — they are never persisted and only forwarded to the other participant(s) in the conversation.
Group Service
The Group Service manages group membership and implements the fan-out strategy. For small groups (2-100 members), it uses write fan-out: the message is routed individually to each member's gateway via Redis Pub/Sub. For large groups (101-1,024 members), it uses read fan-out: the message is written once to the group conversation partition in Cassandra, and members pull it on their next sync. The service maintains a membership cache in Redis to avoid hitting PostgreSQL on every group message.
Push Notification Service
When the session store indicates a recipient is offline (no active WebSocket), the Push Notification Service takes over. It formats platform-specific payloads (APNs for iOS, FCM for Android, Web Push for browsers) and delivers them with exponential backoff. It respects user notification preferences (mute, DND schedules) and implements priority levels (high for direct messages, normal for group messages). To avoid notification storms, it batches multiple messages into a single notification with a count badge after a 2-second debounce window.
Session Store (Redis)
A dedicated Redis cluster that maps user_id -> [gateway_id, connection_id, device_type, connected_at]. This is the routing table for the entire system. When a message arrives for a user, the Message Service queries this store to determine which gateway holds the recipient's connection. For multi-device users, there are multiple entries per user_id, and the message is delivered to all active sessions.
Architecture Flow
Scenario: Alice in Tokyo sends "Hey, want to grab dinner?" to Bob in NYC
T+0ms Alice types message, client encrypts with Bob's public key (Signal Protocol)
Client generates client-side message ID (UUID) for idempotency
T+1ms Encrypted payload sent over Alice's WebSocket to Gateway-47 (Tokyo DC)
Frame: [opcode=0x1, payload={"to":"bob_123", "conv":"conv_abc", "body":"<encrypted>", "client_id":"uuid-xyz"}]
T+2ms Gateway-47 authenticates frame (JWT session valid)
Forwards to Message Service via gRPC
T+5ms Message Service:
├─ Validates: message < 64KB, Alice not rate-limited (100 msg/min)
├─ Generates Snowflake ID: timestamp(41bit) + datacenter(5bit) + machine(5bit) + sequence(12bit)
├─ Deduplication check: client_id "uuid-xyz" not in recent 5-min cache
└─ Publishes to Kafka topic "messages" partition = hash(conv_abc) % 290
T+7ms Message Service queries Session Store:
GET user:bob_123 → {"gateway": "gateway-1201", "dc": "us-east-1"}
T+8ms Message Service publishes to Redis Pub/Sub:
PUBLISH channel:gateway-1201 {"msg_id":"sf-123", "to":"bob_123", "body":"<encrypted>"}
T+9ms Redis routes to Gateway-1201 (NYC DC) subscriber
Gateway-1201 finds Bob's WebSocket in its local connection map
T+10ms Gateway-1201 pushes message frame to Bob's WebSocket
Bob's client decrypts with private key, renders in UI
T+11ms Bob's client sends delivery ACK back through WebSocket
ACK propagates: Gateway-1201 → Message Service → status update
T+15ms Message Service publishes delivery receipt to Alice via same Redis path
Alice sees double-tick (✓✓ delivered)
T+200ms Kafka consumer persists message to Cassandra (async, non-blocking)
INSERT INTO messages (conversation_id, message_id, sender, body, status)
VALUES ('conv_abc', 'sf-123', 'alice_456', '<encrypted>', 'delivered')
Total end-to-end latency: ~10ms (both users online, cross-continent)
Scenario: Alice sends a message but Bob is OFFLINE
T+0ms Same flow as above through T+7ms
T+7ms Message Service queries Session Store:
GET user:bob_123 → NULL (no active connection)
T+8ms Message Service marks message status as "sent" (single tick ✓)
Sends ACK to Alice: "message accepted, recipient offline"
T+9ms Message Service enqueues push notification request:
→ Push Service: {to: "bob_123", title: "Alice", body: "New message", badge: 3}
T+12ms Push Service queries Bob's device tokens from PostgreSQL
Sends APNs payload to Bob's iPhone
T+200ms Kafka consumer persists message to Cassandra with status="sent"
... 2 hours later ...
T+2h Bob opens app, WebSocket connects to Gateway-892
Gateway-892 registers: SET user:bob_123 → {"gateway": "gateway-892"}
T+2h+1ms Bob's client sends sync request:
{"action": "sync", "last_seen_msg_id": "sf-100"}
T+2h+3ms Message Service queries Cassandra:
SELECT * FROM messages WHERE conversation_id IN (bob's conversations)
AND message_id > 'sf-100' LIMIT 50
T+2h+8ms Returns batch of missed messages (paginated)
Bob's client decrypts and renders all missed messages
Sends delivery ACKs for each → Alice sees ✓✓ for all
Scenario: Group message fan-out (50-member group)
T+0ms Alice sends message to group "Weekend Hiking" (50 members)
Encrypted with group shared key (sender keys protocol)
T+5ms Message Service receives, generates Snowflake ID
Queries Group Service: GET group:hiking_789 → [50 member user_ids]
T+6ms Group Service determines: 50 members → WRITE FAN-OUT strategy
T+7ms Session Store batch query: MGET user:m1, user:m2, ... user:m50
Result: 32 online (on various gateways), 18 offline
T+8ms Redis Pub/Sub fan-out to online members:
├─ PUBLISH channel:gateway-47 (Alice + 5 others on this gateway)
├─ PUBLISH channel:gateway-201 (8 members)
├─ PUBLISH channel:gateway-892 (7 members)
├─ PUBLISH channel:gateway-1201 (6 members)
└─ PUBLISH channel:gateway-550 (6 members)
T+10ms All 32 online members receive message via WebSocket
T+12ms Push Service: batch notification to 18 offline members
Grouped by platform: 11 APNs, 7 FCM
T+200ms Single write to Cassandra (conversation partition for hiking_789)
NOT 50 writes — message stored once, all members read from same partition
Failure & Recovery Scenarios
Scenario 1: WebSocket Gateway Dies (Gateway-47 crashes)
Impact: 100,000 users on Gateway-47 lose their WebSocket connections immediately.
Detection: The L4 load balancer detects failed health checks within 5 seconds. The Session Store entries for those users have a TTL of 60 seconds and will auto-expire.
Recovery:
- Clients detect connection drop (TCP RST or timeout) within 1-3 seconds.
- Clients initiate reconnection with exponential backoff: 1s, 2s, 4s, 8s (+ 20% jitter).
- Load balancer routes reconnections to healthy gateways (Gateway-48, 49, etc.).
- New gateway registers fresh session in Redis Session Store.
- Client sends sync request with
last_seen_msg_idto pull any messages missed during the 5-10 second reconnection window. - Messages sent TO these users during the gap were published to Redis Pub/Sub for the dead gateway — they were lost from the real-time path. But Kafka persisted them, so the sync request retrieves them from Cassandra.
Key insight: No messages are lost because the persistence path (Kafka -> Cassandra) is independent of the delivery path (Redis Pub/Sub -> Gateway). The real-time path is best-effort; the storage path is durable.
Scenario 2: Redis Pub/Sub Shard Fails
Impact: Messages destined for gateways subscribed to the failed shard are not delivered in real-time. Users appear to not receive messages for 10-30 seconds.
Detection: Gateway health checks to Redis fail. Circuit breaker opens after 3 consecutive failures (500ms timeout each).
Recovery:
- Circuit breaker opens — gateways stop publishing to the failed shard.
- Fallback path activates: Message Service writes a "pending delivery" record to a separate Redis list (different shard) for each affected user.
- Gateways poll the pending delivery list every 5 seconds as a fallback.
- Once the Redis shard recovers (typically auto-failover via Redis Sentinel in 10-30s), subscriptions are re-established.
- Any messages in the pending delivery list are drained.
Why not Kafka for real-time routing? Even during Redis failure, the added latency is only 5 seconds (polling interval) vs. Kafka's 2-10ms per message under normal conditions. The failure case is rare enough that the simpler architecture wins.
Scenario 3: Cassandra Cluster Degradation (1 node down in 3-node replica set)
Impact: With LOCAL_QUORUM consistency (2 of 3 replicas must ACK), losing 1 node means writes still succeed (2 remaining nodes can form quorum). BUT if a second node in the same replica set fails, writes to those partitions fail.
Detection: Cassandra gossip protocol detects node failure within 10 seconds. Monitoring alerts fire.
Recovery:
- Single node failure: No user-visible impact. Reads and writes continue at
LOCAL_QUORUM. - Cassandra automatically streams data to replacement node (or the failed node when it recovers).
- Kafka consumers that fail to write to Cassandra will retry with backoff. Kafka retention (72 hours) provides a buffer — no messages are lost even if Cassandra is fully down for hours.
- If multiple nodes fail: Message Service detects write failures, activates "degraded mode" — messages are kept in Kafka and delivered in real-time via Redis Pub/Sub, but not persisted. A banner appears in client: "Messages may not be saved to history."
Scenario 4: Full Datacenter Failure
Impact: All users connected to gateways in the failed DC lose connections. Cross-DC routing via Redis Pub/Sub is interrupted.
Recovery:
- DNS failover (or Anycast BGP withdrawal) redirects clients to nearest healthy DC within 30-60 seconds.
- Clients reconnect to gateways in the healthy DC.
- Cassandra's multi-DC replication means message history is available from the other DC (assuming
EACH_QUORUMwas not used for writes). - Kafka MirrorMaker ensures unprocessed messages in the failed DC's Kafka are replicated to the surviving DC.
Data Model
Messages Table (Cassandra)
CREATE TABLE messages (
conversation_id UUID, -- Partition key
message_id BIGINT, -- Clustering key (Snowflake ID, sorts by time)
sender_id BIGINT,
message_type TEXT, -- 'text', 'image', 'video', 'audio', 'document'
body BLOB, -- E2E encrypted payload
media_url TEXT, -- S3 presigned URL (null for text)
media_thumbnail BLOB, -- Encrypted thumbnail (< 5KB)
reply_to BIGINT, -- Snowflake ID of parent message (null if not reply)
status TEXT, -- 'sent', 'delivered', 'read'
created_at TIMESTAMP,
expires_at TIMESTAMP, -- For disappearing messages
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': 1, 'compaction_window_unit': 'DAYS'}
AND gc_grace_seconds = 864000
AND default_time_to_live = 2592000; -- 30 days server-side retention
User Conversations Index (Cassandra)
CREATE TABLE user_conversations (
user_id BIGINT, -- Partition key
last_activity_at TIMESTAMP, -- Clustering key (most recent first)
conversation_id UUID,
conversation_type TEXT, -- '1:1', 'group'
conversation_name TEXT, -- Group name or other user's name
last_message_preview TEXT, -- Truncated, encrypted
unread_count INT,
is_muted BOOLEAN,
PRIMARY KEY (user_id, last_activity_at)
) WITH CLUSTERING ORDER BY (last_activity_at DESC);
Users Table (PostgreSQL)
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
phone_number VARCHAR(20) UNIQUE NOT NULL,
display_name VARCHAR(100),
avatar_url TEXT,
public_key BYTEA, -- Signal Protocol identity key
created_at TIMESTAMP DEFAULT NOW(),
last_seen_at TIMESTAMP,
status_text VARCHAR(500)
);
CREATE INDEX idx_users_phone ON users(phone_number);
Groups Table (PostgreSQL)
CREATE TABLE groups (
group_id UUID PRIMARY KEY,
conversation_id UUID NOT NULL REFERENCES conversations(conversation_id),
name VARCHAR(200) NOT NULL,
description TEXT,
avatar_url TEXT,
created_by BIGINT REFERENCES users(user_id),
max_members INT DEFAULT 1024,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE group_members (
group_id UUID REFERENCES groups(group_id),
user_id BIGINT REFERENCES users(user_id),
role VARCHAR(20) DEFAULT 'member', -- 'admin', 'member'
joined_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (group_id, user_id)
);
CREATE INDEX idx_group_members_user ON group_members(user_id);
Session Store (Redis)
# Active session mapping
KEY: session:user:{user_id}:{device_id}
VALUE: {"gateway_id": "gw-47", "connected_at": 1698234567, "device_type": "ios"}
TTL: 300 seconds (refreshed on every heartbeat)
# Presence
KEY: presence:{user_id}
VALUE: {"status": "online", "last_seen": 1698234567}
TTL: 90 seconds (3 missed heartbeats = offline)
# Group membership cache
KEY: group:members:{group_id}
VALUE: SET of user_ids
TTL: 3600 seconds (1 hour, invalidated on membership change)
# Message deduplication
KEY: dedup:{client_message_id}
VALUE: server_message_id (Snowflake ID)
TTL: 300 seconds (5-minute dedup window)
Algorithms Under the Hood
Snowflake ID Generation (Message Ordering)
class SnowflakeIDGenerator:
"""
64-bit ID: [1 bit unused][41 bits timestamp][5 bits datacenter][5 bits machine][12 bits sequence]
Properties:
- Time-ordered: IDs generated later are always larger
- Unique across all machines: datacenter + machine bits prevent collision
- High throughput: 4096 IDs per millisecond per machine (12-bit sequence)
- No coordination: each machine generates independently
Why not UUID? UUIDs are 128 bits (wastes storage), not time-ordered (can't sort),
and random (destroys Cassandra write locality in LSM-tree).
"""
EPOCH = 1288834974657 # Custom epoch (Nov 04, 2010) to extend 41-bit range
DATACENTER_BITS = 5 # 32 datacenters
MACHINE_BITS = 5 # 32 machines per DC
SEQUENCE_BITS = 12 # 4096 per ms per machine
def __init__(self, datacenter_id: int, machine_id: int):
self.datacenter_id = datacenter_id
self.machine_id = machine_id
self.sequence = 0
self.last_timestamp = -1
def next_id(self) -> int:
timestamp = current_time_ms() - self.EPOCH
if timestamp == self.last_timestamp:
# Same millisecond: increment sequence
self.sequence = (self.sequence + 1) & 0xFFF # 4095 max
if self.sequence == 0:
# Sequence exhausted: wait for next millisecond
timestamp = self._wait_next_ms(timestamp)
else:
self.sequence = 0 # New millisecond: reset sequence
self.last_timestamp = timestamp
# Bit-pack: timestamp(41) | datacenter(5) | machine(5) | sequence(12)
return (
(timestamp << 22) |
(self.datacenter_id << 17) |
(self.machine_id << 12) |
self.sequence
)
def _wait_next_ms(self, current_ts: int) -> int:
"""Spin-wait until clock advances (handles clock skew)."""
ts = current_time_ms() - self.EPOCH
while ts <= current_ts:
ts = current_time_ms() - self.EPOCH
return ts
WebSocket Connection Routing
class ConnectionRouter:
"""
Routes a message to the correct gateway holding the recipient's WebSocket.
Architecture:
- Session Store (Redis): user_id -> gateway_id mapping
- Redis Pub/Sub: gateway subscribes to its own channel
- Fallback: if real-time delivery fails, message persists in Cassandra for sync
"""
def __init__(self, session_store: Redis, pubsub: Redis, push_service: PushService):
self.session_store = session_store
self.pubsub = pubsub
self.push_service = push_service
async def route_message(self, message: Message, recipient_id: str) -> DeliveryStatus:
# Step 1: Find recipient's active sessions
sessions = await self.session_store.smembers(f"session:user:{recipient_id}")
if not sessions:
# User is offline → push notification + persist for later sync
await self.push_service.send_notification(
user_id=recipient_id,
title=message.sender_name,
body="New message",
data={"conversation_id": message.conversation_id}
)
return DeliveryStatus.SENT # Single tick ✓
# Step 2: Publish to each gateway holding a session (multi-device)
delivered_count = 0
for session in sessions:
gateway_id = session["gateway_id"]
try:
await self.pubsub.publish(
channel=f"channel:{gateway_id}",
message=serialize(message, recipient_id, session["device_id"])
)
delivered_count += 1
except RedisError:
# Circuit breaker: gateway might be dead
# Message will be picked up via Cassandra sync
logger.warning(f"Failed to route to {gateway_id}")
if delivered_count > 0:
return DeliveryStatus.DELIVERED # Double tick ✓✓
else:
# All gateways failed — treat as offline
await self.push_service.send_notification(recipient_id, message)
return DeliveryStatus.SENT
async def handle_delivery_ack(self, message_id: str, user_id: str):
"""Called when client ACKs receipt. Updates status and notifies sender."""
await self.session_store.set(
f"delivered:{message_id}:{user_id}", "1", ex=86400
)
# Notify sender of delivery (double tick)
sender_id = await self.get_sender(message_id)
await self.route_message(
DeliveryReceipt(message_id=message_id, status="delivered"),
recipient_id=sender_id
)
Group Message Fan-out
class GroupFanoutService:
"""
Hybrid fan-out strategy:
- Small groups (≤ 100): Write fan-out (push to each member individually)
- Large groups (> 100): Read fan-out (write once, members pull on sync)
Why hybrid?
- 99% of groups have < 50 members → write fan-out gives instant delivery
- 1% large groups would cause 1000+ Redis publishes per message → read fan-out avoids this
- Threshold is tunable based on real-time delivery SLA vs. write amplification cost
"""
FANOUT_THRESHOLD = 100 # Groups larger than this use read fan-out
async def fan_out(self, message: Message, group_id: str) -> FanoutResult:
# Step 1: Get group members (cached in Redis, TTL 1 hour)
members = await self.get_group_members(group_id)
group_size = len(members)
if group_size <= self.FANOUT_THRESHOLD:
return await self._write_fanout(message, members)
else:
return await self._read_fanout(message, group_id, members)
async def _write_fanout(self, message: Message, members: List[str]) -> FanoutResult:
"""Push message individually to each member. O(N) publishes."""
online_members = []
offline_members = []
# Batch lookup: which members are online?
sessions = await self.session_store.mget(
[f"session:user:{m}" for m in members]
)
for member_id, session in zip(members, sessions):
if member_id == message.sender_id:
continue # Don't deliver to sender
if session:
online_members.append((member_id, session))
else:
offline_members.append(member_id)
# Fan-out to online members via Redis Pub/Sub
# Group by gateway to reduce publish calls
gateway_groups = defaultdict(list)
for member_id, session in online_members:
gateway_groups[session["gateway_id"]].append(member_id)
for gateway_id, recipient_ids in gateway_groups.items():
# Single publish per gateway with list of recipients
await self.pubsub.publish(
channel=f"channel:{gateway_id}",
message=serialize_group_message(message, recipient_ids)
)
# Batch push notifications for offline members
if offline_members:
await self.push_service.send_batch(
user_ids=offline_members,
title=f"{message.sender_name} in {message.group_name}",
body="New message",
collapse_key=f"group:{message.group_id}" # Collapse multiple into one notification
)
return FanoutResult(
online_delivered=len(online_members),
push_sent=len(offline_members),
strategy="write_fanout"
)
async def _read_fanout(self, message: Message, group_id: str, members: List[str]) -> FanoutResult:
"""Write once to group conversation. Members pull on next sync. O(1) write."""
# Message is already being written to Cassandra via Kafka (normal persistence path)
# For online members, we still send a lightweight "new message" notification
# so they know to pull from Cassandra
# Send lightweight "nudge" (not the full message) to online members
online_count = 0
for batch in chunk(members, 100):
sessions = await self.session_store.mget(
[f"session:user:{m}" for m in batch]
)
for member_id, session in zip(batch, sessions):
if session and member_id != message.sender_id:
await self.pubsub.publish(
channel=f"channel:{session['gateway_id']}",
message=serialize_nudge(group_id, message.message_id)
)
online_count += 1
# Client receives nudge → fetches latest messages from Cassandra
# This adds ~50ms latency vs. write fan-out but prevents write amplification
return FanoutResult(
online_delivered=online_count,
push_sent=0, # Large groups: rely on nudge, not individual push
strategy="read_fanout"
)
Scaling Considerations
| Challenge | Solution | Why It Works |
|---|---|---|
| 170M concurrent WebSockets | 1,700+ gateway servers, each handling 100K connections via epoll | Linux kernel supports 1M+ file descriptors; 100K is conservative to leave headroom for CPU and memory |
| Cross-gateway message routing | Sharded Redis Pub/Sub (18 shards, consistent hash ring) | Each gateway subscribes to its own channel; routing is O(1) lookup + O(1) publish |
| Message ordering | Kafka partition by conversation_id + Snowflake IDs | Single partition = single consumer = total order; Snowflake provides wall-clock correlation |
| Thundering herd on recovery | Exponential backoff with jitter (base=1s, max=300s, jitter=20%) | Spreads 170M reconnections over ~5 minutes instead of 1 second |
| Hot conversations | Rate limiting per conversation (1000 msg/min) + connection-level throttling | Prevents viral group chats from overwhelming a single Cassandra partition |
| Media messages | Presigned S3 URLs + CDN; only URL reference stored in message | Keeps message path fast (200 bytes); media uploaded/downloaded directly to S3 |
| Read receipts at scale | Batched status updates (aggregate per conversation, flush every 500ms) | Without batching: 580K receipts/sec doubles write load; batching reduces to ~50K/sec |
| Multi-datacenter | Active-active with Cassandra multi-DC replication + Redis Cluster per DC | Users connect to nearest DC; messages replicate async (50-100ms cross-DC) |
| Offline sync storm | Cursor-based pagination (50 messages/batch) + server-side rate limiting (10 batches/sec) | Prevents a user with 10K missed messages from saturating Cassandra reads |
| Group membership changes | Invalidate Redis cache + Kafka event for async propagation | Membership is read-heavy (every group message) but write-rare (joins/leaves) |
| End-to-end encryption key distribution | Prekey bundles stored in PostgreSQL; fetched once per new conversation | Signal Protocol X3DH: sender fetches recipient's prekeys, establishes shared secret without interaction |
| Message search | Client-side index (E2E encrypted messages can't be server-searched); optional server-side for enterprise (Slack model) | Privacy vs. functionality tradeoff; WhatsApp chose privacy, Slack chose searchability |
Quick Recall
| Interview Question | Concise Answer |
|---|---|
| Why WebSocket over HTTP polling? | Full-duplex, 130x less overhead (6 bytes vs 800 bytes/msg), persistent connection saves mobile battery |
| How do you route messages between gateways? | Redis Pub/Sub — each gateway subscribes to its own channel; Message Service publishes to recipient's gateway channel |
| Why Cassandra for messages? | Write-heavy workload (580K/sec), partition by conversation_id gives range-scan for conversation history, LSM-tree optimized for sequential writes |
| How do you guarantee ordering? | Per-conversation only: Kafka partition key = conversation_id; within partition, Snowflake IDs provide total order |
| What happens if Redis Pub/Sub loses a message? | No data loss — Kafka persists every message to Cassandra independently. Client syncs missed messages on next heartbeat or reconnect |
| How does group chat work at scale? | Hybrid fan-out: write fan-out for groups <= 100 (push individually), read fan-out for larger groups (write once, members pull) |
| How do offline users get messages? | Push notification (APNs/FCM) for alerting; full message retrieval via cursor-based sync from Cassandra on reconnect |
| How does presence work at 500M users? | Heartbeat every 30s + gossip protocol between presence instances; approximate (90s TTL) because exact is too expensive |
| Why Snowflake IDs instead of UUIDs? | 64 bits (vs 128), time-ordered (sortable without timestamp column), preserves Cassandra write locality in LSM-tree |
| How do you handle the thundering herd? | Exponential backoff with jitter on client reconnect; gateway admission control (503 + Retry-After); Kafka buffers messages during recovery |
| How does E2E encryption work? | Signal Protocol: X3DH key exchange + Double Ratchet for forward secrecy; server never sees plaintext; group chats use Sender Keys |
| Why partition by conversation_id not user_id? | Dominant access pattern is "load conversation history" (range scan); avoids N writes per group message; separate lightweight inbox index for conversation list |
| How do you handle media messages? | Client uploads directly to S3 (presigned URL); message contains only the S3 reference URL; CDN serves downloads; keeps message path fast |
| What is the single point of failure? | None by design — every layer is horizontally scaled and multi-DC. Closest risk: Redis Session Store (mitigated by Redis Cluster + fallback to Cassandra sync) |