Message Queues & Event Streaming
Real Incident: LinkedIn, 2010
LinkedIn's synchronous service calls created cascading failures — one slow service blocked 14 others. 30-minute outages weekly. They built Kafka to decouple services. Today Kafka processes 7 trillion messages/day at LinkedIn. The entire modern data infrastructure (real-time analytics, event sourcing, stream processing) exists because of this architectural shift.
Why This Comes Up in Every Design Interview
The moment you say "Service A needs to tell Service B something," the interviewer expects you to reason about:
- Sync vs async trade-offs
- Delivery guarantees
- Ordering requirements
- What happens when consumers are slower than producers
- How to handle poison messages
If you just say "I'll use Kafka" without reasoning through WHY and what delivery guarantees you need, you've lost points.
The Core Decision: Sync vs Async
| Aspect | Synchronous (HTTP/gRPC) | Asynchronous (Queue) |
|---|---|---|
| Coupling | Temporal + spatial (both up, both known) | Decoupled in time and space |
| Failure propagation | A fails if B fails | A succeeds even if B is down |
| Latency | Immediate response | Eventually processed |
| Throughput | Limited by slowest service | Burst absorbed by queue |
| Debugging | Simple request/response trace | Harder (async, delayed) |
| Data loss risk | Response confirms processing | Queue must be durable |
When to choose async:
- Producer doesn't need immediate confirmation of processing
- Consumer is slower than producer (rate mismatch)
- You need to survive downstream failures
- Fan-out to multiple consumers
- You need replay capability
When to stay sync:
- User expects immediate response (login, payment confirmation)
- Simple request/response (read a user profile)
- Low latency critical (< 10ms)
Two Fundamental Models
Point-to-Point (Queue)
- Message consumed by exactly ONE consumer
- Once consumed, message is gone
- Use case: Job distribution, task processing
- Example: "Process this image," "Send this email"
- Systems: RabbitMQ, SQS, Celery
Publish/Subscribe (Topic)
- Message delivered to ALL subscribers
- Each subscriber gets its own copy
- Use case: Event broadcasting, multiple consumers need same event
- Example: "Order placed" → inventory service + notification service + analytics
- Systems: Kafka, SNS, Google Pub/Sub
Kafka's Consumer Groups: The Hybrid
Kafka combines both: within a consumer group, each partition goes to one consumer (point-to-point). Across groups, all groups get all messages (pub/sub).
Delivery Guarantees — The Most Important Trade-off
| Guarantee | How It Works | Trade-off | When to Use |
|---|---|---|---|
| At-most-once | Send and forget. No retry. | Fastest. May lose messages. | Metrics, logs (loss is tolerable) |
| At-least-once | Retry until consumer ACKs. May deliver duplicates. | Safe but requires idempotent consumers. | 95% of use cases. Standard. |
| Exactly-once | Transactional produces + idempotent consumers + offset commit in same transaction | Slowest, most complex. | Financial transactions, billing. |
The practical answer for interviews:
"I'd use at-least-once delivery with idempotent consumers. Each message has a unique ID. Consumer checks if it's already processed (idempotency key in DB). This gives us effective exactly-once semantics without the complexity and throughput cost of true exactly-once."
How idempotent consumers work:
- Message arrives with ID
msg-abc-123 - Consumer checks: "Have I processed
msg-abc-123?" - If yes → ACK, skip. If no → process, record ID, ACK.
- Safe to retry delivery — duplicate is harmless.
Ordering — When It Matters and When It Doesn't
| Ordering Need | Solution | Example |
|---|---|---|
| No ordering needed | Any queue, multiple consumers | "Send welcome email" — order doesn't matter |
| Per-entity ordering | Partition by entity key | All events for user_123 in order (but user_456 can interleave) |
| Total ordering | Single partition/queue | Financial ledger, consensus log (slow — no parallelism) |
Kafka's approach: Ordering guaranteed WITHIN a partition. Use partition key to co-locate related messages.
Partition key = user_id
→ All messages for user_123 go to partition 7
→ Processed in order by one consumer
→ But user_456's messages in partition 3 are processed in parallel
Back-of-envelope for partition count:
- Target throughput: 100K messages/sec
- Single consumer processes: 5K messages/sec
- Partitions needed: 100K / 5K = 20 partitions minimum
- With headroom (2x): 40 partitions
Kafka Architecture Deep Dive
| Component | What | Why |
|---|---|---|
| Broker | Server storing message log | Distributed storage |
| Topic | Named feed ("orders", "clicks") | Logical separation |
| Partition | Ordered, immutable log within topic | Unit of parallelism |
| Offset | Position of message in partition | Enables replay, exactly-once |
| Consumer Group | Set of consumers sharing partitions | Load balancing |
| Replication Factor | Copies per partition (usually 3) | Fault tolerance |
| ISR (In-Sync Replicas) | Replicas caught up to leader | Durability guarantee |
Key design decisions:
- Log-based: Messages are APPENDED, never deleted (until retention expires). This enables replay.
- Pull-based consumers: Consumers pull at their own pace (vs push). Natural backpressure.
- Sequential I/O: Appending to log = sequential disk writes = incredibly fast (even faster than random SSD).
Why Kafka is fast (interview gold):
- Sequential disk I/O (600MB/s vs 100MB/s random)
- Zero-copy (sendfile syscall — data goes from disk to network without copying to user space)
- Batch compression (many messages compressed together)
- Page cache utilization (OS caches frequently read segments)
Back-of-Envelope: Kafka Capacity
Scenario: Design messaging for a social media platform, 500M daily active users, each generates ~20 events/day.
| Parameter | Calculation | Value |
|---|---|---|
| Daily messages | 500M × 20 | 10 billion |
| Messages/sec (average) | 10B / 86400 | ~115K/sec |
| Peak (10x average) | 115K × 10 | ~1.15M/sec |
| Message size | Average event payload | ~500 bytes |
| Daily data volume | 10B × 500B | ~5 TB/day |
| Retention (7 days) | 5TB × 7 | ~35 TB storage |
| Replication (RF=3) | 35TB × 3 | ~105 TB total |
| Brokers needed | 105TB / 10TB per broker | ~11 brokers |
| Partitions (for 1.15M/sec, 10K/partition) | 1.15M / 10K | ~115 partitions |
Handling Failures: Dead Letter Queues & Retry
What happens when processing fails?
| Strategy | How | When |
|---|---|---|
| Immediate retry | Retry 3 times with backoff | Transient failures (network blip) |
| Dead Letter Queue (DLQ) | After N retries, move to separate queue | Poison messages, persistent failures |
| Delay queue | Retry after T seconds/minutes | Rate limits, temporary downstream outage |
| Human review | Alert + DLQ inspection tools | Business-critical failures |
Poison message problem: One bad message blocks the entire queue (head-of-line blocking). DLQ solves this — move it aside, continue processing.
Backpressure — When Producers Outrun Consumers
| Strategy | How | Trade-off |
|---|---|---|
| Queue depth limit | Reject new messages when queue is full | Producer gets error, must handle |
| Consumer scaling | Auto-scale consumers based on lag | Cost, spin-up time |
| Sampling/dropping | Drop low-priority messages | Data loss (acceptable for metrics) |
| Producer rate limiting | Throttle producer | Producer blocked |
| Kafka approach | Consumers pull at own pace, messages retained for days | Need enough disk for retention |
System Comparison: When to Use What
| Criteria | Kafka | RabbitMQ | SQS | Pulsar |
|---|---|---|---|---|
| Throughput | Millions/sec | 10-50K/sec | Unlimited (managed) | Millions/sec |
| Ordering | Per-partition | Per-queue | Best-effort (FIFO available) | Per-partition |
| Retention | Days/weeks (configurable) | Until consumed | 14 days max | Tiered (infinite) |
| Replay | Yes (offset reset) | No (consumed = gone) | No | Yes |
| Delivery | At-least-once (exactly-once available) | At-least-once | At-least-once | At-least-once |
| Use case | Event streaming, data pipeline | Task queue, RPC | AWS serverless, simple | Multi-tenant, geo-replicated |
| Operations | Complex (ZK/KRaft, brokers) | Moderate | Zero (managed) | Complex |
Decision framework:
- Need replay? → Kafka or Pulsar
- Simple job queue on AWS? → SQS
- Complex routing (priority, routing keys)? → RabbitMQ
- Multi-region, multi-tenant? → Pulsar
- Don't want to manage infrastructure? → SQS or managed Kafka (MSK, Confluent)
Event Sourcing & CQRS (Advanced Pattern)
Event Sourcing: Store every state change as an immutable event, not just current state.
| Traditional | Event Sourced |
|---|---|
UPDATE account SET balance = 500 | Event: "Withdrawn $100 at 3:04pm" |
| Current state only | Full history, audit trail |
| Hard to debug "how did we get here?" | Replay events to reconstruct any point in time |
CQRS (Command Query Responsibility Segregation): Separate write model (events/commands) from read model (materialized views). Write to Kafka → materialize into read-optimized DB.
Used by: Banking systems, audit trails, collaborative editing, gaming (replay systems).
Interview Framework
When asked "How do services communicate in your design?":
Step 1: "For this interaction, I need to decide sync vs async. [Service B] doesn't need to respond in real-time to [Service A], and we need to survive B being down, so I'd use async messaging."
Step 2: "I'd use [Kafka/SQS] with [at-least-once] delivery. Consumers are idempotent using a message ID stored in the processing DB."
Step 3: "For ordering, I'd partition by [entity_id] so all events for one entity are strictly ordered. With [N] partitions, I can parallelize to [N] consumers."
Step 4: "For failures — retry 3x with exponential backoff, then DLQ. We monitor DLQ depth and alert if it grows."
Step 5: "Back-of-envelope: [X] messages/sec, [Y] bytes each, [Z] retention = [W] storage needed."
Quick Recall
| Question | Answer |
|---|---|
| When async over sync? | Producer doesn't need immediate response, need to survive downstream failure, rate mismatch |
| Ordering guarantee? | Within partition only. Partition by entity key for per-entity ordering. |
| At-least-once vs exactly-once? | At-least-once + idempotent consumer is the practical standard |
| Kafka vs RabbitMQ? | Kafka = event streaming, replay, high throughput. RabbitMQ = task queue, routing. |
| Partition count? | Target throughput / per-consumer throughput (with 2x headroom) |
| Dead Letter Queue? | Where failed messages go after N retries — prevents head-of-line blocking |
| Why Kafka is fast? | Sequential I/O, zero-copy, batching, page cache |
| Consumer group? | Load-balances partitions across consumers. Max consumers = partition count. |