Event Sourcing
Real Incident: LMAX Exchange — 6 Million Transactions/Second on a Single Thread
In 2010, LMAX Exchange needed to process financial trades with microsecond latency. Traditional CRUD with a database? 10,000 ops/s max. Their solution: event sourcing with an in-memory model, replaying events at startup. The result — a single-threaded business logic processor handling 6 million transactions per second with deterministic replay for recovery. When a node crashes, it replays the event log and is back in service in seconds. Event sourcing didn't just solve their consistency problem — it gave them 600x throughput over traditional architectures.
Why This Comes Up in Interviews
Event sourcing appears in designs for financial systems, audit-heavy domains, and any system needing "time travel." Interviewers want to hear:
- Why append-only is fundamentally different from update-in-place
- How you handle event replay and system recovery
- When CQRS is needed alongside event sourcing
- Concrete trade-offs: storage cost, complexity, query patterns
- How eventual consistency manifests and how you manage it
Event Sourcing vs Traditional CRUD
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph TB
subgraph Traditional CRUD
CMD1[Transfer $100] --> DB[(Database)]
DB -->|UPDATE accounts SET balance = balance - 100| STATE1[Current State Only]
end
subgraph Event Sourcing
CMD2[Transfer $100] --> STORE[(Event Store)]
STORE --> E1[MoneyDeposited $500]
STORE --> E2[MoneyWithdrawn $200]
STORE --> E3[MoneyTransferred $100]
E1 --> PROJECTION[Current State = Replay All Events]
E2 --> PROJECTION
E3 --> PROJECTION
end
style DB fill:#e74c3c,color:#fff
style STORE fill:#27ae60,color:#fff
style PROJECTION fill:#3498db,color:#fff | Aspect | Traditional CRUD | Event Sourcing |
|---|---|---|
| Storage | Current state only | Full history of changes |
| What's saved | balance = 300 | [deposited 500, withdrew 200, transferred 100] |
| Audit trail | Must be bolted on (often incomplete) | Built-in, immutable, complete |
| Recovery | Restore from backup (point-in-time) | Replay events to any point in time |
| Schema changes | Migrate rows in-place | Old events preserved, new projections adapt |
| Debugging | "Why is balance 300?" — no idea | Replay shows exactly what happened and when |
| Storage cost | Lower (only current state) | Higher (every event stored forever) |
Core Components
| Component | Role | Example |
|---|---|---|
| Event | Immutable fact that happened | OrderPlaced { orderId: 123, items: [...], timestamp: ... } |
| Event Store | Append-only log of events | Kafka topic, EventStoreDB, DynamoDB stream |
| Aggregate | Domain entity rebuilt from events | Order aggregate replayed from its event stream |
| Projection | Read-optimized view built from events | "All orders by customer" table |
| Snapshot | Cached state at a point to avoid full replay | Order#123 state at event #5000 |
Event Store Design
| Property | Requirement | Why |
|---|---|---|
| Append-only | Events are never updated or deleted | Immutability guarantees audit + replay correctness |
| Ordered | Events within a stream have strict ordering | Replay must be deterministic |
| Durable | Events survive crashes | They are the source of truth |
| Partitioned | Events grouped by aggregate ID | Efficient replay of single entity |
Back-of-envelope: Storage sizing
| Metric | Value |
|---|---|
| Events per order | ~10 (placed, confirmed, paid, picked, shipped, delivered...) |
| Orders per day | 1 million |
| Events per day | 10 million |
| Average event size | 500 bytes |
| Daily storage | 5 GB |
| Annual storage | 1.8 TB |
| With 3x replication | 5.4 TB/year |
At ~\(0.02/GB/month for cold storage: **~\)1,300/year** — cheap for a complete audit trail.
Snapshots — Avoiding Full Replay
The problem: An aggregate with 100,000 events takes seconds to replay on every request.
The solution: Periodically save a snapshot of the current state.
| Strategy | How | When to Snapshot |
|---|---|---|
| Every N events | Snapshot after every 100 events | Simple, predictable |
| Time-based | Snapshot daily | Good for batch systems |
| On demand | Snapshot when replay > threshold | Adaptive |
Replay with snapshot:
- Load latest snapshot (state at event #9500)
- Replay only events #9501 through #9612 (112 events instead of 9612)
- Latency: 2ms instead of 200ms
CQRS Integration
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph LR
subgraph Write Side
CMD[Command] --> HANDLER[Command Handler]
HANDLER --> VALIDATE{Validate}
VALIDATE -->|OK| STORE[(Event Store)]
end
subgraph Read Side
STORE -->|Subscribe| PROJECTOR[Event Projector]
PROJECTOR --> READDB[(Read DB - Optimized)]
QUERY[Query] --> READDB
end
style STORE fill:#27ae60,color:#fff
style READDB fill:#3498db,color:#fff
style HANDLER fill:#f39c12,color:#fff | Aspect | Write Model | Read Model |
|---|---|---|
| Optimized for | Consistency, business rules | Query speed, denormalization |
| Storage | Event store (append-only) | SQL/NoSQL/Elasticsearch (per query need) |
| Schema | Events (immutable facts) | Projections (derived, rebuildable) |
| Scaling | Fewer writes than reads | Scale independently, multiple projections |
| Consistency | Strongly consistent | Eventually consistent (lag = projection delay) |
Why CQRS + Event Sourcing pair naturally: Events are great for writes (append-only, fast) but terrible for reads ("show all orders over $100 this month" requires replaying millions of events). CQRS lets you build query-optimized read models from the same event stream.
Event Versioning — Schema Evolution
Events are stored forever, but your domain evolves. How do you handle old events?
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Upcasting | Transform old events to new schema on read | Old events untouched, transparent | Upcast logic grows over time |
| Versioned events | OrderPlacedV1, OrderPlacedV2 | Explicit, clear | Multiple handler versions |
| Weak schema | Store events as flexible JSON | Forward compatible | No compile-time safety |
| Copy-transform | Migrate event store to new schema | Clean slate | Expensive, risky for large stores |
Best practice: Use upcasting for minor changes, versioned events for breaking changes. Never modify stored events.
Eventual Consistency — The Trade-off
| Scenario | Consistency Lag | Acceptable? |
|---|---|---|
| User places order, sees it in "My Orders" | 50-200ms projection lag | Usually yes (show optimistic UI) |
| Payment processed, inventory updated | 100-500ms | Yes (async is fine) |
| Admin views real-time dashboard | 1-5s | Yes (dashboards tolerate lag) |
| User withdraws money, checks balance | Must be immediate | No — use read-your-own-writes pattern |
Read-your-own-writes pattern: After a command succeeds, the response includes the new version number. Client passes this version to read queries. Read side either waits for projection to catch up or reads directly from event store.
When to Use Event Sourcing
| Good Fit | Poor Fit |
|---|---|
| Financial systems (audit trail required) | Simple CRUD apps (blog, todo list) |
| Systems needing "time travel" / replay | Systems where history has no value |
| Complex domain with many state transitions | High-write, low-read systems with no audit needs |
| Regulatory compliance (GDPR right to explanation) | Teams unfamiliar with eventual consistency |
| Event-driven microservices | Tight latency requirements on reads (without CQRS) |
Real-World Implementations
| System | How They Use Event Sourcing |
|---|---|
| LMAX | In-memory event sourcing, single-threaded replay, 6M tx/s |
| Event Store (Greg Young) | Purpose-built event store database with projections |
| Kafka | Append-only log as event store (with compaction trade-offs) |
| Axon Framework | Java framework for event sourcing + CQRS |
| DynamoDB Streams | Event stream from DynamoDB changes (CDC pattern) |
Interview Framework
When event sourcing is relevant in system design:
Step 1: "For this domain (e.g., banking, e-commerce orders), I'd use event sourcing because we need a complete audit trail and the ability to reconstruct state at any point in time."
Step 2: "Every state change is stored as an immutable event in an append-only log. The current state is derived by replaying events. For performance, I'd snapshot every 100 events to avoid full replay."
Step 3: "For reads, I'd use CQRS — project events into query-optimized read models. This gives us fast reads without compromising the write model's simplicity."
Step 4: "The trade-off is eventual consistency between write and read sides (~100-200ms lag) and increased storage. But we gain: complete audit, easy debugging, ability to rebuild read models, and replay for recovery."
Quick Recall
| Question | Answer |
|---|---|
| What is event sourcing? | Store state changes as immutable events; derive current state by replay |
| Why pair with CQRS? | Events are write-optimized; CQRS adds read-optimized projections |
| Snapshot purpose? | Avoid replaying entire event history; checkpoint state periodically |
| Biggest trade-off? | Eventual consistency on reads + storage growth + complexity |
| Event versioning? | Upcasting (transform old to new on read) or versioned event types |
| When NOT to use? | Simple CRUD, no audit needs, team inexperienced with async patterns |
| LMAX throughput? | 6 million tx/s single-threaded via in-memory event sourcing |
| Storage cost estimate? | ~1.8 TB/year for 1M orders/day at 10 events/order |