Distributed Caching
Real Incident: Instagram, 2018
One Memcached node dies. 50M requests/sec that were served from cache slam into PostgreSQL. DB CPU hits 100% in 3 seconds. Connection pools exhaust in 5. 1 billion users see errors. Fix: single cache node restart. Downtime: 4 minutes. Cost: ~$150K. That cache node? $0.12/hour.
The 30-Second Explanation
Cache = a fast storage layer (memory) that sits between your application and the slow storage layer (database).
Speed
Redis: 0.1ms. Database: 5-50ms. 50-500x faster.
Cost
$0.12/hr cache vs. $10K/hr DB scale-up
Protection
Cache absorbs traffic spikes DB can't handle
The key insight: A cache isn't just for speed — it's a shield protecting your database from traffic it was never designed to handle.
Caching Strategies
Read Strategies
| Strategy | How | Best For |
|---|---|---|
| Cache-Aside (Lazy) | App checks cache → miss → reads DB → writes to cache | Most common. Simple. App controls. |
| Read-Through | Cache itself fetches from DB on miss | Cleaner app code. Cache handles logic. |
Write Strategies
| Strategy | How | Trade-off |
|---|---|---|
| Write-Through | Write to cache AND DB synchronously | Strong consistency, higher write latency |
| Write-Behind (Write-Back) | Write to cache, async flush to DB | Fast writes, risk of data loss on crash |
| Write-Around | Write directly to DB, skip cache | Avoids cache pollution for rarely-read data |
The Most Common Pattern (FAANG Default)
Cache-Aside + Write-Around — Read from cache (populate on miss), write directly to DB, invalidate cache on write.
Interview Gold
"I'd use cache-aside with write-around. On read miss, fetch from DB and populate cache with a TTL. On write, update DB and invalidate the cache entry. This avoids stale data AND cache pollution from write-heavy keys that nobody reads."
Cache Invalidation — The Hardest Problem
"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton
| Strategy | How | When |
|---|---|---|
| TTL-based | Entry expires after N seconds | Default. Works for most use cases. |
| Event-based | DB write triggers cache delete | Strong consistency needs |
| Version-based | Key includes version number | When you need atomic updates |
The Facebook approach: On write, DELETE the cache key (don't update it). Next read will miss and repopulate from DB. This avoids race conditions between concurrent writes.
Cache Failure Patterns
| Pattern | What Happens | Solution |
|---|---|---|
| Cache Stampede | Popular key expires → 1000 threads all hit DB simultaneously | Lock (only one thread fetches), stale-while-revalidate |
| Cache Penetration | Requests for keys that DON'T EXIST → always miss, always hit DB | Bloom filter, cache null results with short TTL |
| Cache Avalanche | Many keys expire at same time → massive DB load spike | Randomize TTLs, stagger expiration |
| Hot Key | One key gets millions of reads → single cache node overwhelmed | Replicate hot keys across multiple nodes, local cache |
Local Cache vs. Distributed Cache
| Aspect | Local (in-process) | Distributed (Redis/Memcached) |
|---|---|---|
| Latency | ~1μs (memory access) | ~0.1-1ms (network hop) |
| Shared across servers | No (each server has own copy) | Yes |
| Consistency | Hard (invalidation across processes) | Easier (single source) |
| Size | Limited by server RAM | Can scale independently |
| Best for | Config, hot reference data | Session, user data, query results |
Multi-tier approach (Netflix, Uber): L1 = local cache (Caffeine/Guava, 30s TTL) → L2 = distributed cache (Redis, 5min TTL) → L3 = database.
Eviction Policies
| Policy | How | Best For |
|---|---|---|
| LRU | Remove least recently used | General purpose (most common) |
| LFU | Remove least frequently used | Stable hot set (trending content) |
| TTL | Remove expired entries | Time-sensitive data |
| Random | Remove random entry | Surprisingly effective, zero overhead |
| FIFO | Remove oldest entry | Simple, when access pattern is flat |
Real Systems
| Company | Stack | Notable |
|---|---|---|
| Memcached (TAO) | Billions of queries/sec. Lease-based invalidation. | |
| Redis + Memcached | Timeline cache. Pre-computed feeds. | |
| Netflix | EVCache (Memcached) | Multi-region replication. 30M ops/sec. |
| Uber | Redis + Docstore cache | Geospatial caching for nearby drivers. |
| Amazon | ElastiCache (Redis/Memcached) | Session store, product catalog cache. |
Redis vs. Memcached
| Aspect | Redis | Memcached |
|---|---|---|
| Data structures | Strings, Lists, Sets, Hashes, Sorted Sets | Strings only |
| Persistence | Optional (RDB/AOF) | None (pure cache) |
| Clustering | Built-in (Redis Cluster) | Client-side sharding |
| Memory efficiency | Higher overhead per key | More memory-efficient |
| Use when | Need data structures, persistence, pub/sub | Pure key-value cache at massive scale |
The 3 Mistakes That Get You Rejected
Don't Say These
- "Cache everything" — Caching write-heavy, rarely-read data wastes memory and adds consistency complexity. Cache what's read-heavy.
- "Just set TTL to 24 hours" — Long TTLs mean stale data. Short TTLs mean more DB hits. The right TTL depends on data's change frequency and staleness tolerance.
- "The cache can never go down" — It will. Discuss graceful degradation: circuit breaker to DB, connection pooling, backpressure.
Interview Answer Template
"For [system], I'd use [Redis/Memcached] as a distributed cache with [cache-aside/read-through] strategy. Write path: [write-through/write-around] with cache invalidation on write. TTL: [N]s based on [data staleness tolerance]. For cache stampede protection, I'd use [mutex/stale-while-revalidate]. For hot keys, I'd [replicate/local cache]."
Quick Recall Card
| Question | Answer |
|---|---|
| Most common pattern? | Cache-aside + write-around + TTL |
| Cache stampede? | Popular key expires → thundering herd. Fix: lock or stale-while-revalidate |
| Cache penetration? | Non-existent keys. Fix: bloom filter or cache nulls |
| Redis vs Memcached? | Redis = data structures + persistence. Memcached = simple, faster for strings |
| Best eviction? | LRU (general purpose) |
| Multi-tier? | L1 local (μs) → L2 distributed (ms) → L3 database |
See Also
- Redis — Data structures, persistence, and clustering
- CAP Theorem — Consistency vs availability trade-offs
- Consistent Hashing — Distributing cache keys across nodes
- Rate Limiting — Throttling with token bucket and sliding window
- Spring Data Redis — Redis integration in Spring Boot