Skip to content
3 min read

System Design Estimation Cheat Sheet

Real Incident: Amazon EBS Outage, 2012

A capacity planning error for EBS volumes in US-East-1 triggered a cascading failure. The re-mirroring storm after a network event consumed more capacity than estimated, overwhelming the control plane. Reddit, Netflix, Instagram, and Pinterest went dark for hours. Root cause: the team underestimated how many volumes would simultaneously attempt recovery. Back-of-envelope math isn't optional — it's the difference between a design that survives and one that cascades.


Why This Comes Up in Every Interview

Interviewers use estimation to test:

  • Can you reason about scale without a calculator?
  • Do you know the practical limits of hardware and systems?
  • Can you identify bottlenecks BEFORE designing the solution?
  • Do you make reasonable assumptions and state them explicitly?

Every system design answer should start with: "Let me estimate the scale first."


Powers of 2 — The Foundation

Power Exact Approximate Name
2^10 1,024 ~1 Thousand 1 KB
2^20 1,048,576 ~1 Million 1 MB
2^30 1,073,741,824 ~1 Billion 1 GB
2^40 ~1.1 Trillion ~1 Trillion 1 TB
2^50 ~1.1 Quadrillion ~1 Quadrillion 1 PB

Quick conversions:

From To Multiply by
Bytes → KB ÷ 1,000 (or shift right 10 bits)
KB → MB ÷ 1,000
MB → GB ÷ 1,000
Seconds/day 86,400 ≈ ~100K
Seconds/month ~2.5M
Seconds/year ~31.5M ≈ ~30M

Latency Numbers Every Programmer Should Know

Operation Latency Notes
L1 cache reference 1 ns
L2 cache reference 4 ns
Branch mispredict 3 ns
Mutex lock/unlock 17 ns
Main memory (RAM) reference 100 ns
Compress 1KB (Snappy) 2 μs
Read 1MB sequentially from RAM 3 μs
SSD random read 16 μs
Read 1MB sequentially from SSD 49 μs
Round trip within same datacenter 500 μs 0.5 ms
Read 1MB sequentially from HDD 825 μs ~1 ms
Disk seek (HDD) 2 ms
Round trip same region (cross-AZ) 1-2 ms
Send packet CA → Netherlands → CA 150 ms Speed of light in fiber
Round trip cross-region (US→Europe) 70-150 ms
Round trip US → Asia 150-300 ms

Key takeaways for design:

  • Memory is 100x faster than SSD, 1000x faster than HDD
  • Network within datacenter (~0.5ms) is acceptable for synchronous calls
  • Cross-region (~150ms) means you MUST cache locally or accept latency
  • Sequential reads are dramatically faster than random (SSD: 20x, HDD: 100x)

Throughput Rules of Thumb

Single Server Capacity

Resource Throughput Notes
Web server (Nginx) 100K-200K RPS Static content / simple proxy
Application server 5K-50K RPS Depends on computation per request
PostgreSQL reads 10K-50K QPS With connection pooling, indexed queries
PostgreSQL writes 5K-20K TPS Depends on fsync, WAL
MySQL reads 10K-50K QPS Similar to PostgreSQL
Redis (single instance) 100K-300K ops/sec In-memory, single-threaded
Kafka (single broker) 200K-2M msgs/sec Depends on message size
Elasticsearch 5K-50K queries/sec Depends on query complexity
gRPC server 10K-100K RPS Lower latency than REST

Network Bandwidth

Connection Bandwidth Effective Throughput
1 Gbps NIC 1 Gbps ~100 MB/s effective
10 Gbps NIC 10 Gbps ~1 GB/s effective
25 Gbps NIC (cloud) 25 Gbps ~2.5 GB/s effective
Inter-AZ bandwidth Typically 5-25 Gbps Cloud dependent
Internet (user connection) 10-100 Mbps Highly variable

Availability Math

SLA Downtime/Year Downtime/Month Downtime/Week
99% (two 9s) 3.65 days 7.3 hours 1.68 hours
99.9% (three 9s) 8.77 hours 43.8 minutes 10.1 minutes
99.95% 4.38 hours 21.9 minutes 5.04 minutes
99.99% (four 9s) 52.6 minutes 4.38 minutes 1.01 minutes
99.999% (five 9s) 5.26 minutes 26.3 seconds 6.05 seconds

Compound availability:

  • Two components in SERIES: A × B (e.g., 99.9% × 99.9% = 99.8%)
  • Two components in PARALLEL: 1 - (1-A)(1-B) (e.g., 1 - 0.001² = 99.9999%)

Interview gold: "If we need 99.99% availability and each server is 99.9% available, we need redundancy. Two servers in active-active gives us 1 - (0.001)² = 99.9999% — well above our target."


The Estimation Template

Step 1: Clarify Scale

Question to Ask Why
How many users? (DAU, MAU) Determines request volume
Read-heavy or write-heavy? Determines architecture (cache vs queue)
What's the read:write ratio? Usually 10:1 to 1000:1
What's the expected growth? Size for 3-5 years ahead
Any traffic spikes? (peak:avg ratio) Usually 3x-10x average

Step 2: Estimate Traffic

Text Only
Daily active users (DAU) × actions per user = daily requests
Daily requests ÷ 86,400 = average QPS
Average QPS × peak factor (3-10x) = peak QPS

Step 3: Estimate Storage

Text Only
Daily new records × record size = daily storage
Daily storage × 365 × years × replication factor = total storage

Step 4: Estimate Bandwidth

Text Only
Peak QPS × average response size = outbound bandwidth
Peak write QPS × average request size = inbound bandwidth

Step 5: Derive Infrastructure

Text Only
Peak QPS ÷ single-server capacity = number of servers needed
Total storage ÷ per-machine storage = number of storage nodes
Add replication factor (usually 3x)

Worked Examples

Example 1: Twitter-like Feed (500M DAU)

Parameter Calculation Result
Daily tweets 500M × 2 tweets/user/day 1B tweets/day
Tweet size 280 chars + metadata ≈ 500 bytes
Daily storage 1B × 500 bytes 500 GB/day
Yearly storage 500 GB × 365 ~180 TB/year
With replication (3x) 180 TB × 3 ~540 TB/year
Write QPS (avg) 1B ÷ 86,400 ~12K writes/sec
Peak write QPS (10x) 12K × 10 ~120K writes/sec
Read QPS (100:1 ratio) 12K × 100 ~1.2M reads/sec
Feed reads (timeline) Need caching + fanout Redis for hot timelines

Conclusion: Need sharding (120K peak writes > single DB), heavy caching (1.2M reads/sec), and ~540TB storage over 3 years.

Example 2: Chat System (WhatsApp-scale, 2B users, 100B messages/day)

Parameter Calculation Result
Messages/sec (avg) 100B ÷ 86,400 ~1.15M msg/sec
Peak (5x) 1.15M × 5 ~5.8M msg/sec
Message size ~100 bytes avg (text)
Daily storage 100B × 100 bytes 10 TB/day
Monthly storage 10 TB × 30 300 TB/month
With replication (3x) 300 TB × 3 900 TB/month
WebSocket connections 500M concurrent (DAU/4)
Servers for connections 500M ÷ 100K per server ~5,000 servers
Bandwidth (outbound) 5.8M × 100 bytes ~580 MB/sec peak

Conclusion: Need message queue backbone (Kafka at 5.8M/sec), ~5000 WebSocket gateway servers, partition by conversation_id.

Example 3: Image Hosting (Instagram-scale, 2B users, 100M uploads/day)

Parameter Calculation Result
Uploads/sec 100M ÷ 86,400 ~1,150 uploads/sec
Average image size 2 MB (after compression)
Daily storage 100M × 2 MB 200 TB/day
Yearly storage 200 TB × 365 73 PB/year
With CDN replicas 73 PB × 3 regions ~220 PB
Upload bandwidth 1,150 × 2 MB 2.3 GB/sec inbound
Read QPS (50:1 ratio) 1,150 × 50 57,500 reads/sec
CDN hit rate (95%) Only 5% reach origin ~2,875 origin reads/sec
Origin bandwidth 2,875 × 2 MB ~5.7 GB/sec

Conclusion: CDN is critical (reduces origin load 20x), object storage (S3) for images, metadata in sharded DB.


Common Interview Estimates (Quick Reference)

System Users Key Metric Key Number
URL shortener 100M URLs/month Redirects/sec ~40K (read-heavy)
Chat system 500M DAU Messages/sec ~1M
Social feed 500M DAU Feed reads/sec ~1M
Video streaming 1B views/day Bandwidth Petabytes/day
Search engine 5B searches/day QPS ~60K
Payment system 1M transactions/day TPS ~12 (low but critical)

Common Estimation Mistakes

Mistake Reality
Forgetting peak vs average Peak is typically 3-10x average. Design for peak.
Ignoring replication Storage × 3 for fault tolerance
Assuming uniform traffic Time zones, events, and seasonality cause spikes
Not accounting for metadata Indexes, logs, and metadata often 2-3x raw data
Forgetting retention/growth "How long do we keep data?" changes everything
Single point math Show range: "between X and Y, I'll use Z for design"

Interview Framework

When the interviewer says "Design X":

Step 1 — Clarify and scope: "Before diving in, let me estimate the scale. We have [N] users doing [M] actions per day. That's [X] requests/sec average, [Y] peak."

Step 2 — Storage: "Each [entity] is approximately [Z] bytes. At [N] per day with [R] replication, that's [T] TB over [Y] years."

Step 3 — Bandwidth: "Peak [X] requests/sec × [Z] KB average response = [B] GB/sec. This tells me we need [CDN / multiple servers / caching]."

Step 4 — Identify bottleneck: "The bottleneck is [reads at 1M QPS → need caching | writes at 100K TPS → need sharding | storage at PB scale → need object storage + tiering]."

Step 5 — Derive infrastructure: "At [X] QPS per server, we need [N] servers. With [Z] TB per machine, we need [M] storage nodes. Replication factor 3 for durability."


Quick Recall

Question Answer
Seconds in a day? ~86,400 ≈ ~100K (for quick math)
QPS for 1B requests/day? ~12K average
RAM vs SSD speed? RAM is ~100x faster (100ns vs 16μs)
Cross-region latency? 70-300ms (speed of light limit)
Redis throughput? 100K-300K ops/sec per instance
99.99% downtime? ~52 minutes/year
Compound availability (series)? Multiply: 99.9% × 99.9% = 99.8%
Peak vs average ratio? Usually 3-10x
Replication factor? Usually 3x for durability
When to shard? Single DB can't handle the write QPS or storage