Data Partitioning
Real Incident: Uber's Schemaless, 2014
Uber's trip data grew to billions of rows on PostgreSQL. Queries that took 5ms at 1M rows took 500ms at 1B rows. Backups took 18 hours. They built Schemaless — partitioned by city + time. Trip query in São Paulo only scans São Paulo's partition. Query time back to 5ms regardless of global data size. Partitioning makes the impossible query fast by narrowing the search space.
Why This Comes Up in Interviews
Partitioning is more fundamental than sharding — it's the STRATEGY, while sharding is one implementation. It appears when:
- "How do you handle 1B+ rows efficiently?" → Partition
- "How do you design DynamoDB's key schema?" → Partition key + sort key
- "How do you avoid hot partitions in Kafka?" → Partition key design
- "How do you optimize time-series queries?" → Time-based partitioning
Key distinction: Partitioning = dividing data by some criteria. Sharding = partitioning across MACHINES. You can partition within one database.
Partitioning vs Sharding vs Table Partitioning
| Concept | Where Data Lives | Who Manages | Example |
|---|---|---|---|
| Table Partitioning | Same machine, managed by DB | Database engine | PostgreSQL PARTITION BY RANGE (created_at) |
| Partitioning (general) | Any arrangement | Application or middleware | Kafka partitions, DynamoDB partitions |
| Sharding | Different machines | Application or proxy | Instagram's user_id-based PostgreSQL shards |
Why it matters: Table partitioning is free optimization (DB handles it). Sharding is an architectural decision with major trade-offs.
Three Dimensions of Partitioning
Horizontal Partitioning (Most Common)
Split by rows: Different rows in different partitions. Same schema everywhere.
| Method | How | Best For | Worst For |
|---|---|---|---|
| Hash | partition = hash(key) % N | Even distribution, point lookups | Range queries (must scan all) |
| Range | Partition by value ranges | Time-series, range scans | Hot partition on active range |
| List | Partition by explicit values | Geographic, categorical data | Uneven category sizes |
| Composite | Partition key (hash) + sort key (range) | Flexible (DynamoDB model) | Complex key design |
Vertical Partitioning
Split by columns: Frequently-accessed columns separated from rarely-accessed ones.
| Example | Partition A (hot) | Partition B (cold) |
|---|---|---|
| User table | id, name, email, status | bio, avatar_url, preferences, settings |
| Product table | id, name, price, stock | description, reviews, specifications |
Why: If 95% of queries only need 4 columns, why load 40 columns from disk?
Functional Partitioning
Split by feature/domain: Each bounded context gets its own database.
| Service | Database | Data |
|---|---|---|
| User service | users_db | Profiles, auth, preferences |
| Order service | orders_db | Orders, line items, payments |
| Inventory service | inventory_db | Stock levels, warehouses |
| Analytics service | analytics_db | Events, metrics, logs |
This is essentially the microservices "database per service" pattern.
Partition Key Design — The DynamoDB Mental Model
DynamoDB's key model is the best framework for thinking about partition design in interviews:
Partition Key (PK): Determines WHICH partition. Must be high-cardinality, evenly distributed. Sort Key (SK): Determines ORDER within a partition. Enables range queries.
Design Patterns
| Access Pattern | Partition Key | Sort Key | Query |
|---|---|---|---|
| User's recent orders | user_id | order_timestamp | "Last 10 orders for user X" → single partition |
| Chat messages | conversation_id | message_timestamp | "Messages in convo Y since Z" → single partition |
| IoT sensor readings | device_id | reading_timestamp | "Device Z readings for today" → single partition |
| Product reviews | product_id | review_timestamp | "Latest reviews for product P" → single partition |
The golden rule: Design so that 95%+ of queries hit a SINGLE partition.
Anti-Patterns (What NOT to Do)
| Bad Key | Why It Fails | Fix |
|---|---|---|
date as PK | Today's partition gets ALL writes (hot) | Use entity_id as PK, date as SK |
status as PK ("active"/"inactive") | Two partitions, 90% data in "active" | Use user_id as PK, query with filter |
country as PK | US/India partition 100x larger than Iceland | Hash-based PK, country as attribute |
Sequential order_id as PK | Recent orders all go to latest partition | Hash of order_id, or user_id as PK |
The Hot Partition Problem — Deep Dive
What makes a partition hot: - Celebrity account: one partition key gets 1000x traffic (Justin Bieber's timeline) - Temporal hotspot: "today" partition gets all writes in time-series data - Category skew: "electronics" category has 10x items vs "garden supplies"
Back-of-envelope: - DynamoDB partition limit: ~3000 RCU / 1000 WCU per partition (baseline planning numbers — since 2018-2019, adaptive capacity and burst capacity allow individual partitions to temporarily exceed these limits, so they are no longer hard ceilings) - Celebrity with 100M followers: follower fanout = 1M writes/sec during a post - Single partition handles 1000 WCU → needs 1000 partitions for that ONE user
Solutions by Pattern
| Pattern | Strategy | How | Trade-off |
|---|---|---|---|
| Write sharding | Add random suffix to PK | celebrity_123#0 through celebrity_123#9 | Reads scatter across 10 partitions → parallel read + merge |
| GSI overloading | Use multiple access patterns in one table | Different SK patterns for different queries | Complex schema |
| Time bucketing | Split hot time partition | 2024-01-15#0, 2024-01-15#1 | Reads need multiple queries |
| Separate table | Hot data in dedicated table with higher capacity | VIP users → separate table | Operational overhead |
| Caching | Cache hot partition reads | Redis in front of hot partition | Cache invalidation |
Instagram's approach: For celebrity accounts, partition key = user_id + random(0-9). Writes spread across 10 partitions. Reads: 10 parallel queries → merge → return. Works because read fanout (10 queries) is cheap, write hotspot (10,000 WCU on one partition) is impossible.
Read Fan-Out Tradeoff
Spreading writes across 10 partitions means reads now require scatter-gather across all 10, adding tail latency variance (p99 is bounded by the slowest partition). This is acceptable for write-heavy workloads (celebrity posts) but problematic for read-heavy patterns — if a key is read 1000x per second, you're paying 10,000 partition reads instead of 1,000. Only apply write sharding to genuinely write-hot keys.
Secondary Indexes with Partitioned Data
The core problem: Primary data is partitioned by entity_id. But you also need to query by email, timestamp, or category.
| Index Type | Where It Lives | Read | Write | Consistency |
|---|---|---|---|---|
| Local secondary index | Same partition as data | Fast (same partition) | Fast (same partition) | Strong |
| Global secondary index | Separate partitions (by indexed attribute) | Fast (goes to right index partition) | Slow (async replication) | Eventually consistent |
DynamoDB Example
- Table: PK =
user_id, SK =order_timestamp - LSI: Sort by
order_total(same partition, different sort order) - GSI: PK =
product_id, SK =order_timestamp(query: "all orders for product X")
GSI write amplification: Every write to main table → also writes to GSI partitions. 3 GSIs = 4x write cost.
Partitioning Strategies in Real Systems
| System | Partitioning | Key Design | Rebalancing |
|---|---|---|---|
| DynamoDB | Hash(PK) → partition | Composite key (PK + SK) | Automatic split on hot partition |
| Cassandra | Consistent hash ring | Partition key → token | Vnodes for even distribution |
| Kafka | Hash(message key) % partitions | Message key decides partition | Manual partition reassignment |
| PostgreSQL | Declarative (RANGE, LIST, HASH) | Any column(s) | Manual (attach/detach partitions) |
| Elasticsearch | Hash(doc_id) % shards | Document _id or custom routing | Shard allocation API |
| BigQuery | Time-based + clustering | Ingestion time or column | Automatic |
| MongoDB | Range or hash | Shard key field | Automatic chunk splitting + migration |
PostgreSQL Table Partitioning (Free Performance)
When: Single table > 10M rows with identifiable access patterns (usually time-based).
| Benefit | How |
|---|---|
| Query speed | Partition pruning — DB skips irrelevant partitions entirely |
| Vacuum speed | Vacuum each partition independently (smaller tables) |
| Bulk delete | Drop old partition instead of DELETE (instant vs hours) |
| Index size | Smaller indexes per partition (fit in memory) |
Common patterns:
| Data Type | Partition By | Retention |
|---|---|---|
| Logs/events | Monthly or weekly | Drop partitions > 90 days |
| Orders | Monthly | Keep forever, archive old to cold storage |
| Time-series metrics | Daily | Drop partitions > 30 days |
| Audit trail | Yearly | Compress old partitions |
Back-of-envelope benefit: - 1B row table, query needs last 7 days - Without partitioning: index scan on 1B rows → seconds - With daily partitions: only 7 partitions scanned, each ~3M rows → milliseconds - Speedup: ~50x for time-bounded queries
Partition Rebalancing
When needed: Partition grows too large, or load becomes uneven.
| Strategy | How | Downtime | Used By |
|---|---|---|---|
| Fixed partitions | Create many upfront, never change | None | Kafka (can't change partition count easily) |
| Dynamic splitting | Hot/large partition automatically splits | None | DynamoDB, HBase |
| Consistent hashing | Add node → steal ~1/N from all others | Minimal | Cassandra |
| Manual rebalance | Operator triggers migration | Depends | Elasticsearch (shard allocation) |
DynamoDB auto-split: - Partition gets > 10GB or > 3000 RCU → automatically splits into two - No downtime, no application awareness needed - But: too many splits = too many partitions = metadata overhead
Interview Framework
When designing a data model:
Step 1 — Identify access patterns: "The primary queries are [per-user lookups / time-range scans / category filtering]. This tells me the partition key should be [entity that 95%+ queries filter by]."
Step 2 — Choose partition key: "I'd partition by [user_id/device_id/conversation_id] because it's [high cardinality, immutable, matches access pattern]. Sort key would be [timestamp] for range queries within partition."
Step 3 — Hot partition mitigation: "If any entity becomes disproportionately hot, I'd use write sharding — append random suffix to PK, reads do parallel scatter-gather across suffixes."
Step 4 — Secondary access: "For queries not aligned with PK (e.g., 'all orders for product X'), I'd add a [GSI / Elasticsearch index / denormalized table]."
Step 5 — Rebalancing: "I'd use [DynamoDB auto-split / logical partition indirection / over-provisioned partition count] so growth doesn't require manual intervention."
Quick Recall
| Question | Answer |
|---|---|
| Partitioning vs sharding? | Partitioning = logical division. Sharding = across machines. |
| Hash vs range partition? | Hash = even distribution, no range queries. Range = range queries, hotspot risk. |
| Composite key model? | Partition key (hash → which partition) + Sort key (range within partition) |
| Hot partition fix? | Write sharding (random suffix), caching, separate table |
| DynamoDB auto-split? | Partition > 10GB or > 3000 RCU → splits automatically |
| Table partitioning benefit? | Partition pruning, faster vacuum, instant bulk delete |
| Secondary index trade-off? | Local = strong consistency. Global = eventual, async replication. |
| Golden rule? | 95%+ queries should hit single partition |