Notification System
Real Incident: Facebook Duplicate Push Notification Storm, 2019
In September 2019, Facebook sent approximately 3x duplicate push notifications to hundreds of millions of users over a 30-minute window. Users received the same notification three times in rapid succession — likes, comments, friend requests, all tripled. The root cause: a deploy to the notification deduplication service introduced a race condition where the bloom filter check and the exact-match check returned conflicting results under high concurrency. The dedup layer "passed" messages it should have suppressed. Because there was no secondary dedup at the channel-worker level, every duplicate sailed through to APNs/FCM unchecked. The incident burned user trust — some disabled notifications entirely, tanking engagement metrics for weeks. Facebook's post-mortem mandated deduplication at every layer (ingestion, queue consumer, and provider-level collapse keys) so that no single-layer failure can produce duplicates. A notification system without layered dedup is a ticking time bomb.
System Design Concepts Used
Priority Queues · Channel Isolation · Bloom Filter Deduplication · Rate Limiting (Token Bucket) · Dead Letter Queue · Exponential Backoff with Jitter · Template Engine (i18n) · Fan-out on Write · User Preference Engine · Idempotency Keys · Queue Backpressure · Circuit Breaker · Kafka Consumer Groups · Multi-Tenant Rate Limiting · Cassandra (Write-Optimized Storage) · ClickHouse (OLAP Analytics)
1. Functional Requirements
- Multi-channel delivery — send notifications via push (iOS/Android), SMS, email, in-app, and webhook
- Priority classification — route notifications to priority lanes (P0=security/OTP, P1=transactional, P2=social, P3=marketing)
- User preference management — honor quiet hours, channel opt-outs, frequency caps, and language preferences
- Template engine with i18n — render notifications from templates with variable substitution, supporting 50+ locales
- Deduplication — prevent duplicate notifications from reaching users across all channels
- Delivery tracking — track notification lifecycle (queued, sent, delivered, read, failed) with analytics
- Scheduling — support delayed/scheduled notifications (e.g., "send at 9 AM user's local time")
- Rate limiting — enforce per-user and per-channel rate limits to prevent notification fatigue
- Retry with fallback — if push fails, fall back to SMS; if SMS fails, fall back to email
2. Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Throughput | 1B notifications/day (~12K/sec avg, 60K/sec peak) | Supports global consumer apps with flash sales, breaking news |
| Latency (P0 critical) | < 3s end-to-end (p99) | OTP/security alerts must arrive before user loses patience |
| Latency (P1 transactional) | < 10s end-to-end (p99) | Order confirmations, payment receipts expected quickly |
| Latency (P2/P3 bulk) | < 5 minutes (soft real-time) | Social/marketing can tolerate batching for efficiency |
| Delivery rate | 99.9% (across all channels combined) | Fallback channels compensate for primary channel failures |
| Duplicate rate | < 0.01% | Users notice and are annoyed by duplicates; trust erosion |
| Availability | 99.95% (< 4.4 hours/year downtime) | Notifications are revenue-critical (OTPs, order updates) |
| Data retention | 90 days hot, 2 years cold | Compliance, debugging, analytics |
3. Capacity Estimation
/* ━━━ NAPKIN MATH: Start From Daily Volume ━━━ */
Total notifications/day: 1,000,000,000 (1B)
Notifications/sec (avg): 1B / 86,400 ≈ 11,574/sec → ~12K/sec
Peak (5x daily avg): 12K × 5 = 60K/sec (flash sales, breaking news)
Absolute spike (10x): 120K/sec (New Year countdown, election results)
/* ━━━ CHANNEL BREAKDOWN ━━━ */
Push notifications (60%): 600M/day → 7,000/sec avg → 35K/sec peak
Email (25%): 250M/day → 2,900/sec avg → 14.5K/sec peak
SMS (10%): 100M/day → 1,150/sec avg → 5,750/sec peak
In-App (5%): 50M/day → 580/sec avg → 2,900/sec peak
/* ━━━ STORAGE ━━━ */
Notification record size: ~500 bytes (template rendered + metadata)
Delivery log per notification: ~200 bytes (status, timestamps, provider response)
Daily storage (notifications): 1B × 500B = 500 GB/day
Daily storage (delivery logs): 1B × 200B = 200 GB/day
90-day hot storage: 700 GB × 90 = 63 TB (Cassandra cluster)
2-year cold storage: 700 GB × 730 = 511 TB (S3/compressed)
/* ━━━ QUEUE DEPTH ━━━ */
Kafka partitions per priority:
P0 (critical): 32 partitions, 4 consumer groups → max 500K in-flight
P1 (high): 64 partitions, 8 consumer groups → max 2M in-flight
P2 (normal): 128 partitions, 16 consumer groups → max 10M in-flight
P3 (low): 128 partitions, 8 consumer groups → max 50M in-flight
Total Kafka retention: 24 hours × 60K/sec × 1KB avg = ~5 TB
/* ━━━ COST AWARENESS ━━━ */
SMS cost (at $0.01/msg): 100M/day × $0.01 = $1M/day (!!)
Push cost (FCM/APNs): Free (volume-based, no per-message cost)
Email cost (SES at $0.10/1K): 250M/day × $0.0001 = $25K/day
→ SMS rate limiting is a BUSINESS REQUIREMENT, not just UX
System Nature
Write-heavy, latency-sensitive, cost-constrained. Unlike a URL shortener (read-heavy), a notification system is write-heavy: every notification is written once and read zero times by the system (it's pushed to users). The architecture optimizes for high-throughput writes, channel isolation, and cost control (especially SMS).
4. "Why X, Not Y?" — Tradeoff Analysis
Why priority queues, not a single FIFO?
Priority queues are mandatory because a single FIFO creates priority inversion — a 10M marketing blast queued at 2 PM blocks an OTP generated at 2:01 PM from reaching the user for minutes. With priority topics (P0/P1/P2/P3), each priority level has dedicated consumer groups. P0 consumers are always idle (low volume, instant consumption), so an OTP goes from queue to delivery in <100ms regardless of how many P3 marketing messages are backed up.
The cost is operational complexity: 4 topic groups instead of 1, separate monitoring/alerting per priority, and priority assignment logic in the router. But the Uber 2019 incident proved this is non-negotiable — a single queue means your most important messages (security, authentication) are held hostage by your least important (marketing).
Single FIFO advantage: Simpler operations, guaranteed ordering, easier capacity planning. Only acceptable for systems with <1M notifications/day where all notifications are equally important.
Why channel-specific workers, not a unified sender?
Channel isolation ensures that an APNs outage (Apple push) doesn't cascade to email or SMS delivery. Each channel has fundamentally different characteristics: push is fire-and-forget over HTTP/2 multiplexed connections, SMS requires serial API calls with per-provider rate limits (Twilio caps at 400 msg/sec per number), email uses SMTP batching with connection pooling. A unified sender would need to handle all protocols, and a bug/outage in one protocol's handling would stall the entire worker pool.
With isolated workers: Push workers maintain persistent HTTP/2 connections to APNs/FCM. SMS workers manage Twilio connection pools with provider-specific retry logic. Email workers batch via SES with 50-email-per-API-call optimization. Each can scale independently based on channel volume.
Unified sender advantage: Simpler deployment, single codebase, easier to implement cross-channel dedup. Acceptable for startups sending <10K notifications/day.
Why deduplication at every layer, not just ingestion?
A single dedup layer is a single point of failure for duplicates — and duplicate notifications are among the most user-visible failures. The Facebook 2019 incident proved that ingestion-layer dedup alone is insufficient: if the dedup service has a bug, a cache eviction, or a race condition, duplicates flow through unimpeded to billions of users.
Layered dedup provides defense-in-depth:
- API layer: Idempotency key in Redis (24h TTL) — catches duplicate API calls from producers
- Queue consumer layer: Check
notification_idin delivery log before processing — catches messages redelivered by Kafka (at-least-once semantics) - Provider layer: APNs
apns-collapse-id, FCMcollapse_key, emailMessage-ID— the provider itself suppresses duplicates
Each layer is cheap (Redis lookup, DB exists check, free header). The combined probability of all three failing simultaneously is negligible.
Ingestion-only dedup advantage: Lower latency (single check point), simpler architecture. Acceptable only if your dedup service has 99.999% reliability and you accept the risk.
Why rate limiting per-user, not just per-channel?
Per-channel rate limiting (e.g., max 1M push/sec globally) protects infrastructure but doesn't protect user experience. A user who receives 47 push notifications in an hour will disable notifications entirely — regardless of whether your servers are healthy. Per-user rate limiting (e.g., max 5 push/hour, 2 SMS/day, 10 email/day) prevents notification fatigue.
Implementation uses a sliding window counter per user per channel in Redis: INCR user:{id}:push:window:{hour}. When the limit is hit, lower-priority notifications are either dropped (P3 marketing) or queued for the next window (P2 social). P0/P1 are never rate-limited — security alerts and OTPs always go through.
Per-channel-only advantage: Much simpler (single global counter), no per-user Redis keys (saves memory). Acceptable for B2B systems where users expect high notification volume (e.g., monitoring dashboards).
5. High-Level Architecture
6. Backend Services Explained
Notification API Gateway
The single entry point for all notification requests. Validates the payload schema (channel, recipient, template_id, priority), checks the idempotency key in Redis (reject if seen within 24h), enforces API-level rate limits per producer service (prevents a buggy microservice from flooding the system), enriches the request with user preferences (quiet hours, channel opt-outs), and renders the notification body from the template engine. If the user has push disabled, the API immediately routes to the fallback channel (email/in-app) without hitting the queue. Stateless, horizontally scaled behind an L7 load balancer.
Template Engine
Maintains a library of notification templates with i18n support (50+ locales). Templates use variable substitution ({{order_id}}, {{driver_name}}) and channel-specific formatting: push templates are limited to 4KB (APNs limit), email templates include HTML/plaintext variants, SMS templates are capped at 160 characters with URL shortening. Templates are versioned and cached in Redis — a template update propagates to all instances within seconds via pub/sub invalidation. Pre-compilation avoids runtime template parsing overhead.
Priority Router
Examines the notification type and maps it to a priority level: P0 (OTP, security alerts, 2FA), P1 (order confirmations, payment receipts, delivery updates), P2 (likes, comments, friend requests), P3 (marketing, digests, recommendations). The router also handles scheduling: if a notification is marked "send at 9 AM user's local time," it calculates the delay and publishes to a delayed-execution topic (Kafka with timestamp-based consumption). For immediate notifications, it publishes directly to the appropriate priority topic.
Channel Workers (Push, SMS, Email, In-App, Webhook)
Each channel has an independent worker pool that subscribes to all four priority topics but with weighted consumption: P0 messages are always consumed first (Kafka consumer priority via separate consumer groups with lag-based autoscaling). Each worker handles the specifics of its channel: the Push Worker maintains persistent HTTP/2 connections to APNs and FCM, batching up to 1,000 notifications per HTTP/2 stream. The SMS Worker manages provider failover (primary: Twilio, fallback: Vonage) with per-provider rate limiting. The Email Worker batches up to 50 recipients per SES API call. Before sending, each worker performs a final dedup check against the delivery log.
Delivery Tracker
Receives delivery callbacks from external providers (APNs delivery receipts, Twilio status callbacks, SES bounce/complaint notifications) and updates the notification status in Cassandra. Feeds real-time delivery metrics to ClickHouse for analytics dashboards (delivery rate by channel, by provider, by geography). Detects systematic failures (e.g., APNs returning errors for 50%+ of messages) and triggers circuit breakers.
Dead Letter Queue (DLQ)
Messages that fail after maximum retries (e.g., 5 attempts with exponential backoff) are routed to the DLQ rather than being dropped. An operations team monitors the DLQ, inspects failure reasons, and can replay messages after the root cause is fixed. The DLQ is also used for messages that fail schema validation or reference non-existent users — these are logged for debugging but never retried automatically.
7. Architecture Flow — Maria's Food Delivery Notification
A food delivery app sends an "order ready for pickup" notification to Maria in Sao Paulo, Brazil. Her phone is on Android (FCM), she has push enabled, quiet hours set to 22:00-08:00 (local BRT timezone), and she prefers Portuguese.
Phase 1 — High-Priority Push Notification (P1 Transactional)
T+0ms: The Order Service publishes a notification request to the Notification API:
POST /v1/notifications
{
"user_id": "u_789012",
"type": "ORDER_READY",
"priority": "P1",
"template_id": "order_ready_v3",
"variables": {"restaurant": "Burger Palace", "order_id": "ORD-4521"},
"channels": ["push", "sms"],
"idempotency_key": "ord-4521-ready-notif"
}
T+2ms: API Gateway checks Redis for idempotency key ord-4521-ready-notif — not found (first attempt). Checks user preferences: push is enabled, current time in BRT is 14:30 (outside quiet hours). Rate limit check: Maria has received 2 push notifications this hour (limit is 10/hour for P1) — allowed.
T+5ms: Template Engine renders the Portuguese template: "Seu pedido no Burger Palace esta pronto! Pedido #ORD-4521". Body is 89 bytes — well within APNs/FCM limits.
T+8ms: Priority Router publishes to Kafka topic notifications.p1 with partition key hash(u_789012) mod 64 = partition 17. This ensures all of Maria's notifications are processed in order.
T+12ms: P1 Push Worker consumes the message (consumer lag for P1 is typically <50ms). Performs final dedup check: queries Cassandra delivery log for notification_id — not found. Proceeds.
T+15ms: Push Worker sends to FCM via HTTP/2:
POST https://fcm.googleapis.com/v1/projects/foodapp/messages:send
{
"message": {
"token": "maria_fcm_device_token_abc123",
"notification": {"title": "Pedido Pronto!", "body": "Seu pedido no Burger Palace..."},
"android": {"collapse_key": "order_ready_ORD-4521", "priority": "high"}
}
}
T+180ms: FCM returns 200 OK with message ID. Push Worker writes delivery status to Cassandra: {notification_id, status: "sent", provider: "fcm", sent_at: now()}.
T+2.1s: FCM delivers to Maria's phone. FCM sends delivery receipt callback. Delivery Tracker updates status to "delivered."
Phase 2 — Email Batch Delivery (P3 Marketing)
T+0: Marketing Engine triggers a campaign: "Weekend special: 30% off orders over R$50" to 5 million users in Brazil.
T+1ms: API Gateway receives the batch request. Rate limiter applies: marketing emails are capped at 500K/hour per campaign to avoid overwhelming SES and triggering spam filters. The 5M messages are spread over 10 hours.
T+10ms: Priority Router publishes all 5M messages to Kafka topic notifications.p3. Because P3 has the lowest consumer priority, these will only be processed when P0/P1/P2 queues are drained.
T+30s to T+10h: Email Workers consume from P3 at a controlled rate. Each worker batches 50 recipients per SES API call. SES rate limit: 100 calls/sec = 5,000 emails/sec. Total throughput: 500K/hour as designed.
Key insight: Maria's P1 order notification at T+0 was delivered in 2.1 seconds. The 5M marketing emails do NOT interfere because they're in a completely separate priority lane (P3) with separate consumer groups.
Phase 3 — SMS Fallback When Push Fails
T+0ms: Auth Service sends OTP to Maria (P0 critical):
{"user_id": "u_789012", "type": "OTP", "priority": "P0", "template_id": "otp_v2",
"variables": {"code": "847291"}, "channels": ["push", "sms"], "fallback_chain": true}
T+8ms: P0 Push Worker sends to FCM. But Maria's phone is turned off (airplane mode for a flight).
T+5.2s: FCM returns 200 OK (accepted) but no delivery receipt arrives within the 5-second timeout. Push Worker marks status as sent_unconfirmed.
T+5.3s: Fallback logic triggers: since fallback_chain: true and push delivery is unconfirmed after 5s, the system escalates to SMS.
T+5.4s: SMS Worker sends via Twilio:
POST https://api.twilio.com/2010-04-01/Accounts/.../Messages.json
Body: "847291 e seu codigo de verificacao. Nao compartilhe."
To: +5511987654321
T+6.8s: Twilio delivers SMS via carrier network. Maria (who has SMS available despite airplane mode in some carriers, or when she lands) receives the OTP.
8. Failure & Recovery Scenarios
Push Gateway (APNs/FCM) Down
Scenario: FCM returns 503 for all requests for 15 minutes (Google infrastructure issue).
Detection: Push Worker circuit breaker trips after 10 consecutive failures in 5 seconds. Metrics alert fires.
Impact: All push notifications queue up in Kafka. No data loss — Kafka retains messages for 24 hours.
Recovery: 1. Circuit breaker enters half-open state every 30 seconds, sending a single probe request. 2. P0/P1 notifications with fallback_chain: true immediately fall back to SMS/email after 3 failed push attempts (no waiting for FCM recovery). 3. P2/P3 notifications accumulate in Kafka. When FCM recovers, consumers drain the backlog at maximum rate. 4. Dedup layer ensures that messages retried during the outage are not sent twice once service resumes.
Key metric: During a 15-minute FCM outage, <0.1% of P0 notifications are delayed >10s (they fall back to SMS within 5s).
SMS Provider Fails (Twilio Outage)
Scenario: Twilio API returns 500 errors. Primary SMS path is broken.
Recovery: 1. SMS Worker detects Twilio failures (circuit breaker trips after 5 failures in 2 seconds). 2. Automatic failover to secondary provider (Vonage/AWS SNS). Provider selection is configurable per region. 3. If all SMS providers are down, P0 messages (OTP) fall back to email with a disclaimer: "Use this code within 2 minutes." 4. P3 marketing SMS are paused entirely (cost savings during outage — no point sending marketing to a black hole).
Queue Backpressure (Kafka Consumer Lag Spike)
Scenario: A flash sale triggers 10x normal notification volume. P2/P3 consumer lag grows to millions.
Recovery: 1. Autoscaler detects consumer lag >100K and spins up additional P2/P3 consumer instances (Kubernetes HPA based on Kafka lag metric). 2. P0/P1 consumers are over-provisioned by 5x — they never experience lag even during spikes. 3. If P3 lag exceeds 50M messages (>4 hours of backlog), the system activates "marketing throttle" — dropping oldest P3 messages that are >2 hours stale (marketing notifications lose value rapidly). 4. Kafka retention guarantees no P0/P1 message is ever dropped.
Dedup Service Failure (Redis Cluster Partial Failure)
Scenario: 2 of 6 Redis nodes in the dedup cluster fail. ~33% of idempotency checks cannot be performed.
Recovery: 1. Redis Cluster redistributes slots to surviving nodes (automatic within 15 seconds for Redis Cluster). 2. During the gap, the API layer's dedup check returns "unknown" for affected keys. 3. Fallback dedup: the system proceeds with a warning flag, and the channel worker performs the secondary dedup check against Cassandra delivery log before sending. 4. Post-recovery: no duplicates were sent because the Cassandra-level dedup caught them. The multi-layer approach means Redis failure does not cause duplicate sends.
9. Data Model
/* ━━━ PostgreSQL: Notification Templates ━━━ */
CREATE TABLE notification_templates (
template_id VARCHAR(64) PRIMARY KEY,
version INT NOT NULL DEFAULT 1,
channel VARCHAR(16) NOT NULL, -- push, sms, email, in_app
locale VARCHAR(10) NOT NULL, -- en-US, pt-BR, es-MX, etc.
title_template TEXT, -- "Your order from {{restaurant}} is ready!"
body_template TEXT NOT NULL, -- "Order #{{order_id}} is ready for pickup"
metadata JSONB, -- {max_length: 160, supports_html: false}
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (template_id, version, channel, locale)
);
/* ━━━ PostgreSQL: User Preferences ━━━ */
CREATE TABLE user_preferences (
user_id BIGINT PRIMARY KEY,
push_enabled BOOLEAN DEFAULT TRUE,
sms_enabled BOOLEAN DEFAULT TRUE,
email_enabled BOOLEAN DEFAULT TRUE,
quiet_start TIME, -- e.g., 22:00
quiet_end TIME, -- e.g., 08:00
timezone VARCHAR(64) DEFAULT 'UTC',
locale VARCHAR(10) DEFAULT 'en-US',
rate_limits JSONB DEFAULT '{"push_per_hour": 10, "sms_per_day": 3, "email_per_day": 20}',
channel_priority JSONB DEFAULT '["push", "sms", "email"]', -- fallback order
unsubscribed JSONB DEFAULT '[]', -- ["marketing", "social"]
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_user_prefs_timezone ON user_preferences (timezone);
/* ━━━ Cassandra: Delivery Log (Write-Optimized, Time-Series) ━━━ */
CREATE TABLE delivery_log (
user_id BIGINT,
notification_id UUID,
channel TEXT, -- push, sms, email
priority TEXT, -- P0, P1, P2, P3
status TEXT, -- queued, sent, delivered, read, failed, bounced
provider TEXT, -- fcm, apns, twilio, ses
provider_msg_id TEXT, -- external reference for tracking
template_id TEXT,
created_at TIMESTAMP,
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
failed_reason TEXT, -- null if successful
retry_count INT,
PRIMARY KEY ((user_id), created_at, notification_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
AND default_time_to_live = 7776000; -- 90-day TTL
CREATE TABLE delivery_dedup (
notification_id UUID PRIMARY KEY,
created_at TIMESTAMP,
) WITH default_time_to_live = 86400; -- 24h TTL for dedup window
/* ━━━ ClickHouse: Analytics (OLAP, Columnar) ━━━ */
CREATE TABLE notification_analytics (
notification_id UUID,
user_id UInt64,
channel LowCardinality(String),
priority LowCardinality(String),
template_id LowCardinality(String),
status LowCardinality(String),
provider LowCardinality(String),
country LowCardinality(String),
created_at DateTime,
delivered_at Nullable(DateTime),
latency_ms UInt32 -- end-to-end delivery latency
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (channel, priority, created_at);
10. Algorithms Under the Hood
Deduplication (Bloom Filter + Exact Check)
/* Two-phase dedup: probabilistic fast-path + exact slow-path */
function is_duplicate(notification_id, idempotency_key):
// Phase 1: Bloom filter (in-memory, O(1), 1% false positive rate)
if bloom_filter.might_contain(idempotency_key):
// Phase 2: Exact check in Redis (only when bloom says "maybe")
if redis.EXISTS("dedup:" + idempotency_key):
return TRUE // confirmed duplicate — reject
// Bloom false positive — not actually a duplicate
// Not a duplicate — record it
bloom_filter.add(idempotency_key)
redis.SET("dedup:" + idempotency_key, notification_id, EX=86400) // 24h TTL
return FALSE
/* Why bloom filter first?
- 1B notifications/day = 12K/sec Redis queries for dedup alone
- Bloom filter eliminates 99% of Redis lookups (only new messages hit Redis)
- At 1B entries/day with 1% FPR: bloom filter size = ~1.2 GB RAM
- Rotated daily (swap with fresh filter at midnight UTC)
*/
Priority Queue Scheduling
/* Weighted consumption across priority topics */
function consume_next_batch(worker):
// P0 always checked first — starvation-free for critical messages
batch = kafka.poll("notifications.p0", timeout=10ms, max=100)
if batch.size > 0:
return batch // Always serve P0 first
// P1 checked next — weighted 4x over P2
batch = kafka.poll("notifications.p1", timeout=10ms, max=100)
if batch.size > 0:
return batch
// P2 and P3 use weighted round-robin
if worker.round_robin_counter % 5 < 4: // 80% of cycles go to P2
batch = kafka.poll("notifications.p2", timeout=50ms, max=200)
else: // 20% of cycles go to P3
batch = kafka.poll("notifications.p3", timeout=50ms, max=500)
worker.round_robin_counter += 1
return batch
/* Alternative: Separate consumer groups per priority
P0: 4 dedicated consumers (always idle, instant drain)
P1: 8 consumers (autoscale 4-16 based on lag)
P2: 16 consumers (autoscale 8-32)
P3: 8 consumers (autoscale 4-64, burst during marketing campaigns)
This is the preferred approach — simpler and truly isolated.
*/
Rate Limiting Per User (Sliding Window Counter)
/* Sliding window rate limiter using Redis sorted sets */
function check_rate_limit(user_id, channel, priority):
// P0 (critical) is NEVER rate-limited
if priority == "P0":
return ALLOW
key = "ratelimit:" + user_id + ":" + channel
now = current_timestamp_ms()
window = get_window_duration(channel) // push=1hour, sms=1day, email=1day
window_start = now - window
// Remove expired entries
redis.ZREMRANGEBYSCORE(key, 0, window_start)
// Count entries in current window
count = redis.ZCARD(key)
limit = get_user_limit(user_id, channel) // from user_preferences table
if count >= limit:
if priority == "P1":
// Transactional: queue for next window (don't drop)
schedule_for_next_window(notification)
return DEFER
else:
// P2/P3: drop silently (user won't miss marketing)
return DROP
// Under limit — allow and record
redis.ZADD(key, now, notification_id)
redis.EXPIRE(key, window / 1000) // TTL = window duration
return ALLOW
/* Memory calculation:
Active users/day: 100M
Keys per user: 3 (push, sms, email)
Sorted set per key: ~20 entries × 16 bytes = 320 bytes
Total: 100M × 3 × 320B = ~96 GB Redis
→ Shard across 12 Redis nodes (8 GB per node)
*/
Retry with Exponential Backoff + Jitter
/* Retry strategy for failed deliveries */
MAX_RETRIES = 5
BASE_DELAY_MS = 1000 // 1 second
function send_with_retry(notification, channel_worker):
for attempt in range(1, MAX_RETRIES + 1):
result = channel_worker.send(notification)
if result.success:
update_status(notification.id, "delivered")
return SUCCESS
if result.error_type == "PERMANENT":
// Invalid token, unsubscribed, blocked — don't retry
update_status(notification.id, "failed_permanent", result.error)
move_to_dlq(notification, reason=result.error)
return PERMANENT_FAILURE
// Transient error — retry with exponential backoff + jitter
delay = BASE_DELAY_MS * (2 ^ (attempt - 1)) // 1s, 2s, 4s, 8s, 16s
jitter = random(0, delay * 0.3) // ±30% jitter
actual_delay = delay + jitter
log("Retry {attempt}/{MAX_RETRIES} for {notification.id} in {actual_delay}ms")
sleep(actual_delay)
// All retries exhausted
update_status(notification.id, "failed_exhausted")
move_to_dlq(notification, reason="max_retries_exceeded")
// Trigger fallback channel if configured
if notification.fallback_chain:
next_channel = get_next_fallback(notification)
if next_channel:
re_enqueue(notification, channel=next_channel)
return FAILURE
/* Retry timeline for a transient FCM error:
Attempt 1: immediate → fails
Attempt 2: +1.0-1.3s delay → fails
Attempt 3: +2.0-2.6s delay → fails
Attempt 4: +4.0-5.2s delay → fails
Attempt 5: +8.0-10.4s delay → fails → DLQ + fallback to SMS
Total time before DLQ: ~15-20 seconds
*/
11. Scaling Considerations
| Challenge | Solution | Numbers |
|---|---|---|
| Flash sale spike (10x traffic) | Kafka absorbs burst; autoscale consumers based on lag | P3 consumers scale 8 to 64 pods in 90 seconds |
| SMS cost explosion | Per-user rate limiting + priority-based throttling | Cap at $200K/day SMS spend; defer P3 SMS to email |
| Global latency (users in 190+ countries) | Regional Kafka clusters + geo-routed workers | Push workers in us-east, eu-west, ap-southeast |
| Provider outage (APNs/FCM/Twilio) | Circuit breaker + multi-provider failover | Failover in <5s; no manual intervention needed |
| Hot partition (celebrity with 100M followers) | Fan-out limiter: batch celebrity notifications, shard by recipient | Max 10K messages per Kafka partition per second |
| Template update propagation | Redis pub/sub invalidation + versioned templates | All workers get new template in <2 seconds |
| Quiet hours across timezones | Scheduler service with per-user timezone offset | Delayed publish to Kafka with timestamp-based consumption |
| Delivery log growth (63 TB/90 days) | Cassandra with TTL + tiered storage (hot/warm/cold) | Auto-expire after 90 days; archive to S3 for compliance |
| Monitoring at 1B/day | ClickHouse real-time dashboards + anomaly detection | Alert if delivery rate drops below 99.5% for any channel |
| Dedup bloom filter memory | Daily rotation + Redis exact-match fallback | 1.2 GB RAM per day; swap at midnight UTC |
12. Quick Recall
| Question | Answer |
|---|---|
| Why priority queues? | OTP must never wait behind 10M marketing emails. P0 has dedicated consumers that are always idle. |
| Why channel isolation? | APNs down does NOT affect email. Separate worker pools, separate failure domains. |
| Dedup strategy? | 3 layers: bloom filter + Redis at API, Cassandra check at worker, collapse_key at provider. |
| Why rate limit per-user? | SMS costs $0.01/msg. 100M uncontrolled = $1M/day. Also prevents notification fatigue → uninstalls. |
| Why Kafka not RabbitMQ? | 1B/day throughput + replay capability + consumer groups. RabbitMQ struggles above 50K/sec. |
| DLQ purpose? | Failed messages after 5 retries go here. Nothing is lost. Ops can inspect + replay after fix. |
| Retry strategy? | Exponential backoff with 30% jitter. Prevents thundering herd on provider recovery. |
| Why Cassandra for delivery log? | Write-optimized (1B writes/day), time-series partitioning, built-in TTL for auto-expiry. |
| How to handle quiet hours? | Scheduler delays publication until user's quiet hours end. Uses timezone from user_preferences. |
| Fallback chain? | push fails → SMS → email. Each channel independently retried before escalating. |
| Template vs pre-render? | Templates: store O(templates × locales) not O(users × templates). Render at send time. |
| How to prevent fan-out bomb? | Celebrity notifications batched + sharded by recipient. Max 10K/partition/sec enforced. |