Distributed Rate Limiter
Real Incident: GitHub API Abuse, 2018
A misconfigured CI/CD pipeline at a Fortune 500 company spawned 200 parallel Jenkins jobs, each polling the GitHub API for status checks. Combined traffic: 50,000 requests/minute from a single organization — 10x their rate limit. GitHub's rate limiter correctly identified and blocked the excess traffic, returning 403 Forbidden responses. However, the client library had no exponential backoff logic and the responses lacked a Retry-After header. Every rejected request was retried immediately with zero delay, creating a thundering herd of 50K retries/second — each retry generating its own rejection, which triggered yet another retry. The retry storm consumed 3x more compute resources than the original traffic: each request required rate-limit lookup, response serialization, and connection handling regardless of outcome. GitHub's post-mortem led to three changes: (1) mandatory Retry-After headers on all 429 responses, (2) progressive penalties for clients ignoring backoff (increasing cooldown windows), and (3) connection-level throttling for repeat offenders. Lesson: A rate limiter that doesn't communicate "when to retry" creates more problems than it solves. The limiter itself becomes a DoS amplifier.
System Design Concepts Used
token bucket algorithm | sliding window counter | distributed state management | Redis Lua scripting | consistent hashing | fail-open vs fail-closed | circuit breaker pattern | thundering herd mitigation | rate limit headers (RFC 6585) | atomic operations | local cache with eventual consistency | hot-reload configuration | connection pooling | sharded counters | backpressure signaling
Functional Requirements
- Per-key rate limiting — Enforce separate limits per API key (each client has independent quota)
- Per-endpoint granularity — Different endpoints have different limits (e.g.,
/search= 30 req/min,/users= 100 req/min) - Multiple algorithm support — Token bucket (default), sliding window log, sliding window counter, and fixed window counter selectable per rule
- Rate limit response headers — Return
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset, andRetry-Afteron every response (both allowed and rejected) - Configurable rules with hot-reload — Operators can update rate limit rules (thresholds, algorithms, windows) without service restart
- Whitelist/blacklist support — Permanently allow internal services (bypass checks) or permanently block known-abusive keys
- Tiered rate limiting — Different limits per subscription tier (free: 100 req/min, pro: 1000 req/min, enterprise: 10000 req/min)
- Burst tolerance — Allow brief bursts above steady-state rate (token bucket capacity > refill rate) for legitimate traffic spikes
- Distributed enforcement — Rate limits enforced consistently across all API gateway nodes (not per-node)
- Graceful degradation signaling — On rejection, provide actionable information (retry timing, remaining quota, limit ceiling)
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Throughput | 1.7M rate-limit checks/sec | 1M clients x 100 req/min average = full production load |
| Latency (p99) | < 1ms per check | Must not add perceptible delay to API calls; sub-ms for local fast path |
| Accuracy | < 0.1% false rejections | Rejecting legitimate traffic directly impacts revenue and user trust |
| Availability | 99.99% (52 min downtime/year) | Rate limiter failure path must not block legitimate traffic |
| Consistency | Eventually consistent (< 1s lag) | Brief over-limit acceptable vs. rejecting legitimate bursts across nodes |
| Scalability | Linear horizontal scaling | Adding Redis shards or gateway nodes should proportionally increase capacity |
| Failure mode | Fail-open | On Redis outage, allow all traffic rather than reject everything |
| Recovery time | < 5s failover | Redis sentinel/cluster auto-failover must be faster than typical burst duration |
| State storage | < 500 MB total | All rate-limit state must fit in memory for sub-ms access |
| Config propagation | < 2s rule update | New rules must take effect across all nodes within 2 seconds |
Capacity Estimation
=== Napkin Math: Distributed Rate Limiter ===
Traffic Volume:
Active clients: 1,000,000
Avg requests per client: 100 req/min
Total request rate: 1M × 100 / 60 = ~1,700,000 req/sec (1.7M)
Peak (3x average): ~5.1M req/sec
State Per Client:
Token bucket state: tokens (8B) + last_refill (8B) + key overhead (40B) = ~56 bytes
Endpoints per client: 10 (average distinct endpoints hit)
State per client: 56 × 10 = 560 bytes
Total Redis Memory:
Active keys: 1M clients × 10 endpoints = 10M keys
Raw state: 10M × 56 bytes = 560 MB
Redis overhead (2x): ~1.12 GB (hash encoding, pointers, metadata)
With TTL metadata: ~1.3 GB total
Safety margin (1.5x): ~2 GB allocated
Redis Cluster Sizing:
Single Redis throughput: ~300,000 ops/sec (pipelined, single thread)
Required throughput: 1.7M ops/sec (each check = 1 EVALSHA call)
Shards needed: 1.7M / 300K = 6 shards (minimum)
With headroom (2x): 12 shards (6 primary + 6 replica)
Memory per shard: 2 GB / 6 = ~333 MB per primary
Network Overhead:
Request to Redis: key (50B) + args (30B) + protocol (20B) = ~100 bytes
Response from Redis: result (8B) + remaining (8B) + protocol (20B) = ~36 bytes
Per-second bandwidth: 1.7M × 136 bytes = ~231 MB/sec (Redis network)
Gateway egress headers: 1.7M × 80 bytes (4 headers) = ~136 MB/sec
Lua Script Execution:
Script size: ~400 bytes (cached as SHA after first EVAL)
Execution time: ~0.05ms per call (single Redis thread)
Total script CPU: 1.7M × 0.05ms = 85 sec of CPU / sec → distributed across 6 shards
Per-shard CPU utilization: 85 / 6 = ~14 sec/sec → ~23% utilization (safe)
Local Counter Fast Path:
Hot clients (top 1%): 10,000 clients
Local cache memory: 10K × 56 bytes = ~560 KB per gateway node
Cache hit rate: ~40% (Zipfian distribution of API usage)
Effective Redis load: 1.7M × 0.6 = 1.02M ops/sec (with local cache)
Shards with cache: 1.02M / 300K = 4 shards sufficient
"Why X, Not Y?" Tradeoff Analysis
Why Token Bucket, Not Fixed Window or Leaky Bucket?
| Criterion | Token Bucket | Fixed Window | Leaky Bucket |
|---|---|---|---|
| Burst handling | Allows bursts up to capacity (graceful) | 2x burst at window boundary (vulnerability) | No bursts at all (too strict) |
| Implementation in Redis | Single HMGET + HMSET (2 fields) | Single INCR + EXPIRE | Requires ordered queue (complex) |
| Memory per key | 16 bytes (tokens + timestamp) | 12 bytes (counter + expiry) | Unbounded (queue of timestamps) |
| User experience | Feels fair — saved tokens reward inactivity | Punishes users at boundary | Punishes bursty legitimate usage |
| Configurability | Two knobs: capacity + refill rate | One knob: limit per window | Two knobs: bucket size + drain rate |
Decision: Token bucket. It handles the real-world pattern of bursty API usage (e.g., a mobile app syncing on wake) while preventing sustained abuse. Fixed window's boundary problem means a client can send 2x the limit in a 1-second span straddling two windows. Leaky bucket's strict uniform output punishes legitimate batch operations.
Why Redis, Not Local In-Memory Counters?
| Criterion | Redis (Centralized) | Local In-Memory |
|---|---|---|
| Accuracy | Exact count across all nodes | Each node sees partial view (10 nodes = 10x over-limit possible) |
| Latency | ~0.5ms (network hop) | ~0.01ms (memory access) |
| Consistency | Strong (single source of truth) | Eventual (sync lag = over-limit window) |
| Failure mode | Redis down = degraded limiting | Node restart = state loss |
| Scalability | Add shards (linear) | Add nodes = worse accuracy |
Decision: Redis as source of truth with local counters as an optimization layer. Pure local counters fail the core requirement: if you have 10 gateway nodes and a 100 req/min limit, a client sending 10 requests to each node would consume 100 requests while each node thinks "only 10 — well within limit." The hybrid approach (local fast-path for hot keys, async sync to Redis every 100ms) gives sub-ms checks for 40% of traffic while maintaining accuracy.
Why Fail-Open, Not Fail-Closed on Redis Failure?
| Criterion | Fail-Open (Allow All) | Fail-Closed (Reject All) |
|---|---|---|
| Revenue impact | Potential brief over-limit (minor) | All legitimate traffic blocked (catastrophic) |
| User experience | Users unaffected during outage | 100% of users get 503/429 errors |
| Abuse risk | Temporary abuse window (~seconds) | No abuse but also no service |
| Recovery | Seamless when Redis returns | Thundering herd of retries on recovery |
| Business alignment | Availability > perfect limiting | Security > availability |
Decision: Fail-open. A rate limiter exists to protect services from abuse, not to be a single point of failure. During the ~5 seconds of Redis failover, allowing excess traffic is far less damaging than returning errors to every customer. Exception: for security-critical endpoints (login, password reset), implement a separate fail-closed circuit with local state.
Why Lua Scripts, Not MULTI/EXEC for Atomicity?
| Criterion | Lua Script (EVALSHA) | MULTI/EXEC (Transaction) |
|---|---|---|
| Round trips | 1 (script executes server-side) | 3 minimum (WATCH + MULTI + EXEC) |
| Atomicity | Guaranteed (single-threaded execution) | Optimistic (EXEC fails if key changed) |
| Retry logic | None needed | Must retry on EXEC failure (contention) |
| Latency | ~0.05ms execution | ~0.15ms (3 round trips + potential retry) |
| Cluster support | Works with single-key scripts | MULTI limited to same-slot keys |
| Complexity | Script handles read + compute + write | Client must handle CAS retry loop |
Decision: Lua scripts. The token bucket algorithm requires read-then-write semantics (read current tokens, compute refill, decrement, write back). With MULTI/EXEC, a hot key (popular API) would cause frequent EXEC failures under contention — exactly when low latency matters most. Lua scripts execute atomically in a single Redis thread with zero contention overhead. The script is cached server-side (EVALSHA) so only the SHA hash is transmitted after first load.
High-Level Architecture
Backend Services Explained
API Gateway Rate Limiter Middleware — This is the core component, implemented as middleware in the API gateway (e.g., Kong, Envoy, or custom Nginx/OpenResty). Every inbound request passes through this layer before reaching any backend. It extracts the API key (from header, query param, or JWT claim) and the normalized endpoint path, constructs the rate-limit key ({api_key}:{endpoint}:{window}), and executes the rate-check algorithm. The middleware attaches rate-limit headers to ALL responses (both allowed and rejected) so clients always know their remaining quota. It operates as a pre-request hook with sub-millisecond overhead on cache hit. The middleware also handles whitelist/blacklist lookups from an in-memory set that refreshes every 2 seconds.
Redis State Store (Cluster) — Six Redis primary shards with six replicas form the distributed state backend. Each shard handles ~300K operations/sec using pipelined connections. Keys are distributed across shards via consistent hashing on {api_key} (using Redis hash tags to ensure all endpoints for a single client land on the same shard, enabling multi-key operations). The Lua token-bucket script is loaded once via SCRIPT LOAD and invoked via EVALSHA — eliminating script transmission overhead on every call. Keys have a TTL of 2x the rate-limit window (e.g., 2 hours for hourly limits) to auto-expire inactive clients. Redis Sentinel handles automatic failover within 5 seconds; during failover, the gateway falls back to fail-open mode.
Config Service — A lightweight service (or etcd/Consul KV store) that holds rate-limit rules as structured configuration. Rules specify: API key or tier pattern, endpoint regex, algorithm choice, window size, limit, and burst capacity. Changes are propagated to all gateway nodes via Redis Pub/Sub within 2 seconds. The config service supports rule inheritance (global defaults < tier overrides < per-key overrides) and validation (prevents deploying rules that would reject 100% of traffic). Operators interact via a REST API or admin UI. Rules are versioned for rollback capability.
Monitoring and Alerting — Every rate-limit decision emits a metric (allowed/rejected, latency, algorithm used, Redis response time). These flow to Prometheus via StatsD/OpenTelemetry. Grafana dashboards show: rejection rate per endpoint, per-key usage patterns, Redis cluster health, and fail-open events. Alerts fire on: rejection rate > 5% (potential misconfiguration), Redis latency > 5ms (degradation), fail-open mode triggered (infrastructure issue), or single key exceeding 10K req/sec (potential abuse or misconfigured client). The monitoring pipeline is fire-and-forget — it never blocks the request path.
Local Counter Cache — An in-memory hash map on each gateway node that caches token counts for the top ~1% of clients (by request volume). The local counter handles ~40% of all rate-limit checks without a Redis round-trip, achieving 0.01ms decision time. Every 100ms, the local counter syncs its state to Redis (subtracting consumed tokens) and pulls fresh counts. The tradeoff: during the sync interval, a client could exceed their limit by up to (number_of_gateway_nodes * local_allowance) requests. For most use cases, this brief over-limit (< 1 second) is acceptable. For strict-enforcement endpoints (payments, auth), the local cache is bypassed and every check goes to Redis.
Architecture Flow
Scenario: Fintech API Trading Bot Burst
A fintech API receives a burst of 500 requests in 1 second from a trading bot. The rate limit is 100 req/min (token bucket: capacity=100, refill=1.67 tokens/sec).
Normal Request (Allowed):
The first request arrives at the API gateway. The middleware extracts the API key tk_bot_9382 from the Authorization header and normalizes the endpoint to /v1/orders. It constructs the key tk_bot_9382:/v1/orders:bucket. The local counter cache does not have this key (cold start), so it invokes the Redis Lua script via EVALSHA. The script reads the current bucket: 100 tokens available (full bucket, client hasn't made requests recently). It refills 0 tokens (0 elapsed time since bucket was just initialized), decrements 1 token, writes {tokens: 99, last_refill: now} back to the hash, and returns {1, 99} (allowed, 99 remaining). The middleware adds headers: X-RateLimit-Limit: 100, X-RateLimit-Remaining: 99, X-RateLimit-Reset: 1640000060. The request is forwarded to the orders backend service. Total added latency: 0.6ms.
Burst Continues (Requests 2-100):
The trading bot's burst continues. Requests 2 through 100 each consume one token. The local counter cache is now warm — after the first Redis call, the gateway caches the token count locally. Subsequent requests decrement the local counter (0.01ms each) without Redis round-trips. Every 100ms, the local counter syncs: "I consumed 45 tokens locally" → Redis decrements by 45. After 100 requests total, the bucket is empty. The 100th request gets X-RateLimit-Remaining: 0. Total time elapsed: ~200ms (all 100 allowed within the burst capacity).
Rate-Limited Request (Rejected with Headers):
Request 101 arrives. The local counter shows 0 tokens. The gateway immediately returns HTTP 429 Too Many Requests with body:
{
"error": "rate_limit_exceeded",
"message": "Rate limit of 100 requests per minute exceeded for /v1/orders",
"retry_after": 36
}
Retry-After: 36 (seconds until enough tokens refill), X-RateLimit-Limit: 100, X-RateLimit-Remaining: 0, X-RateLimit-Reset: 1640000060. The remaining 399 requests from the burst all receive 429s. Each rejection takes ~0.01ms (local counter, no Redis needed for rejections). The trading bot's well-implemented SDK reads Retry-After, pauses for 36 seconds, then resumes at a sustainable rate. Redis Failure (Fail-Open):
During the burst, Redis shard 3 (which holds this client's key) experiences a network partition. The gateway's Redis client gets a connection timeout after 50ms. The circuit breaker trips. The middleware switches to fail-open mode: all requests from this client are ALLOWED without rate checking. The gateway emits an alert: "fail-open triggered for shard 3, affected keys: ~1.6M". It continues adding headers with stale data: X-RateLimit-Remaining: -1 (special value indicating degraded mode). After 4.2 seconds, Redis Sentinel promotes the replica to primary. The gateway reconnects, reloads cached script SHA, and resumes normal rate limiting. During the 4.2-second window, the trading bot sent ~50 extra requests above its limit — acceptable vs. rejecting all traffic from 1.6M clients.
Rule Hot-Reload:
An operator decides the trading bot's tier should be upgraded: 100 req/min → 500 req/min. They update the config service via REST API: PUT /rules/tier/pro {"limit": 500, "capacity": 500}. The config service validates the change, writes to its backing store, and publishes to Redis Pub/Sub channel rate_limit_rules_v2. All 20 gateway nodes receive the message within 800ms. Each node updates its in-memory rule cache. The next rate-limit check for any pro tier client uses the new limit. No restart, no deploy, no dropped connections. The token bucket's capacity is immediately set to 500 — existing tokens are preserved, new tokens refill at the higher rate.
Failure & Recovery Scenarios
Redis Cluster Down (Complete Outage)
Trigger: All 6 Redis primaries become unreachable (network partition, cloud availability zone failure).
Detection: Gateway Redis client pool reports 100% connection failures within 200ms. Circuit breaker trips.
Response: ALL gateway nodes switch to fail-open mode simultaneously. Rate limiting is effectively disabled. Local counters continue to operate with last-known state but cannot sync.
Risk: Abusive clients can send unlimited traffic. Backend services are unprotected.
Mitigation: (1) Local counters still provide approximate limiting for warm keys. (2) Backend services have their own circuit breakers/load shedding. (3) Emergency mode: gateway can enable a hard-coded conservative limit (e.g., 10 req/sec per IP) using local-only state.
Recovery: When Redis cluster recovers, all keys will have expired (TTL < outage duration) or contain stale data. Buckets reset to full capacity. Brief burst of previously-limited clients is expected. Gateway nodes detect recovered connections, re-register Lua scripts, and resume normal operation within 1 request cycle.
Config Service Unreachable
Trigger: Config service crashes or is unreachable.
Detection: Gateway's periodic config poll (every 10s) fails 3 consecutive times.
Response: Gateway continues using last-known-good configuration cached in memory. An operator alert fires.
Risk: Rule changes cannot be deployed. If a bad rule was just pushed to some nodes but not others, inconsistent behavior between gateway nodes.
Mitigation: Rules are versioned. Gateways compare rule versions on each sync. If a gateway has a newer version than another, the older gateway won't degrade — it just can't upgrade. Config service should be deployed with 3+ replicas and automatic failover.
Recovery: When config service returns, gateways pull latest rules on next poll cycle (< 10s). Any queued updates are applied in version order.
Clock Skew Between Nodes
Trigger: Gateway node 7's clock drifts 3 seconds ahead of other nodes due to NTP sync failure.
Impact on Token Bucket: The Lua script receives now as a parameter from the gateway (not Redis server time). If node 7's clock is 3 seconds ahead, it computes 3 extra seconds of token refill: 3s * 1.67 tokens/sec = 5 extra tokens. Client requests hitting node 7 get slightly more generous limits.
Impact on Fixed Window: Worse. Node 7 might roll over to a new window 3 seconds early, resetting counters while other nodes are still in the old window. Client gets 2x limits during the overlap.
Mitigation: (1) Use Redis TIME command instead of client-provided timestamps (adds 1 RTT but guarantees consistency). (2) NTP monitoring with alerts on drift > 500ms. (3) Token bucket's continuous refill model is inherently more tolerant of clock skew than window-based algorithms.
Hot Key (Single API Key: 100K req/sec)
Trigger: A viral app's API key generates 100K requests/sec — 100x its limit and 33% of a single Redis shard's capacity.
Detection: Monitoring alerts on single-key request rate > 10K/sec.
Impact: The single Redis shard handling this key ({api_key}:{endpoint}:{window} consistently hashes to shard 4) becomes a hotspot. Lua script executions for this key consume 33% of shard 4's capacity, increasing p99 latency for all other keys on shard 4.
Response: (1) Local counter absorbs most checks — after the first Redis call confirms the bucket is empty, the local counter caches "rejected" state and handles all subsequent requests without Redis. (2) Gateway implements request coalescing: batch N concurrent checks for the same key into 1 Redis call, broadcast the result. (3) Connection-level throttling: after 1000 consecutive rejections from the same IP, drop the TCP connection (save response serialization cost).
Mitigation: (1) Per-IP connection limits at the load balancer layer (before gateway). (2) Adaptive blacklisting: automatically add the key to the blacklist set after sustained abuse. (3) Shard splitting: if one key dominates, split that key's counter across multiple shards using sub-keys.
Data Model
Redis Key Structure
Key Pattern: {api_key}:{endpoint}:{algorithm_suffix}
Examples:
sk_prod_abc123:/v1/users:bucket → Token bucket state
sk_prod_abc123:/v1/search:bucket → Separate bucket per endpoint
sk_prod_abc123:/v1/orders:swlog → Sliding window log (sorted set)
sk_prod_abc123:/v1/orders:fwin:1640000 → Fixed window counter
Token Bucket Hash:
HSET sk_prod_abc123:/v1/users:bucket tokens 73 last_refill 1640000042.123
Sliding Window Log (Sorted Set):
ZADD sk_prod_abc123:/v1/orders:swlog 1640000042.123 "req_uuid_1"
ZADD sk_prod_abc123:/v1/orders:swlog 1640000042.456 "req_uuid_2"
(Score = timestamp, member = unique request ID)
Fixed Window Counter:
SET sk_prod_abc123:/v1/orders:fwin:1640000 47
EXPIRE sk_prod_abc123:/v1/orders:fwin:1640000 120
TTL Policy:
Token bucket keys: EXPIRE = 2 × window_size (auto-cleanup inactive keys)
Sliding window log: EXPIRE = window_size + 10s buffer
Fixed window: EXPIRE = window_size + 60s buffer
Config Schema (YAML)
# Rate limit rule configuration
rules:
- id: "rule_001"
version: 42
tier: "pro"
match:
api_key_pattern: "sk_prod_*"
endpoint_regex: "^/v1/orders.*"
algorithm: "token_bucket"
params:
capacity: 100 # Maximum tokens (burst size)
refill_rate: 1.67 # Tokens per second (100/min)
window: 60 # Window for header calculation (seconds)
override:
# Per-key override (higher priority)
"sk_prod_vip_001":
capacity: 1000
refill_rate: 16.67
- id: "rule_002"
tier: "free"
match:
api_key_pattern: "sk_free_*"
endpoint_regex: ".*"
algorithm: "sliding_window_counter"
params:
limit: 100
window: 60
whitelist:
- "sk_internal_*" # Internal services bypass all limits
- "sk_monitoring_*" # Health check services
blacklist:
- "sk_revoked_*" # Permanently blocked keys
- "sk_abuse_flagged_*" # Flagged by abuse detection
Rate Limit Response Headers
# On EVERY response (both 200 and 429):
X-RateLimit-Limit: 100 # Maximum requests allowed in window
X-RateLimit-Remaining: 73 # Requests remaining in current window
X-RateLimit-Reset: 1640000060 # Unix timestamp when window resets (UTC)
# Only on 429 responses:
Retry-After: 36 # Seconds until client should retry
# On fail-open (degraded mode):
X-RateLimit-Remaining: -1 # Special value: rate limiting degraded
X-RateLimit-Policy: degraded # Explicit degraded state indicator
# Full 429 response example:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 36
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000060
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit of 100 requests per 60 seconds exceeded",
"details": {
"limit": 100,
"window_seconds": 60,
"retry_after_seconds": 36,
"reset_at": "2021-12-20T12:01:00Z"
}
}
}
Algorithms Under the Hood
1. Token Bucket (Primary Algorithm)
The token bucket is a metaphor: imagine a bucket that holds tokens. Tokens are added at a constant rate (refill). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity (burst allowance).
Properties: Allows bursts up to capacity. Smooths traffic over time. Two tuning knobs (capacity + refill rate). Memory-efficient (2 values per key).
-- Token Bucket: Redis Lua Script (atomic execution)
-- KEYS[1] = rate limit key
-- ARGV[1] = bucket capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp (epoch with ms precision)
-- ARGV[4] = tokens to consume (usually 1)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4]) or 1
-- Fetch current bucket state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
-- Initialize if first request (new key)
if tokens == nil then
tokens = capacity
last_refill = now
end
-- Calculate token refill since last request
local elapsed = math.max(0, now - last_refill)
local refilled = elapsed * refill_rate
tokens = math.min(capacity, tokens + refilled)
-- Attempt to consume tokens
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
-- Return: allowed=1, remaining tokens, ms until next token
return {1, math.floor(tokens), 0}
else
-- Rejected: calculate wait time until enough tokens available
local deficit = requested - tokens
local wait_ms = math.ceil((deficit / refill_rate) * 1000)
redis.call('HSET', key, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
return {0, 0, wait_ms}
end
2. Sliding Window Log
Stores the timestamp of every request in the current window. Counts requests by checking how many timestamps fall within [now - window, now]. Most accurate but highest memory usage.
Properties: Perfectly accurate (no boundary issues). Memory scales with request count (not fixed). Expensive for high-volume keys. Best for low-limit, high-accuracy scenarios (login attempts).
-- Sliding Window Log: Redis Lua Script
-- Uses a Sorted Set where score = timestamp, member = unique request ID
-- KEYS[1] = rate limit key (sorted set)
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests per window
-- ARGV[3] = current timestamp (epoch ms)
-- ARGV[4] = unique request ID (for dedup)
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local request_id = ARGV[4]
-- Remove all entries outside the current window
local window_start = now - (window * 1000)
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
-- Count requests in current window
local count = redis.call('ZCARD', key)
if count < limit then
-- Add this request to the log
redis.call('ZADD', key, now, request_id)
redis.call('EXPIRE', key, window + 10)
return {1, limit - count - 1, 0} -- allowed, remaining, wait_ms
else
-- Rejected: calculate when oldest entry expires
local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
local wait_ms = 0
if #oldest > 0 then
wait_ms = math.ceil((tonumber(oldest[2]) + (window * 1000)) - now)
end
return {0, 0, math.max(0, wait_ms)}
end
3. Sliding Window Counter (Hybrid)
Combines fixed window efficiency with sliding window accuracy. Maintains counters for the current and previous windows, then weights them based on the position within the current window.
Properties: Approximate but within ~1% accuracy. Fixed memory (2 counters per key). Eliminates the fixed-window boundary problem. Good balance of accuracy and performance.
-- Sliding Window Counter: Redis Lua Script
-- Weighted average of current and previous window counters
-- KEYS[1] = current window counter key
-- KEYS[2] = previous window counter key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests per window
-- ARGV[3] = current timestamp (epoch seconds)
local curr_key = KEYS[1]
local prev_key = KEYS[2]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Determine position within current window (0.0 to 1.0)
local window_start = math.floor(now / window) * window
local elapsed_in_window = now - window_start
local weight_current = elapsed_in_window / window -- e.g., 0.7 means 70% through window
local weight_previous = 1 - weight_current -- e.g., 0.3
-- Get counts for current and previous windows
local curr_count = tonumber(redis.call('GET', curr_key) or "0")
local prev_count = tonumber(redis.call('GET', prev_key) or "0")
-- Calculate weighted request count
-- "How many requests would be in a sliding window ending now?"
local estimated_count = (prev_count * weight_previous) + curr_count
if estimated_count < limit then
-- Allowed: increment current window counter
redis.call('INCR', curr_key)
redis.call('EXPIRE', curr_key, window * 2)
local remaining = math.floor(limit - estimated_count - 1)
return {1, math.max(0, remaining), 0}
else
-- Rejected: estimate when capacity frees up
local wait_ms = math.ceil((1 - weight_current) * window * 1000)
return {0, 0, wait_ms}
end
4. Fixed Window Counter
The simplest algorithm. Divides time into fixed windows (e.g., per minute). Counts requests in the current window. Resets to zero when the window expires.
Properties: Minimal memory (1 counter + TTL). Simple implementation. Vulnerable to boundary bursts (2x limit at window edge). Best for non-critical, high-volume scenarios.
-- Fixed Window Counter: Redis Lua Script
-- KEYS[1] = window key (includes window ID)
-- ARGV[1] = max requests per window
-- ARGV[2] = window size in seconds (for TTL)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
-- Atomic increment and check
local count = redis.call('INCR', key)
-- Set TTL on first request of the window
if count == 1 then
redis.call('EXPIRE', key, window)
end
if count <= limit then
return {1, limit - count, 0} -- allowed, remaining, wait_ms
else
-- Calculate time remaining in current window
local ttl = redis.call('TTL', key)
local wait_ms = ttl * 1000
return {0, 0, wait_ms}
end
Algorithm Comparison
| Criterion | Token Bucket | Sliding Window Log | Sliding Window Counter | Fixed Window |
|---|---|---|---|---|
| Accuracy | Exact (per-token) | Exact (per-request) | ~99% (weighted estimate) | Exact within window, 2x at boundary |
| Memory/key | 16 bytes | O(n) where n=requests | 24 bytes (2 counters) | 12 bytes |
| Redis ops/check | 1 (EVALSHA) | 1 (EVALSHA, but heavier) | 1 (EVALSHA, 2 keys) | 1 (INCR + conditional EXPIRE) |
| Burst tolerance | Configurable (capacity knob) | None (strict) | Minimal | 2x at boundary (vulnerability) |
| Best for | General API limiting | Login/auth (low-limit, high-accuracy) | High-volume APIs | Simple use cases, non-critical |
| Boundary problem | No (continuous refill) | No (true sliding) | Minimal (~1% error) | Yes (double-spend at edge) |
| Clock sensitivity | Low (relative elapsed) | Medium (absolute timestamps) | Medium (window alignment) | Low (Redis TTL-based) |
| Recommended for | Default choice | Security-critical endpoints | High-volume, cost-sensitive | Internal/dev environments |
Scaling Considerations
| Dimension | Current (1.7M req/sec) | 10x Scale (17M req/sec) | Strategy |
|---|---|---|---|
| Redis shards | 6 primaries + 6 replicas | 60 primaries + 60 replicas | Horizontal shard scaling, consistent hash ring expansion |
| Gateway nodes | 20 nodes | 200 nodes | Stateless, auto-scale on CPU/connection count |
| Local cache hit rate | 40% (saves 680K Redis ops/sec) | 60% (Zipfian: more clients = more hot keys) | Increase local cache size, reduce sync interval |
| Memory (Redis total) | 2 GB across 6 shards | 20 GB across 60 shards | Linear scaling, add shards |
| Network bandwidth | 231 MB/sec (Redis) | 2.3 GB/sec | Dedicated Redis network, RDMA, or Unix sockets |
| Lua script CPU | 23% per shard | 23% per shard (distributes) | Constant per-shard with proper sharding |
| Hot key mitigation | Single-shard hotspot possible | Must shard hot keys | Sub-key splitting: {api_key}:{endpoint}:{shard_n} |
| Config propagation | < 2s (20 nodes via Pub/Sub) | < 5s (200 nodes) | Hierarchical pub/sub or gossip protocol |
| Monitoring volume | 1.7M metrics/sec | 17M metrics/sec | Sampling (emit 1% of allowed, 100% of rejected) |
| Cross-region | Single region | Multi-region | Per-region rate limits with global aggregation (async) |
Multi-Region Considerations
At 10x+ scale, the system likely spans multiple regions. Two approaches:
-
Per-region independent limits — Each region enforces its own rate limit (e.g., 50 req/min per region). Simple, no cross-region coordination. Downside: client can get N * region_count total requests by rotating regions.
-
Global rate with local enforcement — Each region gets a "budget" (proportional to its traffic share). A global coordinator periodically rebalances budgets. Allows burst absorption locally while maintaining global limits. Complexity: budget rebalancing lag, split-brain handling.
Quick Recall
| Question | Answer |
|---|---|
| Why token bucket over sliding window? | Token bucket allows configurable bursts (capacity knob). Sliding window log is strictly uniform — punishes legitimate bursty usage. Token bucket uses 16 bytes/key vs O(n) for sliding log. |
| Why fail-open on Redis failure? | Redis down = allow all traffic. False rejects (blocking paying customers) cause more business damage than brief over-limit (temporary abuse window). Backend services have their own protection (circuit breakers). |
| How is distributed counting solved? | Redis Cluster as single source of truth. Local counters as optimization for hot keys (40% hit rate). Hybrid approach: sub-ms local path + eventual consistency (100ms sync) to Redis. |
| What's the key structure? | {api_key}:{endpoint}:{algorithm_suffix}. Hash tag on {api_key} ensures all endpoints for one client land on same shard (multi-key ops). TTL = 2x window for auto-cleanup. |
| Why is Retry-After critical? | Without it, rejected clients retry immediately → thundering herd → retry storm consumes 3x resources of original traffic. Retry-After + exponential backoff eliminates this amplification. |
| What's Redis cluster capacity? | 6 shards x 300K ops/sec = 1.8M ops/sec capacity. ~200MB raw state (10M keys x 56 bytes). With local cache reducing load by 40%, effective demand is ~1M ops/sec (55% utilization). |
| Why Lua scripts over MULTI/EXEC? | Lua executes atomically in one round-trip (read + compute refill + decrement + write). MULTI/EXEC needs 3 round-trips + CAS retry on contention. Hot keys would cause frequent EXEC failures. |
| How are rules updated without restart? | Config service pushes updates via Redis Pub/Sub. All gateway nodes subscribe. Rule change propagates in < 2s. In-memory rule cache updated atomically. No connection drops, no deploys. |
| What happens during clock skew? | Token bucket is resilient (uses relative elapsed time). Fix: use Redis TIME command instead of client clock. Monitor NTP drift > 500ms. Fixed window is most vulnerable (window misalignment). |
| How to handle a hot key (100K req/sec)? | Local counter absorbs most checks (cache "rejected" state). Request coalescing batches concurrent checks. After sustained abuse, auto-blacklist the key. Shard splitting for legitimate hot keys. |
| What's the false rejection budget? | < 0.1%. Primary cause: clock skew between gateway and Redis, or sync lag in local counter. Mitigate: generous capacity (10% buffer), prefer Redis TIME, reduce sync interval for high-value keys. |
| How does the system handle 10x growth? | Linear horizontal scaling: add Redis shards (consistent hashing), add gateway nodes (stateless). Local cache hit rate improves with scale (Zipfian distribution). Per-shard load stays constant. |