Ride Sharing (Uber/Lyft)

11 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Real Incident: Uber's New Year's Eve Outage, 2015

At midnight on NYE 2015, Uber's matching service received 5x its normal traffic across every major US city simultaneously. The dispatch system — which handled both location ingestion and driver matching in a single monolith — collapsed under the write amplification: 1.7M location updates/sec were contending with matching queries on the same Redis instance. Riders saw "no cars available" despite thousands of idle drivers nearby. The matching algorithm was scanning stale location data (30-60 seconds old), dispatching drivers who had already moved miles away. The outage lasted 25 minutes in NYC and SF. Post-mortem fix: Uber decoupled the hot write path (location ingestion) from the hot read path (matching queries) into separate services with independent scaling. They also built the H3 hexagonal spatial index to replace geohash-based lookups, eliminating boundary edge effects that caused 15% of matches to fail at cell borders. The lesson: never let your highest-throughput write path share infrastructure with your latency-sensitive read path.

System Design Concepts Used

Geospatial Indexing (GEOADD/GEORADIUS) · WebSocket (Real-Time Tracking) · Expanding Radius Search · Consistent Hashing · City-Based Sharding · Event-Driven Architecture (Kafka) · State Machine (Trip Lifecycle) · Surge Pricing (Supply/Demand) · Pre-Authorization (Payments) · ETA Estimation (Graph + Traffic) · Heartbeat / Liveness Detection · CQRS (Location Writes vs. Matching Reads)

1. Functional Requirements

Request a ride — rider specifies pickup and dropoff locations, receives fare estimate and ETA
Match rider with nearest available driver — system finds optimal driver using geospatial proximity, ETA, and driver rating
Real-time tracking — both rider and driver see each other's live location on a map (updates every 3-4 seconds)
Dynamic fare calculation — base fare + distance + time + surge multiplier, calculated at trip completion for accuracy
Payment processing — pre-authorize estimated fare, hold during trip, capture exact amount on completion
ETA estimation — predict arrival time using road network graph weighted by real-time traffic data
Trip state management — full lifecycle: REQUESTED → MATCHED → EN_ROUTE → IN_PROGRESS → COMPLETED/CANCELLED
Driver availability management — track online/offline status, current trip status, and location in real time

2. Non-Functional Requirements

Requirement	Target	Rationale
Availability	99.99% (< 52 min/year)	Riders stranded without rides is unacceptable
Match latency	< 30 seconds end-to-end	Rider patience threshold before cancellation
Location freshness	< 4 seconds stale	Stale locations cause bad matches (driver 2km away, not 200m)
Location throughput	1.7M updates/sec	5M concurrent drivers × 1 update per 3 seconds
ETA accuracy	± 2 minutes	Rider trust depends on accurate predictions
Payment reliability	Zero double-charges	Financial errors destroy user trust permanently
Global scale	20M rides/day across 900+ cities	City-based isolation for fault containment
Consistency model	Eventual for locations, strong for payments	Location staleness acceptable; payment errors are not

3. Capacity Estimation

Text Only

/* ━━━ NAPKIN MATH: Uber-Scale Ride Sharing ━━━ */

/* ━━━ RIDES ━━━ */
Rides/day:               20M
Rides/sec (avg):         20M / 86,400 ≈ 230 rides/sec
Rides/sec (peak 5x):    230 × 5 = 1,150 rides/sec (Friday night, NYE)
Rides/sec (NYE spike):   230 × 10 = 2,300 rides/sec

/* ━━━ LOCATION UPDATES (Hot Write Path) ━━━ */
Active drivers:          5M concurrent
Update frequency:        1 update every 3 seconds
Location updates/sec:    5M / 3 ≈ 1,700,000 updates/sec  ← THIS dominates
Payload per update:      driver_id(8B) + lat(8B) + lon(8B) + ts(8B) + heading(4B) + speed(4B) = 40 bytes
Ingestion bandwidth:     1.7M × 40 bytes = 68 MB/sec sustained

/* ━━━ MATCHING QUERIES ━━━ */
Matching attempts/sec:   1,150 × 3 (avg retries per ride) = 3,450 geo queries/sec
Each GEORADIUS scans:    ~50-200 drivers per query
Matching decisions/sec:  3,450 queries/sec peak

/* ━━━ GEOSPATIAL INDEX (In-Memory) ━━━ */
Drivers in index:        5M
Per driver entry:        driver_id(8B) + geo_score(8B) + metadata(34B) = 50 bytes
Total index size:        5M × 50 bytes = 250 MB  ← fits in ONE Redis instance
With replication (3x):   750 MB total

/* ━━━ STORAGE ━━━ */
Trip record size:        ~1 KB (locations, fare, metadata, payment ref)
Trip storage/day:        20M × 1 KB = 20 GB/day
Trip storage/year:       7.3 TB/year
Location history/day:    1.7M/sec × 86,400 × 40B = 5.9 TB/day (cold storage only)

/* ━━━ WEBSOCKET CONNECTIONS ━━━ */
Active riders tracking:  ~10M concurrent (rider + driver per trip × 5M active trips)
WebSocket messages/sec:  10M connections × 1 msg/3s = 3.3M messages/sec outbound

System Nature

Write-heavy on location ingestion, read-heavy on matching. The architecture must separate these two paths: locations stream in at 1.7M/sec (write-optimized), while matching queries need sub-5ms geospatial lookups (read-optimized). These are fundamentally different access patterns that cannot share infrastructure.

4. "Why X, Not Y?" — Tradeoff Analysis

Why Redis GEOADD/GEORADIUS and not PostgreSQL PostGIS?

Redis wins because 1.7M writes/sec with sub-millisecond read latency is impossible for any disk-based system. The entire driver location dataset (250 MB for 5M drivers) fits in RAM. Redis GEOADD is O(log N) — updating 1.7M locations/sec is feasible because it's pure memory operations with no disk I/O, no WAL, no MVCC overhead. GEORADIUS returns all drivers within a radius in < 1ms.

PostGIS stores geospatial data on disk with B-tree/GiST indexes. Even with SSDs, you'd get ~50K writes/sec per PostgreSQL instance — 34 shards just to handle ingestion. Reads require disk I/O for cold data. PostGIS is ideal for static geospatial data (zip code boundaries, map tiles), but not for data that changes 1.7M times per second.

PostGIS advantage: Complex spatial queries (polygon intersections, route containment), persistent storage, SQL joins with trip data. Use PostGIS for analytics queries on historical location data, not the real-time hot path.

Why expanding radius search and not fixed radius?

Expanding radius guarantees a match in dense AND sparse areas without wasting compute. Start at 1 km — in Manhattan, you'll find 50+ drivers immediately. In rural Kansas, you'll find zero. Fixed radius forces a choice: too small (no matches in sparse areas) or too large (scanning thousands of candidates in dense areas, wasting CPU).

Expanding radius: 1 km → 2 km → 4 km → 8 km → 16 km. In NYC, the first ring (1 km) returns enough candidates — done in one query. In rural areas, you expand 4-5 times but still find a match. The algorithm is self-tuning to density.

Fixed radius advantage: Simpler implementation, predictable latency (always one query). Use if your service only operates in uniformly dense areas (e.g., a bike-sharing app in a single city center).

Why eventual consistency for locations and not strong consistency?

Strong consistency for 1.7M writes/sec would require distributed consensus (Raft/Paxos) on every update — adding 10-50ms latency per write, reducing throughput by 10x. A driver's location being 3-4 seconds stale is perfectly acceptable: at 30 mph, a car moves ~40 meters in 3 seconds. The matching algorithm already accounts for ETA, not exact position.

If we required strong consistency (every matching query sees the absolute latest position), we'd need synchronous replication across geo-index replicas before acknowledging writes. At 1.7M writes/sec, this would create a bottleneck that reduces throughput to ~100K writes/sec — requiring 17x more Redis instances.

Strong consistency advantage: Guarantees no stale matches. Use for systems where position accuracy is safety-critical (air traffic control, autonomous vehicle coordination).

Why city-based sharding and not global?

A driver in NYC will NEVER be matched with a rider in San Francisco — geographic locality is inherent to the domain. City-based sharding gives us: (1) fault isolation — an outage in the NYC shard doesn't affect SF; (2) data locality — all matching queries stay within one shard, zero cross-shard joins; (3) independent scaling — NYC needs 10x the capacity of Des Moines.

Global sharding (e.g., hash by driver_id) would scatter NYC drivers across all shards, requiring fan-out queries to find nearby drivers. A GEORADIUS for one rider would hit every shard — turning a 1ms query into a 20ms distributed gather operation.

Global sharding advantage: Uniform load distribution, simpler capacity planning. Use for systems without inherent geographic locality (e.g., a global chat system where any user can message any other user).

5. High-Level Architecture

6. Backend Services Explained

Trip Service (Orchestrator)

The brain of the system. Manages the entire ride lifecycle through a state machine: REQUESTED → MATCHED → EN_ROUTE → IN_PROGRESS → COMPLETED. It coordinates between Matching, Pricing, Payment, and Notification services using a saga pattern. Every state transition is persisted to PostgreSQL before side effects are triggered. If the Trip Service crashes mid-transition, it recovers by replaying from the last persisted state. It also handles cancellation logic (who pays the cancellation fee based on timing) and dispute resolution triggers.

Location Service (Hot Write Path)

The highest-throughput service in the system — ingesting 1.7M GPS updates per second. Each update contains latitude, longitude, heading, speed, and timestamp. The service validates coordinates (rejects GPS drift > 200 km/h implied speed), then writes to the city-specific Redis geo-index via GEOADD. It also publishes to Kafka's location-history topic for offline analytics and route reconstruction. This service is completely stateless — any instance can process any driver's update. Scaling is horizontal: add more instances behind the WebSocket proxy.

Matching Service (Core Algorithm)

Receives a ride request (pickup lat/lon, destination) and finds the optimal driver. Executes an expanding radius GEORADIUS query against the city's Redis geo-index. Candidates are scored by a weighted function: 0.5 * ETA + 0.3 * distance + 0.2 * driver_rating. The top candidate receives a ride offer via push notification. If the driver doesn't accept within 15 seconds, the next candidate is offered. After 3 rejections or timeouts, the radius expands and a new candidate pool is generated. The service is idempotent — if it crashes mid-matching, the Trip Service detects the timeout and retriggers.

Pricing Service (Surge)

Calculates fare estimates (before trip) and final fares (after trip). Maintains a real-time supply/demand heatmap using H3 hexagonal cells (resolution 7, ~5 km). Every 30 seconds, it recalculates the surge multiplier per cell: surge = max(1.0, demand_count / supply_count * scaling_factor). Surge is capped at 8x and smoothed (changes by at most 0.5x per interval to avoid price shock). The service exposes both a "current surge" API (for fare estimates) and a "lock surge" API (freezes the multiplier for 5 minutes after a rider confirms).

Payment Service

Handles the financial lifecycle: pre-authorize estimated fare when ride is confirmed, adjust hold if route deviates significantly, capture exact amount on completion. Uses idempotency keys to prevent double-charges during retries. Integrates with Stripe/Braintree via a circuit breaker pattern — if the payment gateway is down, trips complete normally and charges are retried asynchronously. Maintains a ledger of all transactions for dispute resolution and regulatory compliance.

ETA Service

Predicts time-of-arrival using a weighted road network graph. Edges represent road segments with weights based on: (1) segment length, (2) speed limit, (3) real-time traffic coefficient (updated every 60 seconds from driver GPS traces), (4) time-of-day patterns (ML model). Uses A* algorithm with traffic-weighted heuristics. Returns both "time to pickup" (driver → rider) and "time to destination" (pickup → dropoff). The graph is city-specific and pre-loaded in memory (~2 GB per major city).

Notification Service

Manages all push notifications (ride offer to driver, driver arriving, trip completed) and real-time WebSocket updates (live location tracking). Maintains persistent WebSocket connections to all active riders and drivers (~10M concurrent). When a driver's location updates, the Notification Service fans out the new position to the matched rider's WebSocket connection. Uses connection affinity (sticky sessions) to avoid reconnection overhead.

7. Architecture Flow — Raj Requests a Ride in Bangalore

A rider named Raj in Bangalore opens the Uber app at 8:45 AM to request a ride to the airport (Kempegowda International, 35 km away) during morning rush hour.

Phase 1 — Ride Request and Fare Estimate

T+0ms: Raj taps "Request Ride." The Rider App sends his pickup location (12.9716N, 77.5946E) and destination to the API Gateway via HTTPS.

T+15ms: API Gateway authenticates Raj's token, applies rate limiting, and routes to the Trip Service in the bangalore shard.

T+20ms: Trip Service calls Pricing Service: "What's the fare estimate for this route with current surge?"

T+25ms: Pricing Service checks the H3 cell containing Raj's pickup. Current state: 45 ride requests in the last 5 minutes, 30 available drivers → surge multiplier = 1.5x. Estimated fare: base (50 INR) + distance (35 km × 12 INR/km) + time (45 min × 2 INR/min) + surge (1.5x) = approximately 840 INR.

T+30ms: Trip Service returns fare estimate to Raj. He confirms. Trip state → REQUESTED.

Text Only

Raj (Bangalore) → API Gateway → Trip Service → Pricing Service (surge 1.5x)
                                              → fare estimate: ₹840
                                              → Trip state: REQUESTED

Phase 2 — Driver Matching (Expanding Radius)

T+50ms: Trip Service calls Matching Service with Raj's pickup coordinates.

T+52ms: Matching Service executes GEORADIUS drivers:bangalore 77.5946 12.9716 1 km on Redis. Returns 8 available drivers within 1 km.

T+55ms: For each candidate, Matching Service calls ETA Service to get real pickup time (accounting for one-way streets, traffic). Results: Driver A = 4 min, Driver B = 6 min, Driver C = 3 min (closest by road), Driver D = 8 min.

T+60ms: Scoring: Driver C wins (lowest ETA, 4.8 rating). Matching Service sends ride offer to Driver C via push notification.

T+65ms - T+72s: Driver C (Priya) sees the offer on her app: "Pickup: 3 min away, Destination: Airport (35 km), Est. fare: 840 INR." She taps Accept at T+8s.

T+8.1s: Trip state → MATCHED. Both Raj and Priya receive confirmation. Raj sees Priya's photo, car model (Honda City, KA-01-XX-1234), and live location.

Text Only

Matching Service → GEORADIUS (1km, 8 candidates) → ETA scoring → offer to Driver C
Driver C accepts → Trip state: MATCHED → notifications to both parties

Phase 3 — Real-Time Tracking (Driver En Route to Pickup)

T+8s to T+3.5min: Priya's Driver App streams GPS coordinates every 3 seconds via WebSocket → Location Service → Redis GEOADD.

Every 3 seconds: Notification Service reads Priya's latest location from Redis and pushes it to Raj's WebSocket connection. Raj sees her car icon moving on the map in real-time.

T+3min: Priya is 200m away. Notification Service sends push: "Your driver is arriving." ETA updates from "3 min" to "1 min" to "Arriving now."

T+3.5min: Priya arrives at pickup. She taps "Arrived." Trip state → EN_ROUTE (waiting for rider).

Text Only

Driver App → WebSocket → Location Service → Redis GEOADD (every 3s)
Notification Service → WebSocket → Rider App (live map update)

Phase 4 — Trip In Progress

T+4min: Raj gets in. Priya taps "Start Trip." Trip state → IN_PROGRESS.

T+4min to T+49min: The route to the airport. Location Service continues streaming Priya's coordinates. The Trip Service records the route polyline for fare calculation. ETA Service updates "time to destination" in real-time as traffic conditions change.

T+35min: Priya takes an alternate route to avoid a traffic jam detected by the ETA Service. The fare is distance-based, so the detour slightly increases the final fare. Trip Service recalculates continuously.

Phase 5 — Trip Completion and Payment

T+49min: Priya arrives at the airport. She taps "End Trip." Trip state → COMPLETED.

T+49.1s: Trip Service calculates final fare: base (50) + actual distance (37.2 km × 12) + actual time (45 min × 2) + surge (1.5x) = 893 INR (slightly more than estimate due to detour).

T+49.2s: Trip Service calls Payment Service: "Capture 893 INR from Raj's pre-authorized hold."

T+49.5s: Payment Service calls Stripe with idempotency key trip_12345_capture. Stripe captures from the pre-authorized hold (which was 1000 INR, 15% buffer above estimate). Remaining hold is released.

T+50s: Both Raj and Priya receive trip summary: route map, fare breakdown, option to rate each other. Trip event published to Kafka for analytics.

Text Only

Trip ends → final fare calculated → Payment capture → receipts sent
Trip event → Kafka → analytics, ML training, driver payouts

8. Failure & Recovery Scenarios

Location Service Lag (Stale Driver Positions)

Problem: A Location Service instance becomes slow (GC pause, network issue). Driver positions in Redis become 15-30 seconds stale instead of 3-4 seconds. Matching Service dispatches drivers who've moved 500m+ from their indexed position.

Detection: Heartbeat monitor checks timestamp of latest update per driver. If now - last_update > 10s, mark driver as "position_uncertain."

Recovery: (1) Matching Service excludes "position_uncertain" drivers from initial candidate pool. (2) If all nearby drivers are uncertain, match anyway but add a "driver may be further than shown" warning. (3) Auto-scale Location Service horizontally — K8s HPA triggers on processing lag > 5s. (4) Driver app detects its updates aren't being acknowledged and switches to HTTP fallback (bypassing the slow WebSocket proxy).

Matching Service Overload (NYE Scenario)

Problem: 5x normal ride requests flood the Matching Service. Queue depth grows, match latency exceeds 30 seconds. Riders cancel and re-request, creating a thundering herd.

Detection: Match latency p99 > 15s triggers alert. Queue depth > 10,000 triggers emergency scaling.

Recovery: (1) Shed load gracefully — return "high demand, longer wait times" to riders instead of timing out silently. (2) Increase batch size — instead of matching one-at-a-time, batch 50 requests in the same area and run a bipartite matching algorithm (Hungarian method) to optimally assign all at once. (3) Reduce candidate scoring — skip ETA Service calls, use straight-line distance as a proxy. Accuracy drops but throughput 10x's. (4) Auto-scale — Matching Service is stateless, new pods come up in 15s on pre-warmed nodes.

Payment Gateway Timeout

Problem: Stripe/Braintree returns timeouts during trip completion. Riders complete trips but charges hang in "pending."

Detection: Circuit breaker opens after 3 consecutive timeouts (> 5 seconds each).

Recovery: (1) Trip still completes normally — payment is decoupled from trip state. Rider sees "payment processing" instead of a final receipt. (2) Async retry with exponential backoff — Payment Service writes charge intent to a durable queue (Kafka payment-retries topic). A consumer retries every 30s, 60s, 120s with the same idempotency key. (3) Hold protects against loss — the pre-authorized hold remains until either captured or expired (7 days). Uber never loses money; worst case is delayed settlement. (4) If gateway is down > 1 hour — switch to backup payment processor (Braintree → Adyen failover).

Driver App Disconnects Mid-Trip

Problem: Priya's phone loses connectivity in a tunnel. No location updates for 2 minutes. Raj sees the driver icon frozen on the map.

Detection: Location Service detects no heartbeat for 30 seconds. Marks trip as "tracking_interrupted."

Recovery: (1) Rider notification — "Live tracking temporarily unavailable. Your trip is still active." (2) Trip continues — the trip state remains IN_PROGRESS. When the driver reconnects, all buffered GPS points from the phone's local storage are uploaded in bulk (the Driver App caches locally during disconnection). (3) Fare calculation unaffected — the phone's local GPS log is authoritative for distance. Even if server-side tracking has gaps, the driver's app reconstructs the full route on reconnection. (4) Safety timeout — if no reconnection for 10 minutes, Trip Service alerts a safety team and contacts both rider and driver via SMS.

9. Data Model

Text Only

/* ═══ PostgreSQL — Trip Store (Sharded by city_id) ═══ */

CREATE TABLE trips (
    trip_id         UUID        PRIMARY KEY,
    city_id         INTEGER     NOT NULL,              -- shard key
    rider_id        BIGINT      NOT NULL,
    driver_id       BIGINT,                            -- NULL until matched
    status          VARCHAR(20) NOT NULL DEFAULT 'REQUESTED',
                    -- REQUESTED | MATCHED | EN_ROUTE | IN_PROGRESS | COMPLETED | CANCELLED
    pickup_lat      DECIMAL(9,6) NOT NULL,
    pickup_lon      DECIMAL(9,6) NOT NULL,
    dropoff_lat     DECIMAL(9,6),
    dropoff_lon     DECIMAL(9,6),
    surge_multiplier DECIMAL(3,2) DEFAULT 1.00,        -- locked at request time
    estimated_fare  INTEGER,                           -- cents, shown to rider
    final_fare      INTEGER,                           -- cents, calculated at completion
    distance_meters INTEGER,                           -- actual route distance
    duration_seconds INTEGER,                          -- actual trip time
    route_polyline  TEXT,                              -- encoded polyline for map display
    requested_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    matched_at      TIMESTAMPTZ,
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    payment_id      UUID                               -- FK to payment ledger
);

CREATE INDEX idx_trips_rider   ON trips (rider_id, requested_at DESC);
CREATE INDEX idx_trips_driver  ON trips (driver_id, requested_at DESC);
CREATE INDEX idx_trips_status  ON trips (city_id, status) WHERE status IN ('REQUESTED','MATCHED','IN_PROGRESS');


/* ═══ Redis — Geospatial Index (Per City) ═══ */

-- One sorted set per city. Members = driver_id. Score = geohash encoding.
GEOADD drivers:bangalore 77.5946 12.9716 "driver_1001"
GEOADD drivers:bangalore 77.5980 12.9750 "driver_1002"

-- Query: find all drivers within 2km of rider
GEORADIUS drivers:bangalore 77.5946 12.9716 2 km WITHCOORD WITHDIST COUNT 20 ASC

-- Driver metadata (availability, vehicle type)
HSET driver:1001 status "available" vehicle "sedan" rating "4.8" last_update "1716100800"


/* ═══ Trip State Machine ═══ */

    REQUESTED ──→ MATCHED ──→ EN_ROUTE ──→ IN_PROGRESS ──→ COMPLETED
        │              │           │              │
        ▼              ▼           ▼              ▼
    CANCELLED    CANCELLED    CANCELLED     CANCELLED
   (by rider)  (by driver)  (by rider)    (by either - rare)

-- Transitions are atomic: UPDATE trips SET status = $1 WHERE trip_id = $2 AND status = $3
-- If affected_rows = 0, state already changed (concurrent update) → retry or abort

10. Algorithms Under the Hood

Geospatial Nearest-Driver Search (GEORADIUS)

Text Only

function find_nearby_drivers(pickup_lat, pickup_lon, city_id, radius_km):
    /*
     * Redis GEORADIUS returns drivers sorted by distance.
     * O(N+log(M)) where N = elements in radius, M = total elements in set.
     * For 5M drivers, log(5M) ≈ 23 comparisons. Sub-millisecond.
     */
    candidates = redis.GEORADIUS(
        key   = "drivers:" + city_id,
        lon   = pickup_lon,
        lat   = pickup_lat,
        radius = radius_km,
        unit  = "km",
        WITHCOORD,           -- return each driver's lat/lon
        WITHDIST,            -- return distance from pickup
        COUNT 50,            -- limit to 50 nearest
        ASC                  -- sort by distance ascending
    )

    // Filter out unavailable drivers
    available = []
    for driver in candidates:
        meta = redis.HGETALL("driver:" + driver.id)
        if meta.status == "available" AND meta.last_update > (now - 30s):
            available.append(driver)

    return available

Expanding Radius Matching

Text Only

function match_rider_to_driver(trip_id, pickup_lat, pickup_lon, city_id):
    radius = 1.0        // start at 1 km
    max_radius = 16.0   // give up at 16 km
    max_attempts = 3    // per radius level

    while radius <= max_radius:
        candidates = find_nearby_drivers(pickup_lat, pickup_lon, city_id, radius)

        if candidates is empty:
            radius = radius * 2     // expand: 1 → 2 → 4 → 8 → 16
            continue

        // Score candidates: lower is better
        scored = []
        for driver in candidates:
            eta = eta_service.get_eta(driver.lat, driver.lon, pickup_lat, pickup_lon)
            score = 0.5 * normalize(eta) +
                    0.3 * normalize(driver.distance) +
                    0.2 * (1 - normalize(driver.rating))  // higher rating = lower score
            scored.append({driver, score, eta})

        scored.sort_by(score, ASC)

        // Offer to candidates in order
        for i in range(min(max_attempts, len(scored))):
            offer = send_ride_offer(scored[i].driver, trip_id, timeout=15s)
            if offer.accepted:
                update_trip_status(trip_id, "MATCHED", scored[i].driver.id)
                mark_driver_unavailable(scored[i].driver.id)
                return SUCCESS

        // All candidates in this radius rejected/timed out
        radius = radius * 2

    return NO_DRIVERS_AVAILABLE   // trigger "no cars nearby" UX

Surge Pricing Formula

Text Only

function calculate_surge(city_id, h3_cell_index):
    /*
     * H3 resolution 7 cell ≈ 5 km diameter.
     * Recalculated every 30 seconds.
     */
    time_window = 5 minutes

    // Count ride requests in this cell in last 5 minutes
    demand = count_requests(city_id, h3_cell_index, time_window)

    // Count available drivers currently in this cell
    supply = count_available_drivers(city_id, h3_cell_index)

    if supply == 0:
        return MAX_SURGE    // cap at 8.0x

    raw_ratio = demand / supply

    // Apply scaling function (not linear — gradual ramp)
    if raw_ratio <= 1.0:
        surge = 1.0                              // no surge
    else if raw_ratio <= 2.0:
        surge = 1.0 + (raw_ratio - 1.0) * 0.5   // 1.0x to 1.5x
    else if raw_ratio <= 4.0:
        surge = 1.5 + (raw_ratio - 2.0) * 0.75  // 1.5x to 3.0x
    else:
        surge = 3.0 + (raw_ratio - 4.0) * 0.5   // 3.0x to 8.0x (capped)

    // Smoothing: don't change by more than 0.5x per interval
    previous_surge = get_previous_surge(city_id, h3_cell_index)
    surge = clamp(surge, previous_surge - 0.5, previous_surge + 0.5)

    // Hard cap
    surge = min(surge, 8.0)

    store_surge(city_id, h3_cell_index, surge)
    return surge


function calculate_fare(trip):
    base_fare = city_config[trip.city_id].base_fare          // e.g., $2.50
    per_km    = city_config[trip.city_id].per_km             // e.g., $1.50/km
    per_min   = city_config[trip.city_id].per_minute         // e.g., $0.25/min
    minimum   = city_config[trip.city_id].minimum_fare       // e.g., $7.00

    raw_fare = base_fare +
               (trip.distance_meters / 1000.0) * per_km +
               (trip.duration_seconds / 60.0) * per_min

    surged_fare = raw_fare * trip.surge_multiplier
    return max(surged_fare, minimum)

ETA Calculation (Road Network Graph)

Text Only

function calculate_eta(origin_lat, origin_lon, dest_lat, dest_lon, city_id):
    /*
     * A* algorithm on a pre-loaded road network graph.
     * Edge weights = segment_length / (speed_limit * traffic_coefficient)
     * Traffic coefficients updated every 60s from driver GPS traces.
     */
    graph = road_networks[city_id]   // pre-loaded in memory (~2 GB per city)

    origin_node = graph.snap_to_nearest_node(origin_lat, origin_lon)
    dest_node   = graph.snap_to_nearest_node(dest_lat, dest_lon)

    // A* with traffic-weighted edges
    path = a_star(
        graph,
        start     = origin_node,
        goal      = dest_node,
        heuristic = haversine_distance,    // admissible heuristic
        weight_fn = function(edge):
            base_time = edge.length_meters / edge.speed_limit_mps
            traffic   = get_traffic_coefficient(edge.id)  // 1.0 = free flow, 3.0 = heavy
            return base_time * traffic
    )

    total_seconds = sum(edge.weight for edge in path.edges)

    // Add buffer for pickup overhead (driver finding exact spot, rider walking out)
    pickup_buffer = 60  // 1 minute

    return total_seconds + pickup_buffer

11. Scaling Considerations

Challenge	Solution	Numbers
1.7M location writes/sec	Location Service horizontally scaled + Redis sharded by city	50 cities × 34K writes/sec each (manageable per Redis instance)
Matching at peak (1,150/sec)	Stateless Matching Service, auto-scaled by queue depth	Each instance handles ~100 matches/sec; 12 instances at peak
City isolation	Each city has its own Redis geo-index, Kafka partition, and DB shard	NYC outage doesn't affect SF; independent deployments
WebSocket connections (10M)	Connection-oriented proxy layer (Envoy) with sticky sessions	100 proxy instances × 100K connections each
Database growth (7.3 TB/year trips)	Shard by city_id; archive completed trips > 90 days to cold storage	Hot partition: only active + recent trips (~500 GB)
Surge accuracy	Recompute every 30s per H3 cell; ML model trained on historical patterns	Demand prediction reduces reactive surge by 40%
ETA accuracy in traffic	Road graph weights updated every 60s from real driver GPS traces	95^th percentile ETA error < 3 minutes
Global expansion (new city)	Spin up new shard (Redis + Postgres + services); import road graph; start with conservative surge	New city operational in < 48 hours
Driver app battery drain	Reduce GPS frequency to every 10s when idle (no active trip); every 3s when matched	40% battery savings for idle drivers
Payment failures	Async retry queue + idempotency keys + backup payment processor	99.97% charges succeed within 5 minutes

12. Quick Recall

Question	Answer
Location update rate?	1.7M/sec (5M drivers × 1 update per 3 seconds)
Why Redis for geo-index?	In-memory, O(log N) GEOADD, sub-ms GEORADIUS, 250 MB fits one instance per city
Matching strategy?	Expanding radius (1→2→4→8→16 km), score by ETA + distance + rating
Why city-based sharding?	Geographic locality (NYC driver never matches SF rider), fault isolation, independent scaling
Trip state machine?	REQUESTED → MATCHED → EN_ROUTE → IN_PROGRESS → COMPLETED (or CANCELLED at any step)
Surge pricing formula?	demand/supply ratio per H3 cell, recalculated every 30s, smoothed ± 0.5x per interval, capped at 8x
Why H3 over geohash?	Uniform cell area (no distortion at poles), no edge effects at boundaries, hierarchical k-ring expansion
Payment flow?	Pre-authorize estimated fare → hold during trip → capture exact amount on completion
Driver goes offline?	No heartbeat for 30s → remove from geo-index. Reconnection re-adds automatically.
How does ETA work?	A* on road network graph with traffic-weighted edges (updated every 60s from GPS traces)
What if Matching Service is slow?	Batch matching (bipartite assignment), skip ETA calls (use distance proxy), auto-scale stateless pods
Why eventual consistency for locations?	Strong consistency at 1.7M writes/sec needs distributed consensus = 10-50ms overhead per write. 3-4s staleness is acceptable (40m drift at 30 mph).
How to handle payment gateway down?	Circuit breaker → async retry queue with idempotency key → failover to backup processor. Trip still completes normally.
WebSocket vs polling?	10M connections × update every 3s. Polling = 3.3M HTTP requests/sec overhead. WebSocket = persistent connection, server pushes only when data changes.

Ride Sharing (Uber/Lyft)

System Design Concepts Used

1. Functional Requirements

2. Non-Functional Requirements

3. Capacity Estimation

4. "Why X, Not Y?" — Tradeoff Analysis

Why Redis GEOADD/GEORADIUS and not PostgreSQL PostGIS?

Why expanding radius search and not fixed radius?

Why eventual consistency for locations and not strong consistency?

Why city-based sharding and not global?

5. High-Level Architecture

6. Backend Services Explained

Trip Service (Orchestrator)

Location Service (Hot Write Path)

Matching Service (Core Algorithm)

Pricing Service (Surge)

Payment Service

ETA Service

Notification Service

7. Architecture Flow — Raj Requests a Ride in Bangalore

Phase 1 — Ride Request and Fare Estimate

Phase 2 — Driver Matching (Expanding Radius)

Phase 3 — Real-Time Tracking (Driver En Route to Pickup)

Phase 4 — Trip In Progress

Phase 5 — Trip Completion and Payment

8. Failure & Recovery Scenarios

Location Service Lag (Stale Driver Positions)

Matching Service Overload (NYE Scenario)

Payment Gateway Timeout

Driver App Disconnects Mid-Trip

9. Data Model

10. Algorithms Under the Hood

Geospatial Nearest-Driver Search (GEORADIUS)

Expanding Radius Matching

Surge Pricing Formula

ETA Calculation (Road Network Graph)

11. Scaling Considerations

12. Quick Recall

5-Minute System Design — Weekly