11 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Video Streaming Platform (YouTube / Netflix)

Real Incident: YouTube Global Outage, October 16, 2018

At 9:10 PM ET, YouTube went completely dark worldwide for 90 minutes. Every video returned a blank page or a monkey holding a wrench. The outage impacted 1.9 billion monthly active users across YouTube, YouTube TV, YouTube Music, and Google-served video ads. The root cause was a bug in YouTube's internal traffic management system that caused servers to incorrectly route all requests to a non-existent backend pool. Because YouTube serves over 500 million hours of video daily and accounts for approximately 15% of global internet traffic, the outage caused a measurable 4% dip in global internet traffic according to multiple ISPs. Recovery required a manual rollback of the configuration change, but the cascading effects (CDN cache invalidation storms, client retry amplification, cold-start thundering herds) extended the resolution window well beyond what a simple rollback should have taken. Lesson: When your system IS the internet for 15% of traffic, even a single misconfiguration in the control plane cascades into a global catastrophe. Defense-in-depth must include canary deployments for configuration changes, not just code.

System Design Concepts Used

adaptive bitrate streaming CDN edge caching chunked/resumable upload async job queues transcoding pipelines object storage content-based routing consistent hashing eventual consistency collaborative filtering full-text search sharding replication DRM/encryption HLS/DASH protocols pre-computed recommendations write-behind caching circuit breakers rate limiting horizontal auto-scaling

Functional Requirements

Video Upload - Creators upload videos up to 12 hours long (256 GB max) via chunked, resumable uploads with progress tracking
Video Transcoding - Uploaded videos are automatically transcoded into 8+ resolution/codec combinations (144p through 4K, H.264/H.265/VP9/AV1)
Adaptive Bitrate Streaming - Viewers watch videos with seamless quality switching based on real-time network conditions using HLS or DASH
Video Search - Full-text search across titles, descriptions, tags, captions, and auto-generated transcripts with ranked results
Personalized Recommendations - Home feed and "Up Next" suggestions powered by collaborative filtering, watch history, and content embeddings
Social Engagement - Comments, likes/dislikes, subscriptions, live chat during premieres, and community posts
Video Analytics - Real-time view counts, audience retention graphs, traffic source breakdown for creators
Content Moderation - Automated detection of copyright violations (Content ID), harmful content, and policy violations before publishing

Non-Functional Requirements

Requirement	Target	Rationale
Availability	99.99% (52 min downtime/year)	YouTube accounts for 15% of global internet traffic; even minutes of downtime affect billions
Video Start Latency	< 2 seconds (p99)	Users abandon videos with > 2s start time; every 100ms delay loses 1% of viewers
Rebuffer Rate	< 0.5% of watch time	Rebuffering is the #1 predictor of viewer churn; must be near-zero
Upload-to-Playable	< 10 minutes for 1080p, < 60 min for 4K	Creators expect fast turnaround; competitors offer "instant" short-form posting
Global Reach	< 50ms latency to nearest CDN PoP for 95% of users	Must serve 200+ countries; edge presence in all major metros
Throughput	170 Tbps peak egress bandwidth	Prime-time concurrent viewership requires massive CDN capacity
Durability	99.999999999% (11 nines) for uploaded content	Losing a creator's original content is unrecoverable and destroys trust
Consistency	Eventual (< 5s) for view counts; strong for uploads	Users accept slightly stale counts; they do NOT accept "video not found" after upload

Capacity Estimation

Text Only

=== YouTube-Scale Video Streaming — Napkin Math ===

DAU:                    500 million
Video views/day:        1 billion (2 views/user avg)
Peak concurrent streams: 1B / 86400 * 3x peak factor = ~35,000 concurrent streams per second
Avg session duration:   40 minutes

=== Upload Volume ===
Videos uploaded/minute:      500 hours of content
Avg video length:            7 minutes
Videos/day:                  (500 * 60 / 7) * 1440 = ~6.2 million videos/day
Avg raw file size (1080p):   ~1.5 GB per hour → ~175 MB per 7-min video
Raw upload storage/day:      6.2M * 175 MB = ~1.1 PB/day (raw)

=== Transcoding Output ===
Renditions per video:        8 (144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 4K)
Avg total output per video:  ~500 MB (all renditions combined, compressed)
Transcoded storage/day:      6.2M * 500 MB = ~3.1 PB/day
Total new storage/day:       ~4.2 PB/day (raw + transcoded)
Cumulative storage:          100+ PB (years of content)

=== Bandwidth ===
Avg bitrate per stream:      5 Mbps (weighted avg across resolutions)
Peak concurrent viewers:     ~35 million simultaneous
Peak bandwidth:              35M * 5 Mbps = ~170 Tbps
CDN cache hit ratio:         95%+ (popular content, long-tail from origin)
Origin bandwidth:            170 Tbps * 5% = ~8.5 Tbps origin egress

=== CDN Infrastructure ===
Global PoPs:                 200+ edge locations
Avg PoP capacity:            ~1 Tbps
Storage per PoP:             ~500 TB SSD cache (hot content)
Total CDN edge storage:      200 * 500 TB = ~100 PB edge cache

=== Metadata ===
Video metadata records:      ~800 million videos total
Avg metadata size:           ~2 KB per video (title, desc, tags, stats)
Metadata DB size:            800M * 2 KB = ~1.6 TB
Watch history events/day:    1B views * 10 events each = 10B events/day

"Why X, Not Y?" Tradeoff Analysis

Why Adaptive Bitrate (HLS/DASH), Not Single-Quality Streaming?

Factor	Single Quality	Adaptive Bitrate (HLS/DASH)
Network variability	Buffers or fails on bandwidth drops	Seamlessly switches to lower quality
Device diversity	Wastes bandwidth on mobile, starves 4K TVs	Serves optimal quality per device
Start time	Must buffer full segment at target quality	Starts at low quality, ramps up quickly
CDN efficiency	Single rendition cached	Popular renditions cached per-PoP
User experience	Binary: plays or stalls	Graceful degradation, near-zero rebuffering

Decision: Adaptive bitrate (ABR) is the only viable approach at scale. The upfront cost of transcoding into 8 renditions pays for itself immediately in reduced rebuffering (the #1 cause of viewer abandonment). HLS is used for Apple ecosystem, DASH for everything else; most platforms support both.

Why Async Transcoding Queue, Not Synchronous?

Factor	Synchronous Transcode	Async Queue + Workers
Upload latency	Minutes to hours (blocks response)	Immediate acknowledgment (< 1s)
Resource isolation	Upload server needs GPU	Dedicated GPU fleet, independently scaled
Failure handling	Upload fails if transcode fails	Retry with backoff, dead-letter queue
Priority management	FIFO only	Priority queues (premium creators first)
Cost optimization	Always-on GPU	Spot instances, scale to zero off-peak
Throughput	Limited by single machine	Horizontally scalable worker fleet

Decision: Video transcoding is inherently CPU/GPU-intensive (a 10-min 4K video takes 30+ minutes to encode all renditions). Blocking the upload request would create unacceptable latency and couple unrelated failure domains. The async pattern (upload -> ack -> enqueue -> process -> notify) decouples ingestion from processing and allows independent scaling of the expensive GPU fleet.

Why CDN for Delivery, Not Origin Servers?

Factor	Origin Servers	CDN (200+ PoPs)
Latency	100-300ms (cross-continent)	< 20ms (nearest PoP)
Bandwidth cost	$0.05-0.09/GB at origin	$0.01-0.02/GB at edge (volume discounts)
Peak handling	Must provision for 170 Tbps	Distributed across 200+ PoPs
Failure blast radius	One region down = millions affected	One PoP down = auto-failover to next
Video start time	2-5 seconds (cold)	< 500ms for cached content
Cache hit ratio	N/A	95%+ for popular content

Decision: At 170 Tbps peak bandwidth, no single origin can serve this traffic. CDN edge caching ensures that the top 20% of content (which represents 80%+ of views due to Zipf distribution) is served from edge cache in < 20ms. The economics are compelling: edge delivery costs 3-5x less than origin delivery at scale.

Why Chunked Upload, Not Single-File Upload?

Factor	Single-File Upload	Chunked Resumable Upload
Large files	Timeout on 4K videos (50+ GB)	Upload in 5 MB chunks, no timeout
Network interruption	Restart from byte 0	Resume from last successful chunk
Progress tracking	Binary (uploading or not)	Granular progress (chunk 47/1000)
Parallel upload	Single stream	Upload multiple chunks in parallel
Server memory	Buffer entire file	Process chunk by chunk (5 MB buffer)
Mobile support	Fails on spotty connections	Works on 3G, handles disconnects

Decision: YouTube's upload limit is 256 GB per video. A single HTTP request for a 50 GB 4K file would timeout, consume excessive server memory, and force complete restart on any network blip. Chunked upload (5 MB chunks) with server-side reassembly provides resumability, parallel uploads, and graceful handling of the unreliable last-mile connections that mobile creators depend on.

High-Level Architecture

Backend Services Explained

Upload Service

The Upload Service handles the ingestion of raw video files from creators. It implements the tus resumable upload protocol (or a proprietary equivalent) where clients split files into 5 MB chunks and upload them independently with each chunk identified by a byte-range offset. The server maintains an upload session state in Redis (tracking which chunks have been received) and writes completed chunks directly to object storage (S3/GCS) using multipart upload APIs. Once all chunks arrive, it triggers a "video uploaded" event to the transcode queue. The service also performs pre-upload validation: codec sniffing, file-size limits (256 GB max), virus scanning, and duplicate detection via perceptual hashing. It runs behind a rate limiter (per-creator: 50 uploads/day for free tier, unlimited for premium) and authenticates every request via short-lived signed upload URLs.

Transcoder Fleet

The Transcoder Fleet is a horizontally-scalable pool of GPU-accelerated workers (typically NVIDIA T4 or A10G instances) that consume jobs from the transcode queue. Each job specifies the source video location in raw storage and produces a resolution ladder — eight renditions from 144p to 4K. The transcoding pipeline is a DAG: first, the source is decoded and analyzed (scene detection, bitrate optimization via per-title encoding), then each resolution is encoded in parallel using FFmpeg with hardware-accelerated codecs (NVENC for H.264/H.265, libvpx-vp9, libaom-av1). Each rendition is chunked into 2-10 second segments (aligned to keyframes for clean switching). The fleet auto-scales based on queue depth with a target of < 10 min processing time for 1080p content. Failed jobs are retried 3 times before moving to a dead-letter queue for manual investigation. The fleet uses spot/preemptible instances for 60-70% cost savings with on-demand fallback for priority jobs.

Streaming / Origin Service

The Origin Service generates manifest files (HLS .m3u8 or DASH .mpd) that describe all available renditions and their segment URLs, and serves video segments on CDN cache misses. The manifest is the "control plane" of video playback — it tells the client player what quality levels exist, segment durations, and byte ranges. For DRM-protected content (Netflix, premium YouTube), the origin injects license acquisition URLs and content key IDs into the manifest. It also handles range requests for byte-range seeking, token authentication (signed URLs with expiry to prevent hotlinking), and geo-restriction enforcement. The origin is deployed across multiple regions with Anycast DNS for automatic failover.

CDN (Content Delivery Network)

The CDN layer consists of 200+ Points of Presence (PoPs) globally, each equipped with 100+ TB of SSD cache and 1+ Tbps of egress capacity. When a viewer requests a segment, DNS routes them to the nearest PoP (< 20ms latency). The PoP checks its local cache: on a hit (95% of requests for popular content), it serves the segment directly from SSD with zero origin contact. On a miss, it fetches from the origin (or a mid-tier cache), serves the viewer, and caches the segment for future requests. Cache eviction uses a combination of LRU and popularity scoring — viral content stays cached longer. The CDN also handles pre-warming: when a high-profile video is published (e.g., a new music video from a major artist), the system proactively pushes segments to all PoPs before viewers request them, eliminating first-viewer latency.

Metadata Service

The Metadata Service owns all non-binary video data: title, description, tags, thumbnails, view counts, like/dislike ratios, upload timestamps, channel information, and video status (processing, live, unlisted, private). It is backed by a sharded PostgreSQL cluster (Vitess) partitioned by video_id with read replicas for high-throughput queries. View counts use a write-behind pattern: increments are batched in Redis (flushed every 5 seconds) to avoid write amplification on the primary DB. The service exposes both internal gRPC APIs (for other services) and external REST APIs (for the client app). It also publishes change events to Kafka for downstream consumers (search indexing, analytics, recommendation training).

Search Service

The Search Service provides full-text search across 800M+ videos using Elasticsearch clusters sharded by language/region. It indexes video titles, descriptions, tags, auto-generated captions (from speech-to-text), and channel names. Ranking combines text relevance (BM25) with engagement signals (CTR, watch time, freshness) and personalization (user's watch history, subscriptions). Autocomplete suggestions are served from a separate prefix-tree index with sub-50ms latency. The search pipeline includes spell correction, synonym expansion, and safe-search filtering. Index updates are near-real-time (< 30 seconds from metadata change to searchable) via a Kafka-to-Elasticsearch connector.

Recommendation Engine

The Recommendation Engine generates personalized video suggestions for the home feed, "Up Next" sidebar, and notifications. It combines multiple signals through a two-stage architecture: candidate generation (recall: retrieve 1000s of candidates from collaborative filtering, content-based similarity, and trending signals) and ranking (precision: score each candidate using a deep neural network trained on engagement features — watch time, likes, shares, not-interested signals). The model is trained on billions of watch events using distributed TensorFlow/PyTorch. Recommendations are pre-computed in batch (updated hourly per user) and stored in a low-latency serving layer (Redis/Memcached). Real-time signals (e.g., user just watched a cooking video) trigger lightweight re-ranking of the pre-computed set without waiting for full model inference.

Architecture Flow

Upload Path: Diego in Mexico City uploads a 4K cooking video

Diego opens the YouTube Studio app and selects a 4K video (12 minutes, 8.5 GB) of his abuela's mole recipe.

T+0ms — The client requests an upload session from the Upload Service. The service validates Diego's auth token, checks his upload quota, and returns a signed upload URL with a unique upload_id. It creates a session record in Redis: {upload_id, total_size: 8.5GB, chunks_received: [], status: "in_progress"}.

T+50ms — The client splits the file into 1,700 chunks (5 MB each) and begins uploading 6 chunks in parallel (browser/app limit). Each chunk is a PUT request with header Content-Range: bytes 0-5242879/8500000000. The Upload Service writes each chunk directly to S3 using multipart upload and marks it complete in Redis.

T+3min — Diego's train enters a tunnel. Upload pauses at chunk 847/1700. No data is lost. When connectivity returns 45 seconds later, the client queries GET /upload/{upload_id}/status, receives the list of completed chunks, and resumes from chunk 848. No re-upload needed.

T+8min — All 1,700 chunks received. The Upload Service calls S3's CompleteMultipartUpload, which assembles the chunks into a single object. It publishes a message to the Transcode Queue (Kafka): {video_id: "dQw4w9WgXcQ", source: "s3://raw/diego/upload_id.mp4", priority: "standard"}. Diego sees "Processing..." in his dashboard.

T+8min 10s — A transcoder worker picks up the job. It downloads the source file, probes the video (4K, 30fps, H.264, AAC audio), and constructs the transcoding DAG: 8 resolution outputs, 2 codec variants (H.264 for compatibility + VP9 for Chrome/Android), producing 16 total renditions. GPU-accelerated encoding begins in parallel across the resolution ladder.

T+12min — Lower resolutions (144p through 720p) complete first. As each rendition finishes, its segments are uploaded to S3 and the video becomes playable at those qualities. Diego's video shows "Processing — SD quality available" in his dashboard.

T+35min — All 16 renditions complete. The transcoder writes the final manifest files (one .m3u8 master playlist pointing to all quality levels), updates the metadata DB (status: "ready"), triggers thumbnail extraction (ML-selected best frame), and publishes a "video ready" event. Diego receives a push notification: "Your video is ready to share!"

T+36min — The CDN pre-warm system detects Diego has 50K subscribers. It pushes the first 30 seconds of the video (all renditions) to the top 20 PoPs where his audience is concentrated (Mexico, US Southwest, Spain).

Watch Path: Yuki in Tokyo opens the app to watch Diego's video

Yuki sees Diego's video recommended on her home feed (collaborative filtering: she watches cooking content, other cooking viewers watched Diego's video). She taps to play.

T+0ms — Yuki's player sends GET /api/v1/videos/dQw4w9WgXcQ/manifest.m3u8 to the CDN edge PoP in Tokyo. The PoP has the manifest cached (popular video, recently pre-warmed). Cache hit. Response in 8ms.

T+8ms — The player parses the master manifest, which lists 8 quality levels:

Text Only

#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=256x144
/segments/dQw4w9WgXcQ/144p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=16000000,RESOLUTION=3840x2160
/segments/dQw4w9WgXcQ/4k/playlist.m3u8

The ABR algorithm estimates Yuki's bandwidth (~50 Mbps fiber). It selects 1080p as the initial quality (conservative start — not 4K until buffer is established).

T+15ms — Player requests the first segment: GET /segments/dQw4w9WgXcQ/1080p/seg_0.ts. Tokyo PoP has it cached (pre-warmed). Serves 2-second segment (2 MB) in 40ms. Video begins playing.

T+55ms — First frame rendered. Total startup latency: 55ms (well under the 2-second target). Player background-fetches segments 1, 2, 3 to build a 6-second buffer.

T+2s — Buffer established at 6 seconds. ABR algorithm observes stable 50 Mbps throughput and low RTT. Upgrades to 4K for segment 4 onward. Yuki doesn't notice the switch — the transition happens at a segment boundary during playback.

T+4min — Yuki's roommate starts a large download. Available bandwidth drops to 8 Mbps. The ABR algorithm detects the throughput drop (last segment took 3x longer than expected). Buffer is draining: 4 seconds remaining. Algorithm switches DOWN aggressively to 720p immediately (skip 1080p — when dropping, skip levels to protect buffer). Buffer stabilizes.

T+4min 30s — Roommate's download completes. Bandwidth recovers to 50 Mbps. ABR algorithm waits until buffer exceeds 15 seconds (conservative upswitch threshold), then gradually steps up: 720p → 1080p → 4K over the next 3 segments. This asymmetric switching (fast down, slow up) prevents oscillation.

T+12min — Video ends. Player reports watch-time analytics event to the Watch History service: {user: yuki, video: dQw4w9WgXcQ, watch_time: 12min, max_quality: 4K, rebuffers: 0}. The recommendation engine updates Yuki's profile in real-time for the next "Up Next" suggestion.

Failure & Recovery Scenarios

CDN PoP Failure

Scenario: The Tokyo PoP loses power. 200K concurrent viewers in Japan are affected.

Detection: Health checks fail within 5 seconds. DNS-based routing (Anycast or latency-based Route53) detects the PoP is unresponsive.

Recovery: DNS TTL is 30 seconds. Within 30-60 seconds, all Japanese viewers are re-routed to the Osaka PoP (next nearest, 15ms additional latency). Viewers experience a brief quality downgrade (ABR detects increased latency, steps down one level) but zero interruption. The Osaka PoP's cache has 70% hit rate for Japanese-popular content; the remaining 30% of requests go to origin for a few seconds until the cache warms. Total user-visible impact: 1-2 second quality dip for 30 seconds.

Prevention: Multi-PoP redundancy in every metro. Tokyo has 3 PoPs across different facilities. Anycast automatically routes around failures at the network layer.

Transcoding Job Stuck / Failed

Scenario: A transcoder worker crashes mid-encoding of a video. The job has been processing for 20 minutes with no heartbeat.

Detection: The queue system uses visibility timeout — if a worker doesn't send a heartbeat within 5 minutes, the message becomes visible again for another worker to pick up.

Recovery: Another worker claims the job and restarts transcoding from scratch (transcoding is idempotent — same input always produces same output). If the job fails 3 times consecutively, it moves to a dead-letter queue (DLQ) for investigation. The creator sees "Processing failed — our team is investigating" and is notified when resolved.

Prevention: Workers checkpoint progress for long-running encodes (e.g., "720p complete, 1080p in progress"). On retry, the new worker can skip already-completed renditions and resume from the incomplete one. Workers run health-check processes that kill and restart stuck FFmpeg processes.

Origin Overloaded (Cache Miss Storm)

Scenario: A viral video is published with 10M simultaneous viewers. CDN hasn't cached it yet — all requests hit origin.

Detection: Origin request rate spikes 100x above baseline. Latency increases from 50ms to 5 seconds. 503 errors begin.

Recovery (immediate): Request coalescing — when 1000 CDN PoPs simultaneously request the same segment, the origin recognizes duplicate requests and serves a single read from S3, fanning the response out to all waiting CDN PoPs. This reduces origin load from N (number of PoPs) to 1 per unique segment.

Recovery (proactive): Shield / mid-tier cache — a small number of regional "shield" servers sit between CDN PoPs and origin. All PoPs in a region route misses through the shield first. The shield caches the response after the first fetch, so only 1 request per region reaches origin instead of 1 per PoP.

Prevention: Pre-warming for high-profile releases. When a creator with > 1M subscribers publishes, the system proactively pushes segments to all PoPs before the video goes live.

Client Bandwidth Drops Mid-Stream

Scenario: A viewer on cellular data enters an elevator. Bandwidth drops from 10 Mbps to 0.5 Mbps.

Detection: The ABR algorithm monitors two signals: (1) throughput — measured download speed of the last N segments, and (2) buffer level — seconds of video already downloaded but not yet played.

Recovery: When buffer drops below the "panic threshold" (4 seconds), the algorithm switches to the lowest available quality (144p at 400 Kbps) immediately — no gradual stepping. This "emergency downswitch" sacrifices quality to prevent the cardinal sin of rebuffering. If bandwidth is truly zero (elevator), the buffer drains over 4 seconds and playback pauses with a "Poor connection" message. When bandwidth returns, playback resumes from the buffer and quality ramps back up conservatively.

Prevention: Predictive buffering — the player pre-fetches extra segments when it detects the user is on cellular (based on network type API). Some players implement "download ahead" during high-bandwidth periods to build a 30-60 second buffer cushion.

Data Model

Video Metadata (PostgreSQL/Vitess — sharded by video_id)

SQL

CREATE TABLE videos (
    video_id        BIGINT PRIMARY KEY,          -- Snowflake ID (timestamp + worker + seq)
    channel_id      BIGINT NOT NULL,             -- FK to channels table
    title           VARCHAR(200) NOT NULL,        -- Full-text indexed in Elasticsearch
    description     TEXT,
    tags            TEXT[],                        -- Array of tags for search/filtering
    status          VARCHAR(20) NOT NULL,          -- 'uploading','transcoding','ready','failed','removed'
    privacy         VARCHAR(10) DEFAULT 'public',  -- 'public','unlisted','private'
    duration_ms     INTEGER,
    upload_size_bytes BIGINT,
    thumbnail_url   VARCHAR(500),
    view_count      BIGINT DEFAULT 0,             -- Denormalized, updated via write-behind from Redis
    like_count      INTEGER DEFAULT 0,
    dislike_count   INTEGER DEFAULT 0,
    comment_count   INTEGER DEFAULT 0,
    created_at      TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    published_at    TIMESTAMP WITH TIME ZONE,
    updated_at      TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Index for channel video listing
CREATE INDEX idx_videos_channel_published ON videos(channel_id, published_at DESC);
-- Index for trending/popular queries
CREATE INDEX idx_videos_view_count ON videos(view_count DESC) WHERE status = 'ready';

HLS Manifest / Chunk Manifest (stored as files in S3, structure shown here)

Text Only

#EXTM3U
#EXT-X-VERSION:4
#EXT-X-STREAM-INF:BANDWIDTH=400000,RESOLUTION=256x144,CODECS="avc1.42c00c,mp4a.40.2"
144p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=426x240,CODECS="avc1.4d4015,mp4a.40.2"
240p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=640x360,CODECS="avc1.4d401e,mp4a.40.2"
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=854x480,CODECS="avc1.4d401f,mp4a.40.2"
480p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1280x720,CODECS="avc1.4d4020,mp4a.40.2"
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=8000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
1080p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=12000000,RESOLUTION=2560x1440,CODECS="avc1.640032,mp4a.40.2"
1440p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=20000000,RESOLUTION=3840x2160,CODECS="avc1.640033,mp4a.40.2"
4k/playlist.m3u8

Per-resolution playlist (e.g., 720p/playlist.m3u8):

Text Only

#EXTM3U
#EXT-X-VERSION:4
#EXT-X-TARGETDURATION:4
#EXT-X-MEDIA-SEQUENCE:0
#EXTINF:4.000,
seg_000.ts
#EXTINF:4.000,
seg_001.ts
#EXTINF:4.000,
seg_002.ts
...
#EXTINF:2.500,
seg_179.ts
#EXT-X-ENDLIST

User Watch History (Cassandra — partitioned by user_id)

SQL

CREATE TABLE watch_history (
    user_id         BIGINT,
    watched_at      TIMESTAMP,
    video_id        BIGINT,
    watch_duration_ms INTEGER,         -- How long they actually watched
    video_duration_ms INTEGER,         -- Total video length (for % calculation)
    max_quality     VARCHAR(10),       -- Highest quality reached
    device_type     VARCHAR(20),       -- mobile, desktop, tv, tablet
    rebuffer_count  INTEGER,
    PRIMARY KEY ((user_id), watched_at)
) WITH CLUSTERING ORDER BY (watched_at DESC)
  AND default_time_to_live = 63072000;  -- 2-year TTL

-- Materialized view for "continue watching" feature
CREATE MATERIALIZED VIEW incomplete_watches AS
    SELECT * FROM watch_history
    WHERE watch_duration_ms < video_duration_ms * 0.9
    PRIMARY KEY ((user_id), watched_at)
    WITH CLUSTERING ORDER BY (watched_at DESC);

Algorithms Under the Hood

Adaptive Bitrate Selection (Hybrid: Buffer-Based + Throughput-Based)

Python

def select_next_quality(state):
    """
    Hybrid ABR algorithm combining buffer-level and throughput estimation.
    Used by the client player to decide quality for the next segment.

    Inputs:
        state.buffer_level_sec    - seconds of video in buffer
        state.throughput_history  - list of measured throughputs (last 5 segments)
        state.current_quality    - index into QUALITY_LEVELS
        state.segment_duration   - typical segment length in seconds

    Returns: index into QUALITY_LEVELS for next segment download
    """
    QUALITY_LEVELS = [
        {"name": "144p",  "bitrate": 400_000},
        {"name": "240p",  "bitrate": 800_000},
        {"name": "360p",  "bitrate": 1_500_000},
        {"name": "480p",  "bitrate": 3_000_000},
        {"name": "720p",  "bitrate": 5_000_000},
        {"name": "1080p", "bitrate": 8_000_000},
        {"name": "1440p", "bitrate": 12_000_000},
        {"name": "4K",    "bitrate": 20_000_000},
    ]

    BUFFER_LOW = 4       # seconds - panic zone, switch down aggressively
    BUFFER_SAFE = 10     # seconds - stable, allow current quality
    BUFFER_HIGH = 20     # seconds - surplus, consider switching up
    SAFETY_FACTOR = 0.7  # only use 70% of estimated throughput

    # Step 1: Estimate available throughput (harmonic mean for stability)
    if len(state.throughput_history) >= 3:
        # Harmonic mean is more conservative than arithmetic mean
        # (penalizes low samples more), which is what we want for video
        harmonic_mean = len(state.throughput_history) / sum(
            1.0 / t for t in state.throughput_history
        )
        estimated_throughput = harmonic_mean * SAFETY_FACTOR
    else:
        # Not enough samples yet - be conservative
        estimated_throughput = QUALITY_LEVELS[1]["bitrate"]  # Start at 240p

    # Step 2: Find max quality our throughput can support
    throughput_max_quality = 0
    for i, level in enumerate(QUALITY_LEVELS):
        if level["bitrate"] <= estimated_throughput:
            throughput_max_quality = i

    # Step 3: Buffer-based decision (overrides throughput in extreme cases)
    current = state.current_quality

    if state.buffer_level_sec < BUFFER_LOW:
        # PANIC: Buffer critically low. Drop to lowest immediately.
        # Skip levels — don't step down gradually, that's too slow.
        return 0  # 144p - survival mode

    elif state.buffer_level_sec < BUFFER_SAFE:
        # LOW: Don't go up. Allow one step down if throughput says so.
        return min(current, throughput_max_quality)

    elif state.buffer_level_sec < BUFFER_HIGH:
        # SAFE: Follow throughput estimate, but don't go above current+1
        # (conservative upswitch to avoid oscillation)
        return min(throughput_max_quality, current + 1)

    else:
        # HIGH: Buffer is healthy. Allow throughput to drive quality up.
        # Still cap at +1 level per segment to avoid jarring jumps.
        target = min(throughput_max_quality, current + 1)
        return target

Chunked Upload with Resumability

Python

def chunked_upload(file_path, upload_url, chunk_size=5 * 1024 * 1024):
    """
    Client-side chunked upload with resumability.
    Implements a simplified tus-protocol-like flow.

    On network failure: call resume_upload() which queries server
    for completed chunks and restarts from the last gap.
    """
    file_size = os.path.getsize(file_path)
    total_chunks = math.ceil(file_size / chunk_size)

    # Step 1: Create upload session on server
    session = http_post(f"{upload_url}/create", body={
        "filename": os.path.basename(file_path),
        "file_size": file_size,
        "chunk_size": chunk_size,
        "content_type": detect_mime_type(file_path),
        "checksum_algo": "sha256"
    })
    upload_id = session["upload_id"]

    # Step 2: Upload chunks with parallel workers
    MAX_PARALLEL = 6  # Browser/HTTP2 connection limit
    completed_chunks = set()

    with ThreadPoolExecutor(max_workers=MAX_PARALLEL) as executor:
        futures = {}

        for chunk_index in range(total_chunks):
            offset = chunk_index * chunk_size
            length = min(chunk_size, file_size - offset)

            future = executor.submit(
                upload_single_chunk,
                upload_id, file_path, chunk_index, offset, length
            )
            futures[future] = chunk_index

        for future in as_completed(futures):
            chunk_idx = futures[future]
            try:
                future.result()
                completed_chunks.add(chunk_idx)
                report_progress(len(completed_chunks) / total_chunks)
            except NetworkError:
                # Don't fail entire upload — we'll retry missing chunks
                pass

    # Step 3: Retry any failed chunks (with exponential backoff)
    missing = set(range(total_chunks)) - completed_chunks
    for attempt in range(3):
        if not missing:
            break
        time.sleep(2 ** attempt)  # 1s, 2s, 4s backoff
        for chunk_index in list(missing):
            try:
                offset = chunk_index * chunk_size
                length = min(chunk_size, file_size - offset)
                upload_single_chunk(upload_id, file_path, chunk_index, offset, length)
                missing.discard(chunk_index)
            except NetworkError:
                continue

    if missing:
        raise UploadError(f"Failed to upload chunks: {missing}")

    # Step 4: Finalize — server assembles chunks
    http_post(f"{upload_url}/{upload_id}/complete", body={
        "total_chunks": total_chunks,
        "full_file_sha256": compute_sha256(file_path)
    })

    return upload_id


def upload_single_chunk(upload_id, file_path, chunk_index, offset, length):
    """Upload a single chunk with integrity verification."""
    with open(file_path, 'rb') as f:
        f.seek(offset)
        chunk_data = f.read(length)

    chunk_sha256 = hashlib.sha256(chunk_data).hexdigest()

    response = http_put(
        f"{UPLOAD_URL}/{upload_id}/chunks/{chunk_index}",
        headers={
            "Content-Range": f"bytes {offset}-{offset + length - 1}/{file_size}",
            "X-Chunk-SHA256": chunk_sha256,
        },
        body=chunk_data,
        timeout=30  # Per-chunk timeout, not entire file
    )

    if response.status != 200:
        raise NetworkError(f"Chunk {chunk_index} upload failed: {response.status}")


def resume_upload(upload_id):
    """Query server for upload status and return missing chunk indices."""
    status = http_get(f"{UPLOAD_URL}/{upload_id}/status")
    # Server returns: {"completed_chunks": [0,1,2,5,6,7], "total": 1700}
    all_chunks = set(range(status["total"]))
    completed = set(status["completed_chunks"])
    return all_chunks - completed  # Chunks that need re-uploading

Transcoding DAG (Resolution Ladder with Per-Title Encoding)

Python

def build_transcode_dag(source_video_path):
    """
    Build a transcoding Directed Acyclic Graph (DAG) for a video.

    The DAG optimizes for:
    1. Parallel encoding of independent resolutions
    2. Per-title encoding: adjust bitrate based on content complexity
    3. Fast availability: lower resolutions complete first → early playback
    4. Codec variants: H.264 (compatibility) + VP9 (efficiency) + AV1 (future)

    Returns: DAG of tasks that the scheduler executes on GPU workers.
    """

    RESOLUTION_LADDER = [
        {"name": "144p",  "width": 256,  "height": 144,  "base_bitrate_kbps": 400},
        {"name": "240p",  "width": 426,  "height": 240,  "base_bitrate_kbps": 800},
        {"name": "360p",  "width": 640,  "height": 360,  "base_bitrate_kbps": 1500},
        {"name": "480p",  "width": 854,  "height": 480,  "base_bitrate_kbps": 3000},
        {"name": "720p",  "width": 1280, "height": 720,  "base_bitrate_kbps": 5000},
        {"name": "1080p", "width": 1920, "height": 1080, "base_bitrate_kbps": 8000},
        {"name": "1440p", "width": 2560, "height": 1440, "base_bitrate_kbps": 12000},
        {"name": "4K",    "width": 3840, "height": 2160, "base_bitrate_kbps": 20000},
    ]

    dag = TranscodeDAG()

    # ──── Stage 1: Source Analysis (runs first, all else depends on it) ────
    analyze_task = dag.add_task(
        name="analyze_source",
        command=f"ffprobe -v quiet -print_format json -show_streams {source_video_path}",
        outputs=["source_info.json"]
    )

    # Per-title encoding: analyze content complexity to optimize bitrates
    complexity_task = dag.add_task(
        name="analyze_complexity",
        command=f"""
            ffmpeg -i {source_video_path} -vf "select=not(mod(n\\,30))" 
            -vsync vfr -q:v 2 -f null - 2>&1 | grep 'frame=' 
        """,
        depends_on=[analyze_task],
        outputs=["complexity_score"]
    )
    # complexity_score: 0.0 (static slideshow) to 1.0 (fast action sports)
    # Low complexity → reduce bitrate 30% (saves storage, same quality)
    # High complexity → increase bitrate 20% (maintains quality in motion)

    # ──── Stage 2: Determine output resolutions ────
    # Don't upscale: if source is 720p, only produce 144p-720p
    source_info = probe_video(source_video_path)
    source_height = source_info["height"]

    applicable_resolutions = [
        r for r in RESOLUTION_LADDER if r["height"] <= source_height
    ]

    # ──── Stage 3: Parallel encoding per resolution ────
    segment_tasks = []
    for resolution in applicable_resolutions:
        # Adjust bitrate based on content complexity
        adjusted_bitrate = adjust_bitrate_for_complexity(
            resolution["base_bitrate_kbps"], 
            complexity_score
        )

        # H.264 encode (universal compatibility)
        h264_task = dag.add_task(
            name=f"encode_h264_{resolution['name']}",
            command=f"""
                ffmpeg -i {source_video_path} 
                -vf scale={resolution['width']}:{resolution['height']}
                -c:v libx264 -preset medium -b:v {adjusted_bitrate}k
                -c:a aac -b:a 128k
                -f hls -hls_time 4 -hls_segment_type mpegts
                -hls_playlist_type vod
                output/{resolution['name']}/h264/playlist.m3u8
            """,
            depends_on=[complexity_task],
            gpu_required=True,
            priority=resolution["height"],  # Lower res = higher priority (faster availability)
        )
        segment_tasks.append(h264_task)

        # VP9 encode (better compression for Chrome/Android)
        if resolution["height"] >= 360:  # VP9 only for 360p+
            vp9_task = dag.add_task(
                name=f"encode_vp9_{resolution['name']}",
                command=f"""
                    ffmpeg -i {source_video_path}
                    -vf scale={resolution['width']}:{resolution['height']}
                    -c:v libvpx-vp9 -b:v {int(adjusted_bitrate * 0.7)}k
                    -c:a libopus -b:a 128k
                    -f dash
                    output/{resolution['name']}/vp9/manifest.mpd
                """,
                depends_on=[complexity_task],
                gpu_required=True,
                priority=resolution["height"] + 1000,  # Lower priority than H.264
            )
            segment_tasks.append(vp9_task)

    # ──── Stage 4: Generate master manifest (after all encodes complete) ────
    manifest_task = dag.add_task(
        name="generate_master_manifest",
        command="generate_hls_master_playlist(output/)",
        depends_on=segment_tasks,  # Waits for ALL encodes
        outputs=["master.m3u8", "master.mpd"]
    )

    # ──── Stage 5: Upload to storage + notify ────
    upload_task = dag.add_task(
        name="upload_to_s3",
        command=f"aws s3 sync output/ s3://encoded-segments/{video_id}/",
        depends_on=[manifest_task],
    )

    notify_task = dag.add_task(
        name="notify_completion",
        command=f"publish_event('video.ready', {{video_id: '{video_id}'}})",
        depends_on=[upload_task],
    )

    return dag


def adjust_bitrate_for_complexity(base_bitrate_kbps, complexity_score):
    """
    Per-title encoding: adjust bitrate based on content complexity.

    - Talking head / slideshow (score ~0.2): reduce 30% — no motion to encode
    - Normal content (score ~0.5): use base bitrate
    - Sports / fast action (score ~0.8): increase 20% — more bits needed for motion
    """
    multiplier = 0.7 + (complexity_score * 0.5)  # Range: 0.7x to 1.2x
    return int(base_bitrate_kbps * multiplier)

Scaling Considerations

Dimension	Challenge	Solution
Storage growth	4.2 PB/day new content, 100+ PB total	Tiered storage: hot (SSD, < 7 days), warm (HDD, < 1 year), cold (Glacier, archive). Deduplication for re-uploads.
CDN bandwidth	170 Tbps peak, growing 25% YoY	Multi-CDN strategy (CloudFront + Akamai + Fastly). Build own CDN for top markets (Google Global Cache).
Transcoding throughput	500 hours uploaded/minute, 8 renditions each	Auto-scaling GPU fleet on spot instances. Priority queues (premium creators first). Regional processing to minimize data movement.
Metadata queries	100K+ QPS for video lookups	Vitess (sharded MySQL) + aggressive caching (Redis, 99% hit rate for popular videos). Read replicas per region.
Search indexing	6.2M new videos/day, 800M total docs	Elasticsearch cluster per region, sharded by language. Near-real-time indexing via Kafka connector.
Recommendation serving	500M DAU, each needs personalized feed	Pre-compute top-1000 recommendations per user (batch). Store in Redis. Real-time re-ranking only for engaged users.
Watch history writes	10B events/day	Cassandra with wide rows (partition by user_id, cluster by timestamp). TTL for auto-cleanup.
Global consistency	Creator uploads in Brazil, viewer watches in Japan	Eventual consistency for non-critical (view counts, recommendations). Strong consistency for critical (video availability — wait for CDN pre-warm before showing "ready").
Cost optimization	$10M+/month in compute + storage + bandwidth	Spot instances for transcoding (70% savings). Reserved instances for origin. Content-aware encoding (lower bitrate for simple content). AV1 codec (30% smaller files at same quality).
Cold start (new PoP)	New CDN PoP has empty cache	Pre-warm with top 10K videos for the region. ML model predicts which content will be popular in next 24 hours.

Quick Recall

Question	Answer
Why adaptive bitrate (ABR)?	Network conditions fluctuate. ABR lets client switch quality per-segment (every 2-10s) based on measured throughput + buffer level. Eliminates rebuffering.
HLS vs DASH?	HLS = Apple (HTTP Live Streaming, `.m3u8` manifests). DASH = Open standard (`.mpd` manifests). Functionally equivalent. Most platforms support both.
Why async transcoding?	Encoding a 10-min 4K video takes 30+ min of GPU time. Can't block upload response. Queue decouples ingestion from processing, enables retry, priority, and independent scaling.
Why CDN not origin?	170 Tbps peak. No single data center can serve this. CDN distributes load across 200+ PoPs. 95% cache hit rate. < 20ms to nearest edge. 3-5x cheaper than origin delivery.
Buffer-based vs throughput-based ABR?	Throughput: measures download speed of recent segments. Buffer: tracks seconds of video cached locally. Hybrid (both) is best — throughput for steady-state, buffer for emergency switching.
Why chunked upload?	Files up to 256 GB. Single HTTP request would timeout. Chunks (5 MB) enable: resumability, parallel upload, per-chunk retry, progress tracking, low server memory.
What's a manifest file?	The "table of contents" for adaptive streaming. Lists all quality levels and their segment URLs. Client parses manifest to know what qualities exist and where to fetch segments.
Per-title encoding?	Adjust bitrate based on content complexity. A talking-head video needs fewer bits than a fast-action sports clip at the same quality. Saves 20-30% storage.
CDN cache miss storm?	Request coalescing (dedupe concurrent requests for same segment). Shield/mid-tier cache (regional layer between PoPs and origin). Pre-warming for viral content.
Why Cassandra for watch history?	Write-heavy (10B events/day), time-series access pattern, TTL for auto-cleanup, partition by user_id for efficient "my history" queries. AP system — availability over consistency.
How to handle live streaming?	Same architecture but segments are generated in real-time (low-latency HLS/DASH). No transcoding queue — direct encode + push to CDN. Typical latency: 3-10 seconds behind live.
How to reduce storage costs?	Tiered storage (hot/warm/cold), per-title encoding (lower bitrates where possible), AV1 codec (30% smaller), deduplication, delete low-view content from hot tier after 90 days.
What about DRM?	Widevine (Google), FairPlay (Apple), PlayReady (Microsoft). Encryption applied during transcoding. License server validates entitlement before issuing decryption keys to client.
How to measure quality of experience?	Track: video start time (< 2s target), rebuffer rate (< 0.5%), average bitrate delivered, quality switches per session, and abandonment rate. All reported via client-side analytics.

Video Streaming Platform (YouTube / Netflix)

System Design Concepts Used

Functional Requirements

Non-Functional Requirements

Capacity Estimation

"Why X, Not Y?" Tradeoff Analysis

Why Adaptive Bitrate (HLS/DASH), Not Single-Quality Streaming?

Why Async Transcoding Queue, Not Synchronous?

Why CDN for Delivery, Not Origin Servers?

Why Chunked Upload, Not Single-File Upload?

High-Level Architecture

Backend Services Explained

Upload Service

Transcoder Fleet

Streaming / Origin Service

CDN (Content Delivery Network)

Metadata Service

Search Service

Recommendation Engine

Architecture Flow

Upload Path: Diego in Mexico City uploads a 4K cooking video

Watch Path: Yuki in Tokyo opens the app to watch Diego's video

Failure & Recovery Scenarios

CDN PoP Failure

Transcoding Job Stuck / Failed

Origin Overloaded (Cache Miss Storm)

Client Bandwidth Drops Mid-Stream

Data Model

Video Metadata (PostgreSQL/Vitess — sharded by video_id)

HLS Manifest / Chunk Manifest (stored as files in S3, structure shown here)

User Watch History (Cassandra — partitioned by user_id)

Algorithms Under the Hood

Adaptive Bitrate Selection (Hybrid: Buffer-Based + Throughput-Based)

Chunked Upload with Resumability

Transcoding DAG (Resolution Ladder with Per-Title Encoding)

Scaling Considerations

Quick Recall

5-Minute System Design — Weekly