Batch vs Stream Processing
Real Incident: LinkedIn Real-Time Activity Processing
In 2012, LinkedIn's batch-only pipeline processed user activity (profile views, connection requests, content engagement) overnight. Users posted content and waited 8-24 hours to see engagement analytics. Feed recommendations were a day stale. They built Apache Kafka and Samza to create a unified real-time stream processing platform. The result: notifications delivered in seconds, feed personalization updated in real-time, and "Who Viewed Your Profile" went from daily email to instant push. The shift from batch to stream reduced their data freshness from 24 hours to under 10 seconds — a 86,400x improvement.
Why This Comes Up in Interviews
Nearly every large-scale system generates data that needs processing — analytics, recommendations, fraud detection, ETL. Interviewers want to hear:
- When batch is still the right answer (it often is)
- The Lambda vs Kappa architecture debate
- Exactly-once semantics and why it's hard
- Windowing strategies for stream aggregation
- Concrete tool choices and their trade-offs
Lambda Architecture
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph TD
SOURCE[Data Source] --> BATCH[Batch Layer - Spark]
SOURCE --> SPEED[Speed Layer - Flink/Kafka Streams]
BATCH --> BATCHVIEW[(Batch Views - Complete, Accurate)]
SPEED --> RTVIEW[(Real-time Views - Fast, Approximate)]
BATCHVIEW --> SERVING[Serving Layer - Merge Results]
RTVIEW --> SERVING
SERVING --> QUERY[Query]
style BATCH fill:#3498db,color:#fff
style SPEED fill:#e74c3c,color:#fff
style SERVING fill:#27ae60,color:#fff
style SOURCE fill:#f39c12,color:#fff | Layer | Purpose | Tool Examples | Latency | Accuracy |
|---|---|---|---|---|
| Batch | Reprocess all historical data periodically | Spark, Hadoop MapReduce | Hours | Exact |
| Speed | Process real-time data for low-latency results | Flink, Kafka Streams, Storm | Seconds | Approximate |
| Serving | Merge batch + speed results for queries | Druid, Cassandra, Elasticsearch | Milliseconds | Combined |
The promise: Best of both worlds — accurate batch results AND real-time approximations.
The reality: You maintain TWO codebases computing the same thing differently. Bugs diverge. Testing doubles. Operational complexity explodes.
Kappa Architecture
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph TD
SOURCE[Data Source] --> LOG[(Immutable Event Log - Kafka)]
LOG --> STREAM[Stream Processor - Flink]
STREAM --> VIEW[(Materialized Views)]
VIEW --> QUERY[Query]
LOG -->|Replay from offset 0| REPROCESS[Reprocess: Deploy new version, replay all events]
REPROCESS --> NEWVIEW[(New Views - Replace old)]
style LOG fill:#f39c12,color:#fff
style STREAM fill:#3498db,color:#fff
style VIEW fill:#27ae60,color:#fff
style REPROCESS fill:#9b59b6,color:#fff Kappa simplification: One processing path (streaming) handles both real-time AND historical reprocessing by replaying the log.
| Aspect | Lambda | Kappa |
|---|---|---|
| Codebases | Two (batch + stream) | One (stream only) |
| Reprocessing | Batch layer recomputes periodically | Replay stream from beginning |
| Complexity | High (maintain two systems) | Lower (one system, one logic) |
| Correctness | Batch corrects speed layer errors | Must get streaming right (exactly-once) |
| Tool requirement | Any batch + any stream tool | Stream tool must handle full replay (Kafka + Flink) |
| Best for | Existing batch infra, ML training pipelines | New greenfield systems, event-sourced architectures |
Processing Paradigm Comparison
| Dimension | Batch | Micro-batch | Stream |
|---|---|---|---|
| Latency | Minutes to hours | Seconds (1-10s) | Milliseconds to seconds |
| Throughput | Very high (optimized for bulk) | High | High (with backpressure) |
| Semantics | Exact (reprocessable) | At-least-once (easy), exactly-once (harder) | Exactly-once (hardest) |
| State | Full dataset available | Windowed | Event-at-a-time + managed state |
| Fault tolerance | Restart job from beginning | Restart micro-batch | Checkpoint + replay |
| Example tools | Spark (batch), Hive, Presto | Spark Structured Streaming | Flink, Kafka Streams |
| Use case | Daily reports, ML training, ETL | Near-real-time dashboards | Fraud detection, alerting |
Tool Landscape
| Tool | Type | Strengths | Weaknesses | Used By |
|---|---|---|---|---|
| Apache Spark | Batch + micro-batch | Unified API, SQL support, ML libraries | True streaming is micro-batch (seconds latency) | Netflix, Uber, Airbnb |
| Apache Flink | True stream + batch | Lowest latency, exactly-once, event-time processing | Steeper learning curve, smaller ecosystem | Alibaba, Uber, Netflix |
| Kafka Streams | Lightweight streaming | No separate cluster, library (runs in your app) | Limited to Kafka input/output | LinkedIn, Walmart |
| Apache Beam | Abstraction layer | Write once, run on Spark/Flink/Dataflow | Another abstraction = another thing to debug | Google (Dataflow) |
| AWS Kinesis | Managed streaming | Zero ops, auto-scaling | Vendor lock-in, limited vs Kafka | AWS-native shops |
Exactly-Once Semantics — The Hard Problem
| Guarantee | What It Means | Complexity | Example Failure |
|---|---|---|---|
| At-most-once | Fire and forget. May lose messages. | Lowest | UDP, async fire-and-forget |
| At-least-once | Retry until ack. May duplicate. | Medium | Consumer crashes after processing but before committing offset |
| Exactly-once | Each message processed exactly once in output. | Highest | Consumer processes + commits atomically |
Why exactly-once is hard: Network partitions mean you can never be 100% sure if a message was processed. The receiver might have processed it but the ack was lost.
How Flink achieves exactly-once:
| Mechanism | How |
|---|---|
| Checkpointing | Periodically snapshot operator state + Kafka offsets atomically |
| Barriers | Inject checkpoint barriers into data stream; operators snapshot when barrier arrives |
| Two-phase commit | For sinks (e.g., Kafka producer): pre-commit on checkpoint, commit on checkpoint-complete |
| Recovery | On failure, restore latest checkpoint, replay from checkpointed Kafka offset |
Kafka exactly-once (idempotent producer + transactions):
- Producer ID + sequence number = broker deduplicates retries
- Transactions: atomically write to multiple partitions + commit consumer offsets
- End-to-end: consume → process → produce in a single atomic transaction
Windowing Strategies
| Window Type | Description | Use Case |
|---|---|---|
| Tumbling | Fixed-size, non-overlapping (every 5 min) | Hourly aggregates, billing periods |
| Sliding | Fixed-size, overlapping (5 min window, slides every 1 min) | Moving averages, trend detection |
| Session | Dynamic, gap-based (closes after 30 min inactivity) | User session analytics, click streams |
| Global | Single window per key (entire stream) | Running totals, lifetime aggregates |
Example — "Count clicks per user per 5-minute window":
| Event Time | User | Tumbling [0-5) | Tumbling [5-10) | Sliding (5min, slide 1min) |
|---|---|---|---|---|
| 00:01 | Alice | Count: 1 | — | Window [00:00-00:05]: 1 |
| 00:03 | Alice | Count: 2 | — | Window [00:00-00:05]: 2 |
| 00:07 | Alice | — | Count: 1 | Window [00:03-00:08]: 2 |
Watermarks — Handling Late Data
The problem: Events arrive out of order. An event with timestamp 10:01 arrives at 10:05. How do you know when a window is "complete"?
Watermark: A declaration that "no events with timestamp <= W will arrive anymore."
| Strategy | Behavior | Trade-off |
|---|---|---|
| Strict (no late data) | Watermark = max event time seen | Fast results, drops late events |
| Bounded lateness | Watermark = max event time - allowed lateness (e.g., 5 min) | Waits for late data, delays results |
| Heuristic | Track lateness distribution, set watermark at p99 | Adaptive, may still miss outliers |
Late data handling (Flink/Beam):
- Drop: Ignore events arriving after watermark (simplest)
- Side output: Route late events to a separate stream for reprocessing
- Allowed lateness: Keep window state open for N minutes past watermark; update results
- Accumulating + retracting: Emit updated result AND retraction of previous result
Back-of-Envelope: Streaming Pipeline Sizing
Scenario: Processing 1 million events/second (e-commerce clickstream)
| Component | Sizing |
|---|---|
| Kafka brokers | 10 brokers (100K events/s each at 1KB = 100 MB/s per broker) |
| Kafka partitions | 100 partitions (10K events/s per partition) |
| Flink parallelism | 100 task slots (1 per partition for source parallelism) |
| Flink TaskManagers | 25 (4 slots each, 16 GB RAM each) |
| Checkpoint interval | 60 seconds |
| Checkpoint size | ~10 GB (windowed state across all operators) |
| State backend | RocksDB (for large state that exceeds memory) |
| End-to-end latency | <5 seconds (processing + checkpoint overhead) |
When to Use What
| Scenario | Recommendation | Why |
|---|---|---|
| Daily reports / data warehouse ETL | Batch (Spark) | Simplicity, cost efficiency, exact results |
| ML model training | Batch (Spark) | Need full dataset, iterative algorithms |
| Real-time fraud detection | Stream (Flink) | Sub-second latency critical |
| Real-time dashboards | Stream or micro-batch | Depends on freshness SLA (seconds vs minutes) |
| User notifications | Stream (Kafka Streams) | Low-latency, event-driven, lightweight |
| Complex event processing | Stream (Flink) | Pattern detection, temporal joins |
| Log aggregation | Either | Batch for cost, stream for real-time alerting |
| Recommendation engine | Lambda | Batch for model, stream for real-time features |
Interview Framework
When data processing comes up in system design:
Step 1: "First, I'd clarify the latency requirement. If results can be minutes/hours stale, batch (Spark) is simpler, cheaper, and gives exact results. If sub-second freshness is needed, we need stream processing."
Step 2: "For streaming, I'd use Flink with Kafka as the message bus. Flink gives us exactly-once semantics via checkpointing and handles event-time processing with watermarks for out-of-order data."
Step 3: "For windowed aggregations, I'd use [tumbling/sliding/session] windows depending on the use case. I'd set watermarks with 5-minute allowed lateness and route truly late events to a side output for batch correction."
Step 4: "For architecture, I'd start with Kappa (stream-only) for simplicity — one codebase, replay from Kafka for reprocessing. If we later need heavy batch ML training, we might add a batch layer (Lambda) for that specific use case."
Step 5: "Exactly-once guarantee: Kafka idempotent producers + Flink checkpoints + transactional sinks ensure no duplicates or data loss end-to-end."
Quick Recall
| Question | Answer |
|---|---|
| Lambda vs Kappa? | Lambda: batch + stream layers (two codebases). Kappa: stream-only (replay for reprocessing). |
| When is batch still better? | Daily ETL, ML training, exact results needed, cost-sensitive |
| Exactly-once how? | Checkpoint state + Kafka offsets atomically; two-phase commit to sinks |
| Watermark purpose? | Declare "no more events before time W" — triggers window computation |
| Tumbling vs sliding? | Tumbling: non-overlapping fixed windows. Sliding: overlapping (window > slide). |
| Flink vs Kafka Streams? | Flink: full cluster, complex processing. KStreams: library, simpler, Kafka-only. |
| Session window? | Dynamic window that closes after inactivity gap (e.g., 30 min no events) |
| LinkedIn's improvement? | 24-hour batch delay reduced to <10 second streaming with Kafka + Samza |