2 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Apache Kafka

Q: 1. How does Kafka guarantee message ordering?

Ordering is guaranteed only within a single partition. Messages with the same key go to the same partition (via hash). So if you need ordered processing per customer, use customerId as the key. You cannot get global ordering across partitions without using a single partition (which kills parallelism).

Q: 2. What happens when a consumer in a group crashes?

A rebalance is triggered. The partitions assigned to the dead consumer are redistributed among the remaining consumers in the group. During rebalance, consumption pauses briefly. With cooperative sticky assignor, only affected partitions are reassigned (minimizing disruption).

Q: 3. How do you handle duplicate messages?

Make consumers idempotent: (1) Use a deduplication table with message ID. (2) Use database upserts instead of inserts. (3) Design operations to be naturally idempotent (SET balance=100 vs ADD 10). For Kafka-to-Kafka, use exactly-once transactions.

Q: 4. How would you scale a Kafka consumer that's falling behind?

(1) Add more consumers to the group (up to partition count). (2) If at partition limit, increase partitions (requires rebalancing). (3) Optimize processing logic. (4) Increase max.poll.records. (5) Use batch processing. (6) Check for slow external calls and make them async.

Q: 5. Kafka vs SQS — when do you choose which?

Kafka: Need message replay, multiple independent consumers, event sourcing, stream processing, high throughput (100K+ msg/sec), ordering guarantees. SQS: Simple task queue, low ops overhead, per-message pricing works better at low volume, don't need replay or multiple consumers reading same data.

Q: 6. What is consumer lag and why does it matter?

Consumer lag = Log End Offset - Consumer's Committed Offset. It tells you how far behind a consumer is from the latest messages. High lag means consumers can't keep up with producers. Monitor with kafka-consumer-groups.sh --describe or tools like Burrow. Alert if lag grows continuously.

Q: 7. Explain log compaction.

Compacted topics keep the latest value for each key and remove older records. Unlike time-based retention (delete everything after 7 days), compaction retains at least the last message per key forever. Use case: user profiles, configuration, state snapshots. Set cleanup.policy=compact.

Q: 8. How does Kafka achieve high throughput?

(1) Sequential I/O — append-only log avoids random disk seeks. (2) Zero-copy — data goes from disk to network without copying to user space. (3) Batching — messages batched at producer and broker level. (4) Compression — batch-level compression (LZ4/zstd). (5) Page cache — leverages OS file cache. (6) Partitioning — horizontal parallelism.

The de-facto standard for event streaming in distributed systems. If you're interviewing for any backend role at a top tech company, you will be asked about Kafka.

Why Kafka in Interviews

System design questions like "Design a notification system," "Design an order pipeline," or "How would you handle 1M events/sec?" almost always involve Kafka. You need to explain partitions, consumer groups, ordering guarantees, and exactly-once semantics fluently.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Producers
        P1["Order Service"]
        P2["Payment Service"]
        P3["User Service"]
    end

    subgraph Kafka["Kafka Cluster"]
        T1["Topic: orders<br/>3 partitions"]
        T2["Topic: payments<br/>2 partitions"]
    end

    subgraph Consumers
        CG1["Analytics<br/>Consumer Group"]
        CG2["Notification<br/>Consumer Group"]
        CG3["Search Index<br/>Consumer Group"]
    end

    P1 --> T1
    P2 --> T2
    P3 --> T1
    T1 --> CG1
    T1 --> CG2
    T2 --> CG2
    T2 --> CG3

    style Kafka fill:#E3F2FD,stroke:#1565C0,color:#000

Core Concepts

Events (Messages/Records)

The fundamental unit of data in Kafka. An event represents "something happened."

Field	Description
Key	Determines partition assignment (messages with same key → same partition)
Value	The actual payload (JSON, Avro, Protobuf)
Timestamp	Event time or ingestion time
Headers	Optional metadata (tracing IDs, content-type)
Offset	Unique sequential ID within a partition

Topics & Partitions

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Topic["Topic: user-events"]
        subgraph P0["Partition 0"]
            direction LR
            M0(["offset 0"]) --> M1(["offset 1"]) --> M2(["offset 2"]) --> M3(["offset 3"])
        end
        subgraph P1["Partition 1"]
            direction LR
            M4(["offset 0"]) --> M5(["offset 1"]) --> M6(["offset 2"])
        end
        subgraph P2["Partition 2"]
            direction LR
            M7(["offset 0"]) --> M8(["offset 1"]) --> M9(["offset 2"]) --> M10(["offset 3"]) --> M11(["offset 4"])
        end
    end

    style P0 fill:#E8F5E9,stroke:#2E7D32,color:#000
    style P1 fill:#FFF3E0,stroke:#E65100,color:#000
    style P2 fill:#F3E5F5,stroke:#6A1B9A,color:#000

Topic = logical channel for a category of events (like a database table)
Partition = ordered, immutable append-only log
Ordering guarantee: Only within a single partition (not across partitions)
Parallelism: More partitions = more consumers can read in parallel

Partition Count Rule of Thumb

Start with number of partitions = max expected consumer instances. You can increase partitions later but never decrease them.

Brokers & Cluster

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Cluster["Kafka Cluster"]
        B1(["Broker 1<br/>Leader: P0, P1"])
        B2(["Broker 2<br/>Leader: P2, P3"])
        B3(["Broker 3<br/>Leader: P4, P5"])
    end

    subgraph ZK["Controller"]
        C{{"KRaft Controller<br/>(replaces ZooKeeper)"}}
    end

    C --> B1
    C --> B2
    C --> B3

    style Cluster fill:#E3F2FD,stroke:#1565C0,color:#000
    style ZK fill:#FFF3E0,stroke:#E65100,color:#000

Broker = a single Kafka server that stores data and serves clients
Cluster = group of brokers working together
Controller = manages partition leadership, broker membership (KRaft since Kafka 3.3+, replacing ZooKeeper)

Replication

Each partition has one leader and N-1 followers (replicas).

Term	Meaning
Replication Factor	Total copies of each partition (typically 3)
Leader	Handles all reads and writes for a partition
Follower	Replicates from leader, takes over if leader fails
ISR (In-Sync Replicas)	Followers that are caught up with the leader
min.insync.replicas	Minimum ISR count required to accept writes

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Partition0["Partition 0 (RF=3)"]
        L["Broker 1<br/>LEADER"] -->|replicates| F1["Broker 2<br/>FOLLOWER"]
        L -->|replicates| F2["Broker 3<br/>FOLLOWER"]
    end

    Producer["Producer"] -->|writes| L
    Consumer["Consumer"] -->|reads| L

    style L fill:#C8E6C9,stroke:#2E7D32,color:#000
    style F1 fill:#BBDEFB,stroke:#1565C0,color:#000
    style F2 fill:#BBDEFB,stroke:#1565C0,color:#000

Producers

How Producers Work

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
sequenceDiagram
    participant App as Application
    participant P as Producer
    participant S as Serializer
    participant Part as Partitioner
    participant B as Broker (Leader)

    App->>P: send(topic, key, value)
    P->>S: serialize key & value
    S->>Part: determine partition
    Part->>P: partition number
    P->>P: add to batch (per partition)
    P->>B: send batch
    B->>P: ack (based on acks config)

Key Configuration

Config	Values	Trade-off
`acks`	`0` = fire & forget	Fastest, can lose data
	`1` = leader ack	Good balance, can lose if leader dies before replication
	`all` (-1) = all ISR ack	Strongest durability, higher latency
`batch.size`	bytes (default 16KB)	Larger = better throughput, higher latency
`linger.ms`	milliseconds (default 0)	Wait time to fill batch before sending
`compression.type`	none, gzip, snappy, lz4, zstd	zstd gives best ratio; snappy for speed
`retries`	integer (default MAX_INT)	Combined with `delivery.timeout.ms`
`enable.idempotence`	true/false	Prevents duplicates on retry (default true since 3.0)

Partitioning Strategy

Java

// Default: hash(key) % numPartitions
// Same key always goes to same partition → ordering guarantee

producer.send(new ProducerRecord<>("orders", customerId, orderJson));
// All orders for customerId "C123" go to the same partition

Strategy	When
Key-based (default)	Need ordering per entity (customer, order, device)
Round-robin	No key provided, maximize distribution
Custom partitioner	Business logic (VIP customers → dedicated partition)

Consumers & Consumer Groups

Consumer Group Model

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Topic["Topic: orders (4 partitions)"]
        P0(("P0"))
        P1(("P1"))
        P2(("P2"))
        P3(("P3"))
    end

    subgraph CG1["Consumer Group: order-processor"]
        C1[["Consumer 1"]]
        C2[["Consumer 2"]]
    end

    subgraph CG2["Consumer Group: analytics"]
        C3[["Consumer 3"]]
    end

    P0 --> C1
    P1 --> C1
    P2 --> C2
    P3 --> C2

    P0 --> C3
    P1 --> C3
    P2 --> C3
    P3 --> C3

    style CG1 fill:#E8F5E9,stroke:#2E7D32,color:#000
    style CG2 fill:#FFF3E0,stroke:#E65100,color:#000

Rules:

Each partition is consumed by exactly one consumer in a group
A consumer can read from multiple partitions
If consumers > partitions → some consumers sit idle
Different consumer groups get their own independent copy of all messages

Offset Management

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Partition["Partition 0"]
        direction LR
        O0["0"] --> O1["1"] --> O2["2"] --> O3["3"] --> O4["4"] --> O5["5"] --> O6["6"]
    end

    COMMITTED["Committed Offset: 3"]:::green -.-> O3
    CURRENT["Current Position: 5"]:::blue -.-> O5
    LEO["Log End Offset: 6"]:::red -.-> O6

    classDef green fill:#C8E6C9,stroke:#2E7D32
    classDef blue fill:#BBDEFB,stroke:#1565C0
    classDef red fill:#FFCDD2,stroke:#C62828

Term	Meaning
Committed Offset	Last offset confirmed as processed (stored in `__consumer_offsets` topic)
Current Position	Where the consumer is currently reading
Log End Offset (LEO)	Latest message in the partition
Lag	LEO - Committed Offset (how far behind the consumer is)

Consumer Configuration

Config	Description
`group.id`	Consumer group identifier
`auto.offset.reset`	`earliest` (from beginning) or `latest` (new messages only)
`enable.auto.commit`	If true, offsets committed periodically (default 5s)
`max.poll.records`	Max records returned per poll (default 500)
`session.timeout.ms`	Time before consumer considered dead (default 45s)
`max.poll.interval.ms`	Max time between polls before rebalance (default 5min)

Delivery Semantics

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Semantics["Delivery Guarantees"]
        ALO{{"At-Least-Once<br/>✅ No data loss<br/>⚠️ Possible duplicates"}}
        AMO{{"At-Most-Once<br/>⚠️ Possible data loss<br/>✅ No duplicates"}}
        EO{{"Exactly-Once<br/>✅ No data loss<br/>✅ No duplicates"}}
    end

    ALO ---|"commit AFTER processing"| ALO
    AMO ---|"commit BEFORE processing"| AMO
    EO ---|"transactions + idempotence"| EO

    style ALO fill:#FFF3E0,stroke:#E65100,color:#000
    style AMO fill:#FFCDD2,stroke:#C62828,color:#000
    style EO fill:#E8F5E9,stroke:#2E7D32,color:#000

At-Least-Once (Most Common)

Java

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processRecord(record);  // process first
    }
    consumer.commitSync();  // then commit
}
// If crash after process but before commit → message reprocessed (duplicate)

Exactly-Once Semantics (EOS)

Achieved through idempotent producers + transactions:

Java

// Producer side
props.put("enable.idempotence", "true");
props.put("transactional.id", "order-processor-1");

producer.initTransactions();
producer.beginTransaction();
try {
    producer.send(new ProducerRecord<>("output-topic", key, value));
    // Commit consumer offsets as part of the transaction
    producer.sendOffsetsToTransaction(offsets, consumerGroupId);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

EOS Limitations

Exactly-once only works within Kafka (produce + consume in same cluster). For external systems (databases, APIs), you still need idempotent consumers or the Outbox pattern.

Kafka Connect

Pre-built connectors for integrating Kafka with external systems without writing code.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Sources
        DB["PostgreSQL"]
        S3["S3"]
        API["REST API"]
    end

    subgraph KC["Kafka Connect Cluster"]
        SC["Source Connectors"]
        SK["Sink Connectors"]
    end

    subgraph Kafka["Kafka"]
        T["Topics"]
    end

    subgraph Sinks
        ES["Elasticsearch"]
        DW["Snowflake"]
        REDIS["Redis"]
    end

    DB --> SC
    S3 --> SC
    API --> SC
    SC --> T
    T --> SK
    SK --> ES
    SK --> DW
    SK --> REDIS

    style KC fill:#F3E5F5,stroke:#6A1B9A,color:#000

Connector Type	Direction	Examples
Source	External → Kafka	Debezium (CDC), JDBC, S3, MongoDB
Sink	Kafka → External	Elasticsearch, S3, JDBC, Redis

Debezium (Change Data Capture)

Captures row-level changes from databases and streams them to Kafka topics.

JSON

{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.example.com",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "orders",
    "table.include.list": "public.orders,public.payments",
    "topic.prefix": "cdc",
    "plugin.name": "pgoutput"
  }
}

Kafka Streams

Client library for building stream processing applications. No separate cluster needed — runs in your application.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    IT["Input Topic<br/>raw-orders"] --> KS["Kafka Streams App<br/>(your JVM process)"]
    KS --> OT["Output Topic<br/>enriched-orders"]
    KS --> ST["State Store<br/>(RocksDB local)"]

    style KS fill:#E8F5E9,stroke:#2E7D32,color:#000

Java

StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("raw-orders");

KTable<String, Customer> customers = builder.table("customers");

// Enrich orders with customer data
KStream<String, EnrichedOrder> enriched = orders.join(
    customers,
    (order, customer) -> new EnrichedOrder(order, customer)
);

// Filter high-value orders
enriched
    .filter((key, order) -> order.getTotal() > 1000)
    .to("high-value-orders");

// Aggregate order counts per customer (windowed)
orders
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
    .count()
    .toStream()
    .to("order-counts-per-hour");

Performance & Tuning

Why Kafka is Fast

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Performance["Why Kafka Achieves Millions of msg/sec"]
        SEQ(["Sequential I/O<br/>Append-only log, no random seeks"])
        ZC[/"Zero-Copy Transfer<br/>sendfile() syscall, no user-space copy"/]
        BATCH{{"Batching<br/>Amortize network overhead"}}
        COMP[["Compression<br/>Batch-level compression"]]
        PAGE(["Page Cache<br/>OS caches hot data in RAM"])
        PART{{"Partitioning<br/>Parallel processing"}}
    end

    style SEQ fill:#E8F5E9,stroke:#2E7D32,color:#000
    style ZC fill:#E3F2FD,stroke:#1565C0,color:#000
    style BATCH fill:#FFF3E0,stroke:#E65100,color:#000
    style COMP fill:#F3E5F5,stroke:#6A1B9A,color:#000
    style PAGE fill:#FEF3C7,stroke:#D97706,color:#000
    style PART fill:#FFCDD2,stroke:#C62828,color:#000

Key Metrics to Monitor

Metric	Healthy Value	Action if Unhealthy
Consumer Lag	Close to 0	Scale consumers, check processing time
Under-replicated Partitions	0	Check broker health, disk I/O
Request Latency (p99)	< 100ms	Tune batch size, check network
ISR Shrinks/sec	0	Check slow brokers, network partitions
Disk Usage	< 80%	Adjust retention, add brokers

Retention Configuration

Properties

# Time-based retention (default 7 days)
log.retention.hours=168

# Size-based retention (per partition)
log.retention.bytes=1073741824

# Compaction (keep latest value per key — infinite retention)
log.cleanup.policy=compact

# Compaction + deletion (compact, then delete after time)
log.cleanup.policy=compact,delete

Log Compaction

Use compaction for topics that represent current state (user profiles, configuration). Kafka keeps the latest value for each key and removes older entries. Think of it as a distributed key-value store that also gives you a change log.

Common Patterns

Event Sourcing with Kafka

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
sequenceDiagram
    participant Client
    participant Service
    participant Kafka as Kafka (Event Store)
    participant Materialized as Materialized View

    Client->>Service: Place Order
    Service->>Kafka: OrderPlaced event
    Service->>Client: 202 Accepted

    Kafka->>Materialized: Consume & build read model
    Client->>Materialized: Query order status

CQRS with Kafka

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Write["Write Path"]
        CMD["Command"] --> WS["Write Service"] --> DB["PostgreSQL"]
        WS --> K["Kafka Topic"]
    end

    subgraph Read["Read Path"]
        K --> RS["Read Service"]
        RS --> ES["Elasticsearch"]
        RS --> CACHE["Redis Cache"]
    end

    Q["Query"] --> ES
    Q --> CACHE

    style Write fill:#E8F5E9,stroke:#2E7D32,color:#000
    style Read fill:#E3F2FD,stroke:#1565C0,color:#000

Dead Letter Queue (DLQ) Pattern

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    T["Main Topic"] --> C["Consumer"]
    C -->|"Success"| PROCESS["Process"]
    C -->|"Failure (after retries)"| DLQ["DLQ Topic"]
    DLQ --> ALERT["Alert / Manual Review"]
    DLQ -->|"Retry later"| T

    style DLQ fill:#FFCDD2,stroke:#C62828,color:#000

Schema Management

Schema Registry (Confluent)

Centralized schema store that ensures producers and consumers agree on data format.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    P["Producer"] -->|"1. Register/validate schema"| SR["Schema Registry"]
    P -->|"2. Send data (schema ID + payload)"| K["Kafka"]
    K -->|"3. Consume"| C["Consumer"]
    C -->|"4. Fetch schema by ID"| SR

    style SR fill:#FFF3E0,stroke:#E65100,color:#000

Compatibility Modes:

Mode	Rule
BACKWARD	New schema can read old data (can remove fields, add optional fields)
FORWARD	Old schema can read new data (can add fields, remove optional fields)
FULL	Both backward and forward compatible
NONE	No compatibility check

Kafka vs Alternatives

Feature	Kafka	RabbitMQ	AWS SQS	Pulsar
Model	Log (pull)	Queue (push)	Queue (pull)	Log (pull)
Ordering	Per partition	Per queue	FIFO queues only	Per partition
Retention	Configurable (days/forever)	Until consumed	14 days max	Tiered (hot/cold)
Replay	Yes (seek to offset)	No	No	Yes
Throughput	Millions msg/sec	~50K msg/sec	~3K msg/sec (FIFO)	Millions msg/sec
Multi-tenancy	Topics + ACLs	Vhosts	Queues + IAM	Tenants native
Ops Complexity	High (or use managed)	Medium	Zero (managed)	High

When NOT to Use Kafka

Simple task queues with few messages → use SQS or RabbitMQ
Request-reply pattern → use HTTP or gRPC
Small team, low volume → overhead not justified
Need message-level routing/filtering → RabbitMQ exchanges are better

Spring Boot Integration

Producer

Java

@Configuration
public class KafkaProducerConfig {
    @Bean
    public ProducerFactory<String, OrderEvent> producerFactory() {
        Map<String, Object> config = Map.of(
            ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092",
            ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class,
            ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class,
            ProducerConfig.ACKS_CONFIG, "all",
            ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true
        );
        return new DefaultKafkaProducerFactory<>(config);
    }

    @Bean
    public KafkaTemplate<String, OrderEvent> kafkaTemplate() {
        return new KafkaTemplate<>(producerFactory());
    }
}

@Service
public class OrderEventPublisher {
    private final KafkaTemplate<String, OrderEvent> kafka;

    public void publishOrderCreated(Order order) {
        OrderEvent event = new OrderEvent("ORDER_CREATED", order);
        kafka.send("order-events", order.getCustomerId(), event)
             .whenComplete((result, ex) -> {
                 if (ex != null) log.error("Failed to publish", ex);
             });
    }
}

Consumer

Java

@Component
public class OrderEventConsumer {

    @KafkaListener(
        topics = "order-events",
        groupId = "notification-service",
        concurrency = "3"
    )
    public void handleOrderEvent(
            @Payload OrderEvent event,
            @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
            @Header(KafkaHeaders.OFFSET) long offset) {

        log.info("Received {} from partition {} at offset {}",
                 event.getType(), partition, offset);

        switch (event.getType()) {
            case "ORDER_CREATED" -> sendConfirmationEmail(event);
            case "ORDER_SHIPPED" -> sendShippingNotification(event);
            case "ORDER_DELIVERED" -> sendDeliveryNotification(event);
        }
    }

    @KafkaListener(topics = "order-events-dlq", groupId = "dlq-handler")
    public void handleDlq(ConsumerRecord<String, OrderEvent> record) {
        log.warn("DLQ message: key={}, value={}", record.key(), record.value());
        // Alert, store for manual review, or retry
    }
}

Error Handling & Retry

Java

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, OrderEvent> template) {
    // Retry 3 times with backoff, then send to DLQ
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(template);

    BackOff backOff = new ExponentialBackOff(1000L, 2.0);  // 1s, 2s, 4s

    DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
    // Don't retry these — they'll never succeed
    handler.addNotRetryableExceptions(
        DeserializationException.class,
        ValidationException.class
    );
    return handler;
}

Common Interview Questions

1. How does Kafka guarantee message ordering?

Ordering is guaranteed only within a single partition. Messages with the same key go to the same partition (via hash). So if you need ordered processing per customer, use customerId as the key. You cannot get global ordering across partitions without using a single partition (which kills parallelism).

2. What happens when a consumer in a group crashes?

A rebalance is triggered. The partitions assigned to the dead consumer are redistributed among the remaining consumers in the group. During rebalance, consumption pauses briefly. With cooperative sticky assignor, only affected partitions are reassigned (minimizing disruption).

3. How do you handle duplicate messages?

Make consumers idempotent: (1) Use a deduplication table with message ID. (2) Use database upserts instead of inserts. (3) Design operations to be naturally idempotent (SET balance=100 vs ADD 10). For Kafka-to-Kafka, use exactly-once transactions.

4. How would you scale a Kafka consumer that's falling behind?

(1) Add more consumers to the group (up to partition count). (2) If at partition limit, increase partitions (requires rebalancing). (3) Optimize processing logic. (4) Increase max.poll.records. (5) Use batch processing. (6) Check for slow external calls and make them async.

5. Kafka vs SQS — when do you choose which?

Kafka: Need message replay, multiple independent consumers, event sourcing, stream processing, high throughput (100K+ msg/sec), ordering guarantees. SQS: Simple task queue, low ops overhead, per-message pricing works better at low volume, don't need replay or multiple consumers reading same data.

6. What is consumer lag and why does it matter?

Consumer lag = Log End Offset - Consumer's Committed Offset. It tells you how far behind a consumer is from the latest messages. High lag means consumers can't keep up with producers. Monitor with kafka-consumer-groups.sh --describe or tools like Burrow. Alert if lag grows continuously.

7. Explain log compaction.

Compacted topics keep the latest value for each key and remove older records. Unlike time-based retention (delete everything after 7 days), compaction retains at least the last message per key forever. Use case: user profiles, configuration, state snapshots. Set cleanup.policy=compact.

8. How does Kafka achieve high throughput?

(1) Sequential I/O — append-only log avoids random disk seeks. (2) Zero-copy — data goes from disk to network without copying to user space. (3) Batching — messages batched at producer and broker level. (4) Compression — batch-level compression (LZ4/zstd). (5) Page cache — leverages OS file cache. (6) Partitioning — horizontal parallelism.

Gotchas in Production

Things that bite you in production but never appear in tutorials:

Gotcha	What Happens	How to Prevent
Rebalancing storms	A consumer takes >300ms to process a batch → `max.poll.interval.ms` exceeded → consumer evicted → rebalance → more consumers evicted → cascading rebalances	Increase `max.poll.interval.ms`, reduce `max.poll.records`, or move heavy processing to a thread pool and commit offsets manually
Consumer lag snowball	Consumer falls behind by millions of offsets → memory pressure from buffering → OOM or GC pauses → falls further behind	Alert at 10K lag, not 1M. Add consumers BEFORE you're drowning. Consider `auto.offset.reset=latest` for non-critical consumers
Partition skew	One key produces 80% of messages → one partition is 100x larger → one consumer does all the work	Use a sub-key (`customerId + "-" + timestamp.minute`) or repartition hot keys across multiple partitions
Offset commit timing	Commit offset BEFORE processing → message lost on crash. Commit AFTER processing → message reprocessed on crash	Choose based on tolerance: at-least-once (commit after) is usually right. Make consumers idempotent.
Schema evolution breaks	Producer adds a required field → old consumers crash on deserialization	Always use Schema Registry with BACKWARD compatibility. Never add required fields — only optional ones
Coordinator failover	Group coordinator broker dies → all consumers in that group lose assignments → 30-60s of no consumption	Set `session.timeout.ms=10000` (lower than default 45s) and `heartbeat.interval.ms=3000` for faster detection
Compaction tombstones	Producing `null` value for a key = tombstone = key deleted after compaction delay. Accidental null values = data loss	Validate payloads before producing. Use a schema that rejects null values for non-delete events

Key Takeaways

Kafka is a distributed commit log — not a traditional message queue
Ordering only within a partition — use meaningful keys
Consumer groups enable independent, parallel consumption
acks=all + min.insync.replicas=2 + replication factor 3 = strong durability
Exactly-once within Kafka via idempotent producers + transactions
For external systems, make consumers idempotent (don't rely on Kafka EOS)
Monitor consumer lag — it's the #1 operational metric

Apache Kafka

Core Concepts

Events (Messages/Records)

Topics & Partitions

Brokers & Cluster

Replication

Producers

How Producers Work

Key Configuration

Partitioning Strategy

Consumers & Consumer Groups

Consumer Group Model

Offset Management

Consumer Configuration

Delivery Semantics

At-Least-Once (Most Common)

Exactly-Once Semantics (EOS)

Kafka Connect

Debezium (Change Data Capture)

Kafka Streams

Performance & Tuning

Why Kafka is Fast

Key Metrics to Monitor

Retention Configuration

Common Patterns

Event Sourcing with Kafka

CQRS with Kafka

Dead Letter Queue (DLQ) Pattern

Schema Management

Schema Registry (Confluent)

Kafka vs Alternatives

Spring Boot Integration

Producer

Consumer

Error Handling & Retry

Common Interview Questions

Gotchas in Production

5-Minute System Design — Weekly