Skip to content
15 min read

Kubernetes (K8s)

Master Kubernetes from cluster internals to production operations. Covers architecture, scheduling algorithms, networking model, storage, RBAC, Helm, debugging playbooks, and every question that comes up at senior-level DevOps and SRE interviews.


How to Use This Guide

This guide is structured as a progressive deep-dive, from foundational motivation through production-grade operations:

  • Section 1 — Why K8s exists: the problems it solves (self-healing, scaling, zero-downtime, config management).
  • Sections 2-3 — Architecture & core objects: control plane internals, worker nodes, the full kubectl apply flow, Pods, Deployments, Services, Ingress, HPA.
  • Sections 4-6 — Advanced concepts: StatefulSets, DaemonSets, networking model, storage, RBAC.
  • Sections 7-8 — Production: Helm, operations playbook, debugging techniques.
  • Interview Q&A — FAANG-level design and concept questions with concise answers.

Start with Section 1 if you want the mental model for why orchestration exists. If you already understand the motivation, jump to Section 2 for architecture or directly to the Interview Q&A for rapid prep.


1. Why Kubernetes Exists

Kubernetes was born from Borg — Google's internal system that manages billions of containers across millions of machines. In 2014, Google open-sourced the concepts as Kubernetes. Today it runs everything from Spotify to Airbnb to every major bank.

Problem 1: Self-Healing & High Availability

You have 200 microservices running across 50 nodes. At 3 AM, a node's disk fills up and 12 containers crash. Without K8s, you get paged, SSH into machines, manually restart services, and hope nothing else breaks.

K8s answer: The controller-manager detects unhealthy pods within seconds. The scheduler places replacements on healthy nodes. By the time you wake up, everything is already running — on different nodes. You find out from a Slack alert that says "auto-healed."

Problem 2: Scaling Under Load

Your e-commerce site handles 1,000 req/s normally. Black Friday hits — 50,000 req/s. Manually provisioning servers takes 30 minutes. You've already lost millions in revenue.

K8s answer: HPA detects CPU spike within 15 seconds. Scales pods from 10 to 100 in under a minute. Cluster Autoscaler sees unschedulable pods, provisions new nodes from your cloud provider. Traffic spike handled — automatically, in under 2 minutes.

Problem 3: Zero-Downtime Deploys

You deploy 20 times per day. Old approach: stop all instances, deploy new version, start up. That's 30-60 seconds of downtime per deploy — 10+ minutes of downtime PER DAY.

K8s answer: Rolling updates replace one pod at a time. New pod starts → passes health check → gets traffic → old pod drains connections → terminates. If the new version crashes? Automatic rollback. Zero dropped requests.

Problem 4: Configuration & Secrets at Scale

You have 50 services × 4 environments = 200 different configurations. Database URLs, API keys, feature flags — all baked into Docker images. Changing one config means rebuilding and redeploying an image.

K8s answer: ConfigMaps and Secrets are injected at runtime. Change a database password? Update the Secret, rolling-restart the pods. No image rebuild. No code change. Decoupled.


2. Architecture — How The Cluster Works

A Kubernetes cluster has two planes: the Control Plane (the brain — decides what should happen) and the Data Plane (worker nodes — where your containers actually run). In production, you run 3+ control plane nodes across availability zones.
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
    USER["kubectl / CI/CD / Dashboard"] -->|"HTTPS"| API

    subgraph CP["Control Plane (Master Nodes)"]
        API["kube-apiserver<br/><small>the ONLY entry point</small>"]
        ETCD[("etcd cluster<br/><small>all cluster state<br/>strongly consistent</small>")]
        SCHED["kube-scheduler<br/><small>pod → node matching</small>"]
        CM["controller-manager<br/><small>reconciliation loops</small>"]
        CCM["cloud-controller-manager<br/><small>LBs, disks, node lifecycle</small>"]

        API <--> ETCD
        SCHED --> API
        CM --> API
        CCM --> API
    end

    subgraph WN1["Worker Node 1"]
        KL1["kubelet<br/><small>pod lifecycle agent</small>"]
        KP1["kube-proxy<br/><small>iptables/IPVS rules</small>"]
        CR1["containerd<br/><small>container runtime</small>"]
        P1["Pod A"] & P2["Pod B"]
    end

    subgraph WN2["Worker Node 2"]
        KL2["kubelet"]
        KP2["kube-proxy"]
        CR2["containerd"]
        P3["Pod C"] & P4["Pod D"] & P5["Pod E"]
    end

    API --> KL1 & KL2
    KL1 --> CR1
    KL2 --> CR2

    style CP fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6
    style WN1 fill:#D1FAE5,stroke:#10B981,color:#065F46
    style WN2 fill:#DBEAFE,stroke:#3B82F6,color:#1E40AF

Control Plane Deep Dive

kube-apiserver

The ONLY component that reads/writes etcd. Everything goes through it — kubectl, the scheduler, controllers, kubelets. It handles authentication, authorization (RBAC), admission control, and validation. Horizontally scalable (run multiple replicas behind a load balancer).

etcd

Distributed key-value store (Raft consensus). Holds ALL cluster state — pod specs, service endpoints, secrets, configmaps. If etcd is lost without backup, your cluster is irrecoverable. Always run 3 or 5 nodes across zones. Always have automated backups.

kube-scheduler

Watches for pods with nodeName: "" (unscheduled). Runs a scoring algorithm considering: resource requests, node affinity/anti-affinity, taints/tolerations, topology spread constraints, and inter-pod affinity. Picks the highest-scoring node.

controller-manager

Runs ~30 control loops simultaneously. Each one watches a resource type and reconciles actual state to desired state. Deployment controller, ReplicaSet controller, Node controller, Job controller, EndpointSlice controller, etc. This is the "self-healing" engine.

Worker Node Deep Dive

kubelet

Agent on every node. Watches the API server for pods assigned to its node. Calls the container runtime (containerd) to start/stop containers. Runs liveness/readiness probes. Reports node status and pod status back to API server every 10s.

kube-proxy

Runs on every node. Watches Service and EndpointSlice objects. Programs iptables or IPVS rules so that Service ClusterIPs route to healthy pod backends. This is how my-service:80 gets DNAT'd to 10.244.1.8:8080.

Container Runtime (containerd)

Pulls images, manages container lifecycle, provides CRI (Container Runtime Interface). Docker was deprecated in K8s 1.24 — containerd and CRI-O are the standards now. Your Docker images still work (they're OCI-compliant).

The Full Request Flow: kubectl apply -f deployment.yaml

1. kubectl → API Server (HTTPS + auth)

kubectl reads your kubeconfig, authenticates (cert/token/OIDC), sends the manifest as a POST/PUT request to the API server.

2. Admission Controllers

API server runs mutating webhooks (inject sidecar? add labels?) then validating webhooks (meets policy? resource quotas OK?). If any reject → error returned to user.

3. Persisted to etcd

Validated object written to etcd. API server returns 201 Created. Object now exists in cluster state.

4. Deployment Controller reacts

Watches Deployment objects. Sees new one → creates a ReplicaSet with the pod template and desired replicas.

5. ReplicaSet Controller reacts

Watches ReplicaSets. Sees desired=3, current=0 → creates 3 Pod objects (with nodeName: "").

6. Scheduler assigns nodes

Watches unscheduled pods. Scores available nodes. Binds each pod to the best node (sets nodeName).

7. kubelet starts containers

Watches pods assigned to its node. Pulls image via containerd, starts container, sets up volumes/network. Reports pod status = Running.

8. kube-proxy updates routing

If a Service selects these pods (via label selector), kube-proxy adds them to iptables/IPVS backends. Traffic can now reach them.


3. Core Objects — Pods, Deployments, Services

K8s Resource Map

Pod
Service
Deployment
HPA
ConfigMap
Secret
Volume
Namespace

Nodes & Pods — The Physical View

A Node is a physical or virtual machine in the cluster. Each node runs multiple Pods, and each pod contains one or more Containers. This is the fundamental deployment unit hierarchy:

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
    subgraph NODE1["Node 1 (Worker)"]
        direction TB
        subgraph P1["Pod: order-api"]
            C1["container 1<br/><small>order-service :8080</small>"]
            C2["container 2<br/><small>envoy-sidecar :15001</small>"]
        end
        subgraph P2["Pod: payment-api"]
            C3["container 3<br/><small>payment-service :8080</small>"]
        end
        subgraph P3["Pod: cart-api"]
            C4["container 4<br/><small>cart-service :8080</small>"]
            C5["container 5<br/><small>redis-cache :6379</small>"]
        end
    end

    subgraph NODE2["Node 2 (Worker)"]
        direction TB
        subgraph P4["Pod: user-api"]
            C6["container 6<br/><small>user-service :8080</small>"]
        end
        subgraph P5["Pod: notification-api"]
            C7["container 7<br/><small>notification-svc :8080</small>"]
            C8["container 8<br/><small>kafka-consumer :9092</small>"]
        end
        subgraph P6["Pod: search-api"]
            C9["container 9<br/><small>search-service :8080</small>"]
        end
    end

    style NODE1 fill:#FFCDD2,stroke:#E53935,color:#B71C1C
    style NODE2 fill:#E8EAF6,stroke:#5C6BC0,color:#1A237E
    style P1 fill:#EF9A9A,stroke:#E53935,color:#B71C1C
    style P2 fill:#EF9A9A,stroke:#E53935,color:#B71C1C
    style P3 fill:#EF9A9A,stroke:#E53935,color:#B71C1C
    style P4 fill:#9FA8DA,stroke:#5C6BC0,color:#1A237E
    style P5 fill:#9FA8DA,stroke:#5C6BC0,color:#1A237E
    style P6 fill:#9FA8DA,stroke:#5C6BC0,color:#1A237E

Key relationships:

Concept What it is Analogy
Cluster Collection of all nodes The entire data center
Node A machine (VM or bare-metal) running kubelet A single server rack
Pod Smallest deployable unit — one or more containers sharing network/storage A sealed apartment (containers inside share walls)
Container A single running process (your Docker image) One room in the apartment

Why Pods wrap Containers (not deploy containers directly):

  • Containers in the same pod share localhost — no network hop between sidecar and main app
  • Shared volumes — log shipper reads files written by the app container
  • Atomic scheduling — "these two containers MUST be on the same node"
  • Lifecycle coupling — if the main container dies, the whole pod restarts

Rule of Thumb

One container per pod is the default. Only add multiple containers when they are tightly coupled (sidecar pattern). If two containers can be deployed/scaled independently, they belong in separate pods.


Pods — The Atomic Unit

Mental model: A Pod is NOT "a container." It's a shared execution environment — one or more containers that share network (same IP, communicate via localhost) and storage (same volumes). Think of it as a "logical host" for tightly-coupled processes.
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
    subgraph POD["Pod (IP: 10.244.1.8)"]
        direction LR
        MAIN["Main Container<br/><small>your-api :8080</small>"]
        SIDE["Sidecar<br/><small>envoy-proxy :15001</small>"]
        INIT["Init Container<br/><small>db-migrator (ran first, then exited)</small>"]
    end

    VOL["Shared Volume<br/>/data"]
    NET["Shared Network Namespace<br/><small>localhost:8080 ↔ localhost:15001</small>"]

    POD --- VOL
    POD --- NET

    style POD fill:#DBEAFE,stroke:#3B82F6,color:#1E40AF
    style VOL fill:#D1FAE5,stroke:#10B981,color:#065F46
    style NET fill:#FEF3C7,stroke:#F59E0B,color:#92400E

Multi-container patterns (these come up in FAANG interviews):

Pattern Example Why
Sidecar Envoy proxy, log shipper (Fluentd), cert rotator Adds functionality without modifying main app
Init Container DB migration, wait-for-dependency, config download Must complete before main containers start
Ambassador Redis proxy, API gateway sidecar Simplifies main app's connection logic
Adapter Log format converter, metrics transformer Standardizes output format for monitoring

Probes — The Self-Healing Mechanism:

Probe Question On Failure Use Case
Liveness "Is this process alive?" Restart container Detect deadlocks, infinite loops
Readiness "Can this handle traffic?" Remove from Service endpoints Warmup, dependency health
Startup "Has this finished starting?" Kill + restart after timeout Slow-starting legacy apps

Production War Story: Liveness Probe Mistake

A team set their liveness probe to check a downstream database. When the DB had a blip, ALL pods restarted simultaneously — cascading failure. Rule: Liveness probes must only check the process itself, never external dependencies. Use readiness probes for dependency checks.


Deployments — The Workload Manager

A Deployment manages ReplicaSets, which manage Pods. You declare desired state, K8s makes it happen.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
    DEP["Deployment<br/><small>api-server</small>"]
    DEP --> RS1["ReplicaSet (current)<br/><small>api-server-7d8f6<br/>replicas: 3</small>"]
    DEP -.-> RS2["ReplicaSet (previous)<br/><small>api-server-5c9a2<br/>replicas: 0 (scaled down)</small>"]
    RS1 --> P1["Pod"] & P2["Pod"] & P3["Pod"]

    style DEP fill:#8B5CF6,stroke:#6D28D9,color:#fff
    style RS1 fill:#D1FAE5,stroke:#10B981,color:#065F46
    style RS2 fill:#FEE2E2,stroke:#EF4444,color:#991B1B

Rolling Update — How Zero-Downtime Works

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph BEFORE["Before (v1 running)"]
        direction TB
        V1A["v1 pod A ✓"]
        V1B["v1 pod B ✓"]
        V1C["v1 pod C ✓"]
    end

    subgraph DURING["Mid-rollout"]
        direction TB
        V1X["v1 pod A ✓"]
        V2Y["v2 pod B ✓<br/><small>readiness passed</small>"]
        V1Z["v1 pod C<br/><small>draining...</small>"]
    end

    subgraph AFTER["After (v2 complete)"]
        direction TB
        V2A["v2 pod A ✓"]
        V2B["v2 pod B ✓"]
        V2C["v2 pod C ✓"]
    end

    BEFORE --> DURING --> AFTER

    style BEFORE fill:#DBEAFE,stroke:#3B82F6,color:#1E40AF
    style DURING fill:#FEF3C7,stroke:#F59E0B,color:#92400E
    style AFTER fill:#D1FAE5,stroke:#10B981,color:#065F46

The rolling update cycle for each pod:

Start new v2 pod Wait for readiness probe Add to Service endpoints Remove old v1 from endpoints Drain connections (grace period) Terminate old pod
Zero-downtime requires ALL of these: maxUnavailable: 0 · Readiness probe that passes only when truly ready · terminationGracePeriodSeconds: 30 · PodDisruptionBudget · SIGTERM handler in your app (finish in-flight requests, stop accepting new ones)
Production-Ready Deployment YAML
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels: { app: api-server }
  template:
    metadata:
      labels: { app: api-server, version: v2.1.0 }
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: api
          image: registry.example.com/api:v2.1.0@sha256:abc123...
          ports: [{ containerPort: 8080 }]
          resources:
            requests: { memory: "256Mi", cpu: "250m" }
            limits: { memory: "512Mi", cpu: "1000m" }
          readinessProbe:
            httpGet: { path: /ready, port: 8080 }
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet: { path: /healthz, port: 8080 }
            initialDelaySeconds: 15
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]

Services — Stable Networking Abstraction

Why Services Exist

Pods are ephemeral — they get new IPs every time they restart. A Deployment might scale from 3 to 10 pods and back. Your frontend cannot hardcode 10.244.1.8:8080.

A Service gives you a stable DNS name and virtual IP that never changes. Behind the scenes, kube-proxy programs iptables/IPVS rules to load-balance traffic across all healthy pods matching the label selector.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
    subgraph EXTERNAL["External Traffic"]
        CLIENT["Browser / Mobile"]
    end

    subgraph CLUSTER["Cluster"]
        LB["LoadBalancer Service<br/><small>provisions AWS ALB/NLB</small><br/><small>External IP: 52.x.x.x</small>"]
        NP["NodePort<br/><small>:30080 on every node</small>"]
        CIP["ClusterIP Service<br/><small>api-service.default.svc.cluster.local</small><br/><small>10.96.0.42:80</small>"]

        CIP --> P1["Pod 10.244.1.8"]
        CIP --> P2["Pod 10.244.2.3"]
        CIP --> P3["Pod 10.244.3.7"]
    end

    subgraph INTERNAL["Internal Service-to-Service"]
        APP["order-service pod"]
        CIP2["ClusterIP: payment-service<br/><small>10.96.0.88:80</small>"]
        PAY1["payment pod 1"]
        PAY2["payment pod 2"]
    end

    CLIENT --> LB --> NP --> CIP
    APP -->|"http://payment-service:80"| CIP2
    CIP2 --> PAY1 & PAY2

    style LB fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6
    style CIP fill:#D1FAE5,stroke:#10B981,color:#065F46
    style CIP2 fill:#D1FAE5,stroke:#10B981,color:#065F46

Service types — know when to use each:

Type Scope How it works When to use
ClusterIP Internal only Virtual IP + DNS inside cluster Service-to-service (90% of cases)
NodePort External via node Opens port 30000-32767 on every node On-prem, no cloud LB available
LoadBalancer External via cloud LB Provisions ALB/NLB/GCP LB automatically Production external traffic
ExternalName DNS alias CNAME to external service Migration: gradually move traffic to K8s
Headless Direct pod access clusterIP: None — DNS returns pod IPs StatefulSets (address specific pods)
DNS in K8s: Every service gets <service>.<namespace>.svc.cluster.local. Same namespace? Just use the service name: postgres, redis, api-service. Headless services: each pod gets <pod-name>.<service>.<namespace>.svc.cluster.local.

Ingress — Production HTTP Routing

The Cost Problem

You have 20 microservices that need external access. 20 LoadBalancer Services = 20 cloud load balancers = $$$. Plus you want: path-based routing, host-based routing, TLS termination, rate limiting, all on one domain.

Ingress = one load balancer + smart L7 routing rules. An Ingress Controller (nginx, traefik, Istio) watches Ingress objects and configures itself accordingly.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    INTERNET["Internet"] --> IC["Ingress Controller<br/><small>(nginx / traefik / ALB)</small>"]
    IC -->|"api.myapp.com/v1/*"| API["api-service :8080"]
    IC -->|"api.myapp.com/auth/*"| AUTH["auth-service :8080"]
    IC -->|"docs.myapp.com/*"| DOCS["docs-service :3000"]
    IC -->|"ws.myapp.com"| WS["websocket-service :9090"]

    style IC fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6
    style API fill:#D1FAE5,stroke:#10B981,color:#065F46
    style AUTH fill:#FEF3C7,stroke:#F59E0B,color:#92400E
    style DOCS fill:#DBEAFE,stroke:#3B82F6,color:#1E40AF
    style WS fill:#FEE2E2,stroke:#EF4444,color:#991B1B
Ingress with TLS and path-based routing
YAML
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  ingressClassName: nginx
  tls:
    - hosts: [api.myapp.com]
      secretName: api-tls-cert
  rules:
    - host: api.myapp.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service: { name: api-service, port: { number: 80 } }
          - path: /auth
            pathType: Prefix
            backend:
              service: { name: auth-service, port: { number: 80 } }

HPA — Automatic Scaling

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    MS["Metrics Server<br/><small>scrapes kubelet every 15s</small>"] --> HPA["HPA Controller"]
    HPA -->|"desired = ceil(current × actual/target)"| DEP["Deployment<br/><small>replicas: 3 → 8</small>"]
    DEP --> RS["ReplicaSet"]
    RS --> P1["Pod"] & P2["Pod"] & P3["Pod"]
    RS -.->|"scaling up"| P4["Pod 4"] & P5["Pod 5"]

    style HPA fill:#F59E0B,stroke:#92400E,color:#fff
    style P4 fill:#D1FAE5,stroke:#10B981,color:#065F46
    style P5 fill:#D1FAE5,stroke:#10B981,color:#065F46

How HPA calculates replicas:

Text Only
desiredReplicas = ceil( currentReplicas × (currentMetricValue / targetMetricValue) )

Example: 3 replicas at 90% CPU, target is 60% → ceil(3 × 90/60) = ceil(4.5) = 5 replicas

Scaling behaviors:

  • Scale up: Fast (30s stabilization window). React quickly to load spikes.
  • Scale down: Slow (5 min stabilization). Avoid thrashing on bursty traffic.
Autoscaler What it scales Best for
HPA Pod replicas (horizontal) Stateless services under load
VPA Pod resources (vertical) Right-sizing, batch jobs
Cluster Autoscaler Nodes When pods can't be scheduled
KEDA Pods (event-driven) Kafka lag, SQS depth, cron-based

4. Advanced Workloads — StatefulSets, DaemonSets, Jobs

StatefulSets — When Identity Matters

Deployment (stateless)

Pods are interchangeable cattle. Random names (api-7d8f6-x4k2p). Share storage. Scale freely. Kill any pod — who cares? Web servers, APIs, workers.

StatefulSet (stateful)

Pods are named pets. Stable ordinal names (postgres-0, postgres-1). Each gets its OWN persistent disk. Start in order, stop in reverse. Databases, Kafka, ZooKeeper, Elasticsearch.

StatefulSet guarantees:

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    SS["StatefulSet: postgres"] --> P0["postgres-0<br/><small>pvc: data-postgres-0 (100Gi)</small>"]
    SS --> P1["postgres-1<br/><small>pvc: data-postgres-1 (100Gi)</small>"]
    SS --> P2["postgres-2<br/><small>pvc: data-postgres-2 (100Gi)</small>"]
    HS["Headless Service"] --> DNS0["postgres-0.postgres-headless<br/>.default.svc.cluster.local"]
    HS --> DNS1["postgres-1.postgres-headless"]
    HS --> DNS2["postgres-2.postgres-headless"]

    style SS fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6
    style P0 fill:#D1FAE5,stroke:#10B981,color:#065F46
    style P1 fill:#D1FAE5,stroke:#10B981,color:#065F46
    style P2 fill:#D1FAE5,stroke:#10B981,color:#065F46
  • Created in order: 0 → 1 → 2 (each must be Ready before next starts)
  • Terminated in reverse: 2 → 1 → 0
  • Each pod's PVC persists even if the pod is deleted and rescheduled to another node
  • Headless service gives each pod its own DNS name

DaemonSets — One Pod Per Node

Need something running on EVERY node? Monitoring agent, log collector, network plugin, storage daemon.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart TB
    DS["DaemonSet: node-exporter"]
    DS --> N1["Node 1<br/>node-exporter pod"]
    DS --> N2["Node 2<br/>node-exporter pod"]
    DS --> N3["Node 3<br/>node-exporter pod"]
    DS -.->|"new node added"| N4["Node 4<br/>auto-created pod"]

    style DS fill:#F59E0B,stroke:#92400E,color:#fff
    style N4 fill:#D1FAE5,stroke:#10B981,color:#065F46

Common DaemonSets in production: Prometheus Node Exporter, Fluentd/Filebeat, Datadog Agent, Calico/Cilium (CNI), CSI node driver

Jobs & CronJobs

Type Behavior Example
Job Run to completion, retry on failure, never restart after success DB migration, data export, ML training
CronJob Scheduled Job creation Nightly backup (0 2 * * *), hourly report generation

5. Networking — Pod-to-Pod, Services, Network Policies

The K8s Networking Model (Three Rules)

  1. Every pod gets its own IP (no port conflicts between pods)
  2. All pods can reach all other pods without NAT (flat network — any pod on any node talks to any other pod on any other node)
  3. Agents on a node can reach all pods on that node

This flat network is implemented by CNI plugins: Calico (BGP), Cilium (eBPF), Flannel (VXLAN), AWS VPC CNI.

How Pod-to-Pod Works (Cross-Node)

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph NODE1["Node 1 (10.0.1.10)"]
        PA["Pod A<br/>10.244.1.8"]
        CB1["cbr0 bridge"]
    end

    subgraph NODE2["Node 2 (10.0.1.11)"]
        PB["Pod B<br/>10.244.2.3"]
        CB2["cbr0 bridge"]
    end

    PA --> CB1 -->|"VXLAN tunnel / BGP route"| CB2 --> PB

    style NODE1 fill:#DBEAFE,stroke:#3B82F6,color:#1E40AF
    style NODE2 fill:#D1FAE5,stroke:#10B981,color:#065F46

Network Policies — Pod-Level Firewall

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
graph LR
    API["api pods<br/><small>label: app=api</small>"] -->|"ALLOWED on :5432"| DB["postgres pods<br/><small>label: app=postgres</small>"]
    FE["frontend pods"] -->|"ALLOWED on :8080"| API
    FE -.->|"DENIED"| DB
    OTHER["any other pod"] -.->|"DENIED"| DB

    style DB fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6
    style OTHER fill:#FEE2E2,stroke:#EF4444,color:#991B1B

Default = No Restrictions

Without NetworkPolicies, every pod can talk to every other pod. In production, always apply default-deny then explicitly allow needed traffic. This is defense-in-depth.


6. Storage & Persistence

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    POD["Pod"] -->|"volumeMount"| PVC["PVC<br/><small>'give me 100Gi fast SSD'</small>"]
    PVC -->|"bound"| PV["PV<br/><small>actual provisioned disk</small>"]
    SC["StorageClass<br/><small>'fast-ssd' → gp3, 5000 IOPS</small>"] -->|"dynamic provisioning"| PV
    PV -->|"backed by"| DISK["AWS EBS / GCP PD / Azure Disk"]

    style PVC fill:#DBEAFE,stroke:#3B82F6,color:#1E40AF
    style PV fill:#D1FAE5,stroke:#10B981,color:#065F46
    style SC fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6

Access modes (determines how many nodes can mount):

Mode Meaning Storage types
RWO (ReadWriteOnce) Single node read/write EBS, GCP PD, Azure Disk (block storage)
ROX (ReadOnlyMany) Many nodes read-only NFS, object store mounts
RWX (ReadWriteMany) Many nodes read/write EFS, CephFS, NFS, GlusterFS

7. Security — RBAC, Secrets, Pod Security

RBAC — Who Can Do What

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph SCOPE["Namespace: 'production'"]
        ROLE["Role: 'developer'<br/><small>pods: get, list, watch<br/>deployments: get, list, create, update</small>"]
    end

    USER["User: vamsi<br/>Group: dev-team"] -->|"RoleBinding"| ROLE

    subgraph GLOBAL["Cluster-Wide"]
        CR["ClusterRole: 'readonly'<br/><small>all resources: get, list, watch</small>"]
    end

    SRE["Group: sre-team"] -->|"ClusterRoleBinding"| CR

    style ROLE fill:#D1FAE5,stroke:#10B981,color:#065F46
    style CR fill:#EDE9FE,stroke:#8B5CF6,color:#5B21B6

Best practices:

  • Principle of least privilege — start with nothing, grant only what's needed
  • Namespace-scoped Roles whenever possible (not ClusterRoles)
  • Use Groups, not individual Users
  • Audit: kubectl auth can-i --list --as=vamsi -n production

Secrets — The Reality

K8s Secrets are base64-encoded, NOT encrypted. Anyone with kubectl get secret can decode them.

Production secret management:

Level Approach Security
Basic K8s Secret with RBAC Low (base64 in etcd)
Better etcd encryption at rest Medium (encrypted in storage)
Good External Secrets Operator + Vault High (secrets never stored in K8s)
Best Workload Identity + cloud KMS Highest (no secrets in cluster at all)

8. Helm — Templating & Package Management

Helm is to Kubernetes what apt/brew is to Linux/macOS. Instead of writing 15 YAML files for Prometheus + Grafana + AlertManager, you helm install a pre-built chart and override values.
Bash
# Install a complex application in one command
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=secret \
  --set prometheus.retention=30d

# Upgrade with new values
helm upgrade monitoring prometheus-community/kube-prometheus-stack \
  -f values-prod.yaml --atomic --wait

# Rollback if something breaks
helm rollback monitoring 1

CI/CD pattern: helm upgrade --install --atomic --wait — idempotent (install or upgrade), auto-rollback on failure, block until healthy.


9. Production Operations Playbook

Resource Requests & Limits on every container

Requests = guaranteed minimum (scheduler uses this for placement). Limits = hard cap (OOM killed if exceeded). Without requests, the scheduler can overpack nodes. Without limits, one pod can starve others.

Pod Disruption Budgets (PDB)

During node drain (maintenance, cluster upgrade), K8s evicts pods. PDB says "you must keep at least 2 running" — prevents accidental service outage during voluntary disruptions.

Topology Spread / Anti-Affinity

Don't put all 3 replicas on the same node — one node failure takes your service down. Spread across nodes and availability zones.

HPA + Cluster Autoscaler

HPA scales pods. But what if nodes are full? Cluster Autoscaler watches for unschedulable pods and provisions new nodes from your cloud provider (2-5 min). Together: handles any traffic spike.

GitOps (ArgoCD / Flux)

Don't kubectl apply from laptops. Cluster state lives in git. PR to deploy. ArgoCD syncs continuously. Drift detection. Audit trail. Automatic rollback on git revert.

Monitoring Stack

Prometheus (metrics) + Grafana (dashboards) + AlertManager (alerts) + Loki (logs). USE method for infra: Utilization, Saturation, Errors. RED method for services: Rate, Errors, Duration.


10. Debugging Production Issues

Pod stuck in Pending

kubectl describe pod → check Events. Usually: insufficient resources (no node has enough CPU/memory), unmatched node selector/affinity, or PVC can't bind. Fix: scale down other workloads, add nodes, or fix the selector.

Pod in CrashLoopBackOff

Container starts, crashes, restarts with exponential backoff. kubectl logs <pod> --previous shows the crash output. Common causes: missing config/secret, wrong command, failed health check, OOM (exit 137).

Pod Running but not receiving traffic

Readiness probe failing → pod removed from Service endpoints. Check: kubectl get endpoints <service>. If empty, verify label selector matches pod labels. Exec into pod and test the health endpoint.

Service DNS not resolving

Check CoreDNS pods in kube-system: kubectl get pods -n kube-system -l k8s-app=kube-dns. Exec into a pod and test: nslookup my-service.default.svc.cluster.local. Check if NetworkPolicy blocks DNS (port 53).

Node NotReady

kubelet stopped heartbeating. SSH into node, check: systemctl status kubelet, journalctl -u kubelet. Common causes: disk pressure, memory pressure, container runtime crashed, network partition from control plane.


11. Interview Q&A (FAANG-Level)

Walk through the complete lifecycle of a request from browser to a pod
  1. Browser resolves DNS → cloud LB's public IP
  2. Cloud LB forwards to NodePort on a worker node
  3. iptables DNAT rule (programmed by kube-proxy) rewrites destination to a pod IP
  4. If pod is on another node: packet sent to that node via overlay network (VXLAN) or direct routing (Calico BGP)
  5. Packet arrives at target node → enters pod's network namespace via veth pair
  6. Pod's container process receives the request on its listening port
  7. Response takes the same path back (SNAT for external traffic)
How does K8s handle a node failure? Walk through the timeline.
  • T+0: Node crashes (hardware failure, network partition, OOM)
  • T+10s: kubelet stops sending heartbeats to API server
  • T+40s: Node controller marks node as Unknown (configurable: --node-monitor-grace-period)
  • T+5m: Node controller begins pod eviction (configurable: --pod-eviction-timeout)
  • T+5m+5s: Pods on dead node get status Terminating. ReplicaSet controller creates replacement pods
  • T+5m+10s: Scheduler places new pods on healthy nodes. kubelet starts containers
  • T+5m+30s: New pods pass readiness probes → added to Service endpoints. Traffic restored.

For faster recovery: Use pod-level terminationGracePeriodSeconds: 30 and configure shorter node-monitor timeouts (but beware of false positives during network blips).

Explain how kube-proxy works in iptables mode vs IPVS mode

iptables mode (default): Creates one iptables rule per service endpoint. For a service with 3 pods, creates a chain with probability-based rules (⅓, ½, 1/1). O(n) rule evaluation — becomes slow at 10,000+ services.

IPVS mode: Uses kernel IPVS (IP Virtual Server) for load balancing. O(1) lookup using hash tables. Supports multiple LB algorithms (round-robin, least-connections, source-hash). Much better for large clusters (5,000+ services). Enable with --proxy-mode=ipvs on kube-proxy.

What are taints and tolerations? Give a real-world example.

Taints on nodes repel pods. Tolerations on pods allow them to schedule onto tainted nodes.

Real-world: You have GPU nodes (expensive). Without taints, any pod could get scheduled there. Solution: - Taint GPU nodes: kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule - Only ML pods tolerate it: tolerations: [{key: "nvidia.com/gpu", operator: "Exists", effect: "NoSchedule"}]

Other uses: Dedicate nodes per team, keep workloads off control-plane nodes, drain nodes for maintenance (NoExecute effect evicts running pods).

Design a production K8s architecture for a multi-region e-commerce platform
Text Only
Global: Route53 / Cloud DNS (latency-based routing per region)

Per Region:
├── EKS/GKE cluster (3 AZs, managed control plane)
├── Node pools: general (m5.xlarge), memory-optimized (r5.2xlarge), GPU (p3.2xlarge)
├── Ingress: AWS ALB Ingress Controller + WAF + cert-manager
├── Service mesh: Istio (mTLS, observability, traffic management)
├── Stateless services: Deployments + HPA + PDB
├── Stateful: AWS RDS (outside K8s) / ElastiCache (outside K8s)
├── Async: Kafka on dedicated StatefulSet nodes OR MSK
├── Secrets: External Secrets Operator + AWS Secrets Manager
├── Observability: Prometheus + Grafana + Loki + Jaeger
├── GitOps: ArgoCD (app-of-apps pattern)
└── Policy: OPA Gatekeeper (enforce labels, resource limits, image sources)

Key decisions: Databases OUTSIDE K8s (managed services). Stateless in K8s (easy to scale/deploy). Service mesh for zero-trust networking. Multi-AZ for HA, multi-region for DR.

Explain the K8s scheduler's algorithm in detail
  1. Filtering (predicate functions): Eliminate nodes that can't run the pod
    • Insufficient resources (CPU/memory requests > available)
    • Node selector / affinity doesn't match
    • Taints without matching tolerations
    • Pod topology spread constraints violated
  2. Scoring (priority functions): Rank remaining nodes 0-100
    • LeastRequestedPriority — prefer nodes with more free resources
    • BalancedResourceAllocation — balance CPU/memory usage ratio
    • InterPodAffinityPriority — honor pod affinity/anti-affinity weights
    • NodeAffinityPriority — prefer nodes matching preferred affinity
  3. Binding: Highest score wins. Update pod's nodeName in etcd.

If multiple nodes tie, scheduler picks randomly. If NO node passes filtering, pod stays Pending.

How would you implement canary deployments in K8s?

Option 1: Native (basic) - Two Deployments: api-stable (90% replicas) and api-canary (10% replicas) - Same Service selector matches both → ~10% traffic to canary - Manual: watch metrics, scale canary up if healthy

Option 2: Istio/Linkerd (production) - VirtualService with traffic splitting: 95% → stable, 5% → canary - Automated: Flagger watches error rate/latency, gradually shifts traffic - If canary degrades → automatic rollback

Option 3: Argo Rollouts - Custom Rollout resource with canary strategy - Step-based: 10% → wait 5min → 30% → wait 5min → 100% - Integrates with Prometheus for automated analysis