7 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

MLOps & Production AI

Q: 1. What is MLOps and why is it different from DevOps?

MLOps applies DevOps principles to ML systems but adds three unique dimensions: data versioning (code alone doesn't reproduce results), model decay (models degrade over time even without code changes), and experimental nature (you can't unit-test whether a model is "good enough" the same way you test code correctness). Traditional DevOps manages code deployments; MLOps manages code + data + model artifacts + training pipelines + monitoring for statistical correctness.

Q: 2. Explain training-serving skew and how to prevent it.

Training-serving skew occurs when features are computed differently during training vs inference. Example: training uses batch-computed 30-day averages from a data warehouse, but serving computes them in real-time with slightly different logic. Prevention: use a feature store (Feast, Tecton) that serves the same feature definitions to both training and serving. Also implement data validation checks that compare training feature distributions against serving-time distributions.

Q: 3. How would you detect and handle model drift in production?

Detection: Monitor input data distributions using statistical tests (KS test for continuous features, chi-square for categorical). Track PSI (Population Stability Index) — values > 0.2 indicate significant drift. Monitor prediction distribution changes and, when available, actual vs predicted outcomes. Handling: Set up automated alerts at thresholds. For gradual drift, schedule regular retraining (weekly/monthly). For sudden drift (like COVID), trigger immediate retraining with recent data, p

Q: 4. Compare data parallelism vs model parallelism in distributed training.

Data parallelism: Same model replicated on each GPU. Different data batches processed simultaneously. Gradients synchronized via all-reduce. Works when model fits on one GPU. Scales to hundreds of GPUs easily. Model parallelism: Model split across GPUs (different layers on different devices). Required when model is too large for one GPU (70B+ parameters). More complex communication patterns. Higher inter-GPU bandwidth requirements. Often combined: model parallelism within a node, data parallelis

Q: 6. Design a CI/CD pipeline for an ML system. What tests would you include?

Four layers: (1) Data tests — schema validation, null checks, distribution tests, freshness checks. (2) Feature tests — unit tests for transformation logic, integration tests for feature pipelines. (3) Model tests — minimum accuracy thresholds, no-regression tests vs current production model, fairness/bias checks across protected groups, latency benchmarks. (4) Integration tests — end-to-end pipeline runs, serving endpoint health checks, load testing. Trigger on: code changes (standard CI), data

Q: 7. How do you choose between real-time, batch, and streaming inference?

Real-time: User is waiting for a response. Latency < 100ms required. Examples: fraud detection, search ranking, chatbots. Batch: No immediate user waiting. Process large volumes efficiently. Examples: nightly recommendation updates, credit scoring for all customers, email campaigns. Streaming: Near real-time on continuous data. Examples: anomaly detection on IoT sensor streams, real-time personalization on click streams. Decision factors: latency SLA, cost tolerance, data freshness requiremen

Q: 8. Explain quantization tradeoffs for LLM deployment.

Quantization reduces weight precision (FP32 to FP16/INT8/INT4). Tradeoffs: Speed — lower precision = faster matrix ops + less memory bandwidth needed. Memory — 4-bit model uses 8x less VRAM than FP32. Quality — some degradation, especially on reasoning tasks. Methods: GPTQ (post-training, GPU-optimized), AWQ (activation-aware, better quality preservation), GGUF (flexible bit-widths, CPU-friendly). Rule of thumb: FP16 is free lunch (negligible quality loss). INT8 is almost free. INT4 requires car

Q: 9. What monitoring would you set up for a production ML system?

Four layers: (1) Infrastructure — latency p50/p95/p99, throughput, GPU utilization, memory, error rates. (2) Data quality — input feature distributions vs training baseline, missing values, schema violations. (3) Model performance — prediction distribution shifts, confidence score distributions, accuracy/precision/recall when ground truth available (often delayed). (4) Business metrics — conversion rates, revenue impact, user engagement. Alert thresholds at each layer. Use tools like Evidently f

Q: 10. How would you handle a scenario where offline model metrics are great but production performance is poor?

Systematic investigation: (1) Data leakage — check if test set has future information or overlap with training. (2) Distribution mismatch — compare production data distribution to training/test data. (3) Feature engineering bugs — verify features are computed identically in training and serving (training-serving skew). (4) Proxy metric problem — your offline metric (e.g., accuracy) may not correlate with the actual business goal. (5) Feedback loops — model predictions may change user behavior, i

Training a model is like building a race car in a garage. MLOps is everything needed to win actual races — fuel, pit stops, tire changes, telemetry, and keeping the engine alive at 200 mph.

What is MLOps

MLOps = DevOps principles applied to Machine Learning systems. It bridges the gap between "my model works in a notebook" and "my model serves 10M predictions per day reliably."

The Harsh Truth

87% of ML models never make it to production. The model is 5% of the work. The rest is data pipelines, infrastructure, monitoring, and keeping things from silently failing at 3 AM.

Why ML in production is fundamentally different from traditional software:

Traditional Software	ML Systems
Code changes break things	Data changes break things silently
Deterministic outputs	Probabilistic outputs
Unit tests catch bugs	Models can be "correct" but useless
Deploy once, done	Models decay over time (concept drift)
Version code	Version code + data + model + config
Logs tell you what failed	Predictions look fine until revenue drops

The ML Technical Debt Paper

Google's 2015 paper "Hidden Technical Debt in Machine Learning Systems" remains gospel. ML code is the tiny box in the center. Everything around it — data collection, feature extraction, monitoring, serving infrastructure — is where complexity lives.

ML Lifecycle

The full pipeline is a loop, not a line. Models are never "done."

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A[("Data<br/>Collection")] --> B{{"Feature<br/>Engineering"}}
    B --> C(["Model<br/>Training"])
    C --> D{"Evaluation<br/>& Validation"}
    D --> E[["Model<br/>Registry"]]
    E --> F(["Deployment<br/>& Serving"])
    F --> G{{"Monitoring<br/>& Observability"}}
    G --> H{"Drift<br/>Detection"}
    H -->|"Retrain"| A

    style A fill:#E3F2FD,stroke:#1565C0,color:#000
    style B fill:#FEF3C7,stroke:#D97706,color:#000
    style C fill:#EDE9FE,stroke:#7C3AED,color:#000
    style D fill:#FCE7F3,stroke:#DB2777,color:#000
    style E fill:#D1FAE5,stroke:#059669,color:#000
    style F fill:#E3F2FD,stroke:#1565C0,color:#000
    style G fill:#FEF3C7,stroke:#D97706,color:#000
    style H fill:#FCE7F3,stroke:#DB2777,color:#000

Maturity Levels:

Level 0 — Manual. Jupyter notebooks, manual deployment, no monitoring. "It works on my machine."
Level 1 — ML Pipeline Automation. Automated training, basic CI/CD, some monitoring.
Level 2 — CI/CD for ML. Automated testing, model validation gates, continuous training, full observability.

Data Engineering for ML

Your model is only as good as your data. Garbage in, garbage out. But also: slightly stale data in, catastrophically wrong predictions out.

Data Collection & Cleaning

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A[("Raw Data<br/>Sources")] --> B[/"Ingestion<br/>Pipeline"/]
    B --> C{"Validation<br/>& Quality Checks"}
    C --> D{{"Cleaning &<br/>Transformation"}}
    D --> E[("Feature<br/>Store")]
    E --> F(["Training<br/>Data"])
    E --> G(["Serving<br/>Data"])

    style A fill:#E3F2FD,stroke:#1565C0,color:#000
    style B fill:#FEF3C7,stroke:#D97706,color:#000
    style C fill:#FCE7F3,stroke:#DB2777,color:#000
    style D fill:#EDE9FE,stroke:#7C3AED,color:#000
    style E fill:#D1FAE5,stroke:#059669,color:#000
    style F fill:#E3F2FD,stroke:#1565C0,color:#000
    style G fill:#E3F2FD,stroke:#1565C0,color:#000

Key practices:

Schema validation — catch column type changes before they poison training
Null/outlier detection — automated alerts when data distribution shifts
Deduplication — duplicates bias your model toward repeated patterns
Labeling pipelines — tools like Label Studio, Snorkel (programmatic labeling), or human annotation queues

Data Versioning (DVC)

DVC (Data Version Control) works like Git but for datasets and models.

Bash

# Initialize DVC in your repo
dvc init

# Track a large dataset
dvc add data/training_set.parquet

# Push data to remote storage (S3, GCS, etc.)
dvc push

# Reproduce a specific experiment's data state
git checkout v1.2.0
dvc checkout

Why Not Just Use Git LFS?

Git LFS stores files, DVC stores pipelines. DVC tracks how data was produced, not just the final artifact. You can reproduce any historical dataset state with its full lineage.

Feature Stores

A feature store is a centralized repository for ML features. It solves the #1 pain: training-serving skew (features computed differently at training time vs serving time).

Feature Store	Best For	Key Strength
Feast	Open-source, flexible	Works with any infra
Tecton	Enterprise, real-time	Sub-10ms feature serving
Hopsworks	End-to-end platform	Built-in feature monitoring
Vertex AI Feature Store	GCP-native	Tight BigQuery integration
SageMaker Feature Store	AWS-native	Seamless with SageMaker pipelines

Feature engineering pipeline example:

Python

from feast import FeatureStore, Entity, FeatureView, Field
from feast.types import Float32, Int64

# Define entity
customer = Entity(name="customer_id", join_keys=["customer_id"])

# Define feature view
customer_features = FeatureView(
    name="customer_features",
    entities=[customer],
    schema=[
        Field(name="total_purchases_30d", dtype=Float32),
        Field(name="avg_order_value", dtype=Float32),
        Field(name="days_since_last_purchase", dtype=Int64),
    ],
    source=bigquery_source,  # same source for training AND serving
    ttl=timedelta(days=1),
)

Experiment Tracking

Without experiment tracking, ML research is just guessing with extra steps. You ran 47 experiments last month — which hyperparameters produced that 0.3% accuracy boost?

Why You Need It

Reproducibility — recreate any past result exactly
Comparison — side-by-side metric visualization across runs
Collaboration — your team knows what was tried and what failed
Audit trail — required for regulated industries (finance, healthcare)

Tools Comparison

Tool	Self-Hosted	Cloud	Best For
MLflow	Yes	Yes	Open-source, general purpose
Weights & Biases	No	Yes	Beautiful UI, team collaboration
Neptune	No	Yes	Heavy experimentation teams
CometML	No	Yes	Production monitoring + tracking
TensorBoard	Yes	No	Quick local visualization

MLflow Example

Python

import mlflow

mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="xgboost-tuned"):
    # Log hyperparameters
    mlflow.log_params({
        "max_depth": 6,
        "learning_rate": 0.01,
        "n_estimators": 1000,
        "subsample": 0.8,
    })

    # Train model
    model = train_xgboost(params)

    # Log metrics
    mlflow.log_metrics({
        "auc_roc": 0.943,
        "precision": 0.891,
        "recall": 0.867,
        "f1": 0.879,
    })

    # Log the model artifact
    mlflow.xgboost.log_model(model, "model")

    # Log custom artifacts (confusion matrix, feature importance plots)
    mlflow.log_artifact("confusion_matrix.png")

Track Everything

Log your data hash, git commit SHA, environment specs, and random seeds. Future-you will thank present-you when a stakeholder asks "why did last Tuesday's model perform better?"

Model Training at Scale

Training a ResNet on your laptop is fine for learning. Training a 70B parameter LLM requires a different universe of tooling.

Distributed Training Strategies

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A{{"Distributed Training"}} --> B(["Data Parallelism"])
    A --> C(["Model Parallelism"])
    A --> D(["Pipeline Parallelism"])

    B --> B1[/"Same model on each GPU<br/>Different data batches<br/>Gradient sync"/]
    C --> C1[/"Model split across GPUs<br/>Each GPU holds layers<br/>Needed for huge models"/]
    D --> D1[/"Model split into stages<br/>Micro-batches pipelined<br/>Reduces bubble time"/]

    style A fill:#EDE9FE,stroke:#7C3AED,color:#000
    style B fill:#E3F2FD,stroke:#1565C0,color:#000
    style C fill:#FEF3C7,stroke:#D97706,color:#000
    style D fill:#D1FAE5,stroke:#059669,color:#000
    style B1 fill:#E3F2FD,stroke:#1565C0,color:#000
    style C1 fill:#FEF3C7,stroke:#D97706,color:#000
    style D1 fill:#D1FAE5,stroke:#059669,color:#000

Strategy	When to Use	Tools
Data Parallelism	Model fits on 1 GPU, want faster training	PyTorch DDP, Horovod
Model Parallelism	Model too large for 1 GPU	Megatron-LM, DeepSpeed
Pipeline Parallelism	Very deep models, multiple stages	GPipe, PipeDream
FSDP	Large models, memory-efficient	PyTorch FSDP
ZeRO	Optimizer state sharding	DeepSpeed ZeRO (Stage 1-3)

GPU Utilization Tips

GPU Utilization is Usually Terrible

Most training jobs achieve 30-50% GPU utilization. The GPU is idle waiting for data loading, gradient sync, or CPU preprocessing.

Fixes:

Prefetch data — load next batch while current batch trains
Mixed precision (FP16/BF16) — 2x memory savings, faster math units
Gradient accumulation — simulate larger batches without more memory
Compile models — torch.compile() fuses operations, reduces kernel launches
Profile first — use PyTorch Profiler or NVIDIA Nsight to find bottlenecks

Training Infrastructure

Option	Pros	Cons
Cloud (AWS/GCP/Azure)	Scale on demand, latest GPUs	Expensive at scale, egress costs
On-prem	Cheaper long-term, data stays local	Upfront capex, maintenance burden
Hybrid	Best of both worlds	Complexity of managing two environments
Managed (SageMaker, Vertex)	Handles orchestration	Vendor lock-in, less control

Model Evaluation

Your model has 95% accuracy. Congratulations. Is it useful? Maybe not.

Offline vs Online Evaluation

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A{{"Offline Evaluation"}} --> B[/"Hold-out Test Set"/]
    A --> C[/"Cross-Validation"/]
    A --> D[/"Stratified Splits"/]

    E{{"Online Evaluation"}} --> F(["A/B Testing"])
    E --> G(["Shadow Deployment"])
    E --> H(["Canary Release"])
    E --> I(["Interleaving"])

    style A fill:#E3F2FD,stroke:#1565C0,color:#000
    style E fill:#D1FAE5,stroke:#059669,color:#000
    style B fill:#E3F2FD,stroke:#1565C0,color:#000
    style C fill:#E3F2FD,stroke:#1565C0,color:#000
    style D fill:#E3F2FD,stroke:#1565C0,color:#000
    style F fill:#D1FAE5,stroke:#059669,color:#000
    style G fill:#D1FAE5,stroke:#059669,color:#000
    style H fill:#D1FAE5,stroke:#059669,color:#000
    style I fill:#D1FAE5,stroke:#059669,color:#000

Deployment Strategies for Model Evaluation

Strategy	How It Works	Risk Level
Shadow Deployment	New model runs alongside old, predictions logged but not served	Very Low
Canary Release	1-5% traffic to new model, monitor closely	Low
A/B Test	50/50 split, measure business metrics	Medium
Blue-Green	Instant switch between two full deployments	Medium
Multi-Armed Bandit	Dynamically route traffic to better-performing model	Low

When Offline Metrics Lie

Your model crushed the test set but tanked in production. Common reasons:

Data leakage — test set accidentally contained future information
Distribution shift — production data looks nothing like training data
Proxy metrics — you optimized accuracy, but the business cares about revenue
Feedback loops — your model's predictions change user behavior, which changes future data

Model Registry

A model registry is the "source of truth" for which models exist, which are production-ready, and who approved them.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A(("Experiment<br/>Tracking")) --> B[["Model<br/>Registry"]]
    B --> C[/"Staging"/]
    C --> D(["Production"])
    D --> E{{"Archived"}}

    B --> F[("Metadata:<br/>metrics, lineage,<br/>owner, approvals")]

    style A fill:#EDE9FE,stroke:#7C3AED,color:#000
    style B fill:#E3F2FD,stroke:#1565C0,color:#000
    style C fill:#FEF3C7,stroke:#D97706,color:#000
    style D fill:#D1FAE5,stroke:#059669,color:#000
    style E fill:#FCE7F3,stroke:#DB2777,color:#000
    style F fill:#E3F2FD,stroke:#1565C0,color:#000

Tools

Registry	Ecosystem	Key Feature
MLflow Model Registry	Open-source	Stage transitions, annotations
Vertex AI Model Registry	GCP	Auto-scaling serving integration
SageMaker Model Registry	AWS	Approval workflows, lineage
Neptune	Cloud	Comparison dashboards
Weights & Biases	Cloud	Artifact versioning + linking

Best Practices

Immutable versions — never overwrite a registered model
Promotion gates — automated checks before staging-to-production
Metadata — attach training data hash, evaluation metrics, owner
Rollback plan — always keep the previous production model ready

Model Serving

Getting predictions from a model to users. Sounds simple. It is not.

Serving Patterns

Pattern	Latency	Throughput	Use Case
Real-time (Online)	< 100ms	Medium	Fraud detection, recommendations
Batch	Minutes-hours	Very High	Email campaigns, risk scoring
Streaming	Near real-time	High	Anomaly detection on event streams
Edge	Ultra-low	Low	Mobile apps, IoT devices

Serving Infrastructure

Tool	Strengths	Best For
TF Serving	Mature, gRPC support	TensorFlow models
Triton Inference Server	Multi-framework, GPU optimized	High-throughput GPU serving
vLLM	PagedAttention, continuous batching	LLM serving
BentoML	Pythonic, easy packaging	Rapid prototyping to production
Ray Serve	Scalable, composable	Complex inference graphs
Seldon Core	Kubernetes-native	Enterprise ML deployment
KServe	Serverless, auto-scaling	Kubernetes-first teams

Latency vs Throughput Tradeoffs

The Batching Sweet Spot

Individual requests = lowest latency but worst GPU utilization. Large batches = highest throughput but higher latency. Dynamic batching (collecting requests over a short window) hits the sweet spot.

Python

# BentoML example: packaging and serving a model
import bentoml

@bentoml.service(
    resources={"gpu": 1},
    traffic={"timeout": 30, "concurrency": 32},
)
class FraudDetector:
    def __init__(self):
        self.model = bentoml.xgboost.load_model("fraud_model:latest")

    @bentoml.api(batchable=True, batch_dim=0, max_batch_size=64)
    def predict(self, features: np.ndarray) -> np.ndarray:
        return self.model.predict_proba(features)[:, 1]

LLM Serving Specifically

LLMs are a special beast. They are autoregressive (generate one token at a time), memory-hungry, and expensive. Serving them efficiently is its own discipline.

Key Challenges

KV Cache — each token generation needs attention over all previous tokens. The KV cache stores precomputed key/value pairs. It grows linearly with sequence length.
Memory bandwidth bound — LLM inference is limited by how fast you can read weights from GPU memory, not by compute.
Long sequences — 128K context windows mean massive memory for a single request.

Serving Frameworks

Framework	Key Innovation	Best For
vLLM	PagedAttention (virtual memory for KV cache)	Production LLM serving
TGI (Text Generation Inference)	Continuous batching, tensor parallelism	HuggingFace models
Ollama	Dead-simple local serving	Development, edge
TensorRT-LLM	NVIDIA-optimized kernels	Maximum throughput on NVIDIA
llama.cpp	CPU/Metal inference, GGUF format	Local, resource-constrained

Quantization

Reduce model precision to serve faster with less memory. The tradeoff is (usually small) quality loss.

Method	Bits	Speed Gain	Quality Loss	Format
FP16	16	2x vs FP32	Negligible	Standard
GPTQ	4	~3-4x	Small	GPU-optimized
AWQ	4	~3-4x	Very small	Activation-aware
GGUF	2-8	Varies	Varies	CPU-friendly (llama.cpp)
SmoothQuant	8	~2x	Minimal	W8A8

Batching Strategies

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A{{"Batching Strategies"}} --> B(["Static Batching"])
    A --> C(["Dynamic Batching"])
    A --> D(["Continuous Batching"])

    B --> B1[/"Wait for fixed batch size<br/>Simple but wastes time"/]
    C --> C1[/"Collect requests for N ms<br/>Good balance"/]
    D --> D1[/"New requests join mid-generation<br/>Best GPU utilization<br/>Used by vLLM, TGI"/]

    style A fill:#EDE9FE,stroke:#7C3AED,color:#000
    style B fill:#E3F2FD,stroke:#1565C0,color:#000
    style C fill:#FEF3C7,stroke:#D97706,color:#000
    style D fill:#D1FAE5,stroke:#059669,color:#000
    style B1 fill:#E3F2FD,stroke:#1565C0,color:#000
    style C1 fill:#FEF3C7,stroke:#D97706,color:#000
    style D1 fill:#D1FAE5,stroke:#059669,color:#000

PagedAttention Explained Simply

Traditional serving allocates a contiguous block of GPU memory for each request's maximum possible sequence length. Most of it is wasted. PagedAttention treats KV cache like virtual memory — allocates small pages on demand, allows non-contiguous storage. Result: 2-4x more concurrent requests on the same GPU.

Monitoring & Observability

Traditional software monitoring asks: "Is the server up?" ML monitoring asks: "Are the predictions still correct even though the server looks healthy?"

What to Monitor

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A(("ML Monitoring")) --> B(["System Metrics"])
    A --> C(["Model Metrics"])
    A --> D(["Data Metrics"])
    A --> E(["Business Metrics"])

    B --> B1[/"Latency, throughput,<br/>GPU util, memory"/]
    C --> C1[/"Prediction distribution,<br/>confidence scores"/]
    D --> D1[/"Feature drift,<br/>missing values, schema"/]
    E --> E1[/"Conversion rate,<br/>revenue, user satisfaction"/]

    style A fill:#EDE9FE,stroke:#7C3AED,color:#000
    style B fill:#E3F2FD,stroke:#1565C0,color:#000
    style C fill:#FEF3C7,stroke:#D97706,color:#000
    style D fill:#D1FAE5,stroke:#059669,color:#000
    style E fill:#FCE7F3,stroke:#DB2777,color:#000
    style B1 fill:#E3F2FD,stroke:#1565C0,color:#000
    style C1 fill:#FEF3C7,stroke:#D97706,color:#000
    style D1 fill:#D1FAE5,stroke:#059669,color:#000
    style E1 fill:#FCE7F3,stroke:#DB2777,color:#000

Statistical Tests for Drift Detection

Test	Detects	How It Works
KS Test (Kolmogorov-Smirnov)	Distribution shift in continuous features	Compares CDFs of reference vs current
PSI (Population Stability Index)	Overall population shift	Bins distributions, measures divergence
Chi-Square Test	Shift in categorical features	Compares observed vs expected frequencies
Jensen-Shannon Divergence	Distribution difference	Symmetric version of KL divergence
Page-Hinkley Test	Change point in streaming data	Detects mean shifts in sequential data

Monitoring Tools

Tool	Type	Strength
Evidently AI	Open-source	Beautiful reports, drift dashboards
Arize	Cloud platform	Root cause analysis, embedding drift
WhyLabs	Cloud platform	Real-time profiling, anomaly detection
Fiddler	Enterprise	Explainability + monitoring
Prometheus + Grafana	General infra	Custom ML metrics via exporters
NannyML	Open-source	Estimates performance without ground truth

The Silent Killer

A model can serve predictions with 100% uptime, zero errors, perfect latency — and still be catastrophically wrong. Unlike a crashed server, a drifted model fails silently. You need proactive drift detection, not just reactive alerting.

Data & Model Drift

Drift is when the world changes but your model doesn't know. It is the #1 reason production models degrade over time.

Types of Drift

Type	Definition	Example
Data Drift (Covariate Shift)	Input distribution changes	Users shift from desktop to mobile
Concept Drift	Relationship between input and output changes	"Spam" definition evolves
Label Drift	Target distribution changes	Fraud rate increases during holidays
Feature Drift	Individual features shift	Salary feature inflates over time

Real-World Examples

COVID: The Ultimate Drift Event

In March 2020, almost every production ML model broke simultaneously:

Demand forecasting models predicted normal shopping patterns
Fraud detection flagged legitimate bulk purchases as suspicious
Credit scoring models couldn't handle mass unemployment
Recommendation systems kept suggesting travel and restaurants

Lesson: your model assumes the future looks like the past. Sometimes it doesn't.

Detection and Response

Python

from evidently.metrics import DataDriftTable
from evidently.report import Report

# Compare current production data to training reference
drift_report = Report(metrics=[DataDriftTable()])
drift_report.run(
    reference_data=training_df,
    current_data=production_df_last_24h
)

# Check if drift detected
results = drift_report.as_dict()
drift_detected = results["metrics"][0]["result"]["dataset_drift"]

if drift_detected:
    alert_team()
    trigger_retraining_pipeline()

When to retrain:

PSI > 0.2 on critical features
Model accuracy drops below threshold (if ground truth available)
Business metrics decline (conversion rate, revenue per user)
Scheduled (weekly/monthly) regardless — cheap insurance

CI/CD for ML

Traditional CI/CD tests code. ML CI/CD tests code + data + model quality. Three dimensions of correctness.

Testing Pyramid for ML

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A{{"Integration Tests<br/>Full pipeline runs end-to-end"}} 
    B(["Model Validation Tests<br/>Accuracy thresholds, fairness checks"])
    C{"Data Validation Tests<br/>Schema, distributions, completeness"}
    D[["Unit Tests<br/>Feature transforms, preprocessing functions"]]

    A --> B --> C --> D

    style A fill:#FCE7F3,stroke:#DB2777,color:#000
    style B fill:#EDE9FE,stroke:#7C3AED,color:#000
    style C fill:#FEF3C7,stroke:#D97706,color:#000
    style D fill:#D1FAE5,stroke:#059669,color:#000

What to Test

Data Tests:

Python

def test_no_null_critical_features():
    """Critical features must never be null."""
    df = load_latest_training_data()
    critical = ["user_id", "transaction_amount", "timestamp"]
    for col in critical:
        assert df[col].isnull().sum() == 0, f"{col} has nulls!"

def test_feature_ranges():
    """Features must be within expected bounds."""
    df = load_latest_training_data()
    assert df["age"].between(0, 150).all()
    assert df["transaction_amount"].ge(0).all()

def test_data_freshness():
    """Training data must be recent."""
    latest = load_latest_training_data()["timestamp"].max()
    assert (datetime.now() - latest).days < 7

Model Tests:

Python

def test_model_accuracy_above_threshold():
    """New model must beat minimum accuracy."""
    model = load_candidate_model()
    metrics = evaluate(model, test_set)
    assert metrics["auc_roc"] > 0.90

def test_model_no_regression():
    """New model must not be worse than current production model."""
    candidate = evaluate(load_candidate_model(), test_set)
    production = evaluate(load_production_model(), test_set)
    assert candidate["auc_roc"] >= production["auc_roc"] - 0.01  # allow 1% tolerance

def test_model_fairness():
    """Model must not discriminate across protected groups."""
    predictions = model.predict(test_set)
    for group in ["gender", "race", "age_bucket"]:
        group_metrics = compute_group_metrics(predictions, test_set, group)
        assert max(group_metrics) - min(group_metrics) < 0.05

GitHub Actions for ML

YAML

# .github/workflows/ml-pipeline.yml
name: ML Pipeline CI/CD

on:
  push:
    paths: ['src/models/**', 'src/features/**', 'data/**']

jobs:
  data-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python tests/test_data_quality.py
      - run: dvc pull  # get latest data
      - run: python src/validate_schema.py

  train-and-evaluate:
    needs: data-validation
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - run: python src/train.py --config configs/prod.yaml
      - run: python src/evaluate.py --model outputs/model.pkl
      - run: python tests/test_model_quality.py

  deploy:
    needs: train-and-evaluate
    if: github.ref == 'refs/heads/main'
    steps:
      - run: python src/register_model.py --stage production
      - run: kubectl rollout restart deployment/model-server

Cost Optimization

GPUs are expensive. A single A100 costs ~$2/hour. A training run on 64 GPUs for a week = $21,504. Every percentage point of efficiency matters.

GPU Cost Strategies

Strategy	Savings	Risk
Spot/Preemptible Instances	60-90%	Interruption (use checkpointing)
Reserved Instances	30-60%	Commitment required
Right-sizing	20-40%	None if profiled correctly
Mixed Precision Training	30-50% (time)	Negligible quality loss
Gradient Checkpointing	2-4x batch size	20-30% slower

Model Compression Techniques

Technique	How It Works	Typical Compression
Quantization	Reduce precision (FP32 to INT8)	2-4x smaller
Distillation	Train small model to mimic large one	3-10x smaller
Pruning	Remove unimportant weights	2-5x smaller
Low-Rank Factorization	Decompose weight matrices	2-3x smaller
Architecture Search	Find efficient architectures	Varies

Inference Cost Optimization

The Caching Insight

If 30% of your LLM queries are near-duplicates, a semantic cache (embedding similarity lookup) can save 30% of GPU costs. Tools: GPTCache, Redis + vector search.

Python

# Semantic caching example
import hashlib
from redis import Redis
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")
cache = Redis()

def get_or_generate(prompt: str, threshold: float = 0.95):
    embedding = encoder.encode(prompt)

    # Check cache for similar prompts
    cached = vector_search(cache, embedding, threshold)
    if cached:
        return cached["response"]  # cache hit — $0 GPU cost

    # Cache miss — generate and store
    response = llm.generate(prompt)
    store_in_cache(cache, prompt, embedding, response)
    return response

AI Ethics & Safety

Deploying AI without safety guardrails is like shipping a car without brakes. It might go fast, but the crash is inevitable.

Threat Landscape

Threat	What Happens	Mitigation
Hallucination	Model confidently generates false info	RAG grounding, confidence thresholds, citations
Bias	Model discriminates against groups	Fairness metrics, balanced training data, bias audits
PII Leakage	Model memorizes/outputs personal data	Data sanitization, differential privacy, output filtering
Prompt Injection	Attacker hijacks model behavior via input	Input validation, system prompt hardening, output guardrails
Data Poisoning	Attacker corrupts training data	Data provenance, anomaly detection on training data
Model Stealing	Attacker extracts model via API	Rate limiting, watermarking, output perturbation

Practical Mitigations

Hallucination:

Python

# Ground responses in retrieved facts
def safe_generate(query: str) -> str:
    # Retrieve relevant documents
    docs = vector_store.similarity_search(query, k=5)

    # Generate with grounding
    response = llm.generate(
        prompt=f"Answer based ONLY on these documents: {docs}\nQuery: {query}",
        temperature=0.1,  # lower temperature = less creative = fewer hallucinations
    )

    # Verify claims against source documents
    if not verify_claims(response, docs):
        return "I don't have enough information to answer this reliably."
    return response

Prompt Injection Defense:

Python

def sanitize_input(user_input: str) -> str:
    # Detect injection patterns
    injection_patterns = [
        "ignore previous instructions",
        "you are now",
        "system prompt",
        "disregard",
    ]
    for pattern in injection_patterns:
        if pattern.lower() in user_input.lower():
            raise SecurityException("Potential prompt injection detected")

    # Separate user content from system instructions
    return f"[USER_INPUT_START]{user_input}[USER_INPUT_END]"

Alignment is Not Solved

No amount of RLHF makes a model perfectly safe. Defense in depth: input filtering + system prompts + output filtering + human review for high-stakes decisions. Treat safety as layers, not a checkbox.

Infrastructure Patterns

Decision Matrix

Factor	Cloud	On-Prem	Hybrid
Upfront Cost	Low	High	Medium
Long-term Cost (3+ years)	High	Low	Medium
GPU Availability	Variable	Guaranteed	Both
Data Sovereignty	Challenging	Easy	Manageable
Scale Elasticity	Excellent	Limited	Good
Operational Burden	Low	High	Medium

Kubernetes for ML

YAML

# Example: GPU-enabled ML serving pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "16Gi"
            requests:
              nvidia.com/gpu: 1
              memory: "8Gi"
          ports:
            - containerPort: 8000  # HTTP
            - containerPort: 8001  # gRPC
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-a100
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "70"

Serverless Inference

Platform	Cold Start	Max Timeout	Best For
AWS Lambda	1-10s	15 min	Light models, preprocessing
Google Cloud Run	1-5s	60 min	Container-based, medium models
Azure Container Instances	5-30s	Unlimited	Burst workloads
Modal	~1s (warm GPUs)	Unlimited	GPU inference, ML-native
Replicate	Varies	Unlimited	Model sharing, quick deploy

When to Go Serverless

Serverless works great for bursty, low-traffic models (< 100 req/s). For sustained high traffic, dedicated GPU instances are cheaper. The break-even point is typically around 30-50% utilization.

Full Production Architecture

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    A(("Client<br/>Requests")) --> B{{"API Gateway<br/>+ Rate Limiting"}}
    B --> C[/"Load Balancer"/]
    C --> D[["Model Server<br/>Cluster (GPU)"]]
    D --> E[("Feature Store<br/>(Real-time)")]
    D --> F[("Model Registry")]
    D --> G(["Response Cache"])

    H(("Data Pipeline")) --> I{{"Feature<br/>Engineering"}}
    I --> E
    I --> J(["Training<br/>Pipeline"])
    J --> K(["Experiment<br/>Tracking"])
    K --> F

    D --> L{{"Monitoring<br/>& Alerting"}}
    L --> M{"Drift<br/>Detection"}
    M -->|"Trigger"| J

    style A fill:#E3F2FD,stroke:#1565C0,color:#000
    style B fill:#FEF3C7,stroke:#D97706,color:#000
    style C fill:#FEF3C7,stroke:#D97706,color:#000
    style D fill:#EDE9FE,stroke:#7C3AED,color:#000
    style E fill:#D1FAE5,stroke:#059669,color:#000
    style F fill:#D1FAE5,stroke:#059669,color:#000
    style G fill:#FCE7F3,stroke:#DB2777,color:#000
    style H fill:#E3F2FD,stroke:#1565C0,color:#000
    style I fill:#FEF3C7,stroke:#D97706,color:#000
    style J fill:#EDE9FE,stroke:#7C3AED,color:#000
    style K fill:#EDE9FE,stroke:#7C3AED,color:#000
    style L fill:#FCE7F3,stroke:#DB2777,color:#000
    style M fill:#FCE7F3,stroke:#DB2777,color:#000

Interview Questions

1. What is MLOps and why is it different from DevOps?

MLOps applies DevOps principles to ML systems but adds three unique dimensions: data versioning (code alone doesn't reproduce results), model decay (models degrade over time even without code changes), and experimental nature (you can't unit-test whether a model is "good enough" the same way you test code correctness). Traditional DevOps manages code deployments; MLOps manages code + data + model artifacts + training pipelines + monitoring for statistical correctness.

2. Explain training-serving skew and how to prevent it.

Training-serving skew occurs when features are computed differently during training vs inference. Example: training uses batch-computed 30-day averages from a data warehouse, but serving computes them in real-time with slightly different logic. Prevention: use a feature store (Feast, Tecton) that serves the same feature definitions to both training and serving. Also implement data validation checks that compare training feature distributions against serving-time distributions.

3. How would you detect and handle model drift in production?

Detection: Monitor input data distributions using statistical tests (KS test for continuous features, chi-square for categorical). Track PSI (Population Stability Index) — values > 0.2 indicate significant drift. Monitor prediction distribution changes and, when available, actual vs predicted outcomes. Handling: Set up automated alerts at thresholds. For gradual drift, schedule regular retraining (weekly/monthly). For sudden drift (like COVID), trigger immediate retraining with recent data, potentially with a shorter lookback window.

4. Compare data parallelism vs model parallelism in distributed training.

Data parallelism: Same model replicated on each GPU. Different data batches processed simultaneously. Gradients synchronized via all-reduce. Works when model fits on one GPU. Scales to hundreds of GPUs easily. Model parallelism: Model split across GPUs (different layers on different devices). Required when model is too large for one GPU (70B+ parameters). More complex communication patterns. Higher inter-GPU bandwidth requirements. Often combined: model parallelism within a node, data parallelism across nodes.

5. What is PagedAttention and why does it matter for LLM serving?

PagedAttention (introduced by vLLM) applies virtual memory concepts to KV cache management. Traditional serving pre-allocates contiguous memory for the maximum sequence length per request — most of it wasted. PagedAttention allocates small fixed-size pages on demand and allows non-contiguous storage. Benefits: 2-4x more concurrent requests on the same GPU, near-zero memory waste, efficient memory sharing for beam search and parallel sampling.

6. Design a CI/CD pipeline for an ML system. What tests would you include?

Four layers: (1) Data tests — schema validation, null checks, distribution tests, freshness checks. (2) Feature tests — unit tests for transformation logic, integration tests for feature pipelines. (3) Model tests — minimum accuracy thresholds, no-regression tests vs current production model, fairness/bias checks across protected groups, latency benchmarks. (4) Integration tests — end-to-end pipeline runs, serving endpoint health checks, load testing. Trigger on: code changes (standard CI), data changes (data-aware CI), and scheduled intervals (catch drift).

7. How do you choose between real-time, batch, and streaming inference?

Real-time: User is waiting for a response. Latency < 100ms required. Examples: fraud detection, search ranking, chatbots. Batch: No immediate user waiting. Process large volumes efficiently. Examples: nightly recommendation updates, credit scoring for all customers, email campaigns. Streaming: Near real-time on continuous data. Examples: anomaly detection on IoT sensor streams, real-time personalization on click streams. Decision factors: latency SLA, cost tolerance, data freshness requirements, and request volume patterns.

8. Explain quantization tradeoffs for LLM deployment.

Quantization reduces weight precision (FP32 to FP16/INT8/INT4). Tradeoffs: Speed — lower precision = faster matrix ops + less memory bandwidth needed. Memory — 4-bit model uses 8x less VRAM than FP32. Quality — some degradation, especially on reasoning tasks. Methods: GPTQ (post-training, GPU-optimized), AWQ (activation-aware, better quality preservation), GGUF (flexible bit-widths, CPU-friendly). Rule of thumb: FP16 is free lunch (negligible quality loss). INT8 is almost free. INT4 requires careful evaluation per use case.

9. What monitoring would you set up for a production ML system?

Four layers: (1) Infrastructure — latency p50/p95/p99, throughput, GPU utilization, memory, error rates. (2) Data quality — input feature distributions vs training baseline, missing values, schema violations. (3) Model performance — prediction distribution shifts, confidence score distributions, accuracy/precision/recall when ground truth available (often delayed). (4) Business metrics — conversion rates, revenue impact, user engagement. Alert thresholds at each layer. Use tools like Evidently for drift, Prometheus/Grafana for infra, and custom dashboards for business metrics.

10. How would you handle a scenario where offline model metrics are great but production performance is poor?

Systematic investigation: (1) Data leakage — check if test set has future information or overlap with training. (2) Distribution mismatch — compare production data distribution to training/test data. (3) Feature engineering bugs — verify features are computed identically in training and serving (training-serving skew). (4) Proxy metric problem — your offline metric (e.g., accuracy) may not correlate with the actual business goal. (5) Feedback loops — model predictions may change user behavior, invalidating the test set assumptions. Fix: deploy with shadow mode first, compare predictions against baseline model on real production traffic.

11. Describe strategies to reduce GPU inference costs by 50% or more.

Combined approach: (1) Quantization — INT8 or INT4 reduces memory 2-4x, enabling larger batches. (2) Dynamic batching — increase GPU utilization from 30% to 80%+. (3) Semantic caching — cache responses for similar queries (30%+ hit rate is common). (4) Model distillation — train a smaller model that mimics the large one. (5) Spot instances — 60-90% cheaper for fault-tolerant workloads. (6) Request routing — route simple queries to smaller/cheaper models, complex ones to large models. (7) Auto-scaling — scale down during low-traffic hours instead of paying for idle GPUs.

12. What is a feature store and when would you not need one?

A feature store is centralized infrastructure for computing, storing, and serving ML features consistently across training and inference. You need one when: multiple models share features, training-serving skew is a problem, feature computation is expensive and reusable, or team needs feature discovery/sharing. You do NOT need one when: single model with simple features, batch-only inference (same code runs training and prediction), small team with few models, or features are trivially computed at serving time (no skew risk).

13. How do you ensure fairness and prevent bias in a production ML system?

Multi-stage approach: (1) Data audit — check for representation imbalances, historical bias in labels. (2) Pre-processing — re-sampling, re-weighting underrepresented groups. (3) In-processing — fairness constraints during training (equalized odds, demographic parity). (4) Post-processing — threshold adjustment per group to equalize outcomes. (5) Monitoring — continuous fairness metric tracking across protected attributes in production. (6) Regular audits — scheduled bias audits as data/model changes. Tools: Fairlearn, AI Fairness 360, What-If Tool. Key: define fairness criteria upfront with stakeholders — different definitions (equal opportunity vs demographic parity) lead to different solutions.

14. Explain canary deployments for ML models. How do they differ from traditional canary deployments?

ML canary deployments route a small percentage (1-5%) of traffic to the new model while monitoring both system metrics AND model-specific metrics. Differences from traditional: (1) Longer evaluation period — model quality issues may take days/weeks to surface (delayed labels). (2) Statistical significance — need enough samples to confidently compare model performance. (3) Business metrics — must monitor downstream impact, not just error rates. (4) Rollback criteria — include prediction distribution divergence, confidence score drops, and business KPI degradation. Implementation: use service mesh (Istio) or feature flags to control traffic split. Promote only after statistical significance reached on key metrics.

15. Design a system for continuous model retraining. What triggers retraining and what safeguards would you add?

Triggers: (1) Scheduled — weekly/monthly as baseline. (2) Performance-based — accuracy drops below threshold. (3) Drift-based — PSI > 0.2 on critical features. (4) Data volume — enough new labeled data accumulated. Safeguards: (1) Automated validation — new model must pass all quality gates (accuracy, fairness, latency). (2) Champion-challenger — new model must beat current production model. (3) Gradual rollout — canary deployment, not instant switch. (4) Automatic rollback — if business metrics decline within 24 hours, revert automatically. (5) Human approval — for high-stakes models, require human sign-off before full promotion. (6) Data validation — verify training data quality before training begins. Architecture: orchestrator (Airflow/Kubeflow) triggers pipeline, trains model, runs validation, registers in model registry, triggers canary deployment.

MLOps & Production AI

What is MLOps

ML Lifecycle

Data Engineering for ML

Data Collection & Cleaning

Data Versioning (DVC)

Feature Stores

Experiment Tracking

Why You Need It

Tools Comparison

MLflow Example

Model Training at Scale

Distributed Training Strategies

GPU Utilization Tips

Training Infrastructure

Model Evaluation

Offline vs Online Evaluation

Deployment Strategies for Model Evaluation

Model Registry

Tools

Best Practices

Model Serving

Serving Patterns

Serving Infrastructure

Latency vs Throughput Tradeoffs

LLM Serving Specifically

Key Challenges

Serving Frameworks

Quantization

Batching Strategies

Monitoring & Observability

What to Monitor

Statistical Tests for Drift Detection

Monitoring Tools

Data & Model Drift

Types of Drift

Real-World Examples

Detection and Response

CI/CD for ML

Testing Pyramid for ML

What to Test

GitHub Actions for ML

Cost Optimization

GPU Cost Strategies

Model Compression Techniques

Inference Cost Optimization

AI Ethics & Safety

Threat Landscape

Practical Mitigations

Infrastructure Patterns

Decision Matrix

Kubernetes for ML

Serverless Inference

Full Production Architecture

Interview Questions

5-Minute System Design — Weekly