11 min read By Vamsi Karuturi · Senior Backend Engineer at Salesforce

Transformers & Large Language Models

Q: 1. Explain the self-attention mechanism. Why is it O(n^2)?

Self-attention computes pairwise interactions between all tokens. For each token, we compute dot products with every other token's key vector. With N tokens, that's N x N comparisons, making it O(n^2) in both time and memory. The formula is: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. The QK^T matrix multiplication produces an N x N attention matrix. This is why long context is expensive.

Q: 2. What is the difference between self-attention and cross-attention?

In self-attention, Q, K, and V all come from the same sequence — the input attends to itself. In cross-attention, Q comes from one sequence (decoder) while K and V come from another (encoder output). Cross-attention is how the decoder "looks at" encoder representations. Used in encoder-decoder models like T5 for translation, and in multimodal models where text attends to image features.

Q: 3. Why do we need positional encoding? Compare RoPE and ALiBi.

Transformers process all tokens in parallel with no inherent notion of order. Without positional encoding, "dog bites man" = "man bites dog." RoPE rotates Q and K vectors by position-dependent angles, encoding relative position through the angle difference. It's learned-free and extends well. ALiBi adds a linear bias to attention scores proportional to distance — no parameters, no embedding modification. RoPE is more popular (Llama, Mistral). ALiBi generalizes better to unseen lengths but is les

Q: 4. Explain the training pipeline: pre-training, SFT, RLHF, and DPO.

Pre-training: Next-token prediction on trillions of tokens. Creates a base model that can complete text but won't follow instructions. SFT: Fine-tune on ~100K curated instruction-response pairs. Model learns to be helpful. RLHF: Generate multiple responses, have humans rank them, train a reward model, then optimize the LLM with PPO. DPO: Skip the reward model, directly optimize from preference pairs — simpler and more stable. Each stage builds on the previous one.

Q: 5. What is the KV cache and why does it matter for inference?

During autoregressive generation, each new token needs attention over ALL previous tokens. Without caching, generating token N requires recomputing K and V for all N-1 previous tokens — O(N^2) total work for a sequence. KV cache stores previously computed key and value vectors, so each new token only computes its own Q and looks up cached K/V. Reduces generation from O(N^2) to O(N). The tradeoff: KV cache consumes linear memory per sequence per layer, which is why long-context serving is memory-

Q: 6. Compare encoder-only, decoder-only, and encoder-decoder architectures. Why did decoder-only win?

Encoder-only (BERT): bidirectional attention, great for classification/understanding, can't generate. Decoder-only (GPT): causal/left-to-right attention, next-token generation, handles everything via prompting. Encoder-decoder (T5): encoder sees full context, decoder generates — great for translation. Decoder-only won because: (1) one architecture scales to all tasks, (2) scaling laws favor it, (3) next-token prediction is the universal training objective, (4) simpler to train and serve at scale

Q: 7. Explain Flash Attention. Is it an approximation?

Flash Attention is NOT an approximation — it produces the exact same output as standard attention. It's a pure IO optimization. Standard attention materializes the full N x N attention matrix in slow GPU HBM (high-bandwidth memory). Flash Attention tiles the computation into blocks that fit in fast GPU SRAM, computes softmax incrementally (using the online softmax trick), and never materializes the full matrix. Result: 2-4x speedup, memory usage from O(N^2) to O(N), same mathematical output.

Q: 8. What are the Chinchilla scaling laws and how did they change LLM training?

Chinchilla (DeepMind, 2022) showed that for a fixed compute budget, you should scale model size and training data equally. The optimal ratio is roughly 20 tokens per parameter. Before Chinchilla, the community built huge models with insufficient data (GPT-3: 175B params, 300B tokens). Chinchilla (70B params, 1.4T tokens) matched GPT-3 with 3x fewer parameters. This shifted the field toward smaller, better-trained models. Modern trend: overtrain (Llama 3: 70B on 15T tokens) because inference cost

Q: 9. Explain quantization. How can a 4-bit model nearly match a 16-bit model?

Quantization reduces numerical precision of model weights. 16-bit uses 2 bytes per parameter; 4-bit uses 0.5 bytes — 4x reduction. It works because neural network weights are robust to small perturbations. Methods: GPTQ (post-training, one calibration pass), AWQ (activation-aware, protects salient weights), GGUF (mixed precision per layer). Quality is preserved because: (1) outlier weights are kept at higher precision, (2) calibration data ensures quantization error is minimized on real inputs,

Q: 10. What is speculative decoding and when would you use it?

Speculative decoding uses a small, fast draft model to generate N candidate tokens, then verifies all N with the large model in a single parallel forward pass. If the draft model predicted correctly (which it often does for common continuations), you get N tokens for the cost of one large-model inference. Rejection sampling ensures output quality equals the large model exactly. Use when: latency matters more than throughput, draft model matches large model's distribution well, and you can afford

From "Attention Is All You Need" to GPT-4o and Claude — how the transformer architecture revolutionized AI, explained so you can ace any interview AND build production systems.

Why Transformers

Before 2017, sequence modeling meant RNNs and LSTMs. They processed tokens one-by-one, left to right. Like reading a book by looking at one word at a time through a keyhole.

The RNN bottleneck:

Sequential processing = no parallelism = slow training
Long-range dependencies get "forgotten" (vanishing gradients)
Hidden state is a fixed-size bottleneck — all context must squeeze through it

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph RNN["RNN: Sequential (Slow)"]
        direction LR
        W1(["Word 1"]) --> H1{{"h1"}} --> H2{{"h2"}} --> H3{{"h3"}} --> H4{{"h4"}}
        W2(["Word 2"]) --> H2
        W3(["Word 3"]) --> H3
        W4(["Word 4"]) --> H4
    end

    subgraph TR["Transformer: Parallel (Fast)"]
        direction LR
        T1(("Word 1")) ~~~ T2(("Word 2")) ~~~ T3(("Word 3")) ~~~ T4(("Word 4"))
        T1 <--> T2
        T1 <--> T3
        T1 <--> T4
        T2 <--> T3
        T2 <--> T4
        T3 <--> T4
    end

    style RNN fill:#FEF3C7,stroke:#D97706,color:#000
    style TR fill:#D1FAE5,stroke:#059669,color:#000

In June 2017, Vaswani et al. published "Attention Is All You Need". The key insight: you don't need recurrence at all. Let every token attend to every other token in parallel. Training went from days to hours. Performance leaped.

The One-Liner

Transformers replaced sequential processing with parallel attention. Every word can "look at" every other word simultaneously.

Self-Attention Mechanism

The Party Analogy

Imagine you're at a party (the input sequence). You are one word in a sentence.

Query (Q): "What am I looking for?" — your question to the room
Key (K): "What do I have to offer?" — everyone's name tag
Value (V): "What information do I carry?" — the actual conversation content

You look at everyone's name tag (Q dot K), figure out who's relevant to you (softmax), then take a weighted mix of their conversations (multiply by V). That's attention.

Scaled Dot-Product Attention

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Symbol	Meaning
Q	Query matrix (what am I looking for?)
K	Key matrix (what do I contain?)
V	Value matrix (what info do I carry?)
d_k	Dimension of keys (scaling factor)
QK^T	Dot product = similarity score
softmax	Normalize scores to probabilities

Why divide by sqrt(d_k)?

Without scaling, dot products grow large with high dimensions. Large values push softmax into regions with tiny gradients. Division by sqrt(d_k) keeps gradients healthy. It's numerical stability, not magic.

Walkthrough: "The cat sat on the mat"

For the word "sat":

Compute Q for "sat", K and V for all words
Dot product Q_sat with K_the, K_cat, K_sat, K_on, K_the, K_mat
Highest scores likely: "cat" (subject doing the sitting) and "mat" (location)
Softmax normalizes these scores
Weighted sum of V vectors = context-aware representation of "sat"

Now "sat" knows WHO sat (cat) and WHERE (mat). Without recurrence. In one step.

Multi-Head Attention

One attention head captures one type of relationship. Multiple heads capture different patterns simultaneously.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    Input(["Input Embeddings"]) --> H1{{"Head 1<br/>Syntax"}}
    Input --> H2{{"Head 2<br/>Semantics"}}
    Input --> H3{{"Head 3<br/>Position"}}
    Input --> H4{{"Head 4<br/>Coreference"}}
    H1 --> Concat[/"Concatenate"/]
    H2 --> Concat
    H3 --> Concat
    H4 --> Concat
    Concat --> Linear[["Linear Projection"]]
    Linear --> Output(["Multi-Head Output"])

    style Input fill:#E3F2FD,stroke:#1565C0,color:#000
    style Concat fill:#FEF3C7,stroke:#D97706,color:#000
    style Output fill:#D1FAE5,stroke:#059669,color:#000
    style H1 fill:#EDE9FE,stroke:#7C3AED,color:#000
    style H2 fill:#EDE9FE,stroke:#7C3AED,color:#000
    style H3 fill:#EDE9FE,stroke:#7C3AED,color:#000
    style H4 fill:#EDE9FE,stroke:#7C3AED,color:#000

Each head has its own Q, K, V weight matrices. GPT-3 uses 96 heads. Each head has dimension d_model/n_heads. Outputs are concatenated and projected back.

Multi-Head Formula

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)

Positional Encoding

Transformers process all tokens simultaneously. That's their superpower (parallelism) and their weakness (no sense of word order). "Dog bites man" and "Man bites dog" would look identical without position info.

Sinusoidal Encoding (Original)

The original paper uses sine and cosine functions of different frequencies:

\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

\[ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

Why sine/cosine? Because relative positions can be represented as linear transformations. Position 5 relative to position 3 is always the same rotation, regardless of absolute position.

Modern Approaches

Method	How It Works	Used By
Sinusoidal	Fixed sin/cos patterns added to embeddings	Original Transformer
Learned	Position embeddings trained with model	GPT-2, BERT
RoPE (Rotary)	Rotates Q and K vectors by position angle	Llama, Qwen, Mistral
ALiBi	Adds linear bias to attention scores based on distance	BLOOM, MPT
Relative	Encodes distance between tokens, not absolute position	T5, DeBERTa

RoPE is Dominant Now

RoPE (Rotary Position Embedding) multiplies Q and K by rotation matrices. It encodes relative position naturally, generalizes to longer sequences, and works beautifully with linear attention. Almost every modern open model uses it.

ALiBi's Trick

ALiBi doesn't modify embeddings at all. It simply penalizes attention scores by distance: closer tokens get higher scores. No learned parameters. Shockingly effective for length generalization.

Transformer Architecture

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Encoder["Encoder Block"]
        direction LR
        IE(["Input Embedding<br/>+ Positional Encoding"]) --> MHA1{{"Multi-Head<br/>Self-Attention"}}
        MHA1 --> AN1[["Add & Norm"]]
        AN1 --> FFN1{{"Feed-Forward<br/>Network"}}
        FFN1 --> AN2[["Add & Norm"]]
    end

    subgraph Decoder["Decoder Block"]
        direction LR
        OE(["Output Embedding<br/>+ Positional Encoding"]) --> MMHA{{"Masked Multi-Head<br/>Self-Attention"}}
        MMHA --> AN3[["Add & Norm"]]
        AN3 --> CMHA{{"Cross-Attention<br/>(to Encoder)"}}
        CMHA --> AN4[["Add & Norm"]]
        AN4 --> FFN2{{"Feed-Forward<br/>Network"}}
        FFN2 --> AN5[["Add & Norm"]]
    end

    AN2 --> CMHA
    AN5 --> LIN[/"Linear + Softmax"/]
    LIN --> OUT(["Output Probabilities"])

    style Encoder fill:#E3F2FD,stroke:#1565C0,color:#000
    style Decoder fill:#EDE9FE,stroke:#7C3AED,color:#000
    style LIN fill:#FEF3C7,stroke:#D97706,color:#000
    style OUT fill:#D1FAE5,stroke:#059669,color:#000

Key Components

Residual Connections (Add): Output = Layer(x) + x. Helps gradients flow through deep networks. Without them, 96-layer models wouldn't train.

Layer Normalization (Norm): Normalizes across features (not batch). Stabilizes training. Modern models use Pre-LN (norm before attention) rather than Post-LN (norm after).

Feed-Forward Network (FFN): Two linear transformations with activation between them.

Text Only

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2

This is where the model stores "knowledge." Attention routes information; FFN processes it. Think: attention is the postal system, FFN is the workers at each office.

Modern FFN Variants

SwiGLU (used in Llama, Mistral): Gated activation. Better than ReLU.
MoE (Mixture of Experts, used in Mixtral, GPT-4): Only activates a subset of FFN parameters per token. 8 experts, pick top-2. Cheap inference, massive capacity.

Encoder vs Decoder Models

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph ENC["Encoder-Only"]
        direction LR
        E1(["BERT, RoBERTa<br/>DeBERTa"])
        E2{{"Bidirectional<br/>Sees full context"}}
        E3[/"Classification,<br/>NER, QA"/]
    end

    subgraph DEC["Decoder-Only"]
        direction LR
        D1(["GPT, Claude<br/>Llama, Mistral"])
        D2{{"Autoregressive<br/>Left-to-right"}}
        D3[/"Text Generation,<br/>Chat, Code"/]
    end

    subgraph ED["Encoder-Decoder"]
        direction LR
        ED1(["T5, BART<br/>mBART"])
        ED2{{"Bidirectional encoder<br/>+ Autoregressive decoder"}}
        ED3[/"Translation,<br/>Summarization"/]
    end

    style ENC fill:#E3F2FD,stroke:#1565C0,color:#000
    style DEC fill:#EDE9FE,stroke:#7C3AED,color:#000
    style ED fill:#D1FAE5,stroke:#059669,color:#000

Aspect	Encoder (BERT)	Decoder (GPT)	Encoder-Decoder (T5)
Attention	Bidirectional (sees all tokens)	Causal (sees only past tokens)	Both
Training	Masked language model (fill in blanks)	Next token prediction	Span corruption + generation
Strengths	Understanding, classification	Generation, reasoning	Translation, summarization
Limitation	Can't generate text naturally	Can't "see the future"	More parameters for same task
Examples	BERT, RoBERTa, DeBERTa	GPT-4, Claude, Llama 3	T5, FLAN-T5, BART

Why Decoder-Only Won

Decoder-only models scale better. One architecture handles everything: generation, classification, reasoning, code. The "next token prediction" objective is universal. That's why GPT-4, Claude, Llama, and Gemini are all decoder-only.

Tokenization

Models don't see words. They see tokens — sub-word units. "unhappiness" might become ["un", "happiness"] or ["un", "happ", "iness"].

Why Tokenization Matters

Determines vocabulary size (affects model size and efficiency)
Affects how well the model handles rare words, code, non-English text
A bad tokenizer wastes context window on redundant tokens

Methods

Method	How It Works	Used By
BPE (Byte-Pair Encoding)	Iteratively merges most frequent character pairs	GPT-2, GPT-4
WordPiece	Similar to BPE but maximizes likelihood	BERT
SentencePiece	Language-agnostic, works on raw text (no pre-tokenization)	Llama, T5, Gemini
Tiktoken	Optimized BPE implementation with byte-level fallback	GPT-4, Claude

Token vs Word Count

Text Only

"I'm learning transformers!" 
Words:   4
Tokens:  5  ("I", "'m", " learning", " transform", "ers", "!")
         Actually 6 with the punctuation

Token Economics

You pay per token, not per word. Code is expensive (many short tokens). English averages ~1.3 tokens/word. Chinese averages ~2 tokens/character for older tokenizers. A 100K context window is roughly 75K words.

The LLM Landscape

Model	Maker	Parameters	Context	Open?	Strengths
GPT-4o	OpenAI	Unknown (~1.8T MoE?)	128K	Closed	Multimodal, speed, tooling ecosystem
Claude 4	Anthropic	Unknown	200K	Closed	Long context, safety, reasoning, code
Gemini 2.5	Google	Unknown	1M+	Closed	Massive context, multimodal, search integration
Llama 3.1	Meta	8B / 70B / 405B	128K	Open weights	Best open model, strong multilingual
Mistral Large	Mistral	Unknown	128K	Closed	Efficient, strong coding and reasoning
Mixtral 8x22B	Mistral	176B (39B active)	64K	Open weights	MoE efficiency, fast inference
DeepSeek-V3	DeepSeek	671B (37B active)	128K	Open weights	MoE, cost-efficient training, strong math
Qwen 2.5	Alibaba	0.5B to 72B	128K	Open weights	Multilingual, strong at code

Open Weights vs Open Source

"Open weights" means you can download and run the model. "Open source" means you also get training code, data, and full reproducibility. Most "open" models are open-weights only. Llama 3 has a custom license (not truly OSI-open).

How LLMs Are Trained

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    PT(["Pre-Training<br/>Next Token Prediction<br/>Trillions of tokens"]) --> SFT{{"Supervised<br/>Fine-Tuning<br/>Instruction following"}}
    SFT --> RLHF{{"RLHF / DPO<br/>Human preference<br/>alignment"}}
    RLHF --> Deploy(["Deployed<br/>Model"])

    style PT fill:#E3F2FD,stroke:#1565C0,color:#000
    style SFT fill:#FEF3C7,stroke:#D97706,color:#000
    style RLHF fill:#EDE9FE,stroke:#7C3AED,color:#000
    style Deploy fill:#D1FAE5,stroke:#059669,color:#000

Stage 1: Pre-Training

Objective: Predict the next token given all previous tokens
Data: Trillions of tokens from the internet, books, code, papers
Compute: Thousands of GPUs for weeks/months. GPT-4 training cost estimated at $100M+
Result: A "base model" — great at completing text, terrible at following instructions

The base model is like a brilliant student who has read everything but has never been told what a "question" is.

Stage 2: Supervised Fine-Tuning (SFT)

Human-written examples of ideal responses to instructions
Typically 10K-100K high-quality examples
Teaches the model to be helpful, follow instructions, format outputs

Stage 3: Alignment

RLHF (Reinforcement Learning from Human Feedback):

Generate multiple responses to a prompt
Humans rank responses (A > B > C)
Train a reward model to predict human preferences
Optimize the LLM policy using PPO to maximize reward

Constitutional AI (Anthropic's approach):

Model critiques its own responses against a set of principles
Self-improvement loop reduces need for human labelers
Principles like "be helpful," "be harmless," "be honest"

DPO (Direct Preference Optimization):

Skip the reward model entirely
Directly optimize the policy from preference pairs
Simpler, more stable, increasingly popular
Used by Llama 3, Zephyr, and many others

Alignment Tax

Alignment makes models safer but slightly less capable on benchmarks. It's a deliberate tradeoff. The base model "knows" harmful things; alignment teaches it when NOT to say them.

Prompt Engineering

Prompting Strategies

Strategy	Description	When to Use
Zero-shot	Just ask the question	Simple tasks the model already knows
Few-shot	Provide 2-5 examples first	When you need a specific format/style
Chain-of-Thought	"Let's think step by step"	Math, logic, multi-step reasoning
ReAct	Thought-Action-Observation loop	Agent tasks with tool use
System Prompt	Set role and constraints	All production deployments
Self-Consistency	Sample N chains, majority vote	When accuracy > latency

Chain-of-Thought Example

Text Only

Without CoT: "What's 17 * 24?" -> "__(often wrong)__"

With CoT: "What's 17 * 24? Think step by step."
-> "17 * 24 = 17 * 20 + 17 * 4 = 340 + 68 = 408" (correct!)

Sampling Parameters

Parameter	What It Does	Low Value	High Value
Temperature	Randomness of sampling	Deterministic, focused	Creative, diverse
Top-p (nucleus)	Cumulative probability threshold	Conservative (0.1)	Expansive (0.95)
Top-k	Number of candidates to consider	Very focused (5)	Wide selection (100)
Frequency penalty	Penalize repeated tokens	Allow repetition (0)	Force variety (2.0)

Production Settings

For factual tasks: temperature=0, top_p=1. For creative writing: temperature=0.8, top_p=0.95. Never set both temperature=0 AND top_k=1 — it's redundant. Temperature=0 already makes output deterministic.

Context Windows

The context window is the maximum number of tokens a model can process in a single forward pass. It's like the model's "working memory."

Model	Context Window	Roughly Equivalent To
GPT-3 (original)	2K tokens	1.5 pages
GPT-4	8K / 128K tokens	6 pages / 300 pages
Claude 4	200K tokens	A full novel
Gemini 2.5	1M+ tokens	Multiple textbooks

Why Context Length is Hard

Attention is O(n^2) in sequence length. Doubling context = 4x memory and compute. That's why extending context is an active research area.

Solutions for Long Context

Technique	How It Works
Sliding Window	Each layer only attends to nearby tokens (Mistral uses 4096 window)
RoPE Scaling	Interpolate or extend rotary embeddings to unseen positions
Ring Attention	Distribute long sequences across devices in a ring topology
Sparse Attention	Attend to subset of tokens (local + global tokens)
Recurrent Memory	Compress earlier context into memory tokens

Long Context vs Retrieval

Just because a model HAS 200K context doesn't mean it uses it perfectly. Retrieval (RAG) often beats stuffing everything in context. The "lost in the middle" problem: models attend best to the beginning and end, less to the middle.

Scaling Laws

Chinchilla Scaling (2022)

DeepMind's Chinchilla paper showed: for a given compute budget, you should scale data and model size equally.

Previous models (like GPT-3 175B) were undertrained — too many parameters, not enough data. Chinchilla (70B) matched GPT-3 performance with 4x more training data and 3x fewer parameters.

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '13px', 'fontFamily': 'Inter, -apple-system, sans-serif'}, 'flowchart': {'nodeSpacing': 30, 'rankSpacing': 50, 'padding': 12, 'curve': 'basis'}, 'sequence': {'actorMargin': 60, 'messageMargin': 40}, 'class': {'padding': 12}}}%%
flowchart LR
    subgraph Before["Before Chinchilla"]
        direction LR
        B1(("Bigger model = better"))
        B2[/"GPT-3: 175B params<br/>300B tokens"/]
    end

    subgraph After["After Chinchilla"]
        direction LR
        A1(("Balance params & data"))
        A2[/"Chinchilla: 70B params<br/>1.4T tokens"/]
    end

    Before --> After

    style Before fill:#FCE7F3,stroke:#DB2777,color:#000
    style After fill:#D1FAE5,stroke:#059669,color:#000

Key Scaling Insights

Compute-optimal ratio: ~20 tokens per parameter (Chinchilla rule)
Loss scales as power law with compute, data, and parameters
Inference cost matters too: Llama 3 70B trained on 15T tokens (overtraining) because a smaller model with more training is cheaper to serve
Emergent abilities: Some capabilities appear suddenly at certain scales (debated)

Why Bigger Isn't Always Better

A 7B model trained on 2T tokens often beats a 70B model trained on 200B tokens — and it's 10x cheaper to run. Modern strategy: overtrain smaller models well beyond Chinchilla-optimal.

Inference Optimization

Training happens once. Inference happens millions of times. Optimization here directly saves money.

KV Cache

During autoregressive generation, recompute Q for only the new token but reuse K and V from all previous tokens. Without KV cache, generating N tokens is O(N^2). With it, O(N).

Problem: KV cache grows linearly with sequence length and batch size. A 70B model with 128K context needs ~40GB just for KV cache.

Speculative Decoding

Use a small "draft" model to generate N candidate tokens quickly. Then verify all N tokens with the large model in a single forward pass. If most are correct, you get N tokens for the cost of 1 large-model call.

Draft model: 1B parameters, very fast
Verify model: 70B parameters, accurate
Typical speedup: 2-3x with no quality loss

Flash Attention

Standard attention materializes the full N x N attention matrix in GPU memory. Flash Attention tiles the computation to keep everything in fast SRAM. Result: 2-4x speedup, less memory, exact same output. Not an approximation — pure IO optimization.

Quantization

Reduce model precision from float16 to int8 or int4. Smaller model = faster inference + less memory.

Format	Bits	Quality Loss	Speedup	Tool
FP16	16	Baseline	1x	-
GPTQ	4	Minimal	2-3x	AutoGPTQ
AWQ	4	Less than GPTQ	2-3x	AutoAWQ
GGUF	2-8	Varies	2-4x	llama.cpp
bitsandbytes	4/8	Minimal	2x	Hugging Face

Serving Frameworks

Framework	Key Feature
vLLM	PagedAttention — manages KV cache like virtual memory. Best throughput.
TGI (Text Generation Inference)	Hugging Face's production server. Tensor parallelism built-in.
TensorRT-LLM	NVIDIA's optimized runtime. Best single-GPU performance.
Ollama	Run models locally with one command. Great for development.
llama.cpp	CPU inference with quantization. Runs on laptops.

Production Stack 2025

vLLM + AWQ quantization + Flash Attention + speculative decoding = state-of-the-art serving. For 70B models: tensor parallel across 2-4 GPUs with continuous batching.

Open vs Closed Models

Dimension	Open Weights (Llama, Mistral)	Closed API (GPT-4, Claude)
Cost	GPU infrastructure + engineering	Pay per token
Control	Full control over model, data privacy	Vendor controls everything
Customization	Fine-tune, merge, quantize freely	Limited fine-tuning options
Quality	Catching up fast (Llama 3 405B competitive)	Generally still leading
Latency	You control the hardware	Network round-trip + queue
Maintenance	You own updates, security, scaling	Vendor handles everything
Data Privacy	Data never leaves your infrastructure	Data sent to third party
Compliance	Easier for regulated industries	May not meet requirements

When to Self-Host

Regulated industries (healthcare, finance): data can't leave your infrastructure
High volume: >$5K/month in API costs often means self-hosting is cheaper
Low latency: need <100ms time-to-first-token
Custom models: heavily fine-tuned for your domain
Offline environments: edge deployment, air-gapped systems

When to Use APIs

Prototyping: get started in minutes, not days
Low volume: don't justify GPU infrastructure
Frontier quality: need the absolute best model available
Multi-modal: vision + audio + code in one API
Small team: no ML ops expertise in-house

The Hidden Cost of Self-Hosting

The model is free. The 8x H100 GPUs ($250K), the engineers to manage them, the redundancy, the monitoring, the model updates — that's where the cost hides. Do the math honestly before committing.

Interview Questions

1. Explain the self-attention mechanism. Why is it O(n^2)?

Self-attention computes pairwise interactions between all tokens. For each token, we compute dot products with every other token's key vector. With N tokens, that's N x N comparisons, making it O(n^2) in both time and memory. The formula is: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. The QK^T matrix multiplication produces an N x N attention matrix. This is why long context is expensive.

2. What is the difference between self-attention and cross-attention?

In self-attention, Q, K, and V all come from the same sequence — the input attends to itself. In cross-attention, Q comes from one sequence (decoder) while K and V come from another (encoder output). Cross-attention is how the decoder "looks at" encoder representations. Used in encoder-decoder models like T5 for translation, and in multimodal models where text attends to image features.

3. Why do we need positional encoding? Compare RoPE and ALiBi.

Transformers process all tokens in parallel with no inherent notion of order. Without positional encoding, "dog bites man" = "man bites dog." RoPE rotates Q and K vectors by position-dependent angles, encoding relative position through the angle difference. It's learned-free and extends well. ALiBi adds a linear bias to attention scores proportional to distance — no parameters, no embedding modification. RoPE is more popular (Llama, Mistral). ALiBi generalizes better to unseen lengths but is less common now.

4. Explain the training pipeline: pre-training, SFT, RLHF, and DPO.

Pre-training: Next-token prediction on trillions of tokens. Creates a base model that can complete text but won't follow instructions. SFT: Fine-tune on ~100K curated instruction-response pairs. Model learns to be helpful. RLHF: Generate multiple responses, have humans rank them, train a reward model, then optimize the LLM with PPO. DPO: Skip the reward model, directly optimize from preference pairs — simpler and more stable. Each stage builds on the previous one.

5. What is the KV cache and why does it matter for inference?

During autoregressive generation, each new token needs attention over ALL previous tokens. Without caching, generating token N requires recomputing K and V for all N-1 previous tokens — O(N^2) total work for a sequence. KV cache stores previously computed key and value vectors, so each new token only computes its own Q and looks up cached K/V. Reduces generation from O(N^2) to O(N). The tradeoff: KV cache consumes linear memory per sequence per layer, which is why long-context serving is memory-bound.

6. Compare encoder-only, decoder-only, and encoder-decoder architectures. Why did decoder-only win?

Encoder-only (BERT): bidirectional attention, great for classification/understanding, can't generate. Decoder-only (GPT): causal/left-to-right attention, next-token generation, handles everything via prompting. Encoder-decoder (T5): encoder sees full context, decoder generates — great for translation. Decoder-only won because: (1) one architecture scales to all tasks, (2) scaling laws favor it, (3) next-token prediction is the universal training objective, (4) simpler to train and serve at scale.

7. Explain Flash Attention. Is it an approximation?

Flash Attention is NOT an approximation — it produces the exact same output as standard attention. It's a pure IO optimization. Standard attention materializes the full N x N attention matrix in slow GPU HBM (high-bandwidth memory). Flash Attention tiles the computation into blocks that fit in fast GPU SRAM, computes softmax incrementally (using the online softmax trick), and never materializes the full matrix. Result: 2-4x speedup, memory usage from O(N^2) to O(N), same mathematical output.

8. What are the Chinchilla scaling laws and how did they change LLM training?

Chinchilla (DeepMind, 2022) showed that for a fixed compute budget, you should scale model size and training data equally. The optimal ratio is roughly 20 tokens per parameter. Before Chinchilla, the community built huge models with insufficient data (GPT-3: 175B params, 300B tokens). Chinchilla (70B params, 1.4T tokens) matched GPT-3 with 3x fewer parameters. This shifted the field toward smaller, better-trained models. Modern trend: overtrain (Llama 3: 70B on 15T tokens) because inference cost matters more than training cost.

9. Explain quantization. How can a 4-bit model nearly match a 16-bit model?

Quantization reduces numerical precision of model weights. 16-bit uses 2 bytes per parameter; 4-bit uses 0.5 bytes — 4x reduction. It works because neural network weights are robust to small perturbations. Methods: GPTQ (post-training, one calibration pass), AWQ (activation-aware, protects salient weights), GGUF (mixed precision per layer). Quality is preserved because: (1) outlier weights are kept at higher precision, (2) calibration data ensures quantization error is minimized on real inputs, (3) some redundancy in weights is harmless to remove.

10. What is speculative decoding and when would you use it?

Speculative decoding uses a small, fast draft model to generate N candidate tokens, then verifies all N with the large model in a single parallel forward pass. If the draft model predicted correctly (which it often does for common continuations), you get N tokens for the cost of one large-model inference. Rejection sampling ensures output quality equals the large model exactly. Use when: latency matters more than throughput, draft model matches large model's distribution well, and you can afford running two models. Typical speedup: 2-3x with zero quality loss.

11. How does temperature affect LLM output? What about top-p and top-k?

Temperature scales logits before softmax: logits/T. T=0: argmax (greedy, deterministic). T=1: natural distribution. T>1: flatter distribution (more random). T<1: sharper distribution (more confident). Top-p (nucleus sampling): only consider tokens whose cumulative probability exceeds p. Top-p=0.9 means ignore the bottom 10% probability mass. Top-k: only consider the k most probable tokens. These are all sampling strategies applied AFTER the forward pass. They don't change what the model "knows" — they change how it chooses among options.

12. Explain Mixture of Experts (MoE). Why is it efficient?

MoE replaces the single FFN layer with multiple "expert" FFN layers plus a router. For each token, the router selects top-k experts (typically 2 out of 8). Only those 2 experts process the token. Total parameters: huge (enables capacity). Active parameters per token: small (enables speed). Mixtral 8x22B has 176B total parameters but only 39B active per token — so it runs at 40B-model speeds with 176B-model quality. Challenge: load balancing (experts should be used equally), routing collapse, training instability.

13. What is the 'lost in the middle' problem? How do you mitigate it?

Research shows LLMs attend most to the beginning and end of their context, with degraded retrieval of information placed in the middle. In 128K-token contexts, facts in the middle may be overlooked. Mitigations: (1) Place critical information at the start or end, (2) Use RAG to retrieve relevant chunks and place them prominently, (3) Structured prompts with clear section headings, (4) Recursive summarization of long documents, (5) Models trained specifically for long-context tasks (like Claude) show less degradation.

14. Compare RLHF and DPO. When would you choose each?

RLHF: Train a reward model on human preferences, then optimize LLM with PPO. Complex pipeline (4 models: reference, policy, reward, value). Unstable training. But: more flexible, reward model can be reused, handles complex preference landscapes. DPO: Directly optimize from preference pairs without a reward model. Treats the policy itself as an implicit reward model. Simpler (2 models: reference, policy). More stable. But: may overfit to preference data format, less flexible for online learning. Choose DPO for simplicity and stability. Choose RLHF when you need online feedback or complex reward signals.

15. You need to deploy a model for a production chatbot serving 1000 requests/minute. Walk through your architecture decisions.

First: open vs closed? If data privacy allows, start with Claude/GPT-4 API for fastest iteration. For self-hosting: choose model size (70B for quality, 7B for speed). Quantize with AWQ to 4-bit. Serve with vLLM (PagedAttention for efficient KV cache, continuous batching for throughput). Hardware: 2x A100 80GB for 70B-4bit with tensor parallelism. Add speculative decoding if latency-critical. Monitor: track tokens/second, time-to-first-token, error rates. Scaling: horizontal with load balancer, prefill and decode on separate pools. Fallbacks: secondary model for overload, graceful degradation. Cache common prompts/responses. Set request timeouts and implement streaming.

Transformers & Large Language Models

Why Transformers

Self-Attention Mechanism

The Party Analogy

Scaled Dot-Product Attention

Walkthrough: "The cat sat on the mat"

Multi-Head Attention

Positional Encoding

Sinusoidal Encoding (Original)

Modern Approaches

Transformer Architecture

Key Components

Encoder vs Decoder Models

Tokenization

Why Tokenization Matters

Methods

Token vs Word Count

The LLM Landscape

How LLMs Are Trained

Stage 1: Pre-Training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Alignment

Prompt Engineering

Prompting Strategies

Chain-of-Thought Example

Sampling Parameters

Context Windows

Why Context Length is Hard

Solutions for Long Context

Scaling Laws

Chinchilla Scaling (2022)

Key Scaling Insights

Inference Optimization

KV Cache

Speculative Decoding

Flash Attention

Quantization

Serving Frameworks

Open vs Closed Models

When to Self-Host

When to Use APIs

Interview Questions

5-Minute System Design — Weekly