Lecture 10 — Inference

Links

Lecture video: https://youtu.be/fcgPYo3OtV0
Course materials: lecture_10.py

Overview: The Primacy of the Memory Bottleneck in Inference

This lecture provides a deep, systems-level analysis of Large Language Model (LLM) inference, explaining why it is a fundamentally different and, in many ways, more challenging problem than training. The central, recurring theme is that while training is a throughput-oriented, compute-bound problem, autoregressive inference is a latency-sensitive, memory-bound problem. Every significant optimization in the field of LLM inference is a direct attack on this memory bottleneck.

The lecture dissects the inference workload, quantifies its performance limitations using arithmetic intensity calculations, and then explores a comprehensive suite of solutions ranging from architectural modifications and algorithmic shortcuts to sophisticated systems-level techniques inspired by classical operating systems.

Part 1: Quantifying the Inference Problem

To understand why inference is slow, we must analyze the two distinct phases of generation.

Phase 1: Prefill (Processing the Prompt)

The Task: When a prompt of S tokens is received, the model performs a single forward pass to process all S tokens in parallel.
The Computation: This phase is dominated by large matrix-matrix multiplications (e.g., in the MLP layers: (B, S, D) @ (D, F)).
Arithmetic Intensity Analysis: The number of floating-point operations (FLOPs) scales as O(B*S*D*F), while the memory I/O scales as O(B*S*D + D*F). The ratio of FLOPs to bytes read/written, known as arithmetic intensity, is high. For large S, this intensity easily surpasses the hardware's threshold, making the operation compute-bound. It is limited by the GPU's peak TFLOP/s.
Key Outcome: This phase populates the KV Cache, a large tensor in memory that stores the Key and Value vectors for every token in the prompt, ready for the next phase.

Phase 2: Decoding (Generating the Response, Token by Token)

The Task: To generate the (S+1)-th token, the model performs a forward pass for a single new token. This process repeats sequentially for every subsequent token.
The Computation: This phase is dominated by matrix-vector multiplications. For example, in the MLP, the operation is (B, 1, D) @ (D, F).
Arithmetic Intensity Analysis: The FLOPs for one step are low (O(B*D*F)). However, to perform this computation, the GPU must read the entire model's weights and, crucially, the entire KV Cache from slow HBM (High-Bandwidth Memory). The memory I/O is O(B*S*D + D*F). The resulting arithmetic intensity is extremely low, falling far below the hardware's threshold.
The Bottleneck: This phase is severely memory-bound. The speed is not limited by computation but by memory bandwidth. The latency of every single token generated is directly proportional to the time it takes to read gigabytes of weights and KV cache data from HBM. Increasing batch size (B) helps the MLP layers, but the attention layers remain memory-bound as the KV cache size (S) grows.

Part 2: A Toolbox of Inference Optimizations

The following techniques are all designed to alleviate the memory bandwidth bottleneck, particularly during the decoding phase.

Category 1: "Lossy" Methods (Architectural & Weight Modification)

These methods trade a small, often negligible, amount of model accuracy for significant gains in speed and throughput by reducing the total memory footprint.

Architectural Changes to Shrink the KV Cache:
- Grouped-Query Attention (GQA): This is the most impactful architectural change for inference. In standard Multi-Head Attention (MHA), each of the N query heads has a corresponding key and value head. GQA reduces this by having K key/value heads, where each one is shared by a group of N/K query heads. This reduces the size of the KV cache (and the data that needs to be read from it each step) by a factor of N/K, directly improving decoding speed.
- Multi-Head Latent Attention (MLA): An advanced technique that introduces a bottleneck. The full-sized Key and Value vectors are projected down to a much smaller "latent" dimension (C), and only this compressed representation is stored in the cache. This can offer even greater KV cache reduction than GQA. A wrinkle is that this projection can interfere with Rotary Position Embeddings (RoPE), so a small number of full-dimension channels are often preserved alongside the latent cache.
- Cross-Layer Attention (CLA): Extends the sharing idea of GQA across a different dimension: layers. It allows KV heads to be shared not just within a layer, but across multiple layers, further reducing the total number of K/V parameters and the cache size.
Quantization (Reducing Weight Precision):
- Mechanism: The process of converting model weights from their default 16-bit format (bfloat16) to a lower-precision format like 8-bit (int8) or 4-bit (int4). This directly reduces the model's memory footprint and the bandwidth required to load the weights at each decoding step.
- The Outlier Problem: A critical challenge in quantizing LLMs is the presence of "outlier" weight values with extremely large magnitudes. These outliers are rare but disproportionately important for the model's performance. Naive quantization that scales all values based on the absolute maximum would crush the resolution of all other smaller values.
- Solution (Mixed Precision Quantization): Modern techniques like LLM.int8() or activation-aware methods solve this by treating the outliers differently. They keep the small percentage of outlier weights in high-precision bfloat16 and quantize the vast majority of "normal" values to int8 or int4, preserving accuracy while achieving significant memory savings.
Model Pruning and Distillation:
- This is a two-step process to create a smaller, cheaper version of a powerful model.
- 1. Pruning: An algorithm analyzes a trained LLM to identify and remove the least important components. This can be done at various granularities—removing individual weights, neurons, attention heads, or even entire layers.
- 2. Distillation: The now-smaller "student" model is trained to match the output probability distributions of the original "teacher" model. This transfers the nuanced knowledge of the large model into the compact student model, recovering performance that was lost during the pruning step. This can result in models that are, for example, 40x cheaper to train/run while being more performant.

Category 2: "Lossless" Methods (Algorithmic Speedups)

Speculative Sampling:
- Core Principle: This technique is built on the observation that for a GPU, verifying K tokens in parallel is much faster than generating K tokens sequentially.
- Mechanism: It uses two models: a small, fast "draft model" and the large, accurate "target model."
  1. The draft model autoregressively generates a short "draft" of K tokens (e.g., K=4). This is fast.
  2. The target model then takes the original prompt plus the entire K-token draft and performs a single, parallel forward pass. This is also fast because it's a compute-bound prefill operation.
  3. A statistical acceptance procedure (modified rejection sampling) compares the draft model's probabilities for each token with the target model's.
  4. It accepts all draft tokens up to the first one that the target model would not have chosen. The first incorrect token is discarded, and the target model's corrected prediction is used instead.
- Guarantee: This algorithm is mathematically guaranteed to produce the exact same output as sampling from the target model alone. It achieves a 2-3x speedup by replacing multiple slow, memory-bound decoding steps with a single fast, compute-bound verification step.

Category 3: Systems Optimizations for Dynamic Workloads

Continuous Batching:
- The Problem (Static Batching): Inefficiently handles real-world traffic where requests arrive at different times and have different lengths. It leads to significant GPU idle time as the system waits for batches to fill or for the longest sequence in a batch to finish.
- The Solution: The server operates on a fine-grained, iteration-level schedule. After every single token generation step, the scheduler checks for any sequences that have completed, removes them from the batch, and immediately fills the freed-up slots with new requests from a waiting queue. This ensures the GPU is always operating at maximum capacity, dramatically increasing overall throughput.
PagedAttention (vLLM):
- The Problem (Memory Fragmentation): The KV cache creates massive memory waste. Allocating a contiguous block of memory for the maximum possible context length for every request leads to:
  - Internal Fragmentation: Most sequences are shorter than the max, leaving the allocated block mostly empty.
  - External Fragmentation: The memory space becomes a patchwork of used and free blocks, making it hard to find a contiguous slot for a new request.
- The Solution: Inspired by virtual memory and paging from operating systems, PagedAttention divides the entire KV cache space into small, fixed-size blocks ("pages"). A sequence's KV cache is no longer a single contiguous block but a collection of pages scattered across physical memory. A "page table" maps the logical token positions of a sequence to their physical page locations.
- The Killer Feature (Sharing): Paging enables efficient, zero-copy memory sharing. If multiple requests share a common prefix (e.g., the same long system prompt or the same initial turns in a conversation), their logical page tables can all point to the same physical pages in memory. This avoids redundant computation and saves enormous amounts of memory, boosting system throughput significantly. When a sequence needs to generate a new token that diverges, a copy-on-write mechanism is used.