Lecture 7 — Parallelism (Tatsu)

Links

Lecture video: https://youtu.be/l1RJcDjzK8M
Course materials: lecture 7.pdf

Summary of Key Points

This lecture explains why training massive language models like GPT-3 is impossible on a single GPU and dives into the fundamental techniques used to distribute the workload across thousands of GPUs. The core challenge is managing two limited resources: memory (to hold the model's parameters, gradients, and optimizer states) and compute. The lecture covers the three primary strategies for parallelization: Data Parallelism (including its advanced form, ZeRO), Model Parallelism (split into Pipeline and Tensor Parallelism), and how they are combined into a "3D Parallelism" strategy to train state-of-the-art models efficiently.

Why Parallelism? The Need to Scale

Modern LLMs have grown so large that they exceed the memory capacity of any single GPU.

Memory Bottleneck: A 175B parameter model like GPT-3 requires over 700GB of memory just for its weights and optimizer state in standard precision, far more than a single 80GB A100 GPU can handle.
Compute Bottleneck: Training these models requires exaflops of computation, which would take centuries on a single chip.

The solution is to use large clusters of interconnected GPUs, which requires breaking down the model and data in intelligent ways.

1. Data Parallelism (DP) & ZeRO

This is the simplest and most common form of parallelism.

How it Works: The model is replicated on every GPU. The global data batch is split, with each GPU processing its own mini-batch. After the forward and backward passes, the gradients from all GPUs are synchronized using an All-Reduce communication operation.
Limitation: It doesn't solve the memory problem. Every GPU still needs to store a full copy of the model parameters, gradients, and optimizer states.

ZeRO: Solving the Memory Problem

The Zero Redundancy Optimizer (ZeRO) is an enhancement of Data Parallelism that shards (distributes) the model's memory footprint across GPUs.

Stage 1: Shard Optimizer States. The optimizer states (which are very large) are partitioned across GPUs. This provides significant memory savings with no extra communication cost.
Stage 2: Shard Gradients & Optimizer States. This further reduces memory by also partitioning the gradients.
Stage 3 (FSDP): Shard Everything. Known as Fully Sharded Data Parallel (FSDP) in PyTorch, this stage shards the model parameters as well. During computation, each GPU receives the specific parameters it needs for a layer via an All-Gather operation, performs the computation, and then discards them. This offers the maximum memory savings, allowing much larger models to be trained.

2. Model Parallelism

Instead of replicating the model, Model Parallelism splits the model itself across multiple GPUs. This is essential when a model is too large to fit in a single GPU's memory, even with ZeRO.

Pipeline Parallelism (PP)

This approach splits the model vertically, by layers.

How it Works: Different GPUs hold different layers of the model. For example, GPUs 0-7 might hold the first 8 layers, GPUs 8-15 hold the next 8, and so on. Activations are passed from one GPU stage to the next.
The "Bubble" Problem: This creates a lot of idle time ("bubbles") as later-stage GPUs wait for earlier ones to finish. This is mitigated by splitting the data batch into smaller "micro-batches" to keep the pipeline constantly working.
Best Use Case: Because it has relatively low communication overhead, it's ideal for inter-node parallelism (across different machines).

Tensor Parallelism (TP)

This approach splits the model horizontally, within each layer.

How it Works: Individual matrix multiplications within a transformer layer are split across multiple GPUs. For example, in Y = XA, matrix A can be split column-wise across two GPUs, A = [A1, A2]. Each GPU computes part of the result, which are then combined.
High Communication Cost: TP requires very frequent, high-bandwidth communication (All-Reduce).
Best Use Case: It's only efficient on very fast, low-latency interconnects, making it perfect for intra-node parallelism (within a single server, like over NVLink).

Sequence Parallelism

This is an extension of Tensor Parallelism that also splits activations along the sequence length dimension. This is crucial for reducing activation memory, which is a major memory consumer that doesn't scale down with normal Tensor Parallelism.

3. Putting It All Together: 3D Parallelism

In practice, training a massive model like Llama 3 (405B) uses all three strategies at once. A common and effective configuration is:

Tensor Parallelism: Used within each machine to split individual layers across its 8 GPUs.
Pipeline Parallelism: Used across machines to split the sequence of layers.
Data Parallelism (with ZeRO/FSDP): Used across the entire cluster to process more data simultaneously.

This hybrid approach allows developers to balance memory constraints, communication bandwidth, and computational efficiency to train enormous models at an unprecedented scale.