Lecture 8 — Parallelism (Percy)

Links

Lecture video: https://youtu.be/LHpr5ytssLo
Course materials: lecture_08.py

Overview: The Unifying Theme of Data Movement

This lecture provides a practical, code-centric implementation of the distributed parallelism strategies introduced conceptually in the previous lecture. The unifying theme is that all forms of parallelism—from optimizations within a single GPU to strategies across thousands—are fundamentally about orchestrating computation to minimize data transfer bottlenecks. The lecture builds upon the memory hierarchy, extending it from the single-GPU context (L1 cache -> HBM) to a multi-GPU, multi-node cluster (HBM -> NVLink -> NVSwitch/Network), where communication costs become even more critical.

The goal is to move from theory to practice, showing how the three primary parallelism strategies are built from a core set of communication primitives in PyTorch.

Part 1: The Building Blocks of Distributed Computation

Before implementing any training strategy, it's essential to understand the tools used for inter-GPU communication.

1. The Hardware Hierarchy and Communication Libraries

Hardware: Communication performance is dictated by the physical links.
- Intra-Node (within a server): GPUs are connected by ultra-fast, low-latency NVLink (e.g., 900 GB/s on an H100), which bypasses the CPU.
- Inter-Node (between servers): Nodes are connected by a high-speed network fabric like NVSwitch or Infiniband. This is an order of magnitude slower than NVLink.
NCCL (NVIDIA Collective Communication Library): This is the highly optimized, low-level library that actually sends the data packets. It automatically detects the hardware topology (which GPUs are connected by NVLink vs. the network) and chooses the most efficient communication algorithm (e.g., ring-based or tree-based all-reduce).
torch.distributed: This is the high-level PyTorch API that provides a clean, hardware-agnostic interface for distributed programming. When you call a function like dist.all_reduce(), PyTorch in turn calls the optimized NCCL backend to perform the operation.

2. Collective Operations: The Primitives of Parallelism

These are the fundamental communication patterns that form the basis of all distributed algorithms.

Terminology: A distributed job consists of a world_size (total number of GPUs), each with a unique rank (ID).
Key Operations:
- dist.broadcast: Sends a tensor from a single rank to all other ranks.
- dist.all_reduce: The cornerstone of Data Parallelism. It gathers tensors from all ranks, combines them with an operation (typically SUM or AVG), and distributes the final result back to all ranks.
- dist.all_gather: Gathers tensors from all ranks and concatenates them, giving every rank the complete, combined tensor. This is essential for Tensor Parallelism.
- dist.reduce_scatter: Reduces tensors from all ranks and scatters the result, so each rank receives a unique chunk of the final tensor.
- dist.send / dist.recv: Point-to-point operations used to send a tensor from one specific rank to another. This is the foundation of Pipeline Parallelism.

Part 2: Implementing the Three Parallelism Strategies in PyTorch

The lecture provides simplified, self-contained PyTorch code to demonstrate how each strategy works at a fundamental level, using a simple multi-layer perceptron (MLP) as the model.

1. Data Parallelism (DDP)

This is the most straightforward approach to distributing training.

Sharding Strategy: The data batch is split evenly across all ranks. The model is fully replicated on every rank.
Step-by-Step Implementation:
1. Setup: Each rank initializes an identical copy of the model and its own optimizer.
2. Forward/Backward Pass: Each rank performs the forward and backward pass on its local slice of the data. At the end of the backward pass, each rank has gradients that are unique to its data slice.
3. Communication (The Core Step): A dist.all_reduce call is made on each parameter's gradient tensor (param.grad) with the operation dist.ReduceOp.AVG. This is the only communication step. It efficiently averages the gradients across all GPUs.
4. Optimizer Step: After the all_reduce, every rank now has the exact same averaged gradients. Each rank then calls optimizer.step() locally, which updates its copy of the model weights in an identical manner.
Outcome: The model parameters remain perfectly synchronized across all GPUs after every training step.

2. Tensor Parallelism (TP)

This strategy is used to run a single, very wide layer across multiple GPUs.

Sharding Strategy: The model's weight matrices are split (sharded) column-wise across the ranks. The input data is replicated on every rank.
Step-by-Step Implementation (Forward Pass):
1. Setup: Each rank initializes only its slice of the model's weights. For a linear layer Y = XA + B, each rank would hold a vertical slice of A and a corresponding slice of B.
2. Local Computation: Each rank computes a partial activation by matrix-multiplying the full input X with its local weight slice A_i. The result is a partial output Y_i.
3. Communication (The Core Step): A dist.all_gather operation is performed on the partial activations Y_i. This collects all partial outputs from all ranks and concatenates them.
4. Final Output: After the all_gather, every rank has the full, final activation Y, which can then be passed through the non-linear function and serve as the input to the next tensor-parallel layer.
Outcome: This approach requires frequent communication within every layer, making it viable only on the ultra-fast intra-node NVLink connections. The backward pass involves a reduce_scatter on the gradients.

3. Pipeline Parallelism (PP)

This strategy is used to run a single, very deep model across multiple GPUs.

Sharding Strategy: The model's layers are split into sequential stages, and each stage is assigned to a different rank.
Step-by-Step Implementation (Forward Pass with Micro-batching):
1. Setup: Each rank initializes only the layers in its assigned stage. The data batch is split into several smaller micro-batches.
2. Pipelining:
  - Rank 0 processes the first micro-batch and uses dist.send to pass the resulting activation to Rank 1.
  - As Rank 1 works on the first micro-batch, Rank 0 immediately starts processing the second micro-batch, keeping the hardware busy.
  - This continues, creating a "pipeline" of micro-batches flowing through the stages. Each intermediate rank uses dist.recv to get activations from the previous rank and dist.send to pass them to the next.
3. The "Bubble": This process isn't perfectly efficient; there is unavoidable idle time at the beginning and end of the pipeline, known as the "pipeline bubble." Using more micro-batches reduces the relative size of this bubble but increases communication overhead.
Outcome: This strategy involves the least total communication volume, making it the preferred method for scaling across the slower inter-node network connections.