Lecture 4 — Mixture of experts

Links

Lecture video: https://youtu.be/LPv1KfUXLCo
Course materials: lecture 4.pdf

Overview: The Power of Conditional Computation

This lecture introduces Mixture of Experts (MoE), the architectural paradigm enabling the massive parameter counts of today's top-performing language models (e.g., Mixtral, DBRX, DeepSeek-V3, Grok, Llama 4). The core principle of MoE is conditional computation: instead of processing every token with the same large Feed-Forward Network (FFN), an MoE layer contains a large pool of smaller FFNs (the "experts") and activates only a tiny fraction of them for each input token.

This approach breaks the traditional link between model size and computational cost. It allows models to scale to trillions of parameters—storing vast amounts of knowledge—while keeping the number of active parameters and the required FLOPs for training and inference manageably low. This results in models that are both incredibly knowledgeable and remarkably fast.

Why Use MoEs? The Three Pillars of Success

The rapid adoption of MoE architectures by leading AI labs is due to three compelling, empirically verified advantages over dense models.

Superior Performance for a Fixed Compute Budget: An MoE model consistently achieves a lower training loss and higher scores on downstream benchmarks compared to a dense model with the same number of active parameters. Adding more inactive experts continues to improve performance without increasing the per-token FLOPs. This means MoEs learn more efficiently.
Dramatically Faster Training: Because they learn more efficiently, MoE models can reach a target performance level much faster than their dense counterparts. The lecture highlights empirical results showing speedups of 2x to 7x, which translates into massive savings in training time and cost.
Designed for Scale via Expert Parallelism: MoE is a naturally parallelizable architecture. The most common setup involves placing different experts on different GPUs. An efficient All-to-All communication operation is then used to shuffle tokens across the network to their assigned experts. This expert parallelism is a distinct and powerful scaling dimension alongside data, tensor, and pipeline parallelism.

The Mechanics of an MoE Layer

An MoE layer is a drop-in replacement for the FFN sub-block within a Transformer layer. It comprises two key components:

A Pool of N Experts: Each expert is simply a standard FFN (e.g., a SwiGLU network). N can range from 8 (Mixtral) to over 200 (DeepSeek-V3).
A Router (or Gating Network): This is a small, learnable linear layer (W_g) that determines which experts a token should be sent to.

The Top-K Routing Process

The dominant routing strategy used in virtually all modern MoEs is Top-K Routing:

Get Router Logits: For an input token with hidden state x, the router computes scores for each of the N experts: logits = x @ W_g.
Select Top K Experts: The k experts with the highest logit scores are selected. The value of k (the number of active experts per token) is a critical hyperparameter.
- Switch Transformer: Pioneered k=1.
- Mixtral, Grok: Use k=2.
- DBRX, Qwen: Use k=4.
- DeepSeek: Uses k up to 8.
Compute Gating Weights: The logits of the selected k experts are passed through a softmax function to determine their weights.
Process and Combine: The token's hidden state x is processed by each of the k selected experts. The final output of the MoE layer is the weighted sum of the individual expert outputs.

Core Challenges and Practical Solutions in Training MoEs

The discrete and dynamic nature of MoE routing introduces significant challenges that require specialized solutions.

1. The Load Balancing Problem

Problem: The router can develop "favorite" experts, sending a disproportionate number of tokens to them. This leads to computational imbalance, inefficient hardware use, and poor training. Solution: A heuristic auxiliary load balancing loss is added to the model's main loss function. This loss is calculated based on the fraction of tokens dispatched to each expert and the average router probability assigned to that expert across the batch. It essentially penalizes the model if any expert receives too many or too few tokens, forcing the router to learn a more uniform distribution. This loss is a critical, non-negotiable component for successful MoE training.

2. Training Instability

Problem: The softmax in the router is highly sensitive, and small floating-point errors (especially in bfloat16) can cause massive changes in the output probabilities, leading to loss spikes and unstable training. Solutions:

Router Precision: The router's computations are often performed in float32 for greater numerical stability, even if the rest of the model uses bfloat16.
Router Z-Loss: An additional auxiliary loss that regularizes the magnitude of the logits entering the softmax, further enhancing stability.
Stochasticity: Some methods add a small amount of random noise to the router logits during training. This "jitter" prevents the router from becoming overly confident and helps with exploration, making the learned routing more robust.

3. Fine-tuning and Overfitting

Problem: Due to their massive number of parameters, MoE models are prone to overfitting when fine-tuned on smaller datasets. Solution: A common and effective strategy is to freeze the parameters of all the expert FFNs during fine-tuning. Only the non-expert parts of the model (like the attention blocks and normalization layers) and the router itself are updated.

Architectural Innovations in Modern MoEs

The latest MoE models have evolved beyond the basic design, incorporating several key ideas:

Fine-Grained and Shared Experts (DeepSeek & Qwen):
- Fine-Grained: The trend is towards using a much larger number of smaller, more specialized experts rather than a few large ones.
- Shared: In addition to the N routed experts, these models include 1 or 2 "shared" experts that process every single token. This ensures a consistent baseline of knowledge is applied to all inputs, complementing the specialized knowledge from the routed experts.
Upcycling (A Highly Efficient Training Method):
- Instead of training an MoE from scratch, this technique leverages a pre-trained dense model.
- The process is simple: take the trained FFN weights from the dense model and replicate them N times to initialize each of the N experts in a new MoE model. The router is initialized from scratch.
- This "upcycled" model is then fine-tuned. It provides a massive head start, drastically reducing the time and data needed to train a high-performing MoE model. This was a key technique for models like Qwen-MoE.