Links
- Lecture video: https://youtu.be/OSYuUqGBQxw
- Course materials: lecture 11.pdf
Introduction: From Theoretical Laws to Practical Recipes
In the preceding lectures, we established the foundational principles of scaling laws—the remarkable discovery that the performance of Transformer language models improves as a predictable power-law function of model size, dataset size, and compute. These laws provide a theoretical roadmap, a beacon of predictability in the otherwise chaotic and expensive world of large-scale AI. However, this lecture pivots from the "what" and "why" of scaling to the far more granular and challenging question of "how." It addresses the formidable engineering task that every major AI lab faces: if you are given a staggering budget of compute—say, ten thousand H100s for a month—how do you actually use it? How do you navigate the infinite space of hyperparameters to configure a multi-million dollar training run with a reasonable chance of success?
This lecture delves into the practical art and emerging science of scaling. It confronts the messy realities that complicate the clean elegance of theoretical scaling laws. We will discover that standard model parametrizations are fundamentally unstable at scale, causing optimal hyperparameters to shift unpredictably and rendering small-scale experiments useless. We will also confront the astronomical cost of the very analysis—the "Chinchilla-style" IsoFLOP curves—needed to guide these decisions.
To navigate these challenges, we will explore the cutting-edge solutions that have defined modern LLM training. We will conduct a deep dive into Maximal Update Parametrization (µP), a principled theoretical framework for achieving stable, transferable hyperparameters. We will then dissect clever techniques, such as the Warm-Started Decay (WSD) learning rate schedule, which drastically reduce the cost of fitting scaling laws. Through a series of detailed case studies of recent, openly documented models like Cerebras-GPT, MiniCPM, and DeepSeek, we will synthesize these concepts into a practical playbook for training language models in the modern era. This lecture is about transforming scaling from a high-stakes gamble into a rigorous, predictable, and efficient engineering discipline.
Part 1: The Hyperparameter Dilemma and the µP Revolution
The most significant barrier to applying scaling laws in practice is the instability of optimal hyperparameters across different scales. A hyperparameter setting that is perfect for a 1-billion-parameter model could be catastrophically bad for a 100-billion-parameter model.
The Core Problem: Shifting Optima in Standard Parametrization (SP)
In a Standard Parametrization (SP)—the default setup in most deep learning frameworks—the relationship between model parameters and their updates is not scale-invariant. As the width of a model (its hidden dimension d_model
) increases, the magnitudes of activations, gradients, and parameter updates can change in complex ways.
This has a profoundly negative consequence for hyperparameter tuning, especially for the learning rate. As shown in empirical studies, if you plot the validation loss against a range of learning rates for models of different widths (e.g., 128, 512, 2048, 8192), you will see a series of U-shaped curves. In SP, the minimum point (the optimal learning rate) of this "U" shifts horizontally as the width increases. The optimal learning rate for a small model provides no useful information about the optimal learning rate for a large model. This breaks the fundamental premise of scaling analysis, which is to use cheap, small-scale experiments to inform expensive, large-scale decisions. Without a solution, every new model scale would require its own astronomically expensive hyperparameter search.
µ-Parametrization (µP): A Principled Solution for Stable Transfer
Maximal Update Parametrization (µP), developed by researchers at Microsoft and OpenAI, is a theoretical framework that provides a direct solution to this problem. It is a set of specific, targeted modifications to a model's initialization and learning rate schedule designed to make the training dynamics, and therefore the optimal hyperparameters, invariant to model width.
The Theory Behind µP (An Accessible View)
The theory of µP is mathematically deep, but its core assertions are intuitive. It posits that for training to be stable and transferable across scales, the model's parametrization must satisfy two key properties as its width n
goes to infinity:
- Stable Activations: The magnitude of the activations at initialization should remain constant (
Theta(1)
). - Stable Updates: The magnitude of the change in activations after the very first gradient step should also be constant (
Theta(1)
).
By working backward from these two assertions, the theory derives a set of precise scaling rules for how different types of parameters should be initialized and optimized. A "matrix-like" parameter (one with two dimensions that scale with model width, like an FFN weight matrix) requires different scaling than an "other" parameter (like an embedding layer).
The µP Recipe for Transformers
Applying the theory to the Transformer architecture yields a concrete set of rules that differ from Standard Parametrization:
- Initialization Variance: The variance of the initial weights for matrix-like parameters (QKV projections, FFN layers, output projections) should be scaled by
1 / width_multiplier
. - Learning Rate Scaling: The learning rates for these same matrix-like parameters should be scaled by
1 / width_multiplier
. Essentially, wider layers get smaller learning rates. - Attention Score Scaling: The attention logits (
Q @ K^T
) are scaled by1 / d_head
, rather than the conventional1 / sqrt(d_head)
. - Output Logit Scaling: The final logits produced by the unembedding layer are scaled down by
1 / width_multiplier
.
When these rules are implemented, the magic happens: the optimal learning rate and other key hyperparameters become remarkably stable. The U-shaped tuning curves for different model widths align perfectly, with their minimum points occurring at the same learning rate. The best learning rate found for a tiny 40M parameter "proxy" model can be directly and successfully applied to a 13B parameter model.
Empirical Validation: Case Studies on µP's Robustness
-
Cerebras-GPT: This project was one of the first to publicly validate µP at scale. They trained a family of Chinchilla-optimal models from 111M to 13B parameters. Their key finding was that the models trained with µP followed the predicted scaling law almost perfectly, exhibiting smooth, predictable performance improvements. In contrast, the models trained with SP were much noisier and deviated significantly from the predicted trend, highlighting the practical stability benefits of µP.
-
"A Large-Scale Exploration of µ-Transfer" (Lingle et al.): This recent, exhaustive study tested the limits of µP, confirming its core claims and identifying its failure points.
- What µP is Robust To: The stable transfer of the optimal learning rate holds true even with modern architectural changes like SwiGLU or Squared ReLU nonlinearities, and across a range of batch sizes.
- What Breaks µP:
- Trainable Gains in RMSNorm: This was the most critical finding. The learnable affine parameters (
gamma
andbeta
) in a standard RMSNorm layer violate the theoretical assumptions of µP, causing the optimal learning rate to shift again. The practical engineering takeaway is crucial: when using µP, one must use RMSNorm without the learnable element-wise affine parameters. - Exotic Optimizers: Optimizers that are not based on gradient magnitudes, such as Lion (which relies on the sign of the momentum), are incompatible with µP's assumptions and do not transfer well.
- Strong Weight Decay: High values of weight decay (
lambda = 0.1
) also disrupted the stable transfer, suggesting that regularization strength may need to be tuned separately.
- Trainable Gains in RMSNorm: This was the most critical finding. The learnable affine parameters (
Part 2: Making Scaling Analysis Affordable and Accurate
The second major practical challenge is efficiently determining the compute-optimal trade-off between model size (N
) and dataset size (D
) for a given FLOPs budget (C
).
The Prohibitive Cost of the Original Chinchilla Analysis
The original Chinchilla paper presented three methods for finding the optimal N
and D
. The most robust of these, the "IsoFLOP profile" method, is incredibly expensive. It requires performing hundreds of full training runs to populate a grid of (Model Size, Tokens Trained)
and then fitting curves to find the optimal frontier. Because a standard cosine learning rate schedule's shape depends on the total number of training steps, you cannot simply stop a long run early to get the loss for a shorter run; you must train a new model from scratch for that specific duration. This makes the cost of the analysis quadratic in the number of data points you wish to measure.
The Solution: Warm-Started Decay (WSD) and Multi-Step Learning Rate Schedules
Recent work from labs like MiniCPM and DeepSeek has introduced a highly efficient solution to this problem by changing the learning rate schedule.
-
The WSD Schedule: Instead of a single, monolithic cosine decay, the WSD schedule is broken into three distinct phases:
- A short warmup phase.
- A long stable phase, where the model trains at a constant, high learning rate. This is where the bulk of the learning occurs.
- A short, rapid decay phase at the very end of training.
-
The Key Insight and Efficiency Gain: The crucial property of this schedule is that the model's state during the stable phase is a good "generic" starting point. You can take any checkpoint from the stable training phase, "warm-start" a new run from it, and then apply just the short, cheap decay phase. The final loss you achieve will be nearly identical to the loss you would have obtained from a full, expensive cosine schedule trained to that same number of tokens.
This innovation transforms the economics of scaling analysis. Instead of running hundreds of full training runs, an organization can now perform one long run in the stable phase, saving periodic checkpoints. Then, to populate the IsoFLOP curves, they simply launch dozens of very short, cheap "decay runs" from these saved checkpoints. This reduces the cost of fitting the crucial data scaling curve from quadratic to linear, making this essential analysis vastly more accessible.
Case Studies in Modern Scaling Recipes
-
MiniCPM: This project combined both of the lecture's key ideas. They used µP to ensure their hyperparameters were stable across model sizes and then used the WSD scheduler to efficiently perform a full Chinchilla-style joint fit of the scaling law. Their analysis led them to a startling conclusion: the optimal data-to-model ratio for their architecture was approximately 192 tokens per parameter, a nearly 10x increase over the original Chinchilla recommendation, highlighting that these "optimal" ratios can be architecture-dependent.
-
DeepSeek: This team took a different but equally principled approach. They chose not to use µP and instead performed a direct grid search on small models to fit power laws for the optimal batch size and learning rate, which they then extrapolated to larger scales. For the model-data trade-off, they used the IsoFLOP analysis (Chinchilla's second method), made efficient by their own multi-step (WSD-style) learning rate schedule.
-
Llama 3: While the technical report is less detailed, it confirms the use of IsoFLOP analysis to guide their scaling decisions. Critically, Llama 3 exemplifies an "inference-optimal" strategy. They trained their models on a colossal 15 trillion tokens, far beyond the compute-optimal Chinchilla point. This "overtraining" is more expensive during the training phase but is done with the explicit goal of maximizing the capabilities of the final model for a given parameter count, making it cheaper and more powerful for its primary purpose: inference.
Conclusion: The Emergence of a Scaling Engineering Discipline
The landscape of training large language models is rapidly maturing from a field of heuristic guesswork into a robust engineering discipline. The unpredictable chaos of hyperparameter tuning at scale is being tamed by the principled, theoretically-grounded framework of µ-Parametrization. The once-prohibitive cost of determining optimal resource allocation is being made accessible through clever techniques like the Warm-Started Decay scheduler.
These case studies reveal a clear, emerging playbook for scaling. It involves a meticulous process of small-scale experimentation, careful fitting of scaling laws, and principled extrapolation to make multi-million dollar decisions with a high degree of confidence. While the specific recipes may vary, the underlying principles of empirical measurement, stable parametrization, and an unwavering focus on computational and analytical efficiency are now central to pushing the frontiers of artificial intelligence.