Lecture 9 — Scaling laws

Links

Lecture video: https://youtu.be/6Q-ESEmDf4Q
Course materials: lecture 9.pdf

Overview: From Guesswork to a Predictive Science

This lecture addresses a central question in modern AI: if you were given a massive compute budget (e.g., 10,000 H100s for a month), how would you decide what model to train? Instead of "cargo culting" hyperparameters from existing models or relying on expensive trial-and-error, scaling laws provide a powerful, predictive framework to make these decisions scientifically.

The core idea is that the performance of a language model is not random but follows a predictable power-law relationship with three key resources: model size (parameters), dataset size (tokens), and compute (FLOPs). By running many cheap, small-scale experiments, we can fit these laws and accurately extrapolate to predict the performance of a model at a massive scale, before we train it. This transforms LLM development from an art into a more rigorous engineering discipline.

Part 1: The History and Theoretical Foundations of Scaling

The idea that performance scales with data is not new.

Early Roots (1990s-2000s): Early work in machine learning and NLP (e.g., Vapnik, Banko & Brill) showed that classifier error decreased predictably with more training data, often following a power law (a straight line on a log-log plot).
The Neural Era Precursor (Hestness et al., 2017): This seminal work was the first to rigorously study scaling in deep neural networks across tasks like translation and speech recognition. It identified three distinct phases of a learning curve:
1. Small Data Region: Not enough data to learn anything meaningful.
2. Power-Law Region: The "sweet spot" where performance predictably improves with more data.
3. Irreducible Error Region: The point of diminishing returns where adding more data provides little benefit.

Why Do Power Laws Appear?

At a conceptual level, power-law scaling is a natural property of statistical estimation.

A Toy Example: Mean Estimation: If you estimate the mean of a distribution from n samples, the error of your estimate decreases proportionally to 1/n. On a log-log plot, log(Error) vs. log(n) is a straight line.
Intrinsic Dimensionality: For more complex, high-dimensional tasks, the exponent of the power law (the slope of the line) is theorized to be related to the "intrinsic dimensionality" of the data manifold. This suggests that the smoother and lower-dimensional the data's underlying structure is, the faster a model can learn from it (a steeper slope).

Part 2: Scaling Laws for Large Language Models

Two landmark papers from OpenAI and DeepMind established the modern framework for LLM scaling.

1. The Kaplan et al. (2020) Scaling Laws

This was the first paper to show that Transformer language models exhibit smooth, predictable power-law scaling. The key finding was that the model's test loss (L) can be accurately modeled as a function of Model Size (N), Dataset Size (D), or Compute (C).

$L(N) \propto N^{-\alpha_N}$ $L(D) \propto D^{-\alpha_D}$ $L(C) \propto C^{-\alpha_C}$

This framework has a profound practical application: hyperparameter transfer. Instead of tuning hyperparameters on a massive model, you can:

Train many small models with different hyperparameters (e.g., model depth, width, optimizer choice).
Fit a scaling law for each hyperparameter choice.
Extrapolate the fitted lines to your target compute budget and simply pick the hyperparameter that is predicted to yield the lowest loss. This process reliably predicts that Transformers scale better than LSTMs, that Adam is a better optimizer than SGD, and that model performance is surprisingly insensitive to shape (depth vs. width) within a wide range.

2. The Chinchilla Scaling Laws (Hoffmann et al., 2022)

This work refined the Kaplan laws to answer the most important question: for a fixed compute budget, what is the optimal allocation between model size (N) and data size (D)?

The Chinchilla authors conducted a more controlled set of experiments and concluded that previous large models like GPT-3 (175B params, 300B tokens) were severely undertrained. To achieve the lowest possible loss for a given amount of compute, the model size and the number of training tokens must be scaled in roughly equal proportion.

The Chinchilla Rule: For every doubling of model size, you should also double the number of training tokens. This leads to the famous rule of thumb: the compute-optimal amount of training data is approximately 20 tokens per model parameter.

This finding implied that a much smaller 70B parameter model (Chinchilla) trained on 1.4T tokens could outperform the much larger 175B GPT-3 trained on only 300B tokens.

Practical Applications and Important Nuances

Train-Optimal vs. Inference-Optimal

The Chinchilla rule is train-optimal. It tells you how to get the best possible model for a fixed training compute budget. However, in many real-world applications, the one-time training cost is dwarfed by the cumulative cost of running the model for inference millions of times.

In such cases, it is often better to create an inference-optimal model. This means training a smaller model on far more data than the Chinchilla rule suggests (e.g., 200+ tokens per parameter). This "overtraining" is computationally "inefficient" during the training phase but results in the best possible model for a given parameter count, which is cheaper to run for inference. The Llama 3 models are a prime example of this philosophy.

Data Scaling: Quality and Finiteness

Data Repetition: There are diminishing returns to training on the same data for multiple epochs. The first ~4 epochs are nearly as effective as new data, but after that, the value drops off rapidly. Scaling laws can be modified to account for this.
Data Quality: The optimal data filtering strategy depends on your compute budget.
- Small Compute: You should be very aggressive and filter for only the highest-quality data.
- Large Compute: It is better to be less aggressive and include more, even if slightly lower-quality, data.

The Scaling Law Design Procedure

The modern, principled approach to designing a new large model is:

Run a series of small-scale experiments to test different model architectures, data mixes, and hyperparameters.
Fit scaling law curves to the results of these experiments.
Extrapolate these curves to your final compute budget.
Use these predictions to make final, informed decisions about the model's architecture and training configuration.