Lecture 1 — Overview, tokenization

Links

Lecture video: https://youtu.be/SQ3fZ1sAqXI
Course materials: lecture_01.py

Introduction and Motivation: Why Build From Scratch?

This introductory lecture establishes the core philosophy of CS336. It addresses the growing disconnect between the researchers who use Large Language Models (LLMs) and the complex engineering required to build them.

The Problem: The Industrialization of Language Models

The field has shifted dramatically in just a few years. Previously, researchers would implement their own models (e.g., BERT). Now, the common practice is to prompt proprietary, closed-source APIs like GPT-4 or Claude. This move up the "abstraction ladder" boosts productivity but creates two issues:

Leaky Abstractions: Unlike stable abstractions in computer science (like programming languages), the abstractions around LLMs are "leaky." A deep understanding of the underlying mechanics is still required to push the boundaries of research.
Inaccessibility of Frontier Models: State-of-the-art models are now an "industrial" endeavor, costing hundreds of millions of dollars and requiring massive, dedicated hardware clusters. Companies like OpenAI and xAI do not release technical details about their model architecture, dataset construction, or training methods, citing competitive and safety reasons. This makes it impossible for the academic community to study them directly.

The Course's Solution: Mechanics, Mindset, and Scaling

Since building frontier models is out of reach, this course focuses on what can be learned and what does transfer from smaller-scale experiments to massive ones.

Mechanics: Teaching the fundamental components—how a Transformer works, how model parallelism leverages GPUs, etc. This knowledge is foundational and universal.
Mindset: Instilling an engineering mindset focused on efficiency and scaling. This includes understanding scaling laws and squeezing the most performance out of available hardware.
The Bitter Lesson, Re-interpreted: The lecture references Rich Sutton's "The Bitter Lesson," which states that general methods that leverage computation are ultimately the most effective. The lecture refines this: it's not just about scale, but about "algorithms that scale." Efficiency becomes paramount at larger scales where waste is unaffordable. The guiding principle of the course is:

accuracy = efficiency x resources

🗺️ The Landscape and History of LLMs

The lecture provides a timeline of the key developments that led to modern LLMs.

1. Pre-Neural Era (Before 2010s)

Dominated by n-gram models which calculated the probability of a word given the previous n-1 words.
Pioneering work by Claude Shannon on using language models to measure the entropy of English.

2. Neural Ingredients (2010s)

This decade saw the invention of the core components that underpin all modern models.

First Neural Language Model (2003): Yoshua Bengio et al. introduced the idea of using neural networks for language modeling.
Sequence-to-Sequence (Seq2Seq) Modeling (2014): A breakthrough for tasks like machine translation.
Adam Optimizer (2014): Became the default optimizer for training deep neural networks.
Attention Mechanism (2014): Allowed models to "focus" on relevant parts of the input, overcoming the limitations of fixed-length context vectors.
The Transformer (2017): The "Attention Is All You Need" paper introduced an architecture based solely on attention, which was more parallelizable and scalable than previous recurrent models.

3. Early Foundation Models (Late 2010s)

These models established the paradigm of pre-training on a large corpus and then fine-tuning for specific tasks.

ELMo & BERT (2018): Showed the immense power of pre-training. BERT, using the Transformer encoder, set new state-of-the-art records across NLP benchmarks.
T5 (11B parameters, 2019): Unified all NLP tasks into a text-to-text format.

4. The Era of Scaling and Openness (2020s)

GPT-2 (1.5B): Demonstrated shockingly fluent text generation and early signs of zero-shot task completion.
Scaling Laws (2020): A pivotal paper from OpenAI showed that model performance improves predictably with increases in model size, dataset size, and compute. This provided a scientific roadmap for building better models.
GPT-3 (175B): Showcased powerful in-context learning, where the model could perform tasks it was never explicitly trained for, just by being shown examples in its prompt. It remained closed-source.
Chinchilla (70B): A study by DeepMind that refined the scaling laws, arguing that for compute-optimal training, dataset size should grow in proportion to model size (roughly 20 tokens per parameter). This suggested many previous models were "undertrained."
The Rise of Open Models: A movement to democratize LLMs, led by:
- EleutherAI (The Pile, GPT-J): Provided open datasets and models.
- Meta (OPT, Llama series): Released a series of powerful open-weight models that catalyzed research and development.
- AI2 (OLMo): Created a truly open-source model, releasing the weights, code, and training data.

📚 The Five Components of Building an LLM

The course is structured around five key modules, each representing a critical stage in the LLM development pipeline.

1. The Basics

This module covers the core code needed to get a simple model running.

Tokenization: The process of converting raw text into a sequence of integers. The course focuses on Byte-Pair Encoding (BPE), an algorithm that starts with individual bytes and iteratively merges the most frequent adjacent pairs to build a vocabulary. This effectively compresses text by representing common strings with a single token.
Architecture: The Transformer is the foundational architecture. Key variations to be explored include activation functions (e.g., SwiGLU), normalization layers (e.g., RMSNorm), positional encodings (e.g., RoPE), and attention mechanisms (e.g., sliding window, Grouped-Query Attention).
Training: Implementing the training loop, which involves a loss function (cross-entropy), an optimizer (AdamW), and a learning rate schedule (e.g., cosine decay).

2. Systems

This module focuses on hardware optimization to train models efficiently.

Kernels: Writing low-level code (using Triton/CUDA) to fuse operations and minimize data movement between a GPU's slow main memory (DRAM) and its fast on-chip memory (SRAM), maximizing computational throughput.
Parallelism: Techniques to distribute training across many GPUs when a model is too large for one. This includes data parallelism (replicating the model), pipeline parallelism (splitting layers across GPUs), and tensor parallelism (splitting matrix multiplies within layers).
Inference: The process of generating text from a trained model. This is broken into the prefill phase (processing the prompt, compute-bound) and the decode phase (generating tokens one-by-one, memory-bound). Optimizations like KV-caching are essential here.

3. Scaling Laws

This module covers the science of scaling. By running many small-scale experiments and plotting the results, one can fit a power-law function to predict the performance (loss) of a model at a much larger scale. This allows for principled hyperparameter tuning and budget allocation, rather than relying on guesswork.

4. Data

Data is the fuel for LLMs. This module covers the end-to-end data pipeline.

Sourcing: Gathering data from various sources like Common Crawl, books, arXiv, and GitHub.
Curation: The critical process of cleaning and preparing the data.
- Transformation: Converting formats like HTML and PDF into clean text.
- Filtering: Using classifiers or heuristics to remove low-quality or harmful content.
- Deduplication: Removing duplicate documents to improve efficiency and prevent the model from simply memorizing the training set.

5. Alignment

This module focuses on transforming a raw, pre-trained base model—which is just a next-token predictor—into a helpful and safe assistant.

Supervised Fine-Tuning (SFT): The model is trained on a curated dataset of high-quality instruction-response examples to teach it how to follow instructions.
Learning from Feedback: Since creating SFT data is expensive, this phase uses cheaper preference data. Humans (or more powerful models) are shown two responses to a prompt and choose the better one. Algorithms like Direct Preference Optimization (DPO) then use this preference data to align the model's behavior with human expectations.