Lecture 3 — Architectures, hyperparameters

Links

Lecture video: https://youtu.be/ptFiH_bHnJw
Course materials: lecture 3.pdf

Summary of Key Points

This lecture provides a deep dive into the architectural and training variations of modern Large Language Models (LLMs), moving beyond the standard transformer. The central theme is that while many models share a common foundation, specific choices in normalization, activation functions, position embeddings, and hyperparameters can significantly impact performance, stability, and efficiency. The trend is towards architectures similar to LLaMA, which use pre-norm, RMSNorm, SwiGLU activations, and Rotary Position Embeddings (RoPE).

Architectural Variations

While the core transformer block remains, several key modifications have become common practice.

Normalization: Pre-Norm vs. Post-Norm

This is the most agreed-upon architectural change in modern LLMs.

Post-Norm: The original transformer applied LayerNorm after the residual connection. This can lead to training instability because the main "residual path" is affected by the normalization at each layer.
Pre-Norm: Almost all modern LLMs apply LayerNorm before the residual connection (i.e., on the input to the attention and FFN blocks). This keeps the residual path clean, leading to more stable training, better gradient flow, and fewer gradient spikes.

A recent variant is the "double norm," which uses pre-norm for the main computation but adds a second normalization outside the residual stream, as seen in models like Grok and Gemma 2.

Normalization Type: LayerNorm vs. RMSNorm

LayerNorm: The original method, which normalizes both the mean and variance. Its formula is: $y=\frac{x-E[x]}{\sqrt{Var[x]+\epsilon}}*\gamma+\beta$
RMSNorm: A simplified version that only normalizes the variance (root mean square) and omits the mean subtraction and the bias term ($\beta$). Its formula is: $y=\frac{x}{\sqrt{\|x\|_{2}^{2}+\epsilon}}*\gamma$

Although matrix multiplications account for ~99.8% of a model's FLOPs (floating-point operations), normalization can take up to 25.5% of the actual runtime due to memory access costs. RMSNorm is faster because it involves fewer operations and parameters, making data movement more efficient without sacrificing performance. This has led to its widespread adoption in models like LLaMA and PaLM. Following this trend, most modern transformers also drop bias terms from their linear layers to save memory and improve stability.

Activation Functions

The feed-forward network (FFN) inside each transformer block uses a non-linear activation function.

ReLU and GeLU: Early models used ReLU ($max(0, x)$) or GeLU (a smoother version of ReLU).
*Gated Activations (GLU): Most modern models have shifted to gated variants like SwiGLU and GeGLU. These functions introduce an extra learnable "gate" matrix ($V$) that controls the information flow, as shown in the formula for ReGLU: $FF_{ReGLU}(x)=(max(0,xW_{1})\otimes xV)W_{2}$

Studies show that gated units, particularly SwiGLU, consistently provide a small but significant performance improvement over simpler activations. To keep the parameter count similar, models using GLU variants typically reduce the feedforward dimension ($d_{ff}$) by a factor of $2/3$.

Layer Structure: Serial vs. Parallel

Serial (Standard): The attention block and FFN block are computed sequentially.
Parallel: The attention and FFN blocks are computed in parallel from the same input, and their outputs are added together. The formula is: $$y=x+MLP(LayerNorm(x))+Attention(LayerNorm(x))$$

This parallel structure can speed up training by about 15% at large scales by fusing matrix multiplications. It has been used in models like GPT-J, PaLM, and Cohere's Command R+.

Position Embeddings

Since attention is permutation-invariant, the model needs a way to understand token order.

Absolute, Relative, and Rotary (RoPE)

Absolute/Sine: The original transformer added sinusoidal waves to the input embeddings, while early models like GPT-3 used learned absolute position vectors. These methods can struggle to generalize to sequence lengths not seen during training.
Relative: T5 introduced relative position embeddings, where positional information is injected directly into the attention calculation.
Rotary Position Embeddings (RoPE): The current standard, used in almost all modern models. The key idea is to represent position information by rotating the query and key vectors in a high-dimensional space. The angle of rotation depends on the token's absolute position. This way, the dot product between a query at position $i$ and a key at position $j$ naturally depends only on their relative distance ($i-j$), providing a robust and flexible way to encode position.

Hyperparameters and Regularization

There is a surprising consensus on many key hyperparameters.

Feedforward Ratio: The intermediate dimension of the FFN ($d_{ff}$) is almost always 4x the model's hidden dimension ($d_{model}$) for ReLU/GeLU models and (8/3)x for GLU models.
Head Dimension: Most models adhere to the formula $head_{dim} \times num_{heads} = d_{model}$.
Aspect Ratio ($d_{model} / n_{layer}$): Models tend to have a "sweet spot" ratio between 100 and 200, balancing model width (parallelism) and depth (sequential processing).
Regularization: While early models used dropout, newer models like LLaMA often set dropout to 0 and rely solely on weight decay (typically with a value of 0.1). For LLMs, weight decay acts less as a traditional regularizer to prevent overfitting and more to stabilize optimization dynamics, especially when used with a cosine learning rate schedule.

Attention Variants for Efficiency

To handle long sequences and speed up inference, several attention variants are used.

Multi-Query & Grouped-Query Attention (MQA/GQA)

In standard multi-head attention (MHA), each head has its own query, key, and value projections. During inference, storing the key/value pairs for every head in the "KV cache" consumes a lot of memory.

MQA: Uses multiple query heads but only a single key and value head that is shared across all queries.
GQA: An intermediate approach where a smaller number of key/value heads are shared among groups of query heads.

These methods drastically reduce the size of the KV cache, speeding up inference with only a minimal loss in performance.

Sliding Window Attention (SWA)

Instead of allowing every token to attend to every previous token (which is computationally expensive), SWA restricts attention to a fixed-size local window (e.g., the last 4096 tokens). By stacking layers, the model's "effective context length" can grow much larger than the window size. Models like Command A interleave full attention layers with SWA layers to capture both local and long-range dependencies.