Lecture 12 — Evaluation

Links

Lecture video: https://youtu.be/x-R5l2HsXqM
Course materials: lecture_12.py

Overview: The Crisis and Complexity of Measuring "Good"

This lecture addresses one of the most profound and challenging questions in the field of AI: given a trained language model, how do we actually determine how "good" it is? The process of evaluation is far from a simple, mechanical task of running a model on a dataset and reporting a number. It is a rich, multifaceted, and often contentious field that directly shapes the development priorities and perceived progress of AI.

The lecture opens by acknowledging the "evaluation crisis" described by prominent researchers like Andrej Karpathy. As models become more powerful, traditional benchmarks become saturated, and new, more sophisticated methods are required. There is no single "true" evaluation; the right methodology depends entirely on the question being asked. Are we measuring raw knowledge, reasoning ability, safety, or usefulness for a specific real-world task? The lecture provides a structured framework for thinking about this problem and then surveys the landscape of modern evaluation techniques, from foundational perplexity metrics to complex, multi-turn agentic benchmarks.

A Framework for Thinking About Evaluation

Before diving into specific benchmarks, it's crucial to establish a framework for analyzing any evaluation methodology. Every evaluation involves a series of choices that can profoundly affect the outcome:

What are the inputs? (The prompt/task distribution)
- What use cases are covered (e.g., coding, creative writing, scientific Q&A)?
- Are there enough difficult, "tail-end" examples to truly challenge the model?
How do you call the language model? (The generation setup)
- Is it a zero-shot, few-shot, or chain-of-thought prompt?
- Is the model allowed to use tools (like a code interpreter or web search)?
- Are we evaluating the raw base model or a complete, agentic system?
How do you evaluate the outputs? (The grading mechanism)
- Is there a single ground truth answer (e.g., multiple-choice), or is the generation open-ended?
- Who or what is the judge? Is it a simple string match, a human evaluator, or another powerful LLM (LM-as-a-judge)?
- How are costs (inference compute) and asymmetric errors (a single hallucination being very harmful) factored in?
How do you interpret the results? (The meta-analysis)
- What does a score of "91%" actually mean in terms of real-world reliability?
- How do we account for potential train-test overlap (contamination)?
- Are we evaluating a specific, frozen model or a general method?

The Landscape of Modern LLM Benchmarks

The lecture categorizes benchmarks by the primary capability they aim to measure.

1. Foundational Capabilities: Perplexity and Cloze Tasks

Perplexity: The most fundamental metric. It measures how well the model's predicted probability distribution aligns with a held-out test set of text. A lower perplexity means the model is less "surprised" by the text and has a better internal model of the language. While less popular now for headline results, it remains a crucial, smooth metric for guiding the pre-training process and fitting scaling laws.
Cloze-style Tasks (e.g., HellaSwag): These are spiritually similar to perplexity. The model is given a context and must choose the most likely completion from a set of options. HellaSwag is an "adversarially filtered" dataset, meaning the incorrect options are specifically designed to be tricky for models but easy for humans.

2. Knowledge Benchmarks

These benchmarks test the model's stored factual knowledge, often in a multiple-choice format.

MMLU (Massive Multitask Language Understanding): The long-standing industry standard. It covers 57 subjects ranging from high school mathematics to professional law. However, as top models now score over 90%, it is becoming saturated.
MMLU-Pro: A more difficult version of MMLU with 10 choices instead of 4 and cleaner, more challenging questions to combat saturation.
GPQA (Graduate-Level Google-Proof Q&A): A benchmark of expert-level questions designed to be difficult for even domain experts to answer using Google. PhD experts achieve only ~65% accuracy, making it a very challenging test of deep knowledge.
Humanity's Last Exam (HLE): A new, highly ambitious benchmark featuring 2,500 extremely difficult, multi-modal questions created through a global competition with a $500K prize pool. The questions were filtered to be unsolvable by current frontier models, making it a forward-looking test of future capabilities.

3. Instruction Following and Open-Ended Generation

Evaluating open-ended generation is a major challenge. The solution has been to use stronger LLMs as judges.

Chatbot Arena: A crowdsourced, blind comparison platform. A user enters a prompt and gets responses from two anonymous models. The user votes for which one is better. An ELO rating system is then used to rank the models based on these pairwise comparisons. It is considered one of the most robust and realistic benchmarks for general chat capabilities.
IFEval (Instruction Following Evaluation): A benchmark that tests how well models adhere to explicit constraints in a prompt (e.g., "end your response with a postscript," "use exactly 5 bullet points"). The evaluation is simple and automatic but does not judge the semantic quality of the response.
AlpacaEval & WildBench: These are "LM-as-a-judge" benchmarks. They use a powerful model like GPT-4 to score the responses of other models to a set of instructions. WildBench improves on this by providing the judge with a detailed "checklist" of criteria to consider, making its evaluations more structured and reliable. It has shown a very high correlation with Chatbot Arena rankings.

4. Agent Benchmarks

These benchmarks evaluate the model's ability to act as an autonomous agent, using tools and performing multi-step tasks to achieve a goal.

SWE-Bench: A realistic coding benchmark. The model is given a GitHub issue from a real Python repository and must generate a code patch that resolves the issue and passes the repository's actual unit tests.
CyBench: A cybersecurity benchmark where the model must perform a series of steps (e.g., reading files, running commands in a bash shell) to solve Capture the Flag (CTF) challenges.
MLE-Bench: A benchmark of Kaggle machine learning competitions. The agent must perform a full data science workflow: analyze a dataset, train a model, and generate a valid submission file.

5. Safety Benchmarks

These benchmarks are designed to measure a model's propensity to generate harmful, unethical, or dangerous content.

HarmBench: A curated set of prompts designed to elicit harmful behaviors, from depicting violence to providing instructions for illegal acts.
AIR-Bench: A benchmark based on real-world regulatory frameworks and company policies, covering 314 distinct risk categories.
Jailbreaking: This isn't a fixed benchmark but an adversarial process. Techniques like GCG (Greedy Coordinate Gradient) automatically search for adversarial prompt suffixes that can bypass a model's safety alignment and trick it into responding to harmful instructions.

Part 3: Critical Challenges in the Evaluation Landscape

The lecture concludes by highlighting several critical issues that complicate the interpretation of benchmark results.

Realism: Most academic benchmarks consist of "quizzing" prompts (where the user knows the answer) and do not reflect how real users "ask" questions to get information they don't have. Efforts like Clio and MedHELM from Anthropic aim to build more realistic evaluations based on analysis of real-world user data.
Validity and Contamination: The biggest threat to the validity of any evaluation is train-test overlap. Since modern models are pre-trained on a huge snapshot of the internet, it's highly likely they have seen the questions from popular benchmarks in their training data. This means a high score might reflect memorization rather than true capability. Detecting this "contamination" is a major open problem.
Evaluating Models vs. Methods: The current paradigm primarily evaluates final, integrated systems (e.g., "Llama 3 405B"). This is useful for consumers but makes it difficult for researchers to evaluate the impact of a specific method (e.g., a new training algorithm), as it's impossible to control for all the other variables (data mixture, proprietary fine-tuning) that went into the final product. Benchmarks like the nanogpt speedrun (training a model to a target loss with a fixed dataset and compute budget) are attempts to return to a more controlled, scientific evaluation of methods.