Lecture 13 — Data

6 min read

Overview: The Most Important Ingredient

This lecture argues that data is the single most important, yet most opaque, component in building a Large Language Model. While companies are increasingly open about model architecture and training procedures, the specifics of their pre-training datasets remain a closely guarded secret. This secrecy is driven by two factors: the immense competitive advantage a superior dataset provides, and the significant legal and ethical risks associated with using web-scale data.

The lecture embarks on a journey through the evolution of LLM pre-training data, revealing that "data" is not a monolith that simply "falls from the sky." It is the end product of a massive, complex, and often heuristic engineering pipeline involving sourcing, conversion, filtering, and deduplication. We will see how this pipeline has evolved from simple academic sources to sophisticated, multi-stage recipes involving terabytes of text from the web, books, code, and more.

The Three Stages of Data in an LLM's Lifecycle

The data used to create a modern chatbot can be broken into three phases, which generally move from vast quantities of low-quality data to small quantities of extremely high-quality data:

  1. Pre-training: The model learns general knowledge, grammar, and reasoning abilities from trillions of tokens of raw text scraped from the internet, books, and code. This creates the base model.
  2. Mid-training (or Continued Pre-training): The base model is further trained on a smaller, higher-quality mixture of data to enhance specific capabilities like coding or mathematics.
  3. Post-training (Alignment): The model is fine-tuned on a small set of instruction-response pairs (SFT) and preference data (RLHF/DPO) to make it a helpful and safe assistant. This creates the final instruct/chat model. This lecture focuses on the first stage: pre-training.

Part 1: The Evolution of Pre-training Datasets

The history of LLM pre-training is a story of ever-increasing scale and sophistication in data collection and cleaning.

Phase 1: The Academic Era (e.g., BERT, 2018)

  • Sources: The earliest foundation models were trained on clean, well-structured academic corpora. BERT was famously trained on:
    • English Wikipedia: A high-quality, factual encyclopedia written and curated by humans.
    • BooksCorpus: A collection of ~7,000 unpublished, self-published books, chosen for their long-form, narrative text. (Note: This dataset was later taken down due to copyright and terms-of-service issues).
  • Key Idea: Focus on long, coherent documents rather than shuffled sentences to allow the model to learn long-range dependencies.

Phase 2: Tapping the Web (e.g., GPT-2, T5, 2019)

To get more data, researchers turned to the largest text resource available: the internet.

  • Common Crawl: A non-profit organization that has been crawling and archiving the public web since 2007. It provides petabytes of raw web data in WARC (raw HTML) and WET (extracted text) formats.
  • The Problem: Raw Common Crawl data is incredibly noisy, containing boilerplate, ads, code, and non-prose text. It is largely unusable in its raw form.
  • Early Filtering Efforts:
    • WebText (for GPT-2): A clever heuristic for finding quality. The creators scraped all outbound links from Reddit posts that had at least 3 upvotes, assuming Reddit karma was a good proxy for interesting and well-written content. This resulted in a cleaner 40GB dataset.
    • C4 (Colossal Clean Crawled Corpus, for T5): The T5 team applied a series of aggressive, rule-based heuristics to a snapshot of Common Crawl: they kept only English lines ending in punctuation, removed pages with "bad words" or boilerplate like "Terms of Use," and discarded duplicate paragraphs. This filtered 1.4 trillion tokens down to a much cleaner 800GB dataset.

Phase 3: The Era of Massive, Curated Mixtures (e.g., GPT-3, The Pile, Llama)

As models grew, the need for both massive scale and high diversity became clear. This led to the creation of large, multi-source datasets.

  • GPT-3 Dataset (2020): While the exact composition is unknown, it was a mix of a filtered version of Common Crawl, an expanded WebText2, two large (and mysterious) book corpora, and Wikipedia. A key innovation was using a quality classifier: they trained a model to distinguish between their high-quality reference corpora (books, Wikipedia) and the noisy Common Crawl, then used this classifier to filter the web data.

  • The Pile (2021): A landmark open-source effort by EleutherAI to create a diverse, 825GB dataset from 22 different high-quality sources. This went far beyond the web, including:

    • Academic & Technical: PubMed Central, arXiv, USPTO Patents, Stack Exchange, GitHub.
    • Prose: Project Gutenberg (public domain books), Books3 (a controversial shadow library corpus).
    • Dialogue & Social Media: OpenSubtitles, Ubuntu IRC logs, HackerNews.
    • Web: A filtered subset of Common Crawl (Pile-CC).
  • Llama Series Datasets: The Llama models were trained on a mix similar to The Pile, using publicly available sources like Common Crawl, C4, GitHub, Wikipedia, books, and arXiv. This dataset was replicated by the open-source community as RedPajama.

Phase 4: Modern Data Pipelines (e.g., RefinedWeb, Dolma, DCLM)

The latest pre-training datasets are the product of highly refined, industrial-scale data processing pipelines.

  • RefinedWeb / FineWeb: An approach arguing that with sufficient filtering, web data is "all you need." The process involves starting with raw WARC files, using better HTML-to-text extractors (like trafilatura), applying a chain of rule-based quality filters (like Gopher's rules), and performing aggressive fuzzy deduplication. The resulting FineWeb dataset contains 15 trillion high-quality English tokens.
  • Dolma (AI2): A 3-trillion-token, open-source dataset that combines a filtered Common Crawl with curated sources like The Stack (code), academic papers, Reddit, and books.
  • DataComp-LM (DCLM): A research initiative to standardize the data filtering process. They provide a massive, lightly-filtered pool of 240T tokens from Common Crawl and a baseline filtering method that uses a fastText quality classifier. The classifier is trained to distinguish between high-quality human data (e.g., instruction-following datasets) and a sample of the web, proving to be more effective than purely heuristic filtering.
  • Nemotron-CC: An NVIDIA dataset that combines multiple filtering techniques, including classifier ensembling and using a powerful LLM to rephrase or synthesize new data based on existing high-quality documents, in an effort to get more value out of the available raw data.

Training on vast swathes of the internet raises significant legal and ethical questions, primarily centered on copyright.

  • Copyright Basics: In the US, copyright protection is automatic for any "original work of authorship fixed in any tangible medium." This means that almost every blog post, news article, and piece of code on the internet is copyrighted by default. Using this work without permission is infringement.
  • The "Fair Use" Defense: The primary legal argument used by AI companies is fair use. This is a four-factor balancing test in US law that permits limited use of copyrighted material without permission. The key factor in the context of AI training is whether the use is "transformative." Companies argue that training a model is highly transformative—the model learns statistical patterns and ideas, it does not "store and regurgitate" copies of the original works.
  • The Economic Harm Factor: A major counterargument is the fourth factor of fair use: the effect of the use upon the potential market for the original work. Artists, writers, and news organizations argue that generative models directly compete with and harm their market. The outcome of ongoing lawsuits on this issue will shape the future of data collection.
  • Licensing as a Solution: To mitigate legal risk, many companies are now actively licensing data from sources like Reddit and Shutterstock, paying for the right to train on their content.

Part 3: Data for Specific Capabilities (Mid- and Post-Training)

Beyond general pre-training, data is used to imbue models with specific skills.

  • Long Context: To extend a model's context window (e.g., from 4k to 100k), it is typically fine-tuned on a smaller dataset of long documents, often from books (Project Gutenberg) or scientific papers (arXiv).
  • Instruction Following & Chat: This is the crucial post-training alignment phase.
    • The Rise of Synthetic Data: Early instruction models like Alpaca were created by using a powerful proprietary model (like GPT-3.5) to generate 52,000 instruction-response pairs, a technique called self-instruct.
    • Real User Data: Models like Vicuna were fine-tuned on conversations scraped from ShareGPT, where users would share their interesting ChatGPT interactions.
    • Modern Alignment Datasets: Today, alignment datasets are a carefully curated mixture of human-annotated data, high-quality public datasets, and synthetically generated data from multiple powerful models. The goal is to create a diverse set of examples covering general instructions, reasoning, coding, and safety.
Copyright 2025, Ran DingPrivacyTerms