Links
- Lecture video: https://youtu.be/9Cd0THLS1t0
- Course materials: lecture_14.py
Overview: The Engineering of a High-Quality Dataset
This lecture dives into the critical, yet often unglamorous, engineering work required to transform raw, noisy web data into a high-quality dataset suitable for pre-training a Large Language Model. Building on the previous lecture's overview of data sources, this session focuses on the mechanics of data curation—the specific algorithms and techniques used for filtering and deduplication.
The central theme is that data does not arrive ready-to-use; it must be rigorously processed. We will explore the algorithmic toolkit that enables this processing at a petabyte scale, including classical NLP models like n-gram language models, efficient text classifiers like fastText, and probabilistic data structures like Bloom filters and MinHash for handling the immense challenge of finding near-duplicates in web-scale corpora.
Part 1: Filtering Algorithms - The Tools of Curation
Filtering is the process of identifying and selecting a high-quality subset of documents from a massive, raw corpus. The core task is to take a small, trusted "target" dataset (T) and use it to find a much larger, similar subset (T') from the raw data (R). The algorithms for this must be both effective and extremely fast to be practical.
1. N-gram Language Models (KenLM)
- What it is: A classical, statistical language model that calculates the probability of a sequence of words based on the frequency of n-grams (sequences of n words) in a training corpus. KenLM is a highly efficient implementation of this model.
- How it's Used for Filtering (CCNet):
- An n-gram model is trained on a high-quality target corpus, such as Wikipedia.
- This model is then used to score every document in the raw corpus (e.g., Common Crawl) by calculating its perplexity.
- Documents that are "surprising" to the Wikipedia-trained model (i.e., have high perplexity) are considered low-quality and are discarded. Documents that look statistically similar to Wikipedia (low perplexity) are kept.
- Pros & Cons: It is extremely fast and simple, but also a very crude measure of quality, as it only captures surface-level statistics.
2. Text Classifiers (fastText)
- What it is: fastText is a library for training efficient and surprisingly powerful text classifiers. It represents a document as an average of its word embeddings (or hashed n-gram embeddings) and feeds this single vector into a simple linear classifier.
- How it's Used for Filtering (GPT-3, DCLM):
- Create a Labeled Dataset: Positive examples are drawn from a high-quality target corpus (e.g., WebText, Wikipedia, human-written instruction data). Negative examples are drawn from the raw, unfiltered corpus (e.g., Common Crawl).
- Train a Classifier: A fastText model is trained on this labeled data to predict whether a given document is "high-quality" or "low-quality."
- Apply at Scale: The trained classifier is then run on the entire multi-trillion token raw corpus to score every document. Documents with a high "quality" score are kept.
- Pros & Cons: This is more sophisticated than perplexity filtering and has become the standard for modern data pipelines (e.g., DCLM, Nemotron-CC). While any classifier (like BERT) could be used, fastText's speed is essential for processing web-scale data.
3. Importance Resampling (DSIR)
- What it is: A more theoretically grounded approach that aims to re-weight a raw dataset so that its distribution matches that of a target dataset.
- How it's Used for Filtering:
- Two simple bag-of-n-grams models are trained: one on the target data (
p_T
) and one on the raw data (p_R
). - For each document
x
in the raw data, an importance weight is calculated asw(x) = p_T(x) / p_R(x)
. - The final dataset is created by resampling from the raw data, where the probability of selecting each document is proportional to its importance weight.
- Two simple bag-of-n-grams models are trained: one on the target data (
- Pros & Cons: This method is principled and captures notions of diversity, but in practice, its performance is often similar to a well-trained fastText classifier.
Part 2: Filtering Applications
The same algorithmic tools are applied to solve several distinct curation tasks.
-
Language Identification: The first step in any pipeline is to filter for the desired language(s). The off-the-shelf fastText language ID model, trained on 176 languages, is the industry standard. It is used to predict the language of each document, and only documents exceeding a certain probability threshold (e.g., p(English) > 0.5) are kept.
-
Quality Filtering: This is the most crucial and subjective step. As described above, labs use either heuristic rules (like in C4 and RefinedWeb) or, more commonly now, model-based filtering with fastText classifiers trained to identify content that is similar to a reference corpus of human-written or high-quality text.
-
Toxicity Filtering: To remove harmful content, classifiers are trained on datasets like the Jigsaw Toxic Comments dataset. In the Dolma pipeline, for instance, they trained separate fastText classifiers to detect hate speech and NSFW content, and removed documents that were flagged by either.
Part 3: Deduplication - The Fight Against Redundancy
Web-scale datasets are rife with both exact and near-duplicate content (e.g., boilerplate text, mirrored websites, copied code). Removing this redundancy is critical to prevent models from overfitting and memorizing content, and to save compute by not training on the same data repeatedly.
The challenge is that a naive pairwise comparison of all documents is computationally impossible (O(N^2)
). Scalable deduplication relies on hashing.
1. Exact Deduplication
- Method: Compute a hash (e.g., MurmurHash) for each item (e.g., a 3-sentence span, as in C4). Sort the hashes and remove all but one of each identical hash.
- Limitation: Fails to catch near-duplicates that differ by even a single character.
2. Approximate Set Membership (Bloom Filters)
- Use Case: Efficiently checking if a document contains paragraphs that have already been seen in a massive corpus.
- How it Works: A Bloom filter is a probabilistic data structure consisting of a large bit array and
k
hash functions. To add an item, you hash itk
times and set the bits at the correspondingk
positions to 1. To check if an item is in the set, you hash itk
times and check the bits; if all are 1, the item is probably in the set (with a small, tunable false positive rate). - Advantage: Extremely memory-efficient for tracking membership of billions of items.
3. Near-Duplicate Detection (MinHash + LSH)
This is the standard technique for finding documents that are similar but not identical.
- Step 1: Shingling: Each document is represented as a set of its unique n-grams (shingles).
- Step 2: Jaccard Similarity: The similarity between two documents (A and B) is measured by the Jaccard similarity of their shingle sets:
J(A, B) = |A ∩ B| / |A ∪ B|
. - Step 3: MinHash: Calculating the exact Jaccard similarity for all pairs is still too slow. MinHash is a clever hashing technique where the probability of two sets having the same MinHash value is exactly equal to their Jaccard similarity:
P[minhash(A) == minhash(B)] = J(A, B)
. - Step 4: Locality-Sensitive Hashing (LSH): To find pairs with Jaccard similarity above a certain threshold, LSH is used.
- Multiple MinHash signatures are generated for each document.
- These signatures are divided into
b
bands, each containingr
rows. - Two documents are considered a candidate pair if they match on all
r
rows in at least one of theb
bands. - This "AND-then-OR" structure creates a sharp probability curve, making it highly likely that documents above the similarity threshold will be caught as candidates, while those below will not. This reduces the problem from
N^2
pairwise comparisons to checking only a small number of candidate pairs.
By combining these filtering and deduplication techniques, data engineering teams can systematically transform the chaotic, raw internet into the structured, high-quality fuel that powers today's most capable Large Language Models.