Why High-Dimensional Gaussians Feel Like Soap Bubbles

This post is adapted from a thread I posted on Threads in September 2023, which asked why high-dimensional Gaussians "turn into soap bubbles." The question came from reading Sander Dieleman's excellent primer on typicality.

In one dimension, a standard normal distribution behaves exactly as you'd expect: half the probability sits between -1 and 1, the density peaks at zero, and samples cluster near the center. But scale that picture to thousands of dimensions and something strange happens. The probability mass abandons the center entirely and concentrates on a thin shell, like the surface of a soap bubble. The origin still has the highest pointwise density, but if you actually sample points, you'll almost never find one near zero.

This counterintuitive behavior is called concentration of measure, and it's fundamental to understanding modern machine learning—especially generative models. Here's why it happens and why it matters.

A Toy Example: Hypercubes

Before diving into Gaussians, let's build intuition with something simpler: uniform distributions on hypercubes. Start with a 1×1 square. If you shrink each side by 5%, you're left with a 0.95×0.95 square. The area of that smaller square is 0.95² = 0.9025, which means you've carved out about 9% of the original area. So 9% of the probability mass lives in that outer 5% shell.

Now try a 1×1×1 cube. Shrink each dimension by 5% and you get a 0.95×0.95×0.95 cube with volume 0.95³ ≈ 0.857. You've removed 14% of the volume—the shell is getting thicker.

Keep going. In a 100-dimensional unit hypercube, shrinking each side by 5% leaves you with volume 0.95¹⁰⁰ ≈ 0.006. That's less than 1% of the original volume. Which means 99.4% of the probability mass lives in that thin 5% shell near the surface.

The center of a high-dimensional hypercube is essentially empty. This isn't specific to Gaussians—it's a geometric fact about how volume scales in high dimensions. The surface area explodes exponentially with dimension, so even a thin shell near the boundary contains almost all the available space.

For a more formal proof of this concentration phenomenon, I recommend this lecture on high-dimensional geometry which walks through the mathematics rigorously.

Why the Surface Dominates

In a $d$-dimensional ball, the volume scales like $r^d$. When dimension is high, this exponential dependence on radius means that shrinking $r$ by even a small percentage carves out an enormous chunk of volume. Conversely, the vast majority of volume concentrates near the maximum radius—on the outer shell.

The surface area of concentric shells also scales exponentially. Even though the probability density might be highest at the center (as it is for Gaussians), when you multiply that density by the exploding volume of shells at larger radii, the product peaks away from the origin. Geometry overwhelms the density gradient.

This is hard to visualize because we're limited to three-dimensional intuition. In high dimensions, the rules change. The center is sharp but empty.

Gaussians on the Shell

Now let's apply this geometric insight to Gaussians. Consider a standard multivariate Gaussian $\mathcal{N}(0, I_d)$ in $d$ dimensions. You can think of this as a $d$-dimensional random vector $x = (x_1, x_2, \ldots, x_d)$, where each coordinate $x_i$ is independently drawn from a standard normal distribution with mean 0 and variance 1.

The Euclidean norm $\lVert x \rVert = \sqrt{x_1^2 + x_2^2 + \cdots + x_d^2}$ measures the distance from the origin. Because each coordinate is independent with variance 1, the expected squared norm is just the sum of the variances: $\mathbb{E}[\lVert x \rVert^2] = d$. Taking the square root, the typical radius is around $\sqrt{d}$.

Let's put some numbers to it. Imagine $d = 100$. The average length of a sample vector is $\sqrt{100} = 10$. Most samples don't scatter randomly between radius 0 and 10—they concentrate in a narrow band. The shell between radius 8 and radius 12 contains about 99% of all samples. That's a thickness of 4 units around a radius of 10, but in relative terms it's vanishingly thin.

As dimension grows, this shell gets even tighter. By the time you reach 1,000 dimensions, almost every sample lands in a window just a few percent wide relative to the radius. The Gaussian really does look like a soap bubble.

The Paradox of Typical vs. Likely

Here's the counterintuitive part: for a high-dimensional Gaussian, the origin region still contains the most likely samples—the density is highest there—but they are extremely atypical because they are so rare. Most samples live on the soap bubble shell. They have lower individual density than points at the origin, but there are so many of them that they dominate the probability mass.

This paradox—likely samples being atypical—isn't specific to Gaussians. It's a general property of high-dimensional distributions, driven by how surface area scales with dimension.

The shell where most samples live is called the typical set. This is the region whose total probability mass approaches one, even though no individual point inside it has exceptional density. Typicality is a much better mental model than "peak probability" when reasoning about sampling, compression, or anomaly detection in high dimensions.

Why does this happen? The Gaussian density $p(x) \propto \exp(-\lVert x \rVert^2 / 2)$ does peak at zero, but to find probability mass we need to integrate density over volume. In high dimensions, the volume of a sphere scales like $r^d$—it explodes exponentially. Even though density falls off from the origin, the sheer amount of space at radius $\sqrt{d}$ overwhelms the taller density at the center. Getting a sample near the origin requires all $d$ coordinates to be small simultaneously, and that probability drops exponentially fast with $d$.

Implications for Machine Learning

As Sander Dieleman writes in his excellent post on typicality: "If we want to learn more about what a high-dimensional distribution looks like, studying the most likely samples is usually a bad idea. If we want to obtain a good quality sample from a distribution, subject to constraints, we should not be trying to find the single most likely one. Yet in machine learning, these are things that we do on a regular basis."

This insight cuts right to the heart of generative modeling. Log-likelihood-based training naturally rewards samples that sit in the typical shell, which is exactly where real data lives. But if you naively chase the mode—the single point with highest density—you end up in the center of the bubble, generating pathological outputs that look nothing like real data.

This explains several puzzling behaviors in generative modeling:

Beam search hunts for the highest-probability trajectory, pulling outputs toward the mode and away from the typical set. That's why it often produces bland, repetitive text or images.

Likelihood thresholds fail for anomaly detection because outliers can still sit comfortably in the shell. High likelihood doesn't mean "normal"—it just means "typical volume."

Diffusion and autoregressive samplers inject noise during generation precisely to keep samples inside the typical shell, avoiding the trap of collapsing onto the mode.

Practical Takeaways

If you work with generative models or high-dimensional data, these ideas translate directly into design and debugging heuristics:

Track norms, not likelihoods alone. When evaluating samples from a generative model, check whether their norms match the typical shell. Outliers in norm space often signal mode collapse or distributional mismatch, even if likelihood scores look reasonable.

Random projections distort distributions. Most directions in high dimensions are nearly orthogonal. Projecting down to two or three dimensions for visualization can destroy the shell structure and make your distribution look fundamentally different from what it actually is.

Simulation beats intuition. Plot histograms of sample norms as you scale dimension. You'll see the annulus tighten almost immediately, and that direct experience is far more convincing than any amount of algebra.

The Geometry Is Talking

High-dimensional Gaussians behave like soap bubbles because concentration of measure overwhelms our low-dimensional instincts. Once you internalize the distinction between pointwise density and typical volume, many "counterintuitive" machine learning behaviors stop being mysterious. Beam search collapses because it chases the mode. Likelihood-based anomaly detection fails because outliers can be typical. Diffusion models need noise because the shell is where real data lives.

The geometry is talking. All we have to do is listen.