Why We Think

November 27, 2025•1 min read

Source:lilianweng.github.iolilianweng.github.io/posts/2025-05-01-thinking/(Lilian Weng, Lil'Log)

Overview

Motivation: Test-time compute mirrors human dual-process thinking (fast System 1 vs. deliberate System 2), allowing models to use variable computation based on problem difficulty
Thinking in Tokens: Chain-of-thought prompting and RL training (as in DeepSeek-R1) significantly improve reasoning, with emergent "aha moments" of self-correction
Branching and Editing: Parallel sampling (best-of-N, beam search) and sequential revision offer complementary approaches; easier problems benefit from sequential compute while harder ones need both
Thinking Faithfully: Reasoning models show more faithful CoT than non-reasoning models, but applying optimization pressure on CoT can lead to obfuscated reward hacking
Continuous Space Thinking: Recurrent architectures and thinking tokens provide alternative approaches to extend computation without explicit linguistic reasoning

Takeaways

Lilian Weng authored this comprehensive survey on test-time compute and chain-of-thought reasoning. A key insight is that while RL can dramatically improve reasoning capabilities, directly optimizing CoT risks reward hacking where models hide their true intent.

Test-time compute can cover up the gap easily on easy and medium questions when there is only a small gap in model capability, but proves less effective for hard problems.