Overview
- Motivation: Test-time compute mirrors human dual-process thinking (fast System 1 vs. deliberate System 2), allowing models to use variable computation based on problem difficulty
- Thinking in Tokens: Chain-of-thought prompting and RL training (as in DeepSeek-R1) significantly improve reasoning, with emergent "aha moments" of self-correction
- Branching and Editing: Parallel sampling (best-of-N, beam search) and sequential revision offer complementary approaches; easier problems benefit from sequential compute while harder ones need both
- Thinking Faithfully: Reasoning models show more faithful CoT than non-reasoning models, but applying optimization pressure on CoT can lead to obfuscated reward hacking
- Continuous Space Thinking: Recurrent architectures and thinking tokens provide alternative approaches to extend computation without explicit linguistic reasoning
Takeaways
Lilian Weng authored this comprehensive survey on test-time compute and chain-of-thought reasoning. A key insight is that while RL can dramatically improve reasoning capabilities, directly optimizing CoT risks reward hacking where models hide their true intent.
Test-time compute can cover up the gap easily on easy and medium questions when there is only a small gap in model capability, but proves less effective for hard problems.