Why does "thinking before answering" make AI smarter?
About 10 min readNobel laureate Daniel Kahneman introduced the famous dual-system theory in Thinking, Fast and Slow: human cognition operates through System 1 (fast, intuitive) and System 2 (slow, deliberate). This analogy perfectly explains the difference between standard LLMs and reasoning models.
System 1 (Standard LLM): Sees a question and immediately outputs an answer. It relies on pattern matching -- having seen countless similar problems during training, it can "blurt out" a response. Ask "What's 2+2?" and it instantly says "4".
System 2 (Reasoning Model): "Thinks" before answering. It generates a chain of thinking tokens, breaking complex problems into small steps and reasoning through each one. Faced with "Prove the Pythagorean theorem," it writes out the proof step by step.
Click each card to see how two models respond to the same problem:
A standard LLM is like a quiz show buzzer -- fastest answer wins. A reasoning model is like a careful exam-taker -- slower, but far more accurate.
In 2022, Google researchers discovered something remarkable: simply adding "Let's think step by step" to a prompt dramatically improved LLM accuracy on math and logic tasks. This technique is called Chain-of-Thought (CoT) prompting.
Without CoT, the model jumps straight to an answer -- and often gets it wrong:
Q: 17 x 24 = ? A: __(often incorrect)
With CoT, the model "shows its work," and accuracy improves significantly:
Q: 17 x 24 = ? Let me think: 17x20=340, 17x4=68, 340+68=408 ✓
Taking this further, models like o1, R1, and QwQ are trained to automatically generate chain-of-thought. They produce hundreds or even thousands of thinking tokens before giving a final answer -- this is known as Extended Thinking.
Chain-of-thought is like a math exam that requires you to "show your work" -- get the process right, and the answer follows.
Starting in 2024, major AI labs began releasing dedicated reasoning models, sparking an entirely new competition:
OpenAI o1/o3: Pioneered "test-time compute scaling" -- the model invests more computation during inference, and the longer it thinks, the better its answers. o1 achieved stunning results on math competitions and coding tasks.
DeepSeek R1: An open-source reasoning model distilled from a 671B Mixture-of-Experts (MoE) model down to sizes ranging from 1.5B to 70B. It proved that small models can possess powerful reasoning abilities.
Qwen QwQ: Alibaba's reasoning model, demonstrating strong reasoning performance across multilingual benchmarks.
The core insight shared by all these models: learn to allocate more compute to harder problems.
Higher score = longer bar (normalized from public benchmarks, max 100):
| Model | Provider | Math | Code | Reasoning |
|---|---|---|---|---|
| o3 | OpenAI | |||
| o1 | OpenAI | |||
| DeepSeek R1 | DeepSeek | |||
| QwQ-32B | Alibaba | |||
| R1-Distill-7B | DeepSeek |
Reasoning models are like students who learned to "spend more time on harder questions" -- not smarter, just better at allocating effort.
The traditional path to smarter AI is training-time compute: more data, bigger models, longer training. OpenAI's Scaling Laws formalize this approach.
But reasoning models open a second path: test-time compute. The same model, when faced with harder problems, thinks longer -- generating more thinking tokens -- and achieves higher accuracy.
This introduces an important trade-off: reasoning models are slower and consume more tokens (thinking tokens cost money too). But for complex tasks, the accuracy improvement far outweighs the extra cost. The key caveat: diminishing returns -- thinking too long yields progressively smaller gains.
Drag the slider to adjust the "thinking budget" (allowed thinking tokens) and observe how accuracy changes:
Test-time compute is like giving students more time on an open-book exam -- same knowledge, but more time to think means harder problems get solved.
For complex problems, step-by-step reasoning beats snap intuition. Reasoning models use "System 2 thinking" to tackle hard tasks.
Making models "show their work" significantly improves accuracy on math, logic, and coding tasks.
Test-time compute is a new scaling dimension: the same model performs better the longer it thinks.
DeepSeek R1 proved that large-model reasoning ability can be distilled into 7B or even 1.5B models.
The essential insight of reasoning models: AI gets smarter not just through bigger models, but through deeper thinking.