Reasoning Models

Why does "thinking before answering" make AI smarter?

Chain-of-Thought Reasoning o1/R1
About 10 min read
Chapter 01

Fast Thinking vs Slow Thinking

Nobel laureate Daniel Kahneman introduced the famous dual-system theory in Thinking, Fast and Slow: human cognition operates through System 1 (fast, intuitive) and System 2 (slow, deliberate). This analogy perfectly explains the difference between standard LLMs and reasoning models.

System 1 (Standard LLM): Sees a question and immediately outputs an answer. It relies on pattern matching -- having seen countless similar problems during training, it can "blurt out" a response. Ask "What's 2+2?" and it instantly says "4".

System 2 (Reasoning Model): "Thinks" before answering. It generates a chain of thinking tokens, breaking complex problems into small steps and reasoning through each one. Faced with "Prove the Pythagorean theorem," it writes out the proof step by step.

// INTERACTIVE: Two Ways of Thinking

Click each card to see how two models respond to the same problem:

System 1 — Standard LLM
Fast Intuition
Pattern matching, one-shot answer
Q: A pool has two inlet pipes. Pipe A fills it in 6 hours, pipe B in 8 hours. Both open at once -- how long to fill?
System 2 — Reasoning Model
Deep Reasoning
Step-by-step, methodical derivation
Q: A pool has two inlet pipes. Pipe A fills it in 6 hours, pipe B in 8 hours. Both open at once -- how long to fill?

A standard LLM is like a quiz show buzzer -- fastest answer wins. A reasoning model is like a careful exam-taker -- slower, but far more accurate.

Chapter 02

Chain-of-Thought — The Magic of Showing Your Work

In 2022, Google researchers discovered something remarkable: simply adding "Let's think step by step" to a prompt dramatically improved LLM accuracy on math and logic tasks. This technique is called Chain-of-Thought (CoT) prompting.

Without CoT, the model jumps straight to an answer -- and often gets it wrong:

Q: 17 x 24 = ?  A: __(often incorrect)

With CoT, the model "shows its work," and accuracy improves significantly:

Q: 17 x 24 = ? Let me think: 17x20=340, 17x4=68, 340+68=408 ✓

Taking this further, models like o1, R1, and QwQ are trained to automatically generate chain-of-thought. They produce hundreds or even thousands of thinking tokens before giving a final answer -- this is known as Extended Thinking.

// INTERACTIVE: Direct Answer vs Chain-of-Thought

Q: A bookstore sold 32 books on Monday, 1.5x that on Tuesday, and 8 fewer than Tuesday on Wednesday. How many total?

Chain-of-thought is like a math exam that requires you to "show your work" -- get the process right, and the answer follows.

Chapter 03

o1, R1, QwQ — The Reasoning Model Race

Starting in 2024, major AI labs began releasing dedicated reasoning models, sparking an entirely new competition:

OpenAI o1/o3: Pioneered "test-time compute scaling" -- the model invests more computation during inference, and the longer it thinks, the better its answers. o1 achieved stunning results on math competitions and coding tasks.

DeepSeek R1: An open-source reasoning model distilled from a 671B Mixture-of-Experts (MoE) model down to sizes ranging from 1.5B to 70B. It proved that small models can possess powerful reasoning abilities.

Qwen QwQ: Alibaba's reasoning model, demonstrating strong reasoning performance across multilingual benchmarks.

The core insight shared by all these models: learn to allocate more compute to harder problems.

// INTERACTIVE: Reasoning Model Benchmark Matrix

Higher score = longer bar (normalized from public benchmarks, max 100):

Model Provider Math Code Reasoning
o3 OpenAI
96
92
94
o1 OpenAI
90
88
89
DeepSeek R1 DeepSeek
88
85
87
QwQ-32B Alibaba
83
79
82
R1-Distill-7B DeepSeek
62
55
60

Reasoning models are like students who learned to "spend more time on harder questions" -- not smarter, just better at allocating effort.

Chapter 04

Test-Time Compute — More Thinking, Better Answers

The traditional path to smarter AI is training-time compute: more data, bigger models, longer training. OpenAI's Scaling Laws formalize this approach.

But reasoning models open a second path: test-time compute. The same model, when faced with harder problems, thinks longer -- generating more thinking tokens -- and achieves higher accuracy.

This introduces an important trade-off: reasoning models are slower and consume more tokens (thinking tokens cost money too). But for complex tasks, the accuracy improvement far outweighs the extra cost. The key caveat: diminishing returns -- thinking too long yields progressively smaller gains.

// INTERACTIVE: Thinking Budget vs Accuracy

Drag the slider to adjust the "thinking budget" (allowed thinking tokens) and observe how accuracy changes:

Thinking Tokens Accuracy %
2048

Test-time compute is like giving students more time on an open-book exam -- same knowledge, but more time to think means harder problems get solved.

Chapter 05

Summary

🧠

Slow Thinking > Fast Thinking

For complex problems, step-by-step reasoning beats snap intuition. Reasoning models use "System 2 thinking" to tackle hard tasks.

📝

CoT Boosts Accuracy

Making models "show their work" significantly improves accuracy on math, logic, and coding tasks.

Time Trades for Accuracy

Test-time compute is a new scaling dimension: the same model performs better the longer it thinks.

📦

Distillation Enables Small Reasoners

DeepSeek R1 proved that large-model reasoning ability can be distilled into 7B or even 1.5B models.

The essential insight of reasoning models: AI gets smarter not just through bigger models, but through deeper thinking.

Next: RLHF