Reasoning models generate intermediate thinking steps before producing a final answer :: dramatically improving performance on complex tasks like math, coding, and multi-step logic.
The frontier of LLM research: how reasoning models like o1 and DeepSeek-R1 use test-time compute to solve harder problems, why mixture-of-experts unlocks massive scale at manageable cost, and what scaling laws tell us about where AI is heading.
Reasoning models generate intermediate thinking steps before producing a final answer :: dramatically improving performance on complex tasks like math, coding, and multi-step logic.
Encouraging models to provide detailed intermediate steps prior to reaching a final solution enhances performance in complex mathematical, logical, and coding tasks, especially when applied on a large scale.
Increasing computational resources during inference, rather than during training, can lead to improved answers. Models can improve their performance by exploring multiple reasoning paths and choosing the most accurate one through majority voting or reward systems.
MoE directs every token to a limited number of specialized sub-networks. Even models with trillions of parameters only utilize billions per token, achieving frontier-scale performance at a fraction of the inference cost.
Chinchilla scaling showed that in 2020, many models were not adequately trained due to the rapid increase in data scaling relative to model size.
Developing a compact 'student' model to replicate the performance of a larger 'teacher' model, DeepSeek-R1 successfully distilled reasoning abilities into 7B and 14B models, surpassing the performance of larger base models on various benchmarks.
Training Language Model Models on tasks with verifiable rewards such as math and code eliminates the dependency on human preference labels and significantly enhances reasoning accuracy.
A compact draft model simultaneously suggests various potential tokens, while a larger model confirms them all in a single forward pass. This leads to a 2–4× increase in speed without compromising the accuracy of the final output compared to greedy decoding.
Key-value caching prevents the need to recalculate attention for each token, making long-document analysis and multi-turn agents feasible at 200K context.
OpenAI's o1 demonstrated that spending more compute at inference time :: generating longer chains of thought :: could unlock qualitative capability jumps on tasks that stumped previous models: graduate-level math, competitive coding, and scientific reasoning.
DeepSeek-R1 showed that reinforcement learning with verifiable rewards (RLVR) :: training on problems with objectively correct answers :: can instill sophisticated reasoning without expensive human preference data. Its 7B distilled variant outperforms GPT-4 on math benchmarks.
Dense transformers scale parameter count linearly with compute cost. Mixture-of-Experts decouples the two: a routing layer selects 2–8 specialist sub-networks per token, leaving the rest dormant. GPT-4 is widely reported to use MoE with ~8 experts, activating ~2 per token.
The trade-offs: MoE models require more memory to store all expert weights, have higher communication overhead in distributed setups, and can suffer from load-imbalance where some experts are consistently over-selected. Auxiliary loss terms during training encourage balanced routing.
| Model | Training Approach | Reasoning Method | Parameters | Strength |
|---|---|---|---|---|
| OpenAI o1 / o3 | RLHF + process reward models | Internal chain-of-thought (hidden) | Undisclosed | Math & Science |
| DeepSeek-R1 | RLVR on verifiable tasks | Explicit long CoT, visible to user | 671B MoE (37B active) | Cost Efficiency |
| DeepSeek-R1 Distilled | Knowledge distillation from R1 | CoT inherited from teacher | 7B / 14B / 32B | Small-scale SOTA |
| QwQ-32B | RLVR + self-improvement | Extended reasoning traces | 32B dense | Open weights |
| Claude (Extended Thinking) | Constitutional AI + RLHF | Visible scratchpad thinking | Undisclosed | Safety + Reasoning |
| Gemini Thinking | Multimodal RLHF | Internal multi-hypothesis reasoning | Undisclosed | Multimodal |
The pivotal moments that marked the reasoning revolution ranged from prompting chain-of-thought to scaling compute at test-time.
Wei and colleagues demonstrated that incorporating the phrase 'let's think step by step' into prompts significantly enhanced the performance of LLMs on multi-step reasoning assignments, without the need for fine-tuning.
OpenAI developed reward models that evaluate intermediate reasoning steps, rather than just the end results, allowing reinforcement learning to enhance every stage in a solution sequence.
o1 proved that scaling inference computation (creating longer CoT) could outperform significantly larger dense models in math and coding, marking the beginning of the 'inference scaling law' era.
A reasoning model trained using verifiable rewards, not human preference data, and its 7B distilled version outperformed GPT-4 on MATH benchmarks, making frontier reasoning more accessible.
Research has shown that increasing the use of inference compute methods such as best-of-N sampling, MCTS, and self-refinement can be traded off predictably against accuracy, providing practitioners with a new tool for improving performance.
All cutting-edge models like GPT-5, Claude Opus 4.6, Gemini 3, and Grok 4 are now equipped with reasoning/thinking modes. The focus has moved from questioning the capability of LLMs to reason to optimizing routing strategies.