LLM Volume 3: Reasoning Models, Scaling Inference, and Architectural Design by DataKnobs

Chain-of-Thought

Encouraging models to provide detailed intermediate steps prior to reaching a final solution enhances performance in complex mathematical, logical, and coding tasks, especially when applied on a large scale.

Test-Time Compute

Increasing computational resources during inference, rather than during training, can lead to improved answers. Models can improve their performance by exploring multiple reasoning paths and choosing the most accurate one through majority voting or reward systems.

Mixture of Experts

MoE directs every token to a limited number of specialized sub-networks. Even models with trillions of parameters only utilize billions per token, achieving frontier-scale performance at a fraction of the inference cost.

Scaling Laws

Chinchilla scaling showed that in 2020, many models were not adequately trained due to the rapid increase in data scaling relative to model size.

Model Distillation

Developing a compact 'student' model to replicate the performance of a larger 'teacher' model, DeepSeek-R1 successfully distilled reasoning abilities into 7B and 14B models, surpassing the performance of larger base models on various benchmarks.

RLVR Training

Training Language Model Models on tasks with verifiable rewards such as math and code eliminates the dependency on human preference labels and significantly enhances reasoning accuracy.

Speculative Decoding

A compact draft model simultaneously suggests various potential tokens, while a larger model confirms them all in a single forward pass. This leads to a 2–4× increase in speed without compromising the accuracy of the final output compared to greedy decoding.

KV Cache & Efficiency

Key-value caching prevents the need to recalculate attention for each token, making long-document analysis and multi-turn agents feasible at 200K context.

Deep Dive

The Reasoning
Revolution

OpenAI's o1 demonstrated that spending more compute at inference time :: generating longer chains of thought :: could unlock qualitative capability jumps on tasks that stumped previous models: graduate-level math, competitive coding, and scientific reasoning.

o1 scored in the 89th percentile on competitive programming problems. Its predecessor GPT-4 scored in the 11th percentile :: an 8× relative improvement through reasoning alone, not additional pretraining.

DeepSeek-R1 showed that reinforcement learning with verifiable rewards (RLVR) :: training on problems with objectively correct answers :: can instill sophisticated reasoning without expensive human preference data. Its 7B distilled variant outperforms GPT-4 on math benchmarks.

Architecture

Why Mixture
of Experts

Dense transformers scale parameter count linearly with compute cost. Mixture-of-Experts decouples the two: a routing layer selects 2–8 specialist sub-networks per token, leaving the rest dormant. GPT-4 is widely reported to use MoE with ~8 experts, activating ~2 per token.

With MoE, a model can have 1 trillion total parameters but activate only ~100B per forward pass :: matching the inference cost of a much smaller dense model while retaining the capacity of a much larger one.

The trade-offs: MoE models require more memory to store all expert weights, have higher communication overhead in distributed setups, and can suffer from load-imbalance where some experts are consistently over-selected. Auxiliary loss terms during training encourage balanced routing.

Key challenge: MoE models often perform worse than dense models in few-shot transfer settings, as experts specialize during pretraining and may not generalize as flexibly to new tasks.

Comparison

Reasoning Model Landscape

Model	Training Approach	Reasoning Method	Parameters	Strength
OpenAI o1 / o3	RLHF + process reward models	Internal chain-of-thought (hidden)	Undisclosed	Math & Science
DeepSeek-R1	RLVR on verifiable tasks	Explicit long CoT, visible to user	671B MoE (37B active)	Cost Efficiency
DeepSeek-R1 Distilled	Knowledge distillation from R1	CoT inherited from teacher	7B / 14B / 32B	Small-scale SOTA
QwQ-32B	RLVR + self-improvement	Extended reasoning traces	32B dense	Open weights
Claude (Extended Thinking)	Constitutional AI + RLHF	Visible scratchpad thinking	Undisclosed	Safety + Reasoning
Gemini Thinking	Multimodal RLHF	Internal multi-hypothesis reasoning	Undisclosed	Multimodal

The Reasoning
Milestones

The pivotal moments that marked the reasoning revolution ranged from prompting chain-of-thought to scaling compute at test-time.

2022

Chain-of-Thought Prompting

Wei and colleagues demonstrated that incorporating the phrase 'let's think step by step' into prompts significantly enhanced the performance of LLMs on multi-step reasoning assignments, without the need for fine-tuning.

2023

Process Reward Models

OpenAI developed reward models that evaluate intermediate reasoning steps, rather than just the end results, allowing reinforcement learning to enhance every stage in a solution sequence.

2024 · Sept

OpenAI o1 Launch

o1 proved that scaling inference computation (creating longer CoT) could outperform significantly larger dense models in math and coding, marking the beginning of the 'inference scaling law' era.

2025 · Jan

DeepSeek-R1 & RLVR

A reasoning model trained using verifiable rewards, not human preference data, and its 7B distilled version outperformed GPT-4 on MATH benchmarks, making frontier reasoning more accessible.

2025

Inference Scaling Laws

Research has shown that increasing the use of inference compute methods such as best-of-N sampling, MCTS, and self-refinement can be traded off predictably against accuracy, providing practitioners with a new tool for improving performance.

2025–2026

Reasoning as Default

All cutting-edge models like GPT-5, Claude Opus 4.6, Gemini 3, and Grok 4 are now equipped with reasoning/thinking modes. The focus has moved from questioning the capability of LLMs to reason to optimizing routing strategies.