The DeepSeek-R1 paper makes one big claim: reasoning can emerge from reinforcement learning — without anyone showing the model how to reason. Just give it problems, reward correct answers, and let the policy figure out the rest.
Sixteen interactive widgets, in order: the paradigm, the algorithm, the reward, the emergence, the pipeline, the distillation. Then deeper still — a simulator you can run, a token-level view of one gradient step, the geometry of the trust region, and the failure modes that motivated each design choice.
Companion piece to the cost tutorial and the frontier-model field guide. Read those first if you want context on training compute, the transformer block, and the four post-training stages — this one zooms in on the fourth (reasoning RL) and runs the algorithm in your browser.
The paradigm shift — imitate, or explore.
Same starting model. Different training signal. Wildly different reasoning.
Traditional reasoning models learn by imitating human-written chain-of-thought traces. Every token is a prediction target; the loss is cross-entropy against a fixed sequence. The model can only learn to write what humans wrote.
DeepSeek-R1 takes a different path. There is no fixed target. The model samples its own completions, and the only feedback is whether the final answer was correct. The gradient flows through whatever reasoning path the model actually took.
GRPO — group relative policy optimization.
PPO needs a critic network. GRPO replaces the critic with the mean of a sampled group. Same idea, half the moving parts.
The classical formulation of policy-gradient RL needs a value function — a "critic" that estimates expected reward from a given state. This critic is itself a neural network, trained alongside the policy. It is expensive, brittle, and a frequent source of training instability.
GRPO sidesteps the critic entirely. For each prompt, it samples a group of completions (typically 16). The group's mean reward serves as the baseline. The advantage of any single completion is just (reward − group mean) / std.
The reward — regex, not a neural network.
A learned reward model can be hacked. A regex matcher cannot.
R1-Zero's reward function has two components: accuracy (does the final answer match ground truth?) and format (is the answer wrapped in the expected tags?). Both are rule-based — pure regex extraction and string comparison. No learned model anywhere in the reward path.
This is a deliberate, conservative choice. Process reward models — networks trained to score reasoning quality step-by-step — get hacked by the policy. You'll see how, in chapter 12.
The aha moment — thinking longer, on its own.
Nobody told the model to verify its work. It learned to, because verification correlates with correctness.
This is the paper's most striking finding. As RL training progresses, the model spontaneously starts thinking longer about hard problems — and not just longer, but differently. Verification, alternative methods, self-reflection, even moments of "wait, let me reconsider" all emerge without ever being explicitly trained for.
The mechanism is mundane: longer reasoning empirically yields more correct answers, so the gradient pushes toward longer reasoning. The behaviors are downstream of the only thing being optimized — final-answer accuracy.
The pipeline — four stages, each fixing the last one's failure mode.
R1-Zero proved RL alone could produce reasoning. R1 added the scaffolding to make it usable.
R1-Zero was a research artifact — it reasoned brilliantly but mixed Chinese and English mid-trace, formatted answers oddly, and was bad at non-reasoning tasks like writing or chat. The DeepSeek team built R1 by stacking four training stages on top of V3-Base, each one addressing a specific failure mode of the previous.
Distillation — discover big, deploy small.
A surprising claim: distilling R1 into a small model beats running RL on that small model directly.
The DeepSeek team ran a clean ablation: take the same Qwen-32B base, apply the same RL pipeline that worked for R1, and measure. The result: 47% on AIME. Then they took the same base and SFT-trained it on R1's outputs. Result: 72.6%. A 25-point gap, same starting weights, comparable compute.
What didn't work — three failed approaches.
The paper's negative results are as instructive as the positive ones.
Before settling on the simple recipe — pure GRPO with rule-based rewards — the DeepSeek team tried several more sophisticated approaches that didn't pan out. Each failure exposed a structural flaw in the alternative.
The full system — twelve nodes, twelve clean dependencies.
Click any component to see its role in the larger system.
A working simulator — fifty steps of GRPO.
Enough description. The widget below is GRPO running on a toy four-strategy model with hidden success rates — press play and watch the policy quietly forget the strategies that don't pay.
Up to here, GRPO has been described. Now run it. The model below picks one of four strategies via softmax over logits. Each strategy has a hidden true success probability. At every step, GRPO samples a group of 8, scores them, computes group-relative advantages, and updates the logits.
One gradient step — per-token, on every logit.
A rollout is a few hundred token choices in a row, and after it gets scored the gradient updates the logit at every single position — which is the mechanism that quietly converts a scalar reward into a per-token credit assignment.
The previous chapter showed strategy-level updates. This one shows what happens at the level of individual tokens. A rollout is a sequence of choices: at each position, the model picked one token from a distribution. After the rollout is scored, the gradient updates the logits at every position.
This is how verification behavior gets reinforced — not because verification is rewarded directly, but because it co-occurs with correct answers, and the gradient credits every token in a successful trajectory.
The trust region — geometry of staying close.
The β·KL term defines a region in policy space the gradient cannot escape. Here is that region, made draggable.
Pure RL would push the policy to the highest-reward corner of the simplex. In language modeling, that's a disaster — the model would lose coherence everywhere outside the rewarded distribution. The KL penalty pulls the policy back toward a reference, defining a trust region.
The two forces — reward gradient and KL pull — balance at some point inside the simplex. That balance point is what GRPO actually optimizes.
Reward hacking — five exploits, one principle.
Anything inside a neural network can be optimized against. Including a reward model.
The DeepSeek team's choice to use only rule-based rewards was conservative but principled. Process reward models reward surface features of reasoning rather than reasoning outcomes — and once a learned reward is in the optimization loop, the policy will discover the cheapest way to maximize it.
Below: five exploits, one prompt. Toggle through to see how a learned PRM gets fooled while a rule-based grader stays accurate.
The alignment tax — paying capability for legibility.
Left to its own devices, R1-Zero code-switched mid-trace because some thoughts are cheaper in Chinese and some in English; the published R1 forces monolingual output and pays a small but measurable accuracy bill for it.
R1-Zero spontaneously code-switched: '所以' for "therefore", English for arithmetic, Chinese for some technical terms. Not because it was confused — because some tokens are denser in one language than the other. Forcing monolingual output costs a measurable amount of accuracy.
R1 vs R1-Zero — side by side, same prompt.
The readability gap, made concrete.
The architecture beneath — 671B total, 37B active.
R1 is built on V3-Base, a Mixture-of-Experts model. Most of its parameters sit idle on every token.
A dense 671B model would be infeasible to train with RL — the per-token compute cost of GRPO across tens of thousands of training steps would be astronomical. V3-Base is a Mixture-of-Experts, where each MoE layer has 256 small experts and a router picks 8 of them per token. Plus one shared expert that always activates.
Total parameters: 671B. Activated per token: 37B. About 5.5%.
A new scaling axis — buying accuracy with thinking time.
Same model, same weights. Just give it a bigger thinking budget.
Classical LLMs scale by training-time compute — more parameters, more data. R1 introduces a second axis: inference-time compute. Same weights, longer thinking budget, higher accuracy. This is the qualitatively new property of reasoning models.
The shape is consistent across benchmarks: steep early gains as the model goes from "direct answer" to "step-by-step", a long climbing region as verification and alternative methods kick in, then a plateau where extra budget doesn't help.
Sixteen widgets, one paper. The shape of the argument: a paradigm shift, an algorithm, a reward, an emergence, a pipeline, a distillation. Then deeper — a simulator, a per-token view, a trust region, the failure modes. Finally — the alignment tax, the readability gap, the architecture, the new scaling axis.
The DeepSeek-R1 paper, taken apart and made manipulable.