DeepSeek-R1, from first principles — an interactive walkthrough

Sixteen widgets · self-contained · arxiv.org/abs/2501.12948

The DeepSeek-R1 paper makes one big claim: reasoning can emerge from reinforcement learning — without anyone showing the model how to reason. Just give it problems, reward correct answers, and let the policy figure out the rest.

Sixteen interactive widgets, in order: the paradigm, the algorithm, the reward, the emergence, the pipeline, the distillation. Then deeper still — a simulator you can run, a token-level view of one gradient step, the geometry of the trust region, and the failure modes that motivated each design choice.

Companion piece to the cost tutorial and the frontier-model field guide. Read those first if you want context on training compute, the transformer block, and the four post-training stages — this one zooms in on the fourth (reasoning RL) and runs the algorithm in your browser.

Heads upThis is the only chapter in the book that assumes you've already met policy-gradient RL — PPO is the canonical reference. If you haven't, drag the sliders on the GRPO simulator (§02, §09) and ignore the equations; the intuition lands first, the algebra later.

Chapter 01

The paradigm shift — imitate, or explore.

Same starting model. Different training signal. Wildly different reasoning.

Traditional reasoning models learn by imitating human-written chain-of-thought traces. Every token is a prediction target; the loss is cross-entropy against a fixed sequence. The model can only learn to write what humans wrote.

DeepSeek-R1 takes a different path. There is no fixed target. The model samples its own completions, and the only feedback is whether the final answer was correct. The gradient flows through whatever reasoning path the model actually took.

Prompt

Q: What is 17 × 23?

Training signal — one fixed human-written trace

17×23=17×20+17×3=340+51=391

loss = −Σ_t log p_θ(y_t | y_<t, x)

What the model learns: raise the probability of this exact token sequence. Every other valid solution path — verification, alternate methods, self-correction — is implicitly penalized.

Chapter 02

GRPO — group relative policy optimization.

PPO needs a critic network. GRPO replaces the critic with the mean of a sampled group. Same idea, half the moving parts.

The classical formulation of policy-gradient RL needs a value function — a "critic" that estimates expected reward from a given state. This critic is itself a neural network, trained alongside the policy. It is expensive, brittle, and a frequent source of training instability.

GRPO sidesteps the critic entirely. For each prompt, it samples a group of completions (typically 16). The group's mean reward serves as the baseline. The advantage of any single completion is just (reward − group mean) / std.

No critic, no value network — the group is the baseline. When everyone in the group succeeds, advantages collapse to zero. Learning happens at the variance.

J(θ) = 𝔼_{q,{o_i}} [ ¹⁄_G Σ_i min(r_i(θ)·A_i, clip(r_i, 1−ε, 1+ε)·A_i) − β · KL(π_θ‖π_ref) ]

Group of 4 rollouts

KL coefficient β 0.04

Clip range ε 0.20

Group mean

—

Policy gain

—

Total objective

—

Chapter 03

The reward — regex, not a neural network.

A learned reward model can be hacked. A regex matcher cannot.

R1-Zero's reward function has two components: accuracy (does the final answer match ground truth?) and format (is the answer wrapped in the expected tags?). Both are rule-based — pure regex extraction and string comparison. No learned model anywhere in the reward path.

This is a deliberate, conservative choice. Process reward models — networks trained to score reasoning quality step-by-step — get hacked by the policy. You'll see how, in chapter 12.

Question

If 3x + 7 = 22, what is x?

Ground truth: 5

Model response — edit to test

Accuracy reward

—

Format reward

—

Total

—

Chapter 04

The aha moment — thinking longer, on its own.

Nobody told the model to verify its work. It learned to, because verification correlates with correctness.

This is the paper's most striking finding. As RL training progresses, the model spontaneously starts thinking longer about hard problems — and not just longer, but differently. Verification, alternative methods, self-reflection, even moments of "wait, let me reconsider" all emerge without ever being explicitly trained for.

The mechanism is mundane: longer reasoning empirically yields more correct answers, so the gradient pushes toward longer reasoning. The behaviors are downstream of the only thing being optimized — final-answer accuracy.

Chapter 05

The pipeline — four stages, each fixing the last one's failure mode.

R1-Zero proved RL alone could produce reasoning. R1 added the scaffolding to make it usable.

R1-Zero was a research artifact — it reasoned brilliantly but mixed Chinese and English mid-trace, formatted answers oddly, and was bad at non-reasoning tasks like writing or chat. The DeepSeek team built R1 by stacking four training stages on top of V3-Base, each one addressing a specific failure mode of the previous.

Chapter 06

Distillation — discover big, deploy small.

A surprising claim: distilling R1 into a small model beats running RL on that small model directly.

The DeepSeek team ran a clean ablation: take the same Qwen-32B base, apply the same RL pipeline that worked for R1, and measure. The result: 47% on AIME. Then they took the same base and SFT-trained it on R1's outputs. Result: 72.6%. A 25-point gap, same starting weights, comparable compute.

Reasoning discovery requires scale. Reasoning execution does not.

Chapter 07

What didn't work — three failed approaches.

The paper's negative results are as instructive as the positive ones.

Before settling on the simple recipe — pure GRPO with rule-based rewards — the DeepSeek team tried several more sophisticated approaches that didn't pan out. Each failure exposed a structural flaw in the alternative.

Chapter 08

The full system — twelve nodes, twelve clean dependencies.

Click any component to see its role in the larger system.

Inputs

V3-Base

671B MoE pretrained

Cold-start data

~few thousand traces

Prompts

math, code, science

Ground-truth answers

verifiable solutions

Process

GRPO loop

group sampling + loss

Rule-based reward

accuracy + format

Rejection sampling

filter generated traces

SFT pass

cross-entropy training

Outputs

R1-Zero

pure-RL reasoning

final aligned model

800k training set

distillation source

R1-Distill (1.5B–70B)

small open models

click any node to inspect

DeepSeek-R1 system overview

Twelve components, organized into inputs (data + base model), process (RL loop + filters), and outputs (released models). The system's elegance is that the only learned reward signal — used in the very last stage — is preference-based, applied on top of an already-strong reasoner.

Chapter 09

A working simulator — fifty steps of GRPO.

Enough description. The widget below is GRPO running on a toy four-strategy model with hidden success rates — press play and watch the policy quietly forget the strategies that don't pay.

Up to here, GRPO has been described. Now run it. The model below picks one of four strategies via softmax over logits. Each strategy has a hidden true success probability. At every step, GRPO samples a group of 8, scores them, computes group-relative advantages, and updates the logits.

Step

Last group reward

—

Expected reward

0.40

Strategy probabilities π_θ(a|x)

Learning rate 0.20

Last group of rollouts

Chapter 10

One gradient step — per-token, on every logit.

A rollout is a few hundred token choices in a row, and after it gets scored the gradient updates the logit at every single position — which is the mechanism that quietly converts a scalar reward into a per-token credit assignment.

The previous chapter showed strategy-level updates. This one shows what happens at the level of individual tokens. A rollout is a sequence of choices: at each position, the model picked one token from a distribution. After the rollout is scored, the gradient updates the logits at every position.

This is how verification behavior gets reinforced — not because verification is rewarded directly, but because it co-occurs with correct answers, and the gradient credits every token in a successful trajectory.

Setup

Question: "What is 12 × 8?" — model generates token-by-token. After the full rollout is scored (R = 1 if correct), GRPO computes advantage A and updates each token's logits in proportion to A.

Rollout advantage A —

The update rule

For each chosen token a_t in a rollout with advantage A:

∇_θ log π_θ(a_t|s_t) · A

Chosen logit goes up by A·(1 − p_chosen). Other logits go down by A·p_i. Negative A flips the sign.

Chapter 11

The trust region — geometry of staying close.

The β·KL term defines a region in policy space the gradient cannot escape. Here is that region, made draggable.

Pure RL would push the policy to the highest-reward corner of the simplex. In language modeling, that's a disaster — the model would lose coherence everywhere outside the rewarded distribution. The KL penalty pulls the policy back toward a reference, defining a trust region.

The two forces — reward gradient and KL pull — balance at some point inside the simplex. That balance point is what GRPO actually optimizes.

Setup — Three actions: A pays 0.9, B pays 0.5, C pays 0.1. The reference policy π_ref (gray dot) is roughly uniform. Drag the new policy (red dot) inside the simplex.

drag · gray = π_ref · red = π_θ · contours = objective

π_θ(A, B, C)

—

Expected reward

—

KL(π_θ‖π_ref)

—

Objective

—

= E[R] − β·KL

KL coefficient β

00.402

Chapter 12

Reward hacking — five exploits, one principle.

Anything inside a neural network can be optimized against. Including a reward model.

The DeepSeek team's choice to use only rule-based rewards was conservative but principled. Process reward models reward surface features of reasoning rather than reasoning outcomes — and once a learned reward is in the optimization loop, the policy will discover the cheapest way to maximize it.

Below: five exploits, one prompt. Toggle through to see how a learned PRM gets fooled while a rule-based grader stays accurate.

Setup — Question: "Is 7919 prime?" (answer: yes).

Learned PRM score

—

Rule-based reward

—

Chapter 13

The alignment tax — paying capability for legibility.

Left to its own devices, R1-Zero code-switched mid-trace because some thoughts are cheaper in Chinese and some in English; the published R1 forces monolingual output and pays a small but measurable accuracy bill for it.

R1-Zero spontaneously code-switched: '所以' for "therefore", English for arithmetic, Chinese for some technical terms. Not because it was confused — because some tokens are denser in one language than the other. Forcing monolingual output costs a measurable amount of accuracy.

The discovery

R1-Zero mixed English and Chinese mid-sentence because some tokens are denser or more probable in one language. R1 punishes mixing — but doing so makes the model slightly worse at reasoning.

Sample rollout — same prompt, different λ

Language consistency weight λ 0.40

Mixed-language rate

—

AIME pass@1

—

Readability

—

Accuracy Readability

Chapter 14

R1 vs R1-Zero — side by side, same prompt.

The readability gap, made concrete.

R1-Zero (pure RL, no SFT)

R1 (4-stage pipeline)

PropertyR1-ZeroR1

Chapter 15

The architecture beneath — 671B total, 37B active.

R1 is built on V3-Base, a Mixture-of-Experts model. Most of its parameters sit idle on every token.

A dense 671B model would be infeasible to train with RL — the per-token compute cost of GRPO across tens of thousands of training steps would be astronomical. V3-Base is a Mixture-of-Experts, where each MoE layer has 256 small experts and a router picks 8 of them per token. Plus one shared expert that always activates.

Total parameters: 671B. Activated per token: 37B. About 5.5%.

The trick — a Mixture-of-Experts layer has many small FFN experts. A small router picks a few per token. V3 has 1 shared + 8 of 256 routed active per MoE layer.

Click an input token to route it

Router top-k scores for selected token

MoE layer — 1 shared + 32 (of 256) routed experts

Total params

671B

Active per token

37B

Sparsity

5.5%

Layers

Chapter 16

A new scaling axis — buying accuracy with thinking time.

Same model, same weights. Just give it a bigger thinking budget.

Classical LLMs scale by training-time compute — more parameters, more data. R1 introduces a second axis: inference-time compute. Same weights, longer thinking budget, higher accuracy. This is the qualitatively new property of reasoning models.

The shape is consistent across benchmarks: steep early gains as the model goes from "direct answer" to "step-by-step", a long climbing region as verification and alternative methods kick in, then a plateau where extra budget doesn't help.

Thinking budget 4,000 tk

Accuracy

—

Latency

—

Cost (rel.)

—

Reasoning depth at this budget

That reasoning ability can emerge from pure reinforcement learning on verifiable-correctness rewards — no human-written chain-of-thought traces required. The model figures out the reasoning style itself; the only feedback is whether the final answer was right. That a 671B-parameter dense model trained for 30T tokens beats GPT-4 on every benchmark. That distillation from a frontier closed-source teacher transfers reasoning to small open models. That RLHF with chain-of-thought preference labels is the most efficient path to reasoning. Because it estimates the advantage by sampling a group of completions from the same prompt and using their mean reward as the baseline — the within-group spread plays the role the critic plays in PPO. Same idea, half the moving parts. Because the reward function is rule-based and deterministic, so no value estimate is needed. Because GRPO only updates the final-token logits, so a per-step value isn't needed. Because the KL trust region eliminates the variance the critic was meant to reduce. The point where the model spontaneously starts saying things like "wait, let me reconsider" mid-trace — emergent self-verification that no one trained it to do. It appears purely as a side effect of being rewarded for correct answers when long reasoning helps. The first training step where loss drops below random baseline. The moment the model first uses LaTeX in its outputs. The point where the reward model converges with the policy and KL → 0. R1-Zero is pure RL and produces high-accuracy but barely-readable, language-mixed reasoning traces. R1 wraps the RL with two SFT stages (one before, one after) for readability and language consistency — buying ~1–3% accuracy loss in exchange for traces a human can actually read. R1 has fewer parameters but a larger context window than R1-Zero. R1 uses on-policy gradients while R1-Zero uses off-policy, so R1 is more sample-efficient. R1 was trained on multilingual data while R1-Zero was English-only. Same model weights, longer thinking budget → higher accuracy. Classical LLMs scale with training-time compute (more parameters / data); reasoning models add a second axis where you pay more tokens at inference for measurably better answers, on a curve that climbs steeply then plateaus. Quantization from FP16 to INT4 at inference time, gaining accuracy from sharper logits. Increasing the temperature at inference time to broaden the search. Running the same prompt through multiple models in parallel and majority-voting. Total parameters set the model's capacity (how much knowledge it can store); active parameters set the inference cost (how many fire per token via the router). DeepSeek-V3 has 671B total but only 37B active per token — capacity of a frontier dense model at the cost of a 37B dense one. Total parameters are FP32; active parameters are the FP16 quantized subset that actually runs. Total parameters include the embedding table; active parameters exclude it. Total parameters are the trained set; active parameters are only those gradient updates touched in the last epoch. Reward hacking: the policy drifts arbitrarily far from the SFT initialization to game the reward. Outputs become incoherent or repetitive while still scoring well on the rule-based reward. The KL penalty keeps the policy near the starting distribution so reasoning improves without the language model collapsing. The gradient norms grow unboundedly and training diverges within a few hundred steps. The model forgets all its pre-training and reverts to a uniform distribution over the vocabulary. The reward signal becomes too sparse and the model stops learning entirely. Cold-start SFT (stage 1) is a small supervised pass on a few thousand carefully-curated reasoning traces — just enough to give the RL stage a sane starting policy. Reasoning SFT (stage 3) uses ~600k high-quality reasoning traces sampled from the post-RL model itself, plus general-domain SFT data, to make the final model both reason well and stay readable / general-purpose. Stage 1 trains the tokenizer; stage 3 trains the language model. Stage 1 uses LoRA adapters; stage 3 unfreezes the full model. Stage 1 uses preference labels; stage 3 uses correctness labels.

Sixteen widgets, one paper. The shape of the argument: a paradigm shift, an algorithm, a reward, an emergence, a pipeline, a distillation. Then deeper — a simulator, a per-token view, a trust region, the failure modes. Finally — the alignment tax, the readability gap, the architecture, the new scaling axis.

The DeepSeek-R1 paper, taken apart and made manipulable.

· · ·

End of the walkthrough · Set in Fraunces & IBM Plex

"The paper is in here. Now go price the system that runs it."