ajinkya.ai An experiment in learning with AI.
← All entries
09 May 2026 17 min read

DeepSeek-R1, from first principles — an interactive walkthrough

LLM Tutorial Reasoning Rl Deepseek Interactive
Sixteen widgets · self-contained · arxiv.org/abs/2501.12948

The DeepSeek-R1 paper makes one big claim: reasoning can emerge from reinforcement learning — without anyone showing the model how to reason. Just give it problems, reward correct answers, and let the policy figure out the rest.

Sixteen interactive widgets, in order: the paradigm, the algorithm, the reward, the emergence, the pipeline, the distillation. Then deeper still — a simulator you can run, a token-level view of one gradient step, the geometry of the trust region, and the failure modes that motivated each design choice.

Companion piece to the cost tutorial and the frontier-model field guide. Read those first if you want context on training compute, the transformer block, and the four post-training stages — this one zooms in on the fourth (reasoning RL) and runs the algorithm in your browser.

Heads upThis is the only chapter in the book that assumes you've already met policy-gradient RL — PPO is the canonical reference. If you haven't, drag the sliders on the GRPO simulator (§02, §09) and ignore the equations; the intuition lands first, the algebra later.
Chapter 01

The paradigm shift — imitate, or explore.

Same starting model. Different training signal. Wildly different reasoning.

Traditional reasoning models learn by imitating human-written chain-of-thought traces. Every token is a prediction target; the loss is cross-entropy against a fixed sequence. The model can only learn to write what humans wrote.

DeepSeek-R1 takes a different path. There is no fixed target. The model samples its own completions, and the only feedback is whether the final answer was correct. The gradient flows through whatever reasoning path the model actually took.

01 / SFT vs RL
Prompt
Q: What is 17 × 23?
Training signal — one fixed human-written trace
17×23=17×20+17×3=340+51=391
loss = Σt log pθ(yt | y<t, x)

What the model learns: raise the probability of this exact token sequence. Every other valid solution path — verification, alternate methods, self-correction — is implicitly penalized.

NoteThe crucial property of RL: the gradient flows through whatever path the model actually took. If the model invents a new reasoning style and gets the right answer, that style is reinforced.
Chapter 02

GRPO — group relative policy optimization.

PPO needs a critic network. GRPO replaces the critic with the mean of a sampled group. Same idea, half the moving parts.

The classical formulation of policy-gradient RL needs a value function — a "critic" that estimates expected reward from a given state. This critic is itself a neural network, trained alongside the policy. It is expensive, brittle, and a frequent source of training instability.

GRPO sidesteps the critic entirely. For each prompt, it samples a group of completions (typically 16). The group's mean reward serves as the baseline. The advantage of any single completion is just (reward − group mean) / std.

No critic, no value network — the group is the baseline. When everyone in the group succeeds, advantages collapse to zero. Learning happens at the variance.
02 / GRPO loss
J(θ) = 𝔼q,{oi} [ 1G Σi min(ri(θ)·Ai, clip(ri, 1−ε, 1+ε)·Ai) − β · KL(πθ‖πref) ]
Group of 4 rollouts
KL coefficient β 0.04
Clip range ε 0.20
Group mean
Policy gain
Total objective
Try thisSet all four rewards to 1.0 — advantages collapse to zero, no learning. Push rollout 1's ratio to 1.5 with ε=0.2 — the clip activates. Crank β way up — KL dominates, policy refuses to move.
Chapter 03

The reward — regex, not a neural network.

A learned reward model can be hacked. A regex matcher cannot.

R1-Zero's reward function has two components: accuracy (does the final answer match ground truth?) and format (is the answer wrapped in the expected tags?). Both are rule-based — pure regex extraction and string comparison. No learned model anywhere in the reward path.

This is a deliberate, conservative choice. Process reward models — networks trained to score reasoning quality step-by-step — get hacked by the policy. You'll see how, in chapter 12.

03 / Reward checker
Question
If 3x + 7 = 22, what is x?
Ground truth: 5
Model response — edit to test
Accuracy reward
Format reward
Total
DetailR1-Zero used only accuracy + format. R1 added a third "language consistency" rule to penalize mid-trace language switching. The cost of that rule is explored in chapter 13.
Chapter 04

The aha moment — thinking longer, on its own.

Nobody told the model to verify its work. It learned to, because verification correlates with correctness.

This is the paper's most striking finding. As RL training progresses, the model spontaneously starts thinking longer about hard problems — and not just longer, but differently. Verification, alternative methods, self-reflection, even moments of "wait, let me reconsider" all emerge without ever being explicitly trained for.

The mechanism is mundane: longer reasoning empirically yields more correct answers, so the gradient pushes toward longer reasoning. The behaviors are downstream of the only thing being optimized — final-answer accuracy.

04 / Aha timeline
Training step
0
AIME pass@1
15.6%
Avg tokens / response
~200
Step
Sample rollout
Behaviors detected
What you're seeingLength and accuracy co-emerge. The verification, alternative methods, and metacognition are all downstream of the same simple pressure.
Chapter 05

The pipeline — four stages, each fixing the last one's failure mode.

R1-Zero proved RL alone could produce reasoning. R1 added the scaffolding to make it usable.

R1-Zero was a research artifact — it reasoned brilliantly but mixed Chinese and English mid-trace, formatted answers oddly, and was bad at non-reasoning tasks like writing or chat. The DeepSeek team built R1 by stacking four training stages on top of V3-Base, each one addressing a specific failure mode of the previous.

05 / R1 pipeline
Current model state
DeepSeek-V3-Base
The hidden trick in stage 3The team restarts from V3-Base, not from the Stage-2 checkpoint. Stage 2 generates training data; Stage 3 trains a fresh model on that data. This decouples reasoning quality from the rough exploration noise of an RL'd checkpoint.
Chapter 06

Distillation — discover big, deploy small.

A surprising claim: distilling R1 into a small model beats running RL on that small model directly.

The DeepSeek team ran a clean ablation: take the same Qwen-32B base, apply the same RL pipeline that worked for R1, and measure. The result: 47% on AIME. Then they took the same base and SFT-trained it on R1's outputs. Result: 72.6%. A 25-point gap, same starting weights, comparable compute.

Reasoning discovery requires scale. Reasoning execution does not.
06 / Distill ablation
Distilled from R1 RL on base directly Reference
ImplicationFor most labs, the right move is to distill from a larger reasoner rather than try to RL-train a small one. The reasoning patterns are transferable; the capacity to discover them isn't.
Chapter 07

What didn't work — three failed approaches.

The paper's negative results are as instructive as the positive ones.

Before settling on the simple recipe — pure GRPO with rule-based rewards — the DeepSeek team tried several more sophisticated approaches that didn't pan out. Each failure exposed a structural flaw in the alternative.

07 / Negative results
PatternEach failed approach added a learned component (PRM, value function, search heuristic) that became a target for reward hacking or computational explosion. Simple rule-based rewards stayed legible and stable.
Chapter 08

The full system — twelve nodes, twelve clean dependencies.

Click any component to see its role in the larger system.

08 / System map
Inputs
V3-Base
671B MoE pretrained
Cold-start data
~few thousand traces
Prompts
math, code, science
Ground-truth answers
verifiable solutions
Process
GRPO loop
group sampling + loss
Rule-based reward
accuracy + format
Rejection sampling
filter generated traces
SFT pass
cross-entropy training
Outputs
R1-Zero
pure-RL reasoning
R1
final aligned model
800k training set
distillation source
R1-Distill (1.5B–70B)
small open models
click any node to inspect
DeepSeek-R1 system overview
Twelve components, organized into inputs (data + base model), process (RL loop + filters), and outputs (released models). The system's elegance is that the only learned reward signal — used in the very last stage — is preference-based, applied on top of an already-strong reasoner.
Chapter 09

A working simulator — fifty steps of GRPO.

Enough description. The widget below is GRPO running on a toy four-strategy model with hidden success rates — press play and watch the policy quietly forget the strategies that don't pay.

Up to here, GRPO has been described. Now run it. The model below picks one of four strategies via softmax over logits. Each strategy has a hidden true success probability. At every step, GRPO samples a group of 8, scores them, computes group-relative advantages, and updates the logits.

09 / GRPO simulator
Step
0
Last group reward
Expected reward
0.40
Strategy probabilities πθ(a|x)
Learning rate 0.20
Last group of rollouts
Try thisCrank LR to 0.5 and reset. The strategy distribution becomes erratic — some runs collapse onto the wrong strategy because early lucky rewards push it up before enough samples correct the estimate.
Chapter 10

One gradient step — per-token, on every logit.

A rollout is a few hundred token choices in a row, and after it gets scored the gradient updates the logit at every single position — which is the mechanism that quietly converts a scalar reward into a per-token credit assignment.

The previous chapter showed strategy-level updates. This one shows what happens at the level of individual tokens. A rollout is a sequence of choices: at each position, the model picked one token from a distribution. After the rollout is scored, the gradient updates the logits at every position.

This is how verification behavior gets reinforced — not because verification is rewarded directly, but because it co-occurs with correct answers, and the gradient credits every token in a successful trajectory.

10 / Token-level update
Setup
Question: "What is 12 × 8?" — model generates token-by-token. After the full rollout is scored (R = 1 if correct), GRPO computes advantage A and updates each token's logits in proportion to A.
Rollout advantage A
The update rule
For each chosen token at in a rollout with advantage A:
θ log πθ(at|st) · A
Chosen logit goes up by A·(1 − pchosen). Other logits go down by A·pi. Negative A flips the sign.
The mechanicRepeatedly press "New rollout" → "Apply update". Watch how correct trajectories consolidate. Verification tokens like ", let me check" get reinforced because they appear alongside correct answers.
Chapter 11

The trust region — geometry of staying close.

The β·KL term defines a region in policy space the gradient cannot escape. Here is that region, made draggable.

Pure RL would push the policy to the highest-reward corner of the simplex. In language modeling, that's a disaster — the model would lose coherence everywhere outside the rewarded distribution. The KL penalty pulls the policy back toward a reference, defining a trust region.

The two forces — reward gradient and KL pull — balance at some point inside the simplex. That balance point is what GRPO actually optimizes.

11 / KL trust region
Setup — Three actions: A pays 0.9, B pays 0.5, C pays 0.1. The reference policy πref (gray dot) is roughly uniform. Drag the new policy (red dot) inside the simplex.
always A (R=0.9) always B (R=0.5) always C (R=0.1)
drag · gray = πref · red = πθ · contours = objective
πθ(A, B, C)
Expected reward
KL(πθ‖πref)
Objective
= E[R] − β·KL
KL coefficient β
00.402
The geometrySet β=0 and find the optimum — it sits at a corner. Set β=1 — optimum stays near πref. Typical β=0.04 in production RLHF: a balanced point. Too low, the model collapses; too high, it can't actually learn.
Chapter 12

Reward hacking — five exploits, one principle.

Anything inside a neural network can be optimized against. Including a reward model.

The DeepSeek team's choice to use only rule-based rewards was conservative but principled. Process reward models reward surface features of reasoning rather than reasoning outcomes — and once a learned reward is in the optimization loop, the policy will discover the cheapest way to maximize it.

Below: five exploits, one prompt. Toggle through to see how a learned PRM gets fooled while a rule-based grader stays accurate.

12 / Reward hacking
Setup — Question: "Is 7919 prime?" (answer: yes).
Learned PRM score
Rule-based reward
The principleAny reward signal that lives inside a neural network can, in principle, be hacked by another neural network optimizing against it. PRMs are particularly vulnerable because they reward what reasoning looks like, not what it achieves.
Chapter 13

The alignment tax — paying capability for legibility.

Left to its own devices, R1-Zero code-switched mid-trace because some thoughts are cheaper in Chinese and some in English; the published R1 forces monolingual output and pays a small but measurable accuracy bill for it.

R1-Zero spontaneously code-switched: '所以' for "therefore", English for arithmetic, Chinese for some technical terms. Not because it was confused — because some tokens are denser in one language than the other. Forcing monolingual output costs a measurable amount of accuracy.

13 / Language consistency
The discovery
R1-Zero mixed English and Chinese mid-sentence because some tokens are denser or more probable in one language. R1 punishes mixing — but doing so makes the model slightly worse at reasoning.
Sample rollout — same prompt, different λ
Language consistency weight λ 0.40
Mixed-language rate
AIME pass@1
Readability
Accuracy Readability
The published choiceThe DeepSeek team picked λ ≈ 0.4 in Stage 2. Accuracy cost was ~2 points on AIME — judged worth it for human-readable traces. An alignment tax: paying capability for legibility.
Chapter 14

R1 vs R1-Zero — side by side, same prompt.

The readability gap, made concrete.

14 / Side-by-side
R1-Zero (pure RL, no SFT)
R1 (4-stage pipeline)
PropertyR1-ZeroR1
The patternR1-Zero is information-denser but less usable. R1 is more verbose but parseable, monolingual, and structured. The four-stage pipeline trades raw reasoning compression for downstream usability.
Chapter 15

The architecture beneath — 671B total, 37B active.

R1 is built on V3-Base, a Mixture-of-Experts model. Most of its parameters sit idle on every token.

A dense 671B model would be infeasible to train with RL — the per-token compute cost of GRPO across tens of thousands of training steps would be astronomical. V3-Base is a Mixture-of-Experts, where each MoE layer has 256 small experts and a router picks 8 of them per token. Plus one shared expert that always activates.

Total parameters: 671B. Activated per token: 37B. About 5.5%.

15 / MoE architecture
The trick — a Mixture-of-Experts layer has many small FFN experts. A small router picks a few per token. V3 has 1 shared + 8 of 256 routed active per MoE layer.
Click an input token to route it
Router top-k scores for selected token
MoE layer — 1 shared + 32 (of 256) routed experts
Total params
671B
Active per token
37B
Sparsity
5.5%
Layers
61
Why this matters for R1RL training does tens of thousands of forward+backward passes. With MoE, each token's gradient flows through ~37B params, making the GRPO loop tractable. Reasoning gets baked into different experts — math, code, language — which is part of why distilling to dense models works.
Chapter 16

A new scaling axis — buying accuracy with thinking time.

Same model, same weights. Just give it a bigger thinking budget.

Classical LLMs scale by training-time compute — more parameters, more data. R1 introduces a second axis: inference-time compute. Same weights, longer thinking budget, higher accuracy. This is the qualitatively new property of reasoning models.

The shape is consistent across benchmarks: steep early gains as the model goes from "direct answer" to "step-by-step", a long climbing region as verification and alternative methods kick in, then a plateau where extra budget doesn't help.

16 / Inference scaling
Thinking budget 4,000 tk
Accuracy
Latency
Cost (rel.)
Reasoning depth at this budget
The clincherCompare to Qwen-32B without R1 distillation — barely moves with extra budget. Long thinking is useless without the right RL training to make it productive.
That reasoning ability can emerge from pure reinforcement learning on verifiable-correctness rewards — no human-written chain-of-thought traces required. The model figures out the reasoning style itself; the only feedback is whether the final answer was right. That a 671B-parameter dense model trained for 30T tokens beats GPT-4 on every benchmark. That distillation from a frontier closed-source teacher transfers reasoning to small open models. That RLHF with chain-of-thought preference labels is the most efficient path to reasoning. Because it estimates the advantage by sampling a group of completions from the same prompt and using their mean reward as the baseline — the within-group spread plays the role the critic plays in PPO. Same idea, half the moving parts. Because the reward function is rule-based and deterministic, so no value estimate is needed. Because GRPO only updates the final-token logits, so a per-step value isn't needed. Because the KL trust region eliminates the variance the critic was meant to reduce. The point where the model spontaneously starts saying things like "wait, let me reconsider" mid-trace — emergent self-verification that no one trained it to do. It appears purely as a side effect of being rewarded for correct answers when long reasoning helps. The first training step where loss drops below random baseline. The moment the model first uses LaTeX in its outputs. The point where the reward model converges with the policy and KL → 0. R1-Zero is pure RL and produces high-accuracy but barely-readable, language-mixed reasoning traces. R1 wraps the RL with two SFT stages (one before, one after) for readability and language consistency — buying ~1–3% accuracy loss in exchange for traces a human can actually read. R1 has fewer parameters but a larger context window than R1-Zero. R1 uses on-policy gradients while R1-Zero uses off-policy, so R1 is more sample-efficient. R1 was trained on multilingual data while R1-Zero was English-only. Same model weights, longer thinking budget → higher accuracy. Classical LLMs scale with training-time compute (more parameters / data); reasoning models add a second axis where you pay more tokens at inference for measurably better answers, on a curve that climbs steeply then plateaus. Quantization from FP16 to INT4 at inference time, gaining accuracy from sharper logits. Increasing the temperature at inference time to broaden the search. Running the same prompt through multiple models in parallel and majority-voting. Total parameters set the model's capacity (how much knowledge it can store); active parameters set the inference cost (how many fire per token via the router). DeepSeek-V3 has 671B total but only 37B active per token — capacity of a frontier dense model at the cost of a 37B dense one. Total parameters are FP32; active parameters are the FP16 quantized subset that actually runs. Total parameters include the embedding table; active parameters exclude it. Total parameters are the trained set; active parameters are only those gradient updates touched in the last epoch. Reward hacking: the policy drifts arbitrarily far from the SFT initialization to game the reward. Outputs become incoherent or repetitive while still scoring well on the rule-based reward. The KL penalty keeps the policy near the starting distribution so reasoning improves without the language model collapsing. The gradient norms grow unboundedly and training diverges within a few hundred steps. The model forgets all its pre-training and reverts to a uniform distribution over the vocabulary. The reward signal becomes too sparse and the model stops learning entirely. Cold-start SFT (stage 1) is a small supervised pass on a few thousand carefully-curated reasoning traces — just enough to give the RL stage a sane starting policy. Reasoning SFT (stage 3) uses ~600k high-quality reasoning traces sampled from the post-RL model itself, plus general-domain SFT data, to make the final model both reason well and stay readable / general-purpose. Stage 1 trains the tokenizer; stage 3 trains the language model. Stage 1 uses LoRA adapters; stage 3 unfreezes the full model. Stage 1 uses preference labels; stage 3 uses correctness labels.

Sixteen widgets, one paper. The shape of the argument: a paradigm shift, an algorithm, a reward, an emergence, a pipeline, a distillation. Then deeper — a simulator, a per-token view, a trust region, the failure modes. Finally — the alignment tax, the readability gap, the architecture, the new scaling axis.

The DeepSeek-R1 paper, taken apart and made manipulable.

· · ·
End of the walkthrough · Set in Fraunces & IBM Plex
"The paper is in here. Now go price the system that runs it."