ajinkya.ai An experiment in learning with AI.
← All entries
08 May 2026 18 min read

The anatomy of a frontier model — an interactive field guide

LLM Tutorial Foundations Interactive
Field guide · Seven chapters · Five live demos

A guided expedition through the inside of a GPT-5-class language model — its tokens, its attention, its scaling, its reasoning. Every concept demonstrated by something you can drag, type, and break.

This is a companion piece to the cost-an-AI-agent tutorial: that one tells you how to price a system; this one tells you what is actually happening inside the system you are pricing. Read it first if you want every "token" and "context window" and "MoE active parameter" in the cost guide to land with conviction. Forty to ninety minutes, depending on how long you stay in each demo.

Chapter I

From letters to numbers.

Before a model can think about your sentence, it has to translate it into something a machine can multiply. That translation is the tokenizer — the most under-appreciated component of the whole stack.

A neural network does not see words. It sees tensors of floating-point numbers, and the only meaningful operation it knows how to perform is multiplication. So before any of the deep magic begins, your text must be cut into pieces and each piece replaced with an integer ID — a vocabulary lookup. Those IDs are then converted into vectors by an embedding table. From that moment on, the model never sees your original characters again.

The cutting is done by a byte-pair encoder, or BPE. Imagine starting with single characters as your vocabulary, then repeatedly looking at your entire training corpus and merging the most common adjacent pair into a new symbol. Do this 100,000 times and you end up with a vocabulary that has single characters for rare cases, and long subword chunks like tion, ing, or  the for common ones.

This single design choice — that vocabulary is learned from data rather than handed to the model — is why GPT can handle code, emoji, foreign languages, and made-up words from a single training run. It also explains some of the model's strangest weaknesses, like its trouble counting the letters in strawberry.

Tokenize anything.

Type any sentence below. Watch how it gets sliced — common subword chunks become single tokens; unusual strings get fragmented. The number under each piece is its (simulated) vocabulary ID.

Tokens
Characters
Compression
Try this Paste in some code, then some emoji, then some non-English text. Each gets split very differently — that asymmetry directly affects which languages and tasks the model is good at, because compute cost scales with token count, not character count.
Because the only operation the network can perform is multiplication on floating-point tensors — text has to be turned into integer IDs first, then into vectors via an embedding table. Because UTF-8 has too many bytes per character to fit in GPU memory. Because attention can only operate on fixed-length 32-bit floats and characters are 8-bit. Because the loss function requires word-level granularity and characters are too fine-grained. Repeatedly merges the most-common adjacent pair of symbols in the training corpus into a new vocabulary entry, until you have ~50–200k symbols ranging from single characters to long subword chunks. Compresses each token to its smallest possible byte representation using a Huffman-style code. Splits text into fixed-length 4-byte chunks and looks each chunk up in a static dictionary. Converts every word to a 256-dimensional one-hot vector keyed by its UTF-8 hash.
Chapter II

A stack of identical blocks.

The transformer's most surprising feature is its uniformity. The same block, repeated dozens of times. No special layers, no clever hierarchy. Depth alone produces intelligence.

When researchers in 2017 published the paper that started this entire era, the architecture diagram fit on a single page. There are exactly two pieces inside each block: an attention layer that lets tokens look at one another, and a feed-forward network that processes each token in isolation. Both are wrapped in a residual connection — meaning the block's output is added to its input rather than replacing it — and a normalization step that keeps the activations from blowing up.

That's the entire recipe. Everything else is repetition. A small model has 12 of these blocks. A frontier model has 60 to 120 of them stacked in a tower. The hidden state — a vector representing the meaning of the conversation so far — passes upward through the tower, getting refined at each step, until at the very top it's projected back into a probability distribution over the next token.

The interactive below lets you build your own model on paper. The parameter count grows quadratically with the hidden dimension, because the weight matrices inside attention and the MLP are square. This is why doubling the hidden size quadruples the parameters — and why scaling these models is so expensive.

Build a model.

Adjust the dials. Watch the parameter count grow. Notice that hidden size matters far more than depth — a hint about where the FLOPs actually go.

4096
48
32
100k
Embedding · vocab × d
↓ 48 transformer blocks
Attention  →  MLP
Attention  →  MLP
Attention  →  MLP
Output projection · d × vocab
Total params
Per layer
Embedding
Head dim
The 12d² rule Each transformer block costs about 12·d² parameters: 4·d² for attention's four projection matrices, 8·d² for the MLP at standard 4× expansion. Multiply by depth, add the embedding, and you have the model.
An attention layer (lets tokens look at each other) and a position-wise feed-forward network (processes each token in isolation), each wrapped in a residual connection and a normalization step. An encoder layer and a decoder layer connected by cross-attention. A convolution and a recurrent gate, connected by a residual. A self-attention layer and a softmax classifier head. About 12·d² — 4·d² for attention's four projection matrices and 8·d² for the MLP at the standard 4× expansion. About 6·d, dominated by the residual addition and normalization scales. About d·log(d), dominated by the FFT used inside attention. About 3·d², split evenly between the Q, K and V projections.
Chapter III

The act of looking back.

If the transformer has a heart, this is it. Attention is what allows the word "it" to know which earlier noun it refers to. It's also the only operation in the entire model where tokens talk to each other.

Consider the sentence: The trophy didn't fit in the suitcase because it was too big. What does it refer to — the trophy or the suitcase? You and I know instantly. The challenge is teaching a machine that knows nothing except how to multiply matrices.

The answer is brilliantly mechanical. Each token computes three vectors from its hidden state: a Query (what am I looking for?), a Key (what do I offer?), and a Value (what do I actually contain?). To figure out what it means, we take its query and dot-product it with every previous token's key. The dot products go through a softmax to become probabilities — these are the attention weights. Then we take a weighted sum of the values. That's attention.

The grid below shows real attention weights between six tokens of a sentence. Each row is a query token; each column is a key it can see. The upper-right triangle is greyed out — a token can never look at the future, only the past. This is called causal masking, and it's what makes the model auto-regressive.

Watch tokens look at each other.

Drag the temperature slider. At low temperature, attention sharpens — each token focuses hard on one or two predecessors. At high temperature, attention softens and spreads. Real models use temperature 1; this control lets you see what the softmax is actually doing.

1.00
Thetrophydidn'tfitbecauseit
Notice the bottom row The token it attends most strongly to trophy. The model has learned, purely from text statistics, that pronouns refer to the noun they replace. Multiply this single mechanism by 32 attention heads and 60 layers, and emergent reasoning starts to appear.

The formula, written in full, is shockingly simple:

Attention(Q, K, V)  =  softmax(QKᵀ / √d) · V

That's the whole thing. The square root divisor keeps the dot products from blowing up as the dimension grows; the softmax turns them into probabilities; the product with V mixes the values according to those probabilities. Everything else — multi-head attention, rotary position embeddings, flash attention — is engineering on top of this one equation.

To keep the variance of the QKᵀ scores roughly constant as the head dimension d grows — without it, large d makes the softmax collapse to a one-hot vector and gradients vanish. To convert the unnormalized logits into a valid probability distribution over the vocabulary. To match the L2 norm of the value vectors so the residual addition stays in scale. To approximate the inverse temperature used in the model's final sampling step. The mechanism that prevents a token at position t from attending to any token at position > t — implemented by setting future positions' attention scores to −∞ before the softmax. It's what makes the model auto-regressive. Adding random Gaussian noise to attention scores during training to prevent overfitting. Zeroing out the gradients of past tokens during backpropagation. Forcing each attention head to specialize in a single grammatical relation.
Scaling, more than any single algorithmic insight,
is what made the modern era of AI possible.
Chapter IV

The scaling laws.

If you have a fixed compute budget, how big should your model be? How many tokens should it see? In 2022, a paper from DeepMind answered that question with a remarkably clean formula. The answer reshaped the entire industry.

The Chinchilla paper showed something almost embarrassingly simple: most models trained before 2022 were too big and undertrained. Given a fixed compute budget, the optimal recipe was to train a smaller model on more data. The magic number, roughly, was twenty tokens per parameter.

This rewrote the playbook. The largest model in the world is no longer automatically the best — it depends on how much data you fed it. And there's a second wrinkle: at deployment time, smaller models cost less to run. So labs now over-train deliberately, going past Chinchilla's compute-optimal point. They burn more training compute to get a model that's cheaper to serve. With hundreds of millions of users, this trade is overwhelmingly worth it.

Drag the dials below to see the trade. The compute requirement is the simple Kaplan formula: about 6 · N · D floating-point operations, where N is parameters and D is tokens. The factor of six comes from one multiply-add for the forward pass and two for the backward pass.

The compute economy.

Set the parameter count and tokens-per-parameter. The system calculates total training compute and translates it into "H100 GPU-years" — the actual currency of frontier AI.

500B
20
Training tokens
FLOPs
H100 years
Regime
Frame of reference Llama 3 was trained on about 15 trillion tokens at 405 billion parameters — that's 37 tokens per parameter, well past Chinchilla. GPT-4 is rumored to have used a similar over-training factor. DeepSeek-V3 used 22 tokens per parameter. The strategy is now industry standard.
For a fixed compute budget, the compute-optimal recipe is a smaller model trained on more data — roughly 20 tokens per parameter. Most pre-2022 models were too big and undertrained. That model performance scales linearly with parameter count regardless of dataset size. That ensembling 8 small models always outperforms one large one of equivalent total parameters. That distillation from a larger teacher recovers 95% of the teacher's performance. Because at deployment time a smaller model is cheaper to serve, and with millions of users that inference saving more than pays back the extra training compute. Because over-training reduces the number of attention heads needed at inference time. Because Chinchilla's formula assumed FP32 training and modern bf16 needs more data. Because larger datasets reduce the chance of training-data contamination on benchmarks.
Chapter V

A thousand machines, one mind.

A frontier model cannot fit on a single GPU. It cannot even fit on a single server. The actual training is done by orchestrating thousands of machines into a kind of synchronized choir — every voice contributing, none allowed to drift out of tune.

The simplest form of distribution is data parallelism: every GPU has a full copy of the model, and each one processes a different slice of the batch. After every step, all GPUs share their gradients and average them. This works wonderfully — until the model itself is too big to fit on one GPU. Then you need to split the model itself.

There are two ways to do this. Pipeline parallelism puts different layers on different GPUs, and activations flow through the chain like an assembly line. Tensor parallelism splits a single matrix multiplication across GPUs, and they communicate constantly. In practice, frontier labs use all three at once — a technique called 3D parallelism.

The four ways to split.

Click each card to highlight which GPUs do what. Notice the very different communication patterns — that's why these are combined: they each scale along a different axis.

Data parallel

Every GPU has the full model. Each processes a different batch shard. They average gradients after every step. Simple, but breaks down when the model itself is too big.

Pipeline parallel

Different layers live on different GPUs. Activations flow through. The "bubble" — idle time at the start and end of the pipeline — must be amortized over many micro-batches.

Tensor parallel

Split a single matmul across GPUs. Heavy communication after every operation. Kept inside one server because the network can't keep up between machines.

FSDP / ZeRO

Shard the parameters themselves across GPUs. Each GPU only stores its slice, gathering others on demand. Memory savings without pipeline complexity.

A 16-GPU cluster — selected mode highlighted

Pipeline parallelism has a famous quirk: the bubble. In the diagram below, each row is one GPU, each column is a time step. When the pipeline starts up, GPUs at the end of the chain have nothing to do. When it shuts down, GPUs at the start are idle. The fix is to flood the pipeline with many micro-batches, hiding the bubble in the middle.

The hidden cost Beyond the math: a frontier training run lasts months. GPUs fail. Networks have hiccups. Loss spikes mysteriously. A huge fraction of "training a model" is actually building checkpointing, monitoring, and recovery infrastructure that lets you survive these inevitable failures without losing weeks of progress.
Data parallelism — each GPU holds the whole model and processes a different slice of the batch, then all GPUs all-reduce their gradients after every step. Tensor parallelism — each GPU holds the full model and a shard of the activations. Pipeline parallelism — each GPU holds a copy of every layer and processes a different micro-batch. FSDP — each GPU lazily reconstructs the full model on demand and discards it after every step. The idle time at the start and end of pipeline-parallel training when downstream stages have nothing to do (warm-up) or upstream stages have already finished (cool-down). It's hidden by flooding the pipeline with many micro-batches. The brief overlap when forward and backward passes collide on the same GPU. The extra memory overhead from storing activation gradients between layers. The communication cost of gradient all-reduce across pipeline stages.
Chapter VI

Teaching it to be helpful.

A pre-trained language model is a strange creature. It can complete any text on the internet — including text full of falsehoods, hostility, and irrelevance. It is not yet an assistant. It is barely yet a conversation partner. The next phase is what makes it usable.

Imagine that pre-training has produced a model that has read most of the internet and can predict what would come next in any document. Ask it a question and it might answer — but it might just as easily produce another question, or a list of advertisements, or a poem in the style of the question. It's a text predictor, not yet an agent.

Turning it into an assistant takes a sequence of carefully designed steps. Each one nudges the model toward the kind of behavior you want. None of it requires more raw scale. All of it requires an enormous amount of human judgment, and in the latest models, an enormous amount of automated judgment too.

  1. Supervised fine-tuning

    Show the model thousands of carefully written conversations between a helpful assistant and a user. The format is specific: a system prompt sets the role, the user asks something, the assistant responds. The model now knows what an assistant looks like.

  2. Reward modeling

    Generate two responses to the same prompt. Show them to a human, who picks the better one. Train a separate small model to predict which response a human would prefer. This reward model is now a stand-in for human judgment, scalable to millions of comparisons.

  3. Reinforcement learning

    The model proposes responses; the reward model scores them; the policy is updated to produce higher-scoring outputs. A KL-divergence penalty keeps the model from drifting too far from its starting point. This is RLHF — Reinforcement Learning from Human Feedback.

  4. Reasoning RL

    The newest stage, and the basis for o1, R1, and GPT-5's thinking mode. Apply RL not for human preference but for verifiable correctness — math problems with known answers, code that must pass tests. The model learns to produce long internal chains of thought before answering.

The fourth stage is what makes a 2025-era model feel different from a 2023-era one. The base architecture is similar; the post-training is dramatically more sophisticated. When you ask GPT-5 a hard math question and it pauses to "think," what you're seeing is the artifact of reasoning RL: a long internal monologue, hidden from view, where the model explores options, catches its own mistakes, and converges on an answer. The compute has shifted from train-time to test-time.

RLHF rewards responses based on a learned model of human preference; reasoning RL rewards responses based on verifiable correctness (math answers, passing unit tests). Reasoning RL produces the long internal chains of thought you see in o1 / R1 / GPT-5 thinking mode. RLHF uses on-policy gradients while reasoning RL uses off-policy gradients sampled from a replay buffer. RLHF is applied during pre-training; reasoning RL is applied after deployment from user feedback. RLHF requires GPUs while reasoning RL can be done on CPU clusters because the rewards are deterministic.
Chapter VII

The moment of speech.

After all the training, all the alignment, all the careful scaling — the model finally faces a real user. Each token it produces is a sample from a probability distribution. How that sample is taken decides almost everything about how the model feels.

At the very top of the network, after all those layers of attention and MLPs, the final hidden state is multiplied by the output projection to produce a vector of logits — one number per token in the vocabulary. A softmax turns those into a probability distribution. And then — here is the entire art of inference — we have to pick one.

If we always pick the most likely token, the output is repetitive and boring. If we sample uniformly at random, it's incoherent. The standard approach is a tunable middle: temperature reshapes the distribution (low = sharper, high = flatter), and top-p sampling truncates the tail by keeping only enough tokens to cover probability mass p. Move the controls below to see the distribution change shape in real time.

The final distribution.

For the prompt "The cat sat on the ___" the model considers many candidates. Drag the controls. Greyed bars are filtered out by top-p. Notice how temperature near zero collapses everything to the most likely word, while high temperature spreads attention across the whole tail.

1.00
1.00
Temperature in the wild For factual tasks, run with temperature 0 — get the model's best guess. For creative writing, push to 1.0 or higher for more variety. The same model, with the same weights, behaves like a different system depending on this single number. That's how much the sampling step matters.

The optimizations that make it fast.

Generating each token from scratch would mean re-running the whole network on the entire conversation every time. That's quadratic in sequence length and would be unusably slow. Instead, frontier inference uses three tricks:

  • KV cache. Store the keys and values of every previous token so each new token attends to a cache instead of recomputing. Linear instead of quadratic.
  • Speculative decoding. A small "draft" model proposes K tokens; the big model verifies them in a single forward pass. When the draft is right (often it is), you get K tokens for the cost of one.
  • Continuous batching. Concurrent users get stitched into a single matmul. Rather than waiting for one user's response to finish before starting the next, the server fluidly mixes them token-by-token.

This is also where the famous mixture-of-experts trick pays off. In an MoE model, only a small fraction of the parameters fire for each token — DeepSeek-V3 has 671 billion total parameters but only activates 37 billion per token. The total parameter count gives the model capacity; the active count determines the cost. Decoupling them is one of the most important architectural ideas of the last three years.

It stores the keys and values produced for every previous token, so each new token attends to the cache instead of recomputing the whole sequence — turning per-token inference cost from quadratic in sequence length to linear. It compresses the model weights from FP16 to INT8 to fit in GPU memory. It stores the most recent few user prompts so the model can re-use them without re-tokenizing. It caches the output logits so identical prompts return immediately without a forward pass. Total and active parameter counts are decoupled — a router selects only a small subset of expert MLPs for each token. DeepSeek-V3 has 671B total parameters but only 37B fire per token, so capacity is set by total count and inference cost is set by active count. Multiple specialised models vote on each token and the most-confident one wins. Each layer alternates between an expert designed for code and one designed for prose, doubling effective capacity. The training corpus is split across expert clusters so each shard sees a different domain. It divides every logit by T before the softmax — low T (≪1) sharpens the distribution toward the argmax, high T (≫1) flattens it toward uniform. T=0 is greedy decoding; T=1 is the model's "natural" distribution. It multiplies the softmax output by T, biasing all probabilities upward. It controls how many tokens the model is allowed to consider before committing to one. It sets the random seed used for nucleus sampling on the next token.
End of the field guideSet in Fraunces & JetBrains Mono
"Now that you have seen the parts, the next step is to build something."