A guided expedition through the inside of a GPT-5-class language model — its tokens, its attention, its scaling, its reasoning. Every concept demonstrated by something you can drag, type, and break.
This is a companion piece to the cost-an-AI-agent tutorial: that one tells you how to price a system; this one tells you what is actually happening inside the system you are pricing. Read it first if you want every "token" and "context window" and "MoE active parameter" in the cost guide to land with conviction. Forty to ninety minutes, depending on how long you stay in each demo.
From letters to numbers.
Before a model can think about your sentence, it has to translate it into something a machine can multiply. That translation is the tokenizer — the most under-appreciated component of the whole stack.
A neural network does not see words. It sees tensors of floating-point numbers, and the only meaningful operation it knows how to perform is multiplication. So before any of the deep magic begins, your text must be cut into pieces and each piece replaced with an integer ID — a vocabulary lookup. Those IDs are then converted into vectors by an embedding table. From that moment on, the model never sees your original characters again.
The cutting is done by a byte-pair encoder, or BPE. Imagine starting with single characters as your vocabulary, then repeatedly looking at your entire training corpus and merging the most common adjacent pair into a new symbol. Do this 100,000 times and you end up with a vocabulary that has single characters for rare cases, and long subword chunks like tion, ing, or the for common ones.
This single design choice — that vocabulary is learned from data rather than handed to the model — is why GPT can handle code, emoji, foreign languages, and made-up words from a single training run. It also explains some of the model's strangest weaknesses, like its trouble counting the letters in strawberry.
Tokenize anything.
Type any sentence below. Watch how it gets sliced — common subword chunks become single tokens; unusual strings get fragmented. The number under each piece is its (simulated) vocabulary ID.
A stack of identical blocks.
The transformer's most surprising feature is its uniformity. The same block, repeated dozens of times. No special layers, no clever hierarchy. Depth alone produces intelligence.
When researchers in 2017 published the paper that started this entire era, the architecture diagram fit on a single page. There are exactly two pieces inside each block: an attention layer that lets tokens look at one another, and a feed-forward network that processes each token in isolation. Both are wrapped in a residual connection — meaning the block's output is added to its input rather than replacing it — and a normalization step that keeps the activations from blowing up.
That's the entire recipe. Everything else is repetition. A small model has 12 of these blocks. A frontier model has 60 to 120 of them stacked in a tower. The hidden state — a vector representing the meaning of the conversation so far — passes upward through the tower, getting refined at each step, until at the very top it's projected back into a probability distribution over the next token.
The interactive below lets you build your own model on paper. The parameter count grows quadratically with the hidden dimension, because the weight matrices inside attention and the MLP are square. This is why doubling the hidden size quadruples the parameters — and why scaling these models is so expensive.
Build a model.
Adjust the dials. Watch the parameter count grow. Notice that hidden size matters far more than depth — a hint about where the FLOPs actually go.
The act of looking back.
If the transformer has a heart, this is it. Attention is what allows the word "it" to know which earlier noun it refers to. It's also the only operation in the entire model where tokens talk to each other.
Consider the sentence: The trophy didn't fit in the suitcase because it was too big. What does it refer to — the trophy or the suitcase? You and I know instantly. The challenge is teaching a machine that knows nothing except how to multiply matrices.
The answer is brilliantly mechanical. Each token computes three vectors from its hidden state: a Query (what am I looking for?), a Key (what do I offer?), and a Value (what do I actually contain?). To figure out what it means, we take its query and dot-product it with every previous token's key. The dot products go through a softmax to become probabilities — these are the attention weights. Then we take a weighted sum of the values. That's attention.
The grid below shows real attention weights between six tokens of a sentence. Each row is a query token; each column is a key it can see. The upper-right triangle is greyed out — a token can never look at the future, only the past. This is called causal masking, and it's what makes the model auto-regressive.
Watch tokens look at each other.
Drag the temperature slider. At low temperature, attention sharpens — each token focuses hard on one or two predecessors. At high temperature, attention softens and spreads. Real models use temperature 1; this control lets you see what the softmax is actually doing.
| The | trophy | didn't | fit | because | it |
|---|
The formula, written in full, is shockingly simple:
That's the whole thing. The square root divisor keeps the dot products from blowing up as the dimension grows; the softmax turns them into probabilities; the product with V mixes the values according to those probabilities. Everything else — multi-head attention, rotary position embeddings, flash attention — is engineering on top of this one equation.
is what made the modern era of AI possible.
The scaling laws.
If you have a fixed compute budget, how big should your model be? How many tokens should it see? In 2022, a paper from DeepMind answered that question with a remarkably clean formula. The answer reshaped the entire industry.
The Chinchilla paper showed something almost embarrassingly simple: most models trained before 2022 were too big and undertrained. Given a fixed compute budget, the optimal recipe was to train a smaller model on more data. The magic number, roughly, was twenty tokens per parameter.
This rewrote the playbook. The largest model in the world is no longer automatically the best — it depends on how much data you fed it. And there's a second wrinkle: at deployment time, smaller models cost less to run. So labs now over-train deliberately, going past Chinchilla's compute-optimal point. They burn more training compute to get a model that's cheaper to serve. With hundreds of millions of users, this trade is overwhelmingly worth it.
Drag the dials below to see the trade. The compute requirement is the simple Kaplan formula: about 6 · N · D floating-point operations, where N is parameters and D is tokens. The factor of six comes from one multiply-add for the forward pass and two for the backward pass.
The compute economy.
Set the parameter count and tokens-per-parameter. The system calculates total training compute and translates it into "H100 GPU-years" — the actual currency of frontier AI.
A thousand machines, one mind.
A frontier model cannot fit on a single GPU. It cannot even fit on a single server. The actual training is done by orchestrating thousands of machines into a kind of synchronized choir — every voice contributing, none allowed to drift out of tune.
The simplest form of distribution is data parallelism: every GPU has a full copy of the model, and each one processes a different slice of the batch. After every step, all GPUs share their gradients and average them. This works wonderfully — until the model itself is too big to fit on one GPU. Then you need to split the model itself.
There are two ways to do this. Pipeline parallelism puts different layers on different GPUs, and activations flow through the chain like an assembly line. Tensor parallelism splits a single matrix multiplication across GPUs, and they communicate constantly. In practice, frontier labs use all three at once — a technique called 3D parallelism.
The four ways to split.
Click each card to highlight which GPUs do what. Notice the very different communication patterns — that's why these are combined: they each scale along a different axis.
Data parallel
Every GPU has the full model. Each processes a different batch shard. They average gradients after every step. Simple, but breaks down when the model itself is too big.
Pipeline parallel
Different layers live on different GPUs. Activations flow through. The "bubble" — idle time at the start and end of the pipeline — must be amortized over many micro-batches.
Tensor parallel
Split a single matmul across GPUs. Heavy communication after every operation. Kept inside one server because the network can't keep up between machines.
FSDP / ZeRO
Shard the parameters themselves across GPUs. Each GPU only stores its slice, gathering others on demand. Memory savings without pipeline complexity.
A 16-GPU cluster — selected mode highlighted
Pipeline parallelism has a famous quirk: the bubble. In the diagram below, each row is one GPU, each column is a time step. When the pipeline starts up, GPUs at the end of the chain have nothing to do. When it shuts down, GPUs at the start are idle. The fix is to flood the pipeline with many micro-batches, hiding the bubble in the middle.
Teaching it to be helpful.
A pre-trained language model is a strange creature. It can complete any text on the internet — including text full of falsehoods, hostility, and irrelevance. It is not yet an assistant. It is barely yet a conversation partner. The next phase is what makes it usable.
Imagine that pre-training has produced a model that has read most of the internet and can predict what would come next in any document. Ask it a question and it might answer — but it might just as easily produce another question, or a list of advertisements, or a poem in the style of the question. It's a text predictor, not yet an agent.
Turning it into an assistant takes a sequence of carefully designed steps. Each one nudges the model toward the kind of behavior you want. None of it requires more raw scale. All of it requires an enormous amount of human judgment, and in the latest models, an enormous amount of automated judgment too.
-
Supervised fine-tuning
Show the model thousands of carefully written conversations between a helpful assistant and a user. The format is specific: a system prompt sets the role, the user asks something, the assistant responds. The model now knows what an assistant looks like.
-
Reward modeling
Generate two responses to the same prompt. Show them to a human, who picks the better one. Train a separate small model to predict which response a human would prefer. This reward model is now a stand-in for human judgment, scalable to millions of comparisons.
-
Reinforcement learning
The model proposes responses; the reward model scores them; the policy is updated to produce higher-scoring outputs. A KL-divergence penalty keeps the model from drifting too far from its starting point. This is RLHF — Reinforcement Learning from Human Feedback.
-
Reasoning RL
The newest stage, and the basis for o1, R1, and GPT-5's thinking mode. Apply RL not for human preference but for verifiable correctness — math problems with known answers, code that must pass tests. The model learns to produce long internal chains of thought before answering.
The fourth stage is what makes a 2025-era model feel different from a 2023-era one. The base architecture is similar; the post-training is dramatically more sophisticated. When you ask GPT-5 a hard math question and it pauses to "think," what you're seeing is the artifact of reasoning RL: a long internal monologue, hidden from view, where the model explores options, catches its own mistakes, and converges on an answer. The compute has shifted from train-time to test-time.
The moment of speech.
After all the training, all the alignment, all the careful scaling — the model finally faces a real user. Each token it produces is a sample from a probability distribution. How that sample is taken decides almost everything about how the model feels.
At the very top of the network, after all those layers of attention and MLPs, the final hidden state is multiplied by the output projection to produce a vector of logits — one number per token in the vocabulary. A softmax turns those into a probability distribution. And then — here is the entire art of inference — we have to pick one.
If we always pick the most likely token, the output is repetitive and boring. If we sample uniformly at random, it's incoherent. The standard approach is a tunable middle: temperature reshapes the distribution (low = sharper, high = flatter), and top-p sampling truncates the tail by keeping only enough tokens to cover probability mass p. Move the controls below to see the distribution change shape in real time.
The final distribution.
For the prompt "The cat sat on the ___" the model considers many candidates. Drag the controls. Greyed bars are filtered out by top-p. Notice how temperature near zero collapses everything to the most likely word, while high temperature spreads attention across the whole tail.
The optimizations that make it fast.
Generating each token from scratch would mean re-running the whole network on the entire conversation every time. That's quadratic in sequence length and would be unusably slow. Instead, frontier inference uses three tricks:
- KV cache. Store the keys and values of every previous token so each new token attends to a cache instead of recomputing. Linear instead of quadratic.
- Speculative decoding. A small "draft" model proposes K tokens; the big model verifies them in a single forward pass. When the draft is right (often it is), you get K tokens for the cost of one.
- Continuous batching. Concurrent users get stitched into a single matmul. Rather than waiting for one user's response to finish before starting the next, the server fluidly mixes them token-by-token.
This is also where the famous mixture-of-experts trick pays off. In an MoE model, only a small fraction of the parameters fire for each token — DeepSeek-V3 has 671 billion total parameters but only activates 37 billion per token. The total parameter count gives the model capacity; the active count determines the cost. Decoupling them is one of the most important architectural ideas of the last three years.