Nine interactive chapters, written to be read in order or skimmed by chapter. The thread is how language models work, how to use them, and how to ship systems built on them — tokens to attention to retrieval to evals. Each chapter is a live page with at least one demo you can break.
Foundations
-
01
The anatomy of a frontier model
Tokenizer, transformer block, attention, Chinchilla scaling laws, 3D parallel training, post-training (SFT → reward → RL → reasoning RL), inference (KV cache, speculative decoding, MoE). Seven chapters, five live demos.
-
02
Mechanistic interpretability — looking inside the model
What "mechanistic" means vs behavioural / representational interpretability; three landmark findings (induction heads, the IOI circuit, sparse autoencoders); the toolkit researchers actually run; Amodei's urgency argument; how to get started; and an honest read of the open problems.
-
03
Working with the grain
Five structural truths about LLMs — they sample rather than recall, generation is a walk not a plan, specification is steering, generation isn't evaluation, memory is just context. Each illustrated with a small interactive.
-
04
DeepSeek-R1, from first principles
Sixteen widgets, one paper. Drag the GRPO simulator, click through the four-stage pipeline, watch reasoning emerge from pure RL on rule-based rewards. The reasoning-RL paradigm that drives modern test-time-compute.
Retrieval & context
-
05
Hybrid search — keywords plus vectors
Side-by-side BM25, dense retrieval, and RRF merge on a toy corpus. See where each strategy wins — exact SKUs, paraphrases, error codes — before you wire a production RAG stack.
-
06
Context engineering
Five tabs simulating window budgets, overflow, compaction, lost-in-the-middle, and production mitigation patterns. The cost is literally "how many of these tokens fit per turn."
Production techniques
-
07
Sampling parameters — temperature, top-p, top-k, seed
What each sampling knob actually does to the next-token distribution, why "high temperature = creative" is a useful lie, why seed alone doesn't reproduce in practice, and the four presets that cover 95% of cases. Interactive distribution visualizer included.
-
08
Structured outputs & JSON mode
Three implementations from prompt-and-parse to grammar-constrained decoding, schema design patterns the model can follow, the strict-mode pitfalls (no-refusal, long-enum regression, truncation-defeats-grammar), and an 8-step parser-state visualizer.
-
09
Streaming, batching & async
The three latencies users care about (TTFT, output rate, total), what streaming actually IS at the wire level (SSE), the batch API's 50%-off / hours-of-latency tradeoff, and a side-by-side latency simulator showing why streaming feels 5× faster.
-
10
Rate limits, retries & backoff
TPM vs RPM vs concurrent, the tier ladder, exponential backoff with jitter, idempotency keys, hedged requests. Retry-storm simulator that shows why jitter takes success rate from 15% to 95%.
-
11
How prompt caching works
An animated walk-through of Anthropic-style prompt caching — how a reusable prefix is written once and read many times, why reads cost ~10% of writes, and where the cache breakpoint should sit.
-
12
Tool use & function calling
How LLMs call functions (structured JSON, not magic), schema design that swings selection accuracy by thirty points, parallel tool calls, MCP, and the production failure modes that loop your agent forever.
-
13
MCP & agent-to-agent protocols
The wire protocols underneath tool use — what MCP is at the JSON-RPC level (three primitives, three transports), why A2A is a different concern, how to build & host a server, scaling realities, OAuth 2.1 security, and the production failure modes. Animated 8-step lifecycle visualiser + MCP-vs-A2A decision tree.
Choosing & measuring
-
14
Choosing a frontier LLM in 2026
Claude, GPT, Gemini, Grok, and the open-weight contenders. By task, by constraint, by compliance. A sortable comparison matrix and a five-question decision tree that lands you on a model.
-
15
Telling if your LLM system actually works
Golden sets, rubrics, LLM-as-judge, Cohen's kappa, and the pairwise trick that breaks scalar-score ties. The discipline that separates "feels better" from "is measurably better."
Inference & serving
-
16
Quantization — fitting a 70B model on one GPU
FP16 / FP8 / INT8 / INT4 mechanics, GPTQ / AWQ / SmoothQuant / GGUF methods, where quality breaks (long-context recall, math, schemas), and the deployment math — interactive memory bar that shows when each quant level fits which GPU.
-
17
KV cache & paged attention
Why serving LLMs is a memory problem, how naive allocation wastes 60–90% of cache memory, vLLM's PagedAttention breakthrough, cross-request prefix sharing as the mechanism behind hosted prompt caching.
-
18
Speculative decoding & continuous batching
Two paired tricks that 2–3× modern serving throughput: a small draft model proposes tokens that the big model verifies in parallel, and the scheduler packs many requests onto the same forward pass instead of waiting.
-
19
LLM serving frameworks — vLLM, TensorRT-LLM, TGI, SGLang
What a serving framework actually is, what each of the major four is good at, the benchmark trap, and the decision tree for picking one. Sortable comparison matrix + interactive 4-question decision tree.
Reading order is recommended but not enforced. The book sidebar on every chapter lets you jump anywhere; the pager at the bottom of each chapter is the "next page" if you're reading straight through. The companion cost-modeling tutorial — also on this site — picks up where Chapter 09 leaves off and walks the procurement-side math.