Understanding LLMs — a field guide

Nine interactive chapters, written to be read in order or skimmed by chapter. The thread is how language models work, how to use them, and how to ship systems built on them — tokens to attention to retrieval to evals. Each chapter is a live page with at least one demo you can break.

Part I

Foundations

01

The anatomy of a frontier model

Tokenizer, transformer block, attention, Chinchilla scaling laws, 3D parallel training, post-training (SFT → reward → RL → reasoning RL), inference (KV cache, speculative decoding, MoE). Seven chapters, five live demos.
02

Mechanistic interpretability — looking inside the model

What "mechanistic" means vs behavioural / representational interpretability; three landmark findings (induction heads, the IOI circuit, sparse autoencoders); the toolkit researchers actually run; Amodei's urgency argument; how to get started; and an honest read of the open problems.
03

Working with the grain

Five structural truths about LLMs — they sample rather than recall, generation is a walk not a plan, specification is steering, generation isn't evaluation, memory is just context. Each illustrated with a small interactive.
04

DeepSeek-R1, from first principles

Sixteen widgets, one paper. Drag the GRPO simulator, click through the four-stage pipeline, watch reasoning emerge from pure RL on rule-based rewards. The reasoning-RL paradigm that drives modern test-time-compute.

Part II

Retrieval & context

05

Hybrid search — keywords plus vectors

Side-by-side BM25, dense retrieval, and RRF merge on a toy corpus. See where each strategy wins — exact SKUs, paraphrases, error codes — before you wire a production RAG stack.
06

Context engineering

Five tabs simulating window budgets, overflow, compaction, lost-in-the-middle, and production mitigation patterns. The cost is literally "how many of these tokens fit per turn."

Part III

Production techniques

07

Sampling parameters — temperature, top-p, top-k, seed

What each sampling knob actually does to the next-token distribution, why "high temperature = creative" is a useful lie, why seed alone doesn't reproduce in practice, and the four presets that cover 95% of cases. Interactive distribution visualizer included.
08

Structured outputs & JSON mode

Three implementations from prompt-and-parse to grammar-constrained decoding, schema design patterns the model can follow, the strict-mode pitfalls (no-refusal, long-enum regression, truncation-defeats-grammar), and an 8-step parser-state visualizer.
09

Streaming, batching & async

The three latencies users care about (TTFT, output rate, total), what streaming actually IS at the wire level (SSE), the batch API's 50%-off / hours-of-latency tradeoff, and a side-by-side latency simulator showing why streaming feels 5× faster.
10

Rate limits, retries & backoff

TPM vs RPM vs concurrent, the tier ladder, exponential backoff with jitter, idempotency keys, hedged requests. Retry-storm simulator that shows why jitter takes success rate from 15% to 95%.
11

How prompt caching works

An animated walk-through of Anthropic-style prompt caching — how a reusable prefix is written once and read many times, why reads cost ~10% of writes, and where the cache breakpoint should sit.
12

Tool use & function calling

How LLMs call functions (structured JSON, not magic), schema design that swings selection accuracy by thirty points, parallel tool calls, MCP, and the production failure modes that loop your agent forever.
13

MCP & agent-to-agent protocols

The wire protocols underneath tool use — what MCP is at the JSON-RPC level (three primitives, three transports), why A2A is a different concern, how to build & host a server, scaling realities, OAuth 2.1 security, and the production failure modes. Animated 8-step lifecycle visualiser + MCP-vs-A2A decision tree.

Part IV

Choosing & measuring

14

Choosing a frontier LLM in 2026

Claude, GPT, Gemini, Grok, and the open-weight contenders. By task, by constraint, by compliance. A sortable comparison matrix and a five-question decision tree that lands you on a model.
15

Telling if your LLM system actually works

Golden sets, rubrics, LLM-as-judge, Cohen's kappa, and the pairwise trick that breaks scalar-score ties. The discipline that separates "feels better" from "is measurably better."

Part V

Inference & serving

16

Quantization — fitting a 70B model on one GPU

FP16 / FP8 / INT8 / INT4 mechanics, GPTQ / AWQ / SmoothQuant / GGUF methods, where quality breaks (long-context recall, math, schemas), and the deployment math — interactive memory bar that shows when each quant level fits which GPU.
17

KV cache & paged attention

Why serving LLMs is a memory problem, how naive allocation wastes 60–90% of cache memory, vLLM's PagedAttention breakthrough, cross-request prefix sharing as the mechanism behind hosted prompt caching.
18

Speculative decoding & continuous batching

Two paired tricks that 2–3× modern serving throughput: a small draft model proposes tokens that the big model verifies in parallel, and the scheduler packs many requests onto the same forward pass instead of waiting.
19

LLM serving frameworks — vLLM, TensorRT-LLM, TGI, SGLang

What a serving framework actually is, what each of the major four is good at, the benchmark trap, and the decision tree for picking one. Sortable comparison matrix + interactive 4-question decision tree.

Reading order is recommended but not enforced. The book sidebar on every chapter lets you jump anywhere; the pager at the bottom of each chapter is the "next page" if you're reading straight through. The companion cost-modeling tutorial — also on this site — picks up where Chapter 09 leaves off and walks the procurement-side math.