ajinkya.ai An experiment in learning with AI.
← All entries
12 May 2026 4 min read 0% human written Authorship mix

Honest estimate of who wrote what. The green bar is my human-written share — drafting, editing, structure, voice. The red stripes are AI's share — synthesis, scaffolding, first drafts. 0% human, 100% AI on this one.

Understanding LLMs — a field guide

LLM Book Tutorial Interactive Foundations

Nine interactive chapters, written to be read in order or skimmed by chapter. The thread is how language models work, how to use them, and how to ship systems built on them — tokens to attention to retrieval to evals. Each chapter is a live page with at least one demo you can break.

Part I

Foundations

  1. 01

    The anatomy of a frontier model

    Tokenizer, transformer block, attention, Chinchilla scaling laws, 3D parallel training, post-training (SFT → reward → RL → reasoning RL), inference (KV cache, speculative decoding, MoE). Seven chapters, five live demos.

  2. 02

    Mechanistic interpretability — looking inside the model

    What "mechanistic" means vs behavioural / representational interpretability; three landmark findings (induction heads, the IOI circuit, sparse autoencoders); the toolkit researchers actually run; Amodei's urgency argument; how to get started; and an honest read of the open problems.

  3. 03

    Working with the grain

    Five structural truths about LLMs — they sample rather than recall, generation is a walk not a plan, specification is steering, generation isn't evaluation, memory is just context. Each illustrated with a small interactive.

  4. 04

    DeepSeek-R1, from first principles

    Sixteen widgets, one paper. Drag the GRPO simulator, click through the four-stage pipeline, watch reasoning emerge from pure RL on rule-based rewards. The reasoning-RL paradigm that drives modern test-time-compute.

Part II

Retrieval & context

  1. 05

    Hybrid search — keywords plus vectors

    Side-by-side BM25, dense retrieval, and RRF merge on a toy corpus. See where each strategy wins — exact SKUs, paraphrases, error codes — before you wire a production RAG stack.

  2. 06

    Context engineering

    Five tabs simulating window budgets, overflow, compaction, lost-in-the-middle, and production mitigation patterns. The cost is literally "how many of these tokens fit per turn."

Part III

Production techniques

  1. 07

    Sampling parameters — temperature, top-p, top-k, seed

    What each sampling knob actually does to the next-token distribution, why "high temperature = creative" is a useful lie, why seed alone doesn't reproduce in practice, and the four presets that cover 95% of cases. Interactive distribution visualizer included.

  2. 08

    Structured outputs & JSON mode

    Three implementations from prompt-and-parse to grammar-constrained decoding, schema design patterns the model can follow, the strict-mode pitfalls (no-refusal, long-enum regression, truncation-defeats-grammar), and an 8-step parser-state visualizer.

  3. 09

    Streaming, batching & async

    The three latencies users care about (TTFT, output rate, total), what streaming actually IS at the wire level (SSE), the batch API's 50%-off / hours-of-latency tradeoff, and a side-by-side latency simulator showing why streaming feels 5× faster.

  4. 10

    Rate limits, retries & backoff

    TPM vs RPM vs concurrent, the tier ladder, exponential backoff with jitter, idempotency keys, hedged requests. Retry-storm simulator that shows why jitter takes success rate from 15% to 95%.

  5. 11

    How prompt caching works

    An animated walk-through of Anthropic-style prompt caching — how a reusable prefix is written once and read many times, why reads cost ~10% of writes, and where the cache breakpoint should sit.

  6. 12

    Tool use & function calling

    How LLMs call functions (structured JSON, not magic), schema design that swings selection accuracy by thirty points, parallel tool calls, MCP, and the production failure modes that loop your agent forever.

  7. 13

    MCP & agent-to-agent protocols

    The wire protocols underneath tool use — what MCP is at the JSON-RPC level (three primitives, three transports), why A2A is a different concern, how to build & host a server, scaling realities, OAuth 2.1 security, and the production failure modes. Animated 8-step lifecycle visualiser + MCP-vs-A2A decision tree.

Part IV

Choosing & measuring

  1. 14

    Choosing a frontier LLM in 2026

    Claude, GPT, Gemini, Grok, and the open-weight contenders. By task, by constraint, by compliance. A sortable comparison matrix and a five-question decision tree that lands you on a model.

  2. 15

    Telling if your LLM system actually works

    Golden sets, rubrics, LLM-as-judge, Cohen's kappa, and the pairwise trick that breaks scalar-score ties. The discipline that separates "feels better" from "is measurably better."

Part V

Inference & serving

  1. 16

    Quantization — fitting a 70B model on one GPU

    FP16 / FP8 / INT8 / INT4 mechanics, GPTQ / AWQ / SmoothQuant / GGUF methods, where quality breaks (long-context recall, math, schemas), and the deployment math — interactive memory bar that shows when each quant level fits which GPU.

  2. 17

    KV cache & paged attention

    Why serving LLMs is a memory problem, how naive allocation wastes 60–90% of cache memory, vLLM's PagedAttention breakthrough, cross-request prefix sharing as the mechanism behind hosted prompt caching.

  3. 18

    Speculative decoding & continuous batching

    Two paired tricks that 2–3× modern serving throughput: a small draft model proposes tokens that the big model verifies in parallel, and the scheduler packs many requests onto the same forward pass instead of waiting.

  4. 19

    LLM serving frameworks — vLLM, TensorRT-LLM, TGI, SGLang

    What a serving framework actually is, what each of the major four is good at, the benchmark trap, and the decision tree for picking one. Sortable comparison matrix + interactive 4-question decision tree.

Reading order is recommended but not enforced. The book sidebar on every chapter lets you jump anywhere; the pager at the bottom of each chapter is the "next page" if you're reading straight through. The companion cost-modeling tutorial — also on this site — picks up where Chapter 09 leaves off and walks the procurement-side math.