Companion to the AI-agent cost tutorial (RAG and multi-step agent parts) and hybrid search. Pricing gets easier once you know what must fit in the window each turn.
Context engineering
An interactive walkthrough: what the context window is, what happens when it overflows, how compaction works, why agents hallucinate, and the strategies that fix it.
The context window is a budget
Every model has a fixed token capacity per turn. Instructions, tool definitions, conversation history, retrieved documents, and the user's current question all have to fit in that single window. Drag the sliders to see how a 200k window gets carved up.
What happens when it overflows
As the agent loops — every retrieval, every tool call, every reasoning step appends to history. Pick a strategy, hit run, and watch what happens at the cliff edge. The window here is dramatized to 80k so the overflow is visible.
Hard truncation
Drop oldest messages until it fits. Fast — but loses early context, possibly the original task itself.
Error and stop
Refuse to continue. Safest, but the agent halts mid-task.
Summarize and continue
Compress old turns into a recap, keep recent turns verbatim. Lossy, but the agent keeps going.
Compaction: room without amnesia
When the window hits ~80% full, a compaction step rewrites old turns as a structured recap. The art is choosing what to keep verbatim (recent turns + critical facts) versus what to compress (verbose tool output, exploratory dead ends).
Why agents hallucinate
Most context-related hallucinations come from two failures: the fact is missing entirely, or it's buried where the model can't see it. Drag the slider to watch recall collapse with length — the famous "lost in the middle" effect.
Lost in the middle
Models attend most strongly to the start and end of long contexts. Facts buried mid-window are often skipped — the model fills the gap from training-data priors instead.
Context poisoning
A wrong fact from an earlier tool call gets summarized into the recap as truth. Every later turn treats it as ground truth and compounds the error.
Context distraction
Too many irrelevant tools, examples, or retrieved chunks. The model picks the loudest matching pattern, not the right one.
Missing context
The fact was never in the window at all. The model produces a plausible-sounding answer from priors instead of admitting it doesn't know.
Five moves teams actually make
When an agent's context bloats — a familiar pattern: every tool retrieval kept verbatim, a multi-kilobyte JSON dump from one query taking up most of a turn — the fix is rarely "compress everything." It's deciding, per tool call, whether the output goes into the context at all.
That single decision is the whole job. The five moves below are what teams reach for in service of it.
The first move is to write stable state out of the window. A large JSON dump doesn't need to live in context; it can live on disk, with a one-line note (query result saved to /work/q-1842.json, 1,842 rows, columns: id, sku, qty, price) that the next turn reads and acts on. The same logic applies to a todo list, a plan, a partial draft, an extracted schema. If the model is going to look at it more than once but doesn't need it right now, get it out of the window.
The second move is to select instead of load. Production RAG doesn't stuff the top-50 chunks into the prompt; it retrieves a wider set, reranks, and only the top 3–5 cross the threshold into context. The same discipline applies to tool definitions: if a turn only needs three of your fifteen tools, don't show the model the other twelve. Anthropic's tool-use docs are explicit that tool definitions count against your window — filtering them aggressively is usually free quality.
The third move — the one most teams reach for first and shouldn't — is to compress. Compaction works, and the stepper in tab 3 shows what survives and what doesn't, but it's a last resort. A summary is lossy in ways you can't predict at write-time. Compress old turns (especially tool retries and exploratory dead ends), keep the original task verbatim, keep the most recent two or three turns verbatim, and accept that you'll occasionally lose a fact the model needed. The other moves give you predictable wins; this one gives you a smaller bill in exchange for occasional weirdness.
The fourth move is to isolate. When a sub-task can be defined narrowly enough to return a structured result — "read these 12 files and tell me which one defines handle_refund" — spawn a sub-agent with its own fresh context. The parent gets the answer; the parent never sees the 12 file reads. This is how Claude Code parallelizes search and review, and it's the highest-leverage move once your agent is doing real work on a real codebase. The trick is that the sub-task interface has to be tight, or you've just moved the bloat one level down.
The fifth move is the cheapest and most neglected: structure. Lean system prompt. Lean tool descriptions. Consistent output format the rest of the system can parse without the model narrating. A typical system prompt for a production agent is 300–600 tokens; the first draft of one is usually three to four times that. Tightening it costs almost nothing and gives you back hundreds of tokens of headroom on every turn forever.