Companion to the AI-agent cost tutorial (RAG and multi-step agent parts) and hybrid search. Pricing gets easier once you know what must fit in the window each turn.

Context engineering

An interactive walkthrough: what the context window is, what happens when it overflows, how compaction works, why agents hallucinate, and the strategies that fix it.

The context window is a budget

Every model has a fixed token capacity per turn. Instructions, tool definitions, conversation history, retrieved documents, and the user's current question all have to fit in that single window. Drag the sliders to see how a 200k window gets carved up.

Window size

200k

Used

Free

200k

System prompt Tool definitions History RAG docs Current task

System prompt2.0k

Tool defs8.0k

History20k

RAG docs30k

Current task2.0k

Over 75% full — quality starts degrading and the model still needs room to respond.

Overflow. The oldest messages get dropped, the request errors, or it silently truncates.

Why it matters: a chatbot user types one message at a time. An agent generates dozens of tool calls per turn, each with verbose output. The window fills fast — and once it fills, every quality metric collapses simultaneously.

What happens when it overflows

As the agent loops — every retrieval, every tool call, every reasoning step appends to history. Pick a strategy, hit run, and watch what happens at the cliff edge. The window here is dramatized to 80k so the overflow is visible.

Turn

Tokens

Status

Idle

Overflow strategy

Hard truncation

Drop oldest messages until it fits. Fast — but loses early context, possibly the original task itself.

Error and stop

Refuse to continue. Safest, but the agent halts mid-task.

Summarize and continue

Compress old turns into a recap, keep recent turns verbatim. Lossy, but the agent keeps going.

History

Compaction: room without amnesia

When the window hits ~80% full, a compaction step rewrites old turns as a structured recap. The art is choosing what to keep verbatim (recent turns + critical facts) versus what to compress (verbose tool output, exploratory dead ends).

Before

42k

After

42k

Recovered

What survives compaction: the original task, the most recent turns, files written to disk, and any decision the user explicitly made. What gets compressed: failed attempts, redundant tool calls, raw retrieval payloads, and exploratory reasoning the agent has already concluded.

Why agents hallucinate

Most context-related hallucinations come from two failures: the fact is missing entirely, or it's buried where the model can't see it. Drag the slider to watch recall collapse with length — the famous "lost in the middle" effect.

Context length 8 docs

← StartMiddleEnd →

Recalled Lost in middle Forgotten

Recall

100%

Risk

Low

Likely cause

—

Common hallucination triggers

Lost in the middle

Models attend most strongly to the start and end of long contexts. Facts buried mid-window are often skipped — the model fills the gap from training-data priors instead.

Context poisoning

A wrong fact from an earlier tool call gets summarized into the recap as truth. Every later turn treats it as ground truth and compounds the error.

Context distraction

Too many irrelevant tools, examples, or retrieved chunks. The model picks the loudest matching pattern, not the right one.

Missing context

The fact was never in the window at all. The model produces a plausible-sounding answer from priors instead of admitting it doesn't know.

Five moves teams actually make

When an agent's context bloats — a familiar pattern: every tool retrieval kept verbatim, a multi-kilobyte JSON dump from one query taking up most of a turn — the fix is rarely "compress everything." It's deciding, per tool call, whether the output goes into the context at all.

That single decision is the whole job. The five moves below are what teams reach for in service of it.

The first move is to write stable state out of the window. A large JSON dump doesn't need to live in context; it can live on disk, with a one-line note (query result saved to /work/q-1842.json, 1,842 rows, columns: id, sku, qty, price) that the next turn reads and acts on. The same logic applies to a todo list, a plan, a partial draft, an extracted schema. If the model is going to look at it more than once but doesn't need it right now, get it out of the window.

The second move is to select instead of load. Production RAG doesn't stuff the top-50 chunks into the prompt; it retrieves a wider set, reranks, and only the top 3–5 cross the threshold into context. The same discipline applies to tool definitions: if a turn only needs three of your fifteen tools, don't show the model the other twelve. Anthropic's tool-use docs are explicit that tool definitions count against your window — filtering them aggressively is usually free quality.

The third move — the one most teams reach for first and shouldn't — is to compress. Compaction works, and the stepper in tab 3 shows what survives and what doesn't, but it's a last resort. A summary is lossy in ways you can't predict at write-time. Compress old turns (especially tool retries and exploratory dead ends), keep the original task verbatim, keep the most recent two or three turns verbatim, and accept that you'll occasionally lose a fact the model needed. The other moves give you predictable wins; this one gives you a smaller bill in exchange for occasional weirdness.

The fourth move is to isolate. When a sub-task can be defined narrowly enough to return a structured result — "read these 12 files and tell me which one defines handle_refund" — spawn a sub-agent with its own fresh context. The parent gets the answer; the parent never sees the 12 file reads. This is how Claude Code parallelizes search and review, and it's the highest-leverage move once your agent is doing real work on a real codebase. The trick is that the sub-task interface has to be tight, or you've just moved the bloat one level down.

The fifth move is the cheapest and most neglected: structure. Lean system prompt. Lean tool descriptions. Consistent output format the rest of the system can parse without the model narrating. A typical system prompt for a production agent is 300–600 tokens; the first draft of one is usually three to four times that. Tightening it costs almost nothing and gives you back hundreds of tokens of headroom on every turn forever.

Context engineering is the highest-leverage knob in any agent that calls more than one tool — and it stays the highest-leverage knob even after you think you're done. The cost-of-an-agent tutorial walks through what context choices look like in actual dollar terms on a real workload.

Every turn has a fixed token cap; the system prompt, tool definitions, history, retrieved chunks, and the current user message all compete for the same finite space — plus the model needs headroom to emit its reply. The model charges you per token only for the assistant's reply, not for the prompt. Context windows are renewed every API call with a fresh 200k allocation unrelated to prior turns. Long contexts are free if you use streaming responses. Truncation drops oldest messages wholesale — often deleting the original task or critical tool results. Compaction replaces a large stretch of history with a structured recap that preserves decisions and facts while discarding verbose or failed attempts. Compaction uses smaller embedding dimensions so more turns fit. Truncation always raises the temperature so the model forgets faster. Compaction requires a second GPU exclusively for summarization. Models tend to attend strongly to tokens at the start and end of a long prompt; facts buried in the middle are recalled less reliably, so the model falls back on parametric (training) knowledge. Middle tokens have higher perplexity so they are filtered out by the tokenizer. Attention matrices are sparse and only the first 4k rows are computed. Middle chunks always have lower BM25 scores than head chunks. Write — externalize stable artifacts (files, scratchpads, memory stores) so the agent re-reads them on demand instead of carrying verbatim content in every turn. Structure — tighten markdown headings in the system prompt. Isolate — use FP8 weights on the sub-agent. Select — always retrieve the top-100 chunks for safety.