ajinkya.ai An experiment in learning with AI.
← All entries
11 May 2026 4 min read

How prompt caching works

LLM Caching Cost Performance Tutorial Interactive Anthropic

Anthropic API · technique deep-dive

How prompt caching works

A reusable prefix written once and read many times — cheaper, faster, and surprisingly easy to break if you put the breakpoint in the wrong place.

Animated prompt caching token flow across two API calls Call 1: cacheable tokens flow from request through the model and up into the cache; user message tokens flow straight through. Call 2: the cache emits the same prefix tokens directly down to the model, while only the new user message is processed fresh — costing roughly 10 percent of the input price. Call 1 · cache cold · paying the write tax Call 2 · cache warm · 10× cheaper read 5-minute cache · tools + system + docs + thousands Request Model Response Call 1 — paid the write tax 20K cached × $3.75/M + 300 input × $3/M = $0.0759 $0.076 Call 2 — cache hit, paid 10% of input price 20K cached × $0.30/M + 300 input × $3/M = $0.0069 · saved 91% $0.007
cacheable prefix user message response
Three API calls sharing a cached prefix Call 1 writes the prefix to cache at 1.25× input cost. Calls 2 and 3 read the same prefix at 0.1× input cost. A dashed line marks the cache_control breakpoint between the cached prefix and the per-request user message. Three API calls within the 5-minute TTL window Call 1 tools · system · documents user msg write · 1.25× Call 2 same prefix · read from cache user msg read · 0.1× Call 3 same prefix · read from cache user msg read · 0.1× cache_control breakpoint Hierarchy: tools → system → messages. Each cache hit also refreshes the TTL for free.
The cached prefix is the same content every call. Only the first request pays the write multiplier; the rest pay the read price, which is one-tenth of the base input price.
Model
Cache TTL
Cached prefix
20,000 tok
Per-call user msg
300 tok
Calls within window
20

Without caching

With caching

Savings on input cost
Cache write (once)
Cache reads (n−1 calls)
Dynamic suffix (all calls)
Total with caching

Caching is prefix matching, not content matching

The cache key is a cumulative hash of everything from the start of the request up to the cache_control block — in the order tools → system → messages. Change a single token anywhere in that prefix and the hash differs. The system looks backward through your blocks trying to find an earlier matching entry, but its lookback window is only 20 blocks; if it doesn't find one in that range, you pay a fresh write.

The cardinal sin: breakpoint on a varying block

If your last block contains a timestamp, a request ID, or the incoming user message, the hash lives there and never matches anything. Put the breakpoint on the last block that stays identical across the calls you want to share a cache. For a typical RAG or chat-with-docs setup, that's the end of your reference corpus, with the user's question coming after the breakpoint.

5-minute vs 1-hour is an arithmetic question

A 5-minute cache pays for itself on the second hit (the 1.25× write is recovered by skipping one full read). A 1-hour cache pays for itself on the third hit (2× write). Each cache hit refreshes the TTL for free, so steady traffic keeps a 5m cache warm indefinitely. Reach for the 1-hour TTL only when the gap between calls is reliably longer than 5 minutes — agent side-tasks that take a while, or a chat where the user goes quiet for ten minutes at a stretch.

Pre-warming for latency-sensitive workloads

Send a max_tokens: 0 request at server startup (and on a schedule, every ~4 minutes for a 5m TTL) with your system prompt and the cache_control breakpoint. The cache is written, no output is generated, no output tokens are billed, and the first real user request lands on a warm cache — eliminating the cold-start latency tax on time-to-first-token. Particularly worth it for interactive products where the first interaction should feel snappy.

What invalidates the cache, in decreasing pain

Changing tool definitions blows away the entire cache. Flipping web-search or citations toggles invalidates system + messages. Adding or removing images anywhere invalidates messages. Changing tool_choice invalidates messages. Switching speed modes invalidates system + messages. The fix is the same in every case: keep these stable across calls that should share a cache, and put any varying knob after the breakpoint, never before it.

Minimum cacheable length

Below the model's minimum prefix length, the API silently skips caching — no error, just cache_creation_input_tokens: 0 and cache_read_input_tokens: 0 in the response. Minimums: 4,096 tokens for Opus 4.7 and Haiku 4.5, 2,048 for Sonnet 4.6, 1,024 for older Sonnet/Opus models. If your prefix sits just under the threshold, expanding it to clear the bar is often net cheaper than running uncached.

Pricing reflects Anthropic API rates as of 2026. Reference docs: Anthropic · OpenAI