Anthropic API · technique deep-dive

How prompt caching works

A reusable prefix written once and read many times — cheaper, faster, and surprisingly easy to break if you put the breakpoint in the wrong place.

cacheable prefix user message response

The cached prefix is the same content every call. Only the first request pays the write multiplier; the rest pay the read price, which is one-tenth of the base input price.

Cost simulator

Model

Cache TTL

Cached prefix

20,000 tok

Per-call user msg

300 tok

Calls within window

Without caching

—

With caching

—

Savings on input cost —

Cache write (once)—

Cache reads (n−1 calls)—

Dynamic suffix (all calls)—

Total with caching—

Caching is prefix matching, not content matching

The cache key is a cumulative hash of everything from the start of the request up to the cache_control block — in the order tools → system → messages. Change a single token anywhere in that prefix and the hash differs. The system looks backward through your blocks trying to find an earlier matching entry, but its lookback window is only 20 blocks; if it doesn't find one in that range, you pay a fresh write.

The cardinal sin: breakpoint on a varying block

If your last block contains a timestamp, a request ID, or the incoming user message, the hash lives there and never matches anything. Put the breakpoint on the last block that stays identical across the calls you want to share a cache. For a typical RAG or chat-with-docs setup, that's the end of your reference corpus, with the user's question coming after the breakpoint.

5-minute vs 1-hour is an arithmetic question

A 5-minute cache pays for itself on the second hit (the 1.25× write is recovered by skipping one full read). A 1-hour cache pays for itself on the third hit (2× write). Each cache hit refreshes the TTL for free, so steady traffic keeps a 5m cache warm indefinitely. Reach for the 1-hour TTL only when the gap between calls is reliably longer than 5 minutes — agent side-tasks that take a while, or a chat where the user goes quiet for ten minutes at a stretch.

Pre-warming for latency-sensitive workloads

Send a max_tokens: 0 request at server startup (and on a schedule, every ~4 minutes for a 5m TTL) with your system prompt and the cache_control breakpoint. The cache is written, no output is generated, no output tokens are billed, and the first real user request lands on a warm cache — eliminating the cold-start latency tax on time-to-first-token. Particularly worth it for interactive products where the first interaction should feel snappy.

What invalidates the cache, in decreasing pain

Changing tool definitions blows away the entire cache. Flipping web-search or citations toggles invalidates system + messages. Adding or removing images anywhere invalidates messages. Changing tool_choice invalidates messages. Switching speed modes invalidates system + messages. The fix is the same in every case: keep these stable across calls that should share a cache, and put any varying knob after the breakpoint, never before it.

Minimum cacheable length

Below the model's minimum prefix length, the API silently skips caching — no error, just cache_creation_input_tokens: 0 and cache_read_input_tokens: 0 in the response. Minimums: 4,096 tokens for Opus 4.7 and Haiku 4.5, 2,048 for Sonnet 4.6, 1,024 for older Sonnet/Opus models. If your prefix sits just under the threshold, expanding it to clear the bar is often net cheaper than running uncached.