Explaining the LLM cost paper

I The problem the paper is reacting to

The naive formula, and why it lies.

There is one calculation that engineers and product managers reach for first when someone asks "how much will this LLM application cost us?" It looks like this: dollars per million tokens, multiplied by total tokens, multiplied by total queries. Three numbers, one spreadsheet cell, done.

The paper exists because this calculation can be wrong by an order of magnitude in either direction. The worked example in Section 5 shows the naive number overshooting by 6× — the cache discount on a long stable system prompt erases most of the input bill, and the naive formula misses it. But the same naive formula undershoots on a different shape of workload: tool-orchestration agents that read a 5,000-token system prompt and emit a 50-token tool call run at 70:1 to 90:1 input-to-output, where the calculator's default 6:1 assumption is off by an order of magnitude in the other direction. The error has structure, and the structure is exactly the things the simple formula leaves out. The whole paper is a careful enumeration of those missing pieces.

Before we can talk about what the paper adds, the building blocks. So let's go slowly through the basics first.

Related reading. If you want a build-along tutorial that walks you through implementing the same cost model in your own calculator (with code), see How to cost an AI agent — a build-along tutorial. If you want background on the LLM concepts referenced here (KV cache, quantization, serving frameworks, structured outputs), the hub post Understanding LLMs indexes the deep-dives.

II Foundations

What a token actually is.

A token is the unit that LLM providers charge by. It is not a word and it is not a character. It is a fragment of text — usually a few characters long. The English word "understanding" is one token. The word "uncharacteristically" might be three tokens. Punctuation is its own token. Spaces matter.

As a rough rule that you can use for back-of-envelope math: one token is about three-quarters of an English word, or about four characters. So a 750-word document is around 1,000 tokens. A 10,000-word document is around 13,000 tokens.

The two prices: input and output

Providers charge two different rates depending on which direction the text is moving.

Input tokens also called prompt tokens

Every token you send to the model. This includes your system prompt (the long instructions you give the model about who it is and what tools it has), the user's question, any documents you have retrieved and pasted in, and the results of any tools the model called earlier in the conversation. Input tokens are cheap.

Output tokens also called completion tokens

Every token the model generates back to you. The actual reply. Output tokens are expensive — typically four to eight times more expensive than input tokens. For GPT-5.2 the gap is $1.75 per million input vs $14 per million output. Output is eight times the price.

Why output costs so much more

This is not pricing arbitrage. It reflects how transformer inference actually works. Input tokens can be processed in parallel — the model reads the whole prompt and computes attention across it in one big tensor operation, all at once. Output tokens have to be generated one at a time, each conditioned on every token before it. Generating 100 output tokens means 100 sequential forward passes through the model. Reading 10,000 input tokens is a single parallel pass.

So output is bottlenecked by sequential latency. The GPU spends time on memory I/O between tokens rather than crunching compute. Per token, output consumes far more GPU-seconds than input — and the price reflects that.

A worked picture

Imagine a customer types: "What was the population of Mumbai in 2011?" Behind the scenes, your application sends the model a 4,000-token system prompt explaining its role, plus the 12-token question, plus a 600-token retrieved Wikipedia paragraph. That is 4,612 input tokens. The model replies: "The population of Mumbai in the 2011 census was approximately 12.4 million." That is about 16 output tokens. Total cost: roughly (4,612 × $1.75 + 16 × $14) / 1,000,000 = $0.0083. Less than a cent. But notice — almost all of that cost is input, not output. We will come back to this asymmetry.

III The first big correction

Prompt caching — the largest correction the paper makes.

If your application sends the same long system prompt on every request — say a 5,000-token block explaining "you are a geospatial assistant, here are your tools, here is how to format your replies, here are some examples..." — the provider eventually notices. The first time they see that block, they compute something called a KV cache (key-value cache, the model's internal representation of the prompt). They store that cache on their side. The next time you send a request that starts with the same bytes, they do not recompute. They reuse the stored work.

And they charge you a discounted rate for those cached input tokens — typically about 10% of the normal input rate. OpenAI's cached input is one-tenth of regular input. Anthropic's is similar. For the mechanics of how the cache key is constructed and invalidated, see the companion post How prompt caching works.

The two providers behave differently

OpenAI

Automatic. Any prompt over 1,024 tokens is eligible. The provider matches the longest shared prefix between your current request and recent past requests. No code change required. Just keep your system prompt stable and the cache will form naturally.

Anthropic

Opt-in, two modes. You enable caching by attaching a cache_control field. There are two ways to do this: an automatic mode that attaches a single top-level cache_control field and lets the API choose the breakpoint, and an explicit mode that marks individual content blocks (system prompt, tool schemas, specific messages). Either way, you pay 1.25× the regular input rate on the first write (the call that creates a 5-minute cache entry; 2× for a 1-hour TTL), and only 0.1× on every subsequent read.

The numerical effect

The worked example in Section 5 of the paper sees the modeled cost come out at $26,000/month while the naive formula predicts $155,000/month. The biggest single reason for that 6× gap is the cache. At an 84% effective cache rate, the application is paying full price on only 16% of its input tokens. That alone reduces input cost by roughly five-sixths.

The symbol the paper uses: r_eff

Throughout the paper, r_eff is the effective cache hit rate — the fraction of input tokens that were served from cache rather than computed fresh. If r_eff = 0.84, then 84% of input tokens hit the cache and got the cheap rate, and 16% missed and paid full price. We will see this symbol again in Equation 1, where it controls the entire input-cost term.

What's actually being cached

Transformers compute two tensors per input token, called Keys and Values (the K and V in "KV cache"). These encode what the token says and where it sits in the sequence. Every subsequent token's attention computation reads from those K and V tensors — and once computed, they never change. They only depend on the prefix up to that point.

If two requests share the same prefix (same system prompt, same earlier turns of a conversation), the K and V tensors for that prefix are identical between them. Providers store those tensors after computing them once and reuse them on subsequent requests, charging only the lookup overhead instead of the full re-computation. That is the cache discount.

Why a 1,024-token threshold on OpenAI? Caching has its own overhead — the provider has to detect that the prefix matches, look up the stored state, and verify it. For a 200-token prompt, that overhead can exceed the savings. Above about 1,024 tokens the math tips: the prefix is large enough that reusing it pays for itself.

IV Traffic shape

Not every query costs the same.

The naive formula assumes there is one "average query." Real traffic is not like that. Some questions trigger every tool in the pipeline; others get refused in two sentences; others just retrieve a document and stop. The paper bins traffic into five canonical shapes:

Shape	Input α	Output β	Cacheable?	What it means
full	1.00	1.00	yes	The complete pipeline runs end-to-end. This is the baseline; every other shape is expressed as a multiple of this.
rag	0.19	0.47	yes	Just retrieve a document and answer from it. No downstream tools. Short.
partial	0.62	0.47	yes	The pipeline runs partway and stops — perhaps a tool returns "no data" and the model replies briefly.
refusal	0.06	0.12	no	User asks something out of scope. Model declines quickly. Very cheap. Cannot cache because the prompt is shorter than the cache threshold.
heavy	1.20	1.17	yes	Long multi-turn conversation. Slightly more than full.

α_s (alpha) is the input multiplier for shape s. β_s (beta) is the output multiplier. You pick a traffic mix — say 40% full, 30% rag, 15% partial, 10% refusal, 5% heavy — and take a weighted average across the five shapes to get the blended per-query cost.

The mix matters

A single "one query" cost number lies because real applications are a mix. If 10% of your queries are cheap refusals, your average is meaningfully lower than the full-pipeline price. Conversely, if you only count refusals and forget the heavy long-context conversations, you under-estimate. The mix is part of the model.

V Segments

Who is on the other side of the API.

Different users behave differently, so the paper splits traffic into two segments and computes each separately:

Authenticated users

Logged-in users, often paying customers or institutional accounts. Long sessions. Real intent. No bot traffic — you know who they are. Sessions amortize the cache better because the same person keeps asking follow-up questions on top of the same system prompt.

Anonymous public visitors

Anyone on the open internet. Shorter sessions. A mix of curious humans, scrapers, AI crawlers, and occasional abuse. You don't know who they are. This is where the bot multiplier lives.

The bot multiplier β_bot

A note on symbols

The paper writes this as plain β, but it also uses β_s (in the shape table above) for the per-shape output multiplier — a completely different quantity. Throughout this walk-through I'll write the bot factor as β_bot to keep the two apart. Where the paper just says β, context tells you which one.

The unauthenticated segment is not all human traffic. Even after the rate limiter at your edge has done its job, a meaningful fraction of remaining requests are non-human user-agents: SEO scrapers, AI training crawlers, security probes, abuse. The paper's default is β_bot = 1.5 — meaning that for every 1.0 unit of human demand, there's another 0.5 of bot traffic on top, for a total of 1.5.

This is set to β_bot = 1 for authenticated traffic (you know there are no bots there). For public endpoints, β_bot should be estimated from the deployment's own CDN/WAF logs; the paper uses 1.5 as a starting point in the worked example, but does not claim that number generalizes.

The acronyms in this paragraph

CDN — Content Delivery Network. The edge layer between the open internet and your servers (Cloudflare, Fastly, AWS CloudFront). It sees every request before your application does. WAF — Web Application Firewall. The filtering layer inside or alongside the CDN that blocks known-bad traffic. Both have logs that classify traffic by user-agent — which is what you use to estimate β_bot honestly.

VI Equation 1, walked through

Per-query cost, one piece at a time.

Now we have everything we need to read the first equation. Here it is, exactly as it appears in the paper:

Equation 1 — per-query cost

C_s = T_m · ( I_s(1 − r_eff) p_in + I_s r_eff p_cached + O_s p_out ) / 10⁶

C_s — cost in dollars for one query of shape s.
I_s — input tokens for this shape, = α_s × baseline input tokens.
O_s — output tokens for this shape, = β_s × baseline output tokens.
r_eff — effective cache hit rate (a number between 0 and 1).
p_in, p_cached, p_out — provider rates per million tokens for normal input, cached input, and output.
T_m — tier multiplier (1.0 standard, 0.5 batch, 2.5 priority).
10⁶ — because the prices are quoted "per million tokens," we divide.

Read this as three additions inside the parentheses, because that is what it is:

I_s × (1 − r_eff) × p_in — the input tokens that missed the cache, charged at full input rate. If r_eff = 0.84, then (1 − 0.84) = 0.16, so 16% of input tokens fall into this term.
I_s × r_eff × p_cached — the input tokens that hit the cache, charged at the discounted cached rate. 84% of input tokens fall into this term.
O_s × p_out — output tokens. These have no cache concept; what the model generates is always fresh. Output is charged at its (higher) output rate.

Add those three numbers, divide by one million (because rates are per million tokens), multiply by the tier multiplier T_m, and you have the cost in dollars for one query of one shape.

A worked numeric example

Take one full-pipeline query on GPT-5.2 standard tier: I_s = 5,000 input tokens, O_s = 500 output tokens, r_eff = 0.84, T_m = 1.0. Provider rates: p_in = $1.75/M, p_cached = $0.175/M, p_out = $14/M.

Plug in:

Fresh input: 5,000 × (1 − 0.84) × $1.75/M = 5,000 × 0.16 × $1.75/M = $0.00140
Cached input: 5,000 × 0.84 × $0.175/M = $0.00074
Output: 500 × $14/M = $0.00700

Total: $0.00914 per query. Almost 77% of that is output, even though output is only 9% of the tokens — the 8× output-vs-input rate gap, compounded by the 84% input-side cache discount. This is the asymmetry the interactive widget below lets you push around.

Tiers, and what T_m does

Providers offer different tiers at different prices:

Standard — the default. T_m = 1.0.
Batch (OpenAI) or Flex — your job runs whenever the provider has spare capacity, usually within 24 hours. Half price. T_m = 0.5. Good for offline analytics.
Priority — guaranteed low latency. More expensive. T_m = 2.5. Good for user-facing applications during peak hours.

For cache-ineligible shapes like refusal, the paper sets r_eff = 0 regardless of session length, because the refusal pipeline never accumulates a cacheable prefix.

Try it · Where does the money go in one query?

Input tokens 5,000 Output tokens 500 Cache hit rate 84%

At default settings, output costs more than all input combined — even though output is only 10% of the tokens. The 8× output rate compounds with the 84% cache discount on input.

VII Equation 2

One cached-input rate is two prices in a trench-coat.

Equation 1 treats p_cached as a single number — the cached-input rate per million tokens. In steady state, on OpenAI, that is roughly correct: OpenAI's cached read price is ~10% of the regular input price, and there is no write surcharge. But Anthropic charges differently: a cache write (the first time a particular prefix is sent) costs 1.25× the regular input rate, while a cache read (every subsequent call that hits that prefix) costs only 0.1× the regular rate.

If a fraction w of cached input tokens hit a cache that had to be created from scratch (writes), and (1 − w) hit a cache that was already warm (reads), the effective per-million rate is a weighted blend:

Equation 2 — effective cached-input rate

p_cached,eff = w · p_write + (1 − w) · p_read

w — cache-write share. The fraction of cached input tokens charged at the write rate rather than the read rate. Higher when caches rotate often (new conversations, prompt changes); lower in steady state with a long-lived cache.
p_write — cache creation rate. For OpenAI this equals the regular input rate p_in (no surcharge); for Anthropic it is 1.25 × p_in for a 5-minute cache (2× for a 1-hour cache).
p_read — cache hit rate. Roughly 0.1 × p_in for both providers.
p_cached,eff — the blended rate that gets plugged into Equation 1 as p_cached.

For OpenAI in steady state, w ≈ 0 and the blend collapses to just p_read — which is why Equation 1 looks like it has only one cached price. For Anthropic deployments the picture is more interesting. The paper's calculator originally exposed w as an operator-supplied slider with a default of w = 0.10, but the actual measurement (May 14, 2026 run of the cached-pipeline-anthropic scenario against Claude Sonnet 4.5, N=3 repeats × 6 turns = 18 LLM calls) found w ≈ 0.20 for a representative multi-turn deployment — roughly twice the calculator's default.

The unexpected component is a small cache-creation surcharge (~215 tokens) that Anthropic applies on each warm call to extend the 5-minute TTL, on top of the cold-write cost on turn 1. So even after the cache is warmed, Anthropic continues to charge a small write penalty on every read. The paper now treats w as deployment-specific (governed by how often the cache rotates relative to the TTL relative to traffic shape) rather than as a calibrated default; operators on Anthropic should anchor closer to w = 0.20 than to w = 0.10 for multi-turn workloads.

Try it · Cache write/read blend by provider

Cache-write share w 5%

pin: $1.75/M
pread: $0.175/M
pwrite: $1.75/M
pcached,eff: $0.175/M

At w=5% on OpenAI, the blend is barely different from the steady-state read rate. OpenAI charges the same for cache writes and uncached input (no surcharge), so the effective cached rate stays near 0.1×p_in.

VIII Equation 3

Cache rate grows with the conversation.

One missing piece in Equation 1 is: where does r_eff come from? It is not a constant. It depends on how long the conversation is. A single-turn cold visit cannot cache much, because nothing has been computed yet. A ten-turn conversation has built up a long shared prefix; almost everything except the latest user message is in the cache.

Equation 3 captures this with a simple linear rule:

Equation 3 — effective cache rate

r_eff(q) = clamp( r_baseline + 0.01·(q − 6), 0.50, 0.94 )

q — average number of questions per session for this segment.
r_baseline — the deployment-specific anchor cache rate, measured at q = 6 turns. The paper uses 0.84.
0.01 — the slope. Each additional turn beyond 6 adds 1 percentage point to the cache rate.
clamp(...,0.50, 0.94) — forces the result to stay between 50% and 94%. No matter how long the session, you can't go below 50% (single-turn floor: only the system prompt is shared) or above 94% (long-session ceiling: each turn still introduces a fresh, uncacheable user message). The 0.94 ceiling is an empirical bound — the paper's harness sees OpenAI's gpt-4o-mini saturate in the 0.94–0.97 range on long multi-turn chats (Section 4, Finding 1); the model takes the lower edge of that range as a conservative upper limit.

Worked numbers

If r_baseline = 0.84:

q = 5 questions/session: 0.84 + 0.01 × (5 − 6) = 0.84 − 0.01 = 0.83. Short sessions amortize slightly worse than the 6-turn anchor.
q = 6: 0.84 exactly (this is the anchor).
q = 10: 0.84 + 0.01 × 4 = 0.88. Longer sessions amortize better.
q = 1: 0.84 − 0.05 = 0.79. Still high — but note the clamp at 0.50, which would kick in for very low r_baseline.

Try it · Cache rate as a function of session length

Questions / session (q) 6 r_baseline at q=6 0.84

effective cache rate 0.840

Move the sliders to see how session length shifts the cache hit rate. The line is r_baseline + 0.01·(q−6), but it can't drop below 50% (single-turn floor) or climb above 94% (every turn still introduces fresh user text).

A weak point the paper admits

The 0.01 slope is described in the paper as "a heuristic, not a measured universal constant." Translation: AJ chose it because it produced plausible numbers, not because he measured it. If you adopt this model for a different application, you would want to actually plot cache rate vs session length on your own logs and fit your own slope. The paper says this honestly. A reviewer might still push back, but the disclosure is the right move.

IX Equation 4

How many queries per month.

This one is just bookkeeping, and it is the most concrete equation in the paper.

Equation 4 — monthly query volume

Q = MAU · (sessions/day) · 30 · (questions/session) · β_bot

Q — total queries per month for this segment.
MAU — Monthly Active Users in this segment. The standard product-analytics number.
sessions/day — average visits per active user per day. Typically a number well below 1 (e.g., 0.2, meaning each user comes back every 5 days).
30 — days in a month, roughly.
questions/session — average length of a session in user turns.
β_bot — bot multiplier (1.0 for authenticated, 1.5+ for public anonymous). Earlier drafts of the paper used plain β. The final paper uses β_bot explicitly to avoid confusion with β_s, the per-shape output multiplier from Section 3 / Table 1.

The paper's two segments, numerically

From the worked example in Section 5:

Authenticated

500 users × 0.2 sessions/day × 30 days × 5 questions × 1.0 = 15,000 queries/month

Anonymous (10K MAU)

10,000 users × 0.2 sessions/day × 30 days × 10 questions × 1.5 = 900,000 queries/month

Total: 915,000 queries per month. Notice that the anonymous segment dominates — 60× more queries than the authenticated segment, despite being only 20× more users. The bot multiplier and the longer sessions both push it up.

X The spend cap

The daily spend cap, and what 429 means.

This section of the paper is one of the most important and the easiest to misread. So I want to be very careful.

If you put an LLM behind a public endpoint without protection, the worst-case scenarios are real and they have happened to production teams: a viral tweet directs millions of curious visitors to your site; a misconfigured client sends the same request in an infinite loop; a competitor pays for a scraping service to crawl every page. Without a safety net, you wake up to a $50,000 surprise bill. Engineers who have lived through this once never deploy without a cap again.

What the cap actually is

A daily spend cap is a hard limit at the API-gateway level. Concretely: a script tracks your daily LLM spend in real time, and when that running total crosses, say, $1,500 in a 24-hour window, the gateway stops forwarding requests to the model. Instead, it returns an HTTP 429 status code to the caller.

HTTP 429 Too Many Requests

A standard HTTP status code meaning "you are being rate-limited; try later." The user receives an error response, not a model answer. The application typically shows them a friendly "service is busy, please try again in a few minutes" message. The cap stays in place until the day rolls over (typically midnight UTC), at which point the spend counter resets and requests start flowing again.

The mechanical consequence

Mechanically, the cap turns the monthly LLM bill into a step function that can never exceed cap × days. If the cap is $1,500/day and the month has 30 days, you cannot pay more than $45,000 for LLM tokens, no matter how much demand shows up at your door. In symbols:

Spend cap as an upper bound on monthly LLM cost

LLM_api ≤ cap_daily × days/month

LLM_api — actual monthly LLM bill from Equation 1, summed across all shapes and segments.
cap_daily — the per-day spend ceiling enforced by the gateway.
days/month — 30, give or take.

The paper also allows operators to model a small number of burst days per month (the daily cap is still enforced, but a few "press-event" days are allowed at the elevated daily cap). The shape doesn't change: a hard cliff. Demand above the cliff doesn't get served — it gets HTTP 429'd.

The crucial implication: refused queries

This is the move that the rest of the paper hinges on. When you say "our LLM bill last month was $45,000", that statement is incomplete without also saying "and we refused 64% of incoming queries with HTTP 429 after hitting our daily cap."

$45,000/month with 2.9 million queries refused is a different procurement decision than $125,000/month with all queries served.

This is the paper's most important insight, and the rest of the paper builds the machinery to make this comparison honestly.

XI Equation 5 — the headline

The seven layers of a real monthly bill.

Equation 5 is where the paper stops talking about just LLM tokens and starts talking about the actual monthly bill that hits your finance team. Tokens are one line. There are six more.

Equation 5 — monthly headline cost

Headline/mo = LLM_api · (1 + 1.5r) + V + E + P + A + F + C

LLM_api — the LLM provider bill (sum of Equation 1 across all shapes, segments, and days).
r — fraction of calls that fail and have to be retried. (1 + 1.5r) is the retry penalty.
V, E, P, A, F, C — six additional monthly cost categories, defined below.

The retry penalty: where the 1.5 comes from

If r = 0.05 (5% of calls fail and need retrying), each retry charges full input tokens again plus a partial output (the model wrote something before the call failed). The 1.5 coefficient comes from a back-of-envelope of where the retry cost actually lands: ~1.0× a fresh input on the re-sent prompt (full input tokens charged again, with the cache discount on the stable prefix) + ~0.5× of an average output (the model typically streamed about half the response before the failure). It's an honest estimate, not a measured constant; the paper flags that production teams should refine it with their own observed retry traces. With r = 0.05, the LLM bill inflates by (1 + 1.5 × 0.05) = 1.075, or 7.5%.

The six additional categories

V — Verification

A second LLM pass that fact-checks the first model's reply. The technique is: split the reply into atomic facts (single factual claims), then for each atomic fact, ask a smaller model to do NLI — Natural Language Inference — checking whether the claim is supported by the retrieved source documents. Catches hallucinations. Optional but standard for high-stakes deployments like medical, legal, or government applications. Costs another full LLM call per response.

E — Embeddings (RAG)

If your application uses RAG (Retrieval-Augmented Generation — retrieving documents from a knowledge base to ground the model's answers), you need to turn every document into a vector and store it. This is embedding. There are two costs: the one-time cost of embedding your entire corpus (amortized monthly), plus the per-query cost of embedding the user's question to search against the corpus.

P — Personnel

FTE-allocated cost of the humans who keep this thing running. FTE means Full-Time Equivalent: one person working full-time on this project is 1.0 FTE; two people working half-time is also 1.0 FTE. The paper counts MLOps engineers, prompt engineers (people who refine the system prompt), and SMEs — Subject Matter Experts who validate domain-specific output.

A — Agent-engineering amortization

The upfront design work that went into making the agent work in the first place — writing the system prompt, defining the tools, testing edge cases, getting the orchestration right. This is a one-time cost, but it's spread (amortized) over the agent's useful life. If design took $36,000 and the agent will run for 36 months, A = $1,000/month.

F — Federal compliance

A bundle of costs that apply when you deploy for the US government. ATO = Authority to Operate, the FedRAMP certification process that gives you permission to handle government data. Audit-log retention (required by regulation). PII redaction = automated removal of Personally Identifiable Information from logs and inputs. Egress fees for moving data across networks. For most commercial deployments F = 0.

C — Fixed infrastructure

The "running the lights" category. Things that exist whether or not anyone uses the application: control plane (the servers that orchestrate everything), RDS (AWS's managed relational database), ElastiCache (managed Redis caching), observability tooling (logs, metrics, traces), KMS (AWS Key Management Service, for encryption), basic networking baseline.

Composable

Any one of these layers can be zero, and most are zero for a small commercial chatbot. But for a serious deployment, ignoring even one can throw off your procurement estimate by tens of thousands of dollars per month. A small commercial chatbot pays only LLM and C. A FedRAMP-bound multi-agent government system pays all seven. The model lets operators turn layers on and off to match what they're actually deploying.

Try it · Where the monthly headline comes from

Retry rate r 5% Verifier coverage 0%

At default settings (5% retry, 0% verifier, FedRAMP Moderate), the headline is dominated by the LLM line. The retry multiplier (1 + 1.5r = 1.075) and the federal premium (×1.15) together inflate the API bill by ~24% before any of the other six categories are added.

XII The alternative

Self-host: renting GPUs instead of paying per token.

The paper's second half compares paying OpenAI/Anthropic per token (the API approach) against running the model yourself on rented GPUs (the self-host approach). Before we read the equations, let's understand what self-hosting actually involves.

What you rent

You rent GPU instances from a cloud provider — typically AWS, but also GCP, Azure, Lambda Labs, CoreWeave. A common configuration is one NVIDIA L4 or A10 or H100 GPU attached to a virtual machine. AWS calls these g6.xlarge, g5.xlarge, p5.48xlarge and similar. You pay an hourly rate whether you're using the GPU or not. Costs range from $0.50/hour (small GPU) to $30+/hour (high-end H100).

What you run

On that GPU, you run an open-weight model — a model whose weights have been released publicly. Llama 3, Qwen, Mistral, DeepSeek. These are usually not as capable as contemporary commercial frontier models, but they are good enough for many tasks and much cheaper per token at scale — if you can keep the GPU busy.

The serving stack: vLLM

vLLM is an open-source inference server. It is not the only one (others include TGI, SGLang, TensorRT-LLM) but it is the most widely cited in the paper. vLLM's contribution is something called PagedAttention — a way of managing GPU memory that allows many requests to be processed at the same time on the same GPU. This is called continuous batching and it is the key to high throughput. For the serving-framework landscape (vLLM vs TGI vs SGLang vs TensorRT-LLM), see the companion post LLM serving frameworks.

Throughput: ρ (rho)

The paper uses the symbol ρ for an instance's aggregate sustained throughput in tokens per second — summed across all GPUs in that instance. The worked-example default is ρ = 1,200 tok/s, which is the published aggregate throughput of a g6e.12xl (a 4-GPU L40S instance) running a 70B-class open-weight model under vLLM-style continuous batching. The paper is careful that this is not a hardware constant — actual throughput depends on:

Prompt length — longer prompts take more compute to process.
Output length — output generation is token-by-token and slower.
Batch size — how many requests are being processed simultaneously.
Quantization — running the model at lower numerical precision (e.g., 8-bit instead of 16-bit) for speed.
KV-cache pressure — how much GPU memory the in-flight requests' caches are using.
Latency SLO — Service Level Objective. If you promise users a fast response, you can't batch as aggressively.

The gap between benchmark and production

Even when ρ is benchmarked correctly, production often sustains less than the benchmark unless traffic is well-packed and the serving stack is tuned. Requests arrive irregularly. The KV-cache fills up and evicts entries. The serving stack adds its own overhead. The paper formalizes this gap with a separate symbol — η (eta), a throughput-derating factor — which appears next to ρ in Equation 6. η = 1.0 means "we assume the benchmark holds." η = 0.75 means "we expect to sustain 75% of it in real serving." This is the same idea older calculators called utilization, but separated from sizing math so the assumption is visible.

A self-host fleet sized for peak demand sits mostly idle off-peak. Whether that idle time is "wasted" depends entirely on how honestly you account for it.

XIII Equation 6

How many GPUs do you need.

Naively you might think: total tokens per month, divided by what one GPU can produce per month, rounded up. The paper doesn't size that way, and the reason is one of those production-experience details that catches first-time operators out.

A real workload isn't flat across the day. Public traffic is roughly four times as heavy during business hours as it is overnight. A fleet sized for the monthly average will queue badly at lunchtime and sit half-idle at 3am. So you size to the peak, not the mean. And then you add safety headroom on top, because the peak you measured last month is not the peak you'll see next month.

Equation 6 — required instance count

n = max( ⌈ T_peak / (ρ · η) ⌉, n_min )

T_peak — peak required throughput = (τ · Q_tot / s) · d · h.
τ · Q_tot / s — mean demand in tokens per second (total monthly tokens divided by seconds in a month).
d — diurnal peak-to-mean ratio. Default 4 for public-facing chat: business-hour load is about 4× the overnight load.
h — static headroom for unexpected spikes. Default 1.5.
ρ — instance aggregate throughput in tokens per second, summed across all GPUs in one instance.
η (eta) — throughput-derating factor for serving inefficiencies not captured in ρ alone (queue contention, KV-cache eviction under load). 1.0 in Optimistic mode, 0.75 in Realistic.
n_min — high-availability replica floor. Default 2, so even tiny workloads run with redundancy.
⌈ ⌉ — the ceiling function. You can't rent 4.7 GPUs; you rent 5.

Walking through a real number

Take the 50,000-MAU stress workload: 4.5 million queries per month, 2,000 tokens per query average. ρ = 1,200 tok/s (a g6e.12xl instance with 4×L40S GPUs running a 70B model). Optimistic mode: η = 1.0.

Mean demand: 2,000 × 4,500,000 / 2,592,000 = 3,484 tok/s.
Peak demand: 3,484 × 4 × 1.5 = 20,903 tok/s.
Instances required: ⌈20,903 / 1,200⌉ = 18 instances.

The paper's Section 5.2 reports exactly this: 18 instances Optimistic for τ = 2,000. Notice the work the d × h factor is doing — without the 6× peak-and-headroom multiplier, the same workload sizes to only 3 instances, which would under-provision badly during peak hours and is not a defensible procurement number.

Why the simple formula misleads

Simpler calculators often use the form n = ⌈τ·Q_tot / C⌉ where C is monthly capacity per instance. That formula treats demand as if it were perfectly flat across 24 hours. For workloads with real diurnal swings it under-sizes the fleet by roughly d × h × η, which on the paper's defaults is a factor of 6. The cost difference is the difference between a self-host quote that survives procurement review and one that doesn't.

Once you know n, total monthly self-host cost is:

Monthly self-host cost = n · c_instance + c_fixed

c_instance — monthly cost of one GPU instance (hourly rate × 730).
c_fixed — control plane, observability, FTE amortization, and one-time setup spread over a chosen horizon.

XIV Operating assumptions

Optimistic vs Realistic.

Self-host break-even calculations in industry blog posts almost always pick one set of operating assumptions and present a single answer. But the answer depends so heavily on those assumptions that any single number misleads. The paper exposes two named presets, each holding three knobs together:

Optimistic mode

η = 1.0 — the benchmark throughput ρ is sustainable in production. No derating for queue contention or KV-cache pressure.

FTE allocation = 0 — no dedicated personnel cost. GPU operations are absorbed into an existing team's workload.

Setup amortization = 36 months — the upfront engineering investment is spread thin over three years.

Brackets a well-packed, multi-workload fleet. Best case for self-host.

Realistic mode

η = 0.75 — a 25% derate. Real serving often sustains less than the benchmark; queues build during peaks, the KV-cache evicts under load.

FTE allocation = 0.5 SRE — half a Site Reliability Engineer keeping the fleet healthy.

Setup amortization = 12 months — one year, honest if you're not sure the project will survive longer.

Brackets a single-tenant fleet sized for its own peaks. Typical case.

What η actually captures

The throughput-derating factor is the paper's way of formalizing the gap between "what the GPU benchmark says" and "what your fleet sustains in production." Benchmarks are run with carefully chosen prompt lengths, batch sizes, and concurrency. Production deployments rarely match those conditions exactly: requests arrive irregularly, the KV-cache fills up and evicts entries, the serving stack adds its own overhead. η = 0.75 is the paper's anchor for "this is what you actually get." η = 1.0 is "we assume best case."

"The procurement decision depends on knowing which side of that line your team sits on — and most teams overestimate their side."

The paper doesn't say Realistic is right. It says: if your team can plausibly hit Optimistic conditions, self-host has a better story. If you can only honestly claim Realistic, API has a better story. Optimistic is the answer for a team running a multi-tenant, well-packed fleet with no incremental headcount. Realistic is the answer for a team standing up its own dedicated cluster, paying for an SRE to keep it alive, on a one-year horizon. Almost every team that quotes a self-host number on LinkedIn is implicitly claiming Optimistic conditions; almost none of them can deliver on that claim in production.

XV The 8× lever

Tool-response architecture: what the LLM actually sees.

Before walking into the equal-budget comparison, one piece of background. The paper's largest operational finding (separate from the API-vs-self-host question) is that two real production deployments running the same agent code on the same provider can have monthly costs that differ by 8×. The variable that moves the cost by 8× is not which provider you use, or how aggressive your cache is, or how many GPUs you've sized for. It's whether your agent shows raw tool returns to the LLM, or whether an intermediate layer summarizes them first. For the mechanics of how tool/function calls are dispatched, see the companion post Tool use & function calling.

Imagine a tool that searches a satellite-imagery catalog. The user asks "find Landsat scenes over New York City for October 2024." The tool runs, comes back with 12 STAC items (a structured-metadata format used by raster archives). Each item is a JSON object with around 2,000 tokens of metadata — geometry polygons, band-by-band statistics, asset URLs, license fields, processing-level annotations, and so on. The agent now has to do something with these 12 items. There are two patterns.

Pattern A · templated tool returns

The 12 items go into the agent's internal state, not into the LLM context. The agent's intermediate layer extracts a short status string — "found 12 Landsat-8 scenes over NYC, Oct 1–31 2024" — and only that string goes back to the LLM. The structured items get rendered to the user through a UI template (a results panel, a map overlay) that knows how to display them deterministically. The LLM is told what happened, not everything that happened. Per-turn LLM input on this workload, measured: 3,342 tokens.

Pattern B · freeform tool returns

The 12 items are serialized and concatenated into the LLM's context window. The LLM reads all of them and writes a natural-language summary citing the specific fields the user might care about. This is the default in chat-style agent frameworks (LangChain ReAct, OpenAI Assistants, single-prompt tool loops) where the LLM also generates the user-facing response. The LLM sees everything that happened. Per-turn LLM input on the same workload, measured: 22,798 tokens.

Same agent topology, same provider, same model, same workload — only the architecture of how tool returns reach the LLM differs. The paper measured both endpoints against live OpenAI gpt-5.2 (the templated run is N=20 sessions / 238 LLM calls; the freeform run is N=5 sessions / 60 calls). The measured per-query blended cost is $0.00178 templated vs $0.01392 freeform, a paired 7.8× spread. At 915K queries/month the templated deployment costs $1,629/mo; the freeform deployment costs $12,737/mo (both pre-FedRAMP — applying the optional Eq. 5 compliance multiplier brings them to $1,873 and $14,648).

Three things to notice:

This is a procurement-scale lever, not a tuning knob. It's an architectural choice the team makes once when designing the agent. Switching from freeform to templated requires building a response-template layer that knows how to display each tool's structured output through a deterministic UI; that's an engineering project, not a flag.
Not every deployment can adopt the templated pattern. If your product is a pure chat interface where the LLM also writes the user-facing prose, you can't strip structured payloads out of the LLM's context — the LLM needs to read the data to write the response. The freeform anchor is the realistic operating point for chat-first products.
Self-host costs are roughly architecture-invariant in the modeled range (more carefully: held fixed at τ=2,000 tokens/query for the equal-budget table). Self-host GPU sizing is driven by total token throughput which is dominated by output tokens (similar in both architectures) and a workload-weighted blend of input. The API rate-card asymmetry that produces the 8× spread on the API side hits self-host as additional GPU-minutes, not as a price multiplier.

This is why the equal-budget comparison table in the next section splits the API rows by tool-response architecture but reports self-host as a single column. The templated/freeform split is where the operational decision lives.

XVI The central move

Equal-budget comparison: the move that breaks open the debate.

Most LinkedIn and blog-post comparisons read like this: "API costs $45K/month. Self-host costs $86K/month. Therefore API is cheaper."

The paper points out that this is a category error. The $45K API figure is achievable only because the spend cap is refusing 64% of incoming queries. The $86K self-host figure serves every query that comes in. They are not the same service. Comparing them is like comparing the cost of a restaurant that serves everyone against the cost of a restaurant that turns 64% of customers away at the door.

The paper's proposed fix:

Fix the budget. Pick a daily spend cap, say $1,500/day.
Apply that cap to both strategies. Cap API spend at the cap. Cap self-host spend at the same cap, which translates to "rent fewer GPUs."
Ask: how many queries does each strategy serve within that budget? Now you're comparing service levels at equal cost, which is the fair comparison.

The strategy table (Table 7 in the paper)

The current paper splits the API rows by tool-response architecture (templated vs freeform — see Chapter XV, just above) and keeps two self-host operating modes. Eight rows total.

At 50,000 anonymous MAU (4.5 million queries/month of demand), the paper computes:

Strategy	Monthly cost	Service level
API, templated tool returns
uncapped	$8,037	all 4.52M served (cap unused)
capped $1,500/day	$8,037	all 4.52M served (cap does not bind)
API, freeform tool returns
uncapped	$62,849	all 4.52M served
capped $1,500/day	$45,000	3.23M served / 1.28M refused (28%)
Self-host (τ=2,000 tokens/q fixed)
full service, Optimistic	$85,553	all served, with quality trade-off
full service, Realistic	$141,269	all served, with quality trade-off
capped $1,500/day, Optimistic	$44,202	2.33M served / 2.18M refused
capped $1,500/day, Realistic	$43,786	0.97M served / 3.54M refused

Reading this table carefully

The headline finding is that API service wins decisively under every evaluated row at this scale: API-templated runs at $8K/month and never touches the cap, API-freeform runs at $63K/month uncapped or saturates the cap and refuses 28% of public traffic at $45K, and both numbers are cheaper than the self-host alternatives at the same daily budget. Within the API column, the choice between templated and freeform tool returns is an 8× cost lever on the same user-facing workload.

The self-host rows hold τ=2,000 tokens/query fixed so that the table isolates the API-side tool-response architecture effect (a different self-host re-sizing would be needed to size GPUs for the freeform 22,798-token-per-turn anchor). Under that fixed-τ assumption, the freeform-capped API row serves 3.23M queries within the $45K budget; the corresponding self-host-capped rows serve 2.33M (Optimistic) and 0.97M (Realistic) at the same monthly spend.

An earlier draft of this paper concluded the comparison "flips on operating regime"

Earlier drafts of the paper anchored the API rows against an 84,490-input-token structural-ceiling per-query cost, which gave $45,000/month for API-capped and made the Optimistic-mode self-host-capped row look more competitive. Once both endpoints were measured empirically (templated and freeform), the comparison stopped flipping: API service wins in both modes, by a wider margin than the structural-ceiling anchor implied. The widget below visualizes the earlier-draft structural-ceiling regime; it's preserved for continuity, but the headline finding now sits in the table above.

Structural-ceiling regime · Optimistic ↔ Realistic

Optimistic mode. Self-host capped to the same $44K/mo budget serves 2.33M queries — 44% more than API capped's 1.62M. If your team can run at high utilization with no FTE allocation, self-host wins the equal-budget comparison.

The widget reproduces the earlier-draft analysis at the structural-ceiling anchor: under those assumptions, capped self-host in Optimistic mode appears to serve more queries than capped API, while Realistic mode flips the result. That was the framing in the original release. With both API endpoints now measured empirically against live OpenAI gpt-5.2, the structural-ceiling anchor no longer drives the comparison.

API-templated: serves all 4.52M queries at $8,037/month. The daily cap is never reached. This is the floor — the cheapest defensible operating point for this workload on this provider.
API-freeform: serves all 4.52M queries at $62,849/month uncapped, or 3.23M served (28% refused) at $45,000/month under the same $1,500/day cap. The 8× spread vs templated comes entirely from how much tool-return content the LLM has to read on each turn.
Self-host capped: 2.33M served (Optimistic) or 0.97M served (Realistic) at the same $45K budget — both lose to API-freeform-capped's 3.23M, and dramatically lose to API-templated, which doesn't even touch the cap.

For this workload, the operating decision that moves cost by 8× is whether tool outputs go to the LLM raw or through a response-template layer. The API-vs-self-host choice runs second.

XVII Empirical calibration

Section 4: measured, not guessed.

So far we have been treating the coefficients in the equations (cache rate, traffic shape multipliers, retry rate) as if they were just numbers. Section 4 of the paper is about where those numbers come from. The short answer: a harness called agent-cost-bench that runs real LLM API calls and records what happened.

What "the harness" means

A benchmark harness is a piece of software that exercises a system in controlled, repeatable ways and collects measurements. The paper's harness is written in Python and uses the same libraries production teams use, on purpose:

LiteLLM

A Python library that wraps different LLM providers (OpenAI, Anthropic, Google, local models) behind a single uniform API. You can change one line of config to switch providers and the rest of your code stays the same. LiteLLM logs the exact request body sent to the provider — which is what the harness uses for reviewer-checkable evidence.

LangGraph

A library for orchestrating multi-agent workflows — making one LLM call the output of another in a defined graph. The paper uses LangGraph because real production agents are usually built on it (or similar), so measurements taken through LangGraph transfer to real applications. The alternative — writing custom orchestration code — produces measurements that nobody can apply to their own system.

OpenTelemetry GenAI semconv

OpenTelemetry is the industry-standard observability framework (it produces traces, logs, metrics). GenAI semconv means "semantic conventions for generative AI" — a recent extension that defines what fields an LLM-related trace should carry (model name, token counts, latency, provider response ID, etc.). Using this standard means anyone who runs the harness produces traces in the same format. A reviewer can cross-reference any measurement against the provider's own audit log using the response.id field.

The scenarios

The harness runs nine scenarios across 174 real LLM calls. The total cost was $0.224 — twenty-two cents — which is itself a useful data point: empirical calibration of cost coefficients is not expensive.

The scenarios cover what the paper calls different topologies — the shapes of how LLM calls relate to each other:

Sequential pipeline — one stage runs, its output becomes the next stage's input, repeat. Measures handoff overhead and cumulative context growth.
Streaming pipeline — output tokens stream back in real time. Measures time-to-first-token and steady-state output rate.
Tool-chain single agent — one agent that calls many tools in sequence. Measures schema overhead and tool-result token costs.
Multi-turn chat with long shared system prompt — measures cache hit rate on both OpenAI and Anthropic.
Parallel orchestrator — one supervisor calls three specialists at the same time, waits for all, combines their answers. Measures fan-out cache behavior.

Four findings the paper reports

Finding 1: cache hit rate depends on provider and topology

Provider / scenario	Median	Cold (turn 0)	Warm (turn 1+)
OpenAI gpt-4o-mini, multi-turn chat	0.91	0.61	0.94–0.97
Anthropic claude-sonnet-4-5	0.77	0.00	0.85–0.92
OpenAI parallel fan-out	0.60	0.00	0.72–0.85

Reading this: on the same workload, OpenAI's median cache rate is 14 percentage points above Anthropic's. Parallel topologies suppress hit rate (because each parallel call's prompt is shorter and less prefix-shareable).

Finding 2: input-to-output ratio varies wildly by topology

The "average" chat workload has roughly 6:1 input-to-output. But tool-orchestration agents — agents that read a long system prompt full of tool schemas and emit short tool-call instructions — can run at 70:1 or even 90:1. The paper observes 73:1 on gpt-4o-mini and 88:1 on gpt-5.2 for the same workload.

Implication: if your calculator assumes 6:1, you understate input cost by roughly 12× on this class of workload.

Finding 3: handoff overhead grows with stage count

Some calculators use a flat "200 tokens per handoff" assumption. The paper measures a 5-stage pipeline with cumulative input growing as 175 → 523 → 1,243 → 2,038 → 2,854 tokens. The flat assumption is off by 3–4× on deep pipelines because each stage's output becomes the next stage's input and accumulates.

Finding 4: tool-response architecture is an 8× cost lever

This is the finding that emerged after the v0.1.0 pilot, and it's the largest operational lever in the model. On the same agent, same tools, same 6-turn session, same gpt-5.2 Standard tier, the paper measured two endpoints of the tool-response architecture choice (see Chapter XV for the full setup):

Mode	sess.	calls	in/turn	cache	$/q
Templated tool returns	20	238	3,342	0.883	$0.00178
Freeform tool returns (heavy)	5	60	22,798	0.744	$0.01392

Paired 7.8× spread on per-query blended cost. The freeform anchor is N=5 sessions versus N=20 for templated; the paper presents this as an order-of-magnitude finding rather than a precise 7.8× constant, but the procurement-scale implication is clear: turning on response templating shifts the cost band by roughly an order of magnitude on the same user-facing workload.

What the harness measures, and what it doesn't

The paper draws a line that's worth quoting here — between coefficients the harness re-measures empirically against real provider APIs, and coefficients that are workload-specific operator inputs the harness can't see. This separation matters for a procurement reviewer: it tells them which numbers are reproducible and which are assumption-laden.

Measured by the harness	Configured by the operator
Cache hit rate (per provider, per topology) Input/output token ratio (per shape) Sequential-handoff overhead Retry behavior under throttling Cache-write share for the chosen cache policy	Personnel cost (FTE × loaded salary) Compliance / ATO / audit cost Non-LLM infrastructure (rate-limiting, observability) Bot multiplier β_bot (depends on traffic mix) Rate-card snapshot (provider prices) Self-host GPU throughput ρ and derate η Quality-gap penalty for self-host vs frontier model

The right way to read this: the empirical column is where the paper claims reviewer-checkable evidence. The configured column is where operators have to defend their own assumptions — and where the paper explicitly says "if your workload differs, re-fit these against your own logged traffic before signing a procurement contract." A common failure mode in older cost calculators is conflating the two: presenting an operator's bot-multiplier guess as if it were a measurement, or running a self-host estimate at η = 1.0 (benchmark throughput) without flagging that as an optimistic assumption.

XVIII Equation 7

Handoff overhead, captured as a sum.

The paper proposes a replacement for the flat handoff assumption:

Equation 7 — handoff overhead for an N-stage pipeline

handoff_overhead(N) ≈ Σ_i<N output_tokens_i

N — total number of stages in the pipeline.
output_tokens_i — the average output (in tokens) of stage i.
Σ_i<N — sum over all stages before the last one.

In words: the handoff cost at any stage is the total of all previous stages' outputs, because that's what gets concatenated into the running context as the pipeline progresses. This gets folded into Equation 1 as additional input tokens charged at the non-cached rate (because new handoffs aren't yet cached).

The paper is appropriately careful: "We measured one pipeline; we do not claim the linear-in-N form is universal."

Try it · Pipeline-depth handoff vs the flat assumption

Pipeline depth N 5 stages

At N=5 stages (the paper's measured pipeline), Eq. 7 predicts roughly 2,854 cumulative input tokens at the final stage vs. ~800 for a flat 200-tokens-per-handoff approximation. The flat number understates input cost by roughly 3.5× on a 5-stage pipeline, and the gap grows with depth.

XIX The worked example

Reading Section 5 slowly.

The worked example anchors the whole paper. Let's walk through it one piece at a time.

What the application is

An anonymized geospatial Q&A service. Users ask natural-language questions about Earth-science data ("How much rainfall did Mumbai get last October?"). The system answers by chaining five tools:

Place resolution — turns "Mumbai" into a bounding polygon (latitude/longitude box).
Time parsing — turns "last October" into an ISO date range (2025-10-01 to 2025-10-31).
Dataset search — finds which Earth-observation collections contain data matching the place and time.
Item lookup — within those collections, finds the specific files.
Statistics — computes the requested aggregate (sum, mean, etc.) over the resolved files.

A representative single session in the paper's worked example processes about 84,000 input tokens (massive — that's the long system prompt plus the tool schemas plus the tool results) and emits about 850 output tokens (small — short tool dispatches and a concise final answer). The input-to-output ratio is about 99:1 — extreme output suppression, classic tool-orchestration.

Three operating points for the same agent

The same agent topology produces three different per-query cost anchors depending on how tool returns flow into the LLM's context. The paper now reports all three:

Templated (measured floor): 3,342 input tokens/turn, 0.883 cache rate, $1,629/mo at 915K queries (pre-FedRAMP). Tool returns are intercepted by a response-template layer; the LLM only sees short status strings. Validated against live gpt-5.2, N=20 session repeats, 238 LLM calls.
Freeform (measured upper): 22,798 input tokens/turn, 0.744 cache rate, $12,737/mo at 915K queries (pre-FedRAMP). The LLM receives full STAC item metadata plus per-band statistics arrays directly into context. Same provider, same agent, same workload — only the tool-return architecture differs. Validated against live gpt-5.2, N=5 session repeats, 60 LLM calls.
Structural ceiling (modeled): 84,490 input tokens/query, 0.84 cache rate, ~$26,000/mo. This was the worked-example anchor in earlier paper drafts and is what an even-heavier-payload deployment (aggregating more metadata than the standard STAC 1.0 + statistics shape) would cost. It remains a reasonable upper bound but is not the measured operating point.

The 7.8× cost spread between templated and freeform is the tool-response architecture cost lever: a procurement-scale operational decision separate from the API-vs-self-host question. Applying the optional FedRAMP Moderate ×1.15 layer brings the two measured anchors to $1,873 and $14,648 respectively. The calc.ajinkya.ai preset defaults to the templated floor; reproducing the freeform anchor or the structural ceiling requires the corresponding payload-mode toggle (see REPRODUCING.md in the paper repo).

The two segments, by the numbers

Authenticated: 500 domain experts × 5 questions/session × 0.2 sessions/day × 30 days × β_bot=1 = 15,000 queries/month.
Anonymous baseline: 10,000 visitors × 10 questions/session × 0.2 sessions/day × 30 days × β_bot=1.5 = 900,000 queries/month.

Total: 915,000 queries/month. Provider: GPT-5.2 Standard tier. Cache anchor: r_baseline = 0.84. Daily cap: $1,500.

The headline numbers

Naive estimate

50,000 tokens/query × 915,000 queries × blended rate ≈ $155,000/month

No cache adjustment. No traffic mix. No spend cap.

Measured templated anchor

Equation 1 at the measured 3,342 tok/turn, cache 0.883 → $1,629/month (pre-FedRAMP)

About 95× below the naive estimate. This is the bundled preset default.

Measured freeform anchor

Equation 1 at the measured 22,798 tok/turn, cache 0.744 → $12,737/month (pre-FedRAMP)

About 12× below the naive estimate, and ~8× above the templated floor.

The naive estimate is 12–95× too high for this workload depending on which architecture you measure against, primarily because of the cache discount and because input dominates output on tool-orchestration agents (the paper's measured I/O ratio is 70–90:1, not the 6:1 of chat benchmarks). The error can also run the other way: if you assume small request shapes against a workload that actually has long context or multi-agent handoffs, the naive estimate will undershoot.

The widget below visualizes the original release's anchor (where the modeled cost was ~$26K against the naive ~$155K — a 6× gap dominated by the cache discount). The shape of the corrections is unchanged in the measured operating points; only the absolute dollars shift.

Try it · Where the 6× gap comes from, one correction at a time

Apply cache (84%) on Apply traffic mix on Apply spend cap on

The cache discount does most of the work. Without it, the modeled number stays close to naive even with traffic mix and spend cap turned on — because the dominant cost on this workload is input tokens at full rate.

The stress test (50K MAU)

Scale the anonymous segment up to 50,000 MAU, hold everything else constant. Total monthly demand becomes 4.515 million queries. This is where Table 7 (the dual-architecture equal-budget comparison) lives, which we already walked through in Chapter XVI. At this scale, the two API anchors scale linearly: $8,037/month templated and $62,849/month freeform — and the freeform-capped row saturates the $1,500/day cap and refuses 28% of public traffic.

The sensitivity analysis (Section 5.2)

The paper makes one more important point: the self-host cost number is hugely sensitive to your assumed tokens-per-query. At the 50K-MAU stress configuration:

Tokens/query	Cost (Optimistic)	Cost (Realistic)	Scenario
600	$30,418	not reported	Output-only, optimistic
2,000	$85,553	not reported	Typical mixed (the paper's headline configuration)
10,000	$407,177	$618,420	Long-prompt, conservative

"Not reported" entries reflect the paper's Section 5.2 table as published; the paper reported Realistic only at the 10,000 tok/q point because Realistic costs are dominated by fixed setup + FTE at low τ — the GPU-rental component barely moves at 600 or 2,000 tok/q, so Realistic adds little new information there. At 10,000 tok/q the GPU bill starts to dominate and the two modes diverge usefully.

A factor of twenty across one slider. The lesson: any self-host estimate that doesn't disclose its tokens-per-query assumption is incomplete.

XX Reference

A glossary, for when you forget a word.

Quick reference. Skim this any time the paper uses a term you've forgotten.

α_s (alpha)

Input-token multiplier for query shape s, relative to the full baseline.

β_s (beta) — in shape table

Output-token multiplier for query shape s.

β_bot — bot factor

Bot-traffic multiplier for the anonymous segment. The paper writes this as bare β, which collides with β_s above; this walk-through uses the subscript throughout to keep them apart.

ρ (rho)

Aggregate sustained inference throughput of one GPU instance, in tokens per second, summed across all GPUs in that instance. Paper default: 1,200 tok/s for a 4-GPU g6e.12xl running a 70B model.

η (eta)

Throughput-derating factor in Equation 6. The fraction of benchmark throughput ρ that you actually sustain in production. 1.0 in Optimistic mode, 0.75 in Realistic. Captures queue contention, KV-cache eviction, and serving-stack overhead.

d, h

Sizing multipliers in Equation 6. d is the diurnal peak-to-mean ratio (default 4 for public-facing chat). h is static headroom for unexpected spikes (default 1.5). Together they convert mean demand into the peak demand you actually need to size for.

n_min

High-availability replica floor in Equation 6. Default 2, so even tiny workloads run with redundancy.

τ (tau)

Average tokens per query across the workload.

r_eff

Effective cache hit rate; fraction of input tokens served from cache.

r_baseline

The deployment-specific anchor cache rate, measured at q=6 turns.

Average questions per session in a segment.

T_m

Tier multiplier: 1.0 standard, 0.5 batch, 2.5 priority.

MAU

Monthly Active Users.

RAG

Retrieval-Augmented Generation. Pulling documents from a knowledge base to ground the model's answers.

FTE

Full-Time Equivalent. One person working full-time = 1.0 FTE.

SRE

Site Reliability Engineer. The person who keeps production infrastructure running.

SME

Subject Matter Expert. The domain expert who validates whether the model's answers are correct.

NLI

Natural Language Inference. The task of checking whether a claim is supported by a piece of source text. Used for hallucination detection.

ATO

Authority to Operate. The FedRAMP certification required to handle US government data.

PII

Personally Identifiable Information. Names, addresses, SSNs, etc. Often redacted from logs.

vLLM

An open-source inference server that runs open-weight LLMs efficiently using continuous batching and PagedAttention memory management.

KV cache

Key-Value cache. The model's internal stored computation of a prompt prefix. Provider prompt caching reuses this so they don't recompute.

Topology

The shape of how multiple LLM calls relate to each other: sequential, parallel, tool-chain, etc.

Handoff overhead

Extra input tokens charged when one stage's output is concatenated into the next stage's input.

HTTP 429

Too Many Requests. The HTTP status code returned when a rate limit or spend cap is exceeded.

CDN

Content Delivery Network. The edge layer between the open internet and your servers (Cloudflare, CloudFront, Fastly).

WAF

Web Application Firewall. The filtering layer that blocks known-bad traffic.

LiteLLM, LangGraph, OpenTelemetry

Python libraries used by the harness: provider abstraction, multi-agent orchestration, and tracing respectively.

Quantization

Running a model at reduced numerical precision (e.g., 8-bit or 4-bit instead of 16-bit) to fit more on a GPU and run faster, at some cost in quality.

Continuous batching

A serving technique where new requests can join an in-flight batch on the GPU as soon as space frees up, rather than waiting for a fixed-size batch to fill.

Amortization

Spreading a one-time cost across many months of operation, so the monthly bill reflects the cost's per-month equivalent.

Refusal-aware comparison

A cost comparison that explicitly reports both monthly dollars and queries served, instead of just monthly dollars. The point: a capped API that turns away 64% of traffic at $45K/mo is not the same service as a self-host fleet that answers all traffic at $86K/mo, so comparing only the dollar figures is a category error. The paper's central methodological move.

Segment

A user population with its own session length, questions per session, and bot-overhead profile. The worked example uses two segments: authenticated (500 domain experts) and anonymous (10K–50K public visitors with bot overhead β_bot=1.5). Different segments hit the rate-limit cap at different times and refuse traffic at different fractions; aggregating them into one "average user" hides 1.5–2× error in monthly volume.

Segment-aware

Said of a cost model that handles each segment separately rather than collapsing them into a single average user. Equation 4 (monthly query volume) is segment-aware; Equation 1 (per-query cost) applies per-segment because cache rates and traffic shapes typically differ across segments.

Templated tool returns

An agent architecture in which tool outputs (STAC items, geometry polygons, statistics arrays) are caught by an intermediate layer that renders them to the user through a deterministic UI template; the LLM is only told a short status string ("found 12 items, computed stats over 3 bands"). Measured input per turn on the paper's worked example: 3,342 tokens. The cost floor on this workload.

Freeform tool returns

An agent architecture in which structured tool outputs are serialized and concatenated into the LLM's context window so the LLM can compose a natural-language summary referencing arbitrary fields. The default in chat-style frameworks (LangChain ReAct, OpenAI Assistants, single-prompt tool loops). Measured input per turn on the paper's worked example: 22,798 tokens — about 7× the templated anchor.

Tool-response architecture

The design decision of how tool outputs reach the LLM (templated vs freeform). The paper's largest operational finding: this choice is an 8× cost lever on the same provider, same model, same user-facing workload. See Chapter XV.

Structural ceiling

An upper-bound per-query cost obtained by token-counting a hypothetical deployment that returns full structured payloads of every kind directly into the LLM. The paper's worked example uses 84,490 input tokens/query for its structural ceiling. Useful as a "what if every tool returns everything" worst case; not the measured operating point.

Cap binds / does not bind

"The cap binds" means monthly demand × per-query cost exceeds the monthly cap, so the cap actually constrains spend and refuses queries. "The cap does not bind" means the workload's natural monthly cost is below the cap; the cap is effectively inactive. In Table 7 the templated rows show the cap not binding (cost stays at $8K/mo, below the $45K monthly cap); the freeform-capped row shows the cap binding (cost saturates at $45K, 28% of traffic refused).

Pre-FedRAMP

A cost figure that excludes the FedRAMP Moderate compliance multiplier (×1.15). The paper reports its measured anchors pre-FedRAMP so that scaling 915K → 4.515M queries reproduces exactly without re-applying multipliers; applying ×1.15 brings the templated anchor to $1,873 and the freeform anchor to $14,648 at 10K MAU.

w (cache-write share)

The fraction of cached input tokens charged at the write rate rather than the read rate (see Equation 2). On OpenAI in steady state w ≈ 0 (no write surcharge). On Anthropic, measured at w ≈ 0.20 for multi-turn deployments — roughly 2× the calculator's earlier default of 0.10. The driver: a small (~215-token) cache-creation surcharge Anthropic applies on each warm call to extend the 5-minute TTL.