The naive formula, and why it lies.
There is one calculation that engineers and product managers reach for first when someone asks "how much will this LLM application cost us?" It looks like this: dollars per million tokens, multiplied by total tokens, multiplied by total queries. Three numbers, one spreadsheet cell, done.
The paper exists because this calculation can be wrong by an order of magnitude in either direction. The worked example in Section 5 shows the naive number overshooting by 6× — the cache discount on a long stable system prompt erases most of the input bill, and the naive formula misses it. But the same naive formula undershoots on a different shape of workload: tool-orchestration agents that read a 5,000-token system prompt and emit a 50-token tool call run at 70:1 to 90:1 input-to-output, where the calculator's default 6:1 assumption is off by an order of magnitude in the other direction. The error has structure, and the structure is exactly the things the simple formula leaves out. The whole paper is a careful enumeration of those missing pieces.
Before we can talk about what the paper adds, the building blocks. So let's go slowly through the basics first.
Related reading. If you want a build-along tutorial that walks you through implementing the same cost model in your own calculator (with code), see How to cost an AI agent — a build-along tutorial. If you want background on the LLM concepts referenced here (KV cache, quantization, serving frameworks, structured outputs), the hub post Understanding LLMs indexes the deep-dives.
What a token actually is.
A token is the unit that LLM providers charge by. It is not a word and it is not a character. It is a fragment of text — usually a few characters long. The English word "understanding" is one token. The word "uncharacteristically" might be three tokens. Punctuation is its own token. Spaces matter.
As a rough rule that you can use for back-of-envelope math: one token is about three-quarters of an English word, or about four characters. So a 750-word document is around 1,000 tokens. A 10,000-word document is around 13,000 tokens.
The two prices: input and output
Providers charge two different rates depending on which direction the text is moving.
This is not pricing arbitrage. It reflects how transformer inference actually works. Input tokens can be processed in parallel — the model reads the whole prompt and computes attention across it in one big tensor operation, all at once. Output tokens have to be generated one at a time, each conditioned on every token before it. Generating 100 output tokens means 100 sequential forward passes through the model. Reading 10,000 input tokens is a single parallel pass.
So output is bottlenecked by sequential latency. The GPU spends time on memory I/O between tokens rather than crunching compute. Per token, output consumes far more GPU-seconds than input — and the price reflects that.
Imagine a customer types: "What was the population of Mumbai in 2011?" Behind the scenes, your application sends the model a 4,000-token system prompt explaining its role, plus the 12-token question, plus a 600-token retrieved Wikipedia paragraph. That is 4,612 input tokens. The model replies: "The population of Mumbai in the 2011 census was approximately 12.4 million." That is about 16 output tokens. Total cost: roughly (4,612 × $1.75 + 16 × $14) / 1,000,000 = $0.0083. Less than a cent. But notice — almost all of that cost is input, not output. We will come back to this asymmetry.
Prompt caching — the largest correction the paper makes.
If your application sends the same long system prompt on every request — say a 5,000-token block explaining "you are a geospatial assistant, here are your tools, here is how to format your replies, here are some examples..." — the provider eventually notices. The first time they see that block, they compute something called a KV cache (key-value cache, the model's internal representation of the prompt). They store that cache on their side. The next time you send a request that starts with the same bytes, they do not recompute. They reuse the stored work.
And they charge you a discounted rate for those cached input tokens — typically about 10% of the normal input rate. OpenAI's cached input is one-tenth of regular input. Anthropic's is similar. For the mechanics of how the cache key is constructed and invalidated, see the companion post How prompt caching works.
The two providers behave differently
OpenAI
Automatic. Any prompt over 1,024 tokens is eligible. The provider matches the longest shared prefix between your current request and recent past requests. No code change required. Just keep your system prompt stable and the cache will form naturally.
Anthropic
Opt-in, two modes. You enable caching by attaching a cache_control field. There are two ways to do this: an automatic mode that attaches a single top-level cache_control field and lets the API choose the breakpoint, and an explicit mode that marks individual content blocks (system prompt, tool schemas, specific messages). Either way, you pay 1.25× the regular input rate on the first write (the call that creates a 5-minute cache entry; 2× for a 1-hour TTL), and only 0.1× on every subsequent read.
The worked example in Section 5 of the paper sees the modeled cost come out at $26,000/month while the naive formula predicts $155,000/month. The biggest single reason for that 6× gap is the cache. At an 84% effective cache rate, the application is paying full price on only 16% of its input tokens. That alone reduces input cost by roughly five-sixths.
The symbol the paper uses: reff
Throughout the paper, r_eff is the effective cache hit rate — the fraction of input tokens that were served from cache rather than computed fresh. If r_eff = 0.84, then 84% of input tokens hit the cache and got the cheap rate, and 16% missed and paid full price. We will see this symbol again in Equation 1, where it controls the entire input-cost term.
Transformers compute two tensors per input token, called Keys and Values (the K and V in "KV cache"). These encode what the token says and where it sits in the sequence. Every subsequent token's attention computation reads from those K and V tensors — and once computed, they never change. They only depend on the prefix up to that point.
If two requests share the same prefix (same system prompt, same earlier turns of a conversation), the K and V tensors for that prefix are identical between them. Providers store those tensors after computing them once and reuse them on subsequent requests, charging only the lookup overhead instead of the full re-computation. That is the cache discount.
Why a 1,024-token threshold on OpenAI? Caching has its own overhead — the provider has to detect that the prefix matches, look up the stored state, and verify it. For a 200-token prompt, that overhead can exceed the savings. Above about 1,024 tokens the math tips: the prefix is large enough that reusing it pays for itself.
Not every query costs the same.
The naive formula assumes there is one "average query." Real traffic is not like that. Some questions trigger every tool in the pipeline; others get refused in two sentences; others just retrieve a document and stop. The paper bins traffic into five canonical shapes:
| Shape | Input α | Output β | Cacheable? | What it means |
|---|---|---|---|---|
| full | 1.00 | 1.00 | yes | The complete pipeline runs end-to-end. This is the baseline; every other shape is expressed as a multiple of this. |
| rag | 0.19 | 0.47 | yes | Just retrieve a document and answer from it. No downstream tools. Short. |
| partial | 0.62 | 0.47 | yes | The pipeline runs partway and stops — perhaps a tool returns "no data" and the model replies briefly. |
| refusal | 0.06 | 0.12 | no | User asks something out of scope. Model declines quickly. Very cheap. Cannot cache because the prompt is shorter than the cache threshold. |
| heavy | 1.20 | 1.17 | yes | Long multi-turn conversation. Slightly more than full. |
αs (alpha) is the input multiplier for shape s. βs (beta) is the output multiplier. You pick a traffic mix — say 40% full, 30% rag, 15% partial, 10% refusal, 5% heavy — and take a weighted average across the five shapes to get the blended per-query cost.
A single "one query" cost number lies because real applications are a mix. If 10% of your queries are cheap refusals, your average is meaningfully lower than the full-pipeline price. Conversely, if you only count refusals and forget the heavy long-context conversations, you under-estimate. The mix is part of the model.
Who is on the other side of the API.
Different users behave differently, so the paper splits traffic into two segments and computes each separately:
The bot multiplier βbot
The paper writes this as plain β, but it also uses βs (in the shape table above) for the per-shape output multiplier — a completely different quantity. Throughout this walk-through I'll write the bot factor as βbot to keep the two apart. Where the paper just says β, context tells you which one.
The unauthenticated segment is not all human traffic. Even after the rate limiter at your edge has done its job, a meaningful fraction of remaining requests are non-human user-agents: SEO scrapers, AI training crawlers, security probes, abuse. The paper's default is βbot = 1.5 — meaning that for every 1.0 unit of human demand, there's another 0.5 of bot traffic on top, for a total of 1.5.
This is set to βbot = 1 for authenticated traffic (you know there are no bots there). For public endpoints, βbot should be estimated from the deployment's own CDN/WAF logs; the paper uses 1.5 as a starting point in the worked example, but does not claim that number generalizes.
CDN — Content Delivery Network. The edge layer between the open internet and your servers (Cloudflare, Fastly, AWS CloudFront). It sees every request before your application does. WAF — Web Application Firewall. The filtering layer inside or alongside the CDN that blocks known-bad traffic. Both have logs that classify traffic by user-agent — which is what you use to estimate βbot honestly.
Per-query cost, one piece at a time.
Now we have everything we need to read the first equation. Here it is, exactly as it appears in the paper:
Is — input tokens for this shape, = αs × baseline input tokens.
Os — output tokens for this shape, = βs × baseline output tokens.
reff — effective cache hit rate (a number between 0 and 1).
pin, pcached, pout — provider rates per million tokens for normal input, cached input, and output.
Tm — tier multiplier (1.0 standard, 0.5 batch, 2.5 priority).
106 — because the prices are quoted "per million tokens," we divide.
Read this as three additions inside the parentheses, because that is what it is:
- Is × (1 − reff) × pin — the input tokens that missed the cache, charged at full input rate. If reff = 0.84, then (1 − 0.84) = 0.16, so 16% of input tokens fall into this term.
- Is × reff × pcached — the input tokens that hit the cache, charged at the discounted cached rate. 84% of input tokens fall into this term.
- Os × pout — output tokens. These have no cache concept; what the model generates is always fresh. Output is charged at its (higher) output rate.
Add those three numbers, divide by one million (because rates are per million tokens), multiply by the tier multiplier Tm, and you have the cost in dollars for one query of one shape.
Take one full-pipeline query on GPT-5.2 standard tier: Is = 5,000 input tokens, Os = 500 output tokens, reff = 0.84, Tm = 1.0. Provider rates: pin = $1.75/M, pcached = $0.175/M, pout = $14/M.
Plug in:
- Fresh input: 5,000 × (1 − 0.84) × $1.75/M = 5,000 × 0.16 × $1.75/M = $0.00140
- Cached input: 5,000 × 0.84 × $0.175/M = $0.00074
- Output: 500 × $14/M = $0.00700
Total: $0.00914 per query. Almost 77% of that is output, even though output is only 9% of the tokens — the 8× output-vs-input rate gap, compounded by the 84% input-side cache discount. This is the asymmetry the interactive widget below lets you push around.
Tiers, and what Tm does
Providers offer different tiers at different prices:
- Standard — the default. Tm = 1.0.
- Batch (OpenAI) or Flex — your job runs whenever the provider has spare capacity, usually within 24 hours. Half price. Tm = 0.5. Good for offline analytics.
- Priority — guaranteed low latency. More expensive. Tm = 2.5. Good for user-facing applications during peak hours.
For cache-ineligible shapes like refusal, the paper sets reff = 0 regardless of session length, because the refusal pipeline never accumulates a cacheable prefix.
One cached-input rate is two prices in a trench-coat.
Equation 1 treats pcached as a single number — the cached-input rate per million tokens. In steady state, on OpenAI, that is roughly correct: OpenAI's cached read price is ~10% of the regular input price, and there is no write surcharge. But Anthropic charges differently: a cache write (the first time a particular prefix is sent) costs 1.25× the regular input rate, while a cache read (every subsequent call that hits that prefix) costs only 0.1× the regular rate.
If a fraction w of cached input tokens hit a cache that had to be created from scratch (writes), and (1 − w) hit a cache that was already warm (reads), the effective per-million rate is a weighted blend:
pwrite — cache creation rate. For OpenAI this equals the regular input rate
pin (no surcharge); for Anthropic it is 1.25 × pin for a 5-minute cache (2× for a 1-hour cache).pread — cache hit rate. Roughly 0.1 ×
pin for both providers.pcached,eff — the blended rate that gets plugged into Equation 1 as
pcached.
For OpenAI in steady state, w ≈ 0 and the blend collapses to just pread — which is why Equation 1 looks like it has only one cached price. For Anthropic deployments the picture is more interesting. The paper's calculator originally exposed w as an operator-supplied slider with a default of w = 0.10, but the actual measurement (May 14, 2026 run of the cached-pipeline-anthropic scenario against Claude Sonnet 4.5, N=3 repeats × 6 turns = 18 LLM calls) found w ≈ 0.20 for a representative multi-turn deployment — roughly twice the calculator's default.
The unexpected component is a small cache-creation surcharge (~215 tokens) that Anthropic applies on each warm call to extend the 5-minute TTL, on top of the cold-write cost on turn 1. So even after the cache is warmed, Anthropic continues to charge a small write penalty on every read. The paper now treats w as deployment-specific (governed by how often the cache rotates relative to the TTL relative to traffic shape) rather than as a calibrated default; operators on Anthropic should anchor closer to w = 0.20 than to w = 0.10 for multi-turn workloads.
Cache rate grows with the conversation.
One missing piece in Equation 1 is: where does reff come from? It is not a constant. It depends on how long the conversation is. A single-turn cold visit cannot cache much, because nothing has been computed yet. A ten-turn conversation has built up a long shared prefix; almost everything except the latest user message is in the cache.
Equation 3 captures this with a simple linear rule:
rbaseline — the deployment-specific anchor cache rate, measured at q = 6 turns. The paper uses 0.84.
0.01 — the slope. Each additional turn beyond 6 adds 1 percentage point to the cache rate.
clamp(...,0.50, 0.94) — forces the result to stay between 50% and 94%. No matter how long the session, you can't go below 50% (single-turn floor: only the system prompt is shared) or above 94% (long-session ceiling: each turn still introduces a fresh, uncacheable user message). The 0.94 ceiling is an empirical bound — the paper's harness sees OpenAI's gpt-4o-mini saturate in the 0.94–0.97 range on long multi-turn chats (Section 4, Finding 1); the model takes the lower edge of that range as a conservative upper limit.
Worked numbers
If rbaseline = 0.84:
- q = 5 questions/session: 0.84 + 0.01 × (5 − 6) = 0.84 − 0.01 = 0.83. Short sessions amortize slightly worse than the 6-turn anchor.
- q = 6: 0.84 exactly (this is the anchor).
- q = 10: 0.84 + 0.01 × 4 = 0.88. Longer sessions amortize better.
- q = 1: 0.84 − 0.05 = 0.79. Still high — but note the clamp at 0.50, which would kick in for very low rbaseline.
The 0.01 slope is described in the paper as "a heuristic, not a measured universal constant." Translation: AJ chose it because it produced plausible numbers, not because he measured it. If you adopt this model for a different application, you would want to actually plot cache rate vs session length on your own logs and fit your own slope. The paper says this honestly. A reviewer might still push back, but the disclosure is the right move.
How many queries per month.
This one is just bookkeeping, and it is the most concrete equation in the paper.
MAU — Monthly Active Users in this segment. The standard product-analytics number.
sessions/day — average visits per active user per day. Typically a number well below 1 (e.g., 0.2, meaning each user comes back every 5 days).
30 — days in a month, roughly.
questions/session — average length of a session in user turns.
βbot — bot multiplier (1.0 for authenticated, 1.5+ for public anonymous). Earlier drafts of the paper used plain β. The final paper uses βbot explicitly to avoid confusion with βs, the per-shape output multiplier from Section 3 / Table 1.
The paper's two segments, numerically
From the worked example in Section 5:
Authenticated
500 users × 0.2 sessions/day × 30 days × 5 questions × 1.0 = 15,000 queries/month
Anonymous (10K MAU)
10,000 users × 0.2 sessions/day × 30 days × 10 questions × 1.5 = 900,000 queries/month
Total: 915,000 queries per month. Notice that the anonymous segment dominates — 60× more queries than the authenticated segment, despite being only 20× more users. The bot multiplier and the longer sessions both push it up.
The daily spend cap, and what 429 means.
This section of the paper is one of the most important and the easiest to misread. So I want to be very careful.
If you put an LLM behind a public endpoint without protection, the worst-case scenarios are real and they have happened to production teams: a viral tweet directs millions of curious visitors to your site; a misconfigured client sends the same request in an infinite loop; a competitor pays for a scraping service to crawl every page. Without a safety net, you wake up to a $50,000 surprise bill. Engineers who have lived through this once never deploy without a cap again.
What the cap actually is
A daily spend cap is a hard limit at the API-gateway level. Concretely: a script tracks your daily LLM spend in real time, and when that running total crosses, say, $1,500 in a 24-hour window, the gateway stops forwarding requests to the model. Instead, it returns an HTTP 429 status code to the caller.
The mechanical consequence
Mechanically, the cap turns the monthly LLM bill into a step function that can never exceed cap × days. If the cap is $1,500/day and the month has 30 days, you cannot pay more than $45,000 for LLM tokens, no matter how much demand shows up at your door. In symbols:
capdaily — the per-day spend ceiling enforced by the gateway.
days/month — 30, give or take.
The paper also allows operators to model a small number of burst days per month (the daily cap is still enforced, but a few "press-event" days are allowed at the elevated daily cap). The shape doesn't change: a hard cliff. Demand above the cliff doesn't get served — it gets HTTP 429'd.
The crucial implication: refused queries
This is the move that the rest of the paper hinges on. When you say "our LLM bill last month was $45,000", that statement is incomplete without also saying "and we refused 64% of incoming queries with HTTP 429 after hitting our daily cap."
This is the paper's most important insight, and the rest of the paper builds the machinery to make this comparison honestly.
The seven layers of a real monthly bill.
Equation 5 is where the paper stops talking about just LLM tokens and starts talking about the actual monthly bill that hits your finance team. Tokens are one line. There are six more.
r — fraction of calls that fail and have to be retried. (1 + 1.5r) is the retry penalty.
V, E, P, A, F, C — six additional monthly cost categories, defined below.
The retry penalty: where the 1.5 comes from
If r = 0.05 (5% of calls fail and need retrying), each retry charges full input tokens again plus a partial output (the model wrote something before the call failed). The 1.5 coefficient comes from a back-of-envelope of where the retry cost actually lands: ~1.0× a fresh input on the re-sent prompt (full input tokens charged again, with the cache discount on the stable prefix) + ~0.5× of an average output (the model typically streamed about half the response before the failure). It's an honest estimate, not a measured constant; the paper flags that production teams should refine it with their own observed retry traces. With r = 0.05, the LLM bill inflates by (1 + 1.5 × 0.05) = 1.075, or 7.5%.
The six additional categories
Any one of these layers can be zero, and most are zero for a small commercial chatbot. But for a serious deployment, ignoring even one can throw off your procurement estimate by tens of thousands of dollars per month. A small commercial chatbot pays only LLM and C. A FedRAMP-bound multi-agent government system pays all seven. The model lets operators turn layers on and off to match what they're actually deploying.
Self-host: renting GPUs instead of paying per token.
The paper's second half compares paying OpenAI/Anthropic per token (the API approach) against running the model yourself on rented GPUs (the self-host approach). Before we read the equations, let's understand what self-hosting actually involves.
What you rent
You rent GPU instances from a cloud provider — typically AWS, but also GCP, Azure, Lambda Labs, CoreWeave. A common configuration is one NVIDIA L4 or A10 or H100 GPU attached to a virtual machine. AWS calls these g6.xlarge, g5.xlarge, p5.48xlarge and similar. You pay an hourly rate whether you're using the GPU or not. Costs range from $0.50/hour (small GPU) to $30+/hour (high-end H100).
What you run
On that GPU, you run an open-weight model — a model whose weights have been released publicly. Llama 3, Qwen, Mistral, DeepSeek. These are usually not as capable as contemporary commercial frontier models, but they are good enough for many tasks and much cheaper per token at scale — if you can keep the GPU busy.
The serving stack: vLLM
vLLM is an open-source inference server. It is not the only one (others include TGI, SGLang, TensorRT-LLM) but it is the most widely cited in the paper. vLLM's contribution is something called PagedAttention — a way of managing GPU memory that allows many requests to be processed at the same time on the same GPU. This is called continuous batching and it is the key to high throughput. For the serving-framework landscape (vLLM vs TGI vs SGLang vs TensorRT-LLM), see the companion post LLM serving frameworks.
Throughput: ρ (rho)
The paper uses the symbol ρ for an instance's aggregate sustained throughput in tokens per second — summed across all GPUs in that instance. The worked-example default is ρ = 1,200 tok/s, which is the published aggregate throughput of a g6e.12xl (a 4-GPU L40S instance) running a 70B-class open-weight model under vLLM-style continuous batching. The paper is careful that this is not a hardware constant — actual throughput depends on:
- Prompt length — longer prompts take more compute to process.
- Output length — output generation is token-by-token and slower.
- Batch size — how many requests are being processed simultaneously.
- Quantization — running the model at lower numerical precision (e.g., 8-bit instead of 16-bit) for speed.
- KV-cache pressure — how much GPU memory the in-flight requests' caches are using.
- Latency SLO — Service Level Objective. If you promise users a fast response, you can't batch as aggressively.
The gap between benchmark and production
Even when ρ is benchmarked correctly, production often sustains less than the benchmark unless traffic is well-packed and the serving stack is tuned. Requests arrive irregularly. The KV-cache fills up and evicts entries. The serving stack adds its own overhead. The paper formalizes this gap with a separate symbol — η (eta), a throughput-derating factor — which appears next to ρ in Equation 6. η = 1.0 means "we assume the benchmark holds." η = 0.75 means "we expect to sustain 75% of it in real serving." This is the same idea older calculators called utilization, but separated from sizing math so the assumption is visible.
How many GPUs do you need.
Naively you might think: total tokens per month, divided by what one GPU can produce per month, rounded up. The paper doesn't size that way, and the reason is one of those production-experience details that catches first-time operators out.
A real workload isn't flat across the day. Public traffic is roughly four times as heavy during business hours as it is overnight. A fleet sized for the monthly average will queue badly at lunchtime and sit half-idle at 3am. So you size to the peak, not the mean. And then you add safety headroom on top, because the peak you measured last month is not the peak you'll see next month.
τ · Qtot / s — mean demand in tokens per second (total monthly tokens divided by seconds in a month).
d — diurnal peak-to-mean ratio. Default 4 for public-facing chat: business-hour load is about 4× the overnight load.
h — static headroom for unexpected spikes. Default 1.5.
ρ — instance aggregate throughput in tokens per second, summed across all GPUs in one instance.
η (eta) — throughput-derating factor for serving inefficiencies not captured in ρ alone (queue contention, KV-cache eviction under load). 1.0 in Optimistic mode, 0.75 in Realistic.
nmin — high-availability replica floor. Default 2, so even tiny workloads run with redundancy.
⌈ ⌉ — the ceiling function. You can't rent 4.7 GPUs; you rent 5.
Walking through a real number
Take the 50,000-MAU stress workload: 4.5 million queries per month, 2,000 tokens per query average. ρ = 1,200 tok/s (a g6e.12xl instance with 4×L40S GPUs running a 70B model). Optimistic mode: η = 1.0.
- Mean demand: 2,000 × 4,500,000 / 2,592,000 = 3,484 tok/s.
- Peak demand: 3,484 × 4 × 1.5 = 20,903 tok/s.
- Instances required: ⌈20,903 / 1,200⌉ = 18 instances.
The paper's Section 5.2 reports exactly this: 18 instances Optimistic for τ = 2,000. Notice the work the d × h factor is doing — without the 6× peak-and-headroom multiplier, the same workload sizes to only 3 instances, which would under-provision badly during peak hours and is not a defensible procurement number.
Simpler calculators often use the form n = ⌈τ·Qtot / C⌉ where C is monthly capacity per instance. That formula treats demand as if it were perfectly flat across 24 hours. For workloads with real diurnal swings it under-sizes the fleet by roughly d × h × η, which on the paper's defaults is a factor of 6. The cost difference is the difference between a self-host quote that survives procurement review and one that doesn't.
Once you know n, total monthly self-host cost is:
cfixed — control plane, observability, FTE amortization, and one-time setup spread over a chosen horizon.
Optimistic vs Realistic.
Self-host break-even calculations in industry blog posts almost always pick one set of operating assumptions and present a single answer. But the answer depends so heavily on those assumptions that any single number misleads. The paper exposes two named presets, each holding three knobs together:
Optimistic mode
η = 1.0 — the benchmark throughput ρ is sustainable in production. No derating for queue contention or KV-cache pressure.
FTE allocation = 0 — no dedicated personnel cost. GPU operations are absorbed into an existing team's workload.
Setup amortization = 36 months — the upfront engineering investment is spread thin over three years.
Brackets a well-packed, multi-workload fleet. Best case for self-host.
Realistic mode
η = 0.75 — a 25% derate. Real serving often sustains less than the benchmark; queues build during peaks, the KV-cache evicts under load.
FTE allocation = 0.5 SRE — half a Site Reliability Engineer keeping the fleet healthy.
Setup amortization = 12 months — one year, honest if you're not sure the project will survive longer.
Brackets a single-tenant fleet sized for its own peaks. Typical case.
What η actually captures
The throughput-derating factor is the paper's way of formalizing the gap between "what the GPU benchmark says" and "what your fleet sustains in production." Benchmarks are run with carefully chosen prompt lengths, batch sizes, and concurrency. Production deployments rarely match those conditions exactly: requests arrive irregularly, the KV-cache fills up and evicts entries, the serving stack adds its own overhead. η = 0.75 is the paper's anchor for "this is what you actually get." η = 1.0 is "we assume best case."
"The procurement decision depends on knowing which side of that line your team sits on — and most teams overestimate their side."
The paper doesn't say Realistic is right. It says: if your team can plausibly hit Optimistic conditions, self-host has a better story. If you can only honestly claim Realistic, API has a better story. Optimistic is the answer for a team running a multi-tenant, well-packed fleet with no incremental headcount. Realistic is the answer for a team standing up its own dedicated cluster, paying for an SRE to keep it alive, on a one-year horizon. Almost every team that quotes a self-host number on LinkedIn is implicitly claiming Optimistic conditions; almost none of them can deliver on that claim in production.
Tool-response architecture: what the LLM actually sees.
Before walking into the equal-budget comparison, one piece of background. The paper's largest operational finding (separate from the API-vs-self-host question) is that two real production deployments running the same agent code on the same provider can have monthly costs that differ by 8×. The variable that moves the cost by 8× is not which provider you use, or how aggressive your cache is, or how many GPUs you've sized for. It's whether your agent shows raw tool returns to the LLM, or whether an intermediate layer summarizes them first. For the mechanics of how tool/function calls are dispatched, see the companion post Tool use & function calling.
Imagine a tool that searches a satellite-imagery catalog. The user asks "find Landsat scenes over New York City for October 2024." The tool runs, comes back with 12 STAC items (a structured-metadata format used by raster archives). Each item is a JSON object with around 2,000 tokens of metadata — geometry polygons, band-by-band statistics, asset URLs, license fields, processing-level annotations, and so on. The agent now has to do something with these 12 items. There are two patterns.
The 12 items go into the agent's internal state, not into the LLM context. The agent's intermediate layer extracts a short status string — "found 12 Landsat-8 scenes over NYC, Oct 1–31 2024" — and only that string goes back to the LLM. The structured items get rendered to the user through a UI template (a results panel, a map overlay) that knows how to display them deterministically. The LLM is told what happened, not everything that happened. Per-turn LLM input on this workload, measured: 3,342 tokens.
The 12 items are serialized and concatenated into the LLM's context window. The LLM reads all of them and writes a natural-language summary citing the specific fields the user might care about. This is the default in chat-style agent frameworks (LangChain ReAct, OpenAI Assistants, single-prompt tool loops) where the LLM also generates the user-facing response. The LLM sees everything that happened. Per-turn LLM input on the same workload, measured: 22,798 tokens.
Same agent topology, same provider, same model, same workload — only the architecture of how tool returns reach the LLM differs. The paper measured both endpoints against live OpenAI gpt-5.2 (the templated run is N=20 sessions / 238 LLM calls; the freeform run is N=5 sessions / 60 calls). The measured per-query blended cost is $0.00178 templated vs $0.01392 freeform, a paired 7.8× spread. At 915K queries/month the templated deployment costs $1,629/mo; the freeform deployment costs $12,737/mo (both pre-FedRAMP — applying the optional Eq. 5 compliance multiplier brings them to $1,873 and $14,648).
Three things to notice:
- This is a procurement-scale lever, not a tuning knob. It's an architectural choice the team makes once when designing the agent. Switching from freeform to templated requires building a response-template layer that knows how to display each tool's structured output through a deterministic UI; that's an engineering project, not a flag.
- Not every deployment can adopt the templated pattern. If your product is a pure chat interface where the LLM also writes the user-facing prose, you can't strip structured payloads out of the LLM's context — the LLM needs to read the data to write the response. The freeform anchor is the realistic operating point for chat-first products.
- Self-host costs are roughly architecture-invariant in the modeled range (more carefully: held fixed at τ=2,000 tokens/query for the equal-budget table). Self-host GPU sizing is driven by total token throughput which is dominated by output tokens (similar in both architectures) and a workload-weighted blend of input. The API rate-card asymmetry that produces the 8× spread on the API side hits self-host as additional GPU-minutes, not as a price multiplier.
This is why the equal-budget comparison table in the next section splits the API rows by tool-response architecture but reports self-host as a single column. The templated/freeform split is where the operational decision lives.
Equal-budget comparison: the move that breaks open the debate.
Most LinkedIn and blog-post comparisons read like this: "API costs $45K/month. Self-host costs $86K/month. Therefore API is cheaper."
The paper points out that this is a category error. The $45K API figure is achievable only because the spend cap is refusing 64% of incoming queries. The $86K self-host figure serves every query that comes in. They are not the same service. Comparing them is like comparing the cost of a restaurant that serves everyone against the cost of a restaurant that turns 64% of customers away at the door.
The paper's proposed fix:
- Fix the budget. Pick a daily spend cap, say $1,500/day.
- Apply that cap to both strategies. Cap API spend at the cap. Cap self-host spend at the same cap, which translates to "rent fewer GPUs."
- Ask: how many queries does each strategy serve within that budget? Now you're comparing service levels at equal cost, which is the fair comparison.
The strategy table (Table 7 in the paper)
The current paper splits the API rows by tool-response architecture (templated vs freeform — see Chapter XV, just above) and keeps two self-host operating modes. Eight rows total.
At 50,000 anonymous MAU (4.5 million queries/month of demand), the paper computes:
| Strategy | Monthly cost | Service level |
|---|---|---|
| API, templated tool returns | ||
| uncapped | $8,037 | all 4.52M served (cap unused) |
| capped $1,500/day | $8,037 | all 4.52M served (cap does not bind) |
| API, freeform tool returns | ||
| uncapped | $62,849 | all 4.52M served |
| capped $1,500/day | $45,000 | 3.23M served / 1.28M refused (28%) |
| Self-host (τ=2,000 tokens/q fixed) | ||
| full service, Optimistic | $85,553 | all served, with quality trade-off |
| full service, Realistic | $141,269 | all served, with quality trade-off |
| capped $1,500/day, Optimistic | $44,202 | 2.33M served / 2.18M refused |
| capped $1,500/day, Realistic | $43,786 | 0.97M served / 3.54M refused |
Reading this table carefully
The headline finding is that API service wins decisively under every evaluated row at this scale: API-templated runs at $8K/month and never touches the cap, API-freeform runs at $63K/month uncapped or saturates the cap and refuses 28% of public traffic at $45K, and both numbers are cheaper than the self-host alternatives at the same daily budget. Within the API column, the choice between templated and freeform tool returns is an 8× cost lever on the same user-facing workload.
The self-host rows hold τ=2,000 tokens/query fixed so that the table isolates the API-side tool-response architecture effect (a different self-host re-sizing would be needed to size GPUs for the freeform 22,798-token-per-turn anchor). Under that fixed-τ assumption, the freeform-capped API row serves 3.23M queries within the $45K budget; the corresponding self-host-capped rows serve 2.33M (Optimistic) and 0.97M (Realistic) at the same monthly spend.
Earlier drafts of the paper anchored the API rows against an 84,490-input-token structural-ceiling per-query cost, which gave $45,000/month for API-capped and made the Optimistic-mode self-host-capped row look more competitive. Once both endpoints were measured empirically (templated and freeform), the comparison stopped flipping: API service wins in both modes, by a wider margin than the structural-ceiling anchor implied. The widget below visualizes the earlier-draft structural-ceiling regime; it's preserved for continuity, but the headline finding now sits in the table above.
The widget reproduces the earlier-draft analysis at the structural-ceiling anchor: under those assumptions, capped self-host in Optimistic mode appears to serve more queries than capped API, while Realistic mode flips the result. That was the framing in the original release. With both API endpoints now measured empirically against live OpenAI gpt-5.2, the structural-ceiling anchor no longer drives the comparison.
- API-templated: serves all 4.52M queries at $8,037/month. The daily cap is never reached. This is the floor — the cheapest defensible operating point for this workload on this provider.
- API-freeform: serves all 4.52M queries at $62,849/month uncapped, or 3.23M served (28% refused) at $45,000/month under the same $1,500/day cap. The 8× spread vs templated comes entirely from how much tool-return content the LLM has to read on each turn.
- Self-host capped: 2.33M served (Optimistic) or 0.97M served (Realistic) at the same $45K budget — both lose to API-freeform-capped's 3.23M, and dramatically lose to API-templated, which doesn't even touch the cap.
Section 4: measured, not guessed.
So far we have been treating the coefficients in the equations (cache rate, traffic shape multipliers, retry rate) as if they were just numbers. Section 4 of the paper is about where those numbers come from. The short answer: a harness called agent-cost-bench that runs real LLM API calls and records what happened.
What "the harness" means
A benchmark harness is a piece of software that exercises a system in controlled, repeatable ways and collects measurements. The paper's harness is written in Python and uses the same libraries production teams use, on purpose:
response.id field.The scenarios
The harness runs nine scenarios across 174 real LLM calls. The total cost was $0.224 — twenty-two cents — which is itself a useful data point: empirical calibration of cost coefficients is not expensive.
The scenarios cover what the paper calls different topologies — the shapes of how LLM calls relate to each other:
- Sequential pipeline — one stage runs, its output becomes the next stage's input, repeat. Measures handoff overhead and cumulative context growth.
- Streaming pipeline — output tokens stream back in real time. Measures time-to-first-token and steady-state output rate.
- Tool-chain single agent — one agent that calls many tools in sequence. Measures schema overhead and tool-result token costs.
- Multi-turn chat with long shared system prompt — measures cache hit rate on both OpenAI and Anthropic.
- Parallel orchestrator — one supervisor calls three specialists at the same time, waits for all, combines their answers. Measures fan-out cache behavior.
Four findings the paper reports
Finding 1: cache hit rate depends on provider and topology
| Provider / scenario | Median | Cold (turn 0) | Warm (turn 1+) |
|---|---|---|---|
| OpenAI gpt-4o-mini, multi-turn chat | 0.91 | 0.61 | 0.94–0.97 |
| Anthropic claude-sonnet-4-5 | 0.77 | 0.00 | 0.85–0.92 |
| OpenAI parallel fan-out | 0.60 | 0.00 | 0.72–0.85 |
Reading this: on the same workload, OpenAI's median cache rate is 14 percentage points above Anthropic's. Parallel topologies suppress hit rate (because each parallel call's prompt is shorter and less prefix-shareable).
Finding 2: input-to-output ratio varies wildly by topology
The "average" chat workload has roughly 6:1 input-to-output. But tool-orchestration agents — agents that read a long system prompt full of tool schemas and emit short tool-call instructions — can run at 70:1 or even 90:1. The paper observes 73:1 on gpt-4o-mini and 88:1 on gpt-5.2 for the same workload.
Implication: if your calculator assumes 6:1, you understate input cost by roughly 12× on this class of workload.
Finding 3: handoff overhead grows with stage count
Some calculators use a flat "200 tokens per handoff" assumption. The paper measures a 5-stage pipeline with cumulative input growing as 175 → 523 → 1,243 → 2,038 → 2,854 tokens. The flat assumption is off by 3–4× on deep pipelines because each stage's output becomes the next stage's input and accumulates.
Finding 4: tool-response architecture is an 8× cost lever
This is the finding that emerged after the v0.1.0 pilot, and it's the largest operational lever in the model. On the same agent, same tools, same 6-turn session, same gpt-5.2 Standard tier, the paper measured two endpoints of the tool-response architecture choice (see Chapter XV for the full setup):
| Mode | sess. | calls | in/turn | cache | $/q |
|---|---|---|---|---|---|
| Templated tool returns | 20 | 238 | 3,342 | 0.883 | $0.00178 |
| Freeform tool returns (heavy) | 5 | 60 | 22,798 | 0.744 | $0.01392 |
Paired 7.8× spread on per-query blended cost. The freeform anchor is N=5 sessions versus N=20 for templated; the paper presents this as an order-of-magnitude finding rather than a precise 7.8× constant, but the procurement-scale implication is clear: turning on response templating shifts the cost band by roughly an order of magnitude on the same user-facing workload.
What the harness measures, and what it doesn't
The paper draws a line that's worth quoting here — between coefficients the harness re-measures empirically against real provider APIs, and coefficients that are workload-specific operator inputs the harness can't see. This separation matters for a procurement reviewer: it tells them which numbers are reproducible and which are assumption-laden.
| Measured by the harness | Configured by the operator |
|---|---|
|
|
The right way to read this: the empirical column is where the paper claims reviewer-checkable evidence. The configured column is where operators have to defend their own assumptions — and where the paper explicitly says "if your workload differs, re-fit these against your own logged traffic before signing a procurement contract." A common failure mode in older cost calculators is conflating the two: presenting an operator's bot-multiplier guess as if it were a measurement, or running a self-host estimate at η = 1.0 (benchmark throughput) without flagging that as an optimistic assumption.
Handoff overhead, captured as a sum.
The paper proposes a replacement for the flat handoff assumption:
output_tokensi — the average output (in tokens) of stage i.
Σi<N — sum over all stages before the last one.
In words: the handoff cost at any stage is the total of all previous stages' outputs, because that's what gets concatenated into the running context as the pipeline progresses. This gets folded into Equation 1 as additional input tokens charged at the non-cached rate (because new handoffs aren't yet cached).
The paper is appropriately careful: "We measured one pipeline; we do not claim the linear-in-N form is universal."
Reading Section 5 slowly.
The worked example anchors the whole paper. Let's walk through it one piece at a time.
What the application is
An anonymized geospatial Q&A service. Users ask natural-language questions about Earth-science data ("How much rainfall did Mumbai get last October?"). The system answers by chaining five tools:
- Place resolution — turns "Mumbai" into a bounding polygon (latitude/longitude box).
- Time parsing — turns "last October" into an ISO date range (2025-10-01 to 2025-10-31).
- Dataset search — finds which Earth-observation collections contain data matching the place and time.
- Item lookup — within those collections, finds the specific files.
- Statistics — computes the requested aggregate (sum, mean, etc.) over the resolved files.
A representative single session in the paper's worked example processes about 84,000 input tokens (massive — that's the long system prompt plus the tool schemas plus the tool results) and emits about 850 output tokens (small — short tool dispatches and a concise final answer). The input-to-output ratio is about 99:1 — extreme output suppression, classic tool-orchestration.
The same agent topology produces three different per-query cost anchors depending on how tool returns flow into the LLM's context. The paper now reports all three:
- Templated (measured floor): 3,342 input tokens/turn, 0.883 cache rate, $1,629/mo at 915K queries (pre-FedRAMP). Tool returns are intercepted by a response-template layer; the LLM only sees short status strings. Validated against live gpt-5.2, N=20 session repeats, 238 LLM calls.
- Freeform (measured upper): 22,798 input tokens/turn, 0.744 cache rate, $12,737/mo at 915K queries (pre-FedRAMP). The LLM receives full STAC item metadata plus per-band statistics arrays directly into context. Same provider, same agent, same workload — only the tool-return architecture differs. Validated against live gpt-5.2, N=5 session repeats, 60 LLM calls.
- Structural ceiling (modeled): 84,490 input tokens/query, 0.84 cache rate, ~$26,000/mo. This was the worked-example anchor in earlier paper drafts and is what an even-heavier-payload deployment (aggregating more metadata than the standard STAC 1.0 + statistics shape) would cost. It remains a reasonable upper bound but is not the measured operating point.
The 7.8× cost spread between templated and freeform is the tool-response architecture cost lever: a procurement-scale operational decision separate from the API-vs-self-host question. Applying the optional FedRAMP Moderate ×1.15 layer brings the two measured anchors to $1,873 and $14,648 respectively. The calc.ajinkya.ai preset defaults to the templated floor; reproducing the freeform anchor or the structural ceiling requires the corresponding payload-mode toggle (see REPRODUCING.md in the paper repo).
The two segments, by the numbers
- Authenticated: 500 domain experts × 5 questions/session × 0.2 sessions/day × 30 days × βbot=1 = 15,000 queries/month.
- Anonymous baseline: 10,000 visitors × 10 questions/session × 0.2 sessions/day × 30 days × βbot=1.5 = 900,000 queries/month.
Total: 915,000 queries/month. Provider: GPT-5.2 Standard tier. Cache anchor: rbaseline = 0.84. Daily cap: $1,500.
The headline numbers
Naive estimate
50,000 tokens/query × 915,000 queries × blended rate ≈ $155,000/month
No cache adjustment. No traffic mix. No spend cap.
Measured templated anchor
Equation 1 at the measured 3,342 tok/turn, cache 0.883 → $1,629/month (pre-FedRAMP)
About 95× below the naive estimate. This is the bundled preset default.
Measured freeform anchor
Equation 1 at the measured 22,798 tok/turn, cache 0.744 → $12,737/month (pre-FedRAMP)
About 12× below the naive estimate, and ~8× above the templated floor.
The naive estimate is 12–95× too high for this workload depending on which architecture you measure against, primarily because of the cache discount and because input dominates output on tool-orchestration agents (the paper's measured I/O ratio is 70–90:1, not the 6:1 of chat benchmarks). The error can also run the other way: if you assume small request shapes against a workload that actually has long context or multi-agent handoffs, the naive estimate will undershoot.
The widget below visualizes the original release's anchor (where the modeled cost was ~$26K against the naive ~$155K — a 6× gap dominated by the cache discount). The shape of the corrections is unchanged in the measured operating points; only the absolute dollars shift.
The stress test (50K MAU)
Scale the anonymous segment up to 50,000 MAU, hold everything else constant. Total monthly demand becomes 4.515 million queries. This is where Table 7 (the dual-architecture equal-budget comparison) lives, which we already walked through in Chapter XVI. At this scale, the two API anchors scale linearly: $8,037/month templated and $62,849/month freeform — and the freeform-capped row saturates the $1,500/day cap and refuses 28% of public traffic.
The sensitivity analysis (Section 5.2)
The paper makes one more important point: the self-host cost number is hugely sensitive to your assumed tokens-per-query. At the 50K-MAU stress configuration:
| Tokens/query | Cost (Optimistic) | Cost (Realistic) | Scenario |
|---|---|---|---|
| 600 | $30,418 | not reported | Output-only, optimistic |
| 2,000 | $85,553 | not reported | Typical mixed (the paper's headline configuration) |
| 10,000 | $407,177 | $618,420 | Long-prompt, conservative |
"Not reported" entries reflect the paper's Section 5.2 table as published; the paper reported Realistic only at the 10,000 tok/q point because Realistic costs are dominated by fixed setup + FTE at low τ — the GPU-rental component barely moves at 600 or 2,000 tok/q, so Realistic adds little new information there. At 10,000 tok/q the GPU bill starts to dominate and the two modes diverge usefully.
A factor of twenty across one slider. The lesson: any self-host estimate that doesn't disclose its tokens-per-query assumption is incomplete.
A glossary, for when you forget a word.
Quick reference. Skim this any time the paper uses a term you've forgotten.
full baseline.