How to cost an AI agent: a progressive tutorial

This tutorial is a build guide. By the time you finish, you should be able to design and ship an AI-agent cost calculator from scratch — for your own organisation's procurement reviews, for a vendor pitch, for a research deployment, for anything where someone is going to ask "what does this cost at scale?" and you need a defensible number.

The sidebar on the left is your map. Each H2 below is one concept that any honest cost model has to handle. Read top to bottom for the full curriculum (it's written to teach progressively), or jump to the module that matches the part of your model you're stuck on.

Four things to know before you start.

No prior AI knowledge is assumed. The first three modules explain what a token is, what an LLM API call is, and why the naive "$X per million tokens × N tokens" math is wrong by roughly an order of magnitude. If you've already shipped a per-token cost estimator and want to skip ahead, jump to Part 5 — Caching.

This is implementation-agnostic. The tutorial tells you what controls your calculator needs, what the formulas behind each control look like, and what the validation loop should do. It doesn't prescribe a stack — you can build the same model in a spreadsheet, in React, in a notebook, or as a CLI. Part 0 gives you the full control checklist upfront so you can scope the build before you start. Part 29 gives you a reference architecture (data model, formula library, validation harness) you can clone.

Every module ends with a "build it" callout. Look for boxes that look like this:

These are the prescriptions — what your UI should expose, what your data model should hold, what your math should compute. The prose around them is the why.

The numbers in this post are illustrative, not authoritative. Where I write "$1,800/month observability" or "73:1 input/output ratio," those are defaults from one measurement campaign. Your own deployment's coefficients will differ; Part 25 shows you how to measure them with a benchmark loop and refresh your defaults.

If you want a concrete reference implementation alongside this tutorial, calc.ajinkya.ai is one (its source is open). It is not required reading — every concept here stands on its own.

Companion tutorials — how the model itself actually works. This guide treats the model as a priced black box: tokens go in, tokens come out, you pay the difference. If you want to know what is happening inside that box before you start pricing it — what a token *is*, how attention works, why a 671B-parameter MoE model costs the same as a 37B dense one to serve, how a reasoning model like R1 trades training compute for inference compute, *and* how retrieval plus context assembly decide which tokens the model sees on each turn — read these interactive walk-throughs first. They share this site's chrome (sidebar TOC, Grill Me, Mood) but their bodies are designed pieces in their own right.

The anatomy of a frontier model — seven chapters, five live demos. Tokenizer, transformer block, attention, the Chinchilla scaling laws, 3D parallel training, post-training (SFT → reward → RL → reasoning RL), inference (KV cache, speculative decoding, MoE). 40–90 minutes. Read this if you want every "token" and "context window" and "active parameter" in this guide to land with conviction.
DeepSeek-R1, from first principles — sixteen widgets, one paper. Drag the GRPO simulator, click through the four-stage pipeline, watch reasoning emerge from pure RL on rule-based rewards. Read this if you want to understand the reasoning-RL paradigm that drives the test-time-compute and verification numbers in Parts 10 and 11.
Hybrid search — keywords plus vectors — side-by-side BM25, dense retrieval, and RRF merge on a toy corpus. Read this before Parts 8–9 (RAG / chunk sizing): it shows *why* production stacks run both lexical and semantic search instead of betting on embeddings alone.
Context engineering — five tabs simulating window budgets, overflow, compaction, lost-in-the-middle, and production mitigation patterns. Read this alongside Parts 8–11 whenever you are costing long agent traces, tool output, or retrieved context — the bill is literally "how many of these tokens fit per turn."

All four are implementation-agnostic and standalone. The cost tutorial below assumes you can come back from them with comfort around the words "token", "MoE active parameter", "KV cache", "thinking budget", "RRF / hybrid retrieval", and "context compaction". You can also skip them and learn-as-you-go — every term gets defined again in context here.

Part 0—What your calculator needs to expose

Before we get into tokens and cache curves, it helps to see the finished shape of the thing you're building. A defensible AI-agent cost calculator exposes roughly 25 controls, organised into seven groups. Below is the full list with typical defaults — treat it as a scope document. Each control is taught in detail in a later module (Part number in brackets).

If a control on this list looks unfamiliar, that's the signal it's worth reading the matching module before you start coding. If it looks obvious, you can probably skim that module.

Group A — Workload shape (what one query looks like)

Input token size — distribution per query, not a single number. Default: median 600 in / 250 out, P90 1,500 in / 800 out. [Part 1, Part 2]
Output token size — same. Output is usually ~1/3 of input in chat workloads, but reverses for code generation and summarisation. [Part 2]
System-prompt overhead — a fixed input-token cost paid on every call. Default: 800–2,000 tokens, growing fast as agents add tools and few-shot examples. [Part 2]
Tool-schema overhead — JSON schema for tool definitions, sent on every call. Default: 200–600 tokens per tool × number of tools available. [Part 2]
Conversation history depth — how many prior turns get re-sent each turn. Default: full history capped at 70% of context window, then summarised. [Part 2]

Group B — Query mix (multiple shapes weighted)

Question types & their factors — multipliers on the baseline token sizes. Default: short factual = 0.4×, complex reasoning = 1.8×, RAG-heavy = 3×. [Part 3]
Traffic mix — what percentage of your queries falls into each shape. Defaults sum to 100%. [Part 3]

Group C — Volume (how often)

Audience segments — MAU + sessions/day + queries/session, per segment (paid users behave differently from free trialists, who behave differently from federal employees). [Part 4]
Bot factor — anonymous and crawler traffic multiplier per segment. Default: 1× authenticated, 1.5–10× public. [Part 4]
Peak/avg ratio — burstiness. Drives reservation sizing and rate-limit defaults. Default: 3–5× typical, 10×+ for consumer-facing. [Part 4]

Group D — Cost-shape multipliers

Cache hit rate — fraction of input tokens served at the cached rate (typically ~10% of full input rate). The single most consequential coefficient in the model. Default: 75% for repeat-user products, 30–40% for one-shot. [Part 5]
Cache write share — how much of your "miss" input has to pay write costs (often 25% extra for prompt-cache write). [Part 5]
Daily spend cap — hard ceiling that shapes the bill once traffic exceeds expectations. [Part 17]
Bot rate-limit strategy — how aggressively unauthenticated traffic is throttled. Caps the bot factor downstream. [Part 18]

Group E — Architecture (what your agent actually does)

Multi-agent topology — single LLM, planner+worker, five-stage research pipeline, etc. Each agent has its own model selection and token budget. [Part 7, Part 8]
RAG configuration — embeddings model, corpus size, re-embed cadence, retrieval depth (top-k), reranker. [Part 9]
Reasoning depth multiplier — for o-series / extended thinking models, output tokens balloon 5–20× before the user-visible answer. [Part 10]
Guardrails / safety filter — pre-call moderation and post-call review. Adds 5–20% per call typically. [Part 10]
Tool-call frequency — average tools invoked per turn, plus the response-token tax of each tool's output going back into context. [Part 10]
Verification tier — none, lightweight (single LLM review), or full FactReasoner. Each tier carries a known token multiplier. [Part 11]

Group F — Hosting and economics

Hosting mode per agent — API / BYOK / hybrid / self-host. Different math under each. [Part 12, Part 13]
Reservation tier — pay-as-you-go vs. committed (PTU / enterprise contract). Drives baseline cost down but adds a floor obligation. [Part 14]
Personnel cost — engineering, ops, ML, compliance FTEs. Often the largest single line item — see Part 15.
Federal-compliance multipliers — FedRAMP Moderate / High, IL2 / IL5, GovCloud premium, separate Vector DB infra. Stack on top of base. [Part 16]
Other infrastructure — observability, CI, error tracking, secret management, gateway. Editable line items. [Part 19]

Group G — Output and validation

Migration timeline — how long to ramp from pilot to production traffic. Cost varies dramatically over the ramp. [Part 20]
Multi-model routing — what % of traffic goes to which model tier (cheap default vs. expensive escalation). [Part 21]
Confidence interval — your calculator's output should be a range, not a point estimate. Lognormal P5–P95 is the defensible shape. [Part 22, Part 23]
Negotiated price overrides — a price-book table the user can edit when they have contract rates that beat list price. [Part 21, Part 25]

That's the menu. The remaining 28 modules teach each one with the math, the pitfalls, and a "build it" prescription you can take straight into your data model. Part 29 at the end gives you a reference architecture (how to wire all 29 inputs into a single state object, where the formulas live, how to validate the model) to clone.

A note on counting: the list has 29 items, not 25. The intro rounds — once you collapse "input + output token size" into "token distribution" and similar, you get a UI with ~25 visible groups.

Part 1—The atom of cost: a token

What an LLM API call actually is

When your application talks to a hosted Large Language Model — GPT, Claude, Gemini, Llama on Bedrock, anything — it's an HTTP request that sends a chunk of text and gets back a chunk of text. The provider charges by the size of those chunks, not by wall-clock time, not by request count, not by the user's identity. Just the text in and the text out.

There are no other line items at the LLM-call layer. No "compute hours," no "API call fees" beyond the per-token rate. (Tool calls, embeddings, multimodal inputs, and managed search add separate line items that we'll get to.) The atomic unit of cost is one token.

Tokens, in plain language

A token is a sub-word unit. The provider's model doesn't read characters or whole words; it reads tokens, which are roughly the units a learned compression algorithm came up with when it was trained.

Useful intuitions:

English: about 1 token per 0.75 words, or 4 characters per token. So 1,000 tokens ≈ 750 English words, ≈ 4,000 characters.
Code: about 1.3 tokens per word — code has more punctuation and identifier fragments that don't compress as well.
European languages (French, Spanish, German): 1.2–1.5 tokens per word.
CJK (Chinese, Japanese, Korean): 1.8–2.2 tokens per word. Tokenizers were trained on English-heavy corpora; non-Latin scripts compress poorly.
Arabic, Hebrew: 1.4 tokens per word.

These are rules of thumb. The actual count depends on the provider's tokenizer (OpenAI uses cl100k or o200k variants; Anthropic uses its own; Google uses SentencePiece). For back-of-envelope work, 1 token ≈ 4 characters of English is accurate within ±10%. For procurement-grade numbers, you should use the provider's published tokenizer or your production logs.

The two halves of a bill: input and output

Every API call has two prices, often very different:

Input rate — what you pay per million tokens sent to the model (system prompt, user question, conversation history, retrieved documents, tool definitions, tool results).
Output rate — what you pay per million tokens the model generates back.

Output is roughly 4–6× more expensive than input for most flagship models. Why: generation is the autoregressive loop — each output token requires a full forward pass through the model. Input tokens are processed in parallel (one big matrix multiplication for the prompt). The provider charges for the work, and decoding is more work per token than prefill.

A few representative rates as of mid-2026 (you'll want a price-book table in your calculator that the user can edit when they have negotiated rates that differ from list, and a daily-refreshed fetcher if you can wire one up — the open-source reference at calc.ajinkya.ai uses a Cloudflare Worker for this):

Model	Input ($/M)	Cached input ($/M)	Output ($/M)	Output ÷ Input
GPT-5.5 (flagship)	$5.00	$0.50	$30.00	6×
GPT-4o-mini	$0.15	$0.075	$0.60	4×
Claude Sonnet 4.6	$3.00	$0.30	$15.00	5×
Claude Haiku 4.5	$0.80	$0.08	$4.00	5×
Gemini 3.1 Pro	$2.00	$0.20	$12.00	6×
Gemini 3.1 Flash-Lite	$0.10	$0.025	$0.40	4×

The "cached input" column is critical and we'll spend a whole module (Part 5) on it. It's about 10% of the list input rate. Whether you hit it on 70 % of your input or 90 % of it changes the bill by a factor of roughly two.

A first naive number

Imagine the simplest possible deployment. One agent, one model, one user, one question per session. The user types ~100 tokens. The model replies with ~400 tokens. You're using GPT-5.5 at $5/M input and $30/M output.

Cost per query:

input cost  = 100 tok × $5/M     = $0.0005
output cost = 400 tok × $30/M    = $0.0120
total       = $0.0125 per query

A million queries a month = $12,500/month.

This is the "back-of-envelope" cost calculator that lives in two dozen browser tabs. It is a real number — it's just the wrong one.

About $0.00026 per query — input is $0.000015, output is $0.00024. Output dominates because the rate is 4× higher. About $0.0125 per query — that's the GPT-5.5 flagship-rate answer; gpt-4o-mini is ~50× cheaper. About $0.000015 per query — counted input only; output is billed separately at its own rate. About $0.000375 per query — summed the two rates first, then multiplied by total tokens.

Why that first number is wrong

In the back-of-envelope above, I treated the input as if it were the user's question alone — 100 tokens. Real production agents send dramatically more than the question on every call. A representative federal-program agent (the public-facing geospatial Q&A example used in the paper, anchored to a measured production session) sends roughly 5,000 tokens of input before the user has typed a word.

I also treated this as one call per user query. Real agents call the LLM multiple times per query (planner, retriever, tool agent, summarizer). And I assumed every query has the same shape. Real production traffic has at least five different query shapes that differ in cost by 15×.

I ignored caching, retries, multi-tenancy of the GPU fleet, verification, the cost of refusing out-of-scope queries, the cost of running on FedRAMP-Moderate hardware versus commercial, the personnel who keep the deployment up, the database, the audit log retention, the FactReasoner verifier, and a dozen other line items.

A useful summary: a generic per-token calculator for an AI agent deployment is wrong by a factor of two on average and an order of magnitude under aggressive assumptions. Most of this post is about the surface area of those errors.

The right cost question is not "$X/M tokens × N tokens." It is "what does the program actually deliver, at the budget the program actually has?" That answer pivots on the modules below.

Part 2—How big is one query, really?

This is the first module where you'll wire your token-size input directly into your formulas. Set this control's defaults wrong and every downstream number in your calculator will be wrong by the same factor — token-size is the multiplicative foundation of everything that follows.

The system prompt (and why it dwarfs the question)

Every modern agent has a system prompt — a chunk of instructions that tells the model who it is, what tools it has, what policies to follow, what format to respond in. The system prompt is sent on every API call, even though the user doesn't see it.

Production system prompts are not short. Real ones I've measured:

A federal science-agency Q&A agent: 1,640 tokens.
A customer-support agent with three tools and a refusal policy: 1,180 tokens.
A research orchestrator with five sub-agents: 2,300 tokens (it includes role descriptions and routing rules).

Compare that to "the user's question" which is typically 50–200 tokens. The system prompt outweighs the question by roughly 10–30×.

The good news: system prompts are cacheable. Once we get to caching (Part 5), most of this cost gets discounted to ~10% of list price. The bad news: only on cache hits. The cold-start turn of any session pays the full system prompt at the input rate.

Tool schemas, every turn

If your agent uses tools — function calling, MCP servers, plugins, retrieval — you have to send the tool definitions to the model on every turn. Each tool definition is roughly:

A name (~5 tokens)
A description (~50–100 tokens)
A parameter schema (typically JSON Schema, ~50–100 tokens)

Total: about 150 tokens per tool, every turn. A typical agent has 3–7 tools, so 450–1,050 tokens of schema overhead per turn just to advertise the tool catalog.

These tokens go in input on every call. They do cache, like the system prompt — provided you keep the tool definitions identical across calls. If you're dynamically constructing schemas per request (don't do this), you'll bust the cache.

Conversation history grows linearly

If a user takes multiple turns in one session — and most real deployments have multi-turn sessions — every turn sends the entire prior conversation as context. Turn 5 sends turns 1–4 plus the new user message. Turn 10 sends turns 1–9.

Conversation history grows linearly with turns. If each turn is ~600 tokens (300 user, 300 assistant), turn 10 carries ~5,400 tokens of history alone, before the system prompt and tool schemas.

Long-running sessions eventually hit the model's context window — the maximum input size, typically 128K (GPT-4o), 200K (Claude), 256K (GPT-5), 1M+ (Gemini Pro, Claude Opus). Hitting the limit triggers conversation summarisation: the agent makes an extra LLM call to compress old turns into a shorter summary before continuing.

Build it: when context-fill exceeds ~70% of the model's window, charge a summarisation surcharge. A defensible default:

30% of current context tokens at input rate +
  30% × 30% at output rate

, applied once per overflow event. Most simple calculators omit this entirely — and underestimate multi-turn sessions by 15–25%.

RAG chunks: 5 × 512 + 128 by default

If your agent uses Retrieval-Augmented Generation (RAG) — fetching documents from a vector database to ground the answer — those retrieved chunks become input tokens.

A reasonable default RAG configuration:

5 chunks retrieved per query
512 tokens per chunk
128 tokens to embed the user's question (for vector search)

Total: 5 × 512 + 128 = 2,688 RAG tokens per turn.

If the agent does multi-hop retrieval (asks the vector DB multiple times in one turn), multiply by retrieval calls per turn. A research agent that does 3 retrieval rounds per turn costs 3 × 2,688 ≈ 8,000 RAG tokens.

We'll cover RAG cost in detail in Part 9, including the ingest cost (embedding the corpus once, amortised) and the vector DB hosting bill.

Multimodal: images, audio, PDFs, code interpreter

If your agent accepts non-text inputs, each modality has its own token cost. Your calculator should let users add multimodal counts as additional input-token surcharges:

Modality	Token cost
Image, 1568 × 1568 px	~1,568 tokens
Image, OpenAI low-detail	~85 tokens
Image, OpenAI high-detail	~170 + 170 × tiles
Audio, 1 minute	~1,500 tokens (Whisper baseline, ~25/sec)
PDF page	~1,500 tokens
Code interpreter stdout/stderr (fed back)	50–2,000 tokens

The numbers compound fast. A user uploading a 10-page PDF adds ~15,000 tokens to every turn that PDF is in context. If the conversation goes 6 turns with the PDF persistently attached, that's 90,000 token-turns just for the PDF. At flagship rates, ~$0.45 per session before the agent has done any reasoning.

A useful production pattern: detach the PDF after the first turn and replace it with a summary or a tool that retrieves relevant pages on demand. This "tool-state bypass" trick (also common for geometry payloads — GeoJSON, vertex lists, where replacing the full payload with a 5-vertex bbox can be a 100× saving) is worth a checkbox in your UI: "Persist large tool outputs in context?" — defaulting to off changes long-session math by an order of magnitude.

The few-shot, JSON-schema, citation, memory line items

A handful of smaller categories add up quickly. Group them together in your UI under a "Prompt-engineering overhead" panel:

Few-shot examples (~250 tokens each, often 3–10) — example pairs in the prompt to show the model the desired output format. Cache well; they don't change between calls.
JSON schema enforcement — a JSON Schema in the prompt telling the model what shape to return. ~200–500 tokens.
Citations / grounding — extra markup the model is instructed to add to outputs (e.g., [doc:42 p3] per claim). Adds ~50–150 tokens of output per response.
Persistent memory / long-term agent memory — a chunk of cross-session state injected into every prompt (user preferences, past topics). 200–1,000 tokens depending on the memory architecture.

Individually small. Stacked, they can add 1,000–2,500 tokens of input per turn.

Output is usually smaller than you think

We've spent a lot of words on input. Output is usually a fraction of input — sometimes a small fraction.

Naïve simulators assume a 6:1 input-to-output ratio, calibrated against chat-style agents. Real production tool-orchestration agents look very different. Measured numbers from the open-source agent-cost-bench:

Agent shape	I/O ratio
Chat-style	6:1
Sequential pipeline	2:1
Tool-chain	9:1
Long-sysprompt + output-suppression	73:1 to 88:1

The long-sysprompt + suppression shape is the big surprise. The paper's worked-example geospatial Q&A agent uses a system prompt with rules like:

After receiving stats results, DO NOT print any statistics values, summaries, interpretations, or analysis. Instead, simply say "Statistics retrieved." and STOP.

These output-suppression rules are common in tool-orchestration agents — the agent is supposed to call tools, not narrate them. The result is per-call output that's a tiny fraction of input. On gpt-4o-mini the bench measured 73:1; on gpt-5.2 (the model the deployment actually ships) it's 88:1.

The 6:1 default badly understates input cost for any tool-orchestration agent.

Worked example: a 5,650-token "simple" question

Putting Part 2 together. A federal Q&A agent, single turn:

sysprompt           1,640 tok   (cached after turn 0)
tool schemas (3)      450 tok   (cached)
RAG chunks (5×512)  2,560 tok   (sometimes cached if same chunks)
RAG query embed       128 tok
user question         100 tok
few-shot (3 × 250)    750 tok   (cached)
JSON schema          (none)
citations enforced   on output
persistent memory      0 tok   (none)
                  ─────────
total input         5,628 tok

assistant output      400 tok   (suppressed if stats lookup;
                                 longer if explanatory answer)

Per-turn input: ~5,650 tokens. The "user just asked a 100-token question" intuition under-counts by roughly 56×.

This is before we consider what happens when the user asks a hard question, what happens on a follow-up turn that drags along history, what happens if the agent makes 3 LLM calls internally (planner → retriever → summarizer), or what happens on an out-of-scope query that gets refused.

That is what the next module is about.

Part 3—Five shapes of a query

Real production traffic is not one shape. It's a mix of shapes that differ in cost by an order of magnitude. Your calculator needs two paired controls: a question-types editor (where the user defines each shape and its cost factor) and a traffic-mix editor (where the user sets the weights on those shapes). Treat them as a single panel in your UI; they're useless apart.

The five shapes most production agents see, with anchored token budgets from the worked-example measurement:

Full pipeline

The full thing: decompose the query, retrieve relevant documents, plan tool calls, execute tools, summarize results.

Input: ~84,490 tokens (large because of multi-step tool results plus 5-chunk RAG plus history)
Output: ~854 tokens
Cache eligible: yes
Suggested default factor: 1.0 (treat this as the reference shape and define every other shape relative to it)

A full query is what most calculators quote as "the cost per query," which is why their top-line numbers are alarming. Most production traffic is not full.

RAG-only

The user asks a knowledge question; the agent retrieves documents and answers. No tool calls, no decomposition.

Input: ~16,000 tokens (system prompt + RAG chunks + history)
Output: ~400 tokens
Cache eligible: yes
Default factor: 0.20 (one-fifth the cost of full)

Partial (route to human, clarification)

The pipeline starts but terminates early — the agent asks for clarification, escalates to a human, or fails to find evidence.

Input: ~52,000 tokens (some retrieval happened, but no full tool execution)
Output: ~400 tokens
Cache eligible: yes
Default factor: 0.65

Refusal (out-of-scope)

The user asked something the agent isn't supposed to answer (jailbreak attempt, off-topic question, abusive content). The agent recognises this from the system prompt's policy and returns a refusal.

Input: ~5,000 tokens (system prompt + tools + the question; no RAG, no tool calls)
Output: ~100 tokens
Cache eligible: usually no (the question itself isn't a cacheable prefix)
Default factor: 0.06

A refusal costs about 6% of a full query. That's a 15× cost spread between the most expensive and cheapest shapes.

Heavy (long-context multi-turn)

The pathological case: a long-running session where the user keeps asking follow-ups, history grows, and the agent has to summarise old turns.

Input: ~100,000 tokens
Output: ~1,000 tokens
Cache eligible: partially
Default factor: 1.20

The 15× spread between full and refusal

If your traffic mix is 40% full / 30% rag / 15% partial / 10% refusal / 5% heavy, the weighted-average per-query cost is about 0.51 × cost of full. If your traffic mix is 90% full / 10% refusal, it's 0.91 × cost of full.

A generic calculator picks one shape (usually "full") and quotes it. Your bill is a weighted average across shapes, and the weighting can swing the answer by 2–3×.

Worked example: weighting a real production mix

Take a tool-orchestration deployment with output-suppression rules. After three months in production, the team logs the actual mix:

Shape	Weight	Factor	Contribution
Full pipeline	38%	1.0	0.38
RAG-only	22%	0.20	0.044
Partial	18%	0.65	0.117
Refusal	14%	0.06	0.0084
Heavy	8%	1.20	0.096
Weighted	100%	—	0.65

Weighted-average per-query cost is 0.65 × the full-pipeline cost. If the modelled "full" cost is $0.080/query, the weighted is $0.052. At 1M queries/month, that's $52,000/mo vs $80,000/mo. The mix is responsible for a $28K/mo difference; if you size to "full" you over-budget by 35%.

The point is not to be more optimistic. It's to size against the right number. Procurement reviewers will notice if your math is calibrated to the worst shape; they will also notice if it's calibrated to the cheapest. The defensible number is the measured one, with the mix made explicit so the reviewer can substitute their own weights.

Part 4—Volume: who, when, and how much

We've now sized one query (~5,650 tokens) and weighted across shapes (×0.65 in the worked example). The other half of the math is how many queries per month. This module is short because the formula is short, but the wrong volume number is one of the easiest ways to be off by 5×.

Your calculator needs three coupled controls here: an audience-segments editor, a bot-rate-limiting strategy (which caps anonymous abuse traffic), and a burst/peak multiplier (which drives reservation sizing in Part 14).

Sessions vs queries vs turns

Three units that calculators routinely conflate:

Session — a contiguous interaction with the agent, like a conversation. One session can have many turns.
Turn — one user message + one assistant response.
Query — same as turn for single-shot agents; for some systems "query" means "session" because each session asks one logical question with multiple back-and-forths.

Standardise on queries as your billing unit. A session of 5 turns counts as 5 queries. Expose a "questions per session" multiplier so users who only know their session count can convert. Conflating session-count with query-count is the most common volume bug in the wild.

Why this matters: a chatbot with 1,000 sessions/day and 8 turns/session is doing 8,000 queries/day, not 1,000. The 8× is real and ignoring it is a very common error.

Audience segments (and why you must split them)

Your users are not one cohort. A typical federal deployment has roughly:

Internal staff (authenticated): 500 monthly active users, 5 sessions/day each, 4 questions/session. → ~300,000 queries/month.
Public visitors (anonymous): 10,000 MAU, 0.2 sessions/day each, 1.5 questions/session. → ~90,000 queries/month.

These two segments cost very differently per query because they have different cache hit rates (Part 5) and different shape mixes (internal staff use the full pipeline more; anonymous users hit the refusal path more).

Aggregating "10,500 total users" hides a 10× cost swing between the segments. Your audience-segments editor must allow arbitrary segments (an "Add another segment" affordance, not a fixed two-row layout) — and the totals row should show weighted query volume across all segments, not a sum of MAU.

MAU × sessions/day × queries/session × 30

The volume formula:

monthly queries  =  Σ_segments [ MAU
                                × sessions_per_day
                                × questions_per_session
                                × 30 ]

For the example above:

Internal:  500 × 5 × 4 × 30  =  300,000
Public:  10,000 × 0.2 × 1.5 × 30 =  90,000
                                 ─────────
                                 390,000 / month

Bot factor for public endpoints

Public-facing endpoints get extra traffic that isn't from real users — search-engine crawlers, scrapers, abuse, accidental load tests, malicious traffic.

A defensible default bot factor is 1.5×. Anonymous public segments have their effective query volume multiplied by 1.5× to account for this. So the example's public segment becomes 90,000 × 1.5 = 135,000 queries/month for billing purposes.

The bot factor only applies to anonymous segments; authenticated users are assumed to be real because they had to sign in. Your calculator should make the bot factor editable per segment, and should cap it by the bot-rate-limiting strategy elsewhere — see Part 18.

Diurnal: peak/avg ratio and the 5%-per-level surcharge

Steady-state traffic doesn't exist. You have a daytime peak, a nighttime trough, weekday-vs-weekend swings, and occasional event spikes (press release day, news cycle, crisis response).

A peak/avg ratio control is the ratio between the user's peak-hour query rate and their average rate. Default it to 1.0× (meaning: traffic is flat, which it never is — but most users will mentally start there). Realistic values:

2× — gentle daytime/nighttime cycle, default for most deployments.
3–5× — daytime-only B2B tools (heavy 9–5, almost zero overnight).
5–10× — public news / press / weather endpoints with major event spikes.

Above 2×, add a 5% surcharge per ratio level to base cost — e.g., a 5× peak adds 5% × 3 = 15% to base cost. This represents rate-limit overage charges, queue overflow handling, or premium-tier capacity needed to absorb peaks. (You can refine this surcharge once you have measured data; the linear-in-level shape is a defensible starting point.)

Growth: compounding monthly

Procurement defends 12-month numbers, not steady-state. The growth per month slider compounds your traffic forward:

month_12_volume  =  current_volume × (1 + growth_rate)^12

Default is 20%/month, which compounds to 8.9× over 12 months. Aggressive but realistic for a launching deployment. After 12 months you're typically at a new plateau and growth flattens — at which point you re-baseline.

Burst days and the burst factor

Distinct from the diurnal peak: burst days are individual days that go far above the daily cap (Part 17). A hurricane response day at a weather agency. A press-event spike. A breaking news cycle that lands on your endpoint.

Two parameters under your daily-cap control:

Expected burst days/month — typically 7
Burst-day multiplier — typically 3–10×

A 7-burst-day month at 5× multiplier means 7 days at 5× normal cost plus 23 days at normal cost; your output report should break the burst-bill out as a separate line so reviewers can see what they're committing to.

Build it: compute your top-line monthly query volume as

sum_over_segments(MAU × sessions × queries
  × 30 × bot_factor) × growth_projection

. That's the number the rest of your math multiplies by per-query cost. Show this volume at the top of every report so reviewers can sanity-check it without reading the whole derivation.

The volume gets multiplied by the weighted-per-query cost from Part 3.

That's the full base equation:

monthly_cost  =  weighted_per_query_cost  ×  monthly_query_volume

Everything else in this tutorial is either making the per-query cost more accurate (caching, reasoning, tools, verification) or adding line items that aren't per-query at all (compute, personnel, federal multipliers, ATO).

Part 5—Caching: the most consequential single number

If you only read one module of this post, read this one. Caching is the single largest swing knob in the per-query math. The difference between a 75% cache hit rate and a 91% cache hit rate at typical input volumes is roughly a factor of two on the input bill — which is where most of the cost lives for tool-orchestration agents.

A cache hit rate is the most important single slider in any calculator you build — give it top-line visibility. Behind it, expose three finer controls that defenders will eventually want: cache write share, batch async %, and per-agent overrides (different agents in a multi-agent topology cache very differently — see Part 7).

What a prompt cache actually caches

Every modern provider — OpenAI, Anthropic, Google, Bedrock — operates a server-side prompt cache. The cache stores the intermediate computational state ("KV cache") of recent prompt prefixes, keyed by the exact prefix bytes.

When you send a request whose first N tokens match a cached prefix, the provider skips the prefill computation for those tokens and discounts them on your bill. Your input bill is:

input_cost  =  (1 - hit_rate) × input_tokens × rate_in
              +     hit_rate  × input_tokens × rate_in × 0.10

Cache hits cost ~10% of the list input rate (varies by provider; OpenAI ~10%, Anthropic 10% on reads, Bedrock varies by model).

What's cacheable is the prefix — the leading bytes of your prompt, in order. The system prompt typically caches well because it's identical across calls. Few-shot examples cache well. Tool schemas cache well. Conversation history caches turn-by-turn (turn 5's prefix includes turns 1–4 verbatim).

What does NOT cache: the user's current message (varies every turn), retrieved RAG chunks if the chunks change per query (common), tool results (varies per tool call). The "cacheable" column in the query-shape table earlier in this post is whether the bulk of that shape's input has stable prefixes.

The 90% read discount

To be precise, cache reads are typically priced at 0.10× list rate for OpenAI and similar providers. Anthropic's explicit cache_control mechanism prices cache reads at 0.10× list and cache writes at 1.25× list (the first call that primes the cache pays a write premium).

This is why your calculator needs a cache write share control in addition to the hit rate. Without it, you can't model cold-start sessions or low-traffic deployments correctly.

A hot cache (steady-state production) has hit rates around 80–95% and write shares around 5–15%. A cold cache (low traffic, fresh deployment, end-of-TTL) has hit rates around 40–60% and write shares around 30–50%.

A defensible default for the cache write share is 10% (steady state). Document this default explicitly in your UI's help text — for low-volume deployments, deployments where TTL frequently expires before reuse, or fresh launches, users should be told to override to 30–50%.

The 5-minute TTL on Anthropic prompt caching is a real operational constraint: if your traffic is bursty enough that cached prefixes age out between calls, you pay write premiums on every call instead of read discounts. A common mitigation is to send a tiny "keep-alive" no-op call every few minutes to refresh the TTL — only worth doing if your write share is materially hurting.

The hit-rate curve (turn 0 vs warm turns)

A flat cache-hit-rate coefficient is structurally wrong. Cache hit rate is a curve across turns, not a constant. From the bench's cached-pipeline.yml scenario (1700-token shared sysprompt across 18 calls):

cold turn 0:        0.61   (just the system prompt cached)
warm turns 1-5:     0.94 - 0.97   (steady state)
overall median:     0.91

Turn 0 has a 0.61 hit rate because only the sysprompt is cacheable; the full first message hasn't been seen yet. Turn 5 hits 0.97 because the sysprompt + tool schemas + few-shot + turns 1–4 are all in the cache from previous calls.

Most calculators ship a single coefficient (0.84 was the common historical default; recent measurements push it closer to 0.91 for steady-state production). Ship a simple top-line slider for users who only want one number, but expose a curve-aware per-turn correction behind an "Advanced caching" expander for users who care about session-length modeling.

About $810 — 90 % of the input is billed at 10 % of the rate, so $1,000 × 0.9 × 0.1 + $1,000 × 0.1 = $190. About $900 — cache hits are completely free. About $100 — the discount only applies above a daily threshold. About $450 — providers cap the cache discount at 50 %. Only the system prompt is in cache; the user query, tool schemas reused mid-conversation, and prior turns haven't been seen yet — so most input is uncached. Cache is per-user, and a new session starts with an empty cache, so the first turn is always 0 %. Providers throttle cache reads on the first call to prevent abuse. The first turn is billed at full rate as a "cache priming" surcharge.

Segment-aware: 75-80% anon vs 85-90% auth

Cache hit rate is also a property of the user segment, not just the system. Two real anchors:

Authenticated users in long sessions (5+ turns): 85–90% hit rate. The sysprompt is amortised across many turns, conversation history caches turn-over-turn, retrieved chunks are often the same across the session (same user, same topic).
Anonymous public visitors in short sessions (1–2 turns): 75–80% hit rate. Most requests are cold-start; conversation history is empty or one turn deep; topic and retrieved chunks differ per visitor.

The cost difference between 75% and 88% at typical input volumes is about a factor of two on the input line. If you have both audiences in one deployment (and federal deployments often do — internal scientists plus public access), one global cache rate will be wrong for both.

The fix: per-segment cache hit rate. Each audience segment in your segments editor should expose its own cache slider, and your weighted-per-query cost must integrate across segments. A single global cache slider is fine for v1, but list "per- segment cache" in your "v1.5 must-have" backlog.

Provider differences: OpenAI auto vs Anthropic explicit

The two big providers handle caching differently:

OpenAI / Azure OpenAI: automatic prefix-based caching. No code changes; the provider matches identical leading bytes against its own cache. The cache extends as conversation grows.
Anthropic: explicit cache_control markers on the message you want cached. You opt-in per block. The cache caches only the marked block; downstream content isn't cached unless also marked. 5-minute TTL.

For the same workload:

OpenAI gpt-4o-mini (auto):                  median hit rate 0.91
Anthropic claude-sonnet-4.6 (explicit):     median hit rate 0.77

Anthropic loses ~14 percentage points on the same workload because explicit cache_control caches less aggressively than OpenAI's auto-extension. This is a real procurement consideration if your workload is cache-heavy.

Three caching controls, ranked by importance

In rough order of how much each one moves the bill:

Cache hit rate (0–1) — top-line hit rate. The single most important control in your calculator.
Cache write share (default 0.10) — fraction of cached tokens that are first-write (premium-priced). Higher for cold-start, lower for steady-state.
Batch async % (default 0%) — fraction of queries that can tolerate batch-tier latency (hours, not seconds). Batch-tier is ~50% off; tolerable for offline workloads, not for interactive endpoints.

Worked example: 0.61 → 0.91 across six turns

Take a 6-turn session against a 1640-token sysprompt + 450-token tool schemas + 100-token user message. Output: 300 tokens per turn. GPT-4o-mini rates: $0.15/M input, $0.075/M cached input, $0.60/M output.

Without caching:

input per turn  =  2,190 tok × $0.15/M  =  $0.000329
output per turn =    300 tok × $0.60/M  =  $0.000180
                                          ─────────
per turn        =  $0.000509
6 turns         =  $0.00305

With realistic per-turn caching (turn 0 = 0.61, turns 1–5 = 0.94 average):

turn 0:
  cached portion   =  2,190 × 0.61 × $0.075/M  =  $0.000100
  uncached portion =  2,190 × 0.39 × $0.15/M   =  $0.000128
  output           =  $0.000180
  ─────────────────────────────────
  turn 0 total     =  $0.000408

turns 1-5 (each):
  cached portion   =  2,190 × 0.94 × $0.075/M  =  $0.000154
  uncached portion =  2,190 × 0.06 × $0.15/M   =  $0.000020
  output           =  $0.000180
  ─────────────────────────────────
  turn N total     =  $0.000354

session total    =  $0.000408 + 5 × $0.000354  =  $0.00218

Caching takes the session from $0.00305 to $0.00218 — a 28% saving. At 1M sessions/month, that's a $870 monthly delta from modeling caching as a curve instead of treating turn 0 the same as turn 5.

If you scale to GPT-5.5 ($5.00 input, $0.50 cached, $30 output) the same percentage saving on a much larger bill becomes material — the curve modeling can swing $5,000–$20,000/mo of defended budget.

Part 6—Small adjustments you must not skip

A few minor coefficients that individually move the bill by 2–5%, but compound to 15–25% if you skip all of them. Worth internalising before you defend a number.

Retry rate × 1.5 × base cost

LLM API calls fail. Rate-limit retries, schema-validation failures, JSON-parse retries on structured-output calls, content-policy false positives. Each failed call that is retried costs approximately 1.5× the successful call (you pay for the failed completion plus the retry's input).

The retry-overhead formula:

retry_overhead  =  retry_rate × 1.5 × base_cost

A defensible default retry rate: 3%. Measured on gpt-4o-mini with strict JSON-schema enforcement on a non- native-function-calling configuration: 8%. On agents that demand strict structured output on a model that doesn't support function calling natively, I've seen 10–12%. Make this overridable in your UI; users with measured rates from logs should be able to plug them straight in.

The 1.5× multiplier (vs the 1.0× older tools used) accounts for the partial output already generated before failure plus the full retry call. Modeling this correctly is the difference between under-counting retry waste by 0% and 33%.

Language multiplier (English = 1.0, CJK = 1.8-2.2)

Tokenizers were trained on English-heavy corpora. Non-Latin scripts compress poorly, raising your token count for the same content:

Language	Multiplier
English	1.0×
Code	1.3×
French / Spanish / German	1.2–1.5×
Arabic / Hebrew	1.4×
Chinese / Japanese / Korean	1.8–2.2×

A Japanese-language deployment costs roughly 2× the tokens of an equivalent English-language deployment. Expose this as a single per-deployment multiplier in your UI; for production submissions, the user should validate against the provider's tokenizer or their production logs (cross-provider tokenizer differences can themselves be material, especially for code, long context, and CJK).

Conversation summarisation overhead at >70% context fill

When your conversation history grows past 70% of the model's context window, the agent has to summarise old turns into a shorter representation before continuing. This is one extra LLM call per session, costing approximately:

summarisation_cost  ≈  0.30 × context_tokens × rate_in
                     +  0.30 × 0.30 × context_tokens × rate_out

(Roughly: read 30% of the context to summarise, generate 9% back as the summary, at output rates.)

For a 200K-context model getting 90% full, that's 0.30 × 180K × rate_in plus ~0.30 × 0.30 × 180K × rate_out — at GPT-4o rates, **$3.50 per summarisation event**. If sessions average 30 turns and trigger summarisation every ~20 turns, you're paying this 1–2× per session.

Older calculators ignored this entirely; do not. Model it explicitly as a function of session length and context window, and expose the 70% threshold and the 30% / 9% factors as overridable advanced settings — the right values depend on the model and your prompt strategy.

Provider markup: Bedrock 1.05×, Azure 1.0×, OpenRouter 1.05×

Different access paths to the same underlying model have different markups:

Path	Markup vs vendor list	Notes
Direct vendor API (OpenAI, Anthropic, Google)	1.00×	Baseline.
Azure OpenAI	1.00×	Parity with OpenAI direct, plus enterprise compliance.
AWS Bedrock	1.05×	Typical 5% AWS markup; GovCloud available.
OpenRouter	1.05×	Aggregator markup, varies per model.
BYOK	1.00×	User pays vendor directly, no aggregator.
Self-host	0× per-token + fixed monthly	$5K/mo default GPU+ops.

Negotiated enterprise agreements may be 10–20% below list, and high-volume customers (>$1M/mo) often see custom pricing that any default price book won't capture. Your price-book table must be user-editable so contract rates can be plugged in; treat the shipped defaults as starting points, not source of truth.

For procurement, the markup choice is also a compliance choice. Bedrock and Azure OpenAI have FedRAMP-authorized GovCloud regions; direct OpenAI does not. If you need GovCloud, the 5% markup on Bedrock might be the only path. That decision is downstream of cost modeling, but the price book is where you encode the trade-off.

Part 7—Multi-agent: more than one LLM call per question

So far we've costed one LLM call per query. Real agentic systems make many calls per query. Your calculator needs a multi-agent topology control that switches the per-query math from the single-call (shape × mix) mode to a per-agent sum mode. Most users will start in single-call mode; the topology selector is the affordance that promotes them when their architecture grows.

For richer modeling, expose a multi-agent simulator with per-agent fleet configuration, DAG topology controls, and a live token-flow ledger. This is power-user territory but it's where procurement-grade defenders eventually need to land.

Why agents at all (orchestrator → specialists)

A single-agent deployment has one LLM call per user query: prompt + question → response. Simple.

Real agentic systems split the work across multiple LLM calls because no single call can both:

Plan what to do
Retrieve relevant documents
Call external tools (database, search, code interpreter)
Reason about the results
Compose a final answer

Doing all five in one call requires a very long, very complex prompt and a very capable model. It's expensive and slow. Splitting into specialised agents (planner, retriever, tool-caller, reasoner, summariser) is cheaper per role, allows different models per role, and produces better quality.

In your multi-agent mode, the user should be able to specify each agent's input/output, model, cache behavior, and turn share independently. The total per-query cost is the sum across agents — show this sum prominently with each agent's contribution as a stacked bar so users can see which agent is dominating their bill.

Comm patterns: orchestrator / peer / supervisor

Three common multi-agent topologies:

Orchestrator — one main agent dispatches to specialists, collects their outputs, synthesises the final answer. Specialists don't talk to each other directly. Most common pattern.
Peer — agents communicate directly with each other in a flat structure. More flexible, harder to reason about; less common in production.
Supervisor — a meta-agent reviews specialists' work before it goes to the user. Adds an extra LLM call per turn (the supervisor's review). Common in safety-critical deployments.

A topology dropdown is sufficient — the cost difference between orchestrator and supervisor is one extra LLM call per turn. For cost modeling, the bigger question is the number of agents and what each is doing, not the topology label, so don't over- invest in the dropdown's branching logic.

Single-agent vs multi-agent token math

Single-agent:

session_cost  =  agent_input_cost + agent_output_cost

Multi-agent (sum across agents in the fleet):

session_cost  =  Σ_agents [ agent_input_cost
                         +  agent_output_cost ]

Each agent has its own input_tokens, output_tokens, cache_rate, model, and turn_share. Your per-agent fleet editor must let the user set all of these independently — a single set of "agent" defaults applied uniformly is structurally wrong.

The thing that makes naïve multi-agent estimates wrong: each agent's input includes context from the previous agents. We'll get to this in the sequential-pipeline section below.

Sequential pipelines: cumulative input growth

In a sequential pipeline (agent 1 → agent 2 → agent 3), every downstream agent's input includes the cumulative output of every upstream agent. Mathematically:

Stage 1 input:  sysprompt(1) + user_query
Stage 2 input:  sysprompt(2) + user_query + stage_1_output
Stage 3 input:  sysprompt(3) + user_query + stage_1_output + stage_2_output
Stage N input:  sysprompt(N) + user_query + Σ(stages 1..N-1 outputs)

A measured 5-stage research pipeline (from the open-source agent-cost-bench reference, multi-stage-research.yml):

Stage 1 (analyst):       175 in /  356 out
Stage 2 (feasibility):   523 in /  700 out  (+ stage 1)
Stage 3 (architect):   1,243 in /  800 out  (+ stages 1-2)
Stage 4 (implementer): 2,038 in /  788 out  (+ stages 1-3)
Stage 5 (reviewer):    2,854 in /  668 out  (+ stages 1-4)

Cumulative input growth: ~700 tokens per stage, dominated by prior stages' outputs. Older simulators assumed a flat ~200-token "handoff" overhead per stage; that's wrong by 3.5×. The right coefficient is approximately:

handoff_input_growth  ≈  avg_upstream_output_tokens  ×  n_upstream_stages

Build it: if you ship a multi-agent mode, apply the cumulative-input-growth math automatically. Don't expose a single "handoff overhead" slider as if it were a flat per-stage tax — it isn't, it's

avg_upstream_output ×
  n_upstream_stages

. Stage 5 carries the cost of stages 1–4, and that's where the bill blows up.

Parallel topology: synthesizer pays for everyone

In a parallel topology (orchestrator dispatches to N specialists in parallel; a synthesizer at the end consolidates their outputs):

Specialist N input:  sysprompt(N) + orchestrator_output
Synthesizer input:   sysprompt + Σ all specialist outputs

Each specialist sees only the orchestrator's output (small). The synthesizer at the end sees every specialist's output concatenated, which is large.

The bench's parallel-fan-out.yml scenario (orchestrator + 3 specialists + synthesizer) measured the synthesizer input at ~1,900 tokens — about 7.6× a single specialist's input.

If your topology is parallel-fan-out, your highest-input agent is the synthesizer, not any specialist. Size the model per agent accordingly: cheaper for the specialists, more capable for the synthesizer.

Hybrid DAGs (sequential trunk + parallel sub-branches)

Real research-orchestration agents look like neither pure sequential nor pure parallel. They have a sequential trunk (plan → retrieve → analyse → write) with parallel sub-branches at certain stages (analyse-1, analyse-2, analyse-3 running concurrently). The cost is the same regardless of execution order — total tokens consumed don't change with topology — but concurrent quota usage and rate-limit overage do.

For richer multi-agent calculators, a topology selector (sequential / parallel / hybrid) with a concurrent-quota slider and a rate-limit overage percentage is the right shape. A defensible cost adjustment:

Concurrent overage surcharge — 2% per overflow request
Rate-limit retry penalty — 1.5× the failed request cost

Token math doesn't change with topology — only rate-limit math does. Build the topology selector if your audience cares about sizing reservations; skip it if cost-only is enough.

Concurrent quota and rate-limit overage

Each provider has a max concurrent-request quota:

Anthropic Tier 2: ~50 concurrent
OpenAI Tier 4: ~500 concurrent
Self-hosted: bounded by GPU count and batch size

Exceeding it triggers queueing or 429 responses. A defensible overage model:

concurrent_overage_surcharge  =  0.02 × overflow_request_count
rate_limit_retry_penalty       =  1.5 × failed_request_cost

For a parallel-fan-out workload with 3 specialists, 200 sessions/day, 8 turns/session at ~1 simultaneous request per turn, you need ~200 × 8 / 86400 ≈ 0.02 concurrent on average but 5–10 at peak. Stay under the tier's quota or pay the overage; if you're consistently over, upgrade tier.

Workflow mode: six extra cost components

When your multi-agent system is a pipeline (research orchestration, document analysis, report generation), you'll need a workflow mode distinct from the per-agent fleet mode above. Workflow mode adds six pipeline-only cost components:

Sequential chain handoff
Bulk document ingestion
Partial rerun rate
Fact-check sidecar
Template amortization
HITL pause storage

Each is covered in detail in the next module.

Part 8—Workflow mode: when the agent is a pipeline

For systems that run as a research-workflow pipeline (multiple stages, document ingestion, user-review checkpoints, fact-check sidecars), expose a Workflow DAG mode distinct from the Fleet mode of Part 7. Fleet mode is N agents in parallel; Workflow mode is N stages sequentially with specialised pipeline-only cost components.

The mode should be a single user-facing toggle. The six workflow-only cost components below get added to the base per-agent calculation when Workflow mode is on.

Sequential chain handoff (% output passed forward)

The default coefficient is 80% of stage N's output is passed to stage N+1 as input. Some pipelines pass everything (100%); some compress between stages (50%). Higher pass-through means larger downstream input and higher cost; lower means lossy hand-offs and possibly lower quality.

Slider: Output → input ratio (0–100%). Default 80%.

This is in addition to the cumulative-context growth from Part 7. The chain-handoff is deliberate engineering (the team chose to pass output forward); the cumulative growth is structural (the agent has to see what came before).

Bulk document ingestion per session

Research pipelines often start by ingesting a corpus of PDFs that the user uploaded — not via RAG retrieval (that's chunked, on-demand) but as bulk inputs to early stages.

Cost model:

ingest_tokens_per_session  =  PDFs_per_session
                             × pages_per_PDF
                             × tokens_per_page
                             × % stages reading docs

Defaults: 0 PDFs (off), 15 pages/PDF, 800 tokens/page, 40% of stages read docs. Set PDFs/session to 5–10 for typical research workflows; the cost adds quickly because each PDF is ~12,000 tokens and you have multiple stages reading them.

Partial rerun rate (user-review failures)

In HITL workflows, users review intermediate stage outputs and can reject them. A rejected stage gets re-executed with feedback. The cost model:

partial_rerun_overhead  =  partial_rerun_rate × stage_cost

Default: 15% of stages re-run on user review. For a 5-stage pipeline, that's ~0.75 extra stages of cost per run. Calibrate against your own user-review acceptance rates; some workflows are 5%, others 30%.

Fact-check sidecar (separate verification call per stage)

Distinct from the FactReasoner verifier (Part 11), this is a lightweight per-stage check: a separate (usually cheaper) model reads the stage output and flags issues.

Cost model:

fact_check_cost_per_stage  =  fact_check_rate
                             × FC_input_tokens
                             × FC_model_input_rate

Defaults: fact-check enabled on 70% of stages, FC input tokens 2,000 per check, FC model price $0.80/M (typically a Haiku-class small model). For a 5-stage pipeline, that's 5 × 0.70 × 2,000 × $0.80/M ≈ $0.0056 per session.

Template amortization (planning over many runs)

Pipelines that run from a template (e.g., "literature review template" applied to many topics) amortise the planning step across runs. The calculator subtracts a small saving as runs/template grows:

amortised_planning_cost  =  planning_cost / runs_per_template

Default: 10 runs per template. For 100 runs/month with 10 runs/template, the planning cost (which might be a 5K-token flagship-model call to design the template) gets divided by 10 — a meaningful saving on a high-cost coefficient.

HITL pause storage (state retention during user review)

When a workflow pauses for user review, the session state needs to be persisted somewhere — Redis, a managed store, or a database. Per-session storage cost is small but accumulates.

Cost model:

HITL_storage_cost_per_session  =  pause_count
                                 × pause_duration_hours
                                 × storage_GB_per_session
                                 × $/GB-hour

Defaults: 3 pauses per session, 2 hours each, 0.001 GB per session (1 MB), $0.0001/GB-hour. So ~$0.0000006 per session. Negligible per session but real at scale (1M sessions/month = ~$600/mo on storage alone, plus the database the session state lives in).

Part 9—RAG: retrieval as a cost

Retrieval-Augmented Generation (RAG) is what turns a base model into a domain-specialised agent without training the model. You embed a corpus of documents into a vector database; at query time, you embed the user's question, do nearest-neighbour search, and prepend the top-K retrieved chunks to the prompt.

RAG has three cost layers, only one of which most calculators model. Your RAG configuration control needs to expose all three: per-query retrieval cost, corpus ingest cost, and vector DB hosting cost. Skipping any layer under-estimates by 10–40%.

Per-query retrieval cost

For each user turn, the agent does:

Embed the user's question into a query vector (~128 input tokens at the embedding rate, ~1ms latency).
Vector-search the corpus (free — vector DB query, not LLM).
Pull back top-K chunks (default 5) of K' tokens each (default 512).
Prepend chunks to the prompt; the LLM consumes them as input tokens at the model's input rate.

The embedding step is cheap. The chunk-as-input step is where the cost lives:

RAG_input_tokens_per_turn  =  K × tokens_per_chunk + query_embed_tokens

Default: 5 × 512 + 128 = 2,688 tokens per turn.

At GPT-4o-mini input rates ($0.15/M, or $0.075/M cached if chunks are stable), that's ~$0.0004/turn. At GPT-5.5 rates ($5/M), it's ~$0.013/turn. RAG is the single biggest reason input dwarfs output in production agents.

Chunks × tokens/chunk + query embed tokens

Tunable inputs your RAG configuration control should expose:

Chunks retrieved (top-K) — default 5, common range 3–10.
Tokens per chunk — default 512, common range 256–1024. Larger chunks = more context but more cost; smaller chunks = less context but more chunks needed.
Query embed tokens — default 128. The user's question + the embedding model's preamble.

Trade-off: more retrieved chunks improve answer quality but linearly increase input cost. The right number is workload- specific; benchmark against your retrieval-quality metrics.

Multi-hop queries: retrieval calls/turn

Some agent designs do multi-hop retrieval — the agent retrieves once, reads the chunks, decides what to retrieve next, retrieves again. Each hop is a fresh retrieval round plus another LLM call to decide.

A retrieval-calls-per-turn control (default 1) covers this. Users with multi-hop research agents will set 2–3. Each additional call multiplies the RAG input cost (since you fetch fresh chunks each time) — make this multiplication visible in the per-query breakdown so users see why their bill jumped.

A useful production insight: multi-hop is expensive but often materially improves answer quality on multi-fact questions. Decide per workload, with measured retrieval-quality data.

Embedding corpus ingest, amortised over re-embed cycle

Before you can retrieve chunks, you have to embed the entire corpus once. This is a one-time cost, but you typically re-embed periodically (when documents update, when you upgrade the embedding model, when you add new documents).

ingest_cost  =  corpus_tokens × embedding_rate
re-embed_cycle  =  every N months
amortised_monthly_cost  =  ingest_cost / N

Worked example: 100M-token corpus, OpenAI text-embedding-3-small at $0.02/M, re-embed every 6 months:

ingest_cost              =  100M × $0.02/M           =  $2.00
amortised monthly cost   =  $2.00 / 6                =  $0.33/mo

Effectively negligible at this scale. For a much larger 5B-token corpus (a substantial federal document library), it's $100 ingest, ~$17/month amortised — still small compared to the per-query retrieval cost.

The bigger embeddings cost line is per-query embedding (the 128-token query embed at retrieval time), not corpus ingest. Over 1M queries/month at 128 tokens × $0.02/M, that's ~$2.56/mo of query-embed cost. Larger embedding models (text-embedding-3-large at $0.13/M, ~6× more) push this to ~$15/mo.

Your RAG configuration must capture corpus size, re-embed cycle, and embedding model. A defensible default is OpenAI's text-embedding-3-small ($0.02/M). Other models worth offering: text-embedding-3-large ($0.13/M, higher quality), Cohere embed-v3 ($0.10/M), Voyage AI v3 ($0.06/M), or self-hosted BGE (effectively free per-query, but you run the GPU). Don't hard- code the model — let the user pick.

Vector DB hosting

The vector database itself is a separate hosted line item. Common options and rough monthly costs:

Vector DB	Cost shape	Typical monthly
Pinecone	Per-pod, hourly	$70–$700/mo
Weaviate Cloud	Per-tenant, monthly	$25–$500/mo
pgvector on RDS	Bundled with DB	$0 add-on
Qdrant Cloud	Per-cluster	$30–$300/mo
Self-hosted Milvus / Qdrant	EC2 + storage	$200–$2K/mo

A vector-DB hosting slider — a manual monthly $/mo input — is the simplest implementation. Auto-fetching vector-DB pricing is hard (provider sites change frequently); a user-overridable default is fine for v1.

Part 10—Reasoning, guardrails, and tools

Modern agents do more than read a prompt and write a reply. They think (extended chain-of-thought), they get moderated (guardrails), and they call tools (function calling, web search, code interpreter). Each of these has a distinct cost shape, and your calculator should group them into three named control sections: Reasoning, Guardrails, and Tools. Combining them under one "advanced" panel hides which lever is moving the bill.

Extended thinking tokens (5-10K per query)

Some models support an extended thinking or reasoning mode where the model generates internal reasoning tokens before producing the visible output. These reasoning tokens are billed at the output rate but are not shown to the user. Examples: OpenAI's o1/o3 series, Anthropic's extended- thinking mode on Sonnet 4.6+, Gemini's thinking mode.

The cost shape:

reasoning_cost  =  thinking_tokens × output_rate
                  × reasoning_pct_of_turns

The two controls your Reasoning section needs:

Thinking tokens — default 0. Set to 5,000–10,000 for reasoning-heavy tasks (math, multi-hop reasoning, complex planning). Some models cap this; o3 has a per-call cap.
Reasoning % of turns — default 0%. The fraction of turns that use extended thinking. Flagship reasoning agents might run 20–50%; chat agents run 0–5%.

A 10K-token thinking budget at GPT-5.5 output rate ($30/M) adds $0.30 per turn that uses it. At 30% of turns across 1M turns/month, that's $90K/mo of reasoning overhead on top of the visible-output cost.

Reasoning is a quality-vs-cost trade-off. On hard tasks it materially improves correctness; on easy tasks it adds latency and cost without measurable quality gain. Modeling both the per-turn reasoning budget and the per-cent of turns separately lets the user see the sensitivity — a single "reasoning is on" checkbox hides the lever that matters most.

Reasoning % of turns

Same slider as above, separated into its own concept because it's the dominant cost lever for reasoning models. If you set thinking tokens to 10K but reasoning % of turns to 5%, the overhead is small (~$3 per session). If you set it to 80%, the overhead dwarfs the rest of the cost.

Production systems typically route reasoning selectively: a classifier (or just a heuristic) decides which queries are hard enough to need thinking, and only those pay the cost. This is the most cost-effective use of extended thinking.

Fact-check passes and CoT chain length

Two adjacent reasoning-related controls worth exposing:

Fact-check passes — number of "verification re-reads" the model does over its own output before finalising. Adds ~30% input + 15% output per pass.
CoT chain length — number of explicit chain-of-thought steps in the prompt. Each step is ~150 tokens. A 6-step chain adds ~900 input tokens.

Both are heuristic; they only matter if your prompt template explicitly asks for them. For most agents, leave at 0.

Guardrail tokens (input, output, PII, content policy)

Federal and enterprise deployments routinely run guardrails before and after the LLM call. Each guardrail is a separate classifier or LLM call that screens for unsafe content.

Common guardrail categories:

Input guardrail — scans user input for jailbreak attempts, prompt injections, PII the user shouldn't be sending. ~100–300 tokens of overhead per call.
Output guardrail — scans model output for PII leakage, hallucinated citations, policy violations. ~100–300 tokens.
PII scan — dedicated PII detector (regex, NER, or a small classifier). ~50–150 tokens.
Content policy — content moderation (toxicity, CSAM, election misinformation). ~100–300 tokens.

These can stack: a fully-guarded turn might have input + PII + output + content policy passes, adding 400–1,000 tokens of overhead per turn before the agent has done its main work.

Federal deployments typically require all four. Commercial deployments often skip output scans for performance, accepting the residual risk.

Guard block rate (blocked = wasted)

When a guardrail blocks a turn, the agent's main LLM call may have already partially run before the block triggered. Those tokens are wasted — paid for, never shown to the user.

The waste formula:

guard_block_waste  =  guard_block_rate × main_LLM_cost

Default to 0% (assumes guards block before the main call — the most efficient architecture). Realistic values: 1–3% for well-tuned guards, 5–10% for aggressive safety-first deployments. Encourage users to override when their post- output guards regularly block answers that already cost full output generation.

Separate guard model

Two architectures for running guardrails:

Same model as the agent — the main LLM also does guardrail evaluation, just on a different prompt. Simpler; adds full main-rate cost per guard call.
Separate cheaper guard model — a small specialised model (Llama Guard, IBM Guardian 3.2.5B Multi-Harm, Anthropic's Constitutional Classifier) does the guarding. Cheaper per call, separate latency.

A Separate guard model $/1M control captures this. Default $0 (off, using main model). Set to $0.20–$0.50/M for a small dedicated guard model. For high-volume deployments, separating the guard saves 40–60% of the guardrail cost line — make this saving visible in the per-query breakdown so users see the value of the optimisation.

Tool calls: schemas every turn, results per call, IA overhead

Tools are how agents do anything that isn't pure text generation: query a database, search the web, call an external API, run code, retrieve a document. Tool use has four cost components:

tool_round_trip_cost  =  tool_schema_tokens × n_tools_in_prompt   ← every turn
                       + tool_call_arguments × n_calls            ← per call
                       + tool_result_tokens × n_calls             ← per call
                       + (extra LLM call to incorporate results)  ← +1 call

The first line is every turn, even if the agent doesn't call any tool. Schemas have to be advertised so the model knows what's available.

The bench's tool-chain.yml scenario invokes 3 real local tools (search, fetch_doc, query_db) over 9 user turns. Across 21 LLM calls (turns × tool-loop iterations):

total input tokens   =  54,000   (~80% from tool results)
total output tokens  =   5,700
input/output ratio   =  9:1

Tool result tokens dominate the cost, not the agent's reasoning. A naïve "agent + tools" cost model that assumes tools are cheap "function calls" undershoots the bill by 3–10× depending on tool result size.

The +1 LLM call per tool round-trip

Every tool round-trip is at least two LLM calls:

The model decides to call the tool and emits a tool-call block.
Your runtime executes the tool (no LLM cost; the tool itself might cost separately — see below).
The model is invoked again with the tool result added to the prompt, so it can incorporate the result.

If the agent calls 3 tools in one turn, you're looking at 4 LLM calls (initial + after each result). The right formula: tool calls/turn × result tokens × (1 + n_tools_in_prompt) multiplier. Skip the +1 and you'll under-count by 30–50% on tool-heavy agents.

Web search ($10/1k), file search ($2.50/1k), container sessions

Beyond the LLM cost, several "tool" services have their own billing:

Web search via OpenAI Assistants / Anthropic search — ~$10 per 1,000 search calls.
File search via OpenAI Assistants — ~$2.50 per 1,000 search calls.
Container session (code interpreter) — ~$0.03 per session for OpenAI's Python container.
Google Vertex Search Grounding — ~$35 per 1,000 calls.
Bedrock-passthrough tools — usually no separate fee (the tool cost is in the LLM call's input/output).

Your price book should carry separate per-provider tool-fee columns, and your output report should surface a "Tool fees per session" running total — these fees are easy to miss because they don't look like LLM costs.

If your agent does 2 web searches per turn over 1M turns/month, that's 2M search calls × $10/1k = $20,000/mo of search fees alone, on top of the LLM cost.

Part 11—Verification: the FactReasoner tax

Federal AI deployments increasingly run a probabilistic factuality verifier behind the agent. The agent generates an answer; the verifier decomposes the answer into atomic claims; each claim is checked against retrieved supporting passages via Natural Language Inference (NLI). Claims that fail verification are flagged or removed before the response goes to the user.

A verification tier control covers this. It's complex enough to deserve its own panel — typically a parent toggle ("Enable verification") plus 5–7 child controls for variant choice, sampling coverage, and where the NLI step runs.

Why federal deployments are adding it

Federal AI services have a higher hallucination penalty than commercial chat. A factually wrong answer from a federal science Q&A service about methane levels has different consequences than a factually wrong recipe suggestion from a cooking bot.

FactReasoner-style verification is the current best-practice approach for catching hallucinations probabilistically — not by retraining the model, not by adding more guardrails, but by checking generated claims against authoritative sources at response time.

The original FactReasoner paper is a good background read; the cost calculator implements its variants as separate cost lines.

The pipeline: atomize → retrieve → NLI

A FactReasoner pass does three things:

Atomize — an LLM call decomposes the agent's output into atomic factual claims ("CH4 levels in 2024 were 1923 ppb at Mauna Loa"). ~500–1500 input tokens, ~200–500 output tokens, one flagship-LLM call.
Retrieve — for each atomic claim, retrieve supporting passages from a corpus. Wikipedia (free), Serper API ($5/1k calls), or your own retrieval system.
NLI — for each (claim, passage) pair, classify as entails, contradicts, or neutral. This is where the bulk of the cost lives.

A claim with low entailment evidence and high contradiction evidence gets flagged. Severity-of-flag drives whether the agent revises, suppresses, or asterisks the claim.

Variants: FR1 (24), FR2 (160), FR3 (350) NLI calls per atom

The number of NLI calls per atomic claim varies by variant:

Variant	NLI calls per atom	Use case
FR1 (lean)	24	Lightweight, ~70% recall
FR2 (high-recall, default)	160	Balanced, ~85% recall
FR3 (exhaustive)	350	Comprehensive, ~92% recall

Each NLI call costs roughly:

~500 input tokens (the claim + a passage) at NLI input rate
~10 output tokens (just the label)

Sampling coverage (10% common compromise)

If you verify every response, you double your inference cost. The common compromise is sampling — verify a random fraction of responses, flag systematic issues, fix the upstream agent.

A Sampling coverage control (0–1) handles this. Default 0.0 (off). Common production value: 0.10 (verify 10% of responses; the 90% unverified responses still get the model's intrinsic accuracy, which is where most claims live anyway).

100% coverage is reserved for safety-critical deployments (medical, financial, classified intel) where every hallucination has consequences.

Atomize the response into atomic claims, retrieve evidence for each claim, then run NLI (natural-language inference) to label each claim entailed / contradicted / neutral. Tokenize the prompt, score it with a reward model, then re-prompt with corrections. Hash the answer, look it up in a known-facts cache, then add a citation footer. Send the answer to a second LLM, ask "is this true?", and trust the yes/no.

Where NLI runs: API vs g6.xlarge ($588) vs g5.xlarge ($735)

This is the place self-hosting genuinely wins. NLI is a narrow specialised task. A small specialised model (DeBERTa-v3-large, ~440M parameters) performs as well as a flagship LLM on (claim, passage) classification, at a tiny fraction of the per-call cost.

Three hosting options worth offering for the NLI step:

API (pay per token) — route NLI through a flagship LLM API. ~500 input + 10 output per call × $5/M input + $20/M output ≈ $0.0027 per NLI call. At FR2 (160 calls per atom × 8 atoms per response × 10% sampling × 1M responses/month) ≈ 128M NLI calls/month × $0.0027 ≈ $345K/month.
EC2 g6.xlarge ($588/mo flat) — single L4 GPU running DeBERTa-v3-large. Throughput ~500 NLI calls/second ≈ ~1.3B calls/month at full utilization. Same 128M calls/month fits in <10% utilization. $588/mo flat.
EC2 g5.xlarge ($735/mo flat) — A10G GPU alternative. Slightly more headroom; otherwise similar.

The math is aggressively in favor of self-hosting NLI. You go from $345K/mo to $588/mo on the same workload. ~$5,000/mo savings at FR2 with 15M NLI calls/month (the baseline FactReasoner load); $345K savings at higher coverage.

Retrieval source: Wikipedia (free) vs Serper ($5/1k)

The retrieval step (step 2 of the pipeline) needs a corpus to look up evidence for each atomic claim. Two common options:

Wikipedia (free) — curated dump, good for general knowledge; insufficient for federal domain queries (e.g., scientific data, classified content).
Serper API (~$5 per 1,000 search calls) — Google Search proxy, web-scale.
Your own corpus — federated retrieval against your vector DB. Same cost as your normal RAG.

For a federal science-agency deployment, the right answer is usually "your own corpus" — the same one the agent retrieves from at generation time. The verifier and the generator share the retrieval infrastructure.

Why DeBERTa-v3-large is the one self-host that wins

This is the rare case in AI cost analysis where the math points cleanly to self-hosting:

Narrow task — NLI is a binary/ternary classification. Small models match flagship performance.
Smaller model — 440M parameters fits comfortably on a commodity GPU.
Predictable load — verification volume is roughly queries × atoms × NLI-per-atom × coverage; bursty in the same shape as your traffic.
Low utilization is fine — even 5–10% utilization on a $588/mo instance beats per-token API pricing by 100×.

Almost everywhere else in this tutorial, self-hosting loses to commercial APIs at federal scale (Part 13 has the full story). NLI for verification is the exception. If your deployment includes a FactReasoner verifier, self-host the NLI step. The flagship-LLM call for atomization stays on API (it's a narrow but capable task; flagship is right).

Part 12—Hosting: API, BYOK, hybrid, or self-host

The hosting decision is downstream of cost modeling but it shapes the cost model. Same workload looks very different on API vs self-host vs BYOK. Your calculator needs a top-line hosting mode selector for the deployment as a whole, plus per-agent hosting in the multi-agent fleet editor — agents in the same pipeline can sit on different hosting modes.

API tiers: Standard / Flex / Batch / Priority

Inside "API" there are sub-tiers with different latency vs cost trade-offs:

Tier	Multiplier	Latency	Use case
Standard	1.0×	seconds	Default. Best for interactive agents.
Flex	0.5×	minutes	Latency-tolerant async. ~50% cheaper.
Batch	0.5×	hours	True async (overnight runs). ~50% off.
Priority	2.5×	guaranteed seconds	Production SLA. ~2.5× Standard.

Expose this as a single service-tier dropdown on each agent. Make the multipliers themselves editable in a Service-tier multipliers advanced panel (some providers' batch is 60% off, some's flex is 40% off — defaults shouldn't be load-bearing).

For most public-facing agents, Standard is the right choice. Flex/Batch are useful for offline analysis or non-interactive workloads. Priority is for SLA-bound agents (financial-services chat, healthcare triage) where the 2.5× cost is justified by the latency guarantee.

The tier multipliers: 1× / 0.5× / 0.5× / 2.5×

These multipliers apply to the per-token rate. If your base rate is $5/M input on Standard:

Flex: $2.50/M
Batch: $2.50/M
Priority: $12.50/M

The service-tier multipliers panel lets the user override these per deployment (negotiated rates, vendor-specific discounts).

BYOK: when the user pays

Bring Your Own Key is a hosting mode where the user provides their own API key and pays the LLM bill directly. The operator pays $0 in LLM costs.

When BYOK is appropriate:

Calculator-style tools where each user has their own account.
Self-service developer tools.
Federal evaluation environments where each agency uses its own contract.

When BYOK is not:

Public-facing consumer apps (users won't sign up for an OpenAI account just to chat).
Federal procurement bound by a single vendor contract.

When the user picks BYOK in your hosting selector, your cost engine should zero out the LLM line for that agent and only count non-LLM costs (infrastructure, personnel, compliance). This is non-obvious — many calculators silently still charge LLM cost in BYOK mode and miss the architectural advantage that BYOK provides.

Hybrid hosting splits traffic between API and self-host. Common splits:

70% API / 30% self-host — API for general traffic, self-host for steady high-volume narrow paths.
API for high-quality / self-host for cheap classification — flagship API for the main agent, self-hosted small model for fallback or routing.
Migration timeline — Year 1 100% API, Year 2 50/50, Year 3 100% self-host. (Part 21 covers phased migration.)

Expose this as an API share % control (0–100). Your cost engine should apply API rates to that share and self-host amortisation to the remainder. Both modes' configurations need to be live — don't let the user lose their self-host parameters when they slide back to 100% API.

Self-host: capex + opex

Self-hosting is a different cost model entirely. Per-token rates go to zero (you generate tokens on your own GPUs); a fixed monthly cost for the GPU instances + ops + personnel takes their place.

self_host_monthly  =  GPU_instances × hourly_rate × 730
                     × headroom_multiplier
                     × (1 - reserved_discount)
                     + ops_monthly
                     + FTE_monthly
                     + setup_amortised_monthly
                     + observability_monthly

We'll spend the next module (Part 13) unpacking the self-host realism layer in detail.

Provider markup table

Already covered in Part 6, but the consolidated table:

Path	Markup
Direct vendor (OpenAI, Anthropic, Google)	1.00×
Azure OpenAI	1.00×
AWS Bedrock	1.05×
OpenRouter	1.05×
BYOK	1.00× (user pays)
Self-host	0× per-token + fixed

For procurement, the markup choice is also a compliance choice — Bedrock and Azure have FedRAMP-authorized GovCloud regions; direct OpenAI does not. The 5% markup might be the only path to compliance for high-tier deployments.

Per-agent hosting in multi-agent fleets

In a multi-agent fleet, each agent can have its own hosting:

Orchestrator: cheap API model (gpt-4o-mini or Haiku)
Specialists: flagship API (Sonnet 4.6 or GPT-5.5)
Verifier (NLI): self-host (DeBERTa-v3-large on g6.xlarge)
Embedding model: self-host (BGE on the same g6.xlarge as NLI)

Per-agent hosting must be a per-row dropdown in your fleet editor. Your cost engine then sums per-agent costs with the right rates applied per hosting mode — agents on API use per-token math, agents on self-host use the amortised-fixed math.

The reason this matters: a uniform-fleet pricing assumption materially overstates real heterogeneous-system costs for some configurations and understates them for others (notably when expensive specialised models like Opus are used for high-context analyst roles, alongside cheap Haiku orchestrators). Modeling per-agent is the only way to get the real number.

Part 13—Self-host realism

If you're considering self-host as a path to cost savings, this module is the warning label. The intuition "we buy the hardware once, queries are free" is wrong at federal scale, and the reasons compound. Your calculator should have an Optimistic vs Realistic toggle on self-host estimates — optimistic mode shows the marketing brochure number, realistic mode applies the multipliers below. Always show both side-by-side; the gap is the story.

Utilization: 60-80% commercial vs 10-20% single-tenant

Commercial inference providers run shared GPU fleets across thousands of tenants. Aggregate utilization sits in the 60–80% range because the peaks and troughs cancel — when one tenant's traffic dips, another's rises.

A single-tenant federal fleet sized for peak runs at 10–20% utilization on average. The hardware bill is the same; the per-query amortisation is several times worse.

If you provision 4 GPUs to absorb a 10× peak, three of them sit idle most of the time. Their hourly cost still bills.

Continuous batching: batch sizes of 1 vs 100s

Modern inference servers (vLLM, TensorRT-LLM, TGI) pack many concurrent requests onto the same GPU forward pass. Decode is memory-bandwidth-bound on GPUs, so batch size translates almost directly into per-GPU throughput.

Commercial providers sustain batch sizes well into the hundreds — they have many tenants creating concurrent requests.
A single-tenant federal fleet at moderate volume sees batch sizes in the single digits during off-peak hours.

A GPU running batch=2 is delivering ~2× the throughput of one running batch=1 — but a GPU running batch=200 is delivering roughly 30–50× the throughput of batch=1 (memory bandwidth saturates eventually). Commercial providers are at the high end; single-tenant deployments are not.

This is invisible in the GPU spec sheet (the chip is the same) but huge in the actual cost-per-query.

Cross-tenant prompt caching

Self-hosted servers cache prefixes within an instance, but each instance caches independently by default. Multi-node distributed prefix caches (LMCache, Anyscale) exist — they require non-trivial engineering investment that nobody budgets in a procurement memo.

A commercial provider operates the equivalent at fleet scale as a default property of the system. Cross-tenant prefix caching means a system prompt your agent sends shares the cached state with thousands of other tenants who have similar prefixes — no engineering, no extra cost, included.

Self-hosted single-tenant cache hit rates are typically lower than the same workload on a commercial provider, all else equal.

Custom silicon and proprietary kernels

Commercial providers invest in:

Custom kernels tuned per model
Speculative decoding with draft models
FP8/INT8 quantization tuned per layer
Parallelism matched to the fleet's specific topology
Proprietary attention optimizations

Open-source serving stacks (vLLM, TGI, llama.cpp) lag commercial providers by some margin per GPU on every one of these. The lag varies by model and workload but is rarely zero.

The step-function cost shape

API pricing is continuous in query volume. Self-hosted pricing is a step function tied to integer instance counts.

If you need 1.3 GPUs of capacity:

API: pay for 1.3 GPUs of usage.
Self-host: pay for 2 GPUs (you can't rent 1.3 of them separately).

The step-function adds 10–50% headroom cost that the continuous API model doesn't pay. A commercial provider doesn't pay for dead capacity — when one tenant's traffic drops, another's rises. A single-tenant federal deployment pays the GPU rent whether the GPUs are doing useful work that hour or not.

GPU options: g6e.12xl, g5.48xl, p5.48xl

A reasonable starter set of GPU options to ship in your self-host capacity control:

Instance	Hourly	Throughput	Capable of
g6e.12xl (4× L40S 48GB)	$10.49	1,200 t/s	70B int8
g5.48xl (8× A10G 24GB)	$16.29	900 t/s	70B int4
p5.48xl (8× H100 80GB)	$98.32	4,500 t/s	400B fp8

These are AWS list prices; reserved-instance discounts apply (see Part 14 — reservations).

Make this list user-extensible. An "Add another GPU instance" affordance plus an editable price-book table covers the long tail (custom Azure ND-series, Lambda Labs, Crusoe, Coreweave, on-prem 8×H100 boxes — there's no canonical list).

Diurnal peak factor and headroom multiplier

To size self-hosted capacity you need:

required_GPUs  =  peak_TPS / per_GPU_TPS
                 × headroom_multiplier
                 / throughput_derate

Peak TPS — your worst-hour token rate. Sized as average × diurnal peak factor. Default diurnal peak factor: 1.5×.
Headroom multiplier — how much over peak you provision to absorb spikes. Default 1.3×.
Throughput derate — how much less your fleet actually delivers vs spec. Default 0.75 in realistic mode (paper TFLOPS overstate real throughput by 25–35%).
Min replicas (HA floor) — minimum instance count regardless of demand, for high availability. Default 2.

For a deployment averaging 100 TPS:

peak_TPS         =  100 × 1.5  =  150 TPS
provisioned_TPS  =  150 × 1.3 / 0.75  =  260 TPS
required_GPUs    =  260 / 1200  ≈  0.22 GPUs

with HA floor of 2:  ceil(max(0.22, 2))  =  2 instances
                                          =  2 × g6e.12xl
                                          =  $15,300/mo at on-demand
                                          =  ~$10,000/mo at 1-yr reserved

The HA floor is what kills small-deployment self-host economics. You're paying for 2 instances when 0.22 would do. That's a structural cost the API path doesn't have.

HA replicas (floor) and duty cycle / scale-to-zero

Two more sliders:

Min replicas (HA floor) — typically 2 (one is a single point of failure). Federal deployments often require 3 for cross-AZ resilience.
Duty cycle / scale-to-zero — fraction of the month the fleet is running. Default 100% (always on); some workloads can scale to zero overnight or weekends and pay for less of the month.

The duty cycle slider changes the math materially for bursty workloads. A NOAA storm-tracking explainer that runs ~10% of the month (active during storm seasons only) cuts GPU hours 10× — changing the API-vs-self-host trade-off dramatically.

But scale-to-zero has its own costs: cold-start latency (GPUs take 30–90s to warm up), API-side rate-limit overage when traffic resumes, and operational complexity (the auto-scaler has to actually work). Don't model scale-to-zero unless you've tested it.

Compute platform: Fargate / EKS / self-managed K8s

Where the GPU instances run matters for ops cost:

Fargate — managed container service. No K8s expertise needed; AWS handles the orchestration. Most expensive per GPU-hour. Best for small teams without dedicated MLOps.
EKS — managed Kubernetes. Cheaper than Fargate; you manage the K8s control plane. Standard MLOps team can run it.
Self-managed Kubernetes — you run K8s yourself. Cheapest in raw GPU-hour terms; +$5,333/mo hidden FTE cost is a defensible default for the K8s expertise needed. Don't choose this without the team.

A compute platform dropdown handles this. The hidden-FTE add-on for self-managed K8s is what surfaces the tax that most procurement memos omit — make sure your cost engine wires the dropdown selection into the FTE line, not just the GPU-hour line.

Throughput derate (paper TFLOPS overstate by 25-35%)

Vendor benchmarks (TFLOPS, paper TPS numbers) are measured under ideal conditions: hot model, full batch, no other workloads, perfect networking. Real production fleets deliver 75% of paper throughput in steady state, sometimes less.

A throughput derate control captures this:

Optimistic mode: 1.0 (no derate; vendor numbers).
Realistic mode: 0.75 (25% derate; typical production).

Without the derate, your sizing is undersized by 25–35% — you provision for the spec sheet and run out of capacity at the peak hour.

This is the single biggest reason "the same workload that costs $30K on API costs $86K on self-host" — once you derate honestly, the GPU count goes up, the bill goes up.

Build it: ship two preset bundles of self-host operating assumptions behind your Optimistic / Realistic toggle:

Field	Optimistic	Realistic
Ops monthly ($)	$350	$1,800
FTE monthly ($)	$2,500	$8,000
Setup amortised ($/mo)	$0	$8,333
Throughput derate	1.0	0.75
1-yr RI discount	40%	33%
3-yr RI discount	60%	55%

Optimistic is what GPU vendors quote. Realistic is what production teams report. The gap is roughly 4× on total self-host cost for typical federal deployments. Always show both side-by-side; the gap is the deliverable.

Part 14—Reservations and committed spend

Above some volume, committing to reserved capacity beats per-token pay-as-you-go pricing. Your calculator needs an API reservations control with the structure below — the math here is non-trivial, and most off-the-shelf calculators omit reservations entirely (which makes them unable to model any deployment over ~$50K/mo honestly).

What a reservation actually buys

A reservation is a commitment to spend in exchange for a discount on the rate (or a fixed-capacity guarantee in exchange for paying flat for it). Two structurally different flavors:

Volume commit — you promise to spend $X/month for 12 months; the provider gives you ~10–20% off list. You still pay per token, just at a lower rate.
Capacity commit (PTU) — you buy N units of capacity for the period; you pay flat regardless of usage but use as much as you want within the unit's throughput limit.

Volume commits are simpler. PTUs are more complex but can be cheaper at high steady-state utilization.

Azure PTU: monthly vs annual (~30% off)

Azure OpenAI's Provisioned Throughput Units are the classic capacity-commit model.

Monthly commit (PTU hourly): ~$1,875/month per unit. Each unit buys ~50 TPS for gpt-4o-class models (varies per model — see TPS-per-PTU section).
Annual commit (PTU annual): ~30% off the monthly rate. ~$1,310/month per unit when paid annually.

PTUs are a separate billing line from per-token API spend. You can mix: spillover above PTU capacity goes to per-token rates.

AWS Bedrock provisioned: 1-mo vs 6-mo (~20% off)

AWS Bedrock's equivalent: Provisioned Throughput.

1-month commit: published per-model rate per unit.
6-month commit: ~20% off the 1-month rate.

Sized per model. Each model has its own published per-unit TPS. Anthropic models on Bedrock have separate tiers; Llama models are different again.

OpenAI Enterprise: $100K (10%) and $1M (20%) commits

OpenAI's volume-commit pricing tiers:

≥ $100K/month commit: ~10% off list across products.
≥ $1M/month commit: ~20% off list.
Custom enterprise contracts: negotiated; can include data-residency commits, fine-tuning credits, support SLAs.

These are commit-and-spend deals — you're guaranteed to spend at least the commit threshold. Below the threshold, you pay list rate. Above it, you pay the discounted rate.

PTU sizing math: peak TPS / TPS per PTU

For PTUs (Azure) or Bedrock provisioned, the sizing formula:

PTUs_needed   =  ceil( peak_TPS / TPS_per_PTU_for_your_model )
monthly_cost  =  PTUs_needed × $/PTU/month

Peak TPS is what you must serve in the worst hour of your busiest day, not your average. Sizing to average is the most common PTU mistake — your peak hour hits the ceiling, queries queue or 429.

An Auto-size PTU units from peak TPS checkbox is the right affordance. When on, your engine derives PTU count from the modeled peak TPS automatically; when off, the user overrides manually.

TPS per PTU varies per model (50/200/30)

Per-PTU throughput depends on the model:

Model class	TPS per PTU
gpt-4o (~200B params)	~50
gpt-4o-mini (~8B params)	~200
gpt-5-class (larger)	~30
gpt-5.5 (flagship)	~25

Smaller models get more TPS per PTU because they fit more copies on the same GPU. Your price-book table should expose per-model TPS-per-PTU as an editable column; for procurement, the user must be able to verify against Azure's published tables on the day of submission.

When PTUs make sense; when they're a trap

Two conditions both need to hold for PTUs to win:

Steady or predictable traffic — PTU rent is paid whether you use the capacity or not. Spiky workloads waste capacity.
On-demand spend exceeds PTU rent — for Azure, that's roughly when monthly on-demand spend hits $5–15K (hourly PTU) or $3–10K (annual PTU). Below that, on-demand wins on flexibility.

A Provisioned Throughput Unit — Azure's reservation of guaranteed tokens-per-second capacity for a model. Worth it when traffic is steady AND monthly on-demand spend exceeds roughly $5–15K. A "Per-Token Unit" — Azure's per-token billing increment. You buy them in blocks of 1,000 to get a small discount. A "Public Throughput Use" license, required for any production-facing endpoint regardless of volume. A "Pre-Trained Unit" — Azure's bundle of fine-tuning hours plus the base model weights for that fine-tune.

When PTUs are a trap:

Public-facing federal endpoints with bursty traffic (NOAA storm tracker, IRS Direct File during tax season). The burst is the whole game; you're paying PTU rent through the quiet 90% of the time. Stay on-demand and use a daily $ cap (Part 17).
Early-stage deployments where traffic shape is still unknown. Don't lock in capacity until you have 3+ months of production data.
Multi-model architectures where you might want to switch the workhorse model. PTUs are model-specific.

A break-even crossover display is worth building: show the user, on a single chart, where PTU breaks even (~$5–15K/mo on-demand spend) and where self-host breaks even (similar range, with realistic ops). The real comparison is API + PTU vs self-host with realistic ops, not API on-demand vs self-host — and the chart makes that obvious.

Auto-size from peak TPS

The auto-size affordance is the easiest way to get a defensible PTU count:

Set your modeled peak TPS (your engine computes this from the user's traffic + diurnal peak factor + burst settings).
Pick the model.
Check "Auto-size PTU units from peak TPS."
The engine looks up TPS-per-PTU for the model, computes ceil(peak / per-PTU), and shows the monthly PTU cost.

You can override the count if you want headroom or are phasing in capacity. The auto-size is a starting point, not a procurement commitment.

Part 15—Personnel: the hidden majority of the bill

The compute bill is a third of the real total. Below about $50K/mo of compute spend, personnel dominates. Your calculator needs a personnel / staffing control with per-role FTE allocations and loaded-cost multipliers — and it should be on by default.

If you're costing AI as if compute is the only line item, you're under-counting by roughly 2×. The single most common "missing line item" in homemade AI cost estimates is people.

Roles: MLOps, prompt eng, eval, security, ATO assessor, SRE

Roles that show up in real federal AI deployments:

Role	Loaded annual ($, US, 1.30× base)	When you need them
MLOps engineer	$234,000	Always at >$5K/mo compute
Prompt engineer	$195,000	0.3 FTE for any production agent
Eval engineer	$221,000	Required for federal deployments
Security reviewer	$228,000	Required for federal deployments
ATO assessor (federal ISSO)	$215,000	Required for FedRAMP
SRE / on-call engineer	$254,000	Required for any production endpoint

These are loaded costs — base salary × 1.30 (benefits + overhead), in US dollars, for federal-contractor rates. Commercial rates are similar (range varies by region and seniority).

Loaded annual cost ($195K-254K)

Why "loaded"? Because the FTE costs more than the salary:

Base salary: 100%
Benefits (healthcare, 401k, etc.): +20–25%
Overhead (office, equipment, software): +5–10%
Federal contractor uplift (clearance, indirect rate): can add another 10–15%

A $180K base-salary MLOps engineer is a $234K loaded cost to the program. RFPs require fully-loaded labor in the cost basis.

A typical deployment: 0.85 FTE ≈ $16K/mo, +federal ≈ $21K/mo

A small federal deployment doesn't need full-time anyone in each role. Typical allocation:

0.5 FTE MLOps engineer: 0.5 × $234K / 12 = $9,750/mo
0.25 FTE prompt engineer: 0.25 × $195K / 12 = $4,063/mo
0.1 FTE SRE oncall: 0.1 × $254K / 12 = $2,117/mo
0.0 FTE eval / security / ATO (lean): $0

Total: 0.85 FTE = ~$15,930/month.

For federal deployments, add 0.2 FTE security reviewer ($3,800/mo) + 0.1 FTE ATO assessor ($1,792/mo) = ~$21K/mo all-in.

Why compute is only 1/3 of the bill below $50K compute

A "$10K/month compute" deployment is realistically a "$50–80K/month all-in" deployment once you load personnel honestly. The compute share is ~15–20% of total.

Above $50K/month of compute, the share flips: compute dominates personnel. But most program-scale AI deployments are below that threshold — and that's exactly where the "compute is everything" intuition under-counts the bill by 3–5×.

Ship realistic FTE allocations + loaded costs as defaults. Users can argue them down for a smaller team or a deployment where MLOps is shared across multiple programs, but they can't argue them away — make the override visible, but the default honest.

Build it: a personnel panel with Include personnel costs in monthly total (parent checkbox, on by default), + Add a role for custom roles beyond the defaults, and per-role rows of

FTE allocation × annual base × total-comp multiplier
  (1.30 = +30% benefits/overhead) ÷ 12

. Salaries belong in your price-book / prices table — let the user edit if their team's loaded cost differs. If a user unchecks the parent (BYOK or pure-API deployment with no ops layer), zero out the line in the total — but still show what would have been the personnel line in the report, for procurement transparency.

Part 16—Federal multipliers

Federal cloud isn't commercial cloud. Multipliers compound, and three additional cost lines (ATO, audit retention, PII redaction) appear that don't exist in commercial deployments. Build a federal-compliance & hosting panel that gates the whole layer behind a tier selector — when the user picks "None / commercial" the panel collapses; when they pick FedRAMP Moderate, the multipliers and extra lines all light up.

If you're modeling a commercial deployment, set FedRAMP tier to None / commercial and skip this module. If you're federal, this is where 30–80% of the difference between your calculator number and the commercial-headline number lives.

FedRAMP tiers: None / Low / Moderate / High / IL4 / IL5

Six tiers, each with progressively more security controls and progressively more cost overhead:

Tier	Multiplier	Use case
None / commercial	1.00×	Public-facing commercial
FedRAMP Low Impact SaaS	1.00×	Public data only
FedRAMP Low	1.00×	Internal but unclassified
FedRAMP Moderate	1.15×	Most common federal tier (CUI-Basic)
FedRAMP High	1.30×	High-impact data; GovCloud
DoD IL4	1.50×	Controlled Unclassified Information
DoD IL5	1.80×	National Security Systems

The multiplier applies to the LLM cost line and the GPU cost line. Other line items (vector DB, storage, NAT) get a separate GovCloud premium applied via the "Other monthly infrastructure" line items.

Multipliers: 1.0× / 1.10× / 1.15× / 1.30× / 1.50× / 1.80×

The multipliers reflect:

GovCloud regional premium (~30% across-the-board).
Operational overhead (separate ATO, separate compliance monitoring, restricted personnel).
Limited model availability (some flagship models aren't in GovCloud yet, forcing you to a more expensive alternative).

Negotiated enterprise contracts can shave 5–10% off these multipliers. Default to list-rate multipliers; the user can override in the price-book table when they have negotiated agency rates.

Multi-region: single / active-passive / active-active

Disaster-recovery posture stacks on top of FedRAMP:

DR mode	Multiplier	What it means
Single region	1.00×	One region, no DR
Active-passive	1.40–1.50×	Warm standby in a 2nd region
Active-active	1.80–2.00×	Full duplicate, both live

Active-passive doubles your storage cost and adds a smaller compute cost (the passive side runs minimal capacity until failover). Active-active doubles compute as well.

Multipliers compound (not stack)

Two FedRAMP-tier × multi-region scenarios:

FedRAMP Moderate + active-passive: 1.15 × 1.45 = 1.67× baseline.
FedRAMP High + active-active: 1.30 × 2.00 = 2.60× baseline.

Procurement teams who price the commercial number and add "~10% for compliance" are off by a factor of 2.5–3 on a high-tier active-active deployment. The calculator surfaces both multipliers separately so the reviewer can see what each one buys.

ATO: setup + monthly continuous monitoring

Authority to Operate (ATO) is the formal authorization for a system to handle data at a given classification. It has two costs:

Upfront — assessment, documentation, controls implementation, security testing.
Monthly continuous monitoring — ongoing security reviews, vulnerability scans, incident response readiness.

Typical numbers:

Tier	Upfront	Annual continuous monitoring
FedRAMP Low	$75,000	$25,000
FedRAMP Low Impact SaaS	$100,000	$30,000
FedRAMP Moderate	$250,000	$50,000
FedRAMP High	$500,000	$100,000
DoD IL4	$350,000	$75,000
DoD IL5	$600,000	$120,000

These are additive line items — they're paid regardless of LLM cost. For a small deployment, ATO can dwarf the AI cost. The upfront cost is typically amortised over a 36-month procurement window; the continuous-monitoring figure recurs annually.

ATO amortisation across N months

The upfront ATO cost amortises over the deployment lifetime. Expose this as an Upfront amortisation (months) control. A defensible default: 36 months (3-year deployment cycle).

For FedRAMP Moderate: $250K / 36 = ~$6,944/month for the upfront, plus $50K/month for ongoing monitoring = ~$57K/mo of compliance line item, before AI compute.

Edit the panel's defaults if your ATO costs differ (negotiated 3PAO rates, agency-specific overhead).

Vector DB hosting on GovCloud

If your deployment uses RAG (and most federal deployments do), the vector database needs to live in GovCloud too.

Pinecone GovCloud: ~30% premium over commercial Pinecone.
pgvector on RDS GovCloud: bundled with the DB cost (typically ~$300–1,500/month).
Self-hosted Milvus / Qdrant on EC2 GovCloud: cheaper but requires operational expertise.

A Vector DB / retrieval infra ($/mo) input under federal hosting handles this. A user-overridable manual value is fine for v1; auto-fetching vector-DB GovCloud pricing is brittle and not worth the engineering for the marginal accuracy gain.

Egress per query × $/GB

Federal cloud charges egress separately. Your queries generate response data that ships out of the region:

egress_monthly  =  egress_GB_per_query
                  × queries_per_month
                  × $/GB

Defaults: 0.001 GB/query, $0.09/GB. For 1M queries/month: 1,000 GB × $0.09 = $90/month. Small. But for high-throughput data-heavy responses (image generation, PDF outputs, video summaries), can rise to $1,000+/month.

GovCloud egress is sometimes higher than commercial; verify on the day of pricing.

Audit log retention: KB/query × queries × years × storage rate

Federal deployments retain LLM logs for 5–7 years. The audit log captures every prompt, every completion, every tool call, plus metadata.

audit_log_size  =  audit_KB_per_query
                  × queries_per_month
                  × retention_months
storage_cost   =  audit_log_size × $/GB-month

Defaults: 5 KB/query, 84 months (7 years), $0.004/GB-month (Glacier-class compressed storage).

For 1M queries/month after 7 years of accumulation: 1M × 5KB × 84 = 420 GB × $0.004 = $1.68/month of storage at year 7. Cheap if you use Glacier; expensive if you use S3 Standard ($0.023/GB-month → $9.66/month at year 7).

Expose four controls under Audit log retention: KB/query, queries/month (linked to your volume control), retention months, and storage rate $/GB-month. Default to a Glacier- class storage rate; only switch to S3 Standard if the user needs sub-second retrieval (rare for audit logs).

PII redaction service ($0.30/M tokens or self-hosted classifier)

Federal deployments handling PII need a redaction layer that strips PII from prompts and completions before logging or egress. Two options:

Managed PII redaction service (AWS Macie, Azure Purview, Google DLP): ~$0.30 per million tokens scanned. At 100M tokens/month → $30/month.
Self-hosted classifier (Presidio, custom NER): ~$200/mo fixed for the EC2 instance + ops.

For most federal deployments, the managed service is simpler and cheaper at moderate volume. Above ~10B tokens/month scanned, self-hosting wins on cost.

What's hard to model (sole-source, GSA, GovCloud model availability)

A few federal-specific costs are too situational to ship as defaults — leave them as user-addable line items rather than hardcoded controls:

Sole-source procurement overhead — agency-specific uplift for non-competed contracts.
GSA Schedule discounts — negotiated GSA rates are typically 5–10% below list.
Model availability constraints — some Anthropic Claude variants and OpenAI GPT-5.5 aren't in GovCloud yet. If your deployment must run on FedRAMP High and you need flagship reasoning, you may be stuck on Bedrock-Claude or self-hosted Llama 3.3 70B; the cost model differs.
ATO assessment labor — separate from FedRAMP authorization fees; the assessor's billable hours.

Tell users to add these as fixed-cost line items in your Other monthly infrastructure editor when they apply.

Part 17—Daily caps: the safety brake

A daily LLM-spend cap is the single best protection against runaway cost on a public-facing endpoint. Build a daily spending cap control with both a $/day input and an "Allow bursts" toggle (covered below). The calculator that ships without a cap will blow up the user's wallet on the first viral hit.

What 429 actually does

The cap is a hard ceiling at the API-key (or proxy-gateway) level: when daily spend hits the cap, the API gateway returns HTTP 429 Too Many Requests to subsequent requests for the rest of the day.

The cap doesn't make queries cheaper. It makes some of them not happen. Queries above the cap are refused; the user sees an error or a "service temporarily unavailable" page.

This is a deliberate trade-off: bounded budget vs degraded service for the cap-day. Most procurement reviewers prefer bounded budget — surprise overages are a worse story than degraded service days.

Surfacing refused-query count

The thing the cost calculator everyone uses doesn't model: how many queries got refused because of the cap.

Without the cap, your monthly bill might be $125,000 and you serve all 4.5M queries. With a $1,500/day cap, your monthly bill is $45,000 and you serve 1.6M queries while refusing 2.9M.

These are not the same procurement decision:

The first one ($125K, 4.5M served) is a budget question.
The second one ($45K, 1.6M served, 2.9M refused) is a service-level question.

You need both numbers visible to make either decision honestly. Surface the refused-query count explicitly in your report. Most generic calculators drop it on the floor — and reviewers can't make a service-level decision they can't see.

Burst days and burst-day multiplier

Within the cap mechanism, two child controls give the user realistic burst behaviour:

Expected burst days/month (default 7) — days you allow to exceed the cap.
Burst-day multiplier (default 3×) — how much higher the cap goes on burst days.

So a $1,500/day baseline cap with 7 burst days at 3× becomes "7 days at $4,500 + 23 days at $1,500" = $66,000/month maximum. Procurement defends both numbers — the baseline ceiling and the burst-allowed ceiling.

Rolling vs fixed window

Two enforcement modes:

Fixed monthly window — counter resets at midnight on the 1st. A bad day at the end of the month doesn't help you.
Rolling 30-day window — counter looks at the last 30 days. A quiet week recovers budget for next week.

For federal procurement, fixed monthly is the simpler defense — "we will never spend more than $X in any calendar month." For operations, rolling is more humane. Pick one consciously when you build; modeling rolling is strictly more work, and "fixed monthly" is the right v1 default since it's also what procurement defends.

How to defend a cap in budget review

A reviewer's question: "What if there's a legitimate spike on day 15 — a press event, a crisis response — that the cap refuses?"

Three answers, in increasing cost:

Manual override — the cap is editable on the day of the spike. Operator monitors, lifts the cap, accepts bigger bill that month. Downside: requires 24/7 ops awareness.
Rolling window — let a quiet week earlier in the month build up headroom. Downside: irregular procurement defense.
Spike provisions — a separate emergency fund at the agency level that can be drawn down on declared emergencies. The cap stays in place; the agency pre-approves overage.

Most federal deployments use option 1 + a clear decision tree for when to override. Your $/day cap control plus burst settings let users model both scenarios.

Part 18—Bot rate limiting

Public endpoints get hammered by bots — search-engine crawlers, scrapers, abuse, accidental load tests. Without rate limiting, the bot factor from your audience-segments control (Part 4) can blow up to 5–10× and your bill follows. Build a bot-rate-limiting control that caps the bot multiplier and adds the strategy's monthly cost as a separate line item — this is the "spending money to save money" lever that procurement reviewers always want to see explicit.

Strategies: None / Edge WAF / Session+IP / Bot Control + CAPTCHA

Four strategies, increasing cost and increasing protection:

Strategy	Monthly cost	Bot ceiling
None	$0	unbounded (1.5×–10×)
Edge WAF (basic)	$20	2.0×
Session + IP throttling	$50	1.7×
Bot Control + CAPTCHA (full)	$300	1.3×

The strategy cost is the monthly hosting cost of the rate-limiting infrastructure (CloudFront WAF rules, AWS Bot Control add-on, third-party services like Cloudflare or DataDome).

Strategy monthly cost

Default to list rates for AWS-native options. Make the monthly-cost field user-editable so Cloudflare ($20/mo Workers + Bot Management at $200/mo for Pro plan) or other provider rates can be plugged in.

For a high-volume public endpoint (federal science Q&A services, public benefits-eligibility chatbots, etc.), the $300/month Bot Control + CAPTCHA is usually the right choice — the bot factor reduction (from 1.5× to 1.3×) saves more in LLM cost than the rate-limiter costs.

Worked example: 90,000 monthly anonymous queries × per-query cost $0.05 × bot factor:

Without rate limiter: 90K × 0.05 × 1.5 = $6,750 + $0
With Bot Control: 90K × 0.05 × 1.3 = $5,850 + $300 = $6,150

A net $600/month saving by spending $300/month on rate-limiting. At higher anonymous volume, the saving grows linearly.

Bot-factor ceiling

The strategy choice also caps the bot factor that the audience-segments control can multiply by. With no strategy, default to 1.5×; with full Bot Control, cap at 1.3×.

This is a ceiling, not a target. If a user's real bot factor is 1.1× even without rate limiting (low-profile, low-abuse endpoint), they should set it to 1.1× and skip the strategy — make sure the per-segment bot factor input remains the source of truth, with the strategy ceiling as an upper bound applied on top.

Part 19—Other infrastructure (the floor)

Even if the LLM cost is zero, every production deployment has a floor of infrastructure costs that are paid regardless of query volume. Build an Other monthly costs editor — an editable line-item table — so users can add and adjust infrastructure lines beyond the canonical defaults below.

For a federal-science-agency deployment running on AWS Fargate, the floor is around $1,200/month of pure infrastructure before a single LLM token is consumed.

Compute platform (Fargate / EKS / SMK8s)

The application server hosting the agent (separate from the LLM provider). Three common options:

Fargate — managed serverless containers. Easiest to run; ~$180–500/mo for typical agent traffic.
EKS — managed Kubernetes. Cheaper at scale; ~$150–800/mo including node fleet.
Self-managed K8s — bare EC2 + you run K8s. Cheapest in raw $/node terms but needs MLOps expertise (~$5K/mo hidden FTE).

Be careful: your self-host capacity control's compute platform (which provisions GPU instances for self-hosted LLMs) is a different line from this app-server compute platform. Both deserve their own dropdown. Wiring the two together is a common bug — they have different scaling behaviour, different price classes, and different ops implications.

For typical federal agents, Fargate or EKS at ~$300–800/mo is right. Self-managed K8s makes sense only if you already run K8s elsewhere.

Database (RDS/Aurora ~$300-1500)

Your application needs a database for session state, user metadata, audit logs (separate from the long-retention audit-log storage covered in Part 16), and conversation history.

Typical sizing:

DB	Monthly
RDS Postgres db.t3.medium	$80
RDS Postgres db.r5.large	$300
Aurora Postgres r5.large + 50GB	$400
Aurora Multi-AZ + read replicas	$1,500

Federal deployments typically need at least Multi-AZ ($600+); high-availability deployments with read replicas run $1,500.

Load balancer (ALB ~$35)

AWS Application Load Balancer: ~$35/month for the LB itself plus per-LCU charges for traffic. For typical agent traffic, $35–60/month is right.

Cache (ElastiCache ~$50)

A small Redis instance for session state, cached responses, rate-limit counters. ElastiCache cache.t4g.small ≈ $50/month.

Storage (S3 ~$100)

S3 buckets for static content, audit log staging, model artifacts. Standard-class S3 at $0.023/GB-month, ~$50–200/mo typical.

Observability (CloudWatch ~$300)

Application logs, metrics, traces, alarms. CloudWatch Logs at $0.50/GB ingested + $0.03/GB stored. For typical agent traffic, $200–500/mo.

For richer observability (Datadog, New Relic, Honeycomb), budget $500–1,500/mo depending on host count and trace volume.

Secrets management (~$15)

AWS Secrets Manager for API keys, DB credentials. $0.40/secret/month + $0.05/10K API calls. ~$15/mo for typical setups.

Edge rate limiting (~$15)

CloudFront + WAF rules for the public endpoint. ~$15/mo for basic rules; ~$300/mo for full Bot Control (covered in Part 18).

Ingest jobs (~$10-50)

Periodic jobs to refresh embeddings, sync data, run migrations. Lambda or Step Functions; typically $10–50/mo.

NAT / egress / Route53 (~$50)

NAT Gateway: $32/mo + per-GB processing. Route53: ~$5/mo per hosted zone. Egress charges (covered in Part 16's federal layer if applicable).

Part 20—Migration timeline

Real federal procurement decisions span phases. "API in Year 1, hybrid in Year 2, self-host in Year 3." Build a migration timeline control that models phased deployments and a 3-year TCO view. Without this, your calculator can only defend snapshot costs; reviewers want the curve.

Phased deployment: API → Hybrid → Self-host

A common federal-deployment arc:

Year 1 — API only. Lowest upfront cost, fastest time-to-production, learn what the workload actually looks like.
Year 2 — Hybrid. Production logs reveal which paths are high-volume narrow tasks; move those to self-host (or reserved capacity), keep flagship reasoning on API.
Year 3 — Mostly self-host or reserved. With 18+ months of production data, sizing is precise; commit to capacity or self-host where it wins.

This arc minimises upfront capital, defers self-host decisions until you have the data to make them well, and gives procurement a defensible 3-year cost trajectory.

Per-phase hosting and reservation

Each phase needs its own values for:

Hosting (API / Hybrid / Self-host)
Reservation type (none / 1-yr / 3-yr)
Phase duration (months)
Phase-specific assumption overrides (different cache rate, mix, volume — workload changes across phases)

Your cost engine computes monthly cost per phase, sums to a 3-year total, and surfaces the migration cost (any one-time setup amortised across the timeline).

3-year TCO with risk uncertainty bands

The "Migration timeline (3-year phased)" chart visualises cost-over-time so users can see where transitions break even. Procurement reviewers like to see this — it turns a single number into a defensible curve.

The chart should also surface risk bands: the gap between optimistic and realistic mode at each phase, plus a ±20% uncertainty band on the overall total. The 3-year sum with risk bands is the "TCO" number agencies report in budget submissions.

Part 21—Multi-model routing

Production agentic systems route queries to different models based on complexity. Easy queries go to a cheap model (Haiku, gpt-4o-mini, Gemini Flash); hard queries go to a flagship (Sonnet, GPT-5.5, Gemini Pro).

Build a routing control with percentage sliders per model tier. The simplest version is three rows (cheap / balanced / flagship) with weights summing to 100%. Surface the weighted-blended per-query cost prominently, and a "vs single-model baseline" comparison so users see the saving.

Why route at all (Triage 70/25/5)

A common production split, called Triage:

70% Haiku ($0.80/M input) — easy queries, factual lookups, simple chat.
25% Sonnet ($3/M input) — moderate-complexity tasks.
5% Opus ($15/M input) — hard reasoning, multi-step research.

Compared to running 100% Sonnet for everything: the cost drops by ~50% on the input line (since 70% of queries shift from Sonnet to Haiku), with negligible quality impact on the easy queries.

The bet: easy queries don't need flagship; routing them elsewhere saves money without users noticing.

Production presets: 90/10

For workloads dominated by easy traffic (FAQ bots, deterministic lookup agents):

90% Haiku / 10% Sonnet escalation.
Reserve the 10% Sonnet for queries the Haiku classifier flags as hard.

Saves 60–70% vs all-Sonnet baseline.

Hybrid 50/30/20 across providers

For multi-provider risk diversification:

50% Gemini 3 Flash — cheapest flagship-tier.
30% Sonnet — balanced.
20% Opus — hardest queries.

Mixing providers reduces vendor lock-in and gives you a hedge against per-provider rate hikes. Operational complexity is higher (multiple SDK integrations, multiple rate-limit pools).

Cost-optimized vs Quality-first presets

A useful set of named presets to ship:

Cost-optimized — Gemini Lite 65 / Flash-Lite 20 / Llama 10 / Sonnet 5. Maximally cheap; quality varies.
Quality-first — Sonnet 55 / Opus 30 / GPT-5.5 15. Flagship-heavy; expensive.
Triage 70/25/5 — Haiku → Sonnet → Opus.
Production 90/10 — Haiku → Sonnet escalation.
All-Sonnet 100 — single-model baseline.

The user picks a preset, overrides percentages with the sliders, and your cost engine recomputes the weighted-blended per-query cost. Presets save users from the blank-canvas problem.

The 50-60% saving when 70% of queries are easy

The math: if 70% of queries can be served at 1/15 the cost (Haiku vs Sonnet) without quality loss, the weighted-blended cost is:

0.70 × $0.20  +  0.25 × $3.00  +  0.05 × $15.00
=  $0.14 + $0.75 + $0.75
=  $1.64 per million tokens

vs all-Sonnet at $3.00/M. A 45% saving on input cost.

The trick is correctly routing. A poorly-routed system that sends 70% of queries to Haiku but the wrong 70% (the hard ones, not the easy ones) gets bad answers on hard queries and wastes flagship capacity on easy ones. Build a classifier that you trust before you save 45%.

Part 22—Confidence intervals: lognormal CIs

Output token length is right-skewed — most responses are short, a long tail is much longer. A flat "average" output length under-estimates p99. Build a confidence-interval control that models token length as a lognormal distribution and surfaces p50, p90, p99 separately. A single point estimate is structurally wrong for any non-trivial agent.

Why output length is right-skewed

Two reasons:

A long-form response is structurally longer than a refusal. The cost difference between "I don't know" and "Here is a 2,000-word analysis" is two orders of magnitude.
Output cap dynamics. Most responses come in well below the max-output cap; a few hit it. The distribution is bounded above but not normal.

A normal-distribution model (mean ± stddev) under-counts the tail. A lognormal model captures the skew correctly.

p50 / p90 / p99

The calculator surfaces three percentiles:

p50 — median; half the queries cost less, half cost more.
p90 — 90th percentile; only 10% of queries are more expensive.
p99 — 99th percentile; the long tail.

The lognormal formulas:

p50  =  exp(μ)
p90  =  exp(μ + 1.282 × σ)
p99  =  exp(μ + 2.326 × σ)

For a CV (coefficient of variation) of 1.0, p99 ≈ 5× p50. For CV = 2.0, p99 ≈ 20× p50. The calculator computes σ from CV weighted by task mix.

σ from CV (per task type)

σ  =  √( ln(1 + CV²) )

CV is the standard deviation divided by the mean — a unitless "how variable is this?" measure.

CV per task type

A defensible default CV table to ship:

Task type	CV
Classify	0.30
Summary	0.65
RAG	0.85
Code	2.80
Longform	3.60
Agentic (tool-orchestration)	4.30

These are heuristic estimates from informal observations — not validated against HELM, LMSYS, or other public benchmarks. Surface this caveat in your UI's help text so users don't treat them as ground truth. Treat them as planning estimates with their own ±30% uncertainty.

For high-stakes budgeting, derive CV empirically from your own production logs.

Why p99 matters for budget defense

A procurement question that comes up: "What's the worst month look like?"

The answer is some multiple of the average — driven by:

Long queries (the lognormal output tail)
Burst days (the burst-day multiplier from Part 4)
Retry overhead (rate-limit retries during peaks)

Surface a "p99 monthly bill" column on your model-comparison table that multiplies all three (lognormal tail × burst-day multiplier × retry overhead) — what reviewers see as the worst month out of 100. Procurement reviewers want this distinct from the average; if the monthly cap is below p99, the user will hit it occasionally and need to explain why.

Cost is bounded below by zero and has a long upper tail (a few queries are enormous), which is exactly what a lognormal captures. A normal distribution would assign positive probability to negative cost and would massively under-estimate the p99 tail. Lognormal is mathematically simpler — its mean and median are the same. Provider rate cards are themselves lognormal, so the cost per query inherits the shape directly. Lognormal lets you use linear regression on the percentiles, which the normal distribution doesn't allow.

Part 23—Sensitivity analysis

A defensible cost number isn't one number. It's one number with sensitivity bands. Build a tornado chart that shows which knobs move the bill most. This is the single most useful output any cost calculator produces — it tells the user where to invest engineering effort, and tells reviewers what assumptions deserve scrutiny.

The tornado chart (±50% per parameter)

The tornado chart varies each parameter by ±50% holding all others constant, and plots the resulting min/max bill as a horizontal bar. Bars are sorted longest-first; the longest bar is the parameter your bill is most sensitive to.

Typical findings for a federal Q&A deployment:

Cache hit rate — sensitivity ~$15K (longest bar). A 10pp change in cache hit rate moves the bill by ~$15K.
Reservations applied — sensitivity ~$10K.
Self-host vs API — sensitivity ~$8K.
Verifier coverage — sensitivity ~$5K.
Personnel allocation — sensitivity ~$4K.
Daily cap — sensitivity ~$3K (caps the upside).

The tornado tells you where to invest engineering effort: optimising cache hit rate is more leveraged than optimising language multiplier.

Which knobs move the bill most

Three robust observations across deployments:

Cache hit rate is the single most leveraged knob. Improving from 75% to 90% saves ~30% on input cost.
Hosting choice is the second most leveraged — though at federal scale (Part 25 — same-budget) the API-vs-self-host gap is smaller than headlines suggest.
Reservation type matters above ~$5K/mo of LLM spend. Below that, on-demand wins on flexibility; above that, PTU or commit can save 30–50%.

Knobs that move the bill less than people think:

Language multiplier (unless you're heavily CJK).
Output rate (output is small relative to input for tool-orchestration agents).
Tier multipliers (most workloads use Standard).

What-if scenarios

A "What-If" mode lets the user override one parameter and see the new total without committing the override to the main state. Useful for live-stress-testing during a budget review: "what if we double the volume?" "what if the cache hit rate drops to 60%?" Build it as a transient overlay so exploration doesn't damage the user's saved configuration.

Why sensitivity beats "the number"

Procurement reviewers are paid to ask "what if your assumption is wrong?" The tornado chart is the answer: "here's what happens if every assumption is wrong by 50% in either direction; the worst-case bill is 1.8× the expected, the best case is 0.6×."

That's a defensible bound. "Our estimate is $X" is not.

Part 24—The same-budget question

The honest question for the API-vs-self-host decision isn't "what does each cost at full service?" — it's "what does each deliver at the same monthly budget the program actually has?"

If your calculator only outputs "what does each cost at full service," you're leaving the most important comparison on the table. Build a same-budget mode that takes a $/month budget and reports queries-served-per-strategy.

Match the cap on self-host to the cap on API

Most analyses pit:

API path with a cap against
Self-host path without a cap

They report API as "$45K/month" next to self-host as "$86K/month" and conclude self-host is more expensive. It is. It's also serving every query, while the API path is refusing two-thirds of them at the cap.

The honest comparison is what each strategy delivers at the same monthly budget. Match the cap on self-host to the cap on API. Compute how many queries each strategy serves at that budget under realistic mode.

Compute queries served at equal $

Take a federal Q&A deployment with 4.5M queries/month arriving (heavy traffic). Per-query cost on a typical API configuration is ~$0.028 weighted across the shape mix.

API path at full service:

$0.028 × 4.5M = $126K/month, all 4.5M queries served.

But the program's budget is $45K/month, not $126K. So the API path runs with a daily cap (Part 17). At $45K:

$45K / $0.028 per query = 1.6M queries served.
The other 2.9M arriving queries get HTTP 429.

Self-host path at $45K budget, realistic mode (Part 13's overhead):

realistic self-host overhead per month
  personnel  (1.15 FTE federal blend) =  $21,000
  ops monthly                          =   $1,800
  setup amortised (over 12 months)     =   $8,333
  observability                        =     $300
  infra floor (DB, ALB, S3, etc.)      =   $1,200
  ────────────────────────────────────────────────
  fixed overhead                       =  $32,633

That's $32.6K of fixed cost before a single GPU runs. The remaining $12.4K is what's left for GPU rent — but a single g6e.12xl on-demand is ~$7,650/month ($10.49/hr × 730 hr), so $12.4K only buys ~1.6 instances. (A Year-1 pilot runs on-demand — you can't yet justify the multi-year commitment a reserved-instance discount needs.)

The HA floor (Part 13) is the catch: federal deployments typically require 2 instances minimum for cross-AZ resilience. Two g6e.12xl = ~$15.3K of GPU rent — which exceeds the $12.4K the budget left for compute.

You have three choices, none of them clean:

Violate the HA floor — run 1 instance. Single point of failure. Audit will flag it.
Exceed the budget — provision 2 instances at $14.6K, total monthly $47.2K. Goes over the $45K cap by $2.2K.
Cut the overhead — defer the 0.1 FTE eval engineer, skip continuous monitoring, run lean. Risky for federal compliance.

Take option 1 (single instance, no HA — the most generous case for self-host):

1 × g6e.12xl × 1,200 t/s × 0.75 derate × 730 hours/mo × 3,600 s/hr = ~2.37B tokens/month capacity, at full peak utilization.
But single-tenant utilization is 10–20%, not 100% (Part 13). At 50% effective utilization across the month: ~1.18B usable tokens.
At 2,000 tokens/query weighted average (a defensible sizing coefficient for federal Q&A workloads): **590K queries/month**.

Self-host serves ~2/3 the queries (realistic mode)

Side-by-side at $45K budget, realistic mode:

Strategy	Queries served	Queries refused
API + daily cap	1,600,000	2,900,000
Self-host (HA-violating, 1 inst.)	590,000	3,910,000
Self-host (HA-respecting, 2 inst.)	over budget by $2.2K	n/a

Self-host serves about 37% of API's queries under these assumptions. With more optimistic utilization assumptions (say batch size 4–6 instead of single-digit; more aggressive prompt cache reuse) self-host can climb to ~60–70% of API's serving rate. Either way it's behind, not ahead.

The headline framing: at equal budget, self-host serves roughly two-thirds the queries that the API path does. The math here lands in the same neighbourhood (37–70% depending on utilization assumptions); the sensitivity is to utilization and to whether you're willing to relax the HA floor.

Build an "API vs self-host" report row in your output that shows queries-served at equal budget, side-by-side, with an optimistic-vs-realistic toggle so the gap is visible. Without this view, your calculator can't answer the most important procurement question; with it, the answer is unambiguous. The numbers above are illustrative — when your calculator does the arithmetic honestly with realistic-mode assumptions, the same conclusion lands almost regardless of inputs.

Why "self-host saves money" disappears

Three structural reasons (already covered piecewise):

Utilization — single-tenant 10–20%; commercial 60–80%. The hardware cost is the same; the per-query amortisation is several times worse.
Continuous batching — single-tenant batch sizes in the single digits; commercial in the hundreds. Per-GPU throughput differs by 10–50×.
Personnel + ops — the ~~1.15 FTE federal staffing blend (~~$21K/mo loaded) eats most of a federal-program-scale budget before any GPUs do useful work.

The decision pivots to quality, operational burden, vendor risk, and compliance posture — not cost.

What the decision is actually about

If cost isn't the decision driver, what is?

Quality — does your fine-tuned open-weight model answer as well as a flagship API model?
Operational burden — how comfortable is your team running a GPU fleet 24/7?
Vendor risk — what's your exposure if a provider's pricing changes, or if they exit the market mid-contract?
Compliance posture — FedRAMP High, data sovereignty, classification level requirements that limit which APIs are available.

These are the right questions to debate. The cost number is the gating factor (you must fit a budget); the right answer within that budget is determined by the four above.

Part 25—Validation: a measurement loop

Every coefficient in your calculator will start as an honest default — either hand-tuned by someone who has priced these systems before, or copied from this tutorial's tables. The defaults are not measurements. Without a validation loop, your calculator stays calibrated to "best guess as of last Tuesday," and drifts as models, providers, and pricing change.

The piece that converts hand-tuning into measurement is a benchmark that runs production-shape multi-agent scenarios against real LLM provider APIs and emits variance reports. This is the most important infrastructure to build after v1 ships — without it your calculator is uncalibrated. One open-source reference harness lives at ai-cost-calculator-studio/bench/; the section below walks the architecture so you can build your own (or fork the reference).

LiteLLM + LangGraph + OpenTelemetry GenAI semconv

A defensible bench stack to build on:

Concern	Library	Rationale
Provider abstraction	LiteLLM	One client, every major provider, consistent `usage` capture (incl. `cached_tokens`).
Multi-agent orchestration	LangGraph	The de-facto state-machine framework for agent flows.
Tracing	OpenTelemetry GenAI semconv	Emits standard `gen_ai.*` spans; works with Langfuse / Arize / Phoenix / Datadog out of the box.
Tool execution	Real local function calling + MCP	No mocks; tool-result token counts reflect production payloads.

This isn't a toy benchmark. Traces produced by this stack are auditable down to the request_id, and your coefficients file should ship with provenance for each measured value (source scenario, sample size, provider, measurement date) — without provenance, calibrated values are indistinguishable from hand-tuned ones.

The nine reference scenarios

A useful starter set of nine scenarios, each targeting a specific coefficient:

Scenario	What it validates
`smoke-test`	the plumbing works
`multi-stage-research`	sequential handoff overhead, cumulative-context growth
`streaming-pipeline`	TTFT, output rate (latency coefficients)
`tool-chain`	tool schema overhead, tool result tokens
`data-discovery`	confirmation-gated tool orchestration (long-sysprompt pattern)
`data-discovery-gpt52`	same as above on a frontier model
`cached-pipeline`	OpenAI auto-cache hit rate
`cached-pipeline-anthropic`	Anthropic explicit `cache_control` hit rate
`parallel-fan-out`	orchestrator + 3 parallel specialists + synthesizer

Run a scenario and you get a JSON trace artifact with every LLM call's input/output token counts, cache stats, latency, and cost.

How a coefficient gets calibrated

The calibration loop your bench should implement:

Run the scenario against a real provider API.
Emit a variance report comparing measured values to your calculator's predicted defaults.
Flag any coefficient off by more than ±15%.
Write calibrated values into a versioned coefficients.json (or whatever your equivalent is).
Your calculator fetches it at page load.
A "✓ MEASURED" UI badge confirms when measured values are in use vs. defaults.

Sample variance report:

| Coefficient                 | Predicted | Actual | Δ%   |
|------------------------------|-----------|--------|------|
| cache_hit_rate               |  0.84     |  0.91  |  +8% |
| sequential_handoff_overhead  |  200 tok  |  700   | +250%|
| input_output_ratio           |  6:1      | 73:1   | +1117%|

The 1117% delta on I/O ratio is what motivated the long-sysprompt calibration; older calculators were systematically wrong on tool-orchestration agents.

Coefficient provenance and the ✓ MEASURED badge

Your coefficients file should be the single source of truth for measured values. Each entry should carry:

Value — the calibrated number.
Source scenario — which bench run produced it.
Sample size — how many calls were measured.
Provider — which API.
Measurement date — when the bench last ran.

A ✓ MEASURED badge in your UI cues users that they're seeing measured values, not hand-tuned defaults. Make it clickable to surface the per-coefficient provenance — when a procurement reviewer asks "where does the 0.91 cache hit rate come from," the answer should be one click away. (For a worked example, see calc.ajinkya.ai/coefficients.json.)

Provenance down to request_id

Build your bench so trace artifacts are audit-grade. The provenance chain a reviewer should be able to walk:

Click the ✓ MEASURED badge in your UI.
See the source scenario name (e.g. cached-pipeline.yml).
Open the trace JSON in your bench repo.
See every measured call with its request_id.
(Optionally, with provider's permission) verify the request actually happened against the provider's logs.

This is what turns a cost estimate into a cost measurement, and a spreadsheet into a procurement artifact.

When to re-run the bench

The bench is a continuous-validation harness, not a one-off. Re-run it when:

A provider releases a new model (test the cache hit rate on it, the I/O ratio, the throughput).
A provider changes pricing (verify the rate against the scraper).
Your workload changes shape (longer system prompts, different tool count, new modality).
You're submitting a federal procurement and want fresh measurements (bench takes ~10 minutes per scenario).

The bench costs roughly $0.50–$5 per scenario in API spend, depending on the model and scenario. Cheap insurance against defending a stale coefficient.

Part 26—A worked example end-to-end

Let me put the whole tutorial together with a single end-to-end worked example. The deployment: a federal science agency Q&A agent serving 500 internal researchers (heavy users) and 10,000 anonymous public visitors (light users). FedRAMP Moderate. RAG-augmented; no FactReasoner verifier.

Federal-science-agency deployment specs

Concrete settings to plug into your own calculator:

Hosting: API (managed)
Cost mode: Realistic
Model: gpt-5.5
Tier: Standard
Cache hit rate: 0.91 measured
Verification coverage: 0%
GPU instance: n/a (API path)
FedRAMP tier: Moderate (×1.15)
Multi-region: Single region (×1.00)
Audience segment 1 (internal staff): 500 MAU, 5 sessions/day, 4 questions/session
Audience segment 2 (anonymous public): 10,000 MAU, 0.2 sessions/day, 1.5 questions/session, bot factor 1.5×
Mix: 38% full / 22% rag / 18% partial / 14% refusal / 8% heavy
Per-query input: 5,650 tokens (the worked-example Query A baseline)
Per-query output: 854 tokens (full shape)
RAG: 5 chunks × 512 tok/chunk + 128 query embed
Multi-agent: 3 agents (orchestrator + retriever + summariser)
Personnel: 0.5 MLOps + 0.25 prompt eng + 0.1 SRE
- 0.2 security + 0.1 ATO assessor
ATO: FedRAMP Moderate ($250K + $50K/mo)
Audit retention: 5 KB/query × 84 months × Glacier
Egress: 0.001 GB/query × $0.09/GB
Daily cap: $1,500/day, 7 burst days × 3×

Module-by-module contributions

Module	Line	Monthly contribution
Volume (Part 4)	390K queries × 1.5× bot factor on public	405K queries
Per-query cost weighted (Parts 2-3)	$0.022 × 0.65 weighted	$0.014/query
Caching (Part 5)	0.91 hit rate × 0.10 discount	applied above
Multi-agent (Part 7)	3 agents × handoff growth	applied above
RAG (Part 9)	2,688 tok/turn at input rate	applied above
LLM total	405K × $0.014	$5,670
FedRAMP Moderate × 1.15	LLM tier multiplier	+$850 → $6,520
Reservations (Part 14)	none for this deployment	$0
Personnel (Part 15)	1.15 FTE × $25K/mo	$28,750
ATO (Part 16)	$250K/36mo + $50K/mo	$56,944
Audit logs (Part 16)	5 KB × 405K × 7 yrs × Glacier	$5 (cheap at Glacier)
Egress (Part 16)	405 GB × $0.09	$36
PII redaction (Part 16)	100M tok × $0.30/M	$30
Daily cap (Part 17)	bounded; refusal accounting separate	n/a
Bot rate limiting (Part 18)	Edge WAF basic	$20
Infrastructure (Part 19)	Fargate + Aurora + ALB + ElastiCache + S3 + CloudWatch + secrets + edge + ingest + NAT/Route53	$1,200
Total monthly		~$93,500

Final number with sensitivity bands

Running the tornado (Part 23):

Cache hit rate ±10pp: $84K → $103K (±10%)
Personnel ±0.3 FTE: $86K → $101K (±8%)
ATO amortisation ±12 months: $89K → $98K (±5%)
LLM tier (Standard → Priority): $93K → $103K
LLM model (Sonnet → GPT-5.5 → GPT-5.5 Pro): $80K → $93K → $115K

3-year total: $93,500 × 36 = $3.37M, with a ±20% uncertainty band giving roughly $2.7M – $4.0M.

What the procurement defense looks like

Procurement reviewer: "Why $93K/month?"

Answer:

$5,700 LLM compute (×1.15 FedRAMP Moderate = $6,520). Anchored to measured per-query cost from the geospatial Q&A reference workload, weighted across the 5-shape mix.
$28,750 personnel at 1.15 loaded FTE. Defended by your loaded-rate table (Part 15).
$56,944 ATO ($250K upfront / 36mo amortisation + $50K/mo continuous monitoring). FedRAMP Moderate is the modeled tier.
$1,200 infrastructure floor (Fargate + Aurora + etc.). Itemised under "Other monthly infrastructure".
$~100 misc (audit, egress, PII redaction, edge WAF).

Reviewer: "What's the worst month?"

Answer: p90 monthly bill at $103K. The cap mechanism bounds the worst case; we estimate burst-day overruns at $4,500/day × 7 days = $31,500 incremental in a peak month, totaling $125K worst-case.

Reviewer: "Why not self-host?"

Answer: At equal $93,500/mo budget under realistic mode, self-host serves ~95% of API's queries (Part 24). The 5% gap is structural (single-tenant utilization, integer-step GPU counts). Self-host wins on data sovereignty if needed, but doesn't save money at this scale.

Reviewer: "What if the workload grows 5×?"

Answer (rerun your calculator at 5× volume): the new total is $215K/mo. Personnel and ATO are fixed; LLM and infrastructure scale. The new tornado shows cache hit rate becomes the dominant lever (since LLM is now bigger share); investing in cache optimisation pays back.

That's the defense. Every assumption is a control; the reviewer can sub in their own numbers and rerun. This is the test for any calculator you build: can a procurement reviewer sit down with the deployment owner and walk through every line item, with a single screen between them? If yes, you've built a procurement instrument. If no, you've built a price quote.

Part 27—Limitations and what your calculator won't model

An honest list of what this tutorial (and the reference implementations it points at) don't capture. Build your own guardrails into your calculator's UI so users don't mistake omissions for zero costs.

Empirical anchoring is thin

The full-pipeline shape factor in this tutorial's defaults is anchored to one measured production session — Query A from the worked-example geospatial Q&A deployment. The other four shape factors (rag, partial, refusal, heavy) are team estimates, not yet validated against logged production traffic at scale.

Implication: shape factors might be wrong by ±50% for deployments that don't look like that one. Tell your users explicitly to validate against their own production traffic before submitting a procurement — a small disclaimer near your shape-factor controls is worth more than a 1000-word methodology PDF.

Throughput numbers are rough

The 1,200 / 900 / 4,500 tokens-per-second per-instance figures (g6e.12xl / g5.48xl / p5.48xl) are order-of- magnitude estimates from vendor and community benchmarks. Tell users to benchmark in-house before sizing.

The throughput de-rate (75% in realistic mode) is a coarse correction. Real fleets vary 60–85% of paper throughput depending on workload, model, and tuning.

Cost-mode parameters are educated guesses

The 33%/55% reserved-instance discounts, 0.3 FTE allocation, $1,800/mo observability, and $8,333/mo amortised setup are reasonable defaults. They are not derived from a methodologically rigorous study. Their primary purpose is to surface the optimistic-vs-realistic gap, not to be authoritative.

For a real procurement, the user should replace each with their own numbers (their contract's RI discount, their team's actual loaded cost, their AWS bill). Make all of them overridable.

No empirical validation of self-host

In most reference implementations, the API path is anchored to measured production data; the self-host path is prospective, not postmortem. There aren't many apples-to-apples federal self-hosted-LLM deployments to compare against.

The same-budget finding (Part 24) is a model output, not a validated measurement. If your deployment has a real self-hosted-LLM run with public numbers, contributing them back to a public bench is the highest-leverage thing you can do for the field.

Single provider per agent

Most calculators (including the reference) pick one model per agent. They don't model multi-provider failover (Anthropic primary with OpenAI fallback) or graduated fallback (flagship for hard queries, mini for cheap lookups within the same agent). If your deployment depends on multi-provider failover, you'll need to extend the per-agent model selection to a primary/fallback pair with a fallback trigger rate.

For most cases, multi-model routing (Part 21) handles the provider-mix question well enough at a higher level.

What your calculator probably can't model

Honest about what's structurally hard:

Fine-tuning training cost amortisation — a separate calculator entirely; don't try to fold it in.
Infrastructure / hosting costs for self-hosted deployments beyond the modeled GPU + ops layer (network fabric, BMS, OCP frames if on-prem) — separate calculator.
Human-in-the-loop reviewer time costs — separate calculator; the labor model differs.
Volume discount tiers above 100M tok/mo — negotiate directly with vendors and override the price book.
Network egress / storage / vector DB operational costs beyond the optional file-search/container placeholders.
Compliance overhead (FedRAMP, HIPAA, SOC2 audit costs) — partial coverage; the user must check their specific deployment's audit requirements.
Latency SLA penalty costs — if the user's contract has SLA clauses with penalties, those are separate.

Build your Other monthly costs editor as the catch-all for the long tail of these.

Part 28—Reference implementations and where to go next

You now have the full menu of controls (Part 0), the math behind each one (Parts 1–27), and the validation loop that keeps the math honest (Part 25). The next module (Part 29) gives you a reference architecture you can clone if you'd rather start from a scaffold than a blank file.

A few concrete things to do before, or alongside, building.

Look at one or two existing implementations

Before you start coding, spend an hour with an existing calculator to see how the controls feel in practice. The goal isn't to copy the UI — it's to pressure-test your mental model of the controls against a working artefact.

calc.ajinkya.ai — the reference implementation that this tutorial co-evolved with. Open-source (repo); most of the patterns in Part 29 below are implemented there in TypeScript + React + Cloudflare Workers. Best for: federal-deployment shape, multi-agent topology, reservation modeling.
Vendor pricing pages — OpenAI, Anthropic, Google, Bedrock all publish per-token rates. They are not calculators, but they are the upstream truth your price-book has to track. Bookmark the URLs and put a refresh schedule on them (Part 29 covers this).
Vendor TCO calculators — most of these are sales tools that under-count personnel and over-count reservation savings. Useful as counter-examples for what not to build.

Run a benchmark against your workload before you ship

If you're going to defend a number that anyone else is going to spend money against, run a measurement loop (Part 25) against your real workload before you commit. The reference ai-cost-calculator-studio/bench/ is a good starting point — clone it, point it at your provider, and see whether your hand-tuned defaults survive contact with reality. Most of them won't, by 15–40%.

A full bench run costs $0.50–$5 per scenario in API spend. Cheap insurance against defending a stale coefficient.

Pin contract rates in your price book

Once you ship, the single most common operational task is keeping the price book in sync with the user's actual contracts. Make sure your price-book table is editable per cell, persists in the URL hash (so users can share specific scenarios), and supports a "screenshot for procurement" affordance — a clean export of "here are the rates we used, on this date, for this submission."

Contribute back

If you build something — a calculator, a measurement loop, a new scenario, a calibrated coefficient — and you can publish the numbers, do. The state of public AI cost data is bad, and a single shared coefficient measured at scale is worth months of hand-tuning. Issues at ai-cost-calculator-studio/issues, or wherever your fork lives.

Reach out

If you find a number that's wrong, tell me. If you build something on top of this, tell me too. If you run a calculator on your federal program's actual workload and the output surprises you, tell me — the surprises are the most useful signal for what the model gets wrong.

Email: hello@ajinkya.ai.

— Ajinkya

Part 29—A reference architecture you can clone

Everything above is implementation-agnostic. This module gives you a concrete scaffold — what files to create, what goes in each, how they wire together. Use it as a starting point, not a prescription. The reference at calc.ajinkya.ai implements this in TypeScript + React + Cloudflare Workers; the same shape works in Python + Streamlit, Go + HTMX, a Jupyter notebook, or a spreadsheet.

The five layers

A defensible cost calculator separates into five layers. Conflating them is the source of most maintenance pain.

┌──────────────────────────────────────────────────────────┐
│  5. UI — controls, report, charts, presets               │
├──────────────────────────────────────────────────────────┤
│  4. Cost engine — pure functions: state → cost           │
├──────────────────────────────────────────────────────────┤
│  3. Formula library — atomic math primitives             │
├──────────────────────────────────────────────────────────┤
│  2. Data model — typed state + defaults + persistence    │
├──────────────────────────────────────────────────────────┤
│  1. Pricing data + coefficients — JSON, refresh schedule │
└──────────────────────────────────────────────────────────┘

Layer 4 (the cost engine) should be a pure function: same input → same output, no I/O, no DOM, no network. This is the layer your bench (Part 25) measures against.

Layer 1 — Pricing data and coefficients

Two JSON files, versioned, with provenance:

data/
  prices.json          # per-model rates, PTU rates, GPU hourly
  coefficients.json    # cache hit rate, I/O ratio, handoff overhead, ...

A row in prices.json:

{
  "model_id": "gpt-5.5",
  "provider": "openai",
  "input_per_M": 5.00,
  "cached_input_per_M": 0.50,
  "output_per_M": 30.00,
  "tps_per_PTU": 25,
  "ptu_monthly": 1875,
  "ptu_annual_monthly": 1310,
  "as_of_date": "2026-05-01",
  "source": "openai.com/pricing scrape"
}

A row in coefficients.json:

{
  "name": "cache_hit_rate_steady_state",
  "value": 0.91,
  "scenario": "cached-pipeline.yml",
  "sample_size": 18,
  "provider": "openai",
  "model": "gpt-4o-mini",
  "measured_at": "2026-04-21T14:32Z",
  "trace_artifact": "bench/runs/2026-04-21_14-32_cached.json"
}

Treat these files as the source of truth. Refresh on a schedule:

Prices — daily, automated. A small worker that scrapes vendor pricing pages and PRs the diff against prices.json. Vendors don't notify you when they change rates; you have to poll.
Coefficients — when the bench runs (manually before major submissions; weekly in CI if you can afford it).

Layer 2 — Data model

A single typed state object holds every input. One object, not 17 separate variables — this is what your URL hash and your localStorage and your "Save scenario" feature all serialise.

A sketch (TypeScript syntax for clarity; use whatever your stack speaks):

type CalculatorState = {
  // Group A — Workload shape (Parts 1–2)
  tokens: {
    input_median: number;
    input_p90: number;
    output_median: number;
    output_p90: number;
    sysprompt_overhead: number;
    tool_schema_overhead: number;
    history_depth_turns: number;
  };

  // Group B — Query mix (Part 3)
  shapes: { name: string; factor: number }[];
  mix:    { shape_name: string; weight: number }[];

  // Group C — Volume (Part 4)
  segments: AudienceSegment[];
  peak_avg_ratio: number;
  growth_per_month: number;

  // Group D — Cost-shape multipliers (Parts 5, 17, 18)
  cache_hit_rate: number;
  cache_write_share: number;
  batch_async_pct: number;
  daily_spend_cap: number;
  bot_rate_limit_strategy: BotStrategy;

  // Group E — Architecture (Parts 7–11)
  multi_agent: { topology: Topology; agents: Agent[] } | null;
  rag: RAGConfig | null;
  reasoning: { thinking_tokens: number; pct_of_turns: number };
  guardrails: GuardConfig;
  tools: ToolConfig;
  verification: VerificationConfig | null;

  // Group F — Hosting and economics (Parts 12–16)
  hosting_mode: HostingMode;
  reservation: ReservationConfig | null;
  personnel: { include: boolean; roles: Role[] };
  federal: FederalConfig | null;
  other_infra: LineItem[];

  // Group G — Output and validation (Parts 20–23)
  migration_timeline: Phase[] | null;
  routing: RoutingConfig | null;
  ci: { enabled: boolean; cv_per_task_type: Record<string, number> };
  price_overrides: Record<string, Partial<PriceRow>>;
};

Persistence: serialise the whole object to URL hash (for shareable links) and localStorage (for the user's last session). On load, rehydrate. Have a defaults() function that returns a CalculatorState with sensible defaults so "start fresh" is always one click away.

Layer 3 — Formula library

Atomic, pure, testable functions. Each one corresponds to a concept from the tutorial. Keep them small enough that a unit test fits in 5 lines.

// Part 5
function inputCostWithCache(
  tokens: number, hit_rate: number, write_share: number,
  rate_in: number, rate_cached: number
): number {
  const hit  = tokens * hit_rate * rate_cached;
  const miss = tokens * (1 - hit_rate) * rate_in;
  const write_premium = tokens * hit_rate * write_share * (rate_in * 0.25);
  return hit + miss + write_premium;
}

// Part 7
function sequentialHandoffInput(
  stage_index: number, avg_upstream_output: number, sysprompt: number,
  user_query: number
): number {
  return sysprompt + user_query + (stage_index * avg_upstream_output);
}

// Part 22
function lognormalP99(mu: number, sigma: number): number {
  return Math.exp(mu + 2.326 * sigma);
}

Test each one against the tutorial's worked examples. If your inputCostWithCache(2190, 0.91, 0.10, 0.15, 0.075) doesn't match the $0.000354 figure from Part 5, your implementation is wrong — fix it before you wire it into the engine.

Layer 4 — Cost engine

A single pure function that takes a CalculatorState and returns a CostReport. No side effects.

function computeMonthlyCost(state: CalculatorState): CostReport {
  const volume      = computeVolume(state.segments, state.peak_avg_ratio, state.growth_per_month);
  const per_query   = computePerQueryCost(state);  // composes formulas
  const llm_cost    = volume * per_query;
  const reservation = applyReservation(llm_cost, state.reservation);
  const federal_x   = applyFederalMultipliers(reservation, state.federal);
  const personnel   = computePersonnel(state.personnel);
  const ato         = computeATO(state.federal);
  const infra       = sumLineItems(state.other_infra);
  const total       = federal_x + personnel + ato + infra;

  return {
    total_monthly: total,
    breakdown: { llm: federal_x, personnel, ato, infra, /* ... */ },
    queries_served: volume,
    queries_refused: computeRefused(state, total),
    p50: applyCI(total, state.ci, 0.5),
    p90: applyCI(total, state.ci, 0.9),
    p99: applyCI(total, state.ci, 0.99),
    sensitivity: computeTornado(state),
  };
}

Because it's pure, you can:

Run it on every keystroke (the user sees live updates).
Snapshot it for the URL hash ((state, output) pair is reproducible).
Run it with perturbed inputs to compute the sensitivity tornado (Part 23) — just call it 50 times with each input ±50%.
Run it from your bench harness to validate measured coefficients against predicted ones (Part 25).

Layer 5 — UI

The UI is the largest layer by line count and the least load-bearing for correctness. Keep the engine pure; let the UI be opinionated.

Recommended grouping (matches Part 0):

┌─────────────────────────────────────────────────────────┐
│  Top bar: preset picker, "✓ MEASURED" badge, copy URL   │
├─────────────────────────────────────────────────────────┤
│  Sidebar: control sections (collapsible)                │
│   - Workload shape                                      │
│   - Query mix                                           │
│   - Volume                                              │
│   - Cost-shape multipliers (cache, daily cap, bots)     │
│   - Architecture (multi-agent, RAG, reasoning, ...)     │
│   - Hosting and economics                               │
│   - Federal compliance (collapses if "None")            │
│   - Output & validation (CI, sensitivity)               │
├─────────────────────────────────────────────────────────┤
│  Main: monthly total (huge number)                       │
│         queries served / refused                         │
│         breakdown table by line                          │
│         tornado chart                                    │
│         3-year migration chart (if enabled)              │
│         "How we got there" — every formula trace         │
└─────────────────────────────────────────────────────────┘

A few non-obvious UX rules from a year of iterating on the reference:

One huge number, top of the page. The user always wants to know the total. Don't bury it in a tab.
"How we got there" is a feature, not a debug view. Procurement reviewers click into it more than the sliders. Build it as a first-class artefact.
Show p50 / p90 / p99 next to the total. A single number invites "is that right?"; three numbers invite "what assumption drives the spread?"
Federal-compliance tier collapses when "None". Most commercial users don't need to scroll past 600 px of FedRAMP controls they'll never touch.
Save / Load / Reset are top-bar primitives. Burying them under a menu means users start from scratch every time.

Wiring the bench (Layer 0, sort of)

The bench (Part 25) is technically a separate program, but it belongs in the same repo. It depends on the formula library (Layer 3) and writes to the coefficients file (Layer 1):

your-calculator/
  src/
    data-model/    (Layer 2)
    formulas/      (Layer 3)
    engine/        (Layer 4)
    ui/            (Layer 5)
  data/
    prices.json
    coefficients.json
  bench/
    scenarios/         (yaml configs per scenario)
    runner.ts          (calls formulas + a real provider, diffs)
    reports/           (variance reports, traces)

Run the bench in CI on a weekly cron. Open a PR with the diff against coefficients.json. Review and merge — the reviewer is sanity-checking the measurement, not just the diff. Don't auto-merge.

A 2-week build plan

If you're starting from scratch, a defensible v1 takes about two weeks of focused engineering work:

Days 1–2 — scaffold the data model (Layer 2). Stub defaults for every field in Part 0. No UI yet.
Days 3–5 — implement the formula library (Layer 3) with unit tests against this tutorial's worked examples.
Days 6–7 — wire the cost engine (Layer 4). Get computeMonthlyCost(defaults()) returning a number that matches Part 26's $93,500.
Days 8–10 — minimal UI (Layer 5). Sliders for the top-10 most-impactful controls; everything else as numeric inputs in collapsed expanders. URL-hash persistence.
Day 11 — price-book + coefficients files (Layer 1). Manual values; no scraper yet.
Day 12 — output report: total, breakdown, sensitivity tornado.
Days 13–14 — first bench scenario (Part 25). Aim for the cache-hit-rate scenario; it's the most leveraged coefficient.

A few weeks after v1, add: presets, multi-agent UI, migration timeline, "How we got there" trace, more sensitivity analyses, more bench scenarios. The prioritisation order tracks the tornado: build the things that move the bill most, first.

What you're optimising for

Not "lowest possible monthly bill." Not "most realistic estimate." Both of those make a calculator that nobody uses.

You're optimising for a number a procurement reviewer will accept. That means: the assumptions are visible, each line item is editable, the sensitivity is shown, and the provenance of measured coefficients is one click away. The reviewer's job is to ask "what if you're wrong?" — your calculator's job is to answer that, fast, with a different number that's also defensible.

Build the calculator that supports that conversation, and the math takes care of itself.

Part 30—Glossary

A working reference for every term used in this tutorial and every slider, dropdown, and concept in a typical AI-cost calculator. Read top-to-bottom for an overview, or skim for the term you opened this tutorial to look up. Entries are alphabetical within each category.

Tokens, throughput, and rate

Token — the unit a model processes. About 0.75 English words on average. Pricing is per million tokens of input vs cached vs output.

Input tokens / output tokens — input is what you send (prompt + history); output is what the model generates. Output costs ~4–6× more per token than input.

Cached tokens — input tokens served from the provider's prompt cache. ~10% of normal input rate. Driven by repeated system prompts and stable prefixes.

Context window — the max input the model can read in one call. GPT-4o = 128K, GPT-5 = 256K, Claude Opus = 200K. Bigger isn't free — long inputs are more expensive and slower.

Input/output ratio — tokens input : tokens output. Chat defaults around 6:1, but tool-orchestration agents with output-suppression rules measure as high as 73:1 (gpt-4o-mini) or 88:1 (gpt-5.2). The 6:1 default badly understates input cost for agentic systems.

Language multiplier — tokens-per-word varies by language. English = 1.0×, code = 1.3×, European languages = 1.2–1.5×, Japanese/Chinese/Korean = 1.8–2.2×. OpenAI's tokenizer is biased toward English.

TPM — tokens per minute. Usage rate (input + output combined). Used in API rate-limit caps.

TPS — tokens per second. Throughput / capacity. Used in provisioning discussions (PTU sizing, GPU instance specs).

RPM — requests per minute. Separate from TPM (a single request can use many tokens). Used in rate-limit caps.

TTFT — time to first token. Latency until the user sees any output. The "how snappy is it?" metric.

ITL / TPOT — inter-token latency / time per output token. How fast tokens stream after the first one. Streaming UX depends on this; batch jobs don't care.

Capacity, batching, and serving

Continuous batching — modern inference servers (vLLM, TGI, TensorRT-LLM) pack many in-flight requests onto the same GPU forward pass. Larger batch = higher GPU throughput. Single-tenant deployments rarely fill the batch; commercial providers do.

KV cache — memory holding intermediate state during generation. Big for long contexts. Constrains how many concurrent requests fit on a GPU.

Speculative decoding — generating 2–3 tokens per pass instead of one, with a small "draft model" guessing ahead. Speeds up inference 1.5–2× without changing output. Commercial providers use it; many self-hosted stacks don't.

Prompt caching / prefix caching — reusing computed state for identical leading text. Repeating system prompts hit cache; user text doesn't. Hit rates 75–90% in practice.

Tensor parallelism — splitting a single big model's weights across multiple GPUs (each holds 1/N of every layer).

Pipeline parallelism — splitting model layers across GPUs (GPU 1 has layers 1–20, GPU 2 has layers 21–40).

Throughput vs latency — total tokens/sec a fleet serves vs. time per request. They trade off — bigger batches improve throughput, hurt latency.

Cold start — time before the first request can be served when scaling up from zero. Self-hosting + scale-to-zero pays this every time traffic appears.

Quantization and precision

FP32 / FP16 / BF16 — floating-point formats. Training weights are usually FP16/BF16. FP32 is rarely used for inference.

FP8 — newer 8-bit float, 2× the throughput of FP16, minor quality loss on most models.

INT8 / INT4 — fixed-point quantization, 4× and 8× the throughput of FP16. Slight quality loss, sometimes noticeable on long-form generation. Common for self-hosted 70B+ models.

Quantization-aware training vs post-training quantization — the first builds in quantization during training (better quality); the second converts a trained model after the fact (more common, occasional regressions).

Caching and cost-shape multipliers

Cache hit rate — fraction of input tokens served from prompt cache at ~10% of list price. Single most consequential coefficient in the model.

Provider/scenario	Hit rate
OpenAI gpt-4o-mini, multi-turn chat	0.91
Anthropic claude-sonnet-4.6, explicit `cache_control`	0.77
Parallel fan-out topology	0.60

Cold-start dynamics matter: turn 0 of any session is uncached (~61% even for warm caches, because only the sysprompt is shareable until conversation history accumulates).

Cache write share — of cached tokens, what fraction was written this turn (premium-priced) vs read from a previous turn (discounted). Most providers charge ~25% premium on writes; reads get the 90% discount.

Retry rate / retry penalty — fraction of API calls that fail and retry (default 3%). Each retry costs ~1.5× a successful call. Higher for agents with strict JSON-schema enforcement on non-native-function-calling models. Measured as high as 8% in tool-chain scenarios.

Handoff overhead — multi-agent context-forwarding cost. Older defaults assumed flat 200 tok/stage; measured ~700 tok/stage on the multi-stage-research scenario. Linear in upstream stage count, not constant.

Reservations and provisioning

PTU (Provisioned Throughput Unit) — Azure's reserved capacity unit. ~50 TPS per PTU for gpt-4o-class. Pay flat, use as much as you want within the unit. Effectively 30–50% discount if you fully utilize. Below 50% utilization, PTU costs more than pay-as-you-go.

Reserved Instance (RI) — AWS commitment discount on EC2/RDS/etc. 1-year or 3-year terms; 30–60% discount vs on-demand. Lose money if your workload changes.

Provisioned throughput — generic term for "reserved capacity for inference". Bedrock uses it; OpenAI Priority is a softer version.

On-demand — pay-per-use, no commitment. Highest unit cost but zero risk. The right starting posture for unknown workloads.

Tier — quality-of-service level. Standard / Flex / Batch / Priority. Standard is default. Flex/Batch are slower (minutes-hours to respond) and ~50% cheaper. Priority is faster with SLA, ~2.5× Standard.

Committed-spend discount — flat % off list when you promise N $/month for 12+ months. OpenAI Enterprise: 10% at $100K, 20% at $1M.

Hosting modes

API (managed) — call a provider's hosted endpoint. They run the GPUs; you pay per token.

Self-host — run the model on GPUs you control (EC2 / on-prem / bare metal). Capex + opex vs API's pure opex. Break-even depends on utilization; typically wins above ~40% sustained utilization.

BYOK (Bring Your Own Key) — calculator-style tools where the end user provides their own API key; the operator pays nothing.

Hybrid — split traffic between API and self-host (e.g., 70% API, 30% self-host). Common in migration scenarios or for risk diversification.

Edge inference — running models on the user's device or a CDN edge node. Free at runtime; quality limited by the small models that fit. Not yet competitive with cloud inference for most agents.

Cost modes (optimistic / realistic) — for self-host pricing. Two preset bundles of operating assumptions:

Field	Optimistic	Realistic
ops_monthly	$350	$1,800
fte_monthly	$2,500	$8,000
setup_amortized	$0	$8,333
throughput_derate	1.0	0.75
1-year discount	40%	33%
3-year discount	60%	55%

Optimistic is what GPU vendors quote; realistic is what production teams report. The difference is roughly 4× on total self-host cost.

Self-host GPU options — three preset instance shapes:

Instance	Hourly	Throughput	Capable of
g6e.12xl (4× L40S 48GB)	$10.49	1,200 t/s	70B int8
g5.48xl (8× A10G 24GB)	$16.29	900 t/s	70B int4
p5.48xl (8× H100 80GB)	$98.32	4,500 t/s	400B fp8

Apply throughput_derate (default 0.75 in realistic mode) for queueing, idle time, downtime. Paper TFLOPS overstate real throughput by 25–35%.

Training, fine-tuning, RAG

Pretraining — training a model from scratch on a large corpus. $1M+ for a small model, $10M–$1B for frontier models. Almost never the right answer for a deployment-tier organisation.

Continued pretraining — extending a base model's training on domain-specific data. $100K–$1M. Sometimes worth it for classified or highly specialized work.

Fine-tuning (SFT / DPO) — adjusting a base model's behaviour on task-specific examples. $5K–$50K per version. Worth it for narrow high-volume tasks where the base model is consistently wrong.

LoRA / PEFT — Low-Rank Adaptation / Parameter-Efficient Fine-Tuning. Updates only ~1% of weights. Cheap ($1K–$5K), composable, can swap adaptations at inference time.

RAG (Retrieval-Augmented Generation) — instead of fine-tuning, retrieve relevant documents at query time and stuff them into the prompt. The default first stop for most domain-specialized agents. Per-turn token cost with defaults: 5 chunks × 512 tokens + 128 query embed = 2,688 tokens.

Embedding — converting text into a vector for similarity search. Done at corpus-ingest time AND per query (RAG retrieval).

Embedding corpus — for RAG agents: size of document corpus embedded into a vector store + how often re-embedded. A 50M-token corpus re-embedded every 6 months at OpenAI's text-embedding-3-small costs ~$0.17/month amortised.

Vector DB — database optimised for similarity search over embeddings. Pinecone, Weaviate, pgvector, Qdrant.

Verification and quality

NLI (Natural Language Inference) — classifying whether one text entails / contradicts / is neutral wrt another. The atomic operation in factuality verification.

Atomization / atom decomposition — breaking a generated response into independent factual claims that can be checked.

FactReasoner — a probabilistic factuality verification pipeline (atomize → retrieve → NLI → reason). Variants trade recall for cost:

Variant	NLI calls per atom	Use case
FR1	24	Lightweight
FR2	160	Balanced (default)
FR3	350	Comprehensive

At FR2 with 10% coverage and 8 atoms/response: ~12K tokens of verification overhead per verified query.

Verification coverage — fraction of production responses that go through the verifier. 100% is expensive; 10% sampling is the common compromise.

Hallucination — model confidently outputs something false. The thing the verifier is trying to catch.

Guardrails (input / output / PII / content policy) — pre/post-LLM moderation passes. Each ~100–800 tokens; stacks. Mandatory for federal; commercial often skips output scan.

Workload shape and volume

Agents (count of) — number of distinct LLM-calling roles in your workflow. Single chat agent: 1. Research pipeline (analyst → feasibility → spec → impl → review): 5.

AOI (area of interest) — bounding box for geospatial agents. Production implementations replace full GeoJSON geometry (~~10K tokens) with a 5-vertex bbox polygon (~~100 tokens) — a factor-of-100 saving via tool-state bypass.

Audience segments — user cohorts with distinct usage patterns. Cost multiplies per segment, not by aggregate. Aggregating "10K total users" hides a 10× cost swing between a chatty 1K and a passive 9K.

MAU (monthly active users) — per-segment user count. Multiplied by sessions_per_day × 30 × questions_per_session for monthly query volume.

Sessions per day — per-segment activity rate. Federal scientists: ~5 sessions/day. Public anonymous: ~0.2/day.

Questions per session — distinct queries per session. Chatbots: 5–10/session. Research workflows: 1–3.

Bot factor — anonymous public endpoints get extra traffic from crawlers. Typical multiplier 1.5×.

Peak/avg ratio — burstiness factor: peak QPS / average QPS. Default 1×. Above 2×, add ~5%/level surcharge for headroom.

Burst days / burst factor — within the daily-cap mechanism: days/month allowed to exceed the cap (burst_days, typically 7) at a multiplier (burst_factor).

Growth per month — compound monthly traffic growth. Project 12-month volume as current × (1 + growth)^12. Default 20%.

Query mix — distribution of query "shapes" — full-pipeline, RAG-only, refusal, etc. Each has different token cost; weighted-average the result.

Concurrent users — peak simultaneous active users. Used for internal simulation; billing volume comes from MAU × sessions/day.

Multi-agent topology

Comm pattern (orchestrator / peer / supervisor) — multi-agent topology. Orchestrator: dispatches to specialists, synthesizes. Peer: agents talk directly. Supervisor: meta-agent reviews specialists' work; adds an extra LLM call per turn.

Fleet vs Workflow mode — Fleet: agents in parallel as peers. Workflow: agents sequentially as a pipeline. Workflow mode adds extra cost components (sequential chain overhead, doc ingestion, partial-rerun, fact-check sidecar, template amortisation, HITL pause storage).

Sysprompt (system prompt) — role/instructions/policy text injected at the start of every turn. Production agents have sysprompts of 500–2,000 tokens.

Tools / tool calls / tool schemas — three sub-costs of agent tool use: schema tokens (~~150 each, every turn), call arguments (~~50–200 per call), result tokens (varies wildly: 100 for a bbox, 5,000 for a literature search, 15,000 for a long PDF). Each tool round-trip also costs an extra LLM call.

Multimodal inputs — images, audio, PDFs, code interpreter output:

Modality	Token cost
Image (1568×1568 px)	~1,568 tok
Audio (1 minute)	~1,500 tok
PDF page	~1,500 tok
Code interp output	50–2,000 tok

A user uploading a 10-page PDF adds 15,000 tokens to every turn that PDF is in context.

Extended thinking / reasoning tokens — provider-specific thinking-token budget separate from visible output. Reasoning-heavy tasks: 5–10K tokens per query.

Multi-model routing — cheap model for easy queries, expensive for hard. 70/30 routing on a workload where 70% of queries are factual lookups can cut total LLM cost by 50–60% with negligible quality impact.

Operational and safety controls

Daily spending cap — hard ceiling on daily LLM spend. When daily cost crosses the limit, new requests get refused (HTTP 429). Procurement safety brake, not a token-math input. The cap doesn't make queries free; it makes some of them refused — track both the dollar number and the refused-query count.

Rate cards — published per-million-token rates per model. Override any rate in your calculator's price-book table.

Reservations — volume commitments for discounts. 1-yr API reservation: ~33–40% off. 3-yr: ~55–60% off. Distinct from PTU (capacity reservation, different mechanism).

Batch async tier — providers offer batch tier at ~50% of standard pricing; latency is hours instead of seconds. Useful for offline workloads.

Compliance and federal

FedRAMP — federal cloud-security authorization framework. Tier multipliers on LLM cost:

Tier	Multiplier
None	1.00×
Low	1.10×
Moderate	1.15×
High	1.30×
IL4 (CUI)	1.50×
IL5 (NSS)	1.80×

IL4 / IL5 — DoD impact levels for Controlled Unclassified Information (IL4) and National Security Systems (IL5). Adds more on top of FedRAMP-equivalent.

ATO (Authority to Operate) — formal authorization for a system to handle data at a given classification. $250K–$600K upfront + ongoing continuous monitoring. Amortised over 36 months typically.

GovCloud — AWS / Azure regions physically isolated for federal workloads. ~30% pricing premium over commercial regions.

Audit log retention — federal deployments retain LLM logs for 5–7 years. audit_log_kb_per_query (typically 5 KB) × queries × retention months × storage rate. At Glacier-class pricing ($0.004/GB-month), $50–500/month for typical federal workloads.

Egress — outbound bandwidth. egress_gb_per_query (typically 0.001 GB) × queries × $0.09/GB.

Multi-region — Single (1.0×), active-passive (1.4–1.5×), active-active (1.8–2.0×). Multiplies with FedRAMP tier, not stacks.

Migration phases — Year 1 / Year 2 / Year 3 deployment plan. Each phase has its own hosting (API → committed-spend → self-host) and reservation type. Weighted-average across phases, amortising one-time setup over upfront_amortization_months.

Personnel

Personnel — per-FTE monthly loaded cost ($19.5K–$25K/mo depending on role, US, 1.30× base). Lean commercial deployment: 0.5 MLOps + 0.25 prompt engineer + 0.1 SRE oncall ≈ 0.85 FTE ≈ $16K/month. Federal adds 0.2 security reviewer + 0.1 ATO assessor → ≈ 1.15 FTE ≈ $21K/month all-in.

Role	Loaded annual cost (US, 1.30× base)
MLOps engineer	$234K
Prompt engineer	$195K
Eval engineer	$221K
Security reviewer	$228K
ATO assessor (federal ISSO)	$215K
SRE on-call	$254K

Personnel cost dominates compute cost below about $50K/mo spend.

Cost-decision shorthand

Per-token cost — useful for back-of-envelope, useless for procurement. The headline number that hides the real complexity.

Per-query cost — what you actually budget against. Includes shape mix, caching, verification, multi-agent overhead.

Same-budget comparison — what each strategy delivers at equal monthly $. The honest comparison. At federal scale, API and self-host serve roughly the same number of queries at equal budget; the "self-host saves money" intuition disappears once you're fair about budget.

Break-even volume — monthly query count at which self-host beats API on pure inference $. Above this, self-host wins on cost (before personnel).

TCO (Total Cost of Ownership) — multi-year cost including hardware, software, personnel, ATO, training. The number budget reviews actually look at.

When this glossary and your calculator disagree on a number, trust the calculator's measured coefficients — those are empirical; this glossary can go stale.

How to cost an AI agent: a progressive tutorial

Part 0—What your calculator needs to expose

Group A — Workload shape (what one query looks like)

Group B — Query mix (multiple shapes weighted)

Group C — Volume (how often)

Group D — Cost-shape multipliers

Group E — Architecture (what your agent actually does)

Group F — Hosting and economics

Group G — Output and validation

Part 1—The atom of cost: a token

What an LLM API call actually is

Tokens, in plain language

The two halves of a bill: input and output

A first naive number

Why that first number is wrong

Part 2—How big is one query, really?

The system prompt (and why it dwarfs the question)

Tool schemas, every turn

Conversation history grows linearly

RAG chunks: 5 × 512 + 128 by default

Multimodal: images, audio, PDFs, code interpreter

The few-shot, JSON-schema, citation, memory line items

Output is usually smaller than you think

Worked example: a 5,650-token "simple" question

Part 3—Five shapes of a query

Full pipeline

RAG-only

Partial (route to human, clarification)

Refusal (out-of-scope)

Heavy (long-context multi-turn)

The 15× spread between full and refusal

Worked example: weighting a real production mix

Part 4—Volume: who, when, and how much

Sessions vs queries vs turns

Audience segments (and why you must split them)

MAU × sessions/day × queries/session × 30

Bot factor for public endpoints

Diurnal: peak/avg ratio and the 5%-per-level surcharge

Growth: compounding monthly

Burst days and the burst factor

Part 5—Caching: the most consequential single number

What a prompt cache actually caches

The 90% read discount

Cache write share — cold start vs steady state

The hit-rate curve (turn 0 vs warm turns)

Segment-aware: 75-80% anon vs 85-90% auth

Provider differences: OpenAI auto vs Anthropic explicit

Three caching controls, ranked by importance

Worked example: 0.61 → 0.91 across six turns

Part 6—Small adjustments you must not skip

Retry rate × 1.5 × base cost

Language multiplier (English = 1.0, CJK = 1.8-2.2)

Conversation summarisation overhead at >70% context fill

Provider markup: Bedrock 1.05×, Azure 1.0×, OpenRouter 1.05×

Part 7—Multi-agent: more than one LLM call per question

Why agents at all (orchestrator → specialists)

Comm patterns: orchestrator / peer / supervisor

Single-agent vs multi-agent token math

Sequential pipelines: cumulative input growth

Parallel topology: synthesizer pays for everyone

Hybrid DAGs (sequential trunk + parallel sub-branches)

Concurrent quota and rate-limit overage

Workflow mode: six extra cost components

Part 8—Workflow mode: when the agent is a pipeline

Sequential chain handoff (% output passed forward)

Bulk document ingestion per session

Partial rerun rate (user-review failures)

Fact-check sidecar (separate verification call per stage)

Template amortization (planning over many runs)

HITL pause storage (state retention during user review)

Part 9—RAG: retrieval as a cost

Per-query retrieval cost

Chunks × tokens/chunk + query embed tokens

Multi-hop queries: retrieval calls/turn

Embedding corpus ingest, amortised over re-embed cycle

Vector DB hosting

Part 10—Reasoning, guardrails, and tools

Extended thinking tokens (5-10K per query)

Reasoning % of turns

Fact-check passes and CoT chain length