wire protocol · agents
Tool use — how LLMs call functions, and why your agent loops forever
Function calling is not the model running your code. It is the model emitting a structured JSON object that asks your code to run something, on its behalf, and tell it what came back. Once you internalize that, "tool use" stops being magic and starts being a serializer with quirks.
1 · What a tool call actually is
Before any agent framework, before any MCP server, before any of the abstractions — here is the wire format. A tool call is a structured output the model produces instead of (or alongside) plain text.
You give the model a list of tools as part of the request. The model decides whether to use one. If it does, it emits a JSON block that names the tool and supplies its arguments. Your runtime parses that block, runs the actual function, and feeds the result back in the next message.
This is what Anthropic and OpenAI look like, on the wire, for the same request — "what's the weather in Tokyo?":
{
"role": "assistant",
"stop_reason": "tool_use",
"content": [
{
"type": "tool_use",
"id": "toolu_01A7...",
"name": "get_current_weather",
"input": {
"location": "Tokyo, Japan",
"unit": "celsius"
}
}
]
}
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Tokyo, Japan\",\"unit\":\"celsius\"}"
}
}
]
}
Identical semantics. Different envelopes. Two things to notice:
- OpenAI's
argumentsis a string, not an object — it's serialized JSON inside a JSON field, and you have toJSON.parseit yourself. The string can be malformed if the model truncates or hallucinates. Anthropic gives you a parsed object. - The
idis load-bearing. When you send the result back, you have to refer to this exact id so the model knows which call you're answering. Lose it and you'll get confused responses or hard errors.
You then run getCurrentWeather("Tokyo, Japan") in your own code, get back something like { temp: 17, condition: "cloudy" }, and append a tool_result message to the conversation. The model takes that result and produces the final user-facing answer: "It's 17°C and cloudy in Tokyo."
2 · The function-calling API surface
Three major providers, three slightly different shapes for declaring tools. The semantics are the same; the field names differ.
You declare tools in the request. Each tool gets a name, a description, and a JSON Schema for its arguments. Here's the Anthropic shape:
{
"model": "claude-opus-4-7",
"max_tokens": 1024,
"tools": [
{
"name": "get_current_weather",
"description": "Return the current weather conditions for a given location. Use this whenever the user asks about temperature, conditions, or forecasts for a specific city or region.",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country, e.g. 'Tokyo, Japan' or 'Paris, France'."
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit. Default celsius."
}
},
"required": ["location"]
}
}
],
"tool_choice": { "type": "auto" },
"messages": [
{ "role": "user", "content": "What's the weather in Tokyo?" }
]
}
OpenAI nests the same fields under a function object and renames input_schema to parameters; Gemini calls the array function_declarations. Apart from that, the shape is identical.
| Field | Anthropic | OpenAI | Gemini |
|---|---|---|---|
| tools list | tools | tools (with type:"function") | function_declarations |
| name | name | function.name | name |
| description | description | function.description | description |
| schema | input_schema | function.parameters | parameters |
| call output | tool_use block | tool_calls[] array | functionCall |
| result back | tool_result block | message with role:"tool" | functionResponse |
The tool_choice parameter
Every provider lets you steer the model's decision about whether to call a tool:
auto— model decides. Default. What you want 95% of the time.required/any— model must call some tool. Useful when you've prompted the model into a state where free-form text would be wrong (e.g. a structured-extraction step).{ name: "specific_tool" }— model must call exactly this tool. Useful for forcing a particular function — schema-guided extraction, for instance.none— model cannot call tools this turn. Useful when you want pure text after a tool round-trip is finished.
3 · The tool loop
Almost every "agent" is this loop wrapped in error handling and logging. There's no other secret.
# messages starts as the user's request and the system prompt.
# tools is the list you declared.
for i in range(MAX_ITERATIONS):
response = llm.create(messages=messages, tools=tools)
messages.append(response.assistant_message)
if response.stop_reason == "end_turn":
return response.text # final answer, we're done
if response.stop_reason == "tool_use":
for call in response.tool_calls:
try:
result = run_tool(call.name, call.input)
except Exception as e:
result = {"error": str(e)} # let the model see it
messages.append(tool_result(call.id, result))
continue # loop back, model will produce another response
raise AgentExceededIterations("agent looped too many times")
That's it. The four things to notice:
- Every iteration is one full LLM call — same context as last time plus the new tool result. A 6-step agent does six LLM calls. Costs scale linearly with steps, and prompt caching matters a lot for the repeated prefix.
- You always append the assistant message back to messages before running the tool. The model sees its own previous tool_use blocks in context; if you skip the append, it loses the thread of what it asked for.
- The MAX_ITERATIONS cap is non-negotiable. Without it, a confused model will retry the same broken tool until you hit the rate limit or run out of context window. Set it to 10–25 for typical agents and make sure the failure path is graceful, not "agent went silent."
- Tool exceptions go back to the model as the result. Don't crash the loop — hand the model the error message and let it try to recover. Half the time it will pick a different tool, adjust arguments, or apologize to the user. The other half it will keep doing the same thing; that's why you have a cap.
4 · Schema design — names matter, descriptions matter more
The number-one cause of "the agent picked the wrong tool" is a bad schema. Vague names, missing descriptions, and untyped arguments make the model guess. Clear names, imperative descriptions, and typed arguments swing selection accuracy from sixty-something to high-nineties on routine queries.
Here's a worked comparison. Same underlying functions, two ways of describing them:
{
"name": "weather_thing",
"description": "",
"input_schema": {
"type": "object",
"properties": {
"q": { "type": "string" }
}
}
}
{
"name": "data_lookup",
"description": "Gets data.",
"input_schema": {
"type": "object",
"properties": {
"input": { "type": "string" }
}
}
}
{
"name": "get_current_weather",
"description": "Return current temperature and conditions for a city. Use whenever the user asks about weather, temperature, rain, or conditions for a specific location.",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country, e.g. 'Tokyo, Japan'."
}
},
"required": ["location"]
}
}
The bad version isn't broken — the model might still pick weather_thing when asked about weather. But "might" becomes "definitely" with the good version, and "might" becomes "calls data_lookup half the time because that name is broad enough to match anything" once you add a second tool.
Try it yourself. Same user message, two schema versions. Watch the JSON the model emits and the reasoning:
Rules of thumb for schema authors
- Verb-first, snake_case names.
get_current_weather,create_calendar_event,search_documents. Avoid camelCase (most models tokenize it worse), avoid nouns alone (weatheris ambiguous between "look up" and "predict"), avoid generic suffixes like_thing,_data,_helper. - 1–3 imperative sentences in the description. Start with "Return…" or "Create…" — tell the model what the tool does, then when to use it, then any non-obvious constraint. The description is what the model reads when deciding between two superficially similar tools.
- Typed, named, described arguments. Every property gets a type, a name that says what it is (
location, notq), and a one-line description with at least one example. Useenumfor fixed sets. Userequired. - Don't lie in the description. If the tool is rate-limited, slow, or only works for certain inputs, say so. The model will weight that against using it. Pretending the tool is omnipotent leads to the model calling it for things it can't actually do, then watching errors come back, then trying again.
- Worked counter-example. If you have two near-twin tools (
search_documentsvssearch_emails), include a "do not use this for X" sentence in each. The model is good at exclusion when you give it explicit boundaries.
5 · Error handling — the model can recover if you let it
Tools fail. Networks flap. Arguments come back malformed. The naive instinct is to crash the agent. The right instinct is to give the model the error message and let it try again — usually it succeeds on the second try, and the savings on user-visible failures are large.
Four error classes you'll hit, in roughly decreasing frequency:
| Failure | What it looks like | What to do |
|---|---|---|
| malformed arguments | JSON parses but a required field is missing, or a value is out of enum (unit: "kelvin") |
Return {"error": "unit must be 'celsius' or 'fahrenheit'"} as the tool result. Model almost always fixes it next turn. |
| tool exception | API 500, timeout, downstream service down | Catch, return error message + brief retry guidance as tool result. Model decides whether to retry or give up. |
| hallucinated tool name | Model asks for get_weather_v2 when you only declared get_current_weather |
Rare on frontier models but not zero. Return {"error": "no such tool 'get_weather_v2'; available tools: …"}. Model picks the right one next iteration. |
| infinite retry loop | Same tool, same broken args, same error, ad infinitum | Your MAX_ITERATIONS cap catches it. On cap-hit, return a structured failure to the caller — don't fake a final answer. |
{"status": "error"} and let the model figure it out. Don't. Return the actual error message — {"error": "Database query failed: column 'departure_date' does not exist. Available columns: [depart, return, passenger_count]."}. The model will read that, realize it used the wrong column name, and self-correct on the next call. Treat the model like a competent junior dev reading your stack trace.
6 · Parallel tool calls
Modern frontier models can emit multiple tool_use blocks in a single response. Used right, this collapses a four-step serial loop into two iterations. Used wrong, it causes the "false parallelism" trap.
When the user asks "weather in Tokyo and Paris and the time difference?" a parallel-capable model returns:
{
"stop_reason": "tool_use",
"content": [
{ "type": "tool_use", "id": "toolu_1", "name": "get_current_weather", "input": {"location": "Tokyo"}},
{ "type": "tool_use", "id": "toolu_2", "name": "get_current_weather", "input": {"location": "Paris"}},
{ "type": "tool_use", "id": "toolu_3", "name": "get_timezone_offset", "input": {"city_a": "Tokyo", "city_b": "Paris"}}
]
}
Your runtime fans these out in parallel (Promise.all, asyncio.gather, whatever), waits for all three, then sends back all three results in one batched tool_result message. The next LLM call has every answer and produces the final response in one shot. Two iterations instead of four.
Where parallel helps
- Independent lookups. "Compare AAPL and MSFT" — two independent
get_stock_quotecalls. - Fan-out queries. "Search for X in Confluence, Slack, and email" — three independent search calls.
- Multi-resource enrichment. Get user, get user's projects, get user's recent commits, all in parallel before composing the answer.
Where it hurts: the "false parallelism" trap
The model sometimes emits two tool calls in parallel that should have been sequential, because the second depends on the result of the first. Classic case: "send a calendar invite to whoever I met with last Tuesday." The model emits in parallel:
search_calendar(date: "last Tuesday")send_calendar_invite(attendee: "???")← doesn't know yet
The second call has to guess an attendee because the first hasn't returned. You get a confidently wrong invite sent to someone fabricated. The fix is in the schema and the prompt: spell out that send_calendar_invite requires a concrete attendee email obtained from a prior step, and discourage parallel calls when the second tool's args reference the first's output. Some providers expose a disable_parallel_tool_use flag for cases where you can't trust the model to know better.
7 · MCP — what it is, what it's not
Model Context Protocol is an open standard, originally from Anthropic, for tool and resource servers that any client can connect to. It does not change how function calling works. It changes how function calls reach the model.
Before MCP, every framework (LangChain, LlamaIndex, AutoGen, the model providers' own SDKs, your bespoke agent) had its own plugin format. If you wrote a "Notion integration" you wrote it three or five times. MCP standardizes the wire format between a client (your agent, an IDE like Claude Code or Cursor, a chat app) and a server (a process that exposes tools and resources).
The architecture, briefly:
- The server is a small program that exposes a set of tools (functions to call), resources (read-only content like files or DB rows), and prompts (parameterized templates). It speaks JSON-RPC over either stdio (for local processes) or HTTP/SSE (for remote services).
- The client connects, calls
tools/listto discover what's available, and forwards those tool definitions into its next LLM call as if they were declared natively. When the model emits a tool_use, the client routes it back to the right MCP server, gets the result, and feeds it to the model. - The wire format is provider-agnostic. Anthropic, OpenAI, Gemini, your local Llama — same MCP servers work with all of them, because the client does the format translation.
What MCP actually buys you, in practical terms: one Notion integration works in Claude Code, Cursor, Continue, your custom agent, and anyone else who supports MCP. You write the server once. You stop reimplementing the same five tools across every framework.
What MCP does not do: it does not make tool use safer, it does not improve the model's selection accuracy, it does not magically solve any of the failure modes in the next section. The model still gets a list of name + description + schema entries; everything in section 4 still applies. MCP is plumbing.
8 · Production failure modes
If you ship a tool-using agent, you will see at least three of these. The mitigations are mundane — they don't require new tech, just discipline about schemas, caps, and how many tools you wire up.
- Schema drift You change a tool's behavior — rename a parameter, change a default, add a required field — and don't update the schema description or the prompt that depends on it. The agent keeps calling it the old way, gets errors, retries, fails. The fix is to treat tool schemas like API contracts: versioned, reviewed in PR, with a smoke test that verifies a representative request still works after any change.
- The "wrong tool, plausibly correct args" mistake Two tools have overlapping descriptions. The model picks the wrong one, but with arguments that look right, so the function runs, returns plausible-looking data, and the final answer is confidently incorrect. This is the hardest failure to debug because nothing throws. Mitigate by writing exclusionary descriptions ("do not use for X — use Y instead") and by logging which tool fired against the user's intent so you can audit selection accuracy offline.
- Retry storms A downstream service flaps. Your tool returns an error. The model retries. The tool still fails. The model retries again. You hit your MAX_ITERATIONS cap, but you forgot to put a delay between iterations, so you sent fifteen requests to the flapping service in two seconds. Mitigate with exponential backoff inside the tool implementation (not the loop), and rate-limit per-tool independently of the loop cap.
- Context bloat Every tool result is appended to the conversation. A single big tool call (a 200KB database dump, a long file read, a verbose error stack) inflates context for every subsequent iteration. Costs balloon and the model starts losing track of the original request. Mitigate by summarizing tool results above a threshold — return the first N rows, or a count + a sample, and let the model ask for more if needed.
-
The 8–12 tool overflow
Selection accuracy degrades as you add tools. Below ~6 tools the model picks correctly nearly always; somewhere between 8 and 12 it starts conflating similar ones; past 20 it's noticeably worse than a random forest. The fix is hierarchical routing — instead of giving the model 30 tools, give it 4 "router" tools (
search_anything,act_on_calendar,act_on_files,act_on_messages) that internally dispatch to the specific function. Or build a small classifier-style "router agent" that picks a sub-agent which has its own narrower tool list. - Silent cache invalidation Tool schemas live in your prompt prefix. Change a single description and you blow away the prompt cache, doubling cost on the next call. If you ship daily edits to tool descriptions, you'll pay the full uncached input price every deploy. Mitigate by deploying schema changes during off-peak hours, or by warming the cache with a no-op request after each deploy.
What to take away
Function calling is a serializer. The model writes a JSON request, you run the function, you hand back the result, repeat until done. The interesting work is on either side of that exchange — schema design on the way in, error shaping on the way out, and a cap on the loop so a confused model doesn't run you into the ground.
If you remember three things:
- The model never runs code. It writes JSON that asks you to run code.
- Schemas are prompts. Names and descriptions do most of the routing.
- Cap the iterations, return errors as tool results, and stop adding tools past ten.
Everything else — MCP, agent frameworks, multi-agent topologies — is structure built on top of those three facts.