wire protocol · agents

Tool use — how LLMs call functions, and why your agent loops forever

Function calling is not the model running your code. It is the model emitting a structured JSON object that asks your code to run something, on its behalf, and tell it what came back. Once you internalize that, "tool use" stops being magic and starts being a serializer with quirks.

The model never runs your code. It writes a JSON request describing the function it wants run. Your runtime executes the function, sends the output back as a tool_result, and the model continues the conversation with that result in its context.

1 · What a tool call actually is

Before any agent framework, before any MCP server, before any of the abstractions — here is the wire format. A tool call is a structured output the model produces instead of (or alongside) plain text.

You give the model a list of tools as part of the request. The model decides whether to use one. If it does, it emits a JSON block that names the tool and supplies its arguments. Your runtime parses that block, runs the actual function, and feeds the result back in the next message.

This is what Anthropic and OpenAI look like, on the wire, for the same request — "what's the weather in Tokyo?":

model response · one content blockAnthropic

{
  "role": "assistant",
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01A7...",
      "name": "get_current_weather",
      "input": {
        "location": "Tokyo, Japan",
        "unit": "celsius"
      }
    }
  ]
}

model response · tool_calls arrayOpenAI

{
  "role": "assistant",
  "content": null,
  "tool_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "arguments": "{\"location\":\"Tokyo, Japan\",\"unit\":\"celsius\"}"
      }
    }
  ]
}

Identical semantics. Different envelopes. Two things to notice:

OpenAI's arguments is a string, not an object — it's serialized JSON inside a JSON field, and you have to JSON.parse it yourself. The string can be malformed if the model truncates or hallucinates. Anthropic gives you a parsed object.
The id is load-bearing. When you send the result back, you have to refer to this exact id so the model knows which call you're answering. Lose it and you'll get confused responses or hard errors.

You then run getCurrentWeather("Tokyo, Japan") in your own code, get back something like { temp: 17, condition: "cloudy" }, and append a tool_result message to the conversation. The model takes that result and produces the final user-facing answer: "It's 17°C and cloudy in Tokyo."

The single most important thing to internalize: the "function" in "function calling" is a function you wrote, running in your process. The model is a JSON-emitting oracle that sits behind an API and asks you to call things. Everything else — agents, MCP, tool loops — is plumbing around that one fact.

2 · The function-calling API surface

Three major providers, three slightly different shapes for declaring tools. The semantics are the same; the field names differ.

You declare tools in the request. Each tool gets a name, a description, and a JSON Schema for its arguments. Here's the Anthropic shape:

request · tools array passed alongside messagesAnthropic

{
  "model": "claude-opus-4-7",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "get_current_weather",
      "description": "Return the current weather conditions for a given location. Use this whenever the user asks about temperature, conditions, or forecasts for a specific city or region.",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "City and country, e.g. 'Tokyo, Japan' or 'Paris, France'."
          },
          "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "Temperature unit. Default celsius."
          }
        },
        "required": ["location"]
      }
    }
  ],
  "tool_choice": { "type": "auto" },
  "messages": [
    { "role": "user", "content": "What's the weather in Tokyo?" }
  ]
}

OpenAI nests the same fields under a function object and renames input_schema to parameters; Gemini calls the array function_declarations. Apart from that, the shape is identical.

Field	Anthropic	OpenAI	Gemini
tools list	`tools`	`tools` (with `type:"function"`)	`function_declarations`
name	`name`	`function.name`	`name`
description	`description`	`function.description`	`description`
schema	`input_schema`	`function.parameters`	`parameters`
call output	`tool_use` block	`tool_calls[]` array	`functionCall`
result back	`tool_result` block	message with `role:"tool"`	`functionResponse`

The `tool_choice` parameter

Every provider lets you steer the model's decision about whether to call a tool:

auto — model decides. Default. What you want 95% of the time.
required / any — model must call some tool. Useful when you've prompted the model into a state where free-form text would be wrong (e.g. a structured-extraction step).
{ name: "specific_tool" } — model must call exactly this tool. Useful for forcing a particular function — schema-guided extraction, for instance.
none — model cannot call tools this turn. Useful when you want pure text after a tool round-trip is finished.

Streaming wrinkle. When you stream a response with tools enabled, the tool_use payload arrives in fragments — first the tool name, then the argument JSON character by character. You have to accumulate until the block is complete before parsing. Most SDKs handle this; if you write a raw HTTP client, don't try to JSON-parse partial arguments mid-stream.

3 · The tool loop

Almost every "agent" is this loop wrapped in error handling and logging. There's no other secret.

the agent loop, in 18 linespseudocode

# messages starts as the user's request and the system prompt.
# tools is the list you declared.

for i in range(MAX_ITERATIONS):
    response = llm.create(messages=messages, tools=tools)
    messages.append(response.assistant_message)

    if response.stop_reason == "end_turn":
        return response.text   # final answer, we're done

    if response.stop_reason == "tool_use":
        for call in response.tool_calls:
            try:
                result = run_tool(call.name, call.input)
            except Exception as e:
                result = {"error": str(e)}   # let the model see it
            messages.append(tool_result(call.id, result))
        continue   # loop back, model will produce another response

raise AgentExceededIterations("agent looped too many times")

That's it. The four things to notice:

Every iteration is one full LLM call — same context as last time plus the new tool result. A 6-step agent does six LLM calls. Costs scale linearly with steps, and prompt caching matters a lot for the repeated prefix.
You always append the assistant message back to messages before running the tool. The model sees its own previous tool_use blocks in context; if you skip the append, it loses the thread of what it asked for.
The MAX_ITERATIONS cap is non-negotiable. Without it, a confused model will retry the same broken tool until you hit the rate limit or run out of context window. Set it to 10–25 for typical agents and make sure the failure path is graceful, not "agent went silent."
Tool exceptions go back to the model as the result. Don't crash the loop — hand the model the error message and let it try to recover. Half the time it will pick a different tool, adjust arguments, or apologize to the user. The other half it will keep doing the same thing; that's why you have a cap.

Live demo · step through a tool loop

scenario

step 0 / 0

4 · Schema design — names matter, descriptions matter more

The number-one cause of "the agent picked the wrong tool" is a bad schema. Vague names, missing descriptions, and untyped arguments make the model guess. Clear names, imperative descriptions, and typed arguments swing selection accuracy from sixty-something to high-nineties on routine queries.

Here's a worked comparison. Same underlying functions, two ways of describing them:

BAD schema · vague, undocumenteddon't

{
  "name": "weather_thing",
  "description": "",
  "input_schema": {
    "type": "object",
    "properties": {
      "q": { "type": "string" }
    }
  }
}

{
  "name": "data_lookup",
  "description": "Gets data.",
  "input_schema": {
    "type": "object",
    "properties": {
      "input": { "type": "string" }
    }
  }
}

GOOD schema · named, described, typeddo

{
  "name": "get_current_weather",
  "description": "Return current temperature and conditions for a city. Use whenever the user asks about weather, temperature, rain, or conditions for a specific location.",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City and country, e.g. 'Tokyo, Japan'."
      }
    },
    "required": ["location"]
  }
}

The bad version isn't broken — the model might still pick weather_thing when asked about weather. But "might" becomes "definitely" with the good version, and "might" becomes "calls data_lookup half the time because that name is broad enough to match anything" once you add a second tool.

Try it yourself. Same user message, two schema versions. Watch the JSON the model emits and the reasoning:

Live demo · schema-quality comparator

request

Bad schema —

Good schema —

Rules of thumb for schema authors

Verb-first, snake_case names. get_current_weather, create_calendar_event, search_documents. Avoid camelCase (most models tokenize it worse), avoid nouns alone (weather is ambiguous between "look up" and "predict"), avoid generic suffixes like _thing, _data, _helper.
1–3 imperative sentences in the description. Start with "Return…" or "Create…" — tell the model what the tool does, then when to use it, then any non-obvious constraint. The description is what the model reads when deciding between two superficially similar tools.
Typed, named, described arguments. Every property gets a type, a name that says what it is (location, not q), and a one-line description with at least one example. Use enum for fixed sets. Use required.
Don't lie in the description. If the tool is rate-limited, slow, or only works for certain inputs, say so. The model will weight that against using it. Pretending the tool is omnipotent leads to the model calling it for things it can't actually do, then watching errors come back, then trying again.
Worked counter-example. If you have two near-twin tools (search_documents vs search_emails), include a "do not use this for X" sentence in each. The model is good at exclusion when you give it explicit boundaries.

5 · Error handling — the model can recover if you let it

Tools fail. Networks flap. Arguments come back malformed. The naive instinct is to crash the agent. The right instinct is to give the model the error message and let it try again — usually it succeeds on the second try, and the savings on user-visible failures are large.

Four error classes you'll hit, in roughly decreasing frequency:

Failure	What it looks like	What to do
malformed arguments	JSON parses but a required field is missing, or a value is out of enum (`unit: "kelvin"`)	Return `{"error": "unit must be 'celsius' or 'fahrenheit'"}` as the tool result. Model almost always fixes it next turn.
tool exception	API 500, timeout, downstream service down	Catch, return error message + brief retry guidance as tool result. Model decides whether to retry or give up.
hallucinated tool name	Model asks for `get_weather_v2` when you only declared `get_current_weather`	Rare on frontier models but not zero. Return `{"error": "no such tool 'get_weather_v2'; available tools: …"}`. Model picks the right one next iteration.
infinite retry loop	Same tool, same broken args, same error, ad infinitum	Your MAX_ITERATIONS cap catches it. On cap-hit, return a structured failure to the caller — don't fake a final answer.

The "stubborn re-call" pattern. Sometimes a model latches onto the wrong tool and keeps calling it with slight variations even after you return errors. This is usually a schema problem in disguise — the wrong tool's description is overlapping with the right one, and the model genuinely thinks it's the closest match. Fix it in the schema, not in retry logic. If you find yourself adding heuristics like "if the model called X three times, force Y," your tools are mis-described.

The "show the error" trick. When a tool fails, the temptation is to return {"status": "error"} and let the model figure it out. Don't. Return the actual error message —

{"error": "Database query failed: column 'departure_date' does not exist. Available columns: [depart, return, passenger_count]."}

. The model will read that, realize it used the wrong column name, and self-correct on the next call. Treat the model like a competent junior dev reading your stack trace.

6 · Parallel tool calls

Modern frontier models can emit multiple tool_use blocks in a single response. Used right, this collapses a four-step serial loop into two iterations. Used wrong, it causes the "false parallelism" trap.

When the user asks "weather in Tokyo and Paris and the time difference?" a parallel-capable model returns:

three tool_use blocks in one assistant turnAnthropic

{
  "stop_reason": "tool_use",
  "content": [
    { "type": "tool_use", "id": "toolu_1", "name": "get_current_weather", "input": {"location": "Tokyo"}},
    { "type": "tool_use", "id": "toolu_2", "name": "get_current_weather", "input": {"location": "Paris"}},
    { "type": "tool_use", "id": "toolu_3", "name": "get_timezone_offset", "input": {"city_a": "Tokyo", "city_b": "Paris"}}
  ]
}

Your runtime fans these out in parallel (Promise.all, asyncio.gather, whatever), waits for all three, then sends back all three results in one batched tool_result message. The next LLM call has every answer and produces the final response in one shot. Two iterations instead of four.

Where parallel helps

Independent lookups. "Compare AAPL and MSFT" — two independent get_stock_quote calls.
Fan-out queries. "Search for X in Confluence, Slack, and email" — three independent search calls.
Multi-resource enrichment. Get user, get user's projects, get user's recent commits, all in parallel before composing the answer.

Where it hurts: the "false parallelism" trap

The model sometimes emits two tool calls in parallel that should have been sequential, because the second depends on the result of the first. Classic case: "send a calendar invite to whoever I met with last Tuesday." The model emits in parallel:

search_calendar(date: "last Tuesday")
send_calendar_invite(attendee: "???") ← doesn't know yet

The second call has to guess an attendee because the first hasn't returned. You get a confidently wrong invite sent to someone fabricated. The fix is in the schema and the prompt: spell out that send_calendar_invite requires a concrete attendee email obtained from a prior step, and discourage parallel calls when the second tool's args reference the first's output. Some providers expose a disable_parallel_tool_use flag for cases where you can't trust the model to know better.

Counting tool turns ≠ counting LLM calls. Parallel tool use changes the ratio. A three-step task can be one LLM call if all three tools fire in parallel, or three LLM calls if they're forced sequential. For cost modeling, count LLM iterations, not tool invocations — the model bills you for context length, not for the number of functions you ran.

7 · MCP — what it is, what it's not

Model Context Protocol is an open standard, originally from Anthropic, for tool and resource servers that any client can connect to. It does not change how function calling works. It changes how function calls reach the model.

Before MCP, every framework (LangChain, LlamaIndex, AutoGen, the model providers' own SDKs, your bespoke agent) had its own plugin format. If you wrote a "Notion integration" you wrote it three or five times. MCP standardizes the wire format between a client (your agent, an IDE like Claude Code or Cursor, a chat app) and a server (a process that exposes tools and resources).

The architecture, briefly:

The server is a small program that exposes a set of tools (functions to call), resources (read-only content like files or DB rows), and prompts (parameterized templates). It speaks JSON-RPC over either stdio (for local processes) or HTTP/SSE (for remote services).
The client connects, calls tools/list to discover what's available, and forwards those tool definitions into its next LLM call as if they were declared natively. When the model emits a tool_use, the client routes it back to the right MCP server, gets the result, and feeds it to the model.
The wire format is provider-agnostic. Anthropic, OpenAI, Gemini, your local Llama — same MCP servers work with all of them, because the client does the format translation.

What MCP actually buys you, in practical terms: one Notion integration works in Claude Code, Cursor, Continue, your custom agent, and anyone else who supports MCP. You write the server once. You stop reimplementing the same five tools across every framework.

What MCP does not do: it does not make tool use safer, it does not improve the model's selection accuracy, it does not magically solve any of the failure modes in the next section. The model still gets a list of name + description + schema entries; everything in section 4 still applies. MCP is plumbing.

8 · Production failure modes

If you ship a tool-using agent, you will see at least three of these. The mitigations are mundane — they don't require new tech, just discipline about schemas, caps, and how many tools you wire up.

Schema drift You change a tool's behavior — rename a parameter, change a default, add a required field — and don't update the schema description or the prompt that depends on it. The agent keeps calling it the old way, gets errors, retries, fails. The fix is to treat tool schemas like API contracts: versioned, reviewed in PR, with a smoke test that verifies a representative request still works after any change.
The "wrong tool, plausibly correct args" mistake Two tools have overlapping descriptions. The model picks the wrong one, but with arguments that look right, so the function runs, returns plausible-looking data, and the final answer is confidently incorrect. This is the hardest failure to debug because nothing throws. Mitigate by writing exclusionary descriptions ("do not use for X — use Y instead") and by logging which tool fired against the user's intent so you can audit selection accuracy offline.
Retry storms A downstream service flaps. Your tool returns an error. The model retries. The tool still fails. The model retries again. You hit your MAX_ITERATIONS cap, but you forgot to put a delay between iterations, so you sent fifteen requests to the flapping service in two seconds. Mitigate with exponential backoff inside the tool implementation (not the loop), and rate-limit per-tool independently of the loop cap.
Context bloat Every tool result is appended to the conversation. A single big tool call (a 200KB database dump, a long file read, a verbose error stack) inflates context for every subsequent iteration. Costs balloon and the model starts losing track of the original request. Mitigate by summarizing tool results above a threshold — return the first N rows, or a count + a sample, and let the model ask for more if needed.
The 8–12 tool overflow Selection accuracy degrades as you add tools. Below ~6 tools the model picks correctly nearly always; somewhere between 8 and 12 it starts conflating similar ones; past 20 it's noticeably worse than a random forest. The fix is hierarchical routing — instead of giving the model 30 tools, give it 4 "router" tools (search_anything, act_on_calendar, act_on_files, act_on_messages) that internally dispatch to the specific function. Or build a small classifier-style "router agent" that picks a sub-agent which has its own narrower tool list.
Silent cache invalidation Tool schemas live in your prompt prefix. Change a single description and you blow away the prompt cache, doubling cost on the next call. If you ship daily edits to tool descriptions, you'll pay the full uncached input price every deploy. Mitigate by deploying schema changes during off-peak hours, or by warming the cache with a no-op request after each deploy.

The hardest part of building a tool-using agent isn't the model — it's the schemas. Get the names and descriptions right and most of the failure modes disappear. Get them wrong and no amount of agent-framework cleverness will save you.

What to take away

Function calling is a serializer. The model writes a JSON request, you run the function, you hand back the result, repeat until done. The interesting work is on either side of that exchange — schema design on the way in, error shaping on the way out, and a cap on the loop so a confused model doesn't run you into the ground.

If you remember three things:

The model never runs code. It writes JSON that asks you to run code.
Schemas are prompts. Names and descriptions do most of the routing.
Cap the iterations, return errors as tool results, and stop adding tools past ten.

Everything else — MCP, agent frameworks, multi-agent topologies — is structure built on top of those three facts.

Wire formats: Anthropic Messages API, OpenAI Chat Completions, Gemini, MCP 1.0. Accurate as of 2026. /agents/tool-use

Tool use — how LLMs call functions, and why your agent loops forever

1 · What a tool call actually is

2 · The function-calling API surface

The tool_choice parameter

3 · The tool loop

4 · Schema design — names matter, descriptions matter more

Rules of thumb for schema authors

5 · Error handling — the model can recover if you let it

6 · Parallel tool calls

Where parallel helps

Where it hurts: the "false parallelism" trap

7 · MCP — what it is, what it's not

8 · Production failure modes

What to take away

The `tool_choice` parameter