{ }structured outputs · the wire mechanics

Structured outputs & JSON mode — how the model is forced to produce valid data

Every time you ask a model for JSON, you are picking a point on a spectrum: ask politely and hope, ask politely and have the API guarantee parseable JSON, or mask the decoder at every token so the output cannot escape your schema. The three points give three different guarantees. The cost-of-failure differs by a lot. This walks the mechanics.

1 Why you can't just ask for JSON

"Return the result as JSON" is the first thing everyone tries. It works most of the time on a frontier model. Most of the time isn't good enough when you're parsing the output programmatically — a one-in-fifty failure rate becomes an oncall page.

The naïve prompt is: "Extract the sentiment of this review and return it as JSON with keys 'sentiment' and 'topics'." Here is what comes back, in order of how often I see each failure in real logs:

failure 1 — markdown fences

```json
{"sentiment": "positive", "topics": ["delivery", "packaging"]}
```

failure 2 — prose preamble

Sure! Here is the extracted JSON:

{"sentiment": "positive", "topics": ["delivery"]}

failure 3 — javascript-style comments

{
  "sentiment": "positive",
  // user mentioned shipping twice
  "topics": ["delivery", "packaging"]
}

failure 4 — hallucinated fields

{
  "sentiment": "positive",
  "topics": ["delivery"],
  "confidence": 0.9,
  "language": "en"
}

Plus the more subtle ones: trailing commas, single quotes instead of double, smart-quotes that look right and parse wrong, numeric strings ("5" when you wanted 5), and the all-time classic, an output that is almost JSON but truncated mid-array because you set max_tokens too low.

Every one of those breaks JSON.parse. Some of them break it in ways your retry logic catches; the hallucinated-field one doesn't break the parse at all but quietly corrupts your downstream pipeline. This is the problem the rest of the chapter solves — in three strictly increasing levels of guarantee.

2 Three implementations, ranked by strictness

Each level eliminates a class of failure that the level below it leaves on the table. None of them eliminates semantic errors — that the model wrote valid JSON containing the wrong answer. Pick the highest level your provider supports for the surface you care about.

Approach	What it guarantees	What it doesn't	Where
prompt-and-parse	nothing — you wrap `JSON.parse` in a try/catch and retry on failure	markdown fences, prose preambles, comments, smart quotes, all of it	anywhere; works on every model
JSON mode	output is syntactically valid JSON — parses cleanly	shape: keys can still be wrong, missing, or hallucinated; enums can be ignored	OpenAI `response_format: {type:"json_object"}`; Anthropic via tool use
strict / grammar-constrained	output conforms to the schema by construction — types, enums, required fields all hold	contents of free strings, numeric values, semantic correctness, model's right to refuse	OpenAI Structured Outputs (`strict: true`), Anthropic strict tool use, Gemini, llama.cpp grammars

Prompt-and-parse

You write the JSON Schema into the prompt, you call the model, you parse the response, and if it fails you retry with the error message attached. On a frontier model with a clear schema and a clean example, this is ~95% reliable. That last 5% is what burns you. Cheap to implement, no API support required, works on any model — but it's the only one of the three where you own every recovery path.

JSON mode

OpenAI introduced response_format: {"type": "json_object"} in 2023; it tells the sampler to refuse to emit any token that wouldn't keep the output a parseable JSON value. The API guarantees the response is parseable. It does not guarantee the response matches your schema — the model can still emit {"sentment": "good", "extra_field": null} with a typo and a hallucinated key, and JSON mode will happily accept it because the JSON is valid.

basic JSON mode · no schema enforcementOpenAI

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "Return JSON with keys sentiment, topics." },
    { "role": "user", "content": "Loved the packaging. Slow shipping." }
  ],
  "response_format": { "type": "json_object" }
}

tool use as JSON mode · Anthropic idiomAnthropic

{
  "model": "claude-opus-4-7",
  "tools": [{
    "name": "extract_review",
    "description": "Capture sentiment and topics from a product review.",
    "input_schema": { /* JSON Schema */ }
  }],
  "tool_choice": { "type": "tool", "name": "extract_review" }
}

One quirk worth noting: OpenAI's bare JSON mode requires the literal word "JSON" to appear in the system or user message — otherwise the call errors. It's a safety belt to keep the model from generating a hung "infinite whitespace" output when it has no clue what JSON shape you want.

Strict / grammar-constrained

This is the level where the API takes your schema, compiles it into a state machine, and at every decoding step masks out every token that would make the partial output unparseable against the schema. The strongest guarantee available short of generating the JSON yourself: the output is a valid parse against the schema by construction, before validation runs.

strict mode · openai structured outputsOpenAI

{
  "model": "gpt-4o-2024-08-06",
  "messages": [/* … */],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "review_extract",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "sentiment": {
            "type": "string",
            "enum": ["positive", "neutral", "negative"]
          },
          "topics": {
            "type": "array",
            "items": { "type": "string" }
          }
        },
        "required": ["sentiment", "topics"],
        "additionalProperties": false
      }
    }
  }
}

strict tool use · anthropicAnthropic

{
  "model": "claude-opus-4-7",
  "tools": [{
    "name": "extract_review",
    "input_schema": {
      "type": "object",
      "properties": {
        "sentiment": {
          "type": "string",
          "enum": ["positive", "neutral", "negative"]
        },
        "topics": {
          "type": "array",
          "items": { "type": "string" }
        }
      },
      "required": ["sentiment", "topics"]
    }
  }],
  "tool_choice": { "type": "tool", "name": "extract_review" }
}

Surface differences worth knowing. OpenAI strict mode requires additionalProperties: false on every nested object and demands every property be in required (you express "optional" by allowing null). Anthropic is more forgiving — optional means optional, and unknown fields are silently dropped rather than rejected. Both providers cache compiled grammars, so reusing the same schema across calls is cheap; rotating schemas per request is not.

3 How grammar-constrained decoding actually works

Strict mode is not the model "deciding" to follow the schema. It is the decoder being denied the option to emit anything else. The mechanism is mechanical and worth walking through.

At every decoding step a model produces a probability distribution over its full vocabulary — for current models, ~50K–200K possible next tokens. Normally the sampler picks one. Under grammar-constrained decoding the API runs the schema through a small parser that tracks where in the schema we are and produces a mask: a boolean array, one bit per vocabulary token, set to 1 iff that token would keep the partial output a valid parse. The sampler then samples only from the unmasked subset.

Walk through a tiny schema, {"name": string, "age": number}:

Live demo · grammar-constrained step-through

Click "Next step" to advance the parser. Watch the parser state on the right narrow the allowed vocabulary at each step, and the JSON stream on the left grow one constrained token at a time.

emitted stream

The decoder has emitted nothing yet. The parser is in the start state — it knows we're producing an object, so the only legal first token is {.

parser state · allowed tokens

state: start of object · next must be {

step 0 / 7

Notice what the parser is doing: at every step it represents the schema as a small state machine — "we just emitted {, the next legal thing is a string literal that starts with one of the known keys" — and translates that into a token mask. The token "Robert" is blocked at step 2 because it doesn't start with ". The token true is blocked at step 5 because we're inside a numeric value. The model's preferences aren't ignored — they're just sampled from a smaller set. If the model's top-1 choice happens to be legal, that's what comes out. If it isn't, the highest-logit legal token is picked instead.

One subtle consequence: grammar-constrained decoding can produce outputs that the unconstrained model would never produce. If "the model's actual answer" is "I don't know" but your schema is {"category": enum[…]}, the model cannot emit "I don't know" — every legal continuation forces it to pick a category. This is the central pitfall of strict mode, and the whole of section 5.

4 Schema design — what the model gets right vs wrong

A strict schema is still a prompt. Names, descriptions, enums, and field organization swing the quality of the answer by a wider margin than most people expect — even when the output is guaranteed to conform.

The interactive below shows the same input run through a deliberately bad schema and a deliberately good schema for the same task. Both produce conformant JSON. Only one of them produces useful JSON. Switch between them to see where the schema design earns its keep.

Live demo · schema-quality comparator

task

schema —

model output (strict mode) —

Patterns the model gets right when you give it the chance

Enums beat free strings, always. "sentiment": "positive" | "neutral" | "negative" is dramatically more reliable than "sentiment": string. Without the enum, you get "slightly positive", "mixed", "POSITIVE", and the user's actual review text as a sentiment value. With the enum, strict mode literally cannot produce any of those.
Bounded numbers beat unbounded ones. {"type": "integer", "minimum": 0, "maximum": 100} for a confidence score beats a free-form float. The grammar doesn't enforce the bounds at decode time (you still validate after), but giving the model the bound steers the distribution toward sensible values rather than 0.0–1.0 mixed with 0–100 mixed with "high".
Descriptions are prompts. JSON Schema description fields are fed to the model as part of the schema prompt. Use them. {"category": {"enum": […], "description": "Pick 'refund' only if the user explicitly asks for money back; use 'cancellation' for stop-service requests."} changes selection accuracy more than you'd guess.
Required vs optional matters more than people think. If a field is genuinely optional, mark it optional and the model will leave it out when uncertain. Mark every field required and you'll get a hallucinated value where a missing field would have been the honest answer. Strict mode preserves this distinction faithfully — required fields are always emitted, including with made-up content.
Nesting depth tax. Each additional level of nesting ({a: {b: {c: …}}}) raises the error rate on the deepest field. Three levels is fine. Five is iffy. If you find yourself reaching for deeper nesting, flatten with prefixed keys ({a_b_c: …}) or split into multiple calls.

The "make the schema match the way humans describe the problem" rule. If you would describe the output to a colleague as "sentiment, plus the top three topics, plus a confidence," your schema should be {sentiment, topics[3], confidence}. The closer the field names and structure are to natural English description, the better the model fills it in. The further it drifts — abbreviations, internal IDs, optimization-driven structure — the more you're forcing the model to translate.

5 The strict-mode pitfalls

Strict mode's strength — the model cannot escape the schema — is also its central weakness. The model can no longer say "I don't know," no longer refuse, no longer signal uncertainty unless you've built a field for it. Stories about content-moderation systems classifying things that should have been refusals trace back to exactly this.

Consider a content-classification system using strict mode with this schema:

naive moderation schema · no escape hatchdon't

{
  "type": "object",
  "properties": {
    "category": {
      "type": "string",
      "enum": ["safe", "harassment", "sexual", "violence", "hate"]
    }
  },
  "required": ["category"]
}

The problem: input arrives that's ambiguous — say, a heated political argument that's borderline harassment but the model is genuinely unsure. The unconstrained model would emit something like "this is hard to call from one message alone." Under strict mode, that string isn't a legal continuation. The model has to pick one of the five enum values. It picks "harassment" — not because it's confident, but because it has to pick something and "harassment" has the highest masked-logit among the five.

The result is a downstream system that bans a user with no signal that the call was 50/50. The fix is to always build the escape hatch into the schema:

moderation schema · with confidence + uncertain enumdo

{
  "type": "object",
  "properties": {
    "category": {
      "type": "string",
      "enum": ["safe", "harassment", "sexual", "violence", "hate", "uncertain"],
      "description": "Use 'uncertain' when the message is ambiguous or you would rather a human review it."
    },
    "confidence": {
      "type": "integer",
      "minimum": 0, "maximum": 100,
      "description": "0 = pure guess, 100 = obvious. Anything below 70 routes to human review."
    },
    "reasoning": {
      "type": "string",
      "description": "One sentence: why this category. Helps with audits."
    }
  },
  "required": ["category", "confidence", "reasoning"],
  "additionalProperties": false
}

Two things changed. There's now an uncertain enum value with an explicit rubric, and there's a numeric confidence field that downstream code can threshold against. The model still can't refuse to output JSON — but it now has legitimate JSON it can output when it doesn't know, instead of being forced into one of the wrong answers.

Universal escape-hatch checklist. Every strict-mode schema you ship should have: (a) an uncertain or none value on every classification enum, (b) a confidence field that downstream code reads, and (c) a free-text reasoning or notes field so the model has somewhere to put information that doesn't fit your buckets. Without all three, you've built a system that's confidently wrong by design.

6 When strict mode breaks anyway

Grammar-constrained decoding eliminates a class of failures completely. It introduces a few new ones, and quietly fails to help in other cases that look like it should.

Long enums (~50+ items). The model is masked into the enum, but among 80 categories the model's quality signal — "which is actually right" — gets noisier as the enum grows. The masked sampler picks the highest-logit allowed token, but the gap between #1 and #2 narrows on long lists, and you end up with the model picking the most-likely looking category rather than the correct one. Mitigation: hierarchical classification — pick coarse category first, then sub-category in a second call.
Recursive schemas (tree structures). A schema like {type: "node", children: [{type: "node", …}]} is technically legal in JSON Schema and most strict implementations support it. In practice the model gets lost at depth 3+. It either flattens the structure, terminates early, or produces a tree that's syntactically perfect but semantically wrong. If you need a tree, generate it in passes — one level at a time, with the previous level as input.
Mixed natural language input. Strict mode constrains the output, not the input. If your task is "extract structured data from a 2000-word legal document," the schema does nothing for the hard part — finding the entities. Garbage in, conformant garbage out.
Token-budget truncation. This one bites teams who haven't thought about it. The grammar guarantees that the output, when complete, will be a valid parse. If max_tokens cuts the generation off mid-emission — say, halfway through the third array element — the model stops, the grammar never gets to close the structures, and you receive an invalid-JSON output despite the strict guarantee. The fix is to budget tokens generously and to check for finish_reason: "length" before parsing.
Schema features the implementation doesn't support. OpenAI strict mode supports a subset of JSON Schema — no oneOf, no $ref across the schema boundary, no format: "date" validation (you get a string and have to validate later). Anthropic supports a different subset. Read the docs for the feature you want before assuming it'll be enforced.

The "strict-mode truncation" failure is the most common one in production. A user sends a message that triggers a 4000-token JSON response, but you set max_tokens: 2000 for cost reasons. The model emits perfect JSON for 2000 tokens, gets cut off, and your client crashes with a parse error. Always check the stop reason before parsing — and either raise max_tokens or design schemas that don't blow past your budget.

7 Production patterns

Strict mode is necessary but never sufficient. Three patterns make the difference between "the JSON parses" and "the JSON is right."

Validate after parsing

Strict mode validates JSON syntax. Your business logic validates content. Always run a second-stage validator — Pydantic, Zod, JSON Schema's own runtime validator — over the parsed object before you use it. The strict-mode schema and the runtime schema can be the same document, and modern SDKs make this trivial:

pydantic + openai sdk · one source of truthPython

from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI

client = OpenAI()

class ReviewExtract(BaseModel):
    sentiment: Literal["positive", "neutral", "negative"]
    topics: list[str] = Field(max_length=5)
    confidence: int = Field(ge=0, le=100)

# The SDK derives the JSON Schema from the Pydantic class
# and passes it to OpenAI as a strict schema. The response
# is parsed back into a typed object.
resp = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": review_text}],
    response_format=ReviewExtract,
)

extract: ReviewExtract = resp.choices[0].message.parsed
# extract.sentiment is now typed Literal["positive","neutral","negative"]
# extract.topics is guaranteed list[str], length 0–5
# extract.confidence is guaranteed int 0–100
# — no extra parse step, no manual validation.

One declaration, two enforcement points: the strict schema at the API boundary, the Pydantic validator on the parsed object. The Instructor library does the same for Anthropic. This is the right baseline for any new structured-output code in 2026.

Two-pass extraction for high-stakes data

For data where wrong-but-conformant is expensive (medical, financial, legal), don't trust a single call. Run pass 1 to produce the JSON, then pass 2 — a smaller LLM call, no schema — that reads the original input and the extracted JSON and answers: "is this the right answer for the input?" The validator pass catches semantic errors that strict mode literally can't see, like the model confidently extracting a date that isn't anywhere in the source.

Retry with error feedback

When validation fails — semantic or syntactic — don't just retry blind. Feed the error back into the next prompt: "Your previous response failed validation: topics must contain at least one element. The original input was: …. Try again." This is one of those patterns that sounds too simple to work and resolves ~80% of validation failures on the first retry. The model reads its own broken output and corrects it.

The validate-then-retry loop, in one rule: any error your application can detect (missing field, out-of-range value, semantic mismatch) gets fed back to the model as text. The model is dramatically better at correcting its own broken output than at avoiding the mistake in the first place. This is the same insight as the "show the error" pattern in tool use — the model is good at reading its own stack traces.

8 Putting it together

Three implementations, in the order you should try them: prompt-and-parse for prototyping, JSON mode for production prototypes, strict / grammar-constrained for anything where a malformed response is an oncall page. None of them give you semantic correctness — they give you syntactic guarantees of increasing strength, and the failure modes shift accordingly.

If you remember three things:

Strict mode guarantees JSON syntax, never semantics. Run a second-stage validator over the parsed object. Pydantic or Zod or your favorite, always.
Every strict schema needs an escape hatch. An uncertain enum value, a confidence score, and a reasoning text field. Without them you've built a system that's confidently wrong by design.
Schema descriptions are prompts. Field names, enum values, and JSON Schema description strings are fed to the model. Treat them with the same care as your system prompt.

The model can write valid JSON to spec. It can't always write the right JSON. The work that's left — validation, escape hatches, retries, second-pass review — is yours.

OpenAI Structured Outputs (Aug 2024), Anthropic strict tool use, Gemini structured output, llama.cpp GBNF grammars. Schemas current to 2026. /llm/structured-outputs