ajinkya.ai An experiment in learning with AI.
← All entries
11 May 2026 11 min read

Choosing a frontier LLM in 2026 — Claude, GPT, Gemini, and the open-weight contenders

LLM Models Comparison Tutorial Interactive Decision

An interactive decision rubric — 11 May 2026

Choosing a frontier LLM in 2026

The model menu in 2026 is not what it was even a year ago. Four frontier families, half a dozen credible open-weight contenders, and a clear specialisation pattern have emerged. The hard part is no longer "which is best" — it is matching workload, cost ceiling, and compliance to the model that wins on those axes. This tutorial gives you the matrix and the decision tree.

§01The 2026 landscape

Three things have happened since 2024. The frontier consolidated around four families (Anthropic, OpenAI, Google, xAI). Open-weight quality crossed the threshold where, for narrow tasks at sustained utilisation, self-hosting actually pays. And each family settled into a clear specialty rather than chasing a generic "best on every benchmark" position.

Anthropic — Claude 4.7

Opus / Sonnet / Haiku. The default for coding agents and long-running tool use. Extended-thinking variants for reasoning. 1M-token context on Sonnet. Strong refusal calibration — some find it cautious; in regulated workloads that is the feature, not the bug.

OpenAI — GPT-5 / 5.5

Pro / Standard / Mini, plus the o-series reasoning models. Broadest ecosystem (Assistants, Realtime, structured output, fine-tuning). GPT-5.5 Pro with thinking pulls ahead on hard math and competition coding. The default if you want one vendor for everything.

Google — Gemini 3

Ultra / Pro / Flash. Owns long-context (Gemini 3 Pro: 2M tokens, with attention that genuinely uses the back of the window). Best multimodal in production — video frames, audio, PDF layout. Flash is the unbeatable cost-per-token tier.

xAI — Grok 4

Real-time-web-grounded answering, fewer refusal guardrails. Strong on tasks where freshness or unfiltered reasoning matters. Smaller ecosystem and limited compliance tooling — mostly relevant if you specifically need its disposition or its X-platform data access.

The open-weight tier

Four open-weight families are genuinely competitive on the right slice of workloads: Llama 4 (Meta — broadest deployment story, best Bedrock/Azure availability), Qwen 3 (Alibaba — strong multilingual, leading the open coding charts), DeepSeek V4 (MoE architecture, extraordinary inference economics at scale, R-series reasoning variants), and Mistral Large (European data-sovereignty story, mid-tier sizing for on-prem deploys).

price caveat The per-token prices in the matrix below are approximate, illustrative, mid-2026 list prices. They move in both directions every few months and depend on region, commitment tier, batch vs. realtime, and whether you route through Bedrock / Azure / Vertex (which often markup or markdown). Verify on the provider's pricing page before you build your cost model. The companion tutorial on How to cost an AI agent explains the full envelope.

§02The benchmark trap

If you are picking a model on MMLU, HumanEval, or HellaSwag scores: stop. Three problems.

Contamination. Public benchmarks are in the training data. Every frontier lab has scrubbed for them and every one of them has missed some. A 2-point HumanEval gap between vendors is inside the noise floor of "did they accidentally train on it."

Distribution mismatch. MMLU asks four-option multiple-choice questions about US history and elementary biology. Your workload is, presumably, not that. A model that's 4 points better on MMLU can be measurably worse at your retrieval-grounded customer-service queries, because the failure modes you care about — hallucination on out-of-distribution chunks, format-following on JSON output, refusal calibration on edge cases — aren't what MMLU measures.

Saturation. The frontier is at 88-92% on most of these benchmarks. Differences in the noise zone don't predict workload performance. The benchmarks that do still discriminate (ARC-AGI-2, FrontierMath, SWE-Bench Verified, GPQA Diamond, OSWorld) are useful only if your workload looks like those benchmarks.

opinion The only benchmark that matters is your own. Capture 50–200 real queries from your workload, write golden answers, score model outputs with an LLM-as-judge calibrated against a human-graded subset. If you cannot afford to do that, you cannot afford to pick a model on benchmarks at all — you are guessing, and you should at least guess on price.

§03The comparison matrix

Twenty rows: frontier closed, frontier open-weight, plus a handful of specialised models (embeddings, rerankers, NLI). Sort by any column. The filter buttons restrict to a tier. Prices are per million tokens, approximate, May 2026.

Tier click column to sort
Model Provider Input $/M Output $/M Context Strengths Weaknesses Best for
Approx. list prices, May 2026. Verify with provider. Context shown in tokens (K=thousand, M=million).

§04By task

Where each family genuinely wins. None of this is a soft preference — these are observable in workload-specific evals at margins wider than benchmark noise.

Coding agents & long tool sessions

Claude 4.7 Sonnet, with extended thinking enabled. The gap with GPT-5.5 on multi-turn code tasks (SWE-Bench Verified, real-world repo edits, sustained tool use across 50+ turns) is widest here. Codex CLI and Claude Code both consistently route to Claude for the hardest sub-tasks. If cost is the constraint, drop to Haiku 4.7 for routing and Sonnet for execution.

Heavy reasoning — math, science, planning

GPT-5.5 Pro with thinking, or o4 / o4-mini. Anthropic's extended-thinking Sonnet and Gemini 3 Pro thinking are competitive on most reasoning tasks but lose on competition-style math and FrontierMath-class problems. DeepSeek-R2 is the open-weight pick — surprisingly close to o-series on math at a tenth the price.

Vision — PDF, screenshots, video

Gemini 3 Pro for PDF-heavy workloads (best layout understanding, native page-coordinate citations) and GPT-5.5 for screenshot/GUI tasks (Operator and the OSWorld benchmark). Claude is closing the gap but still trails on dense diagrams. For video: Gemini 3, the only one that natively ingests video at scale.

Long-context — >200K tokens in one prompt

Gemini 3 Pro at 2M tokens, and the attention quality holds at the back of the window in a way the others' don't. Claude 4.7 Sonnet at 1M is the close second. Past 500K tokens, GPT-5.5 degrades. Important: "needle in a haystack" is solved on all of them; useful long-context (synthesise across the corpus, not retrieve from it) still favours Gemini.

Tool use & structured output

Claude 4.7 — lowest rate of malformed tool calls and the only one with native fine-grained tool-result caching. GPT-5.5 structured outputs (JSON schema) are excellent for one-shot extraction. For agentic flows with 10+ tools and sustained context, Claude wins on first try; OpenAI's tool-use traces have improved but are still measurably noisier.

Latency-critical — chat UX, voice, autocomplete

Gemini 3 Flash or Haiku 4.7 or GPT-5.5 Mini. All three are sub-200ms TTFT in their primary region, with output rates above 200 tok/s. Pick on price and on whether you have other workloads on the same provider — the per-token differences are small but the cache-hit benefits of staying in-family are real.

§05By constraint

The same model can be the right answer for one workload and wrong for another — not because of capability, but because of the surrounding box: latency budget, cost ceiling, context need, compliance regime. Walk these in order. The first hard constraint usually narrows you to one or two families before capability even enters the picture.

Latency budget TTFT < 200 ms

If you have a strict time-to-first-token requirement — chat UI, voice synthesis pipeline, IDE autocomplete — you cannot use any "Pro / Opus / Ultra" tier. Period. Those are 0.5–3 second TTFT models. You need a Flash / Mini / Haiku tier, deployed in the closest region to your traffic, with prompt caching enabled to skip prefix recompute. If your eval still demands a frontier-class model at this latency, the answer is: route Flash for the first chunk while Pro generates in parallel, then stream-substitute — an architecture pattern, not a model.

Pick: Gemini 3 Flash, Claude 4.7 Haiku, or GPT-5.5 Mini. Bench all three on your region; differences are within 30 ms but consistent per-region.

Cost budget < $0.50 per M input

The cheap-tier prices reshuffled in 2025. Gemini 3 Flash at roughly $0.30/M input is the floor of any frontier-lab offering. Open-weight inference (DeepSeek V4 hosted, Llama 4 70B on Together) goes lower — $0.10–0.20/M — but you trade capability and reliability. For most workloads under $0.50/M, the answer is Flash unless you need very specific capability or compliance posture.

Pick: Gemini 3 Flash if you want frontier-lab support; DeepSeek V4 via Fireworks/Together if you want to spend half that and have an eval that confirms quality is acceptable.

Context budget > 200K tokens

Past 200K, the field collapses to three credible options: Gemini 3 Pro (2M), Claude 4.7 Sonnet (1M), and a couple of open-weight long-context variants. Gemini wins decisively on the >500K range — its long-context attention holds quality where Claude's degrades around 700K and GPT-5.5 falls off a cliff at 500K. If you are anywhere near 1M tokens of useful context, Gemini 3 Pro is not a preference, it is the answer.

Pick: Gemini 3 Pro for >500K useful tokens. Claude 4.7 Sonnet (1M) for 200K–500K where you also need top-tier tool use.

Compliance FedRAMP / IL5 / EU AI Act

This is where capability gets overruled. FedRAMP Moderate: Bedrock Claude, Azure OpenAI, Vertex Gemini all have authorisations. FedRAMP High: Bedrock Claude (Gov regions) and Azure OpenAI (Gov) are the production-ready options. IL5: Bedrock Claude in AWS GovCloud. EU AI Act high-risk classification or data-residency mandate: Mistral Large on Scaleway/OVH, or Azure OpenAI in EU regions with the data-processing addendum, or self-host Llama 4 on EU infrastructure. Self-host as a compliance answer is real and increasingly common — particularly with Llama 4 and Mistral Large for EU sovereignty cases.

Pick: follow the authorisation, not the benchmark. The capability gap between Bedrock Claude in GovCloud and Claude in commercial is zero; the gap between "compliant" and "not compliant" is your entire deployment.

§06When does open-weight self-host actually win?

The honest answer is: narrower than the open-source enthusiast crowd will tell you, but real where it does win. Three patterns.

Pattern 1: A small specialised model at sustained high utilisation. The canonical example is NLI for hallucination-checking — a fine-tuned DeBERTa-v3-large at 400M parameters is essentially free to run on a single GPU and beats a frontier LLM at deciding "does this answer follow from this context." If you are doing fact-checking at any volume (millions of decisions per day), self-hosting that one model saves five figures monthly versus calling a frontier API for the same task. The win here is task specialisation, not size.

Pattern 2: Data sovereignty as a hard requirement. European public sector, certain healthcare workloads, anything where the data physically cannot leave a particular jurisdiction or your tenancy. Self-hosted Llama 4 70B on EU infrastructure, or Mistral Large via Scaleway, is genuinely the only option. The cost is higher than calling Gemini Flash; the alternative is "the project does not exist."

Pattern 3: Sustained, predictable, very-high-volume workload. At >10M tokens/day with predictable shape, a dedicated DeepSeek V4 or Llama 4 405B deployment on H200s amortises against the per-token premium that frontier APIs charge. The breakeven volume keeps creeping up as the frontier labs drop prices; in 2024 it was 1M/day, in 2026 it's closer to 20M/day before pure cost makes the case. Cost alone is rarely the winning argument anymore.

where open-weight does NOT win General-purpose chat assistants, agentic coding, anything tool-use heavy, anything where output quality at the margin matters more than per-token price. The capability gap between Llama 4 405B and Claude 4.7 Sonnet on agent tasks is roughly what it was between GPT-3 and GPT-4 — smaller than people think, but unmistakable in a real eval. Self-host because you have to, or because you have a narrow task — not because it is fashionable.

§07Specialised models — what you also need

A real production stack uses an LLM plus three or four specialised models. Treating "the LLM" as the whole answer is one of the more expensive mistakes in this domain.

Embeddings

Voyage 3-large for retrieval quality. OpenAI text-embedding-3-large for breadth and ecosystem. Cohere Embed v4 for multilingual.

Choose on retrieval-recall against your evals, not on MTEB scores — MTEB is contaminated and the leaderboard is gamed.

Rerankers

Cohere Rerank 3 and Voyage rerank-2 are the two production options. ~25–40% recall@5 lift over embedding-only retrieval in most workloads. Cheap enough you should default to using one.

Guardrails

Llama Guard 3 is the open default. OpenAI Moderation and the major hyperscalers all have hosted classifiers. You will want both an input filter and an output filter; cheaper than running a frontier model as a moderator.

NLI / fact-check

DeBERTa-v3-large fine-tuned on MNLI is still the best price/quality answer for "does this answer follow from this context." 400M params, runs on a CPU at a push. Two orders of magnitude cheaper than frontier-LLM-as-judge.

Transcription

Whisper-large-v3 (open) for batch. Deepgram Nova-3 for real-time. AssemblyAI if you also want diarisation. Frontier multimodal LLMs do transcription but cost 5–10x what dedicated ASR does and aren't measurably better.

Vision (open)

Llava-1.6, Idefics-3, Qwen2.5-VL — useful when you need on-prem vision and the frontier vision models are off the table. For everything else, Gemini Flash multimodal beats them on cost-quality at the API.

§08The decision tree

Five questions. Answer in order. Each answer either narrows the field or lands you on a recommendation. This codifies the rubric above into a path you can defend in a procurement meeting.

§09The wrong-default trap

The most expensive mistake in 2026 is not picking the wrong frontier model. It is defaulting to a frontier model when a tier-2 model would do.

"We use Claude Sonnet for everything" or "we just call GPT-5.5" are statements I hear from engineering teams whose actual workload — classification, extraction, simple Q&A, retrieval-grounded chat — would run identically well on Haiku, Flash, or Mini at one-tenth the cost. The per-call savings look small. Multiplied by your monthly volume, it is the difference between a $40K and a $4K invoice for the same outcome.

The corollary: a real eval suite, including a cost-tier comparison, is the single highest-ROI engineering investment in this space. You will discover that 60-80% of your traffic can be routed to a cheaper tier without quality regression, and you will spend the difference on the 20-40% that genuinely needs the frontier model.

opinion Routing is the new prompt engineering. A small classifier (or Haiku/Flash itself) picks the cheap tier for easy queries and escalates only the hard ones. The eval discipline to know which is which is what separates "we spent $40K on Claude" from "we got the same answer for $6K." The 10x cost spread between Haiku and Opus only matters if you actually use both.

The takeaway — in three sentences

There is no best model. There is a best model for this workload at this cost ceiling under this compliance regime, and the four frontier families have settled into clear specialties: Claude for code and tool-use, Gemini for long-context and vision, GPT for reasoning and ecosystem breadth, open-weight for sovereignty and narrow specialisation at scale.

Build the eval before you pick the model. Pick the cheapest tier that passes the eval. Route, don't default.