An interactive serving rubric — 13 May 2026
LLM serving frameworks in 2026
vllm · tensorrt-llm · tgi · sglang
You decided to self-host. You typed "best LLM serving framework" into a search engine. You now have twelve open browser tabs, each promising a different winner, and the benchmark numbers in each one are configured to make the publisher win. This is the working rubric — what each framework actually optimises, what it gives up, and a sortable matrix plus decision tree to land on one.
§01What a serving framework actually is
Four jobs glued together. Every framework on the market is some opinionated mix of these. Knowing which job each one optimises is most of the picking decision.
JOB 1
Model loader & quantiser
Read the weights off disk, apply quantisation (FP8, INT8, INT4, GPTQ, AWQ), lay them out for the target hardware. Determines what models you can actually run on what hardware.
JOB 2
KV cache manager
Allocate, page, share, and evict the per-request key/value blocks. PagedAttention lives here. Prefix sharing across requests lives here. This is where 80% of throughput gains come from.
JOB 3
Scheduler
Decide whose token gets computed next. Continuous (in-flight) batching, priority queues, preemption, branching agentic flow. The difference between 30 and 130 QPS on the same GPU.
JOB 4
API surface
OpenAI-compatible REST, gRPC, streaming SSE, structured-output / JSON-schema enforcement, tool-call grammars. Determines how painful it is to swap your client code from a hosted API.
§02The four contenders
There are perhaps a dozen serving frameworks in 2026; four of them are credible production options for new self-hosts. I will describe each one honestly, including what it loses on.
vLLMThe throughput default
Born from the PagedAttention paper. The framework that made "page the KV cache like virtual memory" a standard rather than a research idea. Continuous batching is its native scheduler; GPTQ, AWQ, and FP8 quantisation are all first-class; the OpenAI-compatible server is the most battle-tested in the open-source world.
Wins on: commodity-hardware throughput (any A100/H100, even older Ampere cards), model coverage (new architectures land here first or close to first), reliability of the OpenAI-compatible server, BSD-3 permissiveness, vibrant community.
Loses on: structured-output ergonomics (you can bolt on Outlines or xgrammar, but it's not the smoothest), peak performance on the latest NVIDIA hardware (TensorRT-LLM hand-tuned kernels still edge it at scale), TTFT under pathological workload mixes (the scheduler is throughput-biased and a latency-critical workload can starve).
Pick this when: you are self-hosting for the first time on GPUs you already own and want a predictable, well-understood floor. The conservative default is correct here.
TensorRT-LLMThe NVIDIA silicon specialist
NVIDIA's framework, built on top of TensorRT, the company's general-purpose deep-learning inference compiler. Per-model kernel tuning, native FP8 with H100/H200 hardware support, sparsity support on Hopper and Blackwell. The Triton-Inference-Server-plus-TensorRT-LLM-backend combo is what NVIDIA wants you to deploy.
Wins on: peak throughput and lowest latency on H100, H200, B200 with the latest FP8 kernels — typically 1.3–1.8x vs vLLM on the same hardware for popular dense models. FP8 quality is best-in-class. NVIDIA enterprise support is real and useful at scale.
Loses on: model coverage (new architectures take weeks to land), portability (lock to NVIDIA and to a specific minor version of CUDA / TensorRT), iteration speed (rebuild the engine per model variant — slow), and a steep learning curve compared to "pip install vllm". The Apache-2 license is genuine but the surrounding tooling is NVIDIA-flavoured.
Pick this when: you are operating at scale (think > 100 sustained QPS), you are committed to NVIDIA hardware on the latest generation, and the hardware bill is large enough that a 30–50% per-GPU efficiency gain pays back the engineering investment many times over.
TGIThe HuggingFace path of least resistance
Text Generation Inference. Sensible defaults, OpenAI-compatible API in the box, deeply integrated with the HuggingFace Hub (you can deploy a model with a model ID alone). Rust core for the request gateway with a Python model server. Comfortable if you already live in the HuggingFace ecosystem; almost frictionless if you also use HF's Inference Endpoints.
Wins on: setup time (the "ship it this afternoon" tier), mixed-model deployments (multi-LoRA serving from a base model is mature here), HuggingFace Hub integration, ergonomics for smaller and mid-sized models. Strong AMD ROCm support (one of the few production-ready non-NVIDIA stories).
Loses on: peak throughput vs vLLM (typically 20–40% behind at the same hardware), advanced scheduler features (prefix-sharing across requests is less aggressive), exotic-architecture support lag, and a smaller open-source contributor base than vLLM.
Pick this when: you want to ship today, your team is HuggingFace-native, you want multi-LoRA on one base model, or you are on AMD MI300X and need a production-ready answer.
SGLangThe structured / agentic flow specialist
The newest of the four and the most architecturally distinctive. Its scheduler ingests the control flow of an agentic program — branches, parallel calls, structured-output constraints — and shares the KV prefix across the conversational tree rather than just within one request. The result is dramatic throughput wins on agentic workloads (multi-turn function-calling, tree-of-thought, branching reasoning) that vLLM and TGI treat as independent requests.
Wins on: branching agentic workloads (often 2–5x throughput vs vLLM on workloads with heavy prefix sharing), constrained / JSON-schema decoding (xgrammar integration is first-class, not bolted on), tool-call grammar enforcement, prefix caching across speculative paths.
Loses on: docs maturity (catching up — but the framework moved faster than the docs in 2024–2025), edge-case stability under heavy load, fewer production references than vLLM, ecosystem of plugins / extensions is still small.
Pick this when: your workload is "agent makes many branching calls with the same long system prompt" rather than "user asks one long question." If you do not know the difference, your workload is probably not yet agentic enough to need SGLang — but if you are building tool-using agents, this is increasingly the answer.
§03The benchmark trap
Every framework above publishes benchmark numbers. Every set of numbers is honest in isolation and misleading in aggregate. The mechanism is consistent: a benchmark configuration optimises some axis (large batch sizes for throughput claims, single-stream for latency claims, a specific model and quantisation for absolute-numbers claims). The publishing vendor configures the axis on which they win.
This is not malice. It is the only way to publish numbers that fit on a chart. The problem is that none of those axes is necessarily yours.
Real-world performance is a function of: your model architecture (a Mixtral MoE behaves nothing like a Llama dense model), your quantisation (FP8 on H100 looks nothing like INT4 on A100), your sequence-length distribution (a workload of 100K-input/100-output prompts behaves nothing like 200-input/2000-output), your batch shape (steady-state vs spiky), and your latency budget (do you care about TTFT, ITL, or both).
§04The comparison matrix
Six rows: the four serious contenders, plus llama.cpp (the right answer for the consumer-GPU / Apple-Silicon tier) and Triton Inference Server (NVIDIA's older, model-agnostic server — useful context). Scores 1–5, hand-graded. Sort by any column. Filter by intended hardware tier.
| Framework | License | Sweet spot | Through-put | Latency | Model cov. | Struct. out. | Docs | When to pick |
|---|
§05The decision tree
Four questions. The framework lands at the leaf. The "why" sentence is what to defend in a design review when the next engineer asks you why you didn't pick the one their last team used.
§06What is NOT in any of these
The serving framework is one layer in a four-layer stack. Even with the right framework picked perfectly, you are not done.
§07The wrong-default trap
"We picked vLLM because everyone uses vLLM" is the most common framework decision I encounter, and it is right most of the time — which is why it is dangerous. It is the right answer when the workload is conventional self-hosting of a dense or popular MoE model on H100-class hardware. It is the wrong answer when:
You're at scale on the latest NVIDIA hardware. If you have committed to H100/H200 or moving to B200, and you are running >100 sustained QPS on a stable set of models, TensorRT-LLM's 30–60% efficiency advantage on those specific kernels pays back the engineering tax. The "we don't have time to learn it" excuse stops working when the GPU bill crosses six figures monthly.
Your workload is agentic. If your average request involves 5+ tool calls with the same system prompt, SGLang's cross-branch prefix sharing is genuinely architectural — not a tuning detail. You can run the same model on vLLM at one-quarter the throughput. Most teams discover this six months in, after having shipped on vLLM.
You're on consumer GPUs or Apple Silicon. vLLM does not target this tier. llama.cpp does — and is excellent at it. The decision in that tier is not "vLLM vs llama.cpp," it is "do you have a real data-centre GPU or not."
The takeaway — in three sentences
A serving framework is four jobs glued together — loader + KV manager + scheduler + API — and every benchmark fight is really a fight about KV management and scheduling. vLLM is the right default for the unknown workload, TensorRT-LLM for scale on the latest NVIDIA, TGI for the HuggingFace-native fast-shipping path, SGLang for branching agentic flow, and llama.cpp for everything below a real data-centre GPU.
Then build the rest of the stack — gateway, observability, failover, caching policy — because the framework is one quarter of a self-host. The benchmarks the vendors publish are an upper bound for marketing, not a prediction for your traffic; the only honest benchmark is yesterday's production replay against the candidate.