ajinkya.ai An experiment in learning with AI.
← All entries
13 May 2026 10 min read

LLM serving frameworks — vLLM, TensorRT-LLM, TGI, SGLang

LLM Serving Vllm Tensorrt Llm Tgi Sglang Infrastructure Tutorial Interactive

An interactive serving rubric — 13 May 2026

LLM serving frameworks in 2026
vllm · tensorrt-llm · tgi · sglang

You decided to self-host. You typed "best LLM serving framework" into a search engine. You now have twelve open browser tabs, each promising a different winner, and the benchmark numbers in each one are configured to make the publisher win. This is the working rubric — what each framework actually optimises, what it gives up, and a sortable matrix plus decision tree to land on one.

§01What a serving framework actually is

Four jobs glued together. Every framework on the market is some opinionated mix of these. Knowing which job each one optimises is most of the picking decision.

JOB 1

Model loader & quantiser

Read the weights off disk, apply quantisation (FP8, INT8, INT4, GPTQ, AWQ), lay them out for the target hardware. Determines what models you can actually run on what hardware.

JOB 2

KV cache manager

Allocate, page, share, and evict the per-request key/value blocks. PagedAttention lives here. Prefix sharing across requests lives here. This is where 80% of throughput gains come from.

JOB 3

Scheduler

Decide whose token gets computed next. Continuous (in-flight) batching, priority queues, preemption, branching agentic flow. The difference between 30 and 130 QPS on the same GPU.

JOB 4

API surface

OpenAI-compatible REST, gRPC, streaming SSE, structured-output / JSON-schema enforcement, tool-call grammars. Determines how painful it is to swap your client code from a hosted API.

opinion Every benchmark fight you have ever read is a fight about job 2 and job 3. Throughput claims are KV-management and scheduler claims. Latency claims are scheduler claims. The model loader matters mostly for what hardware you can target; the API surface matters mostly for how fast you can ship. Pick the framework whose KV+scheduler choices match your workload shape, then live with whatever it gives you for jobs 1 and 4.

§02The four contenders

There are perhaps a dozen serving frameworks in 2026; four of them are credible production options for new self-hosts. I will describe each one honestly, including what it loses on.

vLLMThe throughput default

Born from the PagedAttention paper. The framework that made "page the KV cache like virtual memory" a standard rather than a research idea. Continuous batching is its native scheduler; GPTQ, AWQ, and FP8 quantisation are all first-class; the OpenAI-compatible server is the most battle-tested in the open-source world.

Wins on: commodity-hardware throughput (any A100/H100, even older Ampere cards), model coverage (new architectures land here first or close to first), reliability of the OpenAI-compatible server, BSD-3 permissiveness, vibrant community.

Loses on: structured-output ergonomics (you can bolt on Outlines or xgrammar, but it's not the smoothest), peak performance on the latest NVIDIA hardware (TensorRT-LLM hand-tuned kernels still edge it at scale), TTFT under pathological workload mixes (the scheduler is throughput-biased and a latency-critical workload can starve).

Pick this when: you are self-hosting for the first time on GPUs you already own and want a predictable, well-understood floor. The conservative default is correct here.

TensorRT-LLMThe NVIDIA silicon specialist

NVIDIA's framework, built on top of TensorRT, the company's general-purpose deep-learning inference compiler. Per-model kernel tuning, native FP8 with H100/H200 hardware support, sparsity support on Hopper and Blackwell. The Triton-Inference-Server-plus-TensorRT-LLM-backend combo is what NVIDIA wants you to deploy.

Wins on: peak throughput and lowest latency on H100, H200, B200 with the latest FP8 kernels — typically 1.3–1.8x vs vLLM on the same hardware for popular dense models. FP8 quality is best-in-class. NVIDIA enterprise support is real and useful at scale.

Loses on: model coverage (new architectures take weeks to land), portability (lock to NVIDIA and to a specific minor version of CUDA / TensorRT), iteration speed (rebuild the engine per model variant — slow), and a steep learning curve compared to "pip install vllm". The Apache-2 license is genuine but the surrounding tooling is NVIDIA-flavoured.

Pick this when: you are operating at scale (think > 100 sustained QPS), you are committed to NVIDIA hardware on the latest generation, and the hardware bill is large enough that a 30–50% per-GPU efficiency gain pays back the engineering investment many times over.

TGIThe HuggingFace path of least resistance

Text Generation Inference. Sensible defaults, OpenAI-compatible API in the box, deeply integrated with the HuggingFace Hub (you can deploy a model with a model ID alone). Rust core for the request gateway with a Python model server. Comfortable if you already live in the HuggingFace ecosystem; almost frictionless if you also use HF's Inference Endpoints.

Wins on: setup time (the "ship it this afternoon" tier), mixed-model deployments (multi-LoRA serving from a base model is mature here), HuggingFace Hub integration, ergonomics for smaller and mid-sized models. Strong AMD ROCm support (one of the few production-ready non-NVIDIA stories).

Loses on: peak throughput vs vLLM (typically 20–40% behind at the same hardware), advanced scheduler features (prefix-sharing across requests is less aggressive), exotic-architecture support lag, and a smaller open-source contributor base than vLLM.

Pick this when: you want to ship today, your team is HuggingFace-native, you want multi-LoRA on one base model, or you are on AMD MI300X and need a production-ready answer.

SGLangThe structured / agentic flow specialist

The newest of the four and the most architecturally distinctive. Its scheduler ingests the control flow of an agentic program — branches, parallel calls, structured-output constraints — and shares the KV prefix across the conversational tree rather than just within one request. The result is dramatic throughput wins on agentic workloads (multi-turn function-calling, tree-of-thought, branching reasoning) that vLLM and TGI treat as independent requests.

Wins on: branching agentic workloads (often 2–5x throughput vs vLLM on workloads with heavy prefix sharing), constrained / JSON-schema decoding (xgrammar integration is first-class, not bolted on), tool-call grammar enforcement, prefix caching across speculative paths.

Loses on: docs maturity (catching up — but the framework moved faster than the docs in 2024–2025), edge-case stability under heavy load, fewer production references than vLLM, ecosystem of plugins / extensions is still small.

Pick this when: your workload is "agent makes many branching calls with the same long system prompt" rather than "user asks one long question." If you do not know the difference, your workload is probably not yet agentic enough to need SGLang — but if you are building tool-using agents, this is increasingly the answer.

§03The benchmark trap

Every framework above publishes benchmark numbers. Every set of numbers is honest in isolation and misleading in aggregate. The mechanism is consistent: a benchmark configuration optimises some axis (large batch sizes for throughput claims, single-stream for latency claims, a specific model and quantisation for absolute-numbers claims). The publishing vendor configures the axis on which they win.

This is not malice. It is the only way to publish numbers that fit on a chart. The problem is that none of those axes is necessarily yours.

Real-world performance is a function of: your model architecture (a Mixtral MoE behaves nothing like a Llama dense model), your quantisation (FP8 on H100 looks nothing like INT4 on A100), your sequence-length distribution (a workload of 100K-input/100-output prompts behaves nothing like 200-input/2000-output), your batch shape (steady-state vs spiky), and your latency budget (do you care about TTFT, ITL, or both).

the only honest benchmark Is your workload, on your model, on your hardware, against your real query mix. Set up a 24-hour replay of yesterday's production traffic against each candidate framework, with the same model and quantisation, and measure: P50 and P99 TTFT, P50 and P99 ITL, sustained QPS at the latency SLO, and cost per million output tokens. Any framework choice not grounded in this measurement is folklore. The benchmarks the vendors publish are an upper bound for advertising, not a prediction for your traffic.

§04The comparison matrix

Six rows: the four serious contenders, plus llama.cpp (the right answer for the consumer-GPU / Apple-Silicon tier) and Triton Inference Server (NVIDIA's older, model-agnostic server — useful context). Scores 1–5, hand-graded. Sort by any column. Filter by intended hardware tier.

Hardware click column to sort
Framework License Sweet spot Through-put Latency Model cov. Struct. out. Docs When to pick
Scores 1–5, hand-graded May 2026. Throughput / latency are workload-dependent; these are typical-case directionals, not absolutes. Verify on your traffic before committing.

§05The decision tree

Four questions. The framework lands at the leaf. The "why" sentence is what to defend in a design review when the next engineer asks you why you didn't pick the one their last team used.

§06What is NOT in any of these

The serving framework is one layer in a four-layer stack. Even with the right framework picked perfectly, you are not done.

LAYER 1Gateway
Request gateway / API edge. Rate limiting, authentication, multi-tenant routing, request shaping. Envoy, Kong, an API-gateway product, or a Cloudflare Worker in front of the serving framework. The serving framework's "OpenAI-compatible server" is not an API gateway — it has no rate-limit policy, no auth model beyond a token, no per-tenant quotas.
LAYER 2Serving
The framework you just picked. Loader + KV manager + scheduler + raw API. This is the one piece this tutorial is about. It is roughly one quarter of the production stack.
LAYER 3Observability
Telemetry, tracing, evals in production. OpenTelemetry GenAI spans (now standardised in 2025–26), per-request token accounting, cost-per-tenant attribution, prompt and completion logging with PII redaction, drift detection. The framework emits metrics; it does not store, dashboard, or alert on them.
LAYER 4Failover
Resilience & fallback. A second region, or a fallback to a hosted API when the self-host is down for maintenance, or a degradation strategy when the GPU pool is saturated. Self-hosting one framework in one region is a single-region single-point-of-failure architecture. Hosted APIs hide this from you; self-hosting forces you to design it explicitly.
also missing Prompt caching policy sits across layers 2 and 3 — the framework can cache the KV of a long prefix, but which prefixes are worth caching, the eviction policy under memory pressure, and how to bill it back to the right tenant are operational decisions the framework does not make for you. Capacity planning sits across layers 1 and 2 — how many H100s, what utilisation target, how to handle the 2 a.m. spike. Self-hosting trades the hosted-API price premium for the obligation to run all of this yourself.

§07The wrong-default trap

"We picked vLLM because everyone uses vLLM" is the most common framework decision I encounter, and it is right most of the time — which is why it is dangerous. It is the right answer when the workload is conventional self-hosting of a dense or popular MoE model on H100-class hardware. It is the wrong answer when:

You're at scale on the latest NVIDIA hardware. If you have committed to H100/H200 or moving to B200, and you are running >100 sustained QPS on a stable set of models, TensorRT-LLM's 30–60% efficiency advantage on those specific kernels pays back the engineering tax. The "we don't have time to learn it" excuse stops working when the GPU bill crosses six figures monthly.

Your workload is agentic. If your average request involves 5+ tool calls with the same system prompt, SGLang's cross-branch prefix sharing is genuinely architectural — not a tuning detail. You can run the same model on vLLM at one-quarter the throughput. Most teams discover this six months in, after having shipped on vLLM.

You're on consumer GPUs or Apple Silicon. vLLM does not target this tier. llama.cpp does — and is excellent at it. The decision in that tier is not "vLLM vs llama.cpp," it is "do you have a real data-centre GPU or not."

the right default vLLM until your evidence says otherwise. Then re-evaluate. The defaults are vLLM because vLLM is the lowest-risk first move for the unknown workload — broad model support, predictable behaviour, OpenAI-compatible drop-in, BSD licence. The mistake is treating vLLM as the permanent answer when your workload, after six months of growth, has shape that another framework was built for.

The takeaway — in three sentences

A serving framework is four jobs glued together — loader + KV manager + scheduler + API — and every benchmark fight is really a fight about KV management and scheduling. vLLM is the right default for the unknown workload, TensorRT-LLM for scale on the latest NVIDIA, TGI for the HuggingFace-native fast-shipping path, SGLang for branching agentic flow, and llama.cpp for everything below a real data-centre GPU.

Then build the rest of the stack — gateway, observability, failover, caching policy — because the framework is one quarter of a self-host. The benchmarks the vendors publish are an upper bound for marketing, not a prediction for your traffic; the only honest benchmark is yesterday's production replay against the candidate.