Self-Hosted vs API LLMs: The Operational Tradeoffs

The self-hosted-versus-API debate is almost always argued on the wrong axis. Someone builds a spreadsheet comparing the API’s price per million tokens against the hourly rate of a rented GPU, finds a crossover point, and declares a winner. That spreadsheet is real but incomplete, because it prices the compute and ignores the operations — and operations is where self-hosting actually gets expensive or actually pays off. This is a guide to the tradeoffs that the token-cost comparison leaves out.

What the token-cost spreadsheet gets right

There genuinely is a volume crossover. At low volume, paying per token to a frontier API is cheaper than standing up and operating GPUs, because you pay only for what you use and nothing for idle capacity or operational overhead. At high, predictable volume, owning or reserving inference capacity can be cheaper per token, because you amortize fixed hardware against a steady load instead of paying retail per-call pricing.

The crossover exists, but its location is dominated by a term the naive spreadsheet omits entirely: a competent ML infrastructure engineer’s fully-loaded cost. Below a meaningful monthly API spend, self-hosting is hard to justify on cost alone — the engineering time to build and operate the serving stack frequently exceeds the API bill you were trying to escape. The industry analyses converge on the same shape: self-hosting becomes financially rational only once your API spend is large enough that the per-token savings clearly outrun the loaded cost of the people required to run the infrastructure. Treat published break-even figures as order-of-magnitude guidance, not precise thresholds — they swing widely with model size, hardware, utilization, and salary assumptions — but the direction is robust: the engineer is usually the biggest line item, not the GPU.

The GPU memory math nobody budgets for

If you do self-host, the single most common sizing error is budgeting GPU memory for model weights only and ignoring the KV cache. The KV cache stores the attention keys and values for every token in every in-flight request, and it grows with sequence length and concurrency. For long-context or high-concurrency workloads, the KV cache can demand as much memory as the weights or more — and a deployment sized only for weights will fall over under real concurrent load. As the GPU-memory analysis at TianPan ↗ puts it bluntly, GPU memory planning for self-hosted LLMs is almost always wrong precisely because teams size for weights and forget the cache.

This is also the operational reason serving-engine choice matters so much. vLLM ↗ built its reputation on PagedAttention — managing the KV cache in non-contiguous pages the way an OS manages virtual memory — plus continuous batching, which packs new requests into the batch as slots free up instead of waiting for the whole batch to finish. Together these dramatically raise the throughput you extract from a fixed GPU, which directly moves your cost-per-token. The vLLM-versus-TGI comparison from Alongside ↗ lands where most production guidance does in 2026: vLLM for throughput-sensitive production serving, with Hugging Face TGI favored in HF-centric workflows and tools like Ollama reserved for local development rather than production. KV-cache quantization (e.g., FP8 KV cache, available in vLLM) further changes the math for long-context workloads by shrinking the cache’s footprint. The point: when you self-host, your effective cost is set by how well your serving engine uses the hardware, not by the GPU’s sticker price.

The tradeoffs the spreadsheet can’t price

Beyond compute, several operational dimensions decide the choice and resist tidy dollar figures.

Who owns reliability. With an API, the provider owns uptime, scaling, and on-call for the model serving layer; you own your application. Self-hosting moves model-serving reliability onto your team — GPU node failures, OOM-under-load, autoscaling, capacity planning, and the 2 a.m. page when a traffic spike exhausts the KV cache. That on-call burden is a recurring cost the per-token comparison never shows.

Provider-side model drift, inverted. A self-hosted model is a frozen artifact: it changes only when you change it. An API model can shift underneath you when the provider updates a non-pinned alias — the silent quality decay ↗ problem. Self-hosting eliminates this class of drift entirely, which for some regulated or high-stakes systems is the deciding factor regardless of cost. The flip side: you also own every upgrade, including the security patches and capability improvements you’d otherwise get for free.

Data residency and privacy. For workloads that cannot send data to a third party — regulated industries, sensitive internal data — self-hosting may be a hard requirement, not an optimization. Here the decision is made on compliance grounds and the cost comparison is moot.

Capability ceiling. The largest frontier models are typically only available via API. Self-hosting usually means open-weight models, which for many tasks are entirely sufficient and for some are not. The honest question is task-specific: does an open-weight model you can actually run meet your quality bar? If only a frontier API model clears it, the decision is made for you on that path.

The pattern most mature teams land on

The framing as a binary — self-host or API — is itself part of the mistake. The setup that wins in practice is usually a hybrid routed at a gateway, which is the same control-plane idea our LLMOps best-practices guide ↗ recommends for cost governance generally:

Predictable high-volume traffic routes to self-hosted open-weight models on vLLM, where steady load amortizes the fixed cost and you control the artifact.
Spiky overflow traffic routes to APIs, so you do not pay to provision for peaks you rarely hit.
Frontier-capability requests route to the API model that clears the quality bar.
Privacy-sensitive requests route to the self-hosted path, by policy, regardless of cost.

A gateway makes this tractable: routing logic lives in one control plane, the application code is agnostic to where a request lands, and you can shift traffic between self-hosted and API capacity without redeploying the app. It also means the build/buy decision stops being all-or-nothing — you can self-host the slice where it clearly pays and keep the API for everything else, then move the line as your volume and capabilities evolve.

How to actually decide

Skip straight to the questions that dominate the answer:

What’s your monthly API spend, and how steady is the load? Low or spiky volume strongly favors API. High and predictable volume opens the self-host case.
Do you have the operations capacity? Self-hosting needs real ML-infra ownership — KV-cache-aware sizing, serving-engine tuning, GPU on-call. If that capacity does not exist or cost more than the API bill, the spreadsheet’s “savings” are imaginary.
Is there a hard requirement — data residency, frozen-model compliance, a frontier-only capability — that decides it independent of cost? If so, that requirement wins; optimize within it.
Can a hybrid capture most of the upside at a fraction of the operational risk? Usually yes, and it is the right default for teams large enough to consider self-hosting at all.

The compute cost is the part of this decision that is easiest to compute and least likely to be decisive. The operational ownership, the GPU-memory realities, the drift and compliance constraints, and the engineering headcount are what actually determine whether self-hosting is a saving or an expensive distraction. Price those first.

For tooling and platform comparisons across the serving stack, mlopsplatforms.com ↗ surveys serving frameworks and managed options, and sentryml.com ↗ covers how to instrument cost and reliability consistently across hybrid self-hosted-and-API deployments.

Sources

vLLM documentation ↗ — PagedAttention KV-cache management and continuous batching, the throughput techniques that set self-hosted cost-per-token.
Self-Hosted LLMs in Production: The GPU Memory Math — TianPan.co ↗ — Why KV-cache memory, not just model weights, governs self-hosted GPU sizing.
Why vLLM Is Better Than Hugging Face TGI — Alongside ↗ — Serving-engine comparison and where vLLM, TGI, and Ollama fit in 2026 production deployments.

Self-Hosted vs API LLMs: The Operational Tradeoffs

What the token-cost spreadsheet gets right

The GPU memory math nobody budgets for

The tradeoffs the spreadsheet can’t price

The pattern most mature teams land on

How to actually decide

Sources

Sources

LLMOps Report — in your inbox

Related

Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match

Prompt Versioning and Deployment: The Operational Workflow

Token-Cost Observability: What You Measure vs What You Should

Comments