What is the cheapest LLM in 2026 by total cost of ownership?

It depends on workload shape. For stable-system-prompt classification at >500k calls/month, Gemini 2.5 Flash ($0.15/$0.60 per M tokens) is usually cheapest. For schema-strict structured output, GPT-5 wins on TCO despite higher sticker because of lower retry rates. For long-context retrieval, Gemini 2.5 Pro wins.

How much does prompt caching actually save in 2026?

Cache reads are priced at 10-20% of standard input across Anthropic, OpenAI, and Google. A workload with a 4k-token stable system prompt and 75%+ cache hit rate reduces input cost by 80-90%.

What is a realistic LLM retry rate in production?

8-15% across well-instrumented systems. Schema-strict workloads on the wrong model can spike to 20%+; GPT-5 with structured output mode runs 3-5%.

What is the LLM migration tax?

20-80 engineering hours to switch production models (eval + shadow + cutover + rollback). At fully-loaded $150/hr that is $3,000-12,000 per migration. Workloads below ~$1,500/month rarely pay back a migration; pick once and leave it for 18-24 months.

Should I include the OpenAI batch API in my TCO?

Only if your workload tolerates 24-hour latency. The batch APIs from OpenAI, Anthropic, and Google all give a flat 50% discount on non-real-time jobs in 2026.

Why is GPT-5 mini sometimes cheaper than Gemini Flash?

When output tokens dominate and cache is in use. Gemini Flash's $0.60 output rate beats GPT-5 mini's $1.60 by 2.6×, but Flash's quality is lower on structured-output tasks where retries spike to 15%+. Net TCO can flip.

Where does Claude Haiku 4 fit?

Haiku at $0.80 / $4.00 sits between the flagships and the cheap-tier; for classifier / router workloads it usually beats Sonnet on TCO.

What cache-hit rate is realistic?

For stable system-prompt + RAG workloads, 70-90%. For freeform agent loops with per-turn tool calls, 30-50%. Measure with each provider's reported cache_read counter.

Does this include fine-tuning?

No. Fine-tuning rarely pays back below ~5M monthly calls; use the fine-tune-vs-rag calculator if you are above that threshold.

How current is the pricing?

April 2026. Verified against provider public pricing pages. Re-checked monthly — sign up for the cheat sheet to get pricing-move alerts.

LLM vs LLM TCO Calculator 2026 — Full Total Cost of Ownership

The 2026 TCO comparison that actually picks the right model

Sticker-price comparisons between Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro miss somewhere between 30% and 60% of the real bill once you account for prompt cache hit rates, retry tax, eval and observability spend, and the ops hours needed to keep the integration healthy across provider updates. The calculator above models all five layers. This article explains why each one matters and how to estimate the inputs honestly.

The five layers of LLM TCO

1. Base API spend

The headline number every comparison starts with: (input_tokens × input_rate + output_tokens × output_rate) / 1M. As of April 2026 the canonical rates per million tokens are:

Model	Input $/M	Output $/M	Cache read $/M	Context
Claude Sonnet 4.5	$3.00	$15.00	$0.30	200k
Claude Opus 4.7	$15.00	$75.00	$1.50	200k
Claude Haiku 4	$0.80	$4.00	$0.08	200k
GPT-5	$5.00	$20.00	$1.25 (50% off)	400k
GPT-5 mini	$0.40	$1.60	$0.10	400k
Gemini 2.5 Pro	$1.25	$10.00	$0.3125	1M
Gemini 2.5 Flash	$0.15	$0.60	$0.0375	1M

Two structural facts: output is always 4-5× input across every provider, and cache reads are 8-10× cheaper than uncached input on the providers that publish them.

2. Prompt cache savings

Anthropic, OpenAI, and Google all support some form of prompt caching in 2026. The mechanics differ slightly (Anthropic's is contiguous-prefix; OpenAI's is automatic on stable system prompts; Google's is explicit context caching) but the economic effect is similar: cache reads cost 10-20% of standard input. A workload with a stable 4k-token system prompt and 70%+ cache-hit rate effectively reduces input cost by 80-90%.

The realistic cache-hit rates by workload:

Workload	Realistic cache-hit rate
Stable system prompt + variable user query	75-95%
RAG with rotating context	30-55%
Agent loop with per-step tool calls	25-45%
Conversational chatbot with growing history	55-80%
Pure batch classification on a stable prompt	85-95%

Production teams routinely undershoot their realistic cache-hit rate by 15-25 percentage points because they did not structure the prompt for contiguous-prefix caching. Putting stable content first (system prompt, then schema, then guardrails, then user data) is usually a 1-day refactor that pays for itself in a week.

3. Retry tax

Production retry rates of 8-15% are common across well-instrumented LLM systems. The sources of retries are usually some mix of schema-validation failures (the JSON output missed a required field), guardrail violations (the response triggered a safety filter), and rate-limit transients. Every retry is paid for. The effective per-call cost is roughly base × (1 + retry_rate).

4. Eval and observability

Production LLM systems require some form of continuous eval — a held-out test set re-run against the current model to detect regressions, plus tracing for debugging. The major tools in 2026 are LangSmith, Helicone, Braintrust, and Arize Phoenix; pricing typically ranges $400-2,500/month for serious deployments. This is provider-neutral spend (it does not change when you switch from Claude to GPT-5) but it has to appear in the TCO or the comparison gives an unrealistic absolute number.

5. Ops hours

Every production LLM integration needs ongoing engineering time: rate-limit tuning, monitoring alarms, model-version pinning, the occasional emergency rollback, and the eval-and-cutover work whenever the provider ships a new flagship. 20-40 hours per month per provider is typical for a mature integration; new integrations spike to 60-100 hours in their first quarter. At a fully-loaded $120-200/hour for senior engineers, this layer is real money.

Where TCO ranking flips vs. sticker-price ranking

Three patterns consistently flip the model ranking once you move from sticker to TCO:

Long stable system prompts.Claude Sonnet 4.5's $0.30 cache read beats GPT-5's $1.25 cache read by 4×. Any workload with >3k tokens of stable system prompt and >70% cache hit usually ranks Sonnet 4.5 cheapest on TCO even though GPT-5 mini wins on sticker.
Schema-strict workloads. GPT-5 has the best structured-output mode in 2026; if your workload depends on strict JSON, GPT-5's lower retry rate (often 3-5% vs 10-15% on competitors) more than makes up for its higher headline rate.
Long-context retrieval. Gemini 2.5 Pro's 1M context plus aggressive context caching often wins TCO on document-QA workloads that need 100k+ input tokens per call.

Worked example — a 1M-call/month support chatbot

Concrete inputs: 1M calls/month, 1,500 input tokens (system prompt 1k + retrieved 500), 500 output tokens, 75% cache hit on the system prompt portion, 10% retry rate, $800/month eval (LangSmith), 24 ops hours/month at $120/hr.

Model	API $/mo (net of cache, with retries)	Eval+ops $/mo	Total TCO
Claude Sonnet 4.5	$10,725	$3,680	$14,405
GPT-5	$15,400	$3,680	$19,080
GPT-5 mini	$1,232	$3,680	$4,912
Gemini 2.5 Pro	$6,531	$3,680	$10,211
Gemini 2.5 Flash	$396	$3,680	$4,076

Headline: Flash and GPT-5 mini are within 20% on total TCO at this volume. The decision pivots not on price but on quality match for the workload. For a customer-facing support chatbot most teams pick Sonnet 4.5 or GPT-5 because retry rates and answer quality matter more than the absolute price difference. For an internal classifier workload, Flash wins by 80%.

The migration tax — the hidden 6th layer

Switching production LLMs every 6-12 months when a better model ships costs real engineering time: 20-80 hours of eval set re-run, shadow traffic comparison, canary rollout, monitoring, and rollback plan. At $150/hour fully-loaded that is $3,000-12,000 per migration. A 15% spend cut on a $10k/month workload returns $18k over 12 months — pays back the migration even at the high end. A 15% spend cut on a $1,500/month workload returns $2,700 over 12 months and may not pay back the migration at all. Smaller workloads should pick the right-enough model and leave it for 18-24 months.

Keep going

LLM API cost calculator — Single-model spend forecast
Prompt cache savings — Layer 70-90% savings on top
AI API pricing comparison — Sticker-price comparison across every major provider
LLM migration planner — Plan the switch when TCO points elsewhere

The five operator rules for 2026 LLM TCO

Never quote sticker price above 1M calls/month. Cache + retries + ops easily move 30-60% of TCO at that volume.
Measure your actual cache-hit rate. Every provider's API now reports cache_read in the response. Use it; don't guess.
Tag retry sources. Schema-validation retries are fixable; rate-limit retries are throughput tuning. Different fixes; same effect on TCO.
Pin the model version. Provider auto-upgrades cost more in regression debugging than they save in price drops.
Re-run TCO quarterly. Prices fall ~10× per 18 months. Last quarter's winner is often this quarter's middle of the pack.

FAQ

Why is GPT-5 sometimes cheaper TCO than GPT-5 mini?

When retry rates differ. GPT-5's stronger structured-output mode can run a 3% retry rate against GPT-5 mini's 12% on the same JSON-strict workload. At that gap GPT-5's headline $20 output rate becomes cheaper than mini's $1.60 on TCO.

Should I include batch-API discounts?

Yes if your workload tolerates 24-hour latency. OpenAI, Anthropic, and Google all offer flat 50% discounts on batch jobs in 2026. For bulk classification, evals, and overnight summarization, that 50% is real and the calc above does not model it (because it cannot know whether your workload is batch-tolerant). Layer it on top.

Where does fine-tuning fit?

Below ~5M calls/month, fine-tuning rarely beats a cached system prompt on TCO. Above that volume, use the fine-tune-vs-rag calculator. Custom fine-tuned models also raise the migration tax because the per-call cost gain has to amortize the training cost.

Why do you exclude egress fees?

They are usually a rounding error against the API spend for LLM workloads. Image, embedding, and video generation are different — there egress is real money.

How current are these prices?

April 2026, verified against provider public pricing pages. We re-check monthly. Sign up for the 2026 AI Pricing Cheat Sheet below to get a notification when a price moves.

The numbers in this article reflect April 2026 provider pricing. Re-run TCO quarterly; the model rankings continue to move as 2026 prices fall and new tiers ship.

LLM vs LLM TCO (2026)

Results

Visualization

Frequently asked questions

Stop writing AI prompts from scratch.