The 2026 TCO comparison that actually picks the right model
Sticker-price comparisons between Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro miss somewhere between 30% and 60% of the real bill once you account for prompt cache hit rates, retry tax, eval and observability spend, and the ops hours needed to keep the integration healthy across provider updates. The calculator above models all five layers. This article explains why each one matters and how to estimate the inputs honestly.
The five layers of LLM TCO
1. Base API spend
The headline number every comparison starts with: (input_tokens × input_rate + output_tokens × output_rate) / 1M. As of April 2026 the canonical rates per million tokens are:
| Model | Input $/M | Output $/M | Cache read $/M | Context |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | $0.30 | 200k |
| Claude Opus 4.7 | $15.00 | $75.00 | $1.50 | 200k |
| Claude Haiku 4 | $0.80 | $4.00 | $0.08 | 200k |
| GPT-5 | $5.00 | $20.00 | $1.25 (50% off) | 400k |
| GPT-5 mini | $0.40 | $1.60 | $0.10 | 400k |
| Gemini 2.5 Pro | $1.25 | $10.00 | $0.3125 | 1M |
| Gemini 2.5 Flash | $0.15 | $0.60 | $0.0375 | 1M |
Two structural facts: output is always 4-5× input across every provider, and cache reads are 8-10× cheaper than uncached input on the providers that publish them.
2. Prompt cache savings
Anthropic, OpenAI, and Google all support some form of prompt caching in 2026. The mechanics differ slightly (Anthropic's is contiguous-prefix; OpenAI's is automatic on stable system prompts; Google's is explicit context caching) but the economic effect is similar: cache reads cost 10-20% of standard input. A workload with a stable 4k-token system prompt and 70%+ cache-hit rate effectively reduces input cost by 80-90%.
The realistic cache-hit rates by workload:
| Workload | Realistic cache-hit rate |
|---|---|
| Stable system prompt + variable user query | 75-95% |
| RAG with rotating context | 30-55% |
| Agent loop with per-step tool calls | 25-45% |
| Conversational chatbot with growing history | 55-80% |
| Pure batch classification on a stable prompt | 85-95% |
Production teams routinely undershoot their realistic cache-hit rate by 15-25 percentage points because they did not structure the prompt for contiguous-prefix caching. Putting stable content first (system prompt, then schema, then guardrails, then user data) is usually a 1-day refactor that pays for itself in a week.
3. Retry tax
Production retry rates of 8-15% are common across well-instrumented LLM systems. The sources of retries are usually some mix of schema-validation failures (the JSON output missed a required field), guardrail violations (the response triggered a safety filter), and rate-limit transients. Every retry is paid for. The effective per-call cost is roughly base × (1 + retry_rate).
4. Eval and observability
Production LLM systems require some form of continuous eval — a held-out test set re-run against the current model to detect regressions, plus tracing for debugging. The major tools in 2026 are LangSmith, Helicone, Braintrust, and Arize Phoenix; pricing typically ranges $400-2,500/month for serious deployments. This is provider-neutral spend (it does not change when you switch from Claude to GPT-5) but it has to appear in the TCO or the comparison gives an unrealistic absolute number.
5. Ops hours
Every production LLM integration needs ongoing engineering time: rate-limit tuning, monitoring alarms, model-version pinning, the occasional emergency rollback, and the eval-and-cutover work whenever the provider ships a new flagship. 20-40 hours per month per provider is typical for a mature integration; new integrations spike to 60-100 hours in their first quarter. At a fully-loaded $120-200/hour for senior engineers, this layer is real money.
Where TCO ranking flips vs. sticker-price ranking
Three patterns consistently flip the model ranking once you move from sticker to TCO:
- Long stable system prompts.Claude Sonnet 4.5's $0.30 cache read beats GPT-5's $1.25 cache read by 4×. Any workload with >3k tokens of stable system prompt and >70% cache hit usually ranks Sonnet 4.5 cheapest on TCO even though GPT-5 mini wins on sticker.
- Schema-strict workloads. GPT-5 has the best structured-output mode in 2026; if your workload depends on strict JSON, GPT-5's lower retry rate (often 3-5% vs 10-15% on competitors) more than makes up for its higher headline rate.
- Long-context retrieval. Gemini 2.5 Pro's 1M context plus aggressive context caching often wins TCO on document-QA workloads that need 100k+ input tokens per call.
Worked example — a 1M-call/month support chatbot
Concrete inputs: 1M calls/month, 1,500 input tokens (system prompt 1k + retrieved 500), 500 output tokens, 75% cache hit on the system prompt portion, 10% retry rate, $800/month eval (LangSmith), 24 ops hours/month at $120/hr.
| Model | API $/mo (net of cache, with retries) | Eval+ops $/mo | Total TCO |
|---|---|---|---|
| Claude Sonnet 4.5 | $10,725 | $3,680 | $14,405 |
| GPT-5 | $15,400 | $3,680 | $19,080 |
| GPT-5 mini | $1,232 | $3,680 | $4,912 |
| Gemini 2.5 Pro | $6,531 | $3,680 | $10,211 |
| Gemini 2.5 Flash | $396 | $3,680 | $4,076 |
Headline: Flash and GPT-5 mini are within 20% on total TCO at this volume. The decision pivots not on price but on quality match for the workload. For a customer-facing support chatbot most teams pick Sonnet 4.5 or GPT-5 because retry rates and answer quality matter more than the absolute price difference. For an internal classifier workload, Flash wins by 80%.
The migration tax — the hidden 6th layer
Switching production LLMs every 6-12 months when a better model ships costs real engineering time: 20-80 hours of eval set re-run, shadow traffic comparison, canary rollout, monitoring, and rollback plan. At $150/hour fully-loaded that is $3,000-12,000 per migration. A 15% spend cut on a $10k/month workload returns $18k over 12 months — pays back the migration even at the high end. A 15% spend cut on a $1,500/month workload returns $2,700 over 12 months and may not pay back the migration at all. Smaller workloads should pick the right-enough model and leave it for 18-24 months.
- LLM API cost calculator — Single-model spend forecast
- Prompt cache savings — Layer 70-90% savings on top
- AI API pricing comparison — Sticker-price comparison across every major provider
- LLM migration planner — Plan the switch when TCO points elsewhere
The five operator rules for 2026 LLM TCO
- Never quote sticker price above 1M calls/month. Cache + retries + ops easily move 30-60% of TCO at that volume.
- Measure your actual cache-hit rate. Every provider's API now reports cache_read in the response. Use it; don't guess.
- Tag retry sources. Schema-validation retries are fixable; rate-limit retries are throughput tuning. Different fixes; same effect on TCO.
- Pin the model version. Provider auto-upgrades cost more in regression debugging than they save in price drops.
- Re-run TCO quarterly. Prices fall ~10× per 18 months. Last quarter's winner is often this quarter's middle of the pack.
FAQ
Why is GPT-5 sometimes cheaper TCO than GPT-5 mini?
When retry rates differ. GPT-5's stronger structured-output mode can run a 3% retry rate against GPT-5 mini's 12% on the same JSON-strict workload. At that gap GPT-5's headline $20 output rate becomes cheaper than mini's $1.60 on TCO.
Should I include batch-API discounts?
Yes if your workload tolerates 24-hour latency. OpenAI, Anthropic, and Google all offer flat 50% discounts on batch jobs in 2026. For bulk classification, evals, and overnight summarization, that 50% is real and the calc above does not model it (because it cannot know whether your workload is batch-tolerant). Layer it on top.
Where does fine-tuning fit?
Below ~5M calls/month, fine-tuning rarely beats a cached system prompt on TCO. Above that volume, use the fine-tune-vs-rag calculator. Custom fine-tuned models also raise the migration tax because the per-call cost gain has to amortize the training cost.
Why do you exclude egress fees?
They are usually a rounding error against the API spend for LLM workloads. Image, embedding, and video generation are different — there egress is real money.
How current are these prices?
April 2026, verified against provider public pricing pages. We re-check monthly. Sign up for the 2026 AI Pricing Cheat Sheet below to get a notification when a price moves.
The numbers in this article reflect April 2026 provider pricing. Re-run TCO quarterly; the model rankings continue to move as 2026 prices fall and new tiers ship.