Why "cheapest per million tokens" is usually the wrong question
Price arbitrage between Claude, GPT, and Gemini is the oldest game in the LLM economy, and it is mostly played badly. Teams pick Gemini 2.5 Flash at $0.15/M input because it is 20× cheaper than Claude Sonnet 4.5, and then discover three weeks later that (a) it needed 2× the prompt tokens to hit the same accuracy, (b) structured output errors doubled, and (c) the eval harness had to be rebuilt when they switched. Sticker price is the least important input to a real model-selection decision.
The honest comparison framework has five axes, in descending order of importance:
- Quality on your task — measured by pass rate on a private eval set of 50–200 inputs, not MMLU or Arena.
- Total cost per successful task — includes retries, schema failures, and the prompt-token premium a weaker model demands.
- Latency at P95 — the number users actually feel.
- Operational stability — rate limits, regional availability, provider outage history.
- Lock-in cost — tool-use schema portability, system-prompt dialect, streaming semantics.
April 2026 snapshot: who wins what
| Task class | Current best | Runner-up | Why |
|---|---|---|---|
| Agentic tool-use / multi-step | Claude Sonnet 4.5 | GPT-5 | Better planning + lower tool-call hallucination rate. |
| Long-context Q&A (100k+) | Gemini 2.5 Pro | Claude Sonnet 4.5 | Gemini's 1M context + cheap input rate. |
| Structured output / JSON | GPT-5 | Gemini 2.5 Pro | Strict schema mode is mature. |
| Code generation | Claude Sonnet 4.5 | GPT-5 | Leads HumanEval+ and SWE-bench Verified. |
| Bulk classification / extraction | Gemini 2.5 Flash | GPT-5 mini | Throughput + $0.15/M input. |
| Creative / marketing copy | Claude Opus 4.1 | GPT-5 | Voice diversity; less purple prose. |
| Vision + document parsing | Gemini 2.5 Pro | GPT-5 | PDF layout handling is noticeably better. |
Where the calculator is most useful
Use the tool above when you have two or three candidates already filtered on quality and want to see the dollar delta at your volume. A typical finding: GPT-5 at $5/$20 and Claude Sonnet 4.5 at $3/$15 come out within 15% on a workload with 1,500 input / 400 output tokens — inside the noise of your own eval variance. At that point, pick on quality and reliability, not price.
Where the delta is large and real is on output-heavy tasks: a summarizer generating 1,500 output tokens per call pays 5× more on Claude Opus 4.1 ($75/M) than on GPT-5 ($20/M). If quality converges, the ~$55/M output-token gap matters.
Provider-specific traps worth knowing
- Anthropic prompt caching charges 25% extra to writethe cache and drops reads to 10% of normal input price. A workload that hits the cache < 3× before expiring is worse off caching.
- OpenAI structured output with
response_format: json_schemais strict but adds 10–20% latency for first-token generation (schema compilation). - Gemini context caching is billed per minute of storage, not per-read. For workloads under 1 request/min on the cached content, it loses money.
- Vertex AI vs Gemini API pricing differs — Vertex adds a modest premium but buys regional SLAs, audit logs, and VPC-SC. If you need SOC 2, use Vertex.
Worked example: the same task across three providers
Take a typical production synthesis workload — 1,500 input tokens, 400 output tokens, 1M requests per month. Here is what the invoice looks like on each major option, before and after the easy optimizations:
| Model | Raw input | Raw output | Total/mo | With caching |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $4,500 | $6,000 | $10,500 | $7,350 (30% off) |
| Claude Opus 4.1 | $22,500 | $30,000 | $52,500 | $39,750 (24% off) |
| Claude Haiku 4 | $1,200 | $1,600 | $2,800 | $2,080 (26% off) |
| GPT-5 | $7,500 | $8,000 | $15,500 | $11,750 (24% off, 50% cache) |
| GPT-5 mini | $600 | $640 | $1,240 | $970 (22% off) |
| Gemini 2.5 Pro | $1,875 | $4,000 | $5,875 | $4,940 (16% off, 75% cache) |
| Gemini 2.5 Flash | $225 | $240 | $465 | $408 (12% off) |
The caching column assumes a 1,000-token shared system prompt (67% of input is cacheable) with a realistic 85% hit rate at 200,000 QPM steady-state. Note three things that blow up spreadsheet comparisons: Anthropic's 90% discount on cache reads gives it the biggest absolute savings, OpenAI's 50% cache discount is more modest but comes free with no cache-write tax, and Gemini's 75% read discount is eroded by the per-minute storage fee if your QPM drops below ~60 on the cached content.
Cache math on a 1,000-token system prompt at 200,000 QPM
A concrete example of how the three cache models diverge at real production volume. Imagine a routing service that runs 200,000 requests per minute — roughly 8.6B requests per month — each with a 1,000-token system prompt plus 500 tokens of variable user content and 200 tokens of output.
- Anthropic (Claude Sonnet 4.5, 90% cache-read discount): system prompt tokens per month = 8.64B × 1,000 = 8.64T. Cache miss rate at this QPM is ~3%; write cost = 8.64T × 0.03 × $3.75/M = $972K. Cache reads = 8.64T × 0.97 × $0.30/M = $2.51M. User content uncached = 8.64B × 500 × $3/M = $12.96M. Output = 8.64B × 200 × $15/M = $25.92M. Monthly total: $42.36M. Without caching, input alone would be $38.88M — caching saves $23.4M/month.
- OpenAI (GPT-5, 50% cache-read discount): same volume. Cache reads = 8.64T × 0.97 × $2.50/M = $20.95M. User content = 8.64B × 500 × $5/M = $21.6M. Output = 8.64B × 200 × $20/M = $34.56M. Monthly: $77.11M. Savings vs. uncached: $21M/mo.
- Gemini 2.5 Pro (75% discount): cache reads = 8.64T × 0.97 × $0.3125/M = $2.62M. User content = 8.64B × 500 × $1.25/M = $5.4M. Output = 8.64B × 200 × $10/M = $17.28M. Monthly: $25.3M, with additional per-minute storage fee of a few hundred dollars at this QPM.
None of these numbers are realistic for most products, but the structure is: at any serious volume, cache read discounts dwarf every other optimization and Anthropic's aggressive 90% discount closes most of the headline-price gap with Gemini. If you are a real 200k QPM consumer, also build your own inference — but between managed options, the picture is much flatter than rate-card comparisons suggest.
When Haiku 4 actually beats Sonnet 4.5
Four patterns where we route to Haiku 4 every time and do not even A/B test anymore:
- Single-step extractions under 500 output tokens: extract the customer's name, company, and intent from an inbound email. Pull the amount and due date from an invoice. Identify the three most relevant tags from a fixed list. Haiku hits 95%+ of Sonnet's accuracy on these at a quarter of the price.
- Intent classification / routing: routing a user message to one of 12 downstream handlers. This is literally what Haiku was built for; it is faster, cheaper, and well within accuracy tolerance.
- Cheap first-pass summarization: a 200-token TL;DR of a 2,000-token message for a notification. Nobody is going to notice Sonnet's prose being marginally better.
- PII scrubbing / redaction: replace names, emails, and credit card numbers with placeholders. Deterministic schema, no reasoning needed.
When Sonnet 4.5 beats Opus 4.1
The honest answer: 99% of general coding tasks. SWE-bench Verified differences between Sonnet 4.5 and Opus 4.1 are inside the noise band on the kinds of small refactors, bug-fixes, and feature additions that make up most developer work. The quality gap shows up only on genuinely hard reasoning problems — architecture debates, concurrency analysis, tricky algorithmic work. For the average pull-request review, diff summarization, or test generation, Sonnet is the right default and Opus is a 5× cost increase for a quality improvement nobody notices.
The specific places we still reach for Opus: legal clause analysis where a mistake has compliance consequences, multi-hop research agents that need to reason over conflicting evidence, and complex planning tasks where the plan itself is the deliverable.
Latency at P95 — the number users actually feel
Streaming TTFT (time to first token) and per-token latency matter more than throughput for most UXs. April 2026 steady-state on warm connections:
| Model | TTFT P50 | Per-token P50 | Per-token P95 |
|---|---|---|---|
| Haiku 4 | ~280ms | ~35ms | ~70ms |
| Sonnet 4.5 | ~550ms | ~80ms | ~160ms |
| Opus 4.1 | ~900ms | ~60ms | ~140ms |
| GPT-5 | ~400ms | ~55ms | ~130ms |
| GPT-5 mini | ~220ms | ~30ms | ~80ms |
| Gemini 2.5 Pro | ~320ms | ~70ms | ~170ms |
| Gemini 2.5 Flash | ~180ms | ~150ms | ~380ms |
Counterintuitively, Opus emits tokens slightly faster than Sonnet once it gets going, but has a larger setup penalty — a 300-token response on Opus feels slower than on Sonnet despite the steady-state number being better, because the TTFT dominates. Flash has the fastest TTFT of any frontier model but the slowest per-token rate in the group: great for short replies, painful for 2,000-token responses.
Frequently asked questions
Which provider wins overall? None of them. Multi-provider is the correct answer for any serious production system. Anthropic and OpenAI leapfrog each other every few months on capability; Gemini owns the long-context and cheap-bulk ends; the right architecture routes within and across them.
How much quality does a private eval actually add? On almost every real client engagement we have run, the private eval swings the recommendation by at least one tier. MMLU-Pro and Arena rankings have almost no predictive power for specific workloads.
Should I avoid Opus entirely for consumer apps? Not entirely, but default no. The 5× price for a rarely-visible quality gain is bad math for most B2C products. Exceptions exist — a premium tier where users explicitly pay for better quality, or research-style workflows where the user tolerates a wait.
What about Chinese labs — DeepSeek, Qwen, Kimi? Price/performance is competitive and in some cases superior, especially on coding and reasoning. For commercial deployment in the US and EU, enterprise data-handling and compliance posture is the blocker, not capability. If you can run them on your own hardware or via a US-based inference host, they are a real option.
How do I compare without committing to a provider? Use a gateway like OpenRouter, Vercel AI Gateway, or LiteLLM to test all candidates behind a single API. Instrument token usage, latency, and failure rate; swap the underlying model with one environment variable. Total setup time is an afternoon.
Does prompt caching work on every model? On Anthropic, yes — explicit cache_control. On OpenAI, automatic for prefixes ≥1024 tokens. On Gemini, explicit cachedContents API. On Bedrock and Vertex, it is model-by-model. Check before you commit a deployment to the caching discount.
Will prices drop further?Yes, slowly. The rate of decline has shifted from the 10× per year of 2023 to about 2× per year in 2025–2026 as frontier models have gotten more capable and compute constraints have tightened. Build assuming today's prices; pleasant surprises are a bonus.
Is GPT-5 mini really competitive with Haiku 4? On single-shot extraction, yes — both are well above the accuracy threshold for most routing tasks. GPT-5 mini has slightly better instruction following; Haiku 4 has the edge on calibrated refusal and longer-context input. For bulk workloads, run the private eval.
- LLM API cost calculator — compute monthly spend for one chosen model.
- Prompt cache savings — add the 10× cached-input discount to this comparison.
- GPU inference cost — compare hosted APIs to self-hosted Llama / Qwen.
- Fine-tune vs RAG — when a cheaper model + fine-tune beats a premium model at prompting.