AI Economy Hub

LLM API cost calculator

Estimate monthly LLM spend from prompt and completion tokens, input and output rates.

Loading calculator…

Get weekly marketing insights

Join 1,200+ readers. One email per week. Unsubscribe anytime.

Frequently asked questions

1.Which providers does this match?

Any per-token provider β€” OpenAI, Anthropic, Google, Mistral, and most OpenRouter models all price per million tokens with separate input and output rates.

2.Where do I find token counts?

Most providers return token usage in the response. OpenAI's tiktoken and Anthropic's SDK include counters to measure average prompt size before launch.

3.Does this include cache discounts?

No β€” use the Prompt Cache Savings tool to layer caching economics on top of this base estimate.

What an LLM API bill actually looks like in production

Per-token pricing is the deceptively simple surface of a pricing model that punishes every architectural shortcut. A single chatbot request that looks like "<1k tokens" on paper turns into 3,000+ tokens once you add system prompt, retrieved context, tool schemas, chain-of-thought, and a verbose response. Multiply by 50,000 daily users and you are casually signing up for a five-figure monthly invoice before anybody reviews a spec.

As of April 2026, the headline input/output rates per million tokens on the three major providers look like this:

ModelInput $/MOutput $/MContextNotes
Claude Sonnet 4.5$3.00$15.00200kDefault workhorse. Prompt cache drops input to $0.30.
Claude Opus 4.1$15.00$75.00200kReasoning-heavy. Use sparingly.
Claude Haiku 4$0.80$4.00200kRouter/classifier tier.
GPT-5$5.00$20.00400kStrong on structured output + tool use.
GPT-5 mini$0.40$1.60400kReal competition to Haiku.
GPT-4o$2.50$10.00128kStill viable; cheaper than 5 for many tasks.
Gemini 2.5 Pro$1.25$10.001MCheap long-context; quality varies by task.
Gemini 2.5 Flash$0.15$0.601MThroughput king for bulk jobs.

The numbers most teams get wrong are (a) output tokens and (b) retries. Output generation costs 4–5Γ— input across every major provider because decoding is the expensive half of inference. A feature that returns a 1,200-token response is four times more expensive than one returning 300 tokens of the same accuracy. Retries β€” whether from a schema validation failure, a "please try again" wrapper, or an agent loop β€” silently 2–3Γ— your effective cost per successful task. If your observability only logs one number per request, your finance team is flying blind.

Estimating before you have traffic

When you have not launched yet, the only honest estimate comes from benchmarking on the actual prompts you will send. A 20-minute ritual that beats any spreadsheet:

  1. Collect 50 representative inputs from your design doc, a user-research transcript, or a staging dataset.
  2. Fix the exact prompt template β€” system, user, any tool schemas, retrieved chunks.
  3. Run each through tiktoken (OpenAI) or the Anthropic SDK counter. Record input tokens.
  4. Actually call the API. Record output tokens from the usage block.
  5. Compute P50 and P95. P95 matters more than the mean when you model worst-case bills.

Plug P50 input, P50 output, your call rate, and the headline price into the calculator above. Then overlay P95 inputs with a 10% retry rate β€” that is your realistic ceiling. Teams that budget against P50 land at 2–3Γ— overspend inside the first quarter.

The four levers that actually move the bill

In ten production deployments where we cut LLM cost by 40–70%, the winning moves were always the same four:

  1. Prompt caching. Anthropic, OpenAI, and Google all now cache static prefixes (system prompts, long tool schemas, retrieved documents). Cached input drops from $3/M to $0.30/M on Claude Sonnet 4.5 β€” a 10Γ— reduction on whatever fraction of your input is identical across calls. For a RAG bot where the system prompt is 4,000 tokens, this alone saves 60–80% of input cost.
  2. Response length caps. max_tokensis free. A feature that defaults to 4,096 output tokens because that is the SDK default, when realistically 300 would do, is bleeding money on every call. Pair with a "respond in ≀3 sentences" instruction.
  3. Model routing. Route trivial classifications (sentiment, intent, extraction) to Haiku 4 or GPT-5 mini at ~$0.80/M. Escalate to Sonnet or GPT-5 only when confidence is low or the task requires reasoning. A small router saves 60–85% on typical support workloads.
  4. Context trimming.Retrieval that returns 20 chunks to be "safe" when 4 would suffice is 5Γ— overspend on input tokens for the life of the product. Measure relevance cutoff by F1, not feelings.

Three worked scenarios with full token math

Abstract rate cards are useless without plugging them into a real workload. Here are three deployments we have actually run or priced in the last six months, with the line-item arithmetic that determines the monthly invoice.

Scenario 1: Support chatbot, 250,000 requests/month

B2C SaaS with a knowledge-base-backed chatbot. Traffic is ~8,300 requests/day, concentrated 9am–9pm in the customer's timezone. Per request: 800-token system prompt (product FAQ seed + style guide + tool schemas), 1,400 tokens of retrieved context (4 chunks @ 350 tok), 150 tokens of user message, 280-token average response. Total: 2,350 input + 280 output.

  • Monthly input tokens: 250,000 Γ— 2,350 = 587,500,000 β†’ 587.5M input
  • Monthly output tokens: 250,000 Γ— 280 = 70,000,000 β†’ 70M output
  • Uncached Sonnet 4.5: 587.5 Γ— $3 + 70 Γ— $15 = $1,762 + $1,050 = $2,812/mo
  • With Anthropic cache (system + schemas = 800 tok cacheable, 73% hit rate): cached input 427.8M Γ— $0.30 + uncached 159.7M Γ— $3 = $128 + $479 = $607 input β†’ $1,657/mo total (41% savings)
  • Routed version: 65% of intents are FAQ lookups β†’ Haiku 4 at $0.80/$4 handles those. 65% Γ— 250k = 162,500 Haiku calls; 87,500 Sonnet calls. Haiku: 382M in Γ— $0.80 + 45.5M out Γ— $4 = $306 + $182 = $488. Sonnet (cached): 205.7M in Γ— avg $1.00 + 24.5M out Γ— $15 = $206 + $368 = $574. Total: $1,062/mo (62% off the naive bill).

Scenario 2: RAG pipeline, 50,000 queries/month

Internal policy-docs assistant for a 1,200-person org. Fewer queries, much larger context. Per query: 3,200-token system prompt (compliance disclaimers + citation format + 6 few-shot examples), 6 retrieved chunks averaging 650 tok each = 3,900 tok, 120-token user question, 550-token response. Total: 7,220 input + 550 output.

  • Monthly input: 50,000 Γ— 7,220 = 361M. Monthly output: 50,000 Γ— 550 = 27.5M.
  • Uncached Sonnet 4.5: 361 Γ— $3 + 27.5 Γ— $15 = $1,083 + $413 = $1,496/mo
  • System prompt (3,200 tok) is identical across every call. With 5-minute Anthropic cache TTL and steady traffic of ~70 qpm during business hours, cache hit rate is ~92%. Cached prefix contribution: 50,000 Γ— 3,200 Γ— 0.92 = 147.2M tokens at $0.30/M = $44. Write penalty: 50,000 Γ— 3,200 Γ— 0.08 Γ— $3.75 (1.25Γ— write) = $48. Uncached retrieved chunks: 50,000 Γ— 3,900 Γ— $3/M = $585. Uncached user questions: 50,000 Γ— 120 Γ— $3/M = $18. Output: $413. Total: $1,108/mo (26% savings).
  • Push rerank + k=4 retrieval (down from 6): input drops to 5,920 tok, reranker adds $50/mo via Cohere Rerank 3.5 at $1/1k. New total: ~$920/mo with answer quality typically up, not down.

Scenario 3: Code assistant, 10 devs Γ— 40 queries/day

In-IDE code-review + refactor assistant used by a 10-person engineering team. 40 queries/dev/day Γ— 22 workdays = 8,800 queries/month. Per query: 1,100-token system prompt (codebase conventions + language-specific instructions), 4,500-token context window (diff + related files), 900-token response for a non-trivial refactor. Total: 5,600 input + 900 output.

  • Monthly input: 8,800 Γ— 5,600 = 49.3M. Monthly output: 8,800 Γ— 900 = 7.9M.
  • Sonnet 4.5 uncached: $148 + $119 = $267/mo β€” cheap, but per-dev it is $27/mo. For a 10-person team this is noise. For 500 devs (Cursor-style) it would be $13,350/mo.
  • Opus 4.1 for the same workload: $740 + $593 = $1,333/mo. 5Γ— the cost for maybe 2–3pp quality lift on typical coding tasks. Not worth it outside specific hard-reasoning loops (architecture review, tricky concurrency bugs).
  • The right answer here is Sonnet 4.5 as the default, with an explicit "hard mode" button that routes to Opus when the dev wants deeper analysis β€” typically < 5% of queries. That blended cost lands around $320/mo.

Production patterns that actually matter

Cost control in production is not really about the rate card β€” it is about the things that go wrong at 3am when a customer's bot loops 200 times on a malformed prompt. The patterns that matter:

  • Retry budgets. Give every agent call a hard retry limit (3–5) and a total token budget. A runaway tool-use loop that retries 50 times on a schema-validation failure can eat $100 in a minute. Budget enforcement at the SDK wrapper level is non-negotiable.
  • Circuit breakers on upstream providers. When Anthropic or OpenAI has a regional outage β€” and they will, 1–3 times a quarter β€” your app should fail over to a secondary provider, not retry the primary for 90 seconds. Use a library like Opossum or roll your own with a simple error-rate window.
  • Fallback chains.A typical production chain is: Sonnet 4.5 β†’ GPT-5 β†’ Haiku 4 + simplified prompt β†’ static "I'm having trouble, email us". Each layer has lower cost and lower capability. The goal is never "break in the user's face", even if occasionally the response is worse than ideal.
  • Per-tenant spend caps.If you are B2B SaaS, one customer's LLM behavior can bankrupt your unit economics. Enforce a monthly token cap per tenant and expose usage via an API β€” both to protect yourself and as a compliance feature customers will pay for.
  • Observability with token breakdown. Log input tokens, output tokens, cache-hit tokens, and latency on every call. Langfuse, Helicone, and OpenLLMetry are all adequate; rolling your own into Datadog is fine too. What is not fine is one number per request β€” you cannot debug what you cannot see.

Model selection cheat sheet

The vendor-agnostic default we reach for in 2026 looks like this:

  • Claude Haiku 4 ($0.80/$4): single-step extractions under 500 output tokens, intent routing, cheap summarization, PII scrubbing, first-pass classification. If the task is crisp and has a narrow answer, Haiku is usually enough and 3–4Γ— cheaper than Sonnet.
  • Claude Sonnet 4.5 ($3/$15): 95% of production general-purpose use. Agent loops, RAG synthesis, coding (including non-trivial refactors), customer-facing chatbots, long-form generation. This is the default; start here and only escalate when you have a measured reason.
  • Claude Opus 4.1 ($15/$75): hard reasoning, multi-step planning where a mistake compounds (legal, financial analysis), architecture-level code review, research agents that need to reason over conflicting evidence. Under 5% of our traffic, but where we need it, we need it.
  • GPT-5 ($5/$20): strict JSON schema outputs, workflows already built on OpenAI tooling, vision tasks with tables/receipts. Parity with Sonnet on most tasks; the decision is usually about ecosystem, not capability.
  • Gemini 2.5 Pro ($1.25/$10): long context > 200k tokens, cheap bulk work, Google ecosystem (Vertex, BigQuery integration). Quality varies more by task than the others.

Latency vs quality tradeoffs

Sticker-price comparisons ignore the other expensive dimension: how long your user waits. On streaming output at April 2026 production averages: Sonnet 4.5 emits tokens at ~80ms per token (~12.5 tok/sec perceived), Opus 4.1 at ~60ms per token but with a much larger first- token latency (~900ms), Haiku 4 at ~35ms per token (~28 tok/sec), GPT-5 around 55ms per token with ~400ms first-token latency, and Gemini 2.5 Flash around 150ms per token but with the fastest first-byte time of the group. For a chat UX, first-token latency usually matters more than steady-state throughput β€” users start reading the moment characters appear. For an agent generating a full 1,500-token plan before acting, steady-state throughput dominates and Haiku wins decisively even when Sonnet would produce a slightly better plan.

Frequently asked questions

How do I forecast cost before I have users? Run 50 representative prompts through the actual API, record P50/P95 input and output tokens from the usage block, multiply by your projected traffic, multiply again by 1.3 for retries and 1.2 for conversation length drift. That is your realistic ceiling.

Do I pay for tool-schema tokens? Yes. Tool schemas count as input tokens on every request they are attached to. A tool definition with 6 tools and rich descriptions is often 1,200–2,500 tokens β€” and it is cacheable, which is usually the answer.

What about vision input? Anthropic bills images at a tile-based rate equivalent to ~1,200–1,600 tokens for a typical 1024Γ—1024 image. OpenAI uses a similar scheme. Budget 1,500 input tokens per image as a safe default.

Are batch APIs worth it?Anthropic's Message Batches API gives a flat 50% discount on both input and output at the cost of up to 24 hours of latency. For offline enrichment (tagging, classification, summarization of backlogs), it is an obvious yes. For anything a user waits for, obviously no.

What is a realistic cache hit rate? For a stable system prompt + tool schemas, expect 85–95% hit rate during business hours and 30–60% overnight. If your cache hit rate is under 50% during normal traffic, something is rewriting the prefix β€” usually a non-deterministic serialization of tools or a timestamp in the system message.

Should I switch off Sonnet 4.5 when Sonnet 5 ships? Plan for a 30-day shadow-eval period before any model migration in production. Every point release changes token counts for the same prompts by 2–8%, shifts latency, and breaks at least one edge case. Migrations have cost teams more than the new model saved.

Is it ever right to default to Opus? For a high-stakes domain (legal, medical, financial) with a small query volume where being wrong is catastrophic and being slow is fine β€” yes. For any consumer product, no. The quality delta rarely survives contact with actual product metrics.

How much does observability cost me? Langfuse cloud is ~$0.0002 per trace on their standard tier, so 1M traces/month is $200. Rolling your own into existing Datadog or Honeycomb is essentially free but takes a week of eng time. Either way, it pays for itself the first time you debug a cost spike.

Keep going

More free tools