AI Economy Hub

Prompt-to-cost estimator

Turn a prompt length in words into per-call and monthly API cost instantly.

Loading calculator…

Get weekly marketing insights

Join 1,200+ readers. One email per week. Unsubscribe anytime.

Frequently asked questions

1.How accurate is 1.33 tokens/word?

Close for most English prose. Code and structured data can be 2Γ— β€” measure with tiktoken if cost is tight.

2.Why does response length matter so much?

Output tokens are priced 3–5Γ— higher, so a 500-word answer often costs more than a 2000-word prompt.

Turning a prompt you sketched on a whiteboard into a realistic bill

Before you write any code, you can get within 20% of your real LLM cost from just two numbers: your prompt length in words and your typical call volume. The translation from words to tokens is remarkably stable: English prose is ~1.3 tokens per word in both the GPT-4/5 tokenizer and Claude's cl100k-class BPE. A 600-word prompt is ~780 input tokens. A 300-word response is ~390 output tokens. Plug those into the current rate card and you have a pre-launch budget that is honest enough to show a CFO.

This is the math anyone building an LLM-backed feature should be able to do in their head, at least roughly. The feature owner who walks into a planning meeting and says "this will cost us somewhere between $800 and $1,600 a month at 30k users" wins the argument; the one who says "AI is kind of expensive I think" doesn't. The calculator above does the arithmetic; the value is in knowing which numbers to plug in.

Why people underestimate by 3Γ—

The two things engineers leave out when eyeballing a prompt:

  • System prompt + tool schemas. A serious agent with 6 tool definitions, output-format instructions, and 3 few-shot examples is easily 2,000–4,000 tokens on the input side β€” often larger than the user message itself. And it is sent on every call.
  • Retrieved context.A RAG pipeline that returns 8 chunks of 250 tokens each adds 2,000 tokens per call on top of everything else. If retrieval is "generous" at 12 chunks, you are at 3,000.

A reasonable back-of-envelope for a production assistant is: 500 (system) + 2,000 (context) + 300 (user turn) = 2,800 input tokens per call. That is the number to multiply by your input rate, not the 80 tokens in the user message you are thinking about.

Feature archetypeTypical inputTypical outputSonnet 4.5 cost / call
Simple Q&A, no RAG400 tok200 tok$0.0042
RAG chatbot (8 chunks)2,800 tok400 tok$0.0144
Tool-use agent (3 turns)6,000 tok800 tok$0.030
Document summary (20-page PDF)18,000 tok600 tok$0.063
Code review on a PR8,000 tok1,200 tok$0.042
Long-form draft generation1,200 tok2,000 tok$0.0336

Three feature designs and their hidden cost differences

The same feature description often has wildly different cost profiles depending on architectural choices. Three examples from features we have built or audited in the past six months:

  • "Summarize this thread" button in a messaging app. Naive design: pass all 40 messages to Sonnet 4.5. Input of 3,200 tokens Γ— output of 400 = $0.015/call. At 50k uses/month: $750. Optimized design: Haiku 4 for first-pass extraction of key points, then conditional Sonnet 4.5 only if the output is ambiguous. Blended cost per call: $0.004. Monthly: $200. Same feature, 3.75Γ— cheaper.
  • "Draft a reply" in a CRM. Naive: feed 20 previous emails + 3 CRM records + 5 style examples = 8,000 input tokens, 500 output, per call $0.032. At 200k uses/month: $6,400. Optimized: cache the 5 style examples (static), use Haiku 4 for thread summarization, pass summary + top 2 emails to Sonnet. Per call: $0.009. Monthly: $1,800.
  • "Extract line items from this receipt" vision flow. Naive: GPT-5 with full image and verbose schema = $0.021/call. At 80k receipts/month: $1,680. Optimized: Haiku 4 with vision input and strict JSON mode, plus fallback to GPT-5 when confidence is low. Per call blended: $0.006. Monthly: $480.

Rule-of-thumb conversions from English to tokens

Memorizing these saves whiteboard time and makes estimates in real meetings feel effortless:

  • 1 word of English prose β‰ˆ 1.3 tokens.
  • 1 sentence β‰ˆ 20 tokens (prose), 30–40 (code or technical).
  • 1 paragraph (75–100 words) β‰ˆ 100–130 tokens.
  • 1 page of double-spaced 12pt prose β‰ˆ 300–400 tokens.
  • 1 page of single-spaced 10pt prose β‰ˆ 600–800 tokens.
  • 1 PDF page of a typical report β‰ˆ 500–700 tokens.
  • 1 line of Python β‰ˆ 10 tokens average.
  • 1 JSON line β‰ˆ 5–8 tokens (punctuation-heavy).
  • 1 image (Claude/GPT-5 vision) β‰ˆ 1,200–1,600 tokens at 1024Γ—1024.
  • 1 minute of English speech transcript β‰ˆ 180–220 tokens.

Cost vs. latency tradeoffs in feature design

A feature that has to respond in under 500ms cannot afford to chain models. A feature the user submits and walks away from (async background job) can afford to chain three models and still be cheaper on aggregate. Design the UX with the cost architecture in mind:

  • Synchronous chat: single-model call, tight prompt, aggressive max_tokens. Streaming on.
  • Background automation: multi-step pipeline, small cheap models routing to expensive ones only when needed, batch API for 50% discount where latency permits.
  • Progressive UI:fast Haiku response first, optional "ask for deeper analysis" button that invokes Sonnet or Opus. Users get the fast answer by default, opt-in to the expensive one.

Frequently asked questions

How accurate is word-count estimation? Within Β±5% for English prose. Wider for code, JSON, and non-English. Use it for pre-launch budgeting; measure once you have live traffic.

Do I count the system prompt every call? Yes, unless cached. System prompts are billed as input on every request.

What about tool definitions? Billed as input tokens. A 6-tool schema is often 1,200–2,500 tokens β€” larger than your user message. Cache them.

How do streaming and non-streaming differ in cost? They do not. Streaming affects latency only. You pay for every token emitted either way.

What if my users paste large blobs? Cap ingest size. A single malicious user pasting a 200KB document can generate a $10+ call. Enforce byte limits on input.

How do I price a multi-turn conversation? The conversation history accrues on each turn. At turn 10 you are paying for all 9 previous turns in input. Budget average turn depth Γ— per-turn cost.

What is the single best way to cut cost during design? Tight max_tokens. Models pad responses by default; a clear cap plus a style instruction typically cuts output 30–40%.

Should I use vision when text would do? Only if the task actually needs the visual information. A 1-image call is 1,500 tokens; the same information extracted with OCR and passed as text is often 200 tokens.

When the word count itself is wrong

Code, JSON, and non-English text tokenize worse than prose. 1,000 lines of Python β‰ˆ 10,000 tokens, not the 4,000 you'd guess from word count. A Chinese or Japanese prompt uses roughly 2Γ— the tokens of equivalent English content. A long JSON blob with lots of punctuation and keys can hit 2 tokens per word. If your feature handles non-prose, measure directly β€” do not estimate.

Three prompts we have actually priced

To make the abstraction concrete, here are three real prompt structures we have built for clients, with their measured token counts on Claude Sonnet 4.5 in April 2026:

  • Customer support triage agent: 620-token system prompt, 4 tool definitions totaling 1,080 tokens, 3 few-shot examples at 220 tokens each = 660, user message averaging 140 tokens, RAG context of 3 chunks at 320 tokens = 960. Total input: 3,460 tokens. Output: 310 tokens. Cost per call uncached: $0.0151. With cache on the static 2,360-token prefix at 85% hit rate: $0.0088, a 42% drop.
  • Code-review assistant: 480-token system prompt, 1 tool definition (520 tok) for linter integration, no few-shots, diff of 3,200 tokens, related-file context of 2,400 tokens, user comment of 80 tokens. Total input: 6,680 tokens. Output: 820 tokens. Cost per call uncached: $0.0324. With cache on 1,000-token prefix: $0.0280, 14% drop β€” because most of the input is dynamic diff content, caching does less.
  • Meeting-notes summarizer: 340-token system prompt (format + style), 0 tools, 2 few-shots at 180 tokens = 360, transcript of 8,200 tokens, user instruction of 60 tokens. Total input: 8,960. Output: 540 (structured JSON with summary + action items). Cost per call uncached: $0.0350. Caching is marginal β€” transcript changes every call, so only the 700-token prefix caches, saving $0.0019 per call.

The pattern: caching helps most when the static prefix is a large share of input. Summary workloads where the bulk of input is the thing being summarized get limited benefit. Agent and chatbot workloads with shared context get large benefit.

Tokenizer-specific surprises

Not all tokenizers are equal, and the differences compound at scale. Claude's tokenizer (a BPE variant trained on a mix heavy in code and multilingual text) tokenizes English at roughly the same rate as GPT-4's cl100k, but tokenizes code 5–10% more efficiently and non-English text 10–25% more efficiently. Gemini's SentencePiece tokenizer is roughly on par for English, worse on code, better on East Asian languages.

The practical effect: the same 1,000-word user message might be 1,320 tokens on Claude, 1,300 on GPT-5, and 1,280 on Gemini. For pure English prose the difference does not move a decision. For a Japanese-language chatbot or a code-review agent running on a 2MB diff, it can swing monthly cost by 15–20% and is worth benchmarking explicitly before committing a provider.

Common forecasting mistakes

  • Forgetting the conversation history. A chatbot at turn 8 carries all 7 previous turns in input. By turn 10, you are easily paying 3Γ— what you paid at turn 1 per call. Budget for average turn depth, not turn 1.
  • Underestimating tool-use traffic. A tool-use agent that sends back tool_result content for the LLM to reason over is paying for that content as input tokens. For a search-heavy agent, tool results can be 2–3Γ— the size of the user message.
  • Missing the retry multiplier. At 10% schema-failure rate, effective cost per successful call is 1.1Γ— headline. At 25% (common for agent-style tool use), it is 1.33Γ—. Measure retry rate; bake it into the forecast.
  • Ignoring output padding.Models trained to be "helpful" pad responses with "Here is the information you requested..." preambles and closing summaries. A max_tokens=300 cap plus a terse style instruction routinely cuts output cost 30–40%.

Frequently asked questions

What tokenizer should I use to count? For GPT family, tiktoken with the cl100k_base or o200k_base encoding. For Claude, the official SDK exposes a countTokens helper that matches server-side counting. For Gemini, the CountTokens endpoint. Do not estimate β€” an hour spent wiring exact counts into your telemetry pays back the first week.

Is 1.3 tokens per word reliable for prose? Yes, within Β±5% for English prose across all major tokenizers. For code, 0.4–0.6 tokens per character is a better heuristic. For JSON, 0.3–0.4 tokens per character due to punctuation overhead.

How do I price a streaming response? Same as non-streaming. Streaming affects latency and UX, not billing. You pay for every token emitted.

Do system prompts count every call? Yes. The system prompt ships on every call, and you pay for it on every call (unless cached). This is why engineering a tight system prompt is worth 2–3 hours of your time.

What is a "safe" headroom multiplier? We use 1.5Γ— over the best estimate for the first 90 days in production, tapering to 1.2Γ— as telemetry lands. Teams that budget at 1.0Γ— are the ones filing emergency spend approvals six weeks in.

Should I model worst-case user behavior? For consumer products, yes. Power users will send 10Γ— the token volume of median users, and the top 1% will 100Γ— the median. Budget the tail explicitly.

How do I sanity-check vendor invoices?Log input/output tokens on your side and reconcile against the provider's usage dashboard weekly. Both Anthropic and OpenAI publish per-minute usage APIs. Discrepancies > 5% are worth a support ticket.

What if my prompt includes retrieved documents from a user-controlled source?Enforce hard byte limits on ingested content before it hits the model. A 200KB PDF uploaded by a malicious user can otherwise send a single request into the $10+ range. Truncate, summarize, or reject at ingest.

Keep going

More free tools