Turning a prompt you sketched on a whiteboard into a realistic bill
Before you write any code, you can get within 20% of your real LLM cost from just two numbers: your prompt length in words and your typical call volume. The translation from words to tokens is remarkably stable: English prose is ~1.3 tokens per word in both the GPT-4/5 tokenizer and Claude's cl100k-class BPE. A 600-word prompt is ~780 input tokens. A 300-word response is ~390 output tokens. Plug those into the current rate card and you have a pre-launch budget that is honest enough to show a CFO.
This is the math anyone building an LLM-backed feature should be able to do in their head, at least roughly. The feature owner who walks into a planning meeting and says "this will cost us somewhere between $800 and $1,600 a month at 30k users" wins the argument; the one who says "AI is kind of expensive I think" doesn't. The calculator above does the arithmetic; the value is in knowing which numbers to plug in.
Why people underestimate by 3Γ
The two things engineers leave out when eyeballing a prompt:
- System prompt + tool schemas. A serious agent with 6 tool definitions, output-format instructions, and 3 few-shot examples is easily 2,000β4,000 tokens on the input side β often larger than the user message itself. And it is sent on every call.
- Retrieved context.A RAG pipeline that returns 8 chunks of 250 tokens each adds 2,000 tokens per call on top of everything else. If retrieval is "generous" at 12 chunks, you are at 3,000.
A reasonable back-of-envelope for a production assistant is: 500 (system) + 2,000 (context) + 300 (user turn) = 2,800 input tokens per call. That is the number to multiply by your input rate, not the 80 tokens in the user message you are thinking about.
| Feature archetype | Typical input | Typical output | Sonnet 4.5 cost / call |
|---|---|---|---|
| Simple Q&A, no RAG | 400 tok | 200 tok | $0.0042 |
| RAG chatbot (8 chunks) | 2,800 tok | 400 tok | $0.0144 |
| Tool-use agent (3 turns) | 6,000 tok | 800 tok | $0.030 |
| Document summary (20-page PDF) | 18,000 tok | 600 tok | $0.063 |
| Code review on a PR | 8,000 tok | 1,200 tok | $0.042 |
| Long-form draft generation | 1,200 tok | 2,000 tok | $0.0336 |
Three feature designs and their hidden cost differences
The same feature description often has wildly different cost profiles depending on architectural choices. Three examples from features we have built or audited in the past six months:
- "Summarize this thread" button in a messaging app. Naive design: pass all 40 messages to Sonnet 4.5. Input of 3,200 tokens Γ output of 400 = $0.015/call. At 50k uses/month: $750. Optimized design: Haiku 4 for first-pass extraction of key points, then conditional Sonnet 4.5 only if the output is ambiguous. Blended cost per call: $0.004. Monthly: $200. Same feature, 3.75Γ cheaper.
- "Draft a reply" in a CRM. Naive: feed 20 previous emails + 3 CRM records + 5 style examples = 8,000 input tokens, 500 output, per call $0.032. At 200k uses/month: $6,400. Optimized: cache the 5 style examples (static), use Haiku 4 for thread summarization, pass summary + top 2 emails to Sonnet. Per call: $0.009. Monthly: $1,800.
- "Extract line items from this receipt" vision flow. Naive: GPT-5 with full image and verbose schema = $0.021/call. At 80k receipts/month: $1,680. Optimized: Haiku 4 with vision input and strict JSON mode, plus fallback to GPT-5 when confidence is low. Per call blended: $0.006. Monthly: $480.
Rule-of-thumb conversions from English to tokens
Memorizing these saves whiteboard time and makes estimates in real meetings feel effortless:
- 1 word of English prose β 1.3 tokens.
- 1 sentence β 20 tokens (prose), 30β40 (code or technical).
- 1 paragraph (75β100 words) β 100β130 tokens.
- 1 page of double-spaced 12pt prose β 300β400 tokens.
- 1 page of single-spaced 10pt prose β 600β800 tokens.
- 1 PDF page of a typical report β 500β700 tokens.
- 1 line of Python β 10 tokens average.
- 1 JSON line β 5β8 tokens (punctuation-heavy).
- 1 image (Claude/GPT-5 vision) β 1,200β1,600 tokens at 1024Γ1024.
- 1 minute of English speech transcript β 180β220 tokens.
Cost vs. latency tradeoffs in feature design
A feature that has to respond in under 500ms cannot afford to chain models. A feature the user submits and walks away from (async background job) can afford to chain three models and still be cheaper on aggregate. Design the UX with the cost architecture in mind:
- Synchronous chat: single-model call, tight prompt, aggressive max_tokens. Streaming on.
- Background automation: multi-step pipeline, small cheap models routing to expensive ones only when needed, batch API for 50% discount where latency permits.
- Progressive UI:fast Haiku response first, optional "ask for deeper analysis" button that invokes Sonnet or Opus. Users get the fast answer by default, opt-in to the expensive one.
Frequently asked questions
How accurate is word-count estimation? Within Β±5% for English prose. Wider for code, JSON, and non-English. Use it for pre-launch budgeting; measure once you have live traffic.
Do I count the system prompt every call? Yes, unless cached. System prompts are billed as input on every request.
What about tool definitions? Billed as input tokens. A 6-tool schema is often 1,200β2,500 tokens β larger than your user message. Cache them.
How do streaming and non-streaming differ in cost? They do not. Streaming affects latency only. You pay for every token emitted either way.
What if my users paste large blobs? Cap ingest size. A single malicious user pasting a 200KB document can generate a $10+ call. Enforce byte limits on input.
How do I price a multi-turn conversation? The conversation history accrues on each turn. At turn 10 you are paying for all 9 previous turns in input. Budget average turn depth Γ per-turn cost.
What is the single best way to cut cost during design? Tight max_tokens. Models pad responses by default; a clear cap plus a style instruction typically cuts output 30β40%.
Should I use vision when text would do? Only if the task actually needs the visual information. A 1-image call is 1,500 tokens; the same information extracted with OCR and passed as text is often 200 tokens.
When the word count itself is wrong
Code, JSON, and non-English text tokenize worse than prose. 1,000 lines of Python β 10,000 tokens, not the 4,000 you'd guess from word count. A Chinese or Japanese prompt uses roughly 2Γ the tokens of equivalent English content. A long JSON blob with lots of punctuation and keys can hit 2 tokens per word. If your feature handles non-prose, measure directly β do not estimate.
Three prompts we have actually priced
To make the abstraction concrete, here are three real prompt structures we have built for clients, with their measured token counts on Claude Sonnet 4.5 in April 2026:
- Customer support triage agent: 620-token system prompt, 4 tool definitions totaling 1,080 tokens, 3 few-shot examples at 220 tokens each = 660, user message averaging 140 tokens, RAG context of 3 chunks at 320 tokens = 960. Total input: 3,460 tokens. Output: 310 tokens. Cost per call uncached: $0.0151. With cache on the static 2,360-token prefix at 85% hit rate: $0.0088, a 42% drop.
- Code-review assistant: 480-token system prompt, 1 tool definition (520 tok) for linter integration, no few-shots, diff of 3,200 tokens, related-file context of 2,400 tokens, user comment of 80 tokens. Total input: 6,680 tokens. Output: 820 tokens. Cost per call uncached: $0.0324. With cache on 1,000-token prefix: $0.0280, 14% drop β because most of the input is dynamic diff content, caching does less.
- Meeting-notes summarizer: 340-token system prompt (format + style), 0 tools, 2 few-shots at 180 tokens = 360, transcript of 8,200 tokens, user instruction of 60 tokens. Total input: 8,960. Output: 540 (structured JSON with summary + action items). Cost per call uncached: $0.0350. Caching is marginal β transcript changes every call, so only the 700-token prefix caches, saving $0.0019 per call.
The pattern: caching helps most when the static prefix is a large share of input. Summary workloads where the bulk of input is the thing being summarized get limited benefit. Agent and chatbot workloads with shared context get large benefit.
Tokenizer-specific surprises
Not all tokenizers are equal, and the differences compound at scale. Claude's tokenizer (a BPE variant trained on a mix heavy in code and multilingual text) tokenizes English at roughly the same rate as GPT-4's cl100k, but tokenizes code 5β10% more efficiently and non-English text 10β25% more efficiently. Gemini's SentencePiece tokenizer is roughly on par for English, worse on code, better on East Asian languages.
The practical effect: the same 1,000-word user message might be 1,320 tokens on Claude, 1,300 on GPT-5, and 1,280 on Gemini. For pure English prose the difference does not move a decision. For a Japanese-language chatbot or a code-review agent running on a 2MB diff, it can swing monthly cost by 15β20% and is worth benchmarking explicitly before committing a provider.
Common forecasting mistakes
- Forgetting the conversation history. A chatbot at turn 8 carries all 7 previous turns in input. By turn 10, you are easily paying 3Γ what you paid at turn 1 per call. Budget for average turn depth, not turn 1.
- Underestimating tool-use traffic. A tool-use agent that sends back tool_result content for the LLM to reason over is paying for that content as input tokens. For a search-heavy agent, tool results can be 2β3Γ the size of the user message.
- Missing the retry multiplier. At 10% schema-failure rate, effective cost per successful call is 1.1Γ headline. At 25% (common for agent-style tool use), it is 1.33Γ. Measure retry rate; bake it into the forecast.
- Ignoring output padding.Models trained to be "helpful" pad responses with "Here is the information you requested..." preambles and closing summaries. A
max_tokens=300cap plus a terse style instruction routinely cuts output cost 30β40%.
Frequently asked questions
What tokenizer should I use to count? For GPT family, tiktoken with the cl100k_base or o200k_base encoding. For Claude, the official SDK exposes a countTokens helper that matches server-side counting. For Gemini, the CountTokens endpoint. Do not estimate β an hour spent wiring exact counts into your telemetry pays back the first week.
Is 1.3 tokens per word reliable for prose? Yes, within Β±5% for English prose across all major tokenizers. For code, 0.4β0.6 tokens per character is a better heuristic. For JSON, 0.3β0.4 tokens per character due to punctuation overhead.
How do I price a streaming response? Same as non-streaming. Streaming affects latency and UX, not billing. You pay for every token emitted.
Do system prompts count every call? Yes. The system prompt ships on every call, and you pay for it on every call (unless cached). This is why engineering a tight system prompt is worth 2β3 hours of your time.
What is a "safe" headroom multiplier? We use 1.5Γ over the best estimate for the first 90 days in production, tapering to 1.2Γ as telemetry lands. Teams that budget at 1.0Γ are the ones filing emergency spend approvals six weeks in.
Should I model worst-case user behavior? For consumer products, yes. Power users will send 10Γ the token volume of median users, and the top 1% will 100Γ the median. Budget the tail explicitly.
How do I sanity-check vendor invoices?Log input/output tokens on your side and reconcile against the provider's usage dashboard weekly. Both Anthropic and OpenAI publish per-minute usage APIs. Discrepancies > 5% are worth a support ticket.
What if my prompt includes retrieved documents from a user-controlled source?Enforce hard byte limits on ingested content before it hits the model. A 200KB PDF uploaded by a malicious user can otherwise send a single request into the $10+ range. Truncate, summarize, or reject at ingest.
- LLM API cost calculator β project monthly spend once you have realistic token counts.
- Prompt cache savings β recover 60-80% of input cost when prefixes repeat.
- RAG pipeline cost β full per-query cost including embeddings + vector DB.
- AI content cost per piece β if you're generating marketing output at scale.