Prompt caching is the single highest-leverage LLM cost lever in 2026
If your input prompt contains any static prefix — a system message, tool definitions, retrieved documents that appear in multiple turns, style guides, or few-shot examples — prompt caching should be on. Anthropic, OpenAI, and Google all now support some form of it, and on workloads that fit the pattern it cuts input-token cost by 60–90%. Cached reads on Claude Sonnet 4.5 are $0.30/M versus $3.00/M uncached, a 10× difference.
Why caching is higher-leverage than model selection
Ten production audits in: the single change that has the largest impact on our clients' monthly bill is always caching, not model choice. Moving from Sonnet 4.5 to Haiku 4 cuts cost by ~4× on a subset of traffic you can route. Turning on caching on the static prefix cuts cost by 60–85% on all traffic. The caching win is available on workloads where routing is not, because even your hardest queries have the same system prompt. Engineering work required: an afternoon to add cache_control markers and verify hit rate in telemetry.
How each provider does it
| Provider | Mechanism | Cached read | Cache write | TTL |
|---|---|---|---|---|
| Anthropic (Claude) | Explicit cache_control block | 10% of input | 125% of input (one-time) | 5 min default, 1 hr tier |
| OpenAI (GPT-5, GPT-4o) | Automatic for prefixes ≥1024 tok | 50% of input | no surcharge | ~5–10 min, opaque |
| Google (Gemini) | Explicit cachedContents API | 25% of input | per-minute storage fee | user-defined |
Anthropic's model rewards heavy hitters: you pay 1.25× for the first request that writes the cache, then 0.10× for every subsequent read inside the TTL. A workload that sees the same 4,000-token system prompt 50 times in five minutes pays about 6% of what it would uncached. The math collapses if the prefix is hit < 3 times before expiring — in that case, you lose money caching.
Break-even for Claude prompt caching
Using Anthropic's rules (1.25× write, 0.10× read), breakeven is 2 reads: at N reads of a prefix, total cost is 1.25 + 0.10 × N vs. N + 1 uncached. These equal at N ≈ 0.28, so any prefix you read at least once after writing saves money. The real question is whether your traffic pattern hits the same prefix inside the 5-minute window.
Competitive examples from public deployments
Companies that have publicly discussed their LLM cost engineering consistently flag caching as the primary lever. Notion AI's engineering posts highlight caching the shared "document assistant" system prompt as a major cost reduction. Cursor has discussed caching the codebase-aware system context as critical to keeping per-dev costs tractable at their scale. Glean's enterprise deployment uses aggressive prefix caching for the long compliance and citation-format instructions embedded in every query. In each case the architecture is the same: identify the static prefix, mark it cacheable, put the dynamic content after the breakpoint.
What to cache, in priority order
- Tool schemas + system prompt. Almost always static, almost always large, sent on every call. Free money.
- Few-shot examples. 5 worked examples can run 1,500 tokens. Cache them.
- Retrieved documents in multi-turn conversations. If a user session asks 4 questions against the same 10 retrieved chunks, the last 3 calls hit the cache.
- Long static documents users reference repeatedly — a policy document in a compliance assistant, a codebase map in a coding agent.
What not to cache
- One-shot inputs that never repeat — you pay the 1.25× write penalty for nothing.
- Retrieved content that changes every call (fresh search results) — the cache key changes and you never get a read.
- User PII you shouldn't be persisting in provider caches in the first place.
Dollar impact at production volume
A concrete benchmark we recommend as the default: a 1,000-token static system prompt at 200,000 queries per minute. Each provider gets the full benefit of their cache model. Over a month (43.2B queries — obviously more than any real deployment, but the ratio scales linearly):
- Anthropic Claude Sonnet 4.5: 1,000 input tokens × 43.2B calls = 43.2T tokens of system prompt traffic. Without caching at $3/M = $129.6M of input cost from the prefix alone. With caching at 97% hit rate (steady traffic, 5-min TTL): write portion 43.2T × 0.03 × $3.75/M = $4.86M, read portion 43.2T × 0.97 × $0.30/M = $12.57M, total $17.43M. Savings: $112.2M/month on the prefix alone, an 87% cut.
- OpenAI GPT-5: 43.2T × $5/M uncached = $216M. Cached at 50% discount: $108M. Savings of $108M, or 50%.
- Google Gemini 2.5 Pro: 43.2T × $1.25/M uncached = $54M. Cached at 75% discount: $13.5M in read cost plus per-minute storage of negligible size. Savings: about $40.5M, a 75% cut.
Anthropic's 90% read discount is the single most aggressive cache tier in the market, which is why the absolute savings are so large. OpenAI's 50% discount is more modest but automatic — you get it without any code changes as long as your prefix is ≥1024 tokens. Gemini's 75% discount is strong but requires explicit cache management.
Worked example: a smaller, more realistic workload
Most readers do not run 200k QPM. Here is the same math for a mid-sized SaaS chatbot at 150 QPM (~6.5M queries/month), 2,500-token system prompt (detailed role + style + tool schemas) on Sonnet 4.5:
- Monthly prefix tokens: 6.5M × 2,500 = 16.25B tokens.
- Uncached: 16.25B × $3/M = $48,750 per month of system-prompt input cost.
- Cached at 90% hit (steady business-hours traffic): write = 16.25B × 0.10 × $3.75/M = $6,094; read = 16.25B × 0.90 × $0.30/M = $4,388. Total: $10,482/mo.
- Net savings: $38,268/month, on a single flag change in the request body. This is why caching is the first thing we turn on in any engagement.
Hit rate is the variable that matters most
All of the math above assumes 85–97% cache hit rates, which are realistic for stable prompts during normal traffic. The ways we have seen hit rate collapse in production:
- Tool schemas serialized non-deterministically. If your SDK serializes JSON with unordered keys, every request has a different prefix and nothing caches. Serialize with sorted keys.
- Timestamps or request IDs in the system prompt.A well-meaning "current time is..." at the top of the prompt invalidates the cache on every call. Move time-varying content below the cache breakpoint.
- User-specific content at the top.Putting the user's name or tenant ID before the static instructions cache-busts. Put shared stuff first, user-specific content after the
cache_controlbreakpoint. - Sparse traffic outside the TTL. At under 12 calls/hour on the same prefix on the 5-minute Anthropic tier, the cache expires between calls and every call pays the write tax. Switch to the 1-hour tier (2× write, same read price) for low-QPS prefixes.
Production patterns for cache-heavy architectures
Patterns we see working well in deployed systems:
- Explicit cache breakpoint at the end of the system message. Keep the first N tokens stable and mark the cache point right there. Cursor, Notion AI, and Glean all follow this pattern; their system prompts are effectively giant constants.
- Pre-warm the cache on deploy. After a system-prompt change, send one synthetic request to populate the cache before real traffic hits. Otherwise the first user eats the 1.25× write tax with no read benefit for themselves.
- Cache retrieved documents for multi-turn sessions. In a RAG assistant where a user may ask 4–6 follow-up questions about the same retrieved context, caching the retrieval block saves 60–80% on subsequent turns.
- Monitor cache-read token counts per call. Every Anthropic response includes
usage.cache_read_input_tokensandusage.cache_creation_input_tokens. Log them; alert if the ratio drops below your expected hit rate, because something just broke your prefix.
Latency side-effect: caching is faster
A rarely-quoted benefit: cached prefixes skip the prefill phase, which is the dominant component of TTFT for long prompts. A 4,000-token cached prompt returns its first token 150–300ms sooner than the uncached version. For real-time UX, this is a user-perceptible improvement on top of the cost savings. For agent loops with many sequential calls, it compounds to multiple seconds saved per task.
Frequently asked questions
Do caches persist across requests from different users? Within the same account/API key, yes — the cache is keyed on prefix content, not user. Across accounts, no. This is why a multi-tenant SaaS sharing a single API key sees higher hit rates than one provisioning per-tenant keys.
Does streaming affect caching? No. Cache reads and writes apply the same way whether you stream or not.
What happens at the 5-minute TTL boundary? The cache is evicted. The first request after expiry pays the 1.25× write tax again. This is why low-QPS workloads should use the 1-hour tier; the write tax doubles but amortizes over far more reads.
Is there a max prefix size?Anthropic allows up to 4 cache breakpoints per request, each up to the model's full context window. In practice, you will hit the token cost wall before the structural limit.
What about OpenAI — do I need to do anything to enable caching? No. It is automatic for any prefix ≥1024 tokens. The response returns usage.prompt_tokens_details.cached_tokens so you can measure hit rate.
Can I cache across models? No. Caches are per-model. Moving from Sonnet 4.5 to Opus 4.1 invalidates the cache. This matters for fallback chains.
Does Gemini caching make sense at low volume? Usually no. The per-minute storage fee is small but present; if your cache sits idle between calls, you pay for storage you are not using. Gemini caching wins at steady high-frequency access.
What is the observability signal that caching is broken? A sudden 10× jump in input-cost-per-request without a traffic change. Alert on input cost per call and investigate whenever it moves more than 25% week-over-week.
Can I mix cached and uncached content in one call? Yes — in fact, this is the common case. Static prefix cached, user message uncached. Anthropic lets you mark up to 4 cache breakpoints; content before each breakpoint is cached, content after is not.
- LLM API cost calculator — base case before caching is applied.
- RAG pipeline cost — RAG benefits most from caching retrieved chunks.
- Fine-tune vs RAG — when caching makes RAG cheaper than fine-tuning.
- Token price compare — compare providers before committing to a caching strategy.