AI Economy Hub

RAG pipeline cost

End-to-end RAG cost β€” embeddings, vector DB, retrieval, and generation per query.

Loading calculator…

Get weekly marketing insights

Join 1,200+ readers. One email per week. Unsubscribe anytime.

Frequently asked questions

1.Is RAG always cheaper than fine-tuning?

At low query volume, yes. At very high volume with narrow domain, fine-tuning can win. Use the Fine-Tune vs RAG calculator for head-to-head numbers.

2.What about hybrid retrieval?

BM25 + vector hybrid retrieval often needs fewer top-k chunks for the same quality, reducing LLM input cost. Adds minimal vector-DB cost.

3.Do I need to re-embed when I change models?

Yes β€” embeddings are model-specific. Budget a one-time re-index cost when switching from e.g. OpenAI text-embedding-3 to Voyage.

4.Does context caching help RAG?

Anthropic's prompt cache cuts repeated system-prompt cost, but the retrieved chunks change per query and can't be cached directly. System prompt caching still saves 20–40% when deployed right.

5.What about agentic RAG with multiple calls?

Multi-hop RAG multiplies LLM input cost by hops. Budget 2–3Γ— a single-pass estimate for agentic pipelines.

What a RAG query actually costs in production

A lot of "RAG is cheap" claims are based on one line: "$3 per million input tokens." The reality is a per-query stack of five cost components, none of which are negligible at scale. A realistic production RAG query in April 2026 costs between $0.004 and $0.03, depending on how much work went into optimizing it.

ComponentTypical cost / queryNotes
Query embedding$0.00000150 tokens Γ— $0.02/M on text-embedding-3-small
Vector DB query$0.0000501 read on Pinecone Serverless
Reranker (Cohere Rerank 3.5)$0.001$1/1k queries, 100 candidates
LLM generation (Sonnet 4.5, 3k in / 400 out)$0.015Uncached
LLM generation (Sonnet 4.5, 3k in / 400 out, cached prefix)$0.007With prompt caching
Observability + logging$0.0002Langfuse / Datadog tokens

Why the spread is 7Γ—

The factor-of-seven spread between $0.004 and $0.03 per query reflects architectural decisions, not model pricing. A pipeline that retrieves 12 chunks, skips rerank, runs Opus 4.1 without caching, and generates 1,500-token responses is the expensive end. A pipeline that retrieves 4 chunks after rerank, runs cached Sonnet 4.5 with tight max_tokens, and routes easy queries to Haiku is the cheap end. Neither extreme is wrong for every workload; both should be deliberate choices against a measured quality bar.

The architecture decisions that dominate cost

  1. How many chunks you retrieve. k=12 vs k=4 triples input tokens and often hurts answer quality (context dilution). Measure relevance cutoff; most production setups settle on k=4–6 after a real eval.
  2. Whether you rerank. Cohere Rerank 3.5 at $1/1k queries cuts retrieval noise materially and lets you drop k from 12 to 4. Net effect is usually cheaper AND more accurate.
  3. Prompt caching on the static prefix. System prompt + retrieval instructions + tool definitions cache nicely; that typically accounts for 1,500–3,000 tokens of input that drops to 10% of its normal cost.
  4. Model tier. Summary/extract/classify: Haiku 4 or GPT-5 mini cuts LLM cost 5Γ—. Full synthesis requires Sonnet 4.5 or GPT-5.

Where the cost surprises come from in production

Most RAG deployments launch at a reasonable per-query cost and drift upward over three to six months. The drift sources are predictable: corpus growth increasing retrieval latency (which invites "just retrieve more chunks to be safe"), feature creep adding multi-hop behavior, telemetry gaps letting a bug route easy queries to Opus, and the slow accretion of prompt-bloat as engineers add "one more instruction" to fix a specific regression. A monthly cost audit with explicit per-component tracking catches all of these before they compound into a budget emergency.

One-time + monthly fixed costs

  • Initial corpus embedding: 10M tokens Γ— $0.02/M = $0.20 on 3-small (one-time).
  • Incremental re-embed (new + changed content): usually 2–5% of corpus per month.
  • Vector DB: $70–$500/mo for 1–10M vectors on a managed service.
  • Evals infra: if you're serious, budget for a weekly eval run on 200 queries. $5–$20/week.
  • Content ingestion pipeline: Unstructured.io, Reducto, or self-built β€” $0.02–$0.10 per PDF typically.

The hidden expensive case

Agentic RAG β€” where the model issues 2–5 follow-up searches per query β€” looks great in demos and costs 3–5Γ— what simple RAG costs. Every sub-query re-runs retrieval + LLM. If your product needs it, price accordingly; if it doesn't, a single well-retrieved pass with rerank is usually better.

Common architectural mistakes that explode cost

The three most expensive RAG mistakes we see in production audits: passing too many chunks to the LLM because retrieval quality is not trusted (pay 3Γ— input cost), skipping prompt caching because "we'll add it later" (pay 5Γ— what you need to on every call), and running Opus for synthesis when Sonnet would suffice (pay 5Γ— on every token). Any one of these individually is fixable in a day. All three together can easily triple or quadruple a RAG bill.

Three scaled deployments and their line items

  • Internal knowledge assistant, 2k employees, ~15k queries/day = 450k/mo:Per-query at cached Sonnet 4.5 with rerank = $0.0075. Monthly LLM: $3,375. Vector DB (5M vectors on Qdrant Cloud): $350. Embeddings: $20. Reranker: $450. Evals + obs: $80.Total: ~$4,275/mo β€” cheap for 2k employees, ~$2.14/employee/mo.
  • Consumer support bot, 250k queries/day = 7.5M/mo: Aggressive cost engineering needed. Haiku 4 for 70% of queries @ $0.002 = $10,500. Sonnet 4.5 for 30% hard queries with caching @ $0.008 = $18,000. Vector DB (15M vectors on Pinecone): $800. Embeddings: $250. Reranker (only on Sonnet queries): $2,250. Total: ~$31,800/mo. Per query: $0.0042 blended.
  • Agentic RAG research assistant, 5k users Γ— 8 queries/day Γ— 3 sub-queries each = 3.6M sub-queries/mo: Per sub-query same as base RAG = $0.0075 Γ— 3.6M = $27k LLM. Add vector DB + embeddings + rerank + obs = $3.5k. Total ~$30.5k/mo. This is why agentic features justify premium pricing at the product layer.

Quality is not independent of cost β€” optimize together

The biggest mental-model error in RAG cost discussions is treating quality and cost as separate dials. Cheaper retrieval with better rerank often beats expensive retrieval. Cached Sonnet 4.5 often beats uncached Opus 4.1 on both cost and quality. A tighter prompt with clearer instructions often beats a verbose prompt that pushes the model harder. Every serious RAG optimization we have run surfaced at least one change that improved both cost and quality β€” usually caching, retrieval tuning, or prompt hygiene.

The four decisions that determine RAG economics

  1. Chunk size + overlap. 400-token chunks with 50-token overlap is the default we start with. 800-token chunks with 100-token overlap for long-form content. Going to 1,500+ tokens doubles input cost at retrieval time and rarely helps quality.
  2. Retrieval k. k=4 with good rerank beats k=12 without rerank on every benchmark we have run. And it costs 3Γ— less at inference.
  3. Rerank or not. For high-quality retrieval, rerank wins. For a throwaway demo, skip it. Cohere Rerank 3.5 at $1/1k queries is usually a no-brainer.
  4. Model tier for synthesis. Sonnet 4.5 is the default. Drop to Haiku 4 for simple factual lookups. Escalate to Opus 4.1 only for high-stakes synthesis where the answer is the product.

What "good enough" looks like in each layer

If you are building a new RAG pipeline and want sane defaults without overthinking: OpenAI text-embedding-3-small for embeddings, pgvector or Qdrant for vector storage, Cohere Rerank 3.5 for reranking, cached Sonnet 4.5 for synthesis, Langfuse for observability, k=5 retrieval with a 50-candidate rerank pool, 400-token chunks with 50-token overlap. That stack gets you 80% of the way to frontier quality at roughly $0.008 per query. Optimize from there based on where your evals show weakness, not based on abstract "we should try the fancier thing" arguments.

Caching strategy specific to RAG

RAG is caching-friendly in interesting ways. Three layers to cache:

  • Query embedding cache. Identical queries should hit Redis. Typical hit rate for consumer products: 20–40%.
  • Retrieval result cache. Same query + same corpus version = same results. Cache with corpus-version + query-hash key.
  • LLM prefix cache. System prompt + retrieval instructions cache cleanly. For multi-turn sessions on the same retrieved chunks, cache the retrieval block too.

Stacking all three typically cuts per-query cost 30–50% over a naive pipeline.

Scaling RAG past single-tenant

Multi-tenant RAG adds a layer of considerations above the single-tenant case. Per-tenant cost caps, per-tenant retrieval isolation (to prevent cross-tenant leakage), per-tenant eval sets, and usage-based billing that actually reflects the expensive components. In production, we see teams fail on this by sharing a single index with metadata filtering and then paying for the performance degradation at scale, or by over-isolating and paying for 500 idle per-tenant indexes. The middle ground β€” namespace-per-tenant within a shared index β€” is usually correct until one tenant is large enough to warrant its own infrastructure.

Production patterns that matter

  • Corpus versioning. Tag every retrieval with the corpus version used. When content updates, you know exactly which cached answers to invalidate.
  • Query classification before retrieval. A cheap Haiku 4 classifier decides: is this a keyword lookup (use BM25), semantic (use vector), or hybrid? Saves retrieval cost and improves quality.
  • Fallback for empty retrieval.When retrieval returns nothing relevant, do not ship that to the LLM with a vague "say you don't know" fallback. Handle it explicitly with a canned response or an escalation path.
  • Citation enforcement. Use the Anthropic Citations API or constrained decoding to force the model to cite retrieved chunks. Hallucinations drop significantly and users trust the output more.
  • Evals on a schedule.Run a 200-query eval weekly. Alert on quality regression > 2pp. This catches retrieval drift before users complain.

Frequently asked questions

What is a realistic target cost per query? $0.005–$0.010 for a well-optimized production pipeline with cached Sonnet 4.5. $0.020+ means you have room to optimize.

Is rerank always worth it? For quality-sensitive workloads, yes. For a cheap throwaway, skip it. The $1/1k queries price is small.

Can I skip the vector DB and just use BM25? For narrow keyword-matching workloads, yes β€” BM25 (via Elasticsearch or Typesense) is cheaper and sometimes better. For semantic workloads, no.

How often should I re-embed? Only on content changes. Daily incremental updates for active corpora; weekly for mostly-static corpora. Full re-embed only on model change.

What about agentic RAG (search-while-answering)? 3–5Γ— more expensive than single-pass RAG. Use for research-class queries where multi-hop is required; avoid for chatbot workloads.

Do I need a separate evals pipeline? Yes. Gold set of 200+ queries with expected answers. Langfuse, Braintrust, or Ragas can orchestrate. Budget $50–200/month for eval infrastructure.

How do I reduce hallucination rate? Better retrieval (rerank, more relevant chunks), citation enforcement, constrained decoding, and per-provider factuality settings. No single fix; stack them.

Is hybrid search (vector + BM25) worth the complexity? Yes, usually. Weaviate, Qdrant, Elasticsearch, and OpenSearch all support it. Typical recall lift of 5–12pp on messy text.

Keep going

More free tools