Embedding costs are small until they are not
Embeddings feel free in a prototype. At 8,192 tokens per chunk and OpenAI text-embedding-3-small at $0.02 per million tokens, a 10,000-document corpus runs you about $1.50. Then you go to production, rebuild the index every week on 2 million docs, serve 200,000 queries a day, and the embedding line item quietly becomes real money.
The difference between prototype math and production math
Embedding cost at the prototype stage is one of those line items where the spreadsheet shows $1.50 and feels irrelevant, and then six months later the same line reads $3,200. The growth path: you index 10k docs once ($1), you add incremental ingestion ($5/mo), you decide to re-embed on a model change (one-time $200), you add a second corpus ($15/mo), traffic 10×, you add a better embedding model for retrieval-critical flows, and suddenly the line item is material. Budgeting-wise, treat embedding cost as a permanent operational expense that will compound with document growth and query growth independently.
| Provider / Model | Price / 1M tokens | Dimensions | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.020 | 1536 | Default for most teams; great quality/$ |
| OpenAI text-embedding-3-large | $0.130 | 3072 | Use for retrieval-critical workloads |
| Voyage voyage-3 | $0.060 | 1024 | Strong on domain-specific retrieval |
| Voyage voyage-3-large | $0.180 | 1024 | Highest-quality general retrieval |
| Cohere embed-v4 | $0.100 | 1536 | Multimodal + multilingual |
| Google text-embedding-005 | $0.025 | 768 | Well-integrated with Vertex + BigQuery |
| OSS (bge-large-en, self-host) | ~$0.001 effective | 1024 | GPU cost only; needs ops |
The line items in a real embedding invoice
An embedding bill has more moving parts than a call to the LLM. A typical monthly statement breaks into: initial corpus indexing (fixed, one-time), incremental indexing on new or changed content (monthly), query-time embedding (per-call), batch re-embed on model migrations (annual or rare), and in some cases a hosted embedding-as-a-service fee for pipeline management. The calculator above covers the first three, which are 95%+ of cost in most real deployments.
Where the cost actually sits
Embedding cost has two components: indexing (embed the corpus once, re-embed on changes) and query-time (embed the user query on every call). Most teams dramatically overpay on indexing by re-embedding the entire corpus when they could be doing incremental updates.
- Indexing:
corpus_tokens × price. A corpus of 10M tokens on text-embedding-3-small is $0.20, on text-embedding-3-large is $1.30. - Query:
queries × avg_query_tokens × price. At 50 tokens per query × 200k queries/day × 30 days = 300M tokens/month, which is $6/month on 3-small or $39 on 3-large.
Evaluating your actual retrieval quality
The single best investment you can make on embedding quality is not a model swap — it is a labeled evaluation set. 100 queries with human-verified relevant documents lets you measure recall@5, recall@10, and MRR on every candidate configuration. Without that dataset, model-selection arguments are about vibes; with it, the right answer is obvious within an afternoon of benchmarking. Most embedding-model overpayment is teams buying "premium" models because they cannot prove the cheaper one is worse.
Choosing a model: don't default
The recall lift from going from text-embedding-3-small to text-embedding-3-large on domain-specific corpora is usually 2–5 percentage points. On a medical or legal corpus that your business depends on, that is worth the 6.5× price. On a marketing-content RAG where users tolerate mediocre retrieval, it is pure overspend.
Voyage-3-large consistently tops Anthropic-sponsored retrieval benchmarks but is 9× the price of 3-small. Run a retrieval eval on 100 representative queries with known-correct documents, measure recall@5, and let the delta decide.
Indexing cadence matters more than model choice
Teams that launch a RAG product with a weekly full re-embed typically pay 5–10× what teams with incremental-only updates pay, for indistinguishable quality. Build change- detection at the document level from day one: hash each chunk, re-embed only on hash change, and keep a ledger so you can verify what got updated. This alone tends to be the largest single embedding cost optimization.
Self-hosting open embeddings
At scale (>500M tokens/month indexing), self-hosting a good open model like bge-large-en-v1.5 or gte-large-en-v1.5 on an L4 GPU at ~$0.70/hr undercuts OpenAI by 20–50×. The tradeoff is ops: model serving, batching, monitoring, retries. For most teams, that time is worth more than the savings. For vector-heavy platforms (search engines, code search, large doc stores), it stops being optional.
Three production corpora and their real bills
Abstract pricing rarely lines up with what you actually spend. Three real (anonymized) workloads we have priced or built in the past year:
- Customer support KB, 180k articles, 50M tokens, 300k daily queries:initial indexing on OpenAI 3-small = $1. Monthly re-embed of 4% drift = $0.04. Query-side: 300k × 30 × 60 tokens/query × $0.02/M = $10.80/mo. Total embedding cost: $11/mo. The vector DB on pgvector is ~$200/mo. The LLM reading results is $4,200/mo. Embeddings are rounding error here.
- Code search across 40k repos, 1.8B tokens of source + docstrings: initial indexing on Voyage code-3 at $0.12/M = $216. Monthly re-embed on changed files (~2% daily) = 1.8B × 0.02 × 30 × $0.12/M = $130/mo. Query side is negligible. Total: ~$135/mo steady-state — the Voyage premium over OpenAI 3-small is worth it here because code retrieval quality is the product.
- Enterprise RAG on 2M internal documents, 6B tokens: self-hosted bge-large on a pair of L4 GPUs at $0.50/hr reserved each = $720/mo for always-on infrastructure, but that infrastructure also serves query-time embedding for 50M queries a month. Equivalent cost on OpenAI 3-large would be ~$780/mo just on indexing + query, plus the benefit of not sending proprietary docs to a third party. Self-host pays off here at the scale and meets the compliance bar.
Recall vs. cost: the numbers that actually matter
On BEIR-style public benchmarks in 2026, the gap between OpenAI 3-small and Voyage 3-large is about 4pp recall@10. On domain-specific corpora (medical, legal, code), the gap can widen to 8–12pp. The question is not "which is objectively better" but "how much does 4–8pp recall@10 matter in my downstream task?" For a chatbot where users tolerate retrieval noise, it does not. For a medical decision-support tool, it is the entire product.
Dimensionality math
Vector dimensions translate directly into storage and memory. A 1,536-dim float32 vector is 6.1KB; a 3,072-dim vector is 12.3KB. At 10M vectors, that is the difference between 61GB and 123GB of index. Pinecone Serverless bills partly on storage, so the high-dim model doubles your vector-DB line item. Most retrieval evals show single-digit recall deltas from going 1,536 → 3,072; rarely worth the 2× storage cost.
Some embedding models support Matryoshka truncation — you can take a 3,072-dim vector and truncate to 768 dims with a ~1.5pp recall loss. If your embedding model supports it (OpenAI 3-large does, Voyage 3-large does), this is a free 4× storage reduction.
Chunking strategy beats model selection
A common mistake: teams evaluate embedding models on poorly-chunked data, conclude the models are similar, and never chase chunking quality. In practice, chunking strategy (window size, overlap, sentence-aware boundaries, header preservation) swings recall@10 by 10–20pp — more than any embedding-model swap. Before you spend $0.18/M on Voyage 3-large, verify that your chunker is not dumping 8,000-token walls of text through a window of 1,500.
Reranker as a force multiplier
Cohere Rerank 3.5 at $1 per 1,000 queries (100-document candidate pool) consistently improves downstream answer quality by 5–15pp. It works by taking a larger retrieval set (say, top 50 by cosine similarity) and rescoring the top 10 with a cross-encoder. This is almost always cheaper than upgrading to a better base embedder, because the reranker only runs on query time — not indexing.
Voyage rerank-2.5 is a competitive alternative at similar pricing. Self-hosted bge-reranker or mxbai-rerank-large on an L4 costs ~$0.0001 per query at reasonable QPS.
Frequently asked questions
Should I use separate embeddings for queries vs documents? If your model supports it (OpenAI does not; Voyage and Cohere do), yes — asymmetric embedding helps recall 2–4pp. The asymmetric mode knows a query is short and sparse while documents are long and dense, and handles the mismatch better.
Do I need to re-embed when my model ages? You never need to — old vectors still work — but providers periodically retire embedding models with 6–12 months notice. Budget a re-embed migration roughly once per 18 months.
Can I mix embedding models in the same index? No. Vector spaces are not compatible across models. You must re-embed the entire corpus when changing.
What is the cheapest way to embed 1B tokens? Self-hosted bge-large on a single L4 GPU can embed roughly 4-8M tokens per hour at good batch sizes. 1B tokens is ~150 GPU-hours at $0.50/hr = $75. The caveat is that you need the pipeline plumbing.
Is binary quantization viable? Yes, for storage-constrained workloads. Binary-quantized embeddings lose 3–8pp recall but cut storage 32×. Combined with a small rerank pass, net quality is often on par with full-precision.
How do I detect if my embedding model is underperforming? Build a gold set of 100 query/relevant-doc pairs. Measure recall@10. Re-measure every 3 months and whenever you change the chunker or retrieval parameters.
Does query-time caching work for embeddings? Yes, trivially — identical queries should return cached vectors. Redis with a 24-hour TTL cuts query-side embedding cost 30–70% on a typical consumer workload.
How does multilingual embedding pricing compare? Cohere embed-v4 and Voyage multilingual variants cost about 1.5–2× the English-only prices. For multi-language products, the premium is worth it over hacking English-trained models.
- RAG pipeline cost — full picture: embeddings + vector DB + LLM.
- Vector database cost — Pinecone vs. pgvector vs. Weaviate hosting.
- GPU inference cost — if you are evaluating self-hosted embeddings.
- Fine-tune vs RAG — when a fine-tune beats scaling RAG infra.