Which embedding model should I use?

Start with OpenAI text-embedding-3-small ($0.02/1M) or Voyage 3 Lite. Only upgrade to large models if retrieval quality measurably suffers.

Do I pay to embed queries?

Yes — every search query must be embedded with the same model as the corpus. Query tokens usually cost much less than corpus tokens because they're shorter.

How often should I re-index?

On content change, not on schedule. For a blog that adds 5 posts per week, daily incremental embeds cost pennies; a monthly full re-index costs dollars.

Does this include vector DB storage?

No — this is just the embedding API call cost. Use the Vector DB Cost calculator for storage and query fees.

What about dimension reduction?

Matryoshka models let you truncate embeddings to 512 or 256 dims, cutting storage cost without re-embedding. Query cost is unchanged.

Embedding API Cost Calculator

Embedding costs are small until they are not

Embeddings feel free in a prototype. At 8,192 tokens per chunk and OpenAI text-embedding-3-small at $0.02 per million tokens, a 10,000-document corpus runs you about $1.50. Then you go to production, rebuild the index every week on 2 million docs, serve 200,000 queries a day, and the embedding line item quietly becomes real money.

The difference between prototype math and production math

Embedding cost at the prototype stage is one of those line items where the spreadsheet shows $1.50 and feels irrelevant, and then six months later the same line reads $3,200. The growth path: you index 10k docs once ($1), you add incremental ingestion ($5/mo), you decide to re-embed on a model change (one-time $200), you add a second corpus ($15/mo), traffic 10×, you add a better embedding model for retrieval-critical flows, and suddenly the line item is material. Budgeting-wise, treat embedding cost as a permanent operational expense that will compound with document growth and query growth independently.

Provider / Model	Price / 1M tokens	Dimensions	Notes
OpenAI text-embedding-3-small	$0.020	1536	Default for most teams; great quality/$
OpenAI text-embedding-3-large	$0.130	3072	Use for retrieval-critical workloads
Voyage voyage-3	$0.060	1024	Strong on domain-specific retrieval
Voyage voyage-3-large	$0.180	1024	Highest-quality general retrieval
Cohere embed-v4	$0.100	1536	Multimodal + multilingual
Google text-embedding-005	$0.025	768	Well-integrated with Vertex + BigQuery
OSS (bge-large-en, self-host)	~$0.001 effective	1024	GPU cost only; needs ops

The line items in a real embedding invoice

An embedding bill has more moving parts than a call to the LLM. A typical monthly statement breaks into: initial corpus indexing (fixed, one-time), incremental indexing on new or changed content (monthly), query-time embedding (per-call), batch re-embed on model migrations (annual or rare), and in some cases a hosted embedding-as-a-service fee for pipeline management. The calculator above covers the first three, which are 95%+ of cost in most real deployments.

Where the cost actually sits

Embedding cost has two components: indexing (embed the corpus once, re-embed on changes) and query-time (embed the user query on every call). Most teams dramatically overpay on indexing by re-embedding the entire corpus when they could be doing incremental updates.

Indexing: corpus_tokens × price. A corpus of 10M tokens on text-embedding-3-small is $0.20, on text-embedding-3-large is $1.30.
Query: queries × avg_query_tokens × price. At 50 tokens per query × 200k queries/day × 30 days = 300M tokens/month, which is $6/month on 3-small or $39 on 3-large.

Evaluating your actual retrieval quality

The single best investment you can make on embedding quality is not a model swap — it is a labeled evaluation set. 100 queries with human-verified relevant documents lets you measure recall@5, recall@10, and MRR on every candidate configuration. Without that dataset, model-selection arguments are about vibes; with it, the right answer is obvious within an afternoon of benchmarking. Most embedding-model overpayment is teams buying "premium" models because they cannot prove the cheaper one is worse.

Choosing a model: don't default

The recall lift from going from text-embedding-3-small to text-embedding-3-large on domain-specific corpora is usually 2–5 percentage points. On a medical or legal corpus that your business depends on, that is worth the 6.5× price. On a marketing-content RAG where users tolerate mediocre retrieval, it is pure overspend.

Voyage-3-large consistently tops Anthropic-sponsored retrieval benchmarks but is 9× the price of 3-small. Run a retrieval eval on 100 representative queries with known-correct documents, measure recall@5, and let the delta decide.

Indexing cadence matters more than model choice

Teams that launch a RAG product with a weekly full re-embed typically pay 5–10× what teams with incremental-only updates pay, for indistinguishable quality. Build change- detection at the document level from day one: hash each chunk, re-embed only on hash change, and keep a ledger so you can verify what got updated. This alone tends to be the largest single embedding cost optimization.

Self-hosting open embeddings

At scale (>500M tokens/month indexing), self-hosting a good open model like bge-large-en-v1.5 or gte-large-en-v1.5 on an L4 GPU at ~$0.70/hr undercuts OpenAI by 20–50×. The tradeoff is ops: model serving, batching, monitoring, retries. For most teams, that time is worth more than the savings. For vector-heavy platforms (search engines, code search, large doc stores), it stops being optional.

Three production corpora and their real bills

Abstract pricing rarely lines up with what you actually spend. Three real (anonymized) workloads we have priced or built in the past year:

Customer support KB, 180k articles, 50M tokens, 300k daily queries:initial indexing on OpenAI 3-small = $1. Monthly re-embed of 4% drift = $0.04. Query-side: 300k × 30 × 60 tokens/query × $0.02/M = $10.80/mo. Total embedding cost: $11/mo. The vector DB on pgvector is ~$200/mo. The LLM reading results is $4,200/mo. Embeddings are rounding error here.
Code search across 40k repos, 1.8B tokens of source + docstrings: initial indexing on Voyage code-3 at $0.12/M = $216. Monthly re-embed on changed files (~2% daily) = 1.8B × 0.02 × 30 × $0.12/M = $130/mo. Query side is negligible. Total: ~$135/mo steady-state — the Voyage premium over OpenAI 3-small is worth it here because code retrieval quality is the product.
Enterprise RAG on 2M internal documents, 6B tokens: self-hosted bge-large on a pair of L4 GPUs at $0.50/hr reserved each = $720/mo for always-on infrastructure, but that infrastructure also serves query-time embedding for 50M queries a month. Equivalent cost on OpenAI 3-large would be ~$780/mo just on indexing + query, plus the benefit of not sending proprietary docs to a third party. Self-host pays off here at the scale and meets the compliance bar.

Recall vs. cost: the numbers that actually matter

On BEIR-style public benchmarks in 2026, the gap between OpenAI 3-small and Voyage 3-large is about 4pp recall@10. On domain-specific corpora (medical, legal, code), the gap can widen to 8–12pp. The question is not "which is objectively better" but "how much does 4–8pp recall@10 matter in my downstream task?" For a chatbot where users tolerate retrieval noise, it does not. For a medical decision-support tool, it is the entire product.

Dimensionality math

Vector dimensions translate directly into storage and memory. A 1,536-dim float32 vector is 6.1KB; a 3,072-dim vector is 12.3KB. At 10M vectors, that is the difference between 61GB and 123GB of index. Pinecone Serverless bills partly on storage, so the high-dim model doubles your vector-DB line item. Most retrieval evals show single-digit recall deltas from going 1,536 → 3,072; rarely worth the 2× storage cost.

Some embedding models support Matryoshka truncation — you can take a 3,072-dim vector and truncate to 768 dims with a ~1.5pp recall loss. If your embedding model supports it (OpenAI 3-large does, Voyage 3-large does), this is a free 4× storage reduction.

Chunking strategy beats model selection

A common mistake: teams evaluate embedding models on poorly-chunked data, conclude the models are similar, and never chase chunking quality. In practice, chunking strategy (window size, overlap, sentence-aware boundaries, header preservation) swings recall@10 by 10–20pp — more than any embedding-model swap. Before you spend $0.18/M on Voyage 3-large, verify that your chunker is not dumping 8,000-token walls of text through a window of 1,500.

Reranker as a force multiplier

Cohere Rerank 3.5 at $1 per 1,000 queries (100-document candidate pool) consistently improves downstream answer quality by 5–15pp. It works by taking a larger retrieval set (say, top 50 by cosine similarity) and rescoring the top 10 with a cross-encoder. This is almost always cheaper than upgrading to a better base embedder, because the reranker only runs on query time — not indexing.

Voyage rerank-2.5 is a competitive alternative at similar pricing. Self-hosted bge-reranker or mxbai-rerank-large on an L4 costs ~$0.0001 per query at reasonable QPS.

Frequently asked questions

Should I use separate embeddings for queries vs documents? If your model supports it (OpenAI does not; Voyage and Cohere do), yes — asymmetric embedding helps recall 2–4pp. The asymmetric mode knows a query is short and sparse while documents are long and dense, and handles the mismatch better.

Do I need to re-embed when my model ages? You never need to — old vectors still work — but providers periodically retire embedding models with 6–12 months notice. Budget a re-embed migration roughly once per 18 months.

Can I mix embedding models in the same index? No. Vector spaces are not compatible across models. You must re-embed the entire corpus when changing.

What is the cheapest way to embed 1B tokens? Self-hosted bge-large on a single L4 GPU can embed roughly 4-8M tokens per hour at good batch sizes. 1B tokens is ~150 GPU-hours at $0.50/hr = $75. The caveat is that you need the pipeline plumbing.

Is binary quantization viable? Yes, for storage-constrained workloads. Binary-quantized embeddings lose 3–8pp recall but cut storage 32×. Combined with a small rerank pass, net quality is often on par with full-precision.

How do I detect if my embedding model is underperforming? Build a gold set of 100 query/relevant-doc pairs. Measure recall@10. Re-measure every 3 months and whenever you change the chunker or retrieval parameters.

Does query-time caching work for embeddings? Yes, trivially — identical queries should return cached vectors. Redis with a 24-hour TTL cuts query-side embedding cost 30–70% on a typical consumer workload.

How does multilingual embedding pricing compare? Cohere embed-v4 and Voyage multilingual variants cost about 1.5–2× the English-only prices. For multi-language products, the premium is worth it over hacking English-trained models.

Operational lessons from running embedding workloads at scale

At steady-state volume, the embedding line is almost never the bill-driver — the bill-driver is the work you do around embeddings. Chunk-boundary churn that forces partial re-embeds every time a doc changes can 10× the monthly cost silently. Solution: stable chunk IDs tied to a hash of the normalized content plus a small overlap window, so a minor edit only re-embeds 2-4 adjacent chunks, not the whole doc. Track the chunk-to-doc ratio and the re-embed rate per week; a re-embed rate above 15% of corpus/week is a signal your normalization pipeline is producing instability. Separately, keep a shadow index on a cheaper or newer embedding model so you can A/B recall quality before committing to a full re-embed migration.

Keep going

RAG pipeline cost — full picture: embeddings + vector DB + LLM.
Vector database cost — Pinecone vs. pgvector vs. Weaviate hosting.
GPU inference cost — if you are evaluating self-hosted embeddings.
Fine-tune vs RAG — when a fine-tune beats scaling RAG infra.

Use the data programmatically

Every calculator on this site is also exposed as a free, CORS-open JSON endpoint. No auth, no rate limit (fair-use, please cache). License is CC-BY-4.0 — link back to attribution.canonicalUrl in the response.

Endpoint: https://aieconomyhub.co/api/page/embedding-cost

curl

curl -s 'https://aieconomyhub.co/api/page/embedding-cost' | jq .

Python

import requests

r = requests.get("https://aieconomyhub.co/api/page/embedding-cost", timeout=10)
r.raise_for_status()
data = r.json()
print(data["title"])
for faq in data.get("faqs", []):
    print("Q:", faq["q"])

JavaScript / Node

// Node 20+ / modern browser
const res = await fetch("https://aieconomyhub.co/api/page/embedding-cost");
if (!res.ok) throw new Error("HTTP " + res.status);
const embedding_cost = await res.json();
console.log(embedding_cost.title);
for (const faq of embedding_cost.faqs ?? []) {
  console.log("Q:", faq.q);
}

Spec: /api/openapi.yaml · Docs: /api/docs

Embedding cost calculator

Results

Visualization

Frequently asked questions

Embedding costs are small until they are not

The difference between prototype math and production math

The line items in a real embedding invoice

Where the cost actually sits

Evaluating your actual retrieval quality

Choosing a model: don't default

Indexing cadence matters more than model choice

Self-hosting open embeddings

Three production corpora and their real bills

Recall vs. cost: the numbers that actually matter

Dimensionality math

Chunking strategy beats model selection

Reranker as a force multiplier

Frequently asked questions

Operational lessons from running embedding workloads at scale

Use the data programmatically

Track your AI tool costs, ROI, and productivity metrics

More free tools