AI Economy Hub

AI voice cost

Cost of ElevenLabs, PlayHT, or OpenAI TTS for podcasts, audiobooks, and voiceovers.

Results

Monthly voice cost
$94.00
Subscription
$22.00
Overage cost
$72.00
~Minutes of audio
667
Insight: A 30-minute podcast is ~27k characters. An hour of audiobook is ~50k. Model your volume in audio-minutes first, then convert.

Visualization

Get weekly marketing insights

Join 1,200+ readers. One email per week. Unsubscribe anytime.

Frequently asked questions

1.ElevenLabs vs. OpenAI vs. PlayHT?

ElevenLabs v3 is best overall but priciest. OpenAI TTS HD is 5× cheaper with solid quality. PlayHT 3.0 sits between them on both axes.

2.What about open-source TTS?

Coqui XTTS-v2 and MetaVoice run self-hosted for near-zero marginal cost. Quality is behind commercial options on expressiveness; fine for informational content.

3.Can I monetize AI-voiced podcasts?

Yes on most platforms — but Spotify/Apple require you own commercial rights to the voice. Both ElevenLabs and PlayHT paid plans grant this.

4.What counts as a character?

Usually all text including spaces and punctuation. SSML markup tags don't count, but the text inside them does.

5.How does this compare to human VO?

Human voiceover runs $100–$500 per finished audio-minute for professional talent. AI is ~$0.10 per audio-minute raw — about 1,000× cheaper.

TTS pricing in 2026: three tiers, very different unit economics

Voice generation collapsed into three quality tiers. At the top, ElevenLabs Turbo v3 and OpenAI TTS-HD v2 deliver human-indistinguishable voices at ~$0.18/1k characters. Mid-tier (Play.ht, Resemble) hits ~$0.04/1k. Open models (XTTS, Orpheus-3) run self-hosted at effective rates around $0.002/1k. Picking the tier is about use case, not budget.

ProductPricing (per 1k chars)Best forNotes
ElevenLabs Turbo v3~$0.18Premium podcasts, audiobooksVoice cloning, emotion control
ElevenLabs Multilingual v2~$0.30Long-form narrationMost natural prosody on the market
OpenAI TTS-HD v2~$0.15Default premium for SaaS UXFast, reliable, 9 voices
OpenAI TTS (standard)~$0.015High-volume IVR, chatbotsGood enough quality at 1/10 price
Play.ht 2.0~$0.04Ad copy, mid-quality podcastsThousands of voices
Resemble.ai~$0.05Voice cloning + on-premEnterprise + custom voices
Cartesia Sonic-2~$0.08Lowest-latency realtime~90ms time-to-first-audio
Self-host XTTS-v3 (L4 GPU)~$0.002Bulk transformationOSS, quality below frontier

How to think about tier selection

Unlike text LLMs where the right answer is usually a routed mix, voice tends to pick one tier and stay there. The reason: voice quality is jarring to switch mid-conversation. You do not want a support agent whose voice changes halfway through the call because you fell off Cartesia onto a cheaper fallback. Pick the tier your listener expects — premium for podcasts and audiobooks, mid-tier for ads, cheap for IVR — and stay consistent.

Typical workload costs

  • 10-minute podcast episode (~1,500 words × 5 chars = 7,500 chars): ElevenLabs $1.35; OpenAI HD $1.13; Play.ht $0.30.
  • 8-hour audiobook (~60k words = 300k chars): ElevenLabs $54; OpenAI standard $4.50 (quality noticeably worse).
  • Support IVR, 100k calls/mo @ 300 chars each = 30M chars/mo: OpenAI standard $450; ElevenLabs would be $5,400.
  • Voice agent, 50k turns/day, 200 chars each = 3M chars/day: Cartesia Sonic-2 ~$240/day for low-latency realtime.

Deciding where to spend the voice budget

Voice is one of those line items where spending more does not always improve the product. A customer-support IVR where callers are already mildly annoyed does not benefit from premium narration. A meditation app where voice quality is the entire product benefits enormously. Allocate budget to the places where listener attention is high and alternative evidence (app store reviews, churn feedback) suggests voice quality moves the metric. Underinvesting on a hero feature and overinvesting on plumbing is a pattern we see repeatedly in voice product audits.

Latency matters more than price in realtime

For conversational voice agents (Retell, Vapi, custom stacks), time-to-first-audio is the product. Cartesia Sonic-2 hits ~90ms. OpenAI Realtime is ~300ms. ElevenLabs Turbo v3 is ~250ms. Above 500ms, the conversation feels broken regardless of voice quality. Choose latency-first, then tune quality and price.

Audio-processing infra around the TTS call

Raw TTS output is not always what you want. Production pipelines typically include loudness normalization (LUFS targeting), silence trimming on the ends, optional lightweight compression, and format conversion to match consumer expectations (MP3 or Opus for web, PCM for telephony). Open-source tools (ffmpeg, sox) handle all of this cheaply, but the engineering cost of building a robust pipeline is real. Budget a week of work for the audio-postprocessing layer in any serious voice deployment.

Voice cloning legal/ethical reality

All major providers now require explicit voice-owner consent via a signed statement or live-video verification. ElevenLabs' professional clone needs 30+ minutes of audio plus consent. Using an unauthorized clone is a contract violation with all of them, and you can and will be cut off. Plan the consent workflow before you promise a feature.

The voice-UX decisions that come before pricing

Before comparing voice-provider pricing, decide what voice personality your product needs, whether that voice is consistent across surfaces, and how the voice behaves when upstream TTS fails. These decisions drive provider choice more than per-1k pricing. A voice product that uses one provider on the marketing site, another in-app, and a third on phone calls feels incoherent regardless of how well each individual component is engineered.

Three production deployments with full cost math

  • Audiobook publisher, 40 titles/month averaging 80k words each: 40 × 80k × 5 chars/word = 16M chars/mo. ElevenLabs Multilingual v2 at $0.30/1k chars = $4,800/mo. Compare: human narrator $200–400/hr, 8 hours per title, 40 titles = $64k–128k/mo. AI wins 13–26× on cost; debate is artistic preference, not economics.
  • Voice-first agent, 40k calls/day × 45s avg @ 14 chars/s = ~7.5M chars/day = 225M chars/mo: Cartesia Sonic-2 at $0.08/1k = $18k/mo, but latency under 100ms is the product. ElevenLabs Turbo v3 = $40.5k/mo at similar quality but 2.5× the latency. Pick Cartesia; the latency is the product.
  • IVR replacement for a call-center, 500k calls/day × 90s average hold-message content = ~600M chars/mo:OpenAI TTS standard at $0.015/1k = $9k/mo. Quality is more than good enough for "Your call will be answered in approximately..." patterns. ElevenLabs at $108k/mo is indefensible here.

Evaluating voice quality empirically

Public voice demos on provider websites are curated. They are rarely representative of output on the specific domain and tone your product needs. Commission a 100-line evaluation script using text from your actual product (not marketing copy), run it through each candidate at the settings you plan to use, and have 10 target-audience listeners rate the output. This 2-day investment catches voice-quality problems that would otherwise surface in production via user complaints.

Streaming vs. batch audio

For real-time applications, streaming TTS delivers audio chunks as the text arrives, cutting perceived latency from "wait for the whole sentence" to "start hearing it immediately." Cartesia, ElevenLabs Turbo, and OpenAI Realtime all support streaming. Play.ht and non-realtime ElevenLabs modes do not. For a voice agent, streaming is non-negotiable; TTFT under 300ms determines whether the conversation feels natural.

Voice quality varies by language and register

A subtle trap: voice quality benchmarks are usually reported on standard American English at a neutral register. If your product ships voice in Spanish, Mandarin, Hindi, or any regional variant, quality ranking among providers changes. ElevenLabs Multilingual v2 leads most language rankings; Google Cloud TTS is strong on Spanish and Portuguese; Microsoft Azure Speech is competitive for enterprise multi-language deployments. Do not extrapolate from English benchmarks to non-English performance without verification.

Self-hosting open TTS

XTTS-v3 and Orpheus-3 running on an L4 GPU at $0.50/hr hit roughly 15–25× realtime — meaning 1 hour of GPU time produces 15–25 hours of audio. Effective cost: ~$0.02 per hour of speech, or $0.002/1k chars. Quality is noticeably below ElevenLabs but good enough for IVR, drafts, and internal tools. Tradeoff is the usual self-hosting ops work plus voice-cloning legal discipline.

Reliability vs. price tradeoffs

Voice providers have uneven reliability records. ElevenLabs has had outages lasting multiple hours in the past year that broke any customer dependent on them as a single provider. Cartesia and OpenAI Realtime have been more stable but both have had incidents. A real voice product needs multi-provider fallback, which in turn affects how you think about voice-identity consistency — customers will eventually hear the fallback voice if the primary is down.

Quality dimensions that matter

  • Prosody. ElevenLabs and Cartesia lead on natural rhythm and emphasis. Matters for narration, podcasts.
  • Emotion control. ElevenLabs v3 supports explicit emotion tags; OpenAI does not. Matters for character-driven content.
  • Multilingual. ElevenLabs Multilingual v2 handles 30+ languages with native-quality prosody. For global products, critical.
  • Pronunciation of domain terms. All models mispronounce proper nouns and jargon. Use SSML or phoneme overrides; budget a pronunciation dictionary maintenance step.
  • Consistency across sessions. ElevenLabs and Cartesia maintain voice identity across generations; cheaper providers drift.

Production patterns

  • Cache repeated phrases.An IVR that says "Please hold" 500k times should synthesize it once and serve from CDN. Obvious but frequently missed.
  • Pre-synthesize common responses. For a chatbot with a finite menu of confirmations, batch-generate the 50 most common responses ahead of time.
  • Fallback chain. Primary on Cartesia for low latency, fallback to ElevenLabs when Cartesia is slow, fallback to OpenAI standard when both fail. Three-tier fallback keeps voice UX working during provider incidents.
  • Monitor WER on the output. Transcribe back a sample of generated audio with Whisper; compare to input text. Word error rate creeping up signals a voice-quality regression before users complain.

Frequently asked questions

Which TTS for podcasting? ElevenLabs Multilingual v2 is the clear choice; human-indistinguishable at the cost tier. Play.ht is a reasonable mid-budget alternative.

Which for customer support voice bots? Cartesia Sonic-2 for low-latency realtime, OpenAI Realtime if you are already on OpenAI. Avoid non-streaming providers.

Can I clone a celebrity voice? No — all major providers prohibit it without explicit consent. Doing so also exposes you to right-of-publicity claims.

Does SSML matter? Yes. Proper SSML for pauses, emphasis, and phonemes lifts quality noticeably. Invest 2 engineer-days in SSML templating and it compounds.

How do I evaluate voice quality objectively? Pay 10 listeners to rate 50 samples on a 5-point scale across naturalness, clarity, and emotion. Repeat monthly. Human ratings correlate with product outcomes better than MOS-style automated metrics.

Do I need an SSML dictionary for product names? Yes, if your product has a distinctive brand name. Most models mispronounce common tech terms by default.

Can I mix TTS with real narration? Yes, and many podcast workflows do — AI narrator for synthesized content, human for interviews. Match the voice identity carefully with a cloned AI variant of the human voice.

Is realtime voice latency solved? Below 300ms TTFT is solved by Cartesia. Below 150ms requires careful engineering of streaming chunk boundaries. Below 100ms is only possible in constrained setups.

Keep going

More free tools