ElevenLabs vs. OpenAI vs. PlayHT?

ElevenLabs v3 is best overall but priciest. OpenAI TTS HD is 5× cheaper with solid quality. PlayHT 3.0 sits between them on both axes.

What about open-source TTS?

Coqui XTTS-v2 and MetaVoice run self-hosted for near-zero marginal cost. Quality is behind commercial options on expressiveness; fine for informational content.

Can I monetize AI-voiced podcasts?

Yes on most platforms — but Spotify/Apple require you own commercial rights to the voice. Both ElevenLabs and PlayHT paid plans grant this.

What counts as a character?

Usually all text including spaces and punctuation. SSML markup tags don't count, but the text inside them does.

How does this compare to human VO?

Human voiceover runs $100–$500 per finished audio-minute for professional talent. AI is ~$0.10 per audio-minute raw — about 1,000× cheaper.

AI Voice Generation Cost Calculator

TTS pricing in 2026: three tiers, very different unit economics

Voice generation collapsed into three quality tiers. At the top, ElevenLabs Turbo v3 and OpenAI TTS-HD v2 deliver human-indistinguishable voices at ~$0.18/1k characters. Mid-tier (Play.ht, Resemble) hits ~$0.04/1k. Open models (XTTS, Orpheus-3) run self-hosted at effective rates around $0.002/1k. Picking the tier is about use case, not budget.

Product	Pricing (per 1k chars)	Best for	Notes
ElevenLabs Turbo v3	~$0.18	Premium podcasts, audiobooks	Voice cloning, emotion control
ElevenLabs Multilingual v2	~$0.30	Long-form narration	Most natural prosody on the market
OpenAI TTS-HD v2	~$0.15	Default premium for SaaS UX	Fast, reliable, 9 voices
OpenAI TTS (standard)	~$0.015	High-volume IVR, chatbots	Good enough quality at 1/10 price
Play.ht 2.0	~$0.04	Ad copy, mid-quality podcasts	Thousands of voices
Resemble.ai	~$0.05	Voice cloning + on-prem	Enterprise + custom voices
Cartesia Sonic-2	~$0.08	Lowest-latency realtime	~90ms time-to-first-audio
Self-host XTTS-v3 (L4 GPU)	~$0.002	Bulk transformation	OSS, quality below frontier

How to think about tier selection

Unlike text LLMs where the right answer is usually a routed mix, voice tends to pick one tier and stay there. The reason: voice quality is jarring to switch mid-conversation. You do not want a support agent whose voice changes halfway through the call because you fell off Cartesia onto a cheaper fallback. Pick the tier your listener expects — premium for podcasts and audiobooks, mid-tier for ads, cheap for IVR — and stay consistent.

Typical workload costs

10-minute podcast episode (~1,500 words × 5 chars = 7,500 chars): ElevenLabs $1.35; OpenAI HD $1.13; Play.ht $0.30.
8-hour audiobook (~60k words = 300k chars): ElevenLabs $54; OpenAI standard $4.50 (quality noticeably worse).
Support IVR, 100k calls/mo @ 300 chars each = 30M chars/mo: OpenAI standard $450; ElevenLabs would be $5,400.
Voice agent, 50k turns/day, 200 chars each = 3M chars/day: Cartesia Sonic-2 ~$240/day for low-latency realtime.

Deciding where to spend the voice budget

Voice is one of those line items where spending more does not always improve the product. A customer-support IVR where callers are already mildly annoyed does not benefit from premium narration. A meditation app where voice quality is the entire product benefits enormously. Allocate budget to the places where listener attention is high and alternative evidence (app store reviews, churn feedback) suggests voice quality moves the metric. Underinvesting on a hero feature and overinvesting on plumbing is a pattern we see repeatedly in voice product audits.

Latency matters more than price in realtime

For conversational voice agents (Retell, Vapi, custom stacks), time-to-first-audio is the product. Cartesia Sonic-2 hits ~90ms. OpenAI Realtime is ~300ms. ElevenLabs Turbo v3 is ~250ms. Above 500ms, the conversation feels broken regardless of voice quality. Choose latency-first, then tune quality and price.

Audio-processing infra around the TTS call

Raw TTS output is not always what you want. Production pipelines typically include loudness normalization (LUFS targeting), silence trimming on the ends, optional lightweight compression, and format conversion to match consumer expectations (MP3 or Opus for web, PCM for telephony). Open-source tools (ffmpeg, sox) handle all of this cheaply, but the engineering cost of building a robust pipeline is real. Budget a week of work for the audio-postprocessing layer in any serious voice deployment.

Voice cloning legal/ethical reality

All major providers now require explicit voice-owner consent via a signed statement or live-video verification. ElevenLabs' professional clone needs 30+ minutes of audio plus consent. Using an unauthorized clone is a contract violation with all of them, and you can and will be cut off. Plan the consent workflow before you promise a feature.

The voice-UX decisions that come before pricing

Before comparing voice-provider pricing, decide what voice personality your product needs, whether that voice is consistent across surfaces, and how the voice behaves when upstream TTS fails. These decisions drive provider choice more than per-1k pricing. A voice product that uses one provider on the marketing site, another in-app, and a third on phone calls feels incoherent regardless of how well each individual component is engineered.

Three production deployments with full cost math

Audiobook publisher, 40 titles/month averaging 80k words each: 40 × 80k × 5 chars/word = 16M chars/mo. ElevenLabs Multilingual v2 at $0.30/1k chars = $4,800/mo. Compare: human narrator $200–400/hr, 8 hours per title, 40 titles = $64k–128k/mo. AI wins 13–26× on cost; debate is artistic preference, not economics.
Voice-first agent, 40k calls/day × 45s avg @ 14 chars/s = ~7.5M chars/day = 225M chars/mo: Cartesia Sonic-2 at $0.08/1k = $18k/mo, but latency under 100ms is the product. ElevenLabs Turbo v3 = $40.5k/mo at similar quality but 2.5× the latency. Pick Cartesia; the latency is the product.
IVR replacement for a call-center, 500k calls/day × 90s average hold-message content = ~600M chars/mo:OpenAI TTS standard at $0.015/1k = $9k/mo. Quality is more than good enough for "Your call will be answered in approximately..." patterns. ElevenLabs at $108k/mo is indefensible here.

Evaluating voice quality empirically

Public voice demos on provider websites are curated. They are rarely representative of output on the specific domain and tone your product needs. Commission a 100-line evaluation script using text from your actual product (not marketing copy), run it through each candidate at the settings you plan to use, and have 10 target-audience listeners rate the output. This 2-day investment catches voice-quality problems that would otherwise surface in production via user complaints.

Streaming vs. batch audio

For real-time applications, streaming TTS delivers audio chunks as the text arrives, cutting perceived latency from "wait for the whole sentence" to "start hearing it immediately." Cartesia, ElevenLabs Turbo, and OpenAI Realtime all support streaming. Play.ht and non-realtime ElevenLabs modes do not. For a voice agent, streaming is non-negotiable; TTFT under 300ms determines whether the conversation feels natural.

Voice quality varies by language and register

A subtle trap: voice quality benchmarks are usually reported on standard American English at a neutral register. If your product ships voice in Spanish, Mandarin, Hindi, or any regional variant, quality ranking among providers changes. ElevenLabs Multilingual v2 leads most language rankings; Google Cloud TTS is strong on Spanish and Portuguese; Microsoft Azure Speech is competitive for enterprise multi-language deployments. Do not extrapolate from English benchmarks to non-English performance without verification.

Self-hosting open TTS

XTTS-v3 and Orpheus-3 running on an L4 GPU at $0.50/hr hit roughly 15–25× realtime — meaning 1 hour of GPU time produces 15–25 hours of audio. Effective cost: ~$0.02 per hour of speech, or $0.002/1k chars. Quality is noticeably below ElevenLabs but good enough for IVR, drafts, and internal tools. Tradeoff is the usual self-hosting ops work plus voice-cloning legal discipline.

Reliability vs. price tradeoffs

Voice providers have uneven reliability records. ElevenLabs has had outages lasting multiple hours in the past year that broke any customer dependent on them as a single provider. Cartesia and OpenAI Realtime have been more stable but both have had incidents. A real voice product needs multi-provider fallback, which in turn affects how you think about voice-identity consistency — customers will eventually hear the fallback voice if the primary is down.

Quality dimensions that matter

Prosody. ElevenLabs and Cartesia lead on natural rhythm and emphasis. Matters for narration, podcasts.
Emotion control. ElevenLabs v3 supports explicit emotion tags; OpenAI does not. Matters for character-driven content.
Multilingual. ElevenLabs Multilingual v2 handles 30+ languages with native-quality prosody. For global products, critical.
Pronunciation of domain terms. All models mispronounce proper nouns and jargon. Use SSML or phoneme overrides; budget a pronunciation dictionary maintenance step.
Consistency across sessions. ElevenLabs and Cartesia maintain voice identity across generations; cheaper providers drift.

Production patterns

Cache repeated phrases.An IVR that says "Please hold" 500k times should synthesize it once and serve from CDN. Obvious but frequently missed.
Pre-synthesize common responses. For a chatbot with a finite menu of confirmations, batch-generate the 50 most common responses ahead of time.
Fallback chain. Primary on Cartesia for low latency, fallback to ElevenLabs when Cartesia is slow, fallback to OpenAI standard when both fail. Three-tier fallback keeps voice UX working during provider incidents.
Monitor WER on the output. Transcribe back a sample of generated audio with Whisper; compare to input text. Word error rate creeping up signals a voice-quality regression before users complain.

Frequently asked questions

Which TTS for podcasting? ElevenLabs Multilingual v2 is the clear choice; human-indistinguishable at the cost tier. Play.ht is a reasonable mid-budget alternative.

Which for customer support voice bots? Cartesia Sonic-2 for low-latency realtime, OpenAI Realtime if you are already on OpenAI. Avoid non-streaming providers.

Can I clone a celebrity voice? No — all major providers prohibit it without explicit consent. Doing so also exposes you to right-of-publicity claims.

Does SSML matter? Yes. Proper SSML for pauses, emphasis, and phonemes lifts quality noticeably. Invest 2 engineer-days in SSML templating and it compounds.

How do I evaluate voice quality objectively? Pay 10 listeners to rate 50 samples on a 5-point scale across naturalness, clarity, and emotion. Repeat monthly. Human ratings correlate with product outcomes better than MOS-style automated metrics.

Do I need an SSML dictionary for product names? Yes, if your product has a distinctive brand name. Most models mispronounce common tech terms by default.

Can I mix TTS with real narration? Yes, and many podcast workflows do — AI narrator for synthesized content, human for interviews. Match the voice identity carefully with a cloned AI variant of the human voice.

Is realtime voice latency solved? Below 300ms TTFT is solved by Cartesia. Below 150ms requires careful engineering of streaming chunk boundaries. Below 100ms is only possible in constrained setups.

Voice-cost budgeting for volume deployments

At a million minutes of generated voice per month, TTS pricing dominates your AI line item. The working budget math: ElevenLabs Multilingual v2 at $0.30/1k chars ≈ $0.024/min of speech at normal cadence, producing $24k/mo on a million minutes. Cartesia Sonic-2 at roughly half the price brings that to $12k/mo but at a quality tier that is category-fit-dependent. Self-hosted XTTS on an L4 GPU runs at maybe 15% of hosted pricing at 50%+ utilization, but adds 2-4 weeks of infrastructure work and reliability risk. The right answer is usually tier-by-use-case: premium voice for customer-facing, commodity voice for internal transcripts and IVR.

Keep going

AI transcription ROI — the inverse — speech to text.
AI content cost per piece — voice as part of full-stack content.
AI video cost — voice typically ships alongside video.
Compute break-even — self-hosted XTTS at volume.

AI voice cost

Results

Visualization

Frequently asked questions

TTS pricing in 2026: three tiers, very different unit economics

How to think about tier selection

Typical workload costs

Deciding where to spend the voice budget

Latency matters more than price in realtime

Audio-processing infra around the TTS call

Voice cloning legal/ethical reality

The voice-UX decisions that come before pricing

Three production deployments with full cost math

Evaluating voice quality empirically

Streaming vs. batch audio

Voice quality varies by language and register

Self-hosting open TTS

Reliability vs. price tradeoffs

Quality dimensions that matter

Production patterns

Frequently asked questions

Voice-cost budgeting for volume deployments

Track your AI tool costs, ROI, and productivity metrics

More free tools