TTS pricing in 2026: three tiers, very different unit economics
Voice generation collapsed into three quality tiers. At the top, ElevenLabs Turbo v3 and OpenAI TTS-HD v2 deliver human-indistinguishable voices at ~$0.18/1k characters. Mid-tier (Play.ht, Resemble) hits ~$0.04/1k. Open models (XTTS, Orpheus-3) run self-hosted at effective rates around $0.002/1k. Picking the tier is about use case, not budget.
| Product | Pricing (per 1k chars) | Best for | Notes |
|---|---|---|---|
| ElevenLabs Turbo v3 | ~$0.18 | Premium podcasts, audiobooks | Voice cloning, emotion control |
| ElevenLabs Multilingual v2 | ~$0.30 | Long-form narration | Most natural prosody on the market |
| OpenAI TTS-HD v2 | ~$0.15 | Default premium for SaaS UX | Fast, reliable, 9 voices |
| OpenAI TTS (standard) | ~$0.015 | High-volume IVR, chatbots | Good enough quality at 1/10 price |
| Play.ht 2.0 | ~$0.04 | Ad copy, mid-quality podcasts | Thousands of voices |
| Resemble.ai | ~$0.05 | Voice cloning + on-prem | Enterprise + custom voices |
| Cartesia Sonic-2 | ~$0.08 | Lowest-latency realtime | ~90ms time-to-first-audio |
| Self-host XTTS-v3 (L4 GPU) | ~$0.002 | Bulk transformation | OSS, quality below frontier |
How to think about tier selection
Unlike text LLMs where the right answer is usually a routed mix, voice tends to pick one tier and stay there. The reason: voice quality is jarring to switch mid-conversation. You do not want a support agent whose voice changes halfway through the call because you fell off Cartesia onto a cheaper fallback. Pick the tier your listener expects — premium for podcasts and audiobooks, mid-tier for ads, cheap for IVR — and stay consistent.
Typical workload costs
- 10-minute podcast episode (~1,500 words × 5 chars = 7,500 chars): ElevenLabs $1.35; OpenAI HD $1.13; Play.ht $0.30.
- 8-hour audiobook (~60k words = 300k chars): ElevenLabs $54; OpenAI standard $4.50 (quality noticeably worse).
- Support IVR, 100k calls/mo @ 300 chars each = 30M chars/mo: OpenAI standard $450; ElevenLabs would be $5,400.
- Voice agent, 50k turns/day, 200 chars each = 3M chars/day: Cartesia Sonic-2 ~$240/day for low-latency realtime.
Deciding where to spend the voice budget
Voice is one of those line items where spending more does not always improve the product. A customer-support IVR where callers are already mildly annoyed does not benefit from premium narration. A meditation app where voice quality is the entire product benefits enormously. Allocate budget to the places where listener attention is high and alternative evidence (app store reviews, churn feedback) suggests voice quality moves the metric. Underinvesting on a hero feature and overinvesting on plumbing is a pattern we see repeatedly in voice product audits.
Latency matters more than price in realtime
For conversational voice agents (Retell, Vapi, custom stacks), time-to-first-audio is the product. Cartesia Sonic-2 hits ~90ms. OpenAI Realtime is ~300ms. ElevenLabs Turbo v3 is ~250ms. Above 500ms, the conversation feels broken regardless of voice quality. Choose latency-first, then tune quality and price.
Audio-processing infra around the TTS call
Raw TTS output is not always what you want. Production pipelines typically include loudness normalization (LUFS targeting), silence trimming on the ends, optional lightweight compression, and format conversion to match consumer expectations (MP3 or Opus for web, PCM for telephony). Open-source tools (ffmpeg, sox) handle all of this cheaply, but the engineering cost of building a robust pipeline is real. Budget a week of work for the audio-postprocessing layer in any serious voice deployment.
Voice cloning legal/ethical reality
All major providers now require explicit voice-owner consent via a signed statement or live-video verification. ElevenLabs' professional clone needs 30+ minutes of audio plus consent. Using an unauthorized clone is a contract violation with all of them, and you can and will be cut off. Plan the consent workflow before you promise a feature.
The voice-UX decisions that come before pricing
Before comparing voice-provider pricing, decide what voice personality your product needs, whether that voice is consistent across surfaces, and how the voice behaves when upstream TTS fails. These decisions drive provider choice more than per-1k pricing. A voice product that uses one provider on the marketing site, another in-app, and a third on phone calls feels incoherent regardless of how well each individual component is engineered.
Three production deployments with full cost math
- Audiobook publisher, 40 titles/month averaging 80k words each: 40 × 80k × 5 chars/word = 16M chars/mo. ElevenLabs Multilingual v2 at $0.30/1k chars = $4,800/mo. Compare: human narrator $200–400/hr, 8 hours per title, 40 titles = $64k–128k/mo. AI wins 13–26× on cost; debate is artistic preference, not economics.
- Voice-first agent, 40k calls/day × 45s avg @ 14 chars/s = ~7.5M chars/day = 225M chars/mo: Cartesia Sonic-2 at $0.08/1k = $18k/mo, but latency under 100ms is the product. ElevenLabs Turbo v3 = $40.5k/mo at similar quality but 2.5× the latency. Pick Cartesia; the latency is the product.
- IVR replacement for a call-center, 500k calls/day × 90s average hold-message content = ~600M chars/mo:OpenAI TTS standard at $0.015/1k = $9k/mo. Quality is more than good enough for "Your call will be answered in approximately..." patterns. ElevenLabs at $108k/mo is indefensible here.
Evaluating voice quality empirically
Public voice demos on provider websites are curated. They are rarely representative of output on the specific domain and tone your product needs. Commission a 100-line evaluation script using text from your actual product (not marketing copy), run it through each candidate at the settings you plan to use, and have 10 target-audience listeners rate the output. This 2-day investment catches voice-quality problems that would otherwise surface in production via user complaints.
Streaming vs. batch audio
For real-time applications, streaming TTS delivers audio chunks as the text arrives, cutting perceived latency from "wait for the whole sentence" to "start hearing it immediately." Cartesia, ElevenLabs Turbo, and OpenAI Realtime all support streaming. Play.ht and non-realtime ElevenLabs modes do not. For a voice agent, streaming is non-negotiable; TTFT under 300ms determines whether the conversation feels natural.
Voice quality varies by language and register
A subtle trap: voice quality benchmarks are usually reported on standard American English at a neutral register. If your product ships voice in Spanish, Mandarin, Hindi, or any regional variant, quality ranking among providers changes. ElevenLabs Multilingual v2 leads most language rankings; Google Cloud TTS is strong on Spanish and Portuguese; Microsoft Azure Speech is competitive for enterprise multi-language deployments. Do not extrapolate from English benchmarks to non-English performance without verification.
Self-hosting open TTS
XTTS-v3 and Orpheus-3 running on an L4 GPU at $0.50/hr hit roughly 15–25× realtime — meaning 1 hour of GPU time produces 15–25 hours of audio. Effective cost: ~$0.02 per hour of speech, or $0.002/1k chars. Quality is noticeably below ElevenLabs but good enough for IVR, drafts, and internal tools. Tradeoff is the usual self-hosting ops work plus voice-cloning legal discipline.
Reliability vs. price tradeoffs
Voice providers have uneven reliability records. ElevenLabs has had outages lasting multiple hours in the past year that broke any customer dependent on them as a single provider. Cartesia and OpenAI Realtime have been more stable but both have had incidents. A real voice product needs multi-provider fallback, which in turn affects how you think about voice-identity consistency — customers will eventually hear the fallback voice if the primary is down.
Quality dimensions that matter
- Prosody. ElevenLabs and Cartesia lead on natural rhythm and emphasis. Matters for narration, podcasts.
- Emotion control. ElevenLabs v3 supports explicit emotion tags; OpenAI does not. Matters for character-driven content.
- Multilingual. ElevenLabs Multilingual v2 handles 30+ languages with native-quality prosody. For global products, critical.
- Pronunciation of domain terms. All models mispronounce proper nouns and jargon. Use SSML or phoneme overrides; budget a pronunciation dictionary maintenance step.
- Consistency across sessions. ElevenLabs and Cartesia maintain voice identity across generations; cheaper providers drift.
Production patterns
- Cache repeated phrases.An IVR that says "Please hold" 500k times should synthesize it once and serve from CDN. Obvious but frequently missed.
- Pre-synthesize common responses. For a chatbot with a finite menu of confirmations, batch-generate the 50 most common responses ahead of time.
- Fallback chain. Primary on Cartesia for low latency, fallback to ElevenLabs when Cartesia is slow, fallback to OpenAI standard when both fail. Three-tier fallback keeps voice UX working during provider incidents.
- Monitor WER on the output. Transcribe back a sample of generated audio with Whisper; compare to input text. Word error rate creeping up signals a voice-quality regression before users complain.
Frequently asked questions
Which TTS for podcasting? ElevenLabs Multilingual v2 is the clear choice; human-indistinguishable at the cost tier. Play.ht is a reasonable mid-budget alternative.
Which for customer support voice bots? Cartesia Sonic-2 for low-latency realtime, OpenAI Realtime if you are already on OpenAI. Avoid non-streaming providers.
Can I clone a celebrity voice? No — all major providers prohibit it without explicit consent. Doing so also exposes you to right-of-publicity claims.
Does SSML matter? Yes. Proper SSML for pauses, emphasis, and phonemes lifts quality noticeably. Invest 2 engineer-days in SSML templating and it compounds.
How do I evaluate voice quality objectively? Pay 10 listeners to rate 50 samples on a 5-point scale across naturalness, clarity, and emotion. Repeat monthly. Human ratings correlate with product outcomes better than MOS-style automated metrics.
Do I need an SSML dictionary for product names? Yes, if your product has a distinctive brand name. Most models mispronounce common tech terms by default.
Can I mix TTS with real narration? Yes, and many podcast workflows do — AI narrator for synthesized content, human for interviews. Match the voice identity carefully with a cloned AI variant of the human voice.
Is realtime voice latency solved? Below 300ms TTFT is solved by Cartesia. Below 150ms requires careful engineering of streaming chunk boundaries. Below 100ms is only possible in constrained setups.
- AI transcription ROI — the inverse — speech to text.
- AI content cost per piece — voice as part of full-stack content.
- AI video cost — voice typically ships alongside video.
- Compute break-even — self-hosted XTTS at volume.