AI / SPEECH

Voice Cloning & Text-to-Speech: Production Guide 2026

A bilingual, production-focused guide to modern voice cloning and TTS: ElevenLabs v3, OpenAI gpt-4o-mini-tts and gpt-realtime, Google Gemini TTS, open-source models (XTTS-v2, F5-TTS, OpenVoice v2, Sesame CSM, Spark-TTS), real-time providers (Cartesia Sonic, PlayHT, Hume EVI 2), self-hosting on RTX 4090/5090, pricing, and code examples.

By Jose Nobile | Updated 2026-06-11 | 26 min read

2026 Landscape & Use Cases
ElevenLabs v3
OpenAI & Google Gemini TTS
Open-Source Models
Real-Time Providers
Self-Hosting (RTX 4090/5090)
Pricing Comparison
Code Examples

1. 2026 Landscape & Use Cases

Voice cloning and TTS in 2026 split into three tiers: hyper-realistic commercial APIs (ElevenLabs, OpenAI, Google), low-latency real-time engines (Cartesia, PlayHT, Hume), and open-source models you self-host (XTTS-v2, F5-TTS, OpenVoice v2, Sesame CSM, Spark-TTS). Latency, voice fidelity, language coverage, and licensing each pull you toward a different stack.

2. ElevenLabs v3 (Voice Design, IVC, Multilingual)

ElevenLabs v3 (GA since March 14, 2026) is the most expressive TTS model on the market with audio tags, 70+ languages, multi-speaker dialogue, and instant voice cloning (IVC) from 1 minute of audio. Professional voice cloning (PVC) requires 30+ minutes for studio fidelity. v3 is not built for real time -- ElevenLabs recommends Flash v2.5 (~75ms latency) for conversational agents.

3. OpenAI & Google Gemini TTS

OpenAI gpt-4o-mini-tts brings steerable instructions ("speak like a calm therapist") at $0.60/1M input tokens. gpt-realtime is a unified speech-in/speech-out model with sub-300ms latency. Google Gemini 2.5 Flash/Pro TTS supports controllable multi-speaker dialogue at low cost.

4. Open-Source Models

Five open-source families dominate self-hosted TTS in 2026: Coqui XTTS-v2 (multilingual cloning, CPML license), F5-TTS (flow-matching, MIT), OpenVoice v2 (cross-lingual cloning, MIT), Sesame CSM-1B (conversational, Apache 2.0), and Spark-TTS (BiCodec, Apache 2.0).

5. Real-Time Providers

For voice agents and IVR, sub-200ms time-to-first-audio matters more than absolute fidelity. Cartesia Sonic-2 hits ~40ms TTFB, PlayHT Play 3.0 mini ~143ms, and Hume EVI 2 adds emotional prosody and empathic turn-taking on top of an LLM.

6. Self-Hosting on RTX 4090 / 5090

A single RTX 4090 (24 GB) or RTX 5090 (32 GB) handles every open-source TTS model in real time with room to spare. The 5090 doubles memory bandwidth (1.79 TB/s) and adds FP4 support, cutting XTTS-v2 latency by ~35% versus 4090.

7. Pricing Comparison ($/1M chars)

Commercial pricing varies 200x between providers. Below is the April 2026 list price per 1 million characters for the standard tier of each major provider, normalized for direct comparison.

8. Code Examples (ElevenLabs SDK + F5-TTS Local)

Two minimal but production-ready snippets: a Python ElevenLabs streaming TTS call, and a local F5-TTS inference using the official CLI on an RTX 4090.

Need help shipping a voice agent, dubbing pipeline, or self-hosted TTS cluster? I architect production speech systems end-to-end.

Book a 30-min consult Contact