ChatTTS Review: Conversational TTS vs Audio Quality

Most text-to-speech models optimize for reading text naturally—audiobook narration, podcast voiceovers. ChatTTS trained on 100,000+ hours of dialogue to solve a different problem: making LLM assistants sound like they're having a conversation. Then the developers intentionally degraded the audio quality to prevent criminal misuse.

The repository hit 38,000+ GitHub stars in months, with deployments at PaperCast for academic podcasts and Modal's serverless infrastructure. The central tension: technically interesting work crippled by its creators' ethical choice and licensing that blocks most commercial use.

The dialogue optimization problem

ElevenLabs and commercial TTS handle natural language reading well. ChatTTS targets the conversational patterns LLM assistants actually need: mid-sentence pauses, interjections, code-switching between languages, prosody control for dialogue flow. The model provides fine-grained control over laughter, pauses, and interjections without complex workarounds—capabilities that matter when your AI assistant needs to sound conversational rather than reading a script.

The open-source version uses 40,000 hours of pre-trained data, focused on Chinese and English dialogue. PaperCast's application: converting arXiv research papers into listenable podcasts, where conversational delivery matters more than pristine audio fidelity.

The deliberate audio degradation decision

Developers intentionally injected high-frequency noise during training and compressed outputs to MP3 format to prevent deepfake and criminal applications. This isn't a technical limitation—it's a deliberate ethical choice that directly undermines the product's core value proposition.

The strategy raises questions about whether hobbling your own work actually prevents misuse or just ensures bad actors use something else while legitimate users suffer degraded quality.

Stability issues and hallucination with long text

Beyond the intentional degradation, architectural limitations create problems. GitHub issues document recurring stability problems: the model becomes increasingly unreliable with longer sentences, producing unwanted vocal insertions and character dropping. Multi-speaker scenarios require multiple generation attempts to get usable results.

These aren't ChatTTS-specific bugs—they're inherent to autoregressive architectures shared with Bark and VALL-E. The tradeoff for fine-grained prosody control is unpredictable behavior at scale.

The CC-BY-NC-ND licensing problem

Despite the GitHub stars, ChatTTS uses Attribution-NonCommercial-NoDerivatives 4.0 licensing—not open-source. The NonCommercial-NoDerivatives restrictions eliminate most integration use cases. You can experiment and research, but production deployments require negotiating separate terms.

Piper TTS offers MIT licensing and 30+ language support. XTTS-v2 provides multilingual synthesis and voice cloning. ChatTTS's dialogue optimization is genuine, but the licensing makes it a research artifact rather than a production option for most teams.

When the tradeoffs actually make sense

ChatTTS has differentiators: dialogue-specific optimization, code-switching within sentences, and real-time performance for conversational applications. The Modal serverless deployment proves the architecture works at scale when infrastructure is handled properly.

The use case is narrow: non-commercial dialogue prototyping, English/Chinese conversation research, academic applications where the CC-BY-NC-ND license isn't a blocker. For production voice integration, multilingual requirements, or commercial products, the combination of intentional quality degradation, stability issues, and restrictive licensing makes this a poor fit.

The May 2024 Hacker News discussion compared it to ElevenLabs—commercial alternatives don't artificially degrade their own output. ChatTTS demonstrates what dialogue-first architecture looks like. Whether that justifies the compromises depends on whether you're building a research prototype or shipping a product.

2noise/ChatTTS

A generative speech model for daily dialogue.

38.6kstars

4.2kforks

View on GitHub Sponsor