XTTS v1.1 and 🐸TTS v0.19 updates

machine-learning, TTS, coqui.ai, open-source, XTTS

🎮 XTTS Demo
👨‍💻 XTTS Code
💬 Dicord

The XTTS v1.1 version now includes Japanese, along with some model enhancements and a fresh vocoder. As a result, this iteration of XTTS can now speak in 14 languages with improved quality,

One of the common complains with XTTS v1 revolved around the recurrence of prompts in the output audio. The principal reason for this was that the model was trained by conditioning on a random snippet of the actual speech. During inference, this induced the model to repeat the prompt, particularly around the midpoint of the generated speech.

In order to address this issue, we have introduced a new conditioning method that is based on masking. We take a segment of the speech as the prompt but mask that segment while computing the loss. This way, the model is not able to cheat by copying the prompt to the output.

Furthermore, masking aids the model in more effectively adopting the style of the prompt. My hypothesis is that by masking the prompt, it necessitates the model to sustain the prompt’s style throughout the entirety of the speech generation.

XTTS v1.1 is already out with 🐸TTS v0.19.0. You can try it out by simply updating 🐸TTS.

🐸TTS v0.19.0 also comes with the fine-tuning code for XTTS. You can fine-tune XTTS on your own dataset and create voices that carries target speaker’s style. I plan to write down a tutorial for fine-tuning if I don’t get lazy.

I actually wrote this post while I am training XTTS v2.0 which will be a major update with more langauges, better cloning and voice quality. We are looking for help. You can join our discord to help us find data for new langauges, evaluate the model for your language and contribute to the code.

(This is written using 🐸ME_AI)