XTTS-v1 technical notes

machine-learning, TTS, coqui.ai, open-source, XTTS
"Smart electric cars in French, Rococo style, classical style, oil on canvas"

XTTS v1 technical notes #

🎮 XTTS Demo
👨‍💻 XTTS Code
💬 Dicord

XTTS is a versatile Text-to-speech model that offers natural-sounding voices in 13 different languages. One of its unique features is the ability to clone voices across languages using just a 3-second audio sample.

Currently, XTTS-v1 supports the following 13 languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, and Chinese (Simplified).

TTS introduces innovative techniques that simplify cross-language voice cloning and multi-lingual speech generation. These techniques eliminate the need for extensive training data that spans countless hours and have a parallel dataset to be able to do cross-language voice cloning.

XTTS Architecture #

XTTS builds upon the recent advancements in autoregressive models, such as Tortoise, Vall-E, and Soundstorm, which are based on language models trained on discrete audio representations. XTTS utilizes a VQ-VAE model to discretize the audio into audio tokens. Subsequently, it employs a GPT model to predict these audio tokens based on the input text and speaker latents. The speaker latents are computed by a stack of self-attention layers. The output of the GPT model is passed on to a decoder model that outputs the audio signal. We employ the Tortoise methodology for XTTS-v1, which combines a diffusion model and UnivNet vocoder. This approach involves using the diffusion model to transform the GPT outputs into spectrogram frames, and then utilizing UnivNet to generate the ultimate audio signal.

XTTS training #

To convert characters to input IDs, the input text is passed through a BPE tokenizer. Additionally, a special token representing the target language is placed at the beginning of the text. As a result, the final input is structured as follows: [bots], [lang], t1, t2, t3 ... tn, [eots]. [bots] represents begining of the text sequence and eots represents the end.

The speaker latents are obtained by applying a series of attention layers to an input mel-spectrogram. The speaker encoder processes the mel-spectrogram and generates a speaker latent vector for each frame of the spectrogram s1, s2, ..., sk. These latent vectors are then used to condition the model on the speaker. Instead of averaging or pooling these vectors, we directly pass them to the model as a sequence. Consequently, the length of the input sequence is proportional to the duration of the speaker’s audio sample. As a result, the final conditioning input is composed of s1, s2, ..., sk, [bots], [lang], t1, t2, t3, ..., tn, [eots].

We append the audio tokens to the input sequence above to create each of the training samples. So each sample becomes s1, s2, ..., sk, [bots], [lang], t1, t2, t3, ..., tn, [eots], [boas], a1, a2, ..., a_l, [eoas]. [boas] == beginning of audio sequence and [eoas]

XTTS-v1 can be trained in 3 stages. First, we train the VQVAE model, then the GPT model, and finally an audio decoder.

XTTS is trained with 16k hours of data mostly consisting of public datasets. We use all the datasets from the beginning and balance the data batches by language. In order to compute speaker latents, we used audio segments ranging from 3 to 6 seconds in length.

In our training process, we utilize a learning rate of 1e-4 in combination with the AdamW optimizer. Additionally, we employ a step-wise scheduling approach to decrease the learning rate to 1e-5 after 750k steps. The entire training process consists of approximately 1 million steps. As a result, the final model comprises approximately 750 million parameters.

XTTS v1 #

XTTS utilizes various unique techniques for;

In XTTS-v1, the speaker latents are learned from a portion of the ground-truth audio. However, when the segment is used directly, the model tends to cheat by simply copying it to the output. To avoid this, we divide the segment into smaller chunks and shuffle them before inputting them to the speaker encoder. The encoder then calculates a speaker latent for each frame of the spectrogram, which we use to condition the GPT model based on the speaker information

XTTS combines Tortoise and Vall-E to be able to clone the voice and keep it consistent between each run. Tortoise is very good at cloning however, you are likely to get different voices for each run since it only relies on a single vector. Therefore, XTTS uses a sequence of latent vectors that transfers more information about the speaker and leads to better cloning and consistency between runs.

Vall-E had exceptional cloning abilities. However, in order to achieve cross-language voice cloning with Vall-E-X, both the reference audio transcript and a parallel dataset are necessary. On the other hand, our method for conditioning XTTS does not rely on transcripts, and despite not using a parallel dataset, the model is capable of seamlessly transferring voices across different languages.

It is important to mention that the use of language tokens is essential in reducing the data and training time required to learn a language. These tokens act as a strong indicator at the beginning of the generation process and assist the model in being conditioned on the specific language. Language tokens proved to be beneficial in enabling XTTS to learn a language using just 150 hours long dataset, which is considerably smaller when compared to other auto-regressive TTS models.

Performance Notes #

Future plans #

Using XTTS #

To start using XTTS, all you need to do is pip install TTS.

Here is a sample Python code. Please see the docs for more details.

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1", gpu=True)

python# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",

Acknowledgements #

Big thanks to the Coqui team and the community for their work and support. James Betker for Tortoise that showed the the right way to do TTS. The HuggingFace team for their amazing work and support with the XTTS release. All the people and organizations behind public voice datasets, especially Common Voice.

Play with XTTS #