The recent surge of new end-to-end deep learning models has enabled new and exciting Text-to-Speech (TTS) use-cases with impressive natural-sounding results. However, most of these models are trained on massive datasets (20-40 hours) recorded with a single speaker in a professional environment. In this setting, expanding your solution to multiple languages and speakers is not feasible for everyone. Moreover, it is particularly tough for low-resource languages not commonly targeted by mainstream research. To get rid of these limitations and bring zero-shot TTS to low resource languages, we built YourTTS, which can synthesize voices in multiple languages and reduce data requirements significantly by transferring knowledge among languages in the training set. For instance, we can easily introduce Brazilian Portuguese to the model with a single speaker dataset by co-training with a larger English dataset. It makes the model speak Brazilian Portuguese with voices from the English dataset, or we can even introduce new speakers by zero-shot learning on the fly.
In “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone” we introduce the YourTTS that targets,
Multi-Lingual TTS. Synthesizing speech in multiple languages with a single model.
Multi-Speaker TTS. Synthesizing speech with different voices with a single model.
Zero-Shot learning. Adapting the model to synthesize the speech of a novel speaker without re-training the model.
Speaker/language adaptation. Fine-tuning a pre-trained model to learn a new speaker or language. (Learn Turkish from a relatively smaller dataset by transferring knowledge from learned languages)
Cross-language voice transfer. Transferring a voice from its original language to a different language. (Using the voice of an English speaker in French)
Zero-shot voice conversion. Changing the voice of a given speech clip.
YourTTS is an extension of our previous work SC-GlowTTS. It uses the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model as the backbone architecture and builds on top of it. We use a larger text encoder than the original model. Also, YourTTS employs a separately trained speaker encoder model to compute the speaker embedding vectors (d-vectors) to pass speaker information to the rest of the model. We use the H/ASP model as the speaker encoder architecture. See the figure below for the overall model architecture in training (right) and inference (left).
VITS is a peculiar TTS model as it employs different deep-learning techniques together (adversarial learning, normalizing flows, variational auto-encoders, transformers) to achieve high-quality natural-sounding output. It is mainly built on the GlowTTS model. The GlowTTS is light, robust to long sentences, converges rapidly, and is backed up by theory since it directly maximizes the log-likelihood of speech with the alignment. However, its biggest weakness is the lack of naturalness and expressivity of the output.
VITS improves on it by introducing specific updates. First, it replaces the duration predictor with a stochastic duration predictor that better models the variability in speech. Then, it connects a HifiGAN vocoder to the decoder’s output and joins the two with a variational autoencoder (VAE). That allows the model to train in an end2end fashion and find a better intermediate representation than traditionally used mel-spectrograms. This results in high fidelity and more precise prosody, achieving better MOS values reported in the paper.
Note that both GlowTTS and VITS implementations are available on 🐸TTS.
We combined multiple datasets for different languages. We used VCTK and LibriTTS for English (multispeaker datasets), TTS-Portuguese Corpus (TPC) for Brazilian Portuguese, and the French subset of the M-AILABS dataset (FMAI).
We resample the audio clips to 16 kHz, apply voice activity detection to remove silences, and apply RMS volume normalization before passing them to the speaker encoder.
We train YourTTS incrementally, starting from a single speaker English dataset and adding more speakers and languages along the way. We start from a pre-trained model on the LJSpeech dataset for 1M steps and continue with the VCTK dataset for 200K steps. Next, we randomly initialize the new layers introduced by the YourTTS model on the VITS model. Then we add the other datasets one by one and train for ~120K steps with each new dataset.
Before we report results on each dataset, we also fine-tune the final model with speaker encoder loss (SCL) on that particular dataset. SCL compares output speech embeddings with the ground truth embeddings computed by the speaker encoder with cosine similarity loss.
We used a single V100 GPU and used a batch size of 64. We used the AdamW optimizer with beta values 0.8 and 0.99 and a learning rate of 0.0002 decaying exponentially with gamma 0.999875 per iteration. We also employed a weight decay of 0.01.
We run “mean opinion score” (MOS) and similarity MOS tests to evaluate the model performance. Also, we use the speaker encoder cosine similarity (SECS) to measure the similarity between the predicted outputs and the actual audio clips of a target speaker. We used a 3rd party library for SECS to be compatible with the previous work. We avoid details of our experiments for the sake of brevity. Please refer to the paper to see the details.
Table (1) above shows our results on different datasets. Exp1 is trained with only the VCTK. Exp2. is with the VCTK and TPC. Then, we add the FMAI, LibriTTS for Exp3. and Exp4, respectively. The ground truth row reports the values for the real speaker clips in respective datasets. Finally, we compare our results with AttentronZS and SC-GlowTTS. Note that SC-GlowTTS is our previous work that leads our way to the YourTTS (You can find its implementation under 🐸TTS). We achieve significantly better results than the comparing work in our experiments. MOS values are on-par or even better than the ground truth in some cases, which is even surprising for us to see.
Table (2) depicts the zero-shot voice conversion (ZSVC) results between languages and genders by the speaker embeddings. For ZSVC, we pass the given speech clip from the posterior encoder to compute the hidden representation and re-run the model in the inference mode again conditioned on the target speaker’s embedding. You see in the table the model’s performance between languages and genders. For instance, “ en-pt” shows the results for converting the voice of a Portuguese speaker by conditioning on an English speaker. And “M-F” offers the conversion of a Male speaker to a Female speaker.
Table (3) yields the results for the speaker adaptation experiments where we fine-tune the final YourTTS model by SCL on different length clips of a particular novel speaker. For instance, the top row shows the results for a model trained on a male English speaker with 61 seconds of an audio clip. GT is the ground truth, ZS is zero-shot with only the speaker embeddings, and FT is fine-tuning. These results show that our model can achieve high similarity when fine-tuned with only 20 seconds of audio sample from a speaker in case mere use of speaker embeddings is not enough to produce high-quality results.
Due to the time and space constraints in the paper, we could not expand the experiments to all the possible use-cases of YourTTS. We plan to include those in our future study and add new capabilities to YourTTS that would give more control over the model.
Try out YourTTS
Visit our demo page accompanying this blog post and give YourTTS a try right on your browser.
YourTTS is also available in 🐸TTS with a training recipe and a pre-trained model. You can train your own model, synthesize voice with the pre-trained model or finetune it with your dataset.
We are well aware that the expansion of the TTS technology enables various kinds of malign uses of the technology. Therefore, we also actively study different approaches to prevent or at the very least put more fences along the way of the misuse of the TTS technology.
To exemplify this, on our demo page, we add background music to avert the unintended use of the voice clips on different platforms.
If you also want to contribute to our research & discussion in this field, join us here.
YourTTS can achieve competitive results on multi-lingual, multi-speaker TTS, and zero-shot learning. It also allows cross-language voice transfer, learning new speakers and languages from relatively more minor datasets than the traditional TTS models.
We are excited to present YourTTS and see all the different use-cases that 🐸 Community will apply. As always, feel free to reach out for any feedback.
Despite the success of the latest attention based end2end text2speech (TTS) models, they suffer from attention alignment problems at inference time. They occur especially with long-text inputs or out-of-domain character sequences. Here I like to propose a novel technique to fight against these alignment problems which I call Double Decoder Consistency (DDC) (with a limited creativity). DDC consists of two decoders that learn synchronously with different reduction factors. We use the level of consistency of these decoders to attain better attention performance.
End-to-End TTS Models with Attention
Good examples of attention based TTS models are Tacotron and Tacotron2 . Tacotron2 is also the main architecture used in this work. These models comprise a sequence-to-sequence architecture with an encoder, an attention-module, a decoder and an additional stack of layers called Postnet. The encoder takes an input text and computes a hidden representation from which the decoder computes predictions of the target acoustic feature frames. A context-based attention mechanism is used to align the input text with the predictions. Finally, decoder predictions are passed over the Postnet which predicts residual information to improve the reconstruction performance of the model. In general, mel-spectrograms are used as acoustic features to represent audio signals in a lower temporal resolution and perceptually meaningful way.
Tacotron proposes to compute multiple non-overlapping output frames by the decoder. You are able to set the number of output frames per decoder step which is called ‘reduction rate’ (r). Larger the reduction rate, fewer the number of decoder steps required for the model to produce the same length output. Thereby, the model achieves faster training convergence and easier attention alignment, as explained in . However, larger r values also produce smoother output frames and therefore, reduce the frame-level details.
Although these models are used in TTS systems for more natural-sounding speech, they frequently suffer from attention alignment problems, especially at inference time, because of out-of-the-domain words, long input texts, or intricacies of the target language. One solution is to use larger r for a better alignment however, as note above, it reduces the quality of the predicted frames. DDC tries to mitigate these attention problems by acting on these observations to find a suitable architecture finding the middle ground.
The bare-bone model used in this work is formalized as follows:
is a sequence of acoustic feature frames. is a sequence of characters or phonemes, from which we compute sequence of encoder outputs . is the reduction factor which defines the number of output frames per decoder step. Attention alignments, query vector and encoder output at decoder step are donated by , , , respectively. Also, defines a set of output frames whose size changed by . Total number of decoder steps is donated by .
Note that teacher forcing is applied at training. Therefore, at training time. However, the decoder is instructed to stop at inference by a separate network (Stopnet) which predicts a value in a range [0, 1]. If its prediction is larger than a defined threshold, the decoder stops inference.
Double Decoder Consistency
DDC bases on two decoders working simultaneously with different reduction factors (r). One decoder (coarse) works with a large, and the other decoder (fine) works with a small reduction factor.
DDC is designed to settle the trade-off between the attention alignment and the predicted frame quality tunned by the reduction factor. In general, standard models have more robust attention performance with a larger r but due to the smoothing effect of multiple-frames prediction per iteration, final acoustic features are coarser compared to lower reduction factor models.
DDC combines these two properties at training time as it uses the coarse decoder to guide the fine decoder to preserve the attention performance without a loss of precision in acoustic features. DDC achieves this by introducing an additional loss function comparing the attention vectors of these two decoders.
For each training step, both decoders compute their relative attention vectors and the outputs. Due to the differences in their respective r values, their attention vectors are in different lengths. The coarse decoder produces a shorter vector compared to the fine decoder. In order to mitigate this, we interpolate the coarse attention vector to match the length of the fine attention vector. After having them in the same length we use a loss function to penalize the difference in the alignments. This loss is able to synchronize two decoders with respect to their alignments.
The two decoders take the same input from the encoder. They also compute the outputs in the same way except they use different reduction factors. The coarse decoder uses a larger reduction factor compared to the fine decoder. These two decoders are trained with separate loss functions comparing their respective outputs with the real feature frames. The only interaction between these two decoders is the attention loss applied to compare their respective attention alignments.
Other Model Updates
Batch Norm Prenet
Prenet is an important part of Tacotron like auto-regressive models. It projects model output frames before passing to the decoder. Essentially, it computes an embedding space of the feature (spectrogram) frames by which the model de-factors the distribution of upcoming frames.
I replace the original Prenet (PrenetDropout) with the one using Batch Normalization  (PrenetBN) after each dense layer and I remove Dropout layers. Dropout is necessary for learning attention, especially when the data quality is low. However, it causes problems at inference due to distributional differences between training and inference time. Using Batch Normalization is a good alternative. It avoids the issues of Dropout and also provides a certain level of regularization due to the noise of batch-level statistics. It also normalizes computed embedding vectors and generates a well-shaped embedding space.
I use gradual training scheme for the model training. I’ve introduced the gradual training in a previous blog post. In short, we start the model training with a larger reduction factor and gradually reduce it as the model saturates.
Gradual Training shortens the total training time significantly and yields better attention performance due to its progression from coarse to fine information levels.
Recurrent PostNet at inference
The Postnet is the part of the network applied after the Decoder to improve the Decoder predictions before the vocoder. Its output is summed with the Decoder’s to be the final output of the model. Therefore, it predicts a residual which improves the Decoder output. So we can also apply Postnet more than one time assuming, it computes useful residual information for each time. I applied this trick only at inference and observe that, up to a certain number of iterations, it improves the performance. For my experiments, I set the number of iterations to 2.
MB-Melgan Vocoder with Multiple Random Window Discriminator
As a vocoder, I use Multi-Band Melgan  generator. It is trained with Multiple Random Window Discriminator (RWD) different than the original work  where they used Multi-Scale Melgan Discriminator (MSMD).
The main difference between these two is that RWD uses audio level information and MSMD uses spectrogram level information. More specifically, RWD comprises multiple convolutional networks each takes different length audio segments with different sampling rates and performs classification whereas MSMD uses convolutional networks to perform the same classification on STFT output of the target voice signal.
In my experiments, I observed better RWD yields better results with more natural and less abberated voice.
Guided attention  uses a soft diagonal mask to force the attention alignment to be diagonal. As we do, it uses this constant mask at training time to penalize the model with an additional loss term. However, due to its constant nature, it dictates a constant prior to the model which does not always to be true, especially long sentences with various pauses. It also causes skipping in my experiments which are tried to be solved by using a windowing approach at inference time in their work.
Using multiple decoders is initially introduced by . They use two decoders that run in forward and backward directions through the encoder output. The main problem with this approach is that because of the use of two decoders with identical reduction factors, it is almost 2 times slower to train compared to a vanilla model. We solve the problem by using the second decoder with a higher reduction rate. It accelerates the training significantly and also gives the user the opportunity to choose between the two decoders depending on run-time requirements. DDC also does not use any complex scheduling or multiple loss signals that aggravates the model training.
Lately, new TTS models introduced by  predicting output duration directly from the input characters. These models train a duration-predictor or use approximation algorithms to find the duration of each input character. However, as you listen to their samples, it is observed that these models lead to degraded timbre and naturalness. This is because of the indirect hard alignment produced by these models. However, models with soft-attention modules can adaptively emphasize different parts of the speech producing a more natural speech.
Results and Experiments
All the experiments are performed using LJspeech dataset  . I use a sampling-rate of 22050 Hz and mel-scale spectrograms as the acoustic feature. Mel-spectrograms are computed with hop-length 256, window-length 1024. Mel-spectrograms are normalized into [-4, 4]. You can see the used audio parameters below in Coqui TTS config format.
// AUDIO PARAMETERS
// stft parameters
"num_freq": 513, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
I used Tacotron2 as the base architecture with location-sensitive attention and applied all the model updates expressed above. The model is trained for 330k iterations and it took 5 days with a single GPU although the model seems to produce satisfying quality after only 2 days of training with DDC. I used a gradual training schedule shown below. The model starts with r=7 and batch-size 64 and gradually reduces to r=1 and batch-size 32. The coarse decoder is set r=7 for the whole training.
I trained MB-Melgan vocoder using real spectrograms up to 1.5M steps, which took 10 days on a single GPU machine. For the first 600K iterations, it is pre-trained with only the supervised loss as in  and than the discriminator is enabled for the rest of the training. I do not apply any learning rate schedule and I used 1e-4 for the whole training.
DDC Attention Performance
Fig3. shows the validation alignments of the fine and the coarse decoders which have r=1 and r=7 respectively. We observe that two decoders show almost identical attention alignments with a slight roughness with the coarse decoder due to the interpolation.
DDC significantly shortens the time required to learn the attention alignmet. In my experiments, the model is able to align just after 1k steps as opposed to ~8k steps with normal location-sensitive attention.
At the inference time, we ignore the coarse decoder and use only the fine decoder. Below (Fig.4) depicts the model outputs and attention alignments at inference time with 4 different sentences that are not seen at training time. This shows us that the fine decoder is able to generalize successfully on novel sentences.
I used 50 hard-sentences introduced by  to check the attention quality of the DDC model. As you see in the notebook below (Open it on Colab to listen to Griffin-Lim based voice samples), the DDC model performs without any alignment problems. It is the first model, to my knowledge, which performs flawlessly on these sentences.
In Fig5. we see the average L1 difference between the real mel-spectrogram and the model prediction for each Postnet iteration. The results improve until the 3rd iteration. We also observe that some of the artifacts after the first iteration are removed by the second iteration that yields a better L1 value. Therefore, we see here how effective the iterative application of the Posnet to improve the final model predictions.
First of all I hope this section would not be “here are the things we’ve not tried and will not try” section.
There are specifically three aspects of DDC which I like to investigate more. The first is sharing the weights between the fine and the coarse decoders to reduce the total number of model parameters and observing how the shared weights benefit from different resolutions.
The second is to measure the level of complexity required by the coarse decoder. That is, how much simpler the coarse architecture can get without performance loss.
Finally, I like to try DDC with the different model architectures.
Here I tried to summarize a new method that significantly accelerates model training, provides steadfast attention alignment and provides a choice in a spectrum of quality and speed switching between the fine and the coarse decoders at inference. The user can choose depending on run-time requirements.
You can replicate all this work using Coqui TTS. You can also see voice samples and Colab Notebooks from the links above. Let me know how it goes if you try DDC in your project.
If you like to cite this work:
Gölge E. (2020) Solving Attention Problems of TTS models with Double Decoder Consistency. erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency/
 Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. 1–10. https://doi.org/10.21437/Interspeech.2017-1452
 Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R. A., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2–6. http://arxiv.org/abs/1712.05884
 Ioffe, S., & Szegedy, C. (n.d.). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
 Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. http://arxiv.org/abs/1710.08969
 Zheng, Y., Wang, X., He, L., Pan, S., Soong, F. K., Wen, Z., & Tao, J. (2019). Forward-Backward Decoding for Regularizing End-to-End TTS. http://arxiv.org/abs/1907.09006
 Keith Ito, The LJ Speech Dataset (2017) https://keithito.com/LJ-Speech-Dataset/
 Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. http://arxiv.org/abs/1905.09263
 Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. http://arxiv.org/abs/2005.11129
 Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. 7209–7213. https://doi.org/10.1109/icassp40776.2020.9054484
 Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech. http://arxiv.org/abs/2005.05106
 Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., & Simonyan, K. (2019). High Fidelity Speech Synthesis with Adversarial Networks. 1–17. http://arxiv.org/abs/1909.11646
 Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., & Simonyan, K. (2019). High Fidelity Speech Synthesis with Adversarial Networks. 1–17. http://arxiv.org/abs/1909.11646
In this post, I like to introduce two methods that worked well in my experience for better attention alignment in Tacotron models. If you like to try your own you can visit Coqui TTS. The first method is Bidirectional Decoder and the second is Graves Attention (Gaussian Attention) with small tweaks.
Bidirectional decoding uses an extra decoder which takes the encoder outputs in the reverse order and then, there is an extra loss function that compares the output states of the forward decoder with the backward one. With this additional loss, the forward decoder models what it needs to expect for the next iterations. In this regard, the backward decoder punishes bad decisions of the forward decoder and vice versa.
Intuitionally, if the forward decoder fails to align the attention, that would cause a big loss and ultimately it would learn to go monotonically through the alignment process with a correction induced by the backward decoder. Therefore, this method is able to prevent “catastrophic failure” where the attention falls apart in the middle of a sentence and it never aligns again.
At the inference time, the paper suggests to us only the forward decoder and demote the backward decoder. However, it is possible to think more elaborate ways to combine these two models.
There are 2 main pitfalls of this method. The first, due to additional parameters of the backward decoder, it is slower to train this model (almost 2x) and this makes a huge difference especially when the reduction rate is low (number of frames the model generates per iteration). The second, if the backward decoder penalizes the forward one too harshly, that causes prosody degradation in overall. The paper suggests activating the additional loss just for fine-tuning, due to this.
My experience is that Bidirectional training is quite robust against alignment problems and it is especially useful if your dataset is hard. It also aligns almost after the first epoch. Yes, at inference time, it sometimes causes pronunciation problems but I solved this by doing the opposite of the paper’s suggestion. I finetune the network without the additional loss for just an epoch and everything started to work well.
Tacotron uses Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the monotonicity of the alignment just looking into the content which is a hard deal. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonic constraint. But it does not solve all of the problems and you can still experience failures with long or out of domain sentences.
Graves Attention is an alternative that uses content information to decide how far it needs to go on the alignment per iteration. It does this by using a mixture of Gaussian distribution.
Graves Attention takes the context vector of time t-1 and passes it through couple of fully connected layers ([FC > ReLU > FC] in our model) and estimates step-size, variance and distribution weights for time t. Then the estimated step-size is used to update the mean of Gaussian modes. Analogously, mean is the point of interest t the alignment path, variance is attention window over this point of interest and distribution weight is the importance of each distribution head.
I try to formulate above how I compute the alignment in my implementation.
are intermediate values.
is the step size,
is the variance,
is the distribution weight for the GMM node k. (You can also check the code).
Some other versions are explained here but so far I found the above formulation works for me the best, without any NaNs in training. I also realized that with the best-claimed method in this paper, one of the distribution nodes overruns the others in the middle of the training and basically, attention starts to run on a single Gaussian head.
The benefit of using GMM is to have more robust attention. It is also computationally light-weight compared to both bidirectional decoding and normal location attention. Therefore, you can increase your batch size and possibly converge faster.
The downside is that, although my experiments are not complete, GMM’s not provided slightly worse prosody and naturalness compared to the other methods.
Here I compare Graves Attention, Bidirectional Decoding and Location Sensitive Attention trained on LJSpeech dataset. For the comparison, I used the set of sentences provided by this work. There are in total of 50 sentences.
Bidirectional Decoding has 1, Graves attention has 6, Location Sensitive Attention has 18, Location Sensitive Attention with inference time windowing has 11 failures out of these 50 sentences.
In terms of prosodic quality, in my opinion, Location Sensitive Attention > Bidirectional Decoding > Graves Attention > Location Sensitive Attention with Windowing. However, I should say the quality difference is hardly observable in LJSpeech dataset. I also need to point out that, it is a hard dataset.
If you like to try these methods, all these are implemented on Coqui TTS and give it a try.
Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don’t know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.
In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.
Here is the trick. Tacotron has a parameter called ‘r’ which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger ‘r’, the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higherr value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he’d probably know what struggle the attention means. So finding the right trade-off for ‘r’ is a great deal. In the original Tacotron paper, authors used ‘r’ as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.
Gradual training comes to the rescue at this point. What it means is that we set ‘r’ initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.
Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)
Below you can see the attention at validation time after just 1K iterations with the training schedule above.
Next, let’s check the model training curve and convergence.
You can listen to voice examples generated with the final model using GriffinLim vocoder. I’d say the quality of these examples is quite good to my ear.
It was a short post but if you like to replicate the results here, you can visit our repo Coqui TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Github Discussions page. There are some other cool things in the repo that I also write about in the future. Until next time..!
Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.