Posts tagged with: text-to-speech

Solving Attention Problems of TTS models with Double Decoder Consistency

Model Samples:

Colab Notebook (PyTorch): link

Colab Notebook (Tensorflow): link

Despite the success of the latest attention based end2end text2speech (TTS) models, they suffer from attention alignment problems at inference time. They occur especially with long-text inputs or out-of-domain character sequences. Here I like to propose a novel technique to fight against these alignment problems which I call Double Decoder Consistency (DDC) (with a limited creativity). DDC consists of two decoders that learn synchronously with different reduction factors. We use the level of consistency of these decoders to attain better attention performance.

One of Shakespeare’s works read by the DDC model.

End-to-End TTS Models with Attention

Good examples of attention based TTS models are Tacotron and Tacotron2 [1][2]. Tacotron2 is also the main architecture used in this work. These models comprise a sequence-to-sequence architecture with an encoder, an attention-module, a decoder and an additional stack of layers called Postnet. The encoder takes an input text and computes a hidden representation from which the decoder computes predictions of the target acoustic feature frames. A context-based attention mechanism is used to align the input text with the predictions. Finally, decoder predictions are passed over the Postnet which predicts residual information to improve the reconstruction performance of the model. In general, mel-spectrograms are used as acoustic features to represent audio signals in a lower temporal resolution and perceptually meaningful way.

Tacotron proposes to compute multiple non-overlapping output frames by the decoder. You are able to set the number of output frames per decoder step which is called ‘reduction rate’ (r). Larger the reduction rate, fewer the number of decoder steps required for the model to produce the same length output. Thereby, the model achieves faster training convergence and easier attention alignment, as explained in [1]. However, larger r values also produce smoother output frames and therefore, reduce the frame-level details.

Although these models are used in TTS systems for more natural-sounding speech, they frequently suffer from attention alignment problems, especially at inference time, because of out-of-the-domain words, long input texts, or intricacies of the target language. One solution is to use larger r for a better alignment however, as note above, it reduces the quality of the predicted frames. DDC tries to mitigate these attention problems by acting on these observations to find a suitable architecture finding the middle ground.

Fig1. This is an overview of the model used in this work. (Excuse my artwork).

The bare-bone model used in this work is formalized as follows:

    \[\{h_l\}^L_{l=1} = Encoder(\{x_l\}^L_{l=1})\]

    \[p_t = Prenet(o_{t-1})\]

    \[q_t = concat(p_t, c_{t-1})\]

    \[a_t = Attention(q_t, \{h_l\}^L_{l=1})\]

    \[c_t = \sum_{l}a_{t,l}h_l\]

    \[o_t = RNNs(c_t), \quad   o_t = \{f_{t.r}, ..., f_{t.r + r}\}\]

    \[\{o_t\}^T_{t=1}, \{a_t\}^{T}_{t=1} = Decoder(\{h_i\}^L_{i=1}; r)\]

    \[\{f^D_k\}^K_{k=1} = reshape(\{o_t\}^T_{t=1})\]

    \[\{f^P_k\}^K_{k=1} = Postnet((\{f^D_k\}^K_{k=1})\]

    \[L = ||f^P - y || + ||f^D - y||\quad(loss)\]

{y_k}<em>{k=1}^K is a sequence of acoustic feature frames. {x_l}</em>{l=1}^L is a sequence of characters or phonemes, from which we compute sequence of encoder outputs {h_l}_{l=1}^L. r is the reduction factor which defines the number of output frames per decoder step. Attention alignments, query vector and encoder output at decoder step t are donated by a_t, o_t, q_t, o_t respectively. Also, o_t defines a set of output frames whose size changed by r. Total number of decoder steps is donated by T.

Note that teacher forcing is applied at training. Therefore, K=T*r at training time. However, the decoder is instructed to stop at inference by a separate network (Stopnet) which predicts a value in a range [0, 1]. If its prediction is larger than a defined threshold, the decoder stops inference.

Double Decoder Consistency

DDC bases on two decoders working simultaneously with different reduction factors (r). One decoder (coarse) works with a large, and the other decoder (fine) works with a small reduction factor.

DDC is designed to settle the trade-off between the attention alignment and the predicted frame quality tunned by the reduction factor. In general, standard models have more robust attention performance with a larger r but due to the smoothing effect of multiple-frames prediction per iteration, final acoustic features are coarser compared to lower reduction factor models.

DDC combines these two properties at training time as it uses the coarse decoder to guide the fine decoder to preserve the attention performance without a loss of precision in acoustic features. DDC achieves this by introducing an additional loss function comparing the attention vectors of these two decoders.

For each training step, both decoders compute their relative attention vectors and the outputs. Due to the differences in their respective r values, their attention vectors are in different lengths. The coarse decoder produces a shorter vector compared to the fine decoder. In order to mitigate this, we interpolate the coarse attention vector to match the length of the fine attention vector. After having them in the same length we use a loss function to penalize the difference in the alignments. This loss is able to synchronize two decoders with respect to their alignments.

Fig2. DDC model architecture.

The two decoders take the same input from the encoder. They also compute the outputs in the same way except they use different reduction factors. The coarse decoder uses a larger reduction factor compared to the fine decoder. These two decoders are trained with separate loss functions comparing their respective outputs with the real feature frames. The only interaction between these two decoders is the attention loss applied to compare their respective attention alignments.

    \[\{{f^{D_f}}_k\}^K_{k=1}, \{a^f_t\}^{T_f}_{t=1} = Decoder_F(\{h_i\}^L_{i=1}; r_f)\]

    \[\{{f^{D_c}}_k\}^K_{k=1}, \{a^c_t\}^{T_c}_{t=1} = Decoder_C(\{h_i\}^L_{i=1}; r_c)\]

    \[{\{a^\prime^c_t\}^{T_f}_{t=1}} = interpolate(\{a^c_t\}^{T_c}_{t=1})\]

    \[L_{DDC}= ||a^F - a^C||\]

    \[L_{model} = ||f^P - y || + ||f^{D_f} - y||+ ||f^{D_c} - y|| + ||a^F - a^C||\]

Other Model Updates

Batch Norm Prenet

Prenet is an important part of Tacotron like auto-regressive models. It projects model output frames before passing to the decoder. Essentially, it computes an embedding space of the feature (spectrogram) frames by which the model de-factors the distribution of upcoming frames.

I replace the original Prenet (PrenetDropout) with the one using Batch Normalization [3] (PrenetBN) after each dense layer and I remove Dropout layers. Dropout is necessary for learning attention, especially when the data quality is low. However, it causes problems at inference due to distributional differences between training and inference time. Using Batch Normalization is a good alternative. It avoids the issues of Dropout and also provides a certain level of regularization due to the noise of batch-level statistics. It also normalizes computed embedding vectors and generates a well-shaped embedding space.

Gradual Training

I use gradual training scheme for the model training. I’ve introduced the gradual training in a previous blog post. In short, we start the model training with a larger reduction factor and gradually reduce it as the model saturates.

Gradual Training shortens the total training time significantly and yields better attention performance due to its progression from coarse to fine information levels.

Recurrent PostNet at inference

The Postnet is the part of the network applied after the Decoder to improve the Decoder predictions before the vocoder. Its output is summed with the Decoder’s to be the final output of the model. Therefore, it predicts a residual which improves the Decoder output. So we can also apply Postnet more than one time assuming, it computes useful residual information for each time. I applied this trick only at inference and observe that, up to a certain number of iterations, it improves the performance. For my experiments, I set the number of iterations to 2.

MB-Melgan Vocoder with Multiple Random Window Discriminator

As a vocoder, I use Multi-Band Melgan [11] generator. It is trained with Multiple Random Window Discriminator (RWD)[13] different than the original work [11] where they used Multi-Scale Melgan Discriminator (MSMD)[12].

The main difference between these two is that RWD uses audio level information and MSMD uses spectrogram level information. More specifically, RWD comprises multiple convolutional networks each takes different length audio segments with different sampling rates and performs classification whereas MSMD uses convolutional networks to perform the same classification on STFT output of the target voice signal.

In my experiments, I observed better RWD yields better results with more natural and less abberated voice.

Related Work

Guided attention [4] uses a soft diagonal mask to force the attention alignment to be diagonal. As we do, it uses this constant mask at training time to penalize the model with an additional loss term. However, due to its constant nature, it dictates a constant prior to the model which does not always to be true, especially long sentences with various pauses. It also causes skipping in my experiments which are tried to be solved by using a windowing approach at inference time in their work.

Using multiple decoders is initially introduced by [5]. They use two decoders that run in forward and backward directions through the encoder output. The main problem with this approach is that because of the use of two decoders with identical reduction factors, it is almost 2 times slower to train compared to a vanilla model. We solve the problem by using the second decoder with a higher reduction rate. It accelerates the training significantly and also gives the user the opportunity to choose between the two decoders depending on run-time requirements. DDC also does not use any complex scheduling or multiple loss signals that aggravates the model training.

Lately, new TTS models introduced by [7][8][9][10] predicting output duration directly from the input characters. These models train a duration-predictor or use approximation algorithms to find the duration of each input character. However, as you listen to their samples, it is observed that these models lead to degraded timbre and naturalness. This is because of the indirect hard alignment produced by these models. However, models with soft-attention modules can adaptively emphasize different parts of the speech producing a more natural speech.

Results and Experiments

Experiment Setup

All the experiments are performed using LJspeech dataset [6] . I use a sampling-rate of 22050 Hz and mel-scale spectrograms as the acoustic feature. Mel-spectrograms are computed with hop-length 256, window-length 1024. Mel-spectrograms are normalized into [-4, 4]. You can see the used audio parameters below in Mozilla TTS config format.

        // stft parameters
        "num_freq": 513,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1024,      // stft window length in ms.
        "hop_length": 256,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 0.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.

I used Tacotron2[2] as the base architecture with location-sensitive attention and applied all the model updates expressed above. The model is trained for 330k iterations and it took 5 days with a single GPU although the model seems to produce satisfying quality after only 2 days of training with DDC. I used a gradual training schedule shown below. The model starts with r=7 and batch-size 64 and gradually reduces to r=1 and batch-size 32. The coarse decoder is set r=7 for the whole training.

"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], // [first_step, r, batch_size]

I trained MB-Melgan vocoder using real spectrograms up to 1.5M steps, which took 10 days on a single GPU machine. For the first 600K iterations, it is pre-trained with only the supervised loss as in [11] and than the discriminator is enabled for the rest of the training. I do not apply any learning rate schedule and I used 1e-4 for the whole training.

DDC Attention Performance

Fig3. shows the validation alignments of the fine and the coarse decoders which have r=1 and r=7 respectively. We observe that two decoders show almost identical attention alignments with a slight roughness with the coarse decoder due to the interpolation.

DDC significantly shortens the time required to learn the attention alignmet. In my experiments, the model is able to align just after 1k steps as opposed to ~8k steps with normal location-sensitive attention.

Fig3. Attention Alignments of the fine decoder (left) and interpolated the coarse (right)

At the inference time, we ignore the coarse decoder and use only the fine decoder. Below (Fig.4) depicts the model outputs and attention alignments at inference time with 4 different sentences that are not seen at training time. This shows us that the fine decoder is able to generalize successfully on novel sentences.

Fig4. DDC model outputs and attention alignments at test time.

I used 50 hard-sentences introduced by [7] to check the attention quality of the DDC model. As you see in the notebook below (Open it on Colab to listen to Griffin-Lim based voice samples), the DDC model performs without any alignment problems. It is the first model, to my knowledge, which performs flawlessly on these sentences.

Recurrent Postnet

In Fig5. we see the average L1 difference between the real mel-spectrogram and the model prediction for each Postnet iteration. The results improve until the 3rd iteration. We also observe that some of the artifacts after the first iteration are removed by the second iteration that yields a better L1 value. Therefore, we see here how effective the iterative application of the Posnet to improve the final model predictions.

Fig5. (Click on the figure to see larger) Difference between real mel-spectrogram and the Postnet prediction for each iteration. We see that the results improve until the 3rd iteration and some of the artifacts are smoothen at the second iteration. Please pay attention to the scale differences among the figures.

Future Work

First of all I hope this section would not be “here are the things we’ve not tried and will not try” section.

There are specifically three aspects of DDC which I like to investigate more. The first is sharing the weights between the fine and the coarse decoders to reduce the total number of model parameters and observing how the shared weights benefit from different resolutions.

The second is to measure the level of complexity required by the coarse decoder. That is, how much simpler the coarse architecture can get without performance loss.

Finally, I like to try DDC with the different model architectures.


Here I tried to summarize a new method that significantly accelerates model training, provides steadfast attention alignment and provides a choice in a spectrum of quality and speed switching between the fine and the coarse decoders at inference. The user can choose depending on run-time requirements.

You can replicate all this work using Mozilla TTS. You can also see voice samples and Colab Notebooks from the links above. Let me know how it goes if you try DDC in your project.

If you like to cite this work:

Gölge E. (2020) Solving Attention Problems of TTS models with Double Decoder Consistency.


[1] Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. 1–10.

[2] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R. A., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2–6.

[3] Ioffe, S., & Szegedy, C. (n.d.). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

[4] Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention.

[5] Zheng, Y., Wang, X., He, L., Pan, S., Soong, F. K., Wen, Z., & Tao, J. (2019). Forward-Backward Decoding for Regularizing End-to-End TTS.

[6] Keith Ito, The LJ Speech Dataset (2017)

[7] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech.

[8] Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search.

[9] Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2020). FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. 1–11.

[10] Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. 7209–7213.

[11] Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech.

[12] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., & Simonyan, K. (2019). High Fidelity Speech Synthesis with Adversarial Networks. 1–17.

[13] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., & Simonyan, K. (2019). High Fidelity Speech Synthesis with Adversarial Networks. 1–17.


Two Attention Methods for Better Alignment with Tacotron

In this post, I like to introduce two methods that worked well in my experience for better attention alignment in Tacotron models. If you like to try your own you can visit Mozilla TTS. The first method is Bidirectional Decoder and the second is Graves Attention (Gaussian Attention) with small tweaks.

Bidirectional Decoder

from the paper

Bidirectional decoding uses an extra decoder which takes the encoder outputs in the reverse order and then, there is an extra loss function that compares the output states of the forward decoder with the backward one. With this additional loss, the forward decoder models what it needs to expect for the next iterations. In this regard, the backward decoder punishes bad decisions of the forward decoder and vice versa.

Intuitionally, if the forward decoder fails to align the attention, that would cause a big loss and ultimately it would learn to go monotonically through the alignment process with a correction induced by the backward decoder. Therefore, this method is able to prevent “catastrophic failure” where the attention falls apart in the middle of a sentence and it never aligns again.

At the inference time, the paper suggests to us only the forward decoder and demote the backward decoder. However, it is possible to think more elaborate ways to combine these two models.

Example attention figures of both of the decoders.

There are 2 main pitfalls of this method. The first, due to additional parameters of the backward decoder, it is slower to train this model (almost 2x) and this makes a huge difference especially when the reduction rate is low (number of frames the model generates per iteration). The second, if the backward decoder penalizes the forward one too harshly, that causes prosody degradation in overall. The paper suggests activating the additional loss just for fine-tuning, due to this.

My experience is that Bidirectional training is quite robust against alignment problems and it is especially useful if your dataset is hard. It also aligns almost after the first epoch. Yes, at inference time, it sometimes causes pronunciation problems but I solved this by doing the opposite of the paper’s suggestion. I finetune the network without the additional loss for just an epoch and everything started to work well.

Graves Attention

Tacotron uses Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the monotonicity of the alignment just looking into the content which is a hard deal. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonic constraint. But it does not solve all of the problems and you can still experience failures with long or out of domain sentences.

Graves Attention is an alternative that uses content information to decide how far it needs to go on the alignment per iteration. It does this by using a mixture of Gaussian distribution.

Graves Attention takes the context vector of time t-1 and passes it through couple of fully connected layers ([FC > ReLU > FC] in our model) and estimates step-size, variance and distribution weights for time t. Then the estimated step-size is used to update the mean of Gaussian modes. Analogously, mean is the point of interest t the alignment path, variance is attention window over this point of interest and distribution weight is the importance of each distribution head.

    \[\delta = softplus(k)\]

    \[\sigma = exp(-b)\]

    \[w = softmax(g)\]

    \[\mu_{t} = \mu_{t-1} + \delta\]

    \[\alpha_{i,j} = \sum_{k} w_{k} exp\left(-\frac{(j-\mu_{i,k})^2}{2\sigma_{i,k})}\right )\]

I try to formulate above how I compute the alignment in my implementation.

    \[g, b, k\]

are intermediate values.


is the step size,


is the variance,


is the distribution weight for the GMM node k. (You can also check the code).

Some other versions are explained here but so far I found the above formulation works for me the best, without any NaNs in training. I also realized that with the best-claimed method in this paper, one of the distribution nodes overruns the others in the middle of the training and basically, attention starts to run on a single Gaussian head.

Test time attention plots with Graves Attention. The attention looks softer due to the scale differences of the values. In the first step, the attention is weight is big since distributions have almost uniform weights. And they differ as the attention runs forward.

The benefit of using GMM is to have more robust attention. It is also computationally light-weight compared to both bidirectional decoding and normal location attention. Therefore, you can increase your batch size and possibly converge faster.

The downside is that, although my experiments are not complete, GMM’s not provided slightly worse prosody and naturalness compared to the other methods.


Here I compare Graves Attention, Bidirectional Decoding and Location Sensitive Attention trained on LJSpeech dataset. For the comparison, I used the set of sentences provided by this work. There are in total of 50 sentences.

Bidirectional Decoding has 1, Graves attention has 6, Location Sensitive Attention has 18, Location Sensitive Attention with inference time windowing has 11 failures out of these 50 sentences.

In terms of prosodic quality, in my opinion, Location Sensitive Attention > Bidirectional Decoding > Graves Attention > Location Sensitive Attention with Windowing. However, I should say the quality difference is hardly observable in LJSpeech dataset. I also need to point out that, it is a hard dataset.

If you like to try these methods, all these are implemented on Mozilla TTS and give it a try.