Using IPython (ipdb) for debugging

This is one of the things I always need but I forget. So here is a piece of mind to check back.

pip install ipython 
pip install ipdb
export PYTHONBREAKPOINT=ipdb.set_trace  # this is to use ipdb by default

When you run your code with a breakpoint, you get an IPython shell for debugging. So you can use all the perks like autocomplete, magic functions, etc.

print("this is an example")
print("this is the end")

Notes on GPT-3

Original paper :

  • It uses the same architecture as GPT-2.
  • The largest model uses 170B parameters and trained with a batch size of 3.2 million. (Wow!).
  • Training cost exceeds $12M.
  • “Taking all these runs into account, the researchers estimated that building this model generated over 78,000 pounds of CO2 emissions in total—more than the average American adult will produce in two years.”[link]
  • They used a system with more than 285K CPU cores 10K GPUs and 400 Gigabits network connectivity per machine. (Too much pollution).
  • The model is trained on the whole Wikipedia, 2 different Book datasets, and Common Crawl.
  • It learns different tasks with a task description, example(s), and a prompt.
    • Task description is a definition of target action, like “Translate from English to France…”
    • Example(s) is a sample or a set of samples used in one-shot or few-shot learning settings.
    • Prompt is an input on which the target action is performed.
  • The larget the model, the better the results.
  • They perform zero-shot, one-shot, and few-shot learning with the pre-trained language model for specific tasks.
  • At the Question Answering task, it outperforms SOTA models trained with the source documents.
  • At the Translation task, it performs close to SOTA. It is better at translating a language to English than otherwise, given it is trained on an English corpus.
  • Winograd task is determining which word a pronoun refers to in a sentence.
  • Physical Q&A is asking questions about grounded knowledge about the physical world. Outperforms SOTA in all the learning settings.
  • Reading Comprehension is asking questions about a given document. Performs poorly in relation to SOTA.
  • Causal Reasoning is giving a sentence and asking the most possible outcome.
  • Natural Language Interference is a task to determine if the 2nd sentence is matching or conflicting with the 1st sentence. It performs well here.
  • at arithmetic operations, small models perform poorly and large models perform good, especially at summation. They discuss that multiplication is a harder operation.
  • At SAT, it performs better than an average student.
  • Human accuracy to detect the articles written by the model is close to the random guess with the largest model.

I believe GPT-3 is not capable of “reasoning” in contrary to the common belief. GPT-3 rather constitutes an efficient storage mechanism for data it is trained with. At inference time, the model determines the output by finding the samples that are most relevant to the given task and interpolating them.

This is more apparent at the arithmetics task. The summation task is much easier since it is easier to memorize the whole table of information from the training data. And as it gets harder with multiplication the model struggles to fetch the relevant information and the performance drops.

You can also observe that when you take sentences from the generated articles in the paper and google them. Although they do not exactly match any article on the Web, you see very similar content and sometimes sentences that are different by only a couple of words.


Solving Attention Problems of TTS models with Double Decoder Consistency

Model Samples:

Colab Notebook (PyTorch): link

Colab Notebook (Tensorflow): link

Despite the success of the latest attention based end2end text2speech (TTS) models, they suffer from attention alignment problems at inference time. They occur especially with long-text inputs or out-of-domain character sequences. Here I like to propose a novel technique to fight against these alignment problems which I call Double Decoder Consistency (DDC) (with a limited creativity). DDC consists of two decoders that learn synchronously with different reduction factors. We use the level of consistency of these decoders to attain better attention performance.

One of Shakespeare’s works read by the DDC model.

End-to-End TTS Models with Attention

Good examples of attention based TTS models are Tacotron and Tacotron2 [1][2]. Tacotron2 is also the main architecture used in this work. These models comprise a sequence-to-sequence architecture with an encoder, an attention-module, a decoder and an additional stack of layers called Postnet. The encoder takes an input text and computes a hidden representation from which the decoder computes predictions of the target acoustic feature frames. A context-based attention mechanism is used to align the input text with the predictions. Finally, decoder predictions are passed over the Postnet which predicts residual information to improve the reconstruction performance of the model. In general, mel-spectrograms are used as acoustic features to represent audio signals in a lower temporal resolution and perceptually meaningful way.

Tacotron proposes to compute multiple non-overlapping output frames by the decoder. You are able to set the number of output frames per decoder step which is called ‘reduction rate’ (r). Larger the reduction rate, fewer the number of decoder steps required for the model to produce the same length output. Thereby, the model achieves faster training convergence and easier attention alignment, as explained in [1]. However, larger r values also produce smoother output frames and therefore, reduce the frame-level details.

Although these models are used in TTS systems for more natural-sounding speech, they frequently suffer from attention alignment problems, especially at inference time, because of out-of-the-domain words, long input texts, or intricacies of the target language. One solution is to use larger r for a better alignment however, as note above, it reduces the quality of the predicted frames. DDC tries to mitigate these attention problems by acting on these observations to find a suitable architecture finding the middle ground.

Fig1. This is an overview of the model used in this work. (Excuse my artwork).

The bare-bone model used in this work is formalized as follows:

    \[\{h_l\}^L_{l=1} = Encoder(\{x_l\}^L_{l=1})\]

    \[p_t = Prenet(o_{t-1})\]

    \[q_t = concat(p_t, c_{t-1})\]

    \[a_t = Attention(q_t, \{h_l\}^L_{l=1})\]

    \[c_t = \sum_{l}a_{t,l}h_l\]

    \[o_t = RNNs(c_t), \quad   o_t = \{f_{t.r}, ..., f_{t.r + r}\}\]

    \[\{o_t\}^T_{t=1}, \{a_t\}^{T}_{t=1} = Decoder(\{h_i\}^L_{i=1}; r)\]

    \[\{f^D_k\}^K_{k=1} = reshape(\{o_t\}^T_{t=1})\]

    \[\{f^P_k\}^K_{k=1} = Postnet((\{f^D_k\}^K_{k=1})\]

    \[L = ||f^P - y || + ||f^D - y||\quad(loss)\]

{y_k}<em>{k=1}^K is a sequence of acoustic feature frames. {x_l}</em>{l=1}^L is a sequence of characters or phonemes, from which we compute sequence of encoder outputs {h_l}_{l=1}^L. r is the reduction factor which defines the number of output frames per decoder step. Attention alignments, query vector and encoder output at decoder step t are donated by a_t, o_t, q_t, o_t respectively. Also, o_t defines a set of output frames whose size changed by r. Total number of decoder steps is donated by T.

Note that teacher forcing is applied at training. Therefore, K=T*r at training time. However, the decoder is instructed to stop at inference by a separate network (Stopnet) which predicts a value in a range [0, 1]. If its prediction is larger than a defined threshold, the decoder stops inference.

Double Decoder Consistency

DDC bases on two decoders working simultaneously with different reduction factors (r). One decoder (coarse) works with a large, and the other decoder (fine) works with a small reduction factor.

DDC is designed to settle the trade-off between the attention alignment and the predicted frame quality tunned by the reduction factor. In general, standard models have more robust attention performance with a larger r but due to the smoothing effect of multiple-frames prediction per iteration, final acoustic features are coarser compared to lower reduction factor models.

DDC combines these two properties at training time as it uses the coarse decoder to guide the fine decoder to preserve the attention performance without a loss of precision in acoustic features. DDC achieves this by introducing an additional loss function comparing the attention vectors of these two decoders.

For each training step, both decoders compute their relative attention vectors and the outputs. Due to the differences in their respective r values, their attention vectors are in different lengths. The coarse decoder produces a shorter vector compared to the fine decoder. In order to mitigate this, we interpolate the coarse attention vector to match the length of the fine attention vector. After having them in the same length we use a loss function to penalize the difference in the alignments. This loss is able to synchronize two decoders with respect to their alignments.

Fig2. DDC model architecture.

The two decoders take the same input from the encoder. They also compute the outputs in the same way except they use different reduction factors. The coarse decoder uses a larger reduction factor compared to the fine decoder. These two decoders are trained with separate loss functions comparing their respective outputs with the real feature frames. The only interaction between these two decoders is the attention loss applied to compare their respective attention alignments.

    \[\{{f^{D_f}}_k\}^K_{k=1}, \{a^f_t\}^{T_f}_{t=1} = Decoder_F(\{h_i\}^L_{i=1}; r_f)\]

    \[\{{f^{D_c}}_k\}^K_{k=1}, \{a^c_t\}^{T_c}_{t=1} = Decoder_C(\{h_i\}^L_{i=1}; r_c)\]

    \[{\{a^\prime^c_t\}^{T_f}_{t=1}} = interpolate(\{a^c_t\}^{T_c}_{t=1})\]

    \[L_{DDC}= ||a^F - a^C||\]

    \[L_{model} = ||f^P - y || + ||f^{D_f} - y||+ ||f^{D_c} - y|| + ||a^F - a^C||\]

Other Model Updates

Batch Norm Prenet

Prenet is an important part of Tacotron like auto-regressive models. It projects model output frames before passing to the decoder. Essentially, it computes an embedding space of the feature (spectrogram) frames by which the model de-factors the distribution of upcoming frames.

I replace the original Prenet (PrenetDropout) with the one using Batch Normalization [3] (PrenetBN) after each dense layer and I remove Dropout layers. Dropout is necessary for learning attention, especially when the data quality is low. However, it causes problems at inference due to distributional differences between training and inference time. Using Batch Normalization is a good alternative. It avoids the issues of Dropout and also provides a certain level of regularization due to the noise of batch-level statistics. It also normalizes computed embedding vectors and generates a well-shaped embedding space.

Gradual Training

I use gradual training scheme for the model training. I’ve introduced the gradual training in a previous blog post. In short, we start the model training with a larger reduction factor and gradually reduce it as the model saturates.

Gradual Training shortens the total training time significantly and yields better attention performance due to its progression from coarse to fine information levels.

Recurrent PostNet at inference

The Postnet is the part of the network applied after the Decoder to improve the Decoder predictions before the vocoder. Its output is summed with the Decoder’s to be the final output of the model. Therefore, it predicts a residual which improves the Decoder output. So we can also apply Postnet more than one time assuming, it computes useful residual information for each time. I applied this trick only at inference and observe that, up to a certain number of iterations, it improves the performance. For my experiments, I set the number of iterations to 2.

MB-Melgan Vocoder with Multiple Random Window Discriminator

As a vocoder, I use Multi-Band Melgan [11] generator. It is trained with Multiple Random Window Discriminator (RWD)[13] different than the original work [11] where they used Multi-Scale Melgan Discriminator (MSMD)[12].

The main difference between these two is that RWD uses audio level information and MSMD uses spectrogram level information. More specifically, RWD comprises multiple convolutional networks each takes different length audio segments with different sampling rates and performs classification whereas MSMD uses convolutional networks to perform the same classification on STFT output of the target voice signal.

In my experiments, I observed better RWD yields better results with more natural and less abberated voice.

Related Work

Guided attention [4] uses a soft diagonal mask to force the attention alignment to be diagonal. As we do, it uses this constant mask at training time to penalize the model with an additional loss term. However, due to its constant nature, it dictates a constant prior to the model which does not always to be true, especially long sentences with various pauses. It also causes skipping in my experiments which are tried to be solved by using a windowing approach at inference time in their work.

Using multiple decoders is initially introduced by [5]. They use two decoders that run in forward and backward directions through the encoder output. The main problem with this approach is that because of the use of two decoders with identical reduction factors, it is almost 2 times slower to train compared to a vanilla model. We solve the problem by using the second decoder with a higher reduction rate. It accelerates the training significantly and also gives the user the opportunity to choose between the two decoders depending on run-time requirements. DDC also does not use any complex scheduling or multiple loss signals that aggravates the model training.

Lately, new TTS models introduced by [7][8][9][10] predicting output duration directly from the input characters. These models train a duration-predictor or use approximation algorithms to find the duration of each input character. However, as you listen to their samples, it is observed that these models lead to degraded timbre and naturalness. This is because of the indirect hard alignment produced by these models. However, models with soft-attention modules can adaptively emphasize different parts of the speech producing a more natural speech.

Results and Experiments

Experiment Setup

All the experiments are performed using LJspeech dataset [6] . I use a sampling-rate of 22050 Hz and mel-scale spectrograms as the acoustic feature. Mel-spectrograms are computed with hop-length 256, window-length 1024. Mel-spectrograms are normalized into [-4, 4]. You can see the used audio parameters below in Coqui TTS config format.

        // stft parameters
        "num_freq": 513,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1024,      // stft window length in ms.
        "hop_length": 256,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 0.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 8000.0,     // maximum freq level for mel-spec. Tune for dataset!!

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.

I used Tacotron2[2] as the base architecture with location-sensitive attention and applied all the model updates expressed above. The model is trained for 330k iterations and it took 5 days with a single GPU although the model seems to produce satisfying quality after only 2 days of training with DDC. I used a gradual training schedule shown below. The model starts with r=7 and batch-size 64 and gradually reduces to r=1 and batch-size 32. The coarse decoder is set r=7 for the whole training.

"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], // [first_step, r, batch_size]

I trained MB-Melgan vocoder using real spectrograms up to 1.5M steps, which took 10 days on a single GPU machine. For the first 600K iterations, it is pre-trained with only the supervised loss as in [11] and than the discriminator is enabled for the rest of the training. I do not apply any learning rate schedule and I used 1e-4 for the whole training.

DDC Attention Performance

Fig3. shows the validation alignments of the fine and the coarse decoders which have r=1 and r=7 respectively. We observe that two decoders show almost identical attention alignments with a slight roughness with the coarse decoder due to the interpolation.

DDC significantly shortens the time required to learn the attention alignmet. In my experiments, the model is able to align just after 1k steps as opposed to ~8k steps with normal location-sensitive attention.

Fig3. Attention Alignments of the fine decoder (left) and interpolated the coarse (right)

At the inference time, we ignore the coarse decoder and use only the fine decoder. Below (Fig.4) depicts the model outputs and attention alignments at inference time with 4 different sentences that are not seen at training time. This shows us that the fine decoder is able to generalize successfully on novel sentences.

Fig4. DDC model outputs and attention alignments at test time.

I used 50 hard-sentences introduced by [7] to check the attention quality of the DDC model. As you see in the notebook below (Open it on Colab to listen to Griffin-Lim based voice samples), the DDC model performs without any alignment problems. It is the first model, to my knowledge, which performs flawlessly on these sentences.

Recurrent Postnet

In Fig5. we see the average L1 difference between the real mel-spectrogram and the model prediction for each Postnet iteration. The results improve until the 3rd iteration. We also observe that some of the artifacts after the first iteration are removed by the second iteration that yields a better L1 value. Therefore, we see here how effective the iterative application of the Posnet to improve the final model predictions.

Fig5. (Click on the figure to see larger) Difference between real mel-spectrogram and the Postnet prediction for each iteration. We see that the results improve until the 3rd iteration and some of the artifacts are smoothen at the second iteration. Please pay attention to the scale differences among the figures.

Future Work

First of all I hope this section would not be “here are the things we’ve not tried and will not try” section.

There are specifically three aspects of DDC which I like to investigate more. The first is sharing the weights between the fine and the coarse decoders to reduce the total number of model parameters and observing how the shared weights benefit from different resolutions.

The second is to measure the level of complexity required by the coarse decoder. That is, how much simpler the coarse architecture can get without performance loss.

Finally, I like to try DDC with the different model architectures.


Here I tried to summarize a new method that significantly accelerates model training, provides steadfast attention alignment and provides a choice in a spectrum of quality and speed switching between the fine and the coarse decoders at inference. The user can choose depending on run-time requirements.

You can replicate all this work using Coqui TTS. You can also see voice samples and Colab Notebooks from the links above. Let me know how it goes if you try DDC in your project.

If you like to cite this work:

Gölge E. (2020) Solving Attention Problems of TTS models with Double Decoder Consistency.


[1] Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017). Tacotron: Towards End-to-End Speech Synthesis. 1–10.

[2] Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R. A., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. 2–6.

[3] Ioffe, S., & Szegedy, C. (n.d.). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

[4] Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention.

[5] Zheng, Y., Wang, X., He, L., Pan, S., Soong, F. K., Wen, Z., & Tao, J. (2019). Forward-Backward Decoding for Regularizing End-to-End TTS.

[6] Keith Ito, The LJ Speech Dataset (2017)

[7] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech.

[8] Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search.

[9] Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2020). FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. 1–11.

[10] Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. 7209–7213.

[11] Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech.

[12] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., & Simonyan, K. (2019). High Fidelity Speech Synthesis with Adversarial Networks. 1–17.

[13] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., Cobo, L. C., & Simonyan, K. (2019). High Fidelity Speech Synthesis with Adversarial Networks. 1–17.


Two Attention Methods for Better Alignment with Tacotron

In this post, I like to introduce two methods that worked well in my experience for better attention alignment in Tacotron models. If you like to try your own you can visit Coqui TTS. The first method is Bidirectional Decoder and the second is Graves Attention (Gaussian Attention) with small tweaks.

Bidirectional Decoder

Bidirectional decoding uses an extra decoder which takes the encoder outputs in the reverse order and then, there is an extra loss function that compares the output states of the forward decoder with the backward one. With this additional loss, the forward decoder models what it needs to expect for the next iterations. In this regard, the backward decoder punishes bad decisions of the forward decoder and vice versa.

Intuitionally, if the forward decoder fails to align the attention, that would cause a big loss and ultimately it would learn to go monotonically through the alignment process with a correction induced by the backward decoder. Therefore, this method is able to prevent “catastrophic failure” where the attention falls apart in the middle of a sentence and it never aligns again.

At the inference time, the paper suggests to us only the forward decoder and demote the backward decoder. However, it is possible to think more elaborate ways to combine these two models.

Example attention figures of both of the decoders.

There are 2 main pitfalls of this method. The first, due to additional parameters of the backward decoder, it is slower to train this model (almost 2x) and this makes a huge difference especially when the reduction rate is low (number of frames the model generates per iteration). The second, if the backward decoder penalizes the forward one too harshly, that causes prosody degradation in overall. The paper suggests activating the additional loss just for fine-tuning, due to this.

My experience is that Bidirectional training is quite robust against alignment problems and it is especially useful if your dataset is hard. It also aligns almost after the first epoch. Yes, at inference time, it sometimes causes pronunciation problems but I solved this by doing the opposite of the paper’s suggestion. I finetune the network without the additional loss for just an epoch and everything started to work well.

Graves Attention

Tacotron uses Bahdenau Attention which is a content-based attention method. However, it does not consider location information, therefore, it needs to learn the monotonicity of the alignment just looking into the content which is a hard deal. Tacotron2 uses Location Sensitive Attention which takes account of the previous attention weights. By doing so, it learns the monotonic constraint. But it does not solve all of the problems and you can still experience failures with long or out of domain sentences.

Graves Attention is an alternative that uses content information to decide how far it needs to go on the alignment per iteration. It does this by using a mixture of Gaussian distribution.

Graves Attention takes the context vector of time t-1 and passes it through couple of fully connected layers ([FC > ReLU > FC] in our model) and estimates step-size, variance and distribution weights for time t. Then the estimated step-size is used to update the mean of Gaussian modes. Analogously, mean is the point of interest t the alignment path, variance is attention window over this point of interest and distribution weight is the importance of each distribution head.

    \[\delta = softplus(k)\]

    \[\sigma = exp(-b)\]

    \[w = softmax(g)\]

    \[\mu_{t} = \mu_{t-1} + \delta\]

    \[\alpha_{i,j} = \sum_{k} w_{k} exp\left(-\frac{(j-\mu_{i,k})^2}{2\sigma_{i,k})}\right )\]

I try to formulate above how I compute the alignment in my implementation.

    \[g, b, k\]

are intermediate values.


is the step size,


is the variance,


is the distribution weight for the GMM node k. (You can also check the code).

Some other versions are explained here but so far I found the above formulation works for me the best, without any NaNs in training. I also realized that with the best-claimed method in this paper, one of the distribution nodes overruns the others in the middle of the training and basically, attention starts to run on a single Gaussian head.

Test time attention plots with Graves Attention. The attention looks softer due to the scale differences of the values. In the first step, the attention is weight is big since distributions have almost uniform weights. And they differ as the attention runs forward.

The benefit of using GMM is to have more robust attention. It is also computationally light-weight compared to both bidirectional decoding and normal location attention. Therefore, you can increase your batch size and possibly converge faster.

The downside is that, although my experiments are not complete, GMM’s not provided slightly worse prosody and naturalness compared to the other methods.


Here I compare Graves Attention, Bidirectional Decoding and Location Sensitive Attention trained on LJSpeech dataset. For the comparison, I used the set of sentences provided by this work. There are in total of 50 sentences.

Bidirectional Decoding has 1, Graves attention has 6, Location Sensitive Attention has 18, Location Sensitive Attention with inference time windowing has 11 failures out of these 50 sentences.

In terms of prosodic quality, in my opinion, Location Sensitive Attention > Bidirectional Decoding > Graves Attention > Location Sensitive Attention with Windowing. However, I should say the quality difference is hardly observable in LJSpeech dataset. I also need to point out that, it is a hard dataset.

If you like to try these methods, all these are implemented on Coqui TTS and give it a try.


Gradual Training with Tacotron for Faster Convergence

Tacotron is a commonly used Text-to-Speech architecture. It is a very flexible alternative over traditional solutions. It only requires text and corresponding voice clips to train the model. It avoids the toil of fine-grained annotation of the data. However, Tacotron might also be very time demanding to train, especially if you don’t know the right hyperparameters, to begin with. Here, I like to share a gradual training scheme to ease the training difficulty. In my experiments, it provides faster training, tolerance for hyperparameters and more time with your family.

In summary, Tacotron is an Encoder-Decoder architecture with Attention. it takes a sentence as a sequence of characters (or phonemes) and it outputs sequence of spectrogram frames to be ultimately converted to speech with an additional vocoder algorithm (e.g. Griffin-Lim or WaveRNN). There are two versions of Tacotron. Tacotron is a more complicated architecture but it has fewer model parameters as opposed to Tacotron2. Tacotron2 is much simpler but it is ~4x larger (~7m vs ~24m parameters). To be clear, so far, I mostly use gradual training method with Tacotron and about to begin to experiment with Tacotron2 soon.

Tacotron architecture (Thx @yweweler for the figure)

Here is the trick. Tacotron has a parameter called ‘r’ which defines the number of spectrogram frames predicted per decoder iteration. It is a useful parameter to reduce the number of computations since the larger ‘r’, the fewer the decoder iterations. But setting the value to high might reduce the performance as well. Another benefit of higher r value is that the alignment module stabilizes much faster. If you talk someone who used Tacotron, he’d probably know what struggle the attention means. So finding the right trade-off for ‘r’ is a great deal. In the original Tacotron paper, authors used ‘r’ as 2 for the best-reported model. They also emphasize the challenge of training the model with r=1.

Gradual training comes to the rescue at this point. What it means is that we set ‘r’ initially large, such as 7. Then, as the training continues, we reduce it until the convergence. This simple trick helps quite magically to solve two main problems. The first, it helps the network to learn the monotonic attention after almost the first epoch. The second, it expedites convergence quite much. As a result, the final model happens to have more stable and resilient attention without any degrigation of performance. You can even eventually let the network to train with r=1 which was not even reported in the original paper.

Here, I like to share some results to prove the effectiveness. I used LJspeech dataset for all the results. The training schedule can be summarized as follows. (You see I also change the batch_size but it is not necessary if you have enough GPU memory.)

“gradual_training”: [[0, 7, 32], [10000, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 8]] # [start_step, r, batch_size]

Below you can see the attention at validation time after just 1K iterations with the training schedule above.

Tacotron after 950 steps on LJSpeech. Don’t worry about the last part, it is just because the model does not know where to stop initially.

Next, let’s check the model training curve and convergence.

(Ignore the plot in the middle.) You see here the model jumping from r=7 to r=5. There is obvious easy gain after the jump.
Test time model results after 300K. r=1 after 290K steps.
Here is the training plot until ~300K iterations.
(For some reason I could not move the first plot to the end)

You can listen to voice examples generated with the final model using GriffinLim vocoder. I’d say the quality of these examples is quite good to my ear.

It was a short post but if you like to replicate the results here, you can visit our repo Coqui TTS and just run the training with the provided config.json file. Hope, imperfect documentation on the repo would help you. Otherwise, you can always ask for help creating an issue or on Github Discussions page. There are some other cool things in the repo that I also write about in the future. Until next time..!

Disclaimer: In this post, I just wanted to briefly share a trick that I find quite useful in my TTS work. Please feel free to share your comments. This work might be a more legit research work in the future.


Recovering Lost Tmux Session

After a while of using tmux, you might see that you cannot reconnect it from another terminal windows with the error message

error connecting to /tmp/tmux-1000/default (No such file or directory

The solution is easy but hard to find. Here is the magical command worked for me. Hope it works for you too!

pkill -USR1 tmux

Using WSL Linux on Windows 10 for Deep Learning Development.

(WSL now officially supports GPUs:

To explain briefly, WSL enables you to run Linux on Win10 and you can use your favorite Linux tools (bash, zsh, vim) for your development cycle and you can enjoy Win10 for the rest. It obviates the need for dual-boot configuration which might be a nightmare sometimes.

Why I do this? Basically, if you have an Optimus Laptop, it is an onerous job to set up a Linux distro. You need to find the right Nvidia driver to enable GPU. Then you need to install nvidia-prime or if you are lucky, you make bumblebee work. Let’s say you’ve done everything. After some time, you update something on your system by mistake and the next thing you see a black screen for the next reboot. It is time to search what is wrong on your phone and try to fix it. It is horrendous!

As far as my experience goes, WSL Linux gives all the necessary features for your development with a vital exception of reaching to GPU. You can apt-get software, run it. Even you can run a software with UI if you set things right. However, due to the GPU limitation, you are able to compile CUDA codes but cannot run on Linux. Here I just like to explain, how you can deal with limitation with a small trick using the ability of WSL Linux running Win binaries.

The first thing to do is to install your preferred Linux distro from Windows Store. Just go to the store, search for the distro and install. If installation is not available, you might need to update your Windows.

1. Install Linux and activate WSL

Before launching Linux, follow the documnetation here to activate WSL on Win10.

After you installed the distro and activated  WSL, you can either open the command-line and type “`bash“`  or directly use the Linux launcher to get into the linux terminal.

2. Install Hyper Terminal for Linux like experience.

One problem I’ve experienced with Windows command-line is the differences of shortcuts (Copy-Paste) and inability to open multiple tabs. These are quite important features for Linux custom. I solved this by switching to hyper terminal. It simulates the best possible Linux like experience.

3. Install Conda in Windows and add its binaries to `path`

Now you have Linux and a cool terminal. It is time to install the rest. Note that, if you don’t bother to use GPU, you can install everything you like on Linux right away and use. For this example, we install miniconda to Windows and use the python.exe from Linux to run our codes on GPU.

Another cool thing about Linux on WSL is that it enables you to run Windows binaries on Linux environment. Also, Windows’ PATH environment variable is exposed to Linux too. As we install python with miniconda, it asks you to add python.exe to the PATH variable. Just do it. Then we set an alias on the Linux side to run python.exe, when we type python so that we can develop things on Linux but run the code on Windows by using the GPU.

Now install miniconda. Say next until you see the screen below and set the ticks for the all options.

After the installation, if you run python on command-line you should see python session running as shown below.

You should also be able to run python form Linux. Open the terminal, switch Linux, type python.exe and you get it working.

4. Create aliases on Linux

The last step is to creating an alias on Linux bash that runs python.exe when you call python. These are the aliases I set.

alias conda="conda.exe"
alias ipython="ipython.exe"
alias nosetests="nosetests.exe"
alias pip="pip.exe"
alias nvidia-smi="/mnt/c/Program\ Files/NVIDIA\ Corporation/NVSMI/nvidia-smi.exe"

After all, you should be able to run your code on GPU. One important note is that since we use python on windows, you need to set folder paths in relation to Windows. Don’t forget the escape character for separating folder.

Right now, I created a folder /users/erogol/projects and I keep my development craft in it. So it is actually different from the home folder set for your Linux installation. But it does not matter since we use windows file paths. Now, you can install your favorite editor and enjoy training new models.

Please let me know if I skip something here. It is very likely since I wrote this after I set everything.

It is good too see that Microsoft changed direction and start to embrance Linux into their ecosystem by listening the needs of their users. It was a meaningless fight from the start.


Edit: You need run the terminal with Run as Administrator’ to install things with conda to windows.