Posts tagged with: google

Text to Speech Deep Learning Architectures

Small Intro. and Background

Up until now, I worked on a variety of data types and ML problems, except audio. Now it is time to learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top-notch DL architectures dealing with TTS (Text to Speech). I also invite you to Coqui TTS hosting a PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.

Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let’s start.


  • Prosody:
  • Phonemes: units of sounds, we pronounce as we speak. Necessary since very similar words in the letter might be pronounced very differently (e.g. “Rough” “Though”)
  • Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
  • Fundamental Frequency – F0: lowest frequency of a periodic waveform describing the pitch of the sound.
  • Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
  • Query, Key, Value: Key is used by the attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. A query vector is the hidden state of the decoder.
  • Grapheme: Cool way to say character.
  • Error Modes: Sub-optimal status for the attention block where it is not able to escape.
  • Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might not be the same.
  • MOS: Mean Opinion Score. Crowd-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
  • Context vector: Output of an attention module which summarizes multiple time-step outputs of the encoder.
  • Hann Window Function:
  • Teacher Forcing: Providing model’s expected output at time t as input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
  • Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to normal convolution layers.

Deep Voice (25 Feb 2017)

  • Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
  • The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
  • Fundamental frequency for the pitch of each phoneme. It is sequence depended.
  • Frequency + Phonemes + Duration = Voice synthesis. It is done via Google’s WaveNet.
  • Models
    • Segmentation Model
      • Segment audio signal to phonemes.
      • CTC loss
      • Predict phoneme pairs due to probability mass
      • Inputs:
        • Audio clip of “It was early spring”
        • Phonemes (label)
          • [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
      • Outputs:
        • Pairs of Phonemes with their start time
          • [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
    • Fundamental Freq & Duration Models
      • Segmentation model predictions are the labels for these models.
      • Inputs:
        • Phonemes
      • Outputs:
        • Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], …
    • Audio Synthesizer Model
      • Simplified WaveNet
      • Inputs:
        • Duration and F0 for phonemes + audio signals (labels)
      • Outputs:
        • Synthesis audio signal

Continue Reading