Small Intro. and Background
Up until now, I worked on a variety of data types and ML problems, except audio. Now it is time to learn it. And the first thing to do is a comprehensive literature review (like a boss). Here I like to share the top-notch DL architectures dealing with TTS (Text to Speech). I also invite you to Coqui TTS hosting a PyTorch implementation of the first version implementation. (We switched to PyTorch for obvious reasons). It is a work in progress and please feel free to comment and contribute.
Below I like to share my pinpoint summary of the well-known TTS papers which are by no means complete but useful to highlight important aspects of these papers. Let’s start.
Glossary
- Prosody: https://en.wikipedia.org/wiki/Prosody_(linguistics)
- Phonemes: units of sounds, we pronounce as we speak. Necessary since very similar words in the letter might be pronounced very differently (e.g. “Rough” “Though”)
- Vocoder: part of the system decoding from features to audio signals. Wave is used in Deep Voice at that stage.
- Fundamental Frequency – F0: lowest frequency of a periodic waveform describing the pitch of the sound.
- Autoregressive Model: Specifies a model depending linearly on its own outputs and on a parameter set which can be approximated.
- Query, Key, Value: Key is used by the attention module to compute attention weights. Value is the vector stipulated by the attention weights to compute the module output. A query vector is the hidden state of the decoder.
- Grapheme: Cool way to say character.
- Error Modes: Sub-optimal status for the attention block where it is not able to escape.
- Monotonic Attention: Use only a limited scope of nodes close in time to the output step. It improves performance for TTS since there is a certain relation btw the output at time t and the input at time t. However, it is not that reasonable for translation problem since words orders might not be the same. https://arxiv.org/pdf/1704.00784.pdf
- MOS: Mean Opinion Score. Crowd-source the evaluation process with native speakers. It is not easy to measure, especially for a layman.
- Context vector: Output of an attention module which summarizes multiple time-step outputs of the encoder.
- Hann Window Function: https://en.wikipedia.org/wiki/Window_function#Hann_window
- Teacher Forcing: Providing model’s expected output at time t as input at time t+1. It is controlled ground-truth feedback as a teacher does to a student.
- Casual convolution: Convolution which does not foresee the future units given the reference time step T which we like to predict next. In practice, it is implemented by setting right padding orientation to normal convolution layers.
Deep Voice (25 Feb 2017)
- Text to phonemes. Deterministically computed with a dictionary. Or Seq2Seq model to deal with the unseen words.
- http://www.speech.cs.cmu.edu/cgi-bin/cmudict CMU dictionary
- The same phoneme might hold different durations in different words. We need to predict the duration. It is sequence depended.
- Fundamental frequency for the pitch of each phoneme. It is sequence depended.
- Frequency + Phonemes + Duration = Voice synthesis. It is done via Google’s WaveNet.
- Models
- Segmentation Model
- Segment audio signal to phonemes.
- CTC loss
- Predict phoneme pairs due to probability mass
- Inputs:
- Audio clip of “It was early spring”
- Phonemes (label)
- [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
- [IH1, T, ., W, AA1, Z, ., ER1, L, IY0, ., S, P, R, IH1, NG, .]
- Outputs:
- Pairs of Phonemes with their start time
- [(IH1, T, 0:00), (T, ., 0:01), (., W, 0:02), (W, AA1, 0:025), (NG, ., 0:035)]
- Pairs of Phonemes with their start time
- Fundamental Freq & Duration Models
- Segmentation model predictions are the labels for these models.
- Inputs:
- Phonemes
- Outputs:
- Duration, Probability, F0 for each phoneme; [H, 0.1, 25hz], …
- Audio Synthesizer Model
- Simplified WaveNet
- Inputs:
- Duration and F0 for phonemes + audio signals (labels)
- Outputs:
- Synthesis audio signal
- Segmentation Model