Tech Beats-6

machine-learning, TTS,, open-source, XTTS

πŸ‘‰ Subscribe to my Substack to get the latest news and articles.

Dear friends,

My main highlight for this week is the release of our streaming XTTS model. Our most impressive TTS model became the fastest, too. This update allows XTTS to produce speech with an impressively low latency of just 0.2 seconds. The release of XTTS generated huge interest, resulting in a significant increase in our Github stars and placing us in the trends on Github for three consecutive days.

Link to Discord channel

I have recently published a comprehensive review of alternative models to Transformers. If you find it intriguing, give it a look and share your thoughts with me.

Let’s dive in…

Bookmarks #

Raspberry Pi 5 is out πŸ”— Blog

Global Internet freedom is in decline with the use of AI by governments πŸ”— Report

NOVO Nordisk, Europe’s most valuable company to fight obesity πŸ”— Video

Stable LM 3B introduced by Stability AI for running on smart devices πŸ”— Blog

Evaluating LLMs is a minefield, a talk about LLM evaluation πŸ”— Slides

BTLM-3B-8K: 7B Performance in a 3 Billion Parameter Model πŸ”— Blog

Linux Foundation to Fork HashiCorp Terraform into β€˜OpenTofu’ πŸ”— Blog

Papers #

Flamingo: a Visual Language Model for Few-Shot Learning #

πŸ“Ž paper πŸ‘©β€πŸ’» code


Flamingo is a visual language model that inputs an image (or video) and text pair and outputs text. You can prompt the model with an image and then ask about it. The model will answer accordingly.

Flamingo bridges a pre-trained vision-only model with a pre-trained language-only model with the Perceiver module. The Perceiver Resampler receives features from the Vision Encoder and outputs a fixed number of visual tokens. These visual tokens are then used to condition the frozen LM using freshly initialized cross-attention layers that are interleaved between the pre-trained LM layers. These new layers offer an expressive way for the LM to incorporate visual information for the next-token prediction task.

The Perceiver Resampler learns a pre-defined number of latent query tokens. Pre-defined tokens help the model remove redundant information that might hurt the model’s performance otherwise. The resampler is based on the same group’s earlier Perceiver paper.

Flamingo models can rapidly adapt to various image and video tasks, including open-ended tasks such as visual question-answering, captioning tasks, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.

My 2 cents: Finding the optimal approach to pass conditional information to a model is difficult. There is a delicate balance, as providing too much information could lead to model instability and overfitting while delivering less will diminish performance. There are different methods to help with this issue, such as bottleneck layers, discretization, etc. The perceiver combines discretization with attention in a unique way.

Vision Transformers Need Registers #

πŸ”— Paper


This paper studies the impact of redundant tokens in vision transformers, which can reduce model performance and lead to the creation of artifacts in the feature maps. To address this problem, the authors propose a set of register tokens appended to the input sequence. Then, they are discarded in the model output.

In this scenario, the issue of redundant tokens arises from softmax in attention layers. The softmax function requires the sum of the output values to equal 1. Consequently, the model must assign increasingly higher values to these extraneous tokens as training progresses. To tackle this problem, the paper introduces random learnable tokens, called registers, that can take these residual values.

Streaming Language Models with Attention Sinks #

πŸ”— Paper πŸ‘©β€πŸ’» Code

This paper shows that pre-trained language models with finite attention windows can generate up to 4 million tokens using the proposed attention sink tokens. They observed that the language models always attend to the first set of tokens in a sequence. When window attention is used, the absence of these tokens in the attention window significantly degrades performance.


They propose a set of learnable attention sink tokens, similar to the registers. By consistently attending to these tokens during windowed attention, instability can be overcome. This allows the model to benefit from windowed attention efficiently and handle up to 4 million tokens with a 22.2x increase in speed during inference.

They experiment with lama2, pythia, and falcon. The authors state that the utilization of window attention leads to an increase in model perplexity. However, when attention sinks are introduced, the perplexity decreases significantly, demonstrating effective window attention. The authors suggest that the models work optimally with 4 attention sinks.

There are more experiments done by Tom Aarsen. You can check his repo, which is more in sync with the pre-trained Hugging Face models.

UniAudio #

πŸ”— Paper πŸ‘©β€πŸ’» Code


Uniaudio is an audio foundation model that is trained with multiple audio tasks such as TTS, VC, Singing voice synthesis, speech enhancement, speech extraction, text-to-sound, text-to-music, speech edit, audio edit, instructed TTS, and speech dereverberation.

They use different models to tokenize audio and text inputs. They employ a model similar to MegaByte, which I posted earlier. Then, they introduce task ID tokens that pre-condition the model to perform a specific task. The input format is changing based on the target task. For instance, for TTS the input sequence looks like <task_id>, <phoneme_sequence>, <speech_prompt>.

The experiments suggest that training the model with multiple tasks helps the model transfer knowledge between tasks and improves performance in each task.

My 2 cents: UniAudio leaves something to be demanded in terms of quality, possibly due to the limited dataset size. Currently, the development of audio foundation models, such as llama2 , is an area that is still wide open. However, creating such models is challenging because audio data is more license-restricted and harder to come by.

Miipher #

πŸ”— Paper πŸ‘©β€πŸ’» Code

Miipher is a speech restoration model that increases the amount of high-quality training data for speech generation tasks. Miipher differs from other speech restoration models because it uses a robust parametric re-synthesis framework and integrates self-supervised speech and text representations to improve the quality of restored speech samples. Additionally, Miipher is designed to be robust against various audio degradation, including phoneme masking and deletion, which are difficult to handle with traditional speech restoration models.


To restore speech samples in the wild, Miipher uses a speech representation extracted from w2v-BERT for the input feature and a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. These features are passed through a Conformer model that predicts clean features. A Wavefit model is used to convert the features to waveform.

Miipher shows that transcripts are as necessary as the audio representations in the ablation studies. They also report improved quality on TTS systems trained on datasets de-noised by Miipher.

Open-Source #

TorchMultimodal #

πŸ‘©β€πŸ’» Code

Meta released a repository for training multimodal models. It comes with building blocks, fusion layers, loss functions, datasets and utilities. It is currently in beta.

Autogen #

πŸ‘©β€πŸ’» Code

AutoGen is a framework from Microsoft that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.