MiMo-Audio - Scaling Speech Pre-Training to 100 Million Hours Unlocks Few-Shot Learning

machine-learning, speech-synthesis, audio, xiaomi, ai, research, text-to-speech, transformer, scaling

MiMo-Audio is Xiaomi’s 7B parameter model that processes speech and text through a unified architecture. Trained on 100+ million hours of audio data—10x larger than existing open-source models— that results in emergent capabilities like voice conversion, speech translation, and cross-modal reasoning through few-shot learning, demonstrating speech scaling laws similar to text language models.

TL;DR:

Emergent Abilities at Scale #

Evidence for “Phase Transition”:

Emergent Capabilities (not explicitly trained):

Core Contribution: Proving that text scaling paradigms work also for speech.

Architecture Deep Dive #

MiMo-Audio-Tokenizer (1.2B Parameters) #

Core Specs:

Two-Stage Training:

Approach: From-scratch training at massive scale vs. building on existing semantic models

Solving Semantic vs Acoustic Tokens Conflict #

The Core Trade-off:

MiMo-Audio’s Approach:

No ablation in the paper. Improvements might just the output of the larger scale training.

Audio Language Model #

The Challenge: Audio 200 tokens/sec vs text ~4 words/sec sequence length

Solution: Patching audio tokens (25Hz → 6.25Hz) before LLM

Three-Component Architecture:

Result: Efficient cross-modal transfer while maintaining fine-grained audio generation

Implementation Insights #

Training Strategy #

Two-Stage Progressive Approach:

# Stage 1: Understanding Only
loss_weights = [1, 0, 0, 0, 0, 0, 0, 0, 0]  # Text only
learning_rates = {'patch_encoder': 2e-4, 'llm': 3e-5}

# Stage 2: Understanding + Generation
loss_weights = [100, 12, 8, 6, 4, 2, 2, 1, 1]  # Text + RVQ layers
learning_rates = {'patch_encoder': 2e-4, 'llm': 3e-5, 'patch_decoder': 2e-4}

Key Architecture Specs:

Shared embedding tables between encoder/decoder for efficiency

Training at Unprecedented Scale #

Scale Specifications #

Two-Stage Progressive Training #

Stage 1 - Understanding (2.6T tokens):

Stage 2 - Understanding + Generation (5T tokens):

They use an internal TTS model to generate training data for spoken dialogue

Performance Analysis #

Speech Intelligence (SpeechMMLU) #

Key Finding: Consistent reasoning across text/speech modalities

Audio Understanding (MMAU) #

Few-Shot Learning Evidence #

Limitation: Heavy reliance on automatic metrics; perceptual quality gaps unclear

Resources #