Higgs Audio V2 - Unified Audio Language Modeling at Scale

machine-learning, research, audio, text-to-speech, language-models, multimodal

Speech synthesis has evolved rapidly, but most systems still struggle with natural expressiveness and multi-modal understanding. Higgs Audio V2 addresses these limitations through a unified architecture that treats audio as a language, enabling emergent capabilities like multi-speaker dialogues and prosody adaptation without explicit training.

The Challenge #

Traditional text-to-speech systems pipeline acoustic modeling and vocoding separately. Neural approaches improved quality but remained constrained by fixed architectures and limited expressiveness. Recent large language models showed promise for audio generation, but faced two critical issues:

  1. Representation bottleneck - Existing audio tokenizers fail to capture both semantic content and acoustic nuance efficiently
  2. Architectural mismatch - LLMs optimized for text struggle with the dense, multi-dimensional nature of audio tokens

Higgs Audio V2 tackles both challenges through principled design choices backed by extensive evaluation.

Architecture Overview #

The model builds on Llama-3.2-3B with three key innovations:

1. Efficient Audio Tokenization #

The tokenizer design focuses on efficiency over conventional methods:

Low frame rate: At 25 fps, the tokenizer compresses 2x better than standard rates while keeping quality. This comes from heavy downsampling (320x with ratios [8,5,4,2]) plus semantic-acoustic feature mixing.

Unified training: Unlike specialized tokenizers, this one handles speech, music, and environmental sounds in a single 24kHz system. Training combines reconstruction loss with semantic knowledge from HuBERT.

Smart quantization: The 12-codebook RVQ gives 2^120 possible combinations while running at just 2 kbps through residual decomposition.

2. DualFFN Architecture #

DualFFN is different from typical multimodal models that add separate towers or adapters. Instead of making the model wider, it modifies specific layers:

Parallel processing: Text and audio tokens share attention layers but use different feed-forward networks. The model routes tokens based on their ID ranges (text: 0-128000, audio: 128000+).

Efficiency: By replacing instead of stacking FFN layers, training stays at 91% of the original speed. Only some layers get the dual-path upgrade.

Mixed training: The model keeps text weights frozen while training new audio weights. This preserves text performance while adding audio capabilities.

3. Streaming Generation #

The delay pattern solves a key problem: how to generate multiple codebook tokens at once while keeping the model causal.

Time offsets: Instead of generating all 12 codebooks at the same time, codebook k generates tokens for timestep t-k. This keeps causality while allowing parallel generation.

Real-time streaming: The offset structure lets applications start playing audio before the full sequence is done, which is important for low-latency use cases.

4. Advanced Special Token System #

Higgs Audio V2 introduces a rich vocabulary of special tokens for fine-grained audio control:

Environmental Context:

<|scene_desc_start|>Audio is recorded from a quiet room.<|scene_desc_end|>

Sound Effects and Audio Events:

Multi-speaker Support:

This token system enables contextual audio generation where the model adapts prosody, background audio, and speaker characteristics based on semantic cues.

Technical Deep Dive #

Audio Tokenization Process #

The tokenizer follows this pipeline:

  1. Audio preprocessing - 24 kHz input → mel spectrograms
  2. Encoder - CNNs with ratios [8,5,4,2] → 320x downsampling
  3. Quantization - RVQ with 12 codebooks → discrete tokens
  4. Decoder - Transpose convolutions → reconstructed audio

Mathematical formulation:

Given audio signal x sampled at fs = 24000 Hz
Frame rate: fr = fs / M = 24000 / 960 = 25 fps
Bitrate: fr × Nq × log2(Ncb) = 25 × 12 × 10 = 2000 bps

Where M is hop size (960), Nq is codebooks (12), Ncb is codebook size (1024).

DualFFN Implementation Details #

The DualFFN layer replaces standard FFN in selected transformer layers:

class DualFFN(nn.Module):
    def __init__(self, text_ffn, audio_ffn_size):
        self.text_ffn = text_ffn  # Frozen original FFN
        self.audio_ffn = LlamaMLP(audio_ffn_size)  # New audio FFN
        
    def forward(self, hidden_states, audio_mask):
        # Route tokens based on type
        text_output = self.text_ffn(hidden_states)
        audio_output = self.audio_ffn(hidden_states)
        return torch.where(audio_mask, audio_output, text_output)

Training Methodology #

Data: AudioVerse dataset with 10M hours of automatically annotated audio processed through a multi-stage pipeline:

Shared Audio Architecture #

Looking at the code shows that the audio understanding model (for data filtering) and semantic encoder (for voice cloning) use the same architecture:

Shared Architecture (32-layer Whisper encoder):

Divergent Applications:

This shows smart engineering - the same proven architecture does two jobs:

Training: Raw Audio → Understanding Model → Quality Score → AudioVerse Dataset
Inference: Reference Audio → Semantic Encoder → Speaker Embedding → Voice Cloning

Instead of building separate encoders, they reuse the same 32-layer Whisper base for both understanding and voice cloning.

Training stages:

  1. Tokenizer pretraining - Reconstruction loss on diverse audio
  2. LLM adaptation - Frozen text weights, trainable audio components
  3. Joint fine-tuning - End-to-end optimization with delay pattern

Partial freezing strategy: Original text embeddings and output projections remain frozen up to vocabulary index 128000, with new audio tokens (128000+) fully trainable. This preserves text capabilities while enabling audio generation.

Evaluation Framework and Results #

Evaluation Results #

The evaluation covers both standard TTS metrics and new capabilities:

Standard TTS Metrics:

New Capabilities:

DualFFN Validation #

Testing DualFFN on LLaMA-3.1-1B shows:

The improvements work across languages, suggesting DualFFN learns general audio-text patterns rather than language-specific tricks.

Emergent Capabilities #

Beyond traditional TTS metrics, Higgs Audio V2 demonstrates several emergent behaviors:

1. Zero-Shot Voice Cloning #

The model performs voice cloning through audio understanding and cross-modal attention:

Technical Implementation:

# Voice cloning example
messages = [
    Message(role="system", content=system_prompt),
    Message(role="user", content=[
        AudioContent(audio=reference_audio_path),  # Reference voice
        "Generate this text in the reference voice: Hello world"
    ])
]

2. Automatic Voice Assignment #

Given multi-speaker dialogue without voice specifications, the model assigns appropriate, consistent voices based on context and content.

3. Prosody Adaptation #

During narrative passages, the model automatically adjusts rhythm, pace, and emphasis based on content semantics.

4. Cross-modal Generation #

Simultaneous speech and background music generation, with appropriate volume balancing and harmonic compatibility. The model uses contextual cues from scene descriptions to generate appropriate background audio:

# Example: Automatic background music generation
system_prompt = (
    "Generate audio following instruction.\n\n"
    "<|scene_desc_start|>\n"
    "Audio is recorded in a cozy café with soft jazz playing.\n"
    "<|scene_desc_end|>"
)
# Model automatically generates speech with café ambiance

5. Multilingual Consistency #

Voice characteristics maintained across language switches within the same speaker.

These capabilities emerge from the unified training on diverse audio-text pairs rather than explicit programming.

Implementation Considerations #

Memory and Compute #

Performance Optimizations #

CUDA Graph Acceleration:

Repetition-Aware Sampling (RAS):

Integration Patterns #

Three usage modes:

  1. Direct API - Python interface for batch processing
  2. Streaming server - HTTP/WebSocket for real-time applications
  3. vLLM integration - High-throughput serving with OpenAI-compatible API

vLLM Production Serving:

# Launch vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model bosonai/higgs-audio-v2-generation-3B-base \
    --dtype bfloat16 \
    --api-key your-api-key
# Basic usage
serve_engine = HiggsAudioServeEngine(
    model_path="bosonai/higgs-audio-v2-generation-3B-base",
    tokenizer_path="bosonai/higgs-audio-v2-tokenizer"
)

output = serve_engine.generate(
    chat_ml_sample=ChatMLSample(messages=messages),
    max_new_tokens=1024,
    temperature=0.3
)

Limitations and Future Work #

Current Limitations #

Research Directions #

Conclusion #

Higgs Audio V2 demonstrates that treating audio as a first-class language modality enables remarkable expressiveness and emergent capabilities. The combination of efficient tokenization, architectural adaptation, and large-scale pretraining creates a foundation for next-generation audio applications.

The unified approach breaks down traditional boundaries between speech synthesis, audio generation, and language modeling. As the field moves toward more integrated AI systems, this architectural pattern provides a template for incorporating additional modalities while preserving the powerful generative capabilities of large language models.

The open-source release enables broader research and application development, accelerating progress toward more natural and expressive AI communication systems.


Code and models available at: https://github.com/boson-ai/higgs-audio