A Simple Way to Switch Audio Codecs in Audio Language Models

machine-learning, research, audio, text-to-speech

When working with language models for audio tasks, switching between different audio representations can be challenging. Let’s say you have a text-to-speech model using a particular audio codec, and a new codec comes along with better quality. The obvious approach would be to retrain your model with the tokens from this new codec. However, language models are typically large and training them from scratch takes a lot of time and resources. So we need a smarter way to transfer our existing model’s knowledge to work with the new codec’s vocabulary.

Here’s a simple approach that has worked well for me:

First, take both your old and new audio codecs. For each audio sample in your dataset, generate tokens using both codecs. Then count how often tokens from each codec appear together in the same files. This gives us a co-occurrence matrix that shows which tokens from the old and new codecs tend to show up together. The assumption here is that tokens that frequently co-occur likely represent similar audio content.

Using this information, we can create a new embedding layer for our model in one of two ways:

  1. Simple approach: For each token in the new codec, use the embedding vector of its most frequently co-occurring token from the old codec
  2. Advanced approach: Normalize the co-occurrence matrix so each row (representing a new codec token) sums to 1. Then use these values as weights to create a weighted average of the old embedding vectors for each new token.

Once you have this new embedding layer, simply swap it in place of your old one and update your model to use the new audio codec. You can then continue training with the new codec’s tokens.

I’ve tried this for audio models but I believe it can also be applied to the other domains. I think the idea holds.

Below you can find a simple implementation of the proposed algo.

# First let's add some code to demonstrate the concept
import numpy as np
from collections import defaultdict

def build_codec_mapping(dataset, old_codec, new_codec):
    """
    Build co-occurrence matrix between old and new codec tokens
    """
    # Initialize co-occurrence matrix
    cooccurrence = defaultdict(lambda: defaultdict(int))

    # Process each audio sample
    for audio in dataset:
        # Get tokens from both codecs
        old_tokens = old_codec.encode(audio)
        new_tokens = new_codec.encode(audio)

        # Count co-occurrences
        for old_tok in old_tokens:
            for new_tok in new_tokens:
                cooccurrence[new_tok][old_tok] += 1

    return cooccurrence

def create_new_embeddings(cooccurrence, old_embedding_layer):
    """
    Create new embedding layer using co-occurrence statistics
    """
    new_vocab_size = len(cooccurrence)
    embedding_dim = old_embedding_layer.weight.shape[1]

    # Initialize new embedding matrix
    new_embeddings = np.zeros((new_vocab_size, embedding_dim))

    # For each token in new vocabulary
    for new_tok_idx, old_tok_counts in cooccurrence.items():
        # Normalize counts to get weights
        total = sum(old_tok_counts.values())
        weights = {k: v/total for k,v in old_tok_counts.items()}

        # Compute weighted average of old embeddings
        weighted_sum = np.zeros(embedding_dim)
        for old_tok_idx, weight in weights.items():
            weighted_sum += weight * old_embedding_layer.weight[old_tok_idx]

        new_embeddings[new_tok_idx] = weighted_sum

    return new_embeddings

def transfer_model_to_new_codec(model, dataset, old_codec, new_codec):
    """
    Main function to transfer model to new codec
    """
    # Build mapping between old and new codec tokens
    cooccurrence = build_codec_mapping(dataset, old_codec, new_codec)

    # Create new embedding layer
    new_embeddings = create_new_embeddings(cooccurrence, model.embedding_layer)

    # Replace old embedding layer
    model.embedding_layer.weight.data = new_embeddings

    # Update model to use new codec
    model.codec = new_codec

    return model

# Example usage:
"""
old_codec = AudioCodec(...)  # Your original codec
new_codec = AudioCodec(...)  # New codec you want to use
dataset = [...]  # Your audio dataset

model = transfer_model_to_new_codec(
    model=your_model,
    dataset=dataset,
    old_codec=old_codec,
    new_codec=new_codec
)

# Now you can continue training with the new codec
"""