AudioLDM & MusicGen for Voice-to-Music Generation

Welcome to this blog post on generating instrumental music from vocal input, a novel and intuitive interface for musical expression. This post presents and summarizes work done on extending two prominent models—AudioLDM and MusicGen (AudioCraft)—to perform voice-to-music generation with optional multi-modal conditioning.

Introduction

Audio synthesis has recently seen breakthroughs via latent diffusion models and autoregressive transformers. Among these, AudioLDM and MusicGen are state-of-the-art frameworks in the realm of controllable sound generation.

However, despite these advances, using the human voice as a control modality remains underexplored. Voice—through singing or humming—provides a natural way for non-musicians to guide music generation.

In this project:

  • We extend AudioLDM to support audio-based conditioning (voice input).
  • We fine-tune MusicGen to generate music from melody and genre prompts.
  • We build a custom vocal-instrumental dataset with 416 paired tracks across 42 artists and multiple languages.
  • We evaluate both models and show that multi-modal conditioning leads to better generation quality.

Background

AudioLDM

AudioLDM is a latent diffusion model designed for text-to-audio generation. It works in Mel spectrogram space and includes:

  • A VAE for encoding and decoding spectrograms
  • The CLAP encoder for extracting semantic information from text/audio
  • A UNet-based diffusion model to generate latent audio

AudioLDM Pipeline

  1. Encode text using CLAP → embedding
  2. Encode input audio into latent Mel via VAE
  3. Generate latent audio with diffusion model (conditioned on embedding)
  4. Decode Mel → waveform using HiFi-GAN

MusicGen (AudioCraft)

MusicGen, from Meta’s AudioCraft, is an autoregressive Transformer for high-fidelity music generation using:

  • Discrete audio tokens from EnCodec
  • Melody conditioning using chroma pitch features
  • Optional text conditioning for genre/style control

MusicGen Pipeline

  1. Extract chroma features from melody (e.g., vocal input)
  2. Encode text prompt (T5 encoder)
  3. Transformer generates EnCodec tokens
  4. Decode tokens into waveform (32kHz audio)

Applications

  • Voice-to-instrument synthesis
  • Prompt-based music composition
  • Genre transfer
  • Melody harmonization

Dataset

To train both models, we created a dataset of:

  • 416 tracks across 42 artists
  • Paired vocals and instrumentals
  • Metadata: genre, BPM, key, mood, artist name
  • Languages: English (69%), Arabic (19%), French (12%)

Tracks were segmented for training:

  • 30s chunks for MusicGen
  • 20s chunks for AudioLDM

Methodology

Extending AudioLDM

Original AudioLDM only supported text conditioning. We extended it to support multi-modal conditioning by:

  • Using both CLAP text and audio encoders
  • Concatenating embeddings for unified context
  • Injecting context via FiLM layers or direct input to UNet

This allows generation from:

  • Voice only
  • Voice + text prompt (e.g., genre)
  • Text only

Fine-tuning MusicGen

We fine-tuned the 1.5B parameter MusicGen-Melody model in two stages:

  1. Voice-to-music: using isolated vocals to guide melody
  2. Voice + text: adding genre-specific text prompts (e.g., “Disco music for input vocals”)

Experimental Setup

Training was done on an NVIDIA A40 (48 GB VRAM) GPU.

MusicGen Fine-Tuning

  • Model: MusicGen-Melody (1.5B)
  • Batch size: 2
  • Epochs: 48
  • Duration: ~48 hours

AudioLDM Fine-Tuning

  • Model: AudioLDM-s (330M)
  • Trained for 10k steps / 12 hrs per variant
  • Variants:
    • Voice only
    • Voice + text
    • Refined text prompts (to match MusicGen prompt format)

Results

MusicGen Output

MusicGen showed:

  • Strong alignment to both melody and genre prompts
  • Stylistic consistency across genres
  • Clearer, more coherent outputs than the pretrained baseline

AudioLDM Output

  • Voice-only: poor output, often noisy or incoherent
  • Voice + text: major improvement
  • Refined prompts: better genre alignment and clarity

While improved, AudioLDM still underperforms compared to MusicGen, especially in musical coherence and fidelity.

🎧 Voice-to-Music Generation Results

  • Input Vocal

    Ground Truth Instrumental

    Genre Prompt: Pop

    MusicGen Pretrained Output

    MusicGen Fine-tuned Output

    AudioLDM Fine-tuned Output

  • Input Vocal

    Ground Truth Instrumental

    Genre Prompt: Pop

    MusicGen Fine-tuned Output

    AudioLDM Fine-tuned Output

  • Input Vocal

    Ground Truth Instrumental

    Genre Prompt: Disco

    MusicGen Fine-tuned Output

    AudioLDM Fine-tuned Output

  • Input Vocal

    Ground Truth Instrumental

    Genre Prompt: Disco

    MusicGen Pretrained Output

    MusicGen Fine-tuned Output

    AudioLDM Fine-tuned Output

  • Input Vocal

    Ground Truth Instrumental

    Genre Prompt: Rock

    MusicGen Pretrained Output

    MusicGen Fine-tuned Output

    AudioLDM Fine-tuned Output

🌍 Cultural Generalizability: Unclean Vocal Inputs

To evaluate how well the models generalize across languages, accents, and noisy vocal inputs, we tested on vocal tracks from different cultural backgrounds using known songs in Arabic, French, Egyptian Arabic, and English.

Each case uses the original unclean vocals and generates instrumental output via the fine-tuned models.

  • Culture: Arabic
    Song: Kefak enta (كفّك إنتَ)

    Input Vocal:

    Generated Instrumental Output:

  • Culture: French
    Song: La Vie en Rose — Édith Piaf

    Input Vocal:

    Generated Instrumental Output:

  • Culture: Egyptian Arabic
    Song: Cairokee – James Dean

    Input Vocal:

    Generated Instrumental Output:

  • Culture: English
    Song: Sweet Caroline

    Input Vocal:

    Generated Instrumental Output:

🔍 Evaluation Results

We conducted a comprehensive evaluation of our models using both qualitative and quantitative methods. The goal was to compare the performance of our fine-tuned models—MusicGen Fine-tuned and AudioLDM Fine-tuned—against the pretrained MusicGen baseline in generating instrumental music from vocal input.


Qualitative Evaluation: User Listening Survey

To assess perceptual quality and alignment, we conducted a user study with 10 participants, each comparing outputs from all three models across 4 different songs. Participants answered 12 questions covering:

  • Vocal alignment
  • Genre fit
  • Overall audio quality

Win Rate Comparison (Per Question Basis)

Model Win Rate (%)
MusicGen Fine-tuned 58.57%
MusicGen Pretrained 43.33%
AudioLDM Fine-tuned 40.00%

ℹ️ While MusicGen Fine-tuned led in preference, the relatively close win rates highlight the complexity of modeling user expectations and stylistic alignment in music generation.

📊 Genre Preference Bar Chart (Simplified)

Across four test tracks, participants selected preferred outputs based on genre fit:

Track   | MusicGen Fine-tuned | AudioLDM Fine-tuned
--------|----------------------|---------------------
Test 1  | 7 votes              | 3 votes
Test 2  | 5 votes              | 5 votes
Test 3  | 5 votes              | 5 votes
Test 4  | 5 votes              | 5 votes

🎵 Interpretation: While MusicGen was generally preferred for melodic coherence, both models showed similar strength in capturing genre cues.


Quantitative Evaluation

We used two established metrics to evaluate realism and prompt consistency:

CLAP Score (Text-Audio Alignment)

CLAP (Contrastive Language-Audio Pretraining) measures similarity between the generated audio and the text prompt.

Model CLAP Score ↑
MusicGen Fine-tuned 0.180
MusicGen Pretrained 0.180
AudioLDM Fine-tuned 0.117

Higher is better. MusicGen clearly excels at maintaining alignment with semantic prompts, while AudioLDM showed weaker consistency, despite its improvements from multi-modal fine-tuning.

Fréchet Audio Distance (FAD)

FAD assesses the realism of generated audio by comparing the statistical distribution of embeddings against real instrumentals.

Model FAD Score ↓
MusicGen Pretrained 10.64
MusicGen Fine-tuned 10.70
AudioLDM Fine-tuned 9.48

Lower is better. Interestingly, AudioLDM Fine-tuned achieved the lowest FAD score, indicating that it generates more acoustically realistic audio—even if semantically weaker. This suggests it captures low-level audio features well.


Summary of Findings

  • MusicGen Fine-tuned was preferred in qualitative tests and matched baselines in CLAP score.
  • AudioLDM Fine-tuned produced more realistic audio per FAD but lacked in semantic alignment.
  • Combining voice + text conditioning yields stronger results than using audio-only inputs.
  • While MusicGen appears better suited for structured, genre-aware music generation, AudioLDM benefits from its latent-domain realism and could be enhanced further with architectural tuning.

Conclusion

This work proposes a voice-guided music generation framework by extending two powerful audio generation models. Key contributions:

  • A custom dataset with paired vocals/instrumentals and rich metadata
  • Multi-modal conditioning for both AudioLDM and MusicGen
  • Transformer-based generation (MusicGen) outperforms diffusion-based generation (AudioLDM) in quality

Takeaways

  • Voice + text prompts offer the best control and realism
  • MusicGen is better suited for voice-to-instrument tasks today
  • AudioLDM can improve with further architecture tuning

Future Work

  • Larger datasets with studio-quality separation
  • Conditioning on chord progressions, lyrics, or emotions
  • Real-time applications in music apps, or web tools

References





    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • CasCast