blank

AudioLDM & MusicGen for Voice-to-Music Generation

2025-07-05T14:00:00+00:00

Welcome to this blog post on generating instrumental music from vocal input, a novel and intuitive interface for musical expression. This post presents and summarizes work done on extending two prominent models—AudioLDM and MusicGen (AudioCraft)—to perform voice-to-music generation with optional multi-modal conditioning.

Introduction

Audio synthesis has recently seen breakthroughs via latent diffusion models and autoregressive transformers. Among these, AudioLDM and MusicGen are state-of-the-art frameworks in the realm of controllable sound generation.

However, despite these advances, using the human voice as a control modality remains underexplored. Voice—through singing or humming—provides a natural way for non-musicians to guide music generation.

In this project:

We extend AudioLDM to support audio-based conditioning (voice input).
We fine-tune MusicGen to generate music from melody and genre prompts.
We build a custom vocal-instrumental dataset with 416 paired tracks across 42 artists and multiple languages.
We evaluate both models and show that multi-modal conditioning leads to better generation quality.

Background

AudioLDM

AudioLDM is a latent diffusion model designed for text-to-audio generation. It works in Mel spectrogram space and includes:

A VAE for encoding and decoding spectrograms
The CLAP encoder for extracting semantic information from text/audio
A UNet-based diffusion model to generate latent audio

AudioLDM Pipeline

Encode text using CLAP → embedding
Encode input audio into latent Mel via VAE
Generate latent audio with diffusion model (conditioned on embedding)
Decode Mel → waveform using HiFi-GAN

MusicGen (AudioCraft)

MusicGen, from Meta’s AudioCraft, is an autoregressive Transformer for high-fidelity music generation using:

Discrete audio tokens from EnCodec
Melody conditioning using chroma pitch features
Optional text conditioning for genre/style control

MusicGen Pipeline

Extract chroma features from melody (e.g., vocal input)
Encode text prompt (T5 encoder)
Transformer generates EnCodec tokens
Decode tokens into waveform (32kHz audio)

Applications

Voice-to-instrument synthesis
Prompt-based music composition
Genre transfer
Melody harmonization

Dataset

To train both models, we created a dataset of:

416 tracks across 42 artists
Paired vocals and instrumentals
Metadata: genre, BPM, key, mood, artist name
Languages: English (69%), Arabic (19%), French (12%)

Tracks were segmented for training:

30s chunks for MusicGen
20s chunks for AudioLDM

Methodology

Extending AudioLDM

Original AudioLDM only supported text conditioning. We extended it to support multi-modal conditioning by:

Using both CLAP text and audio encoders
Concatenating embeddings for unified context
Injecting context via FiLM layers or direct input to UNet

This allows generation from:

Voice only
Voice + text prompt (e.g., genre)
Text only

Fine-tuning MusicGen

We fine-tuned the 1.5B parameter MusicGen-Melody model in two stages:

Voice-to-music: using isolated vocals to guide melody
Voice + text: adding genre-specific text prompts (e.g., “Disco music for input vocals”)

Experimental Setup

Training was done on an NVIDIA A40 (48 GB VRAM) GPU.

MusicGen Fine-Tuning

Model: MusicGen-Melody (1.5B)
Batch size: 2
Epochs: 48
Duration: ~48 hours

AudioLDM Fine-Tuning

Model: AudioLDM-s (330M)
Trained for 10k steps / 12 hrs per variant
Variants:
- Voice only
- Voice + text
- Refined text prompts (to match MusicGen prompt format)

Results

MusicGen Output

MusicGen showed:

Strong alignment to both melody and genre prompts
Stylistic consistency across genres
Clearer, more coherent outputs than the pretrained baseline

AudioLDM Output

Voice-only: poor output, often noisy or incoherent
Voice + text: major improvement
Refined prompts: better genre alignment and clarity

While improved, AudioLDM still underperforms compared to MusicGen, especially in musical coherence and fidelity.

🎧 Voice-to-Music Generation Results

Input Vocal

Ground Truth Instrumental

Genre Prompt: Pop

MusicGen Pretrained Output

MusicGen Fine-tuned Output

AudioLDM Fine-tuned Output
Input Vocal

Ground Truth Instrumental

Genre Prompt: Pop

MusicGen Fine-tuned Output

AudioLDM Fine-tuned Output
Input Vocal

Ground Truth Instrumental

Genre Prompt: Disco

MusicGen Fine-tuned Output

AudioLDM Fine-tuned Output
Input Vocal

Ground Truth Instrumental

Genre Prompt: Disco

MusicGen Pretrained Output

MusicGen Fine-tuned Output

AudioLDM Fine-tuned Output
Input Vocal

Ground Truth Instrumental

Genre Prompt: Rock

MusicGen Pretrained Output

MusicGen Fine-tuned Output

AudioLDM Fine-tuned Output

🌍 Cultural Generalizability: Unclean Vocal Inputs

To evaluate how well the models generalize across languages, accents, and noisy vocal inputs, we tested on vocal tracks from different cultural backgrounds using known songs in Arabic, French, Egyptian Arabic, and English.

Each case uses the original unclean vocals and generates instrumental output via the fine-tuned models.

Culture: Arabic
Song: Kefak enta (كفّك إنتَ)

Input Vocal:

Generated Instrumental Output:
Culture: French
Song: La Vie en Rose — Édith Piaf

Input Vocal:

Generated Instrumental Output:
Culture: Egyptian Arabic
Song: Cairokee – James Dean

Input Vocal:

Generated Instrumental Output:
Culture: English
Song: Sweet Caroline

Input Vocal:

Generated Instrumental Output:

🔍 Evaluation Results

We conducted a comprehensive evaluation of our models using both qualitative and quantitative methods. The goal was to compare the performance of our fine-tuned models—MusicGen Fine-tuned and AudioLDM Fine-tuned—against the pretrained MusicGen baseline in generating instrumental music from vocal input.

Qualitative Evaluation: User Listening Survey

To assess perceptual quality and alignment, we conducted a user study with 10 participants, each comparing outputs from all three models across 4 different songs. Participants answered 12 questions covering:

Vocal alignment
Genre fit
Overall audio quality

Win Rate Comparison (Per Question Basis)

Model	Win Rate (%)
MusicGen Fine-tuned	58.57%
MusicGen Pretrained	43.33%
AudioLDM Fine-tuned	40.00%

ℹ️ While MusicGen Fine-tuned led in preference, the relatively close win rates highlight the complexity of modeling user expectations and stylistic alignment in music generation.

📊 Genre Preference Bar Chart (Simplified)

Across four test tracks, participants selected preferred outputs based on genre fit:

Track   | MusicGen Fine-tuned | AudioLDM Fine-tuned
--------|----------------------|---------------------
Test 1  | 7 votes              | 3 votes
Test 2  | 5 votes              | 5 votes
Test 3  | 5 votes              | 5 votes
Test 4  | 5 votes              | 5 votes

🎵 Interpretation: While MusicGen was generally preferred for melodic coherence, both models showed similar strength in capturing genre cues.

Quantitative Evaluation

We used two established metrics to evaluate realism and prompt consistency:

CLAP Score (Text-Audio Alignment)

CLAP (Contrastive Language-Audio Pretraining) measures similarity between the generated audio and the text prompt.

Model	CLAP Score ↑
MusicGen Fine-tuned	0.180
MusicGen Pretrained	0.180
AudioLDM Fine-tuned	0.117

Higher is better. MusicGen clearly excels at maintaining alignment with semantic prompts, while AudioLDM showed weaker consistency, despite its improvements from multi-modal fine-tuning.

Fréchet Audio Distance (FAD)

FAD assesses the realism of generated audio by comparing the statistical distribution of embeddings against real instrumentals.

Model	FAD Score ↓
MusicGen Pretrained	10.64
MusicGen Fine-tuned	10.70
AudioLDM Fine-tuned	9.48

Lower is better. Interestingly, AudioLDM Fine-tuned achieved the lowest FAD score, indicating that it generates more acoustically realistic audio—even if semantically weaker. This suggests it captures low-level audio features well.

Summary of Findings

MusicGen Fine-tuned was preferred in qualitative tests and matched baselines in CLAP score.
AudioLDM Fine-tuned produced more realistic audio per FAD but lacked in semantic alignment.
Combining voice + text conditioning yields stronger results than using audio-only inputs.
While MusicGen appears better suited for structured, genre-aware music generation, AudioLDM benefits from its latent-domain realism and could be enhanced further with architectural tuning.

Conclusion

This work proposes a voice-guided music generation framework by extending two powerful audio generation models. Key contributions:

A custom dataset with paired vocals/instrumentals and rich metadata
Multi-modal conditioning for both AudioLDM and MusicGen
Transformer-based generation (MusicGen) outperforms diffusion-based generation (AudioLDM) in quality

Takeaways

Voice + text prompts offer the best control and realism
MusicGen is better suited for voice-to-instrument tasks today
AudioLDM can improve with further architecture tuning

Future Work

Larger datasets with studio-quality separation
Conditioning on chord progressions, lyrics, or emotions
Real-time applications in music apps, or web tools

References

CasCast

2024-08-05T12:30:13+00:00

Welcome to my blog post summarizing and presenting the paper “CasCast: Skillful High-resolution Precipitation Nowcasting via Cascaded Modelling”. The paper will be presented in this blog post as it showcases a novel approach with great promise.

Introduction

The CasCast paper represents an advancement in the field of meteorological forecasting, particularly in the accurate prediction of precipitation using high-resolution radar data. The paper specifically addresses the challenges faced in nowcasting, which is the prediction of weather conditions for a short period, usually up to two hours ahead. Accurate Weather Forecast in the immediate future are of critical importance for the fields of disaster management and in various social sectors. This paper aims to provide a robust solution to improve prediction accuracy, especially for extreme weather events.

Motivation

Around the world extreme weather events result in significant damages every year. One of the most destructive consequences that can be caused by these events is flooding. Flooding is due to high amounts of precipitation that occur. Weather Forecasting is essential to handle disaster, plan for them and has impacts on many sectors (transportation, event planning, etc..). Precipitation events involve multiple scales of atmospheric systems, making accurate predictions challenging. Furthermore most current methods struggle with Short-term forecasting (NowCasting). Nowcasting is defined here as forecasting events that will occur within the next 2 hours. These predictions are essential for emergency management and disaster mitigation. This precipitation data can be used to give Real-Time warning to impacted communities.

Some information to know beforehand

Previous Research in this field has faced multiple problems. First of all, precipitation events involve multiple scales of atmospheric systems, making accurate predictions challenging. These systems are influenced by Mesoscale precipitation as well as small scale systems.

Previous research also faced challenges in predicting extreme precipitation events, which are seen in small scales. Which is important because over the past 50 years, extreme-precipitation events have caused more than 1 million related deaths, and economic losses beyond US$ 2.8 trillion

Precipitation Systems

Mesoscale precipitation systems evolve over spatial ranges of tens to hundreds of kilometers and time scales of several hours, driven and constrained by relatively stable large-scale circulation.

Small-scale systems evolve within a range of a few kilometers and operating on time scales of minutes, is influenced by local processes such as heating, surface features, and other physical factors, which introduce stochasticity and unpredictability into the systems behavior.

Models Used

Another problem is that short term forecasting has its limitations. Each type of model whether deterministic or probabilistic has its limitations. Deterministic models are unable to capture the fine-grained detail of precipitation patterns. On the other hand, Probabilistic models are unable to capture large scale movements.

Deterministic models aim to predict the overall motion of mid-scale precipitation systems with a single-value forecast, but they often lack detail and appear blurry because they average out the randomness of small-scale systems.

Probabilistic models, on the other hand, sample from various latent variables to represent the randomness of future weather, capturing small-scale phenomena better. However, they struggle with accurately forecasting the large-scale, predictable distribution of precipitation.

In summary, current models still face challenges in simultaneously predicting both mesoscale and small-scale systems.

Deep Learning Approach

Mapping The Problem

In order to approach this problem, its structure must be well-defined. Using multiple inputs, experts are attempting to predict the weather conditions using deep Learning Models.

The inputs used are generally Radar Data (High Resolution Radar Echo Images) as well as a variety of Atmospheric Variables; these can include temperature, humidity or wind patterns for example. This is the data from the past, from time 0 to T, where T is the current time step (present). The data covers T time steps

This is an example of a High Resolution Radar Echo Image that can be used as input.

Example of a High Resolution Radar Echo Image

The desired outputs are accurate precipitation maps of the affected areas, below is an example of a desired precipitation map.

In this image areas of precipitation are colored from lightest precipitation to highest precipitation, this progression is as follows: green, yellow, orange, red and pink. The darker the shade in each color the higher the precipitation.

It is also important to note that the pink areas are the areas of extreme precipitation.

Output Precipitation Map from CasCast

Loss Function

Multiple Loss Function can be used to train the models. A couple of loss functions that can be used are:

Mean Squared Error (MSE): Can be used for training deterministic models. It measures the average squared difference between the estimated values and what is observed. This loss helps in minimizing the forecast error in terms of the general precipitation distribution.
Noise Prediction Loss: Can be used in probabilistic models where a diffusion process is involved. This loss function helps in refining the generation of local weather phenomena, focusing on the specifics that the deterministic model might miss.
Hybrid Loss: Different loss function can be combined to train complex, multi-part models.

What loss functions and scoring functions are used ?

Loss Functions

Mean Squared Error Loss

In deterministic component the mean squared error loss is used.

\[L_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y}_i \right)^2\]

where $y_i$ is the observed value and $\hat{y}_i$ is the predicted value. MSE is used to minimize the average squared differences between the predicted and observed values, this ensures that the model captures the general precipitation patterns accurately.

Noise Prediction Loss

One of the loss functions used is the Noise Prediction Loss, which was presented by Ho et al in 2020:

\[L_{\theta_p} = \mathbb{E}_{\epsilon, k} \left[ \| \epsilon - \epsilon_{\theta_p}(z_k, k, z_{\text{cond}}) \|_2^2 \right]\]

where:

$\epsilon$ is the true noise added to the data during the forward diffusion process.
$k$ is the current time step in the diffusion process.
$z_k$ is the latent variable at time step $k$.
$\epsilon_{\theta_p}(z_k, k, z_{\text{cond}})$ is the predicted noise by the model.
$z_{\text{cond}}$ represents the conditional information, including the latent representations of the initial radar observations and deterministic model outputs.

The noise prediction loss is utilized in the probabilistic component, particularly within the diffusion model framework. The objective of the noise prediction loss is to train the model to accurately predict the noise ϵ added at each step of the diffusion process. By minimizing this loss, the model learns to reverse the noise addition, effectively denoising the data to retrieve the original high-resolution precipitation patterns.

Hybrid Loss

By integrating both losses, the hybrid loss ensures that the model captures both broad precipitation patterns (mesoscale) and fine-grained details (small-scale), balancing deterministic accuracy and probabilistic realism.

Scoring Formulas

In the paper 4 different scoring methods are used to compare the results from different models. These scoring methods are : Critical Success Index (CSI), Heidke Skill Score (HSS), Continuous Ranked Probability Score (CRPS) and Continuous Ranked Probability Score (CRPS).

Some of these scoring methods can be viewed in detail below.

Critical Success Index (CSI)