Gradio

Upload Audio

Prompt

🧪 Try Examples

Upload Audio	Prompt

Model Response

Upload Audio

Prompt

🧪 Try Examples

Upload Audio	Prompt

Model Response

🔗 Check out our other Gradio demo here

📚 Overview

Audio Flamingo 3 is a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces:

(i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music;

(ii) flexible, on-demand thinking, allowing the model to do chain-of-thought reasoning before answering;

(iii) multi-turn, multi-audio chat;

(iv) long audio understanding and reasoning (including speech) up to 10 minutes; and

(v) voice-to-voice interaction.

To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Key Features:

💡 Audio Flamingo 3 has strong audio, music and speech understanding capabilities.

💡 Audio Flamingo 3 supports on-demand thinking for chain-of-though reasoning.

💡 Audio Flamingo 3 supports long audio and speech understanding for audios up to 10 minutes.

💡 Audio Flamingo 3 can have multi-turn, multi-audio chat with users under complex context.

💡 Audio Flamingo 3 has voice-to-voice conversation abilities.