Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)
Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.
Overview
This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.
As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.
While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.
Hugging Face Task Classification Mapping
The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:
audio-text-to-text — Models that accept audio + text input and produce text output · Task Models
any-to-any — Omni-modal models handling any input/output pairing (subsumes audio) · Task Models
Hugging Face Resources
Audio Text To Text
Task Overview — audio-text-to-text
Models (Trending) — Browse models
Datasets — Browse datasets
Omni / All-Modality Multimodal
Task Overview — any-to-any
Models (Trending) — Browse models
Datasets — Browse datasets
Repository Index
Core Documentation
models/index.md — Complete index of all audio multimodal models
models.md — Featured open-source audio multimodal models with detailed profiles
companies.md — Companies developing audio multimodal models (open source focus)
providers.md — Organizations developing audio multimodal (open & closed source)
benchmarks.md — Evaluation frameworks and leaderboards
scope.md — Definition of what "audio multimodal" means in this context
Notes & Research
notes/ — Personal notes on nomenclature, parameters, and reference links
notes/nomenclature.md — Terminology and naming conventions
notes/parameters.md — Model parameter sizes for deployment planning
notes/ref.md — Quick reference links (HuggingFace task pages)
AI-Generated Analysis
The ask-ai/ directory contains AI-assisted research outputs:
ask-ai/prompt.md — The prompt used to generate the analysis
ask-ai/outputs/models.md — Comprehensive model list beyond featured models
ask-ai/outputs/nomenclature.md — Terminology analysis across vendors and research
ask-ai/outputs/benchmarks.md — Extended benchmark coverage by workflow type
ask-ai/outputs/pros-cons.md — Comparison of STT vs pipeline vs multimodal approaches
ask-ai/outputs/redundancy-analysis.md — Will multimodal ASR make traditional STT redundant?
ask-ai/outputs/ecosystem.md — Ecosystem overview and emerging trends
Data
data/ — Raw exports from Hugging Face API (CSV/JSON)
Resources & Links
resource-lists.md — Curated awesome-lists for multimodal AI
models-hf.md — GitHub repositories for audio multimodal models
papers.md — Research papers and academic resources
tooling.md — Data pipeline and processing tools
eval-tools.md — Evaluation frameworks and test prompts
inference-tools.md — Tools for running inference at scale
demos-and-starters.md — Example implementations and starter projects
github-tags.md — GitHub topic pages for discovery
Evaluations & Benchmarking
A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.
evaluations/README.md — Evaluation framework overview and methodology
evaluations/test-prompts/ — Complete test prompt library
Test Prompt Categories
Human-Authored Prompts (by-daniel/):
accent-identification.md — Regional accent detection with grounded examples
guess-my-mood.md — Emotional analysis, fatigue detection, word-tone dissonance
non-verbal-context.md — Multi-speaker interpersonal dynamics, pauses as communication
parameters.md — Vocal frequency analysis for audio engineering (EQ recommendations)
who-is-this.md — Speaker identification/recognition
AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.
Why Audio Multimodal Matters
Classic STT vs. Audio Multimodal
The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.
Audio multimodal models are fundamentally different:
Native audio understanding: Process audio tokens directly alongside text prompts
Unified inference: Single API call handles transcription, formatting, and summarization
Prompt-guided processing: Can be instructed to analyze accents, describe voices, or format output
Practical Advantages
Instead of chaining: Whisper → GPT-4 → Formatting
Audio multimodal enables: Single API call with system prompt → Formatted output
Use cases:
Voice journals with structured formatting
Conference call summarization
Accent/voice analysis
Long-form audio processing (tested with 1-hour recordings)
Featured Models
See [models/](models/) for detailed profiles:
Any-to-Any (Omni-Modal)
Qwen Omni — Alibaba · 7B-35B · Apache 2.0
Gemma 3n — Google · 2B-4B effective · Gemma
Macaw-LLM — Chenyang Lyu et al. · 7B-13B · Apache 2.0
Audio-Text-to-Text
Audio Flamingo 3 — NVIDIA · 8B · Non-commercial
BuboGPT — ByteDance · 7B-13B · BSD 3-Clause
Kimi-Audio — Moonshot AI · 10B · MIT/Apache 2.0
OmniAudio — NexaAI · 2.6B · Apache 2.0
Phi-4-Multimodal — Microsoft · 5.6B · MIT
Qwen2-Audio — Alibaba · 8B · Apache 2.0
SALMONN — ByteDance/Tsinghua · 7B-13B · Apache 2.0
Soundwave — FreedomIntelligence · 9B · Apache 2.0
Step-Audio-Chat — StepFun · 130B · Apache 2.0
Step-Audio-R1 — StepFun · 33B · Apache 2.0
Ultravox — Fixie.ai · 8B-70B · MIT
Voxtral — Mistral AI · 5B-24B · Apache 2.0
Providers
See [providers.md](providers.md) for the full list, or [companies.md](companies.md) for a company-to-models mapping:
Open Source: Alibaba, ByteDance, Fixie.ai, FreedomIntelligence, Google DeepMind, Microsoft, Mistral AI, Moonshot AI, NexaAI, NVIDIA, StepFun
Closed Source: Google (Gemini), OpenAI (GPT-4o), Anthropic (Claude), Reka AI
Benchmarks
See [benchmarks.md](benchmarks.md) for full coverage of evaluation frameworks and leaderboards.
MSEB — Google Research · Sound embedding evaluation · GitHub · Blog
google-research/mseb View on GitHubUltraEval-Audio — OpenBMB · Speech understanding & generation · GitHub
OpenBMB/UltraEval-Audio View on GitHublmms-eval — EvolvingLMMs Lab · 100+ multimodal tasks · GitHub
EvolvingLMMs-Lab/lmms-eval View on GitHubVERSA — WavLab Speech · 90+ speech/audio metrics · GitHub
wavlab-speech/versa View on GitHubAudioBench — AudioLLMs · Comprehensive audio LLM · GitHub · Leaderboard
AudioLLMs/AudioBench View on GitHub
Leaderboards: AudioBench · Open ASR
External Resources
Awesome-Audio-LLM - Curated list of audio LLM research
AudioLLMs/Awesome-Audio-LLM View on GitHubHugging Face audio-text-to-text models - Browse latest models
Hugging Face ASR models - Traditional ASR for comparison
Future of Voice AI
Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.
Updates
This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.
Created: December 7, 2025 | Updated: December 8, 2025