Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)

Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.

Overview

This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.

As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.

While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.

Hugging Face Task Classification Mapping

The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:

audio-text-to-text — Models that accept audio + text input and produce text output · Task Models
any-to-any — Omni-modal models handling any input/output pairing (subsumes audio) · Task Models

Hugging Face Resources

Audio Text To Text

Task Overview — audio-text-to-text
Models (Trending) — Browse models
Datasets — Browse datasets

Omni / All-Modality Multimodal

Task Overview — any-to-any
Models (Trending) — Browse models
Datasets — Browse datasets

Repository Index

Core Documentation

models/index.md — Complete index of all audio multimodal models
models.md — Featured open-source audio multimodal models with detailed profiles
companies.md — Companies developing audio multimodal models (open source focus)
providers.md — Organizations developing audio multimodal (open & closed source)
benchmarks.md — Evaluation frameworks and leaderboards
scope.md — Definition of what "audio multimodal" means in this context

Notes & Research

notes/ — Personal notes on nomenclature, parameters, and reference links
notes/nomenclature.md — Terminology and naming conventions
notes/parameters.md — Model parameter sizes for deployment planning
notes/ref.md — Quick reference links (HuggingFace task pages)

AI-Generated Analysis

The ask-ai/ directory contains AI-assisted research outputs:

ask-ai/prompt.md — The prompt used to generate the analysis
ask-ai/outputs/models.md — Comprehensive model list beyond featured models
ask-ai/outputs/nomenclature.md — Terminology analysis across vendors and research
ask-ai/outputs/benchmarks.md — Extended benchmark coverage by workflow type
ask-ai/outputs/pros-cons.md — Comparison of STT vs pipeline vs multimodal approaches
ask-ai/outputs/redundancy-analysis.md — Will multimodal ASR make traditional STT redundant?
ask-ai/outputs/ecosystem.md — Ecosystem overview and emerging trends

Data

data/ — Raw exports from Hugging Face API (CSV/JSON)

Resources & Links

resource-lists.md — Curated awesome-lists for multimodal AI
models-hf.md — GitHub repositories for audio multimodal models
papers.md — Research papers and academic resources
tooling.md — Data pipeline and processing tools
eval-tools.md — Evaluation frameworks and test prompts
inference-tools.md — Tools for running inference at scale
demos-and-starters.md — Example implementations and starter projects
github-tags.md — GitHub topic pages for discovery

Evaluations & Benchmarking

A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.

evaluations/README.md — Evaluation framework overview and methodology
evaluations/test-prompts/ — Complete test prompt library

Test Prompt Categories

Human-Authored Prompts (by-daniel/):

accent-identification.md — Regional accent detection with grounded examples
guess-my-mood.md — Emotional analysis, fatigue detection, word-tone dissonance
non-verbal-context.md — Multi-speaker interpersonal dynamics, pauses as communication
parameters.md — Vocal frequency analysis for audio engineering (EQ recommendations)
who-is-this.md — Speaker identification/recognition

AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.

Why Audio Multimodal Matters

Classic STT vs. Audio Multimodal

The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.

Audio multimodal models are fundamentally different:

Native audio understanding: Process audio tokens directly alongside text prompts
Unified inference: Single API call handles transcription, formatting, and summarization
Prompt-guided processing: Can be instructed to analyze accents, describe voices, or format output

Practical Advantages

Instead of chaining: Whisper → GPT-4 → Formatting

Audio multimodal enables: Single API call with system prompt → Formatted output

Use cases:

Voice journals with structured formatting
Conference call summarization
Accent/voice analysis
Long-form audio processing (tested with 1-hour recordings)

Featured Models

See [models/](models/) for detailed profiles:

Any-to-Any (Omni-Modal)

Qwen Omni — Alibaba · 7B-35B · Apache 2.0
Gemma 3n — Google · 2B-4B effective · Gemma
Macaw-LLM — Chenyang Lyu et al. · 7B-13B · Apache 2.0

Audio-Text-to-Text

Audio Flamingo 3 — NVIDIA · 8B · Non-commercial
BuboGPT — ByteDance · 7B-13B · BSD 3-Clause
Kimi-Audio — Moonshot AI · 10B · MIT/Apache 2.0
OmniAudio — NexaAI · 2.6B · Apache 2.0
Phi-4-Multimodal — Microsoft · 5.6B · MIT
Qwen2-Audio — Alibaba · 8B · Apache 2.0
SALMONN — ByteDance/Tsinghua · 7B-13B · Apache 2.0
Soundwave — FreedomIntelligence · 9B · Apache 2.0
Step-Audio-Chat — StepFun · 130B · Apache 2.0
Step-Audio-R1 — StepFun · 33B · Apache 2.0
Ultravox — Fixie.ai · 8B-70B · MIT
Voxtral — Mistral AI · 5B-24B · Apache 2.0

Providers

See [providers.md](providers.md) for the full list, or [companies.md](companies.md) for a company-to-models mapping:

Open Source: Alibaba, ByteDance, Fixie.ai, FreedomIntelligence, Google DeepMind, Microsoft, Mistral AI, Moonshot AI, NexaAI, NVIDIA, StepFun
Closed Source: Google (Gemini), OpenAI (GPT-4o), Anthropic (Claude), Reka AI

Benchmarks

See [benchmarks.md](benchmarks.md) for full coverage of evaluation frameworks and leaderboards.

MSEB — Google Research · Sound embedding evaluation · GitHub · Blog
google-research/mseb View on GitHub
UltraEval-Audio — OpenBMB · Speech understanding & generation · GitHub
OpenBMB/UltraEval-Audio View on GitHub
lmms-eval — EvolvingLMMs Lab · 100+ multimodal tasks · GitHub
EvolvingLMMs-Lab/lmms-eval View on GitHub
VERSA — WavLab Speech · 90+ speech/audio metrics · GitHub
wavlab-speech/versa View on GitHub
AudioBench — AudioLLMs · Comprehensive audio LLM · GitHub · Leaderboard
AudioLLMs/AudioBench View on GitHub

Leaderboards: AudioBench · Open ASR

External Resources

Awesome-Audio-LLM - Curated list of audio LLM research
AudioLLMs/Awesome-Audio-LLM View on GitHub
Hugging Face audio-text-to-text models - Browse latest models
Hugging Face ASR models - Traditional ASR for comparison

Future of Voice AI

Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.

Updates

This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.

Created: December 7, 2025 | Updated: December 8, 2025