Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)

Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.

Last updated: 06/04/2026

Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)

Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.

Overview

This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.

As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.

While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.

Hugging Face Task Classification Mapping

The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:

Hugging Face Resources

Audio Text To Text

Omni / All-Modality Multimodal

Repository Index

Core Documentation

Notes & Research

AI-Generated Analysis

The ask-ai/ directory contains AI-assisted research outputs:

Data

Resources & Links

Evaluations & Benchmarking

A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.

Test Prompt Categories

Human-Authored Prompts (by-daniel/):

AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.

Why Audio Multimodal Matters

Classic STT vs. Audio Multimodal

The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.

Audio multimodal models are fundamentally different:

Practical Advantages

Instead of chaining: Whisper → GPT-4 → Formatting

Audio multimodal enables: Single API call with system prompt → Formatted output

Use cases:

Featured Models

See [models/](models/) for detailed profiles:

Any-to-Any (Omni-Modal)

Audio-Text-to-Text

Providers

See [providers.md](providers.md) for the full list, or [companies.md](companies.md) for a company-to-models mapping:

Benchmarks

See [benchmarks.md](benchmarks.md) for full coverage of evaluation frameworks and leaderboards.

Leaderboards: AudioBench · Open ASR

External Resources

Future of Voice AI

Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.

Updates

This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.

Created: December 7, 2025 | Updated: December 8, 2025