Speech & ASR Evaluations Index

Index of evaluations and experiments assessing Automatic Speech Recognition (ASR) and Speech-to-Text (STT) system performance under various conditions.

Author: Daniel Rosehill Related Index: Experiments-And-Evaluations-Index

danielrosehill/Experiments-And-Evaluations-Index View on GitHub

Evaluations

Oct 2025 — Local Whisper Model Comparison

Question: What is the difference in accuracy between various Whisper models on local inference, and how do derivative engines compare against original Whisper?

Links: GitHub

danielrosehill/Local-ASR-STT-Benchmark View on GitHub

Summary: Benchmarked multiple Whisper model variants running locally on AMD GPU hardware (ROCm) through Speech Note on Ubuntu. Compared stock Whisper models of different sizes and derivative engines to identify the best-performing local STT option for the hardware.

Nov 2025 — Long-Form Audio Transcription

Question: How do ASR systems perform on extended audio recordings compared to shorter clips?

Links: GitHub

danielrosehill/Long-Form-Audio-Eval View on GitHub

Summary: Single-shot STT benchmark focused on long-form audio. Evaluated how transcription quality degrades or holds up as recording length increases beyond typical short-clip evaluation sets.

Nov 2025 — Podcast ASR Evaluation

Question: How accurately do ASR systems transcribe podcast-style audio?

Links: Dataset

Summary: Evaluation of ASR performance on podcast recordings, testing how conversational speech patterns, multiple speakers, and natural audio quality affect transcription accuracy.

Nov 2025 — STT System Comparison

Question: How do different STT systems compare head-to-head on the same audio inputs?

Links: Space

Summary: Side-by-side comparison of multiple STT systems on identical test audio, providing a direct performance comparison across providers.

Nov 2025 — English-Hebrew Code-Switched Speech

Question: How well do standard STT systems handle code-switched English-Hebrew speech patterns common among English-speaking immigrants in Israel?

Links: Dataset

Summary: Created 516 audio-text pairs of English sentences with naturally interspersed Hebrew words across domains like government, healthcare, and documents. Tests how well ASR handles the kind of mixed-language speech typical of Anglophones living in Israel.

Nov 2025 — Whisper Fine-Tune vs Commercial APIs

Question: Can fine-tuning Whisper achieve measurable WER reductions, even when comparing local inference against cloud-based commercial models?

Links: Dataset · Space

Summary: Fine-tuned Whisper Large Turbo running locally achieved 5.84% WER, beating the best commercial API (Assembly AI) tested via Eden AI. Demonstrates that even a quick fine-tune on personal voice data can outperform paid cloud ASR services.

Nov 2025 — Tech Vocabulary ASR Training Data

Question: Can a specialized speech dataset improve ASR performance on technical and developer vocabulary?

Links: Dataset

Summary: Work-in-progress dataset of 205 human-recorded samples (38 min, 10K words) targeting developer and technical vocabulary for Whisper fine-tuning. Covers software engineering terms, GitHub references, and programming jargon that stock models commonly misrecognize.

Nov 2025 — Voice-to-Vector RAG Pipeline Test

Question: Can voice data be reliably transcribed, structured, and upserted into a vector database for accurate retrieval?

Links: Dataset

Summary: Synthetic dataset simulating a job seeker narrating career trajectory, used to test a voice-to-vector-database RAG pipeline: MP3 → transcription → structured context → Pinecone/Ragie upsert → retrieval accuracy evaluation.

Dec 2025 — Microphone Selection Impact

Question: To what extent does microphone selection affect ASR transcription accuracy?

Links: GitHub

danielrosehill/Microphone-Audio-Samples View on GitHub

Summary: Collected test samples across various microphones and evaluated STT accuracy differences. Tests whether hardware choice meaningfully impacts transcription quality for the same speaker and content.

Dec 2025 — WPM & Background Noise Impact

Question: To what extent do background noise and variations in speaking pace (WPM) affect ASR transcription accuracy as measured by Word Error Rate (WER)?

Links: GitHub · Dataset

danielrosehill/Whisper-WPM-Background-Noise-Eval View on GitHub

Summary: Controlled evaluation testing Whisper across multiple variables: speaking pace (fast, normal, slow, whispered, loud), background noise types (cafe, music, conversations, transit, traffic, sirens, dogs, baby sounds), and microphone distance (close, normal, far). Annotated audio recordings with WER measurements for each condition.

Dec 2025 — Transcription Cleanup Evaluation

Question: How do various cloud audio understanding models perform on the transcribe-and-cleanup workflow?

Links: GitHub

danielrosehill/Transcription-Cleanup-Eval-1225 View on GitHub

Summary: Evaluated multiple cloud-based audio understanding models on their ability to not just transcribe but also clean up and format transcriptions. Compared end-to-end quality of the combined transcription + post-processing pipeline.

Dec 2025 — Fine-Tuned vs Stock Whisper Models

Question: How much accuracy improvement can be achieved through fine-tuning Whisper models compared to stock models on local inference?

Links: GitHub (1) · GitHub (2) · Dataset

danielrosehill/Fine-Tune-Accuracy-Evaluation View on GitHub danielrosehill/Whisper-Fine-Tune-Accuracy-Eval View on GitHub

Summary: Compared fine-tuned Whisper against stock Whisper on local inference using a 92-sample evaluation dataset covering technical vocabulary, English-Hebrew code-switching, and various speaking styles. Ground truth transcriptions provided for WER measurement.

Mar 2026 — Gemini 3.1 Lite Audio Understanding

Question: How well does Gemini 3.1 Lite handle audio understanding tasks beyond simple transcription?

Links: GitHub · Dataset · Space

danielrosehill/Gemini-31-Lite-Audio-Understanding-Eval View on GitHub

Summary: Tested Gemini 3.1 Flash Lite on 137 prompts across 22 categories paired with a 20-minute voice sample. Categories include speaker analysis, emotion detection, audio engineering, voice cloning, and forensic audio. 49 completed model outputs demonstrate the model's capabilities and limitations across diverse audio understanding tasks.

Mar 2026 — Single-Shot ASR Evaluation

Links: Space

Summary: Single-shot evaluation interface for quick ASR benchmarking against individual audio samples.

Datasets

ASR-WPM-And-Background-Noise-Eval — Controlled audio samples testing pace, noise, and distance variables · HF
English-Hebrew-Mixed-Sentences — 516 · Code-switched English-Hebrew speech evaluation pairs · HF
Audio-Understanding-Test-Set — 137 · Multimodal audio understanding test prompts across 22 categories · HF
Small-STT-Eval-Audio-Dataset — 92 · Technical vocabulary and code-switching STT evaluation · HF
Sample-Voice-Context-Data — Voice-to-vector-database RAG pipeline testing · HF
Tech-Sentences-For-ASR-Training — 205 · Technical/developer vocabulary for ASR fine-tuning · HF
Whisper-Fine-Tune-One-Shot-Eval — Fine-tuned Whisper vs commercial ASR API comparison · HF
Podcast-ASR-Evaluation — Podcast transcription ASR evaluation · HF

Spaces

Single-Shot-ASR-Eval — Quick single-shot ASR benchmarking · HF
Audio-Understanding-Experiment — Audio understanding experiment results · HF
Whisper-Fine-Tune-Eval — Whisper fine-tune vs API benchmark results · HF
STT-Comparison — Side-by-side STT system comparison · HF

License

MIT