Speech & ASR Evaluations Index
Index of evaluations and experiments assessing Automatic Speech Recognition (ASR) and Speech-to-Text (STT) system performance under various conditions.
Author: Daniel Rosehill Related Index: Experiments-And-Evaluations-Index
danielrosehill/Experiments-And-Evaluations-Index View on GitHubEvaluations
Oct 2025 — Local Whisper Model Comparison
Question: What is the difference in accuracy between various Whisper models on local inference, and how do derivative engines compare against original Whisper?
Links: GitHub
danielrosehill/Local-ASR-STT-Benchmark View on GitHubSummary: Benchmarked multiple Whisper model variants running locally on AMD GPU hardware (ROCm) through Speech Note on Ubuntu. Compared stock Whisper models of different sizes and derivative engines to identify the best-performing local STT option for the hardware.
Nov 2025 — Long-Form Audio Transcription
Question: How do ASR systems perform on extended audio recordings compared to shorter clips?
Links: GitHub
danielrosehill/Long-Form-Audio-Eval View on GitHubSummary: Single-shot STT benchmark focused on long-form audio. Evaluated how transcription quality degrades or holds up as recording length increases beyond typical short-clip evaluation sets.
Nov 2025 — Podcast ASR Evaluation
Question: How accurately do ASR systems transcribe podcast-style audio?
Links: Dataset
Summary: Evaluation of ASR performance on podcast recordings, testing how conversational speech patterns, multiple speakers, and natural audio quality affect transcription accuracy.
Nov 2025 — STT System Comparison
Question: How do different STT systems compare head-to-head on the same audio inputs?
Links: Space
Summary: Side-by-side comparison of multiple STT systems on identical test audio, providing a direct performance comparison across providers.
Nov 2025 — English-Hebrew Code-Switched Speech
Question: How well do standard STT systems handle code-switched English-Hebrew speech patterns common among English-speaking immigrants in Israel?
Links: Dataset
Summary: Created 516 audio-text pairs of English sentences with naturally interspersed Hebrew words across domains like government, healthcare, and documents. Tests how well ASR handles the kind of mixed-language speech typical of Anglophones living in Israel.
Nov 2025 — Whisper Fine-Tune vs Commercial APIs
Question: Can fine-tuning Whisper achieve measurable WER reductions, even when comparing local inference against cloud-based commercial models?
Summary: Fine-tuned Whisper Large Turbo running locally achieved 5.84% WER, beating the best commercial API (Assembly AI) tested via Eden AI. Demonstrates that even a quick fine-tune on personal voice data can outperform paid cloud ASR services.
Nov 2025 — Tech Vocabulary ASR Training Data
Question: Can a specialized speech dataset improve ASR performance on technical and developer vocabulary?
Links: Dataset
Summary: Work-in-progress dataset of 205 human-recorded samples (38 min, 10K words) targeting developer and technical vocabulary for Whisper fine-tuning. Covers software engineering terms, GitHub references, and programming jargon that stock models commonly misrecognize.
Nov 2025 — Voice-to-Vector RAG Pipeline Test
Question: Can voice data be reliably transcribed, structured, and upserted into a vector database for accurate retrieval?
Links: Dataset
Summary: Synthetic dataset simulating a job seeker narrating career trajectory, used to test a voice-to-vector-database RAG pipeline: MP3 → transcription → structured context → Pinecone/Ragie upsert → retrieval accuracy evaluation.
Dec 2025 — Microphone Selection Impact
Question: To what extent does microphone selection affect ASR transcription accuracy?
Links: GitHub
danielrosehill/Microphone-Audio-Samples View on GitHubSummary: Collected test samples across various microphones and evaluated STT accuracy differences. Tests whether hardware choice meaningfully impacts transcription quality for the same speaker and content.
Dec 2025 — WPM & Background Noise Impact
Question: To what extent do background noise and variations in speaking pace (WPM) affect ASR transcription accuracy as measured by Word Error Rate (WER)?
danielrosehill/Whisper-WPM-Background-Noise-Eval View on GitHubSummary: Controlled evaluation testing Whisper across multiple variables: speaking pace (fast, normal, slow, whispered, loud), background noise types (cafe, music, conversations, transit, traffic, sirens, dogs, baby sounds), and microphone distance (close, normal, far). Annotated audio recordings with WER measurements for each condition.
Dec 2025 — Transcription Cleanup Evaluation
Question: How do various cloud audio understanding models perform on the transcribe-and-cleanup workflow?
Links: GitHub
danielrosehill/Transcription-Cleanup-Eval-1225 View on GitHubSummary: Evaluated multiple cloud-based audio understanding models on their ability to not just transcribe but also clean up and format transcriptions. Compared end-to-end quality of the combined transcription + post-processing pipeline.
Dec 2025 — Fine-Tuned vs Stock Whisper Models
Question: How much accuracy improvement can be achieved through fine-tuning Whisper models compared to stock models on local inference?
Links: GitHub (1) · GitHub (2) · Dataset
danielrosehill/Fine-Tune-Accuracy-Evaluation View on GitHub danielrosehill/Whisper-Fine-Tune-Accuracy-Eval View on GitHubSummary: Compared fine-tuned Whisper against stock Whisper on local inference using a 92-sample evaluation dataset covering technical vocabulary, English-Hebrew code-switching, and various speaking styles. Ground truth transcriptions provided for WER measurement.
Mar 2026 — Gemini 3.1 Lite Audio Understanding
Question: How well does Gemini 3.1 Lite handle audio understanding tasks beyond simple transcription?
Links: GitHub · Dataset · Space
danielrosehill/Gemini-31-Lite-Audio-Understanding-Eval View on GitHubSummary: Tested Gemini 3.1 Flash Lite on 137 prompts across 22 categories paired with a 20-minute voice sample. Categories include speaker analysis, emotion detection, audio engineering, voice cloning, and forensic audio. 49 completed model outputs demonstrate the model's capabilities and limitations across diverse audio understanding tasks.
Mar 2026 — Single-Shot ASR Evaluation
Links: Space
Summary: Single-shot evaluation interface for quick ASR benchmarking against individual audio samples.
Datasets
ASR-WPM-And-Background-Noise-Eval — Controlled audio samples testing pace, noise, and distance variables · HF
English-Hebrew-Mixed-Sentences — 516 · Code-switched English-Hebrew speech evaluation pairs · HF
Audio-Understanding-Test-Set — 137 · Multimodal audio understanding test prompts across 22 categories · HF
Small-STT-Eval-Audio-Dataset — 92 · Technical vocabulary and code-switching STT evaluation · HF
Sample-Voice-Context-Data — Voice-to-vector-database RAG pipeline testing · HF
Tech-Sentences-For-ASR-Training — 205 · Technical/developer vocabulary for ASR fine-tuning · HF
Whisper-Fine-Tune-One-Shot-Eval — Fine-tuned Whisper vs commercial ASR API comparison · HF
Podcast-ASR-Evaluation — Podcast transcription ASR evaluation · HF
Spaces
Single-Shot-ASR-Eval — Quick single-shot ASR benchmarking · HF
Audio-Understanding-Experiment — Audio understanding experiment results · HF
Whisper-Fine-Tune-Eval — Whisper fine-tune vs API benchmark results · HF
STT-Comparison — Side-by-side STT system comparison · HF
License
MIT