
ConnexAI’s AI Research Lab conducted a comprehensive evaluation of the accuracy of leading streaming Automatic Speech Recognition (ASR) models in the customer service industry. This benchmarking study assessed performance on real-world English-language telephony audio, capturing a wide range of accents, acoustic conditions, and conversational styles.
Providers evaluated:ConnexAI, Google, Amazon, OpenAI, Deepgram, AssemblyAI, SpeechmaticsLanguages evaluated:English (US, UK, Australian, and South African accents, including non-native speakers)Evaluation datasets:Twenty-five hours of real-world contact centre audio spanning insurance, healthcare, and finance domains. The dataset included both clean and noisy conditions, natural conversational speech, and a diverse set of speakers varying in age, gender, and accent.Ground truth transcriptions:All recordings were transcribed by ConnexAI's internal data annotation specialists to ensure high-quality, consistent references. Transcriptions were normalised according to provider-specific rules and manually reviewed to maintain fairness across benchmarks.Processing modes evaluated:Streaming telephony (8 kHz), with upsampling applied only when required by a provider's model.
Evaluation Process
Dataset selection:Twenty-five hours of real-world contact center audio were randomly selected to represent a broad mix of speakers, accents, ages, genders, and noise conditions.Transcription & ground truth creation:Human experts transcribed and validated all recordings. Provider-specific normalisation rules were applied to ensure a fair comparison.Model integration:Each ASR model was integrated according to official documentation to ensure accurate and consistent testing.Evaluation Metrics:WER (Word Error Rate) measures transcription errors at the word level.Benchmarking:16,311 recordings were streamed to all providers in their original telephony format, and WER was calculated for each.
Results
ConnexAI's ASR model consistently outperformed other models across most categories, demonstrating best-in-class recognition accuracy under challenging telephony conditions. The only exception was numeric sequences (e.g., phone numbers), where Amazon Transcribe achieved slightly better results.
Key Findings:
Many providers' models exhibited a 75th percentile WER above 20%, highlighting the difficulty of real-world telephony transcription.
ConnexAI’s ASR model achieved the lowest median WER (7.7%) across diverse accents and noisy environments, compared to 10.5% for the next-best provider.
ConnexAI's advantage was most pronounced on short utterances (1-5 words) and alphanumeric sequences, where contextual information is limited.
Best-in-class ASR performance ensures ConnexAI's Agentic AI solutions maintain accurate and reliable speech understanding in live contact centre conversations.
Word Error Rate (%, Lower is better)
Word Error Rate (%, Lower is better)
Word Error Rate (%, Lower is better)



