Humans vs Machines—Who’s Better at Recognizing Speech?

Are humans or machines better at recognizing speech? A new study shows that in noisy conditions, current automatic speech recognition (ASR) systems achieve remarkable accuracy and sometimes even surpass human performance. However, the systems need to be trained on an incredible amount of data, while humans acquire comparable skills in less time.

Automatic speech recognition (ASR) has made incredible advances in the past few years, especially for widely spoken languages ​​such as English. Prior to 2020, it was typically assumed that human abilities for speech recognition far exceeded automatic systems, yet some current systems have started to match human performance. The goal in developing ASR systems has always been to lower the error rate, regardless of how people perform in the same environment. After all, not even people will recognize speech with 100% accuracy in a noisy environment.

In a new study, UZH computational linguistics specialist Eleanor Chodroff and a fellow researcher from Cambridge University, Chloe Patman, compared two popular ASR systems – Meta’s wav2vec 2.0 and Open AI’s Whisper – against native British English listeners. They tested how well the systems recognized speech in speech-shaped noise (a static noise) or pub noise, and produced with or without a cotton face mask.

Latest OpenAI system better – with one exception

The researchers found that humans still maintained the edge against both ASR systems. However, OpenAI’s most recent large ASR system, Whisper large-v3, significantly outperformed human listeners in all tested conditions except naturalistic pub noise, where it was merely on par with humans. Whisper large-v3 has thus demonstrated its ability to process the acoustic properties of speech and successfully map it to the intended message (ie, the sentence). “This was impressive as the tested sentences were presented out of context, and it was difficult to predict any one word from the preceding words,” Eleanor Chodroff says.

Vast training data

A closer look at the ASR systems and how they’ve been trained shows that humans are nevertheless doing something remarkable. Both tested systems involve deep learning, but the most competitive system, Whisper, requires an incredible amount of training data. Meta’s wav2vec 2.0 was trained on 960 hours (or 40 days) of English audio data, while the default Whisper system was trained on over 75 years of speech data. The system that actually outperformed human ability was trained on over 500 years of nonstop speech. “Humans are capable of matching this performance in just a handful of years,” says Chodroff. “Considerable challenges also remain for automatic speech recognition in almost all other languages.”

Different types of errors

The paper also reveals that humans and ASR systems make different types of errors. English listeners almost always produced grammatical sentences, but were more likely to write sentence fragments, as opposed to trying to provide a written word for each part of the spoken sentence. In contrast, wav2vec 2.0 frequently produced gibberish in the most difficult conditions. Whisper also tended to produce full grammatical sentences, but was more likely to “fill in the gaps” with completely wrong information.

Expert Contact
Prof. Dr. Eleanor Chodroff
Department of Computational Linguistics
University of Zurich
Phone Number: +41 76 426 27 07
Email ID: eleanor.chodroff@uzh.ch

Original Source: https://www.news.uzh.ch/en/articles/media/2025/Spracherkennung.html

Original Publication
Chloe Patman, Eleanor Chodroff
Journal: JASA Express Letters
Article Title: Speech recognition in adverse conditions by humans and machines
Article Publication Date: November 12 2024
DOI: https://doi.org/10.1121/10.0032473

Media Contact
Melanie Nyfeler
Media representative
Phone Number:
+41 634 44 78
Email ID:
melanie.nyfeler@kommunikation.uzh.ch

Source: IDW

Media Contact

All latest news from the category: Information Technology

Here you can find a summary of innovations in the fields of information and data processing and up-to-date developments on IT equipment and hardware.

This area covers topics such as IT services, IT architectures, IT management and telecommunications.

Back to home

Comments (0)

Write a comment

Newest articles

AI system analyzing subtle hand and facial gestures for sign language recognition.

Not Lost in Translation: AI Increases Sign Language Recognition Accuracy

Additional data can help differentiate subtle gestures, hand positions, facial expressions The Complexity of Sign Languages Sign languages have been developed by nations around the world to fit the local…

Researcher Claudia Schmidt analyzing Arctic fjord water samples affected by glacial melt.

Breaking the Ice: Glacier Melting Alters Arctic Fjord Ecosystems

The regions of the Arctic are particularly vulnerable to climate change. However, there is a lack of comprehensive scientific information about the environmental changes there. Researchers from the Helmholtz Center…

Genetic analysis reveals new depression risk factors across diverse populations

Global Genetic Insights into Depression Across Ethnicities

New genetic risk factors for depression have been identified across all major global populations for the first time, allowing scientists to predict risk of depression regardless of ethnicity. The world’s…