Why We Make Lip-Reading Errors

Summary: Lip-reading is a highly demanding cognitive feat that forces the brain to decode speech by translating physical mouth movements instead of acoustic waveforms. While psychologists have long tracked overall accuracy rates, the underlying structure of why lip-readers make specific mistakes has historically been poorly understood. Traditionally, speech scientists analyzed these errors through an auditory lens, focusing on spoken sounds (phonemes) rather than raw visual features.

A new study has broken away from this acoustic bias, using network science to build a massive visual map of approximately 20,000 English words. The research switches the analytical focus to visemes, the distinct visual mouth, jaw, and lip shapes that correspond to spoken language.

By mapping words based on visual similarity rather than phonetic sound, the team exposed a complex, stretching topography of visual language, revealing that human errors are highly predictable and structurally driven by a word’s position within this visual network.

Key Facts

The Viseme Perspective: Unlike phonemes (units of sound), visemes are the fundamental units of visual speech. The study prioritized visual cues over auditory signals, mapping words solely by how they look when articulated by the lips, jaw, and mouth.
The Look-Alike Bottleneck: The network map revealed that roughly one-third of all English words look identical to at least one other word when spoken, creating persistent perceptual competitors for lip-readers.
The Compression Phenomenon: The visual word landscape does not distribute evenly; it stretches and compresses. Words with high visual densities crowd into tightly packed spatial regions, severely multiplying the number of look-alike competitors and dropping lip-reading accuracy.
Predictable Error Paths: Lip-reading mistakes are non-random. A person is structurally more likely to misidentify an ambiguous word as a more commonly used vocabulary word within that same compressed network region.
The One-Viseme Miss: The data show that human lip-readers are often closer to the target word than they realize. Most errors are incredibly narrow, missing the correct target by just one or two visual features (visemes).
Dual-Track Applications: The KU team is transitioning these network maps into clinical training programs to help hearing-impaired individuals systematically reduce error distances. Additionally, the data can train multi-modal Artificial Intelligence and transcription services (e.g., Zoom) to combine real-time facial feature tracking with audio streams for human-like accuracy.

Source: University of Kansas

New research from the University of Kansas uses network science to determine why people make mistakes when lip reading.

Michael Vitevitch, professor of speech-language-hearing at KU, and his co-authors created a visual map of around 20,000 words in English, hoping to better grasp why some words are more difficult to lip-read than others.

The results appear in the Journal of the Acoustical Society of America. Findings could improve training for lip readers and boost the capacity for artificial intelligence to read lips and provide transcription and other digital services.

“What we looked at in this study is how people basically read lips, how accurate they are and, more specifically, what kinds of mistakes they make,” Vitevitch said. “A lot of previous work looked at how accurate people were and didn’t necessarily look at the characteristics of the errors themselves. There’s a lot to be learned from the mistakes you make, and that was the approach we took.”

While previous work on lip reading examined errors, much of that research was done by spoken-language researchers who focused on phonemes — the sounds in a language — and on how close participants were to the word as it sounds.

Vitevitch took a different approach.

“We focused on the visual characteristics,” he said. “Instead of looking at how many sounds of the word people got, we looked at how many of the visual characteristics, which we call ‘visemes’ (the visual equivalent of a phoneme), they got. We focused on what you’re getting from the lips, jaw and mouth without using auditory sound. You’re just trying to get the information from what you’re seeing.”

“How does that sound look when it’s spoken? We don’t care what it sounds like; we care about how it looks when it’s spoken,” he said. “Sometimes words sound similar and look similar, such as ‘kit,’ ‘cat’ and ‘cut.’ Other times words don’t sound alike but still look similar like ‘vet,’ ‘fit’ and ‘fuzz.’ In both cases if you’re just looking at my face, you couldn’t tell one word from the other.”

Through analysis of the word map, researchers determined:

People are more likely to mistake a word for another word used more commonly.
When spoken, about a third of words in English look like at least one other word.
If a word has many visual look-alikes, it’s consistently harder to lip-read.
Lip-reading mistakes don’t happen randomly — they’re more likely when visually similar words occupy the same region in the visual network.

“One surprise was that people aren’t that good at this,” Vitevitch said. “We think we are, but we’re really not. Most of the errors show that you’re one or two visual characteristics — one or two visemes — off. You’re getting a good amount of it, but perhaps not enough to get by.”

The researchers’ visual map allowed them to understand how words are distributed throughout the landscape, according to Vitevitch. In the map, words were close when they looked similar and farther apart when the words appeared visually unalike.

“Certain areas become more compressed than you might expect,” he said. “The landscape stretches and compresses in ways we hadn’t anticipated. That stretching and compression has implications for how accurate you’re going to be when trying to lip-read. Does it give you more competitors than you would otherwise have? Or does it move things farther apart and make them more perceptually distinct?”

The KU researcher said his group hopes to move into lip-reading training.

“The idea is that if you track people’s errors over time, those errors should start shrinking toward the target word,” Vitevitch said. “Instead of being far away, people begin picking up the information they need and making more accurate guesses.”

An additional application of the research is in training automatic transcription.

“Systems such as Zoom already do a reasonable job transcribing speech,” Vitevitch said. “Could they do better if they used not only audio but also visual information from a speaker’s face? Computers are very good at finding patterns, and sometimes they’re the same patterns humans use. We may be able to train computers to do things in a more humanlike way.”

Vitevitch said his group will continue to follow up on this work in different ways.

“We’re continuing to explore how people do this, potentially moving toward machine-learning applications and finding ways to help people who need assistance understanding speech,” he said.

Vitevitch’s co-authors were KU graduate students Maia Flynn and Reid Kelly, along with Lorin Lachs of California State University, Fresno.

Key Questions Answered:

Q: What is a “viseme,” and why is it more important for lip-reading than a phoneme?

A: A phoneme is the smallest unit of sound in a language—it’s what spoken-language researchers use to track speech. But when you are lip-reading, your ears aren’t doing the work; your eyes are. Dr. Vitevitch shifted the focus to a “viseme,” which is the visual equivalent of a phoneme. It represents how a specific sound actually looks on the face when spoken, tracking the precise shapes made by the lips, jaw, and tongue. For example, the words “vet,” “fit,” and “fuzz” sound completely different to your ears (different phonemes), but to your eyes, they look exactly the same on the mouth (identical visemes), making them visual clones.

Q: What does it mean that the visual map of English words “stretches and compresses”?

A: The researchers used network science to plot 20,000 words on a virtual map, placing words close together if they look identical on the mouth and far apart if they look completely distinct. They discovered that the English language doesn’t spread out evenly on a face. Instead, certain visual zones become incredibly crowded and “compressed.” In these tight neighborhoods, a single mouth shape could mean dozens of different words, creating a massive pile-up of look-alike options that easily trip up a lip-reader. In contrast, “stretched” areas contain words that look unique, making them far easier to read.

Q: How can this visual word map be used to improve Artificial Intelligence and daily video calls?

A: Right now, virtual platforms like Zoom use AI software that relies entirely on audio to generate real-time transcriptions. While reasonable, audio can easily fail due to bad microphones, loud background noise, or heavy accents. By integrating this 20,000-word visual map into machine-learning algorithms, computers can learn to analyze the video feed of a speaker’s face simultaneously. By combining what the computer hears with a visual tracking loop of what the mouth is doing, the AI can eliminate linguistic guesswork and deliver flawless human-like transcriptions even in noisy rooms.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this visual neuroscience research news

Author: Brendan Lynch
Source: University of Kansas
Contact: Brendan Lynch – University of Kansas
Image: The image is credited to Neuroscience News

Original Research: Open access.
“The visome: Using cognitive networks to examine lip-reading errors in English words” by Michael S. Vitevitch, Lorin Lachs, Maia B. Flynn, Reid Kelly. Journal of the Acoustical Society of America
DOI:10.1121/10.0044182

Abstract

The visome: Using cognitive networks to examine lip-reading errors in English words

Network science was used to examine how English words look rather than sound when spoken. Measures of the visome (network of visual word representations) were compared to a phonological network at the macro- (whole network), meso- (subsets of nodes), and micro-levels (individual nodes) to determine how the structure of the visome influences lipreading performance.

Conventional psycholinguistic measures and network structure measures were further examined in two databases of lipreading errors. Lipreading errors were higher in frequency of occurrence than the target words.

Target words had uniqueness points that occurred after the end of the word (indicating that they are embedded in other words in the visome). Words varied in the number of viseme twins they have (i.e., words that look the same when spoken), and words with many twins are lipread less accurately than words with fewer twins.

Words with many viseme neighbors (the word is related to another word by the addition, deletion, or substitution of a viseme) were also lipread less accurately than words with fewer viseme neighbors.

Errors tended to reside in the same community as the target word instead of in a different community. Network analysis may be useful for reviving and advancing research on lipreading.

Source link

Key Facts

Key Questions Answered:

Editorial Notes:

About this visual neuroscience research news

Related Posts

Leave a Reply Cancel reply