To Help Computers Detect Who's Talking, These Scientists Figured Out How Humans Do It

Humans can easily pick out one voice from many. (Credit: Aaron Ama/Shutterstock)

(Inside Science) — If your phone rings and you answer it without looking at the caller ID, it’s quite possible that before the person from the other end finishes saying “hello,” you would already know that it was your mother. You could also tell within a second whether she was happy, sad, angry or concerned.

Humans can naturally recognize and identify other humans by their voices. A new study published in The Journal of the Acoustical Society of America explored how exactly humans are able to do this. The results may help researchers design more efficient voice recognition software in the future.

The Complexity of Speech

“It’s a crazy problem for our auditory system to solve — to figure out how many sounds there are, what they are and where they are,” said Tyler Perrachione, a neuroscientist and linguist from Boston University not involved in the study.

Nowadays, Facebook has little trouble identifying faces in photos, even when a face is presented from different angles or under different lights. Today’s voice recognition software is much more limited in comparison, according to Perrachione, and that may be related to our lack of understanding about how humans are able to identify voices.

“We humans have different speaker models for different individuals,” said Neeraj Sharma, a psychologist from Carnegie Mellon University in Pittsburgh and the lead author of the recent study. “When you listen to a conversation, you switch between different models in your brain, so you can understand each speaker better.”

People develop speaker models in their brains as they are exposed to different voices, taking into account subtle differences in features such as cadence and timbre. By naturally switching and adapting between different speaker models based on who’s talking, people learn to identify and understand different speakers.

“Right now, voice recognition systems don’t focus on the speaker aspect — they basically use the same speaker model to analyze everything,” said Sharma. “For example, when you talk to Alexa, she uses the same speaker model to analyze my speech versus your speech.”

So let’s say you have a rather thick Alabamian accent — Alexa may think that you are saying “cane” when you are trying to say “can’t.”

“If we can understand how humans use speaker-dependent models, then maybe we can teach a machine system to do it,” said Sharma.

Listen and Say ‘When’

In the new study, Sharma and his colleagues designed an experiment in which a group of human volunteers listened to audio clips of two similar voices speaking in turn, and were asked to identify the exact moment one speaker took over from the previous one.

This allowed the researchers to explore the relationship between certain audio features and the reaction time and false alarm rate of the human volunteers. They then began to decipher what cues humans listen for to indicate a speaker change.

“Currently, we don’t have a lot of different experiments that allow us to study talker identification or voice recognition, so this experiment design is actually quite clever,” said Perrachione.

When the researchers ran the same test for several different types of state-of-the-art voice recognition software, including one commercially available software developed by IBM, they found that the human volunteers performed consistently better than all of the tested software, as expected.

Sharma said that they are planning to look at the brain activity of people listening to different voices using electroencephalography, or EEG, a noninvasive method for monitoring brain activities. “That may help us to further analyze how the brain responds when there is a speaker change,” he said.

[This story was originally published on Inside Science.]