(Credit: Zapp2Photo/Shutterstock)

You hear one thing, but the computer hears another. What’s going on here?

Two researchers from the University of California, Berkeley have exploited the technique computers use to decode human speech to hide messages inside snippets of audio. When translated by a speech recognition program like Mozilla’s DeepSpeech, the computer ends up transcribing the hidden message instead of the sounds we hear.

Do You Hear What I Hear?

The method basically involves hiding a quiet sample of the audio you actually want transcribed within a different portion of audio. The “secret message” registers to humans as nothing more than a bit of background noise, but because of the way computers process audio, they pick up on the hidden audio clearly. In a paper published to the pre-print server the arXiv, the researchers describe how they were able to manipulate DeepSpeech every single time they hid messages inside an audio sample.


DeepSpeech transcription: “That day the merchant gave the boy permission to build the display”


DeepSpeech transcription: “Everyone seemed very excited”

It has to do with how machine learning algorithms recognize speech. Considering the full range of possible letter combinations that each audio sample could potentially contain is prohibitively difficult, so algorithms calculate what amounts to an educated guess. An algorithm will map each bit of audio it samples to a probability distribution of possible letters and characters, and pick the most likely. Training the algorithm on many different audio samples is what lets it get good at guessing the correct one.

Computer Vs. Human

The researchers are able to exploit this system of educated guesses by creating audio that tips the computer’s decision in favor of the words they want to be transcribed, instead of the message that it’s hidden inside. And, in a tactic similar to how algorithms are trained, the researchers’ program tries out many different variations of the same audio sample to match their message sonically to what we hear, even if the words are completely different.

The researchers tested their work on 100 snippets of audio from Mozilla’s Common Voice dataset, and they say it worked every time. They were even able to hide text inside audio with no speech, for example, a snippet of classical music. And because DeepSpeech samples audio many times a second, the hidden text can be much longer than what’s actually heard, up to a limit of 50 characters per second of audio.

Hidden audio could be used to sneak messages past human listeners, or to fool computer transcription programs. But it might not necessarily be so easy to hack speech recognition programs. Because they used DeepSpeech, which has its code openly available, the researchers used what’s called a “white box” approach, which means that they knew everything about how the program works. Using a speech recognition program with unknown machinations would make it much harder to hack. In addition, these examples are targeted specifically at DeepSpeech, so a different speech recognition program wouldn’t pick up on the hidden audio.