In 2018, fears of fake news will pale in comparison to new technology that can fake the human voice. This could create security nightmares. Worse still, it could strip away from each of us a part of our uniqueness. But companies, universities, and governments are already working furiously to decode the human voice for many applications. These range from better integration of our internet-of-things devices to enabling more natural interactions between humans and machines. Technologically adept nation states (the United States, China, and Estonia) have waded into this space and tech giants such as Google, Amazon, Apple, and Facebook also have special projects on voice.
It's not that hard to develop an artificial voice, then model and reproduce spoken words and phrases. I remember being amazed when my original Apple Macintosh informed me of the date and time in a dry, digital tone. Making a natural-sounding voice involves algorithms that are far more complex and computationally expensive. But that technology is available now.
As any speech pathologist will attest, the human voice is far more than vocal-chord vibrations. These vibrations are caused by air escaping our lungs and forcing open our vocal folds, a process that produces tones as unique as a fingerprint because of the thousands of waveforms that are conjured simultaneously and in chorus. But a voice's uniqueness is also tied to qualities we rarely consider, such as intonation, inflection, and pacing. These aspects of our speech are situational, often subconscious and they make all the difference to the listener. They tell us when a phrase such as, “Wow, that outfit is something!” should be interpreted as mean-spirited, sarcastic, loving, or indifferent. This challenge explains the early use of emoji in text messages. They were needed to clarify the intent of a written message, because it is extremely difficult to interpret the true meaning of conversational speech that's written instead of spoken.
Details such as intonation, inflection, and pacing are particularly difficult to model, but we are getting there. Adobe's Project Voco is developing what is essentially a Photoshop of soundwaves. It works by substituting waveforms for pixels to produce something that sounds natural. The company is betting that, if enough of a person's speech can be recorded (or data-mined), it will require little more than a cut-and-paste action to alter a recording of their voice. Adobe's initial results from Voco are eerie, as well as awe-inspiring. The prowess of the prototype indicates how soon common citizens will be unable to distinguish between real voices and spoof ones. If you have enough samples stored in your data library, then you can make anyone appear to say almost anything.
Technology companies and investors are betting on the idea that these systems will eventually have tremendous commercial value. Even before that situation arises, though, this particular type of technology will present big risks. By 2018, a nefarious actor may easily be able to create a good enough vocal impersonation to trick, confuse, enrage, or mobilise the public. Most citizens around the world will be simply unable to discern the difference between a fake Trump or Putin soundbite and the real thing.
When you consider the widespread distrust of the media, institutions, and expert gatekeepers, audio fakery could be more than disruptive. It could start wars. Imagine the consequences of manufactured audio of a world leader making bellicose remarks, supported by doctored video. In 2018, will citizens—or military generals—be able to determine that it's fake?
William Welser IV is the director of the Engineering and Applied Sciences (EAS) Research Department at the RAND Corporation, a professor at Pardee RAND Graduate School, and codirector of RAND's Impact Lab.
This article was first published in The Wired World in 2018, © The Condé Nast Publications Ltd.