Stephen Fry spoke at the CogX Festival in London last week. During that talk, he played a video clip of a WOII documentary narrated by his voice. The voice heard in the clip is in fact not his — it’s a clone of his voice, trained on seven Harry Potter audio books that he patiently and painstakingly has narrated in the past. The cloned voice was then used to narrate the documentary without Fry’s knowledge, or permission.
The voice is so good that you can’t tell the difference; even German words like Hauptsturmführer and Dutch place names are pronounced flawlessly. In his talk, Fry explains:
“This not the result of a mash-up; there are plenty of those and they are completely obvious. This is from a flexible, artificial voice where the words are modulated to fit the meaning of each sentence uniquely. It could therefore have me read anything from a call to storm parliament, to hard porn, to product endorsements. All without my knowledge.”
He is of course right and all of this is made possible due to major leaps in the field of speech synthesis. New technology that is being developed and commercialized by companies like ElevenLabs, Speechify, and Resemble.ai, which allow you to clone a voice with sometimes as little as minutes worth of voice data.
It’s both a profound and beyond terrifying development. Who owns our voice, if anyone can create a copy of our voice for a few dollars and make it say whatever they want to? And what else do we stand to lose if we cannot distinguish a real from fake voice no more?
How did they get so good, so fast?
Computer generated voices have been around for a while, but they were always pretty shit. The technology used spliced up, pre-recorded words and phrases, glued together to match the desired output.
Deep learning turned things on its head. Machine learning algorithms learn from pre-recorded audio and pick up on all the speech patterns that make a voices unique, such as rhythm, pace, intonation, and pronunciation. As of today, synthetic voices have achieved unprecedented sophistication: they can drop a pause at the right moments, imitate um’s and ah’s, and even mastered nonverbal sounds like yawning, sighing and chuckling.
These are life-like voices that are indistinguishable to the human ear. It’s fast, readily available and relatively cheap technology — and it’ll only get cheaper in the future, as the cost of compute is consistently going down every year, in line with Moore’s law.
Of course, there’s a great deal of mundane utility to be extracted. Think audiobooks, voice dubbing, voice assistants, social media content, podcasts, and video games. Last week, I created a YouTube Short (as a little experiment and promotion for the newsletter) and used a pre-made synthetic voice to read out a script that I wrote; the voiceover was generated in mere seconds.
However, there’s a dark side to all of this that cannot be separated from its usefulness. As is often the case, powerful tools can be a force for good in the world and make heavy work light, but in the hands of the ill-intentioned the same tools can be turned into weapons and destroy more than we bargained for.
In many ways, we are our voice
Let me start by saying that losing one’s voice can feel like losing a limb. We perceive it as an inalienable part of our personal identity and it holds a great deal of power.
Evidence of that can be found in myths and folklore, from The Little Mermaid, where Ariel trades her voice away to the evil Ursula, to Greek mythology, where the nymph Echo is punished by Hera, when she learns that Echo is covering for Zeus’ infidelities, and she takes away what’s most precious to Echo, her voice. From that moment on, Echo can only repeat what others are saying around her.
A voice can charm, seduce, and evoke a sense of authority. We use our voice to communicate ideas, ask for what we want, and express how we feel. Thus being able to steal someone’s voice and manipulate it as we see fit, provides us with a unique and intimate power over one another that we previously didn’t have. And the real world effects are showing.
In March, the FTC issued a warning urging people to watch out for scammers using voice clones of loved ones. A short audio clip is all they need. Now let’s say you get a call from a family member who’s in trouble, how would you tell if it’s your grandson on the other side of the line or a scammer who’s out to get your money?
Reputations are being tarnished, too. A respected finance journalist from the UK, Martin Lewis, was the subject of a recent deepfake investment scam that used his face and voice to get involved in a non-existent venture called Quantum AI. The scam banked on Lewis’ authority and profile to dupe viewers into placing trust in the made-up scheme, to the point where he had friends coming up to him saying: “Hey, I’ve just put some money in that investment scheme you’re advertising”.
In June, I wrote a post about how influencers are creating AI doppelgängers so fans can talk to them 24/7, taking advantage of young and vulnerable audience members who are willing to spend a lot money to get some facetime with their idol.
This last example doesn’t involve someone’s voice being taken from them, but it does show the power synthetic voices have over to those who are exposed to them.
As if these examples aren’t serious enough already, it’s easy to imagine even more serious and nefarious uses like political deep fakes, coordinated misinformation campaigns, and extortion schemes, where voice clones can be leveraged as a supremely effective tool to mislead and divert on a massive scale.
Even the law can’t save us
As it turns out the law isn’t going to save us. Stephen Fry might have a chance, since the documentary makers cloned and used his voice without consent. Most likely, this can be trialed under copyright law or something known as ‘passing off’ law, which prevents people from “misrepresenting goods or services as being the goods and services of another”. But legally, there’s no such thing as voice ownership. You don’t own your voice. Nobody does.
European Law defines biometric data in GDPR Article 4 to mean:
“… personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person, such as facial images or dactyloscopic [fingerprint] data.”
As of right now, this definition does not expressly cover our voice.
Recently, the United Voice Artists (UVA) and the National Association of Voice Actors (NAVA) submitted a European Union Artificial Intelligence Act (AI Act) amendment proposal. In the press release, it reads:
United Voice Artists and The National Association of Voice Actors call upon policymakers and stakeholders within the European Union to carefully consider the proposed amendment, recognizing its potential to elevate the voice acting community while setting a global standard for AI legislation that respects artistic integrity and the rights of creators.
Overall, the provisions are aimed at making sure that when AI is used, the people whose work it learns from receive proper credit and compensation, something I wholeheartedly support.
Yet, it feels inconsequential to the broader issue at hand. Yes, it’ll give creators, actors, and other public figures a fair day in court, but it does nothing to prevent their voice from be stolen and misused in the first place — or any voice for that matter.
The coming wave
After putting some thought into it, I feel there isn’t much we can do to forestall, let alone avert. This may sound dark and cynical, but I’m afraid we have to prepare for the wave to come.
Synthetic voices will become ubiquitous. The technology itself will only get better and cheaper, because that has been the trajectory historically. We will need less data to produce more convincing voices. And soon indistinguishable voice clones will be supercharged with video, just as convincing as the voices, at a cost that will tumble down to a level of free or as good as. One tap away, on our smartphones and personal devices.
We will be able to create alternative history. Nothing is holding people back from altering current or past events to fit their narrative or ideology, whether it’s to protect or dismantle nation states, to boost or tarnish reputations. And it’ll be available to all of us — high-school kids, journalists, politicians, organized crime groups, and authoritarian regimes.
I cannot help but think of Mustafa Suleyman’s book The Coming Wave, which I referred to in my last post about AGI, in which he attempts to shift the debate around AI towards proliferation and containment.
Ultimately, it is a containment problem, and I’m fairly certain that cheap and readily available voice cloning technology is something we cannot contain. It’s an upending reality that we’ll have to learn to live and come to terms with, whether we like it or not.
Join the conversation
Leave a comment with your thoughts. Please let me know what you think, do you feel voice clones are going to be unique disruptive, or are more optimistic about how this technology will shape society? 💬
Very interesting read. Fascinating how a person's voice is not explicitly covered as biometric data under GDPR, yet it is such a central part to anyone's identity.
Thanks for your thoughtful reflection on this important topic. Well done. I just recommended your blog to my visitors.
To me, the lessons here seem clear. If we're going to create technologies of vast scale such as AI, then we are no longer in the driver's seat. We are along for the ride, where ever it may take us, like it or not.
Imagine that you got on a bus in Peru, and the bus began it's journey down twisting and turning mountain roads over looking the deep cliffs on every side. The bus may take you to beautiful locations filled with beautiful people, or it may go over a cliff. Once you're on that bus speeding down the mountain road you can't get off, and just have to accept whatever fate awaits you, for the better and the worse.
It's only a matter of time until average users will be able to easily create fake videos which can't be distinguished from the real people and events. All media will lose credibility, as there will be no way to know what is real, and what is not.
The whole thing has been a bad idea right from the start. But it's going to happen anyway. Like everyone else, I'm a passenger on the bus and can't get off. And so while we travel down the mountain road together, I'm going to have some fun with it, as there is no other rational choice available.
Maybe we'll get lucky. Maybe we won't. It's out of our hands at this point.