Text-to-speech is a technology that imitates the human voice. It allows any written text to be converted into audio. Whether it is to meet accessibility issues, in public transports (sound announcements), or to add voice to videos and video games, voice AI has made huge progress and since 2020, the use of synthesized voices has exploded, especially on social networks.
Let’s take a look at its evolution, its ethical implications and its place today in our society.
History and evolution of synthetic voices
What is the history of synthetic voice and when did it come into being? What has been its evolution since its creation and what is its place today in our society and uses? Today, computer-generated voice is widespread and is no longer reserved only for the visually impaired or blind, for whom text reading and voice interaction is essential. It can be found in social networks, in all smart speakers (Google Assistant, Alexa, Siri), in movies, GPS, media and even in space!
Speech synthesis was first developed in 1791, when the Hungarian inventor Wolfgang von Kempelen imagined a “Speaking machine” composed of different parts reproducing and imitating the organs related to speech: lungs, chest, mouth, nostrils. Wolfgang von Kempelen’s work continued to inspire scientists during the following centuries. In 1939, a German university reproduced Wolfgang’s Speaking machine.
During the 1940s, the first electronic voice synthesizer was designed by the American Bell Laboratories, a revolution in computer and voice synthesis technology. This synthesizer named “Vocoder” (contraction of “Voice Encoder”) has undergone several evolutions over the years and its main purpose was initially to facilitate the transmission of telephone calls and reduce their costs. The process consisted in cutting the sound signal by carrying out an encoding at the transmitter of the call and a decoding on the receiver’s side. This operation was intended to reduce the rate of information in the calls and thus to save bandwidth.
The Bell Labs technology was used in various sectors, including the U.S. military, to communicate in an encrypted manner during World War II. The Vocoder’s robotic audio effects have also been the musical basis for many world-famous hits such as those by the electronic music group Daft Punk.
A rapid evolution of quality and uses
The transformation of written content into audio dictated by synthetic voices offers many advantages:
- Time saving since the audioization (or audio generation) is done in only a few seconds;
- Adaptation into several languages is possible in a short time;
- Saving in resources, since the cost of producing audio by human voice is much more expensive and governed by specific right.
The synthetic voice has reached a quality very close to that of the human voice, and progress in this field has been stupendous, especially over the last twenty years. Today, text-to-speech technology can be found everywhere, whether at work, on social networks or during our journeys. In France, we cannot talk about text-to-speech without mentioning the SNCF. Since the 1980s, the French national railway company has been working on sound and audio, which are an integral part of its brand identity. Its design and sound signature are now known to all. A few years ago, the SNCF called on Voxygen, a company that develops personalized voice synthesis solutions based on artificial intelligence, to create “e-Mone” and to digitize and immortalize the voice of Simone Hérault, the SNCF’s leading vocal talent for forty years.
Today, synthetic voices are widely used in mobile situations, in trains, subways, at crosswalks (audio guide for the visually impaired), etc. Transport companies were pioneers in introducing audio on board, in trains and in stations. As early as the 1990s, the SNCF launched its audio design using synthetic voices, a unique solution in that it allowed messages and thousands of station names to be audio-enabled in record time, a challenge impossible to achieve with human voices.
Synthetic voices take over social networks
The introduction of text-to-speech in social networks has revolutionized its use, which is now almost systematic in short videos.
Tiktok and Instagram introduce synthetic voice, a very popular option
In December 2020, TikTok introduced a new “text-to-speech” feature that allowed users to have text (with subtitles) read to them by synthetic voices. This effect was an instant hit, especially because of its humorous nature as the voice would regularly misread the text. Two years later, text-to-speech is still widely used in videos and the mispronunciation has become part of the charm of the application and this feature.
Almost a year after Tiktok, Instagram also launched its text-to-speech and voice editing option in Reels, its new flagship short video format, to compete with the Chinese player. For both social networking mastodons, this also allowed the more introverted internet users to unleash their creativity without having to use their own voice and create virality with robotic voice effects, helium or autotune. In 2021, Tiktok returned with Disney and the Marvel universe, adding the voices of the greatest characters like Chewbacca or Stitch, on the occasion of the first anniversary of the Disney+ streaming platform.
Cinema, connected speakers: cloning by synthetic voice
“Alexa tells a story with grandma’s voice”
During Amazon’s “re:MARS” conference held in Las Vegas in June 2022, Amazon unveiled the new artificial intelligence feature of Alexa (Amazon’s flagship connected speaker) that will allow it to mimic anyone’s voice simply after listening to it. According to Rohit Prasad, Amazon’s senior vice president and chief scientist of Alexa, this new technology is a big step forward and a way to help assuage the pain of losing a loved one by allowing the person’s voice to be immortalized and listened to at will. During the demonstration of the feature, a child had asked Alexa if she could read a story with the voice of his “grandmother”, a challenge taken up by the smart speaker after a few seconds of reflection.
“Top Gun – Maverick” recreates Val Kilmer’s voice using AI and CGI
To achieve this feat, actor Val Kilmer, the main character in the first installment of the Top Gun saga who lost his voice in 2014 due to throat cancer, collaborated with Sonantic, a company specializing in artificial voice intelligence for movies and video games, which was acquired by Spotify in June 2022.
In the movie “Top Gun: Maverick,” the character played by the actor also has cancer and uses typing to express himself. His single, brief line of dialogue was made possible by Sonantic’s technology. Usually made from read scripts, it was from several existing recordings of the actor’s voice that the company’s learning models were able to recreate a voice close to Val Kilmer’s, giving him once again a way to communicate by voice in new projects.
Several companies such as Sonantic, Respeecher, Voxygen, Acapela, Ircam or WellSaidLabs for example, have specialized in the creation of personalized voices and in the cloning of human voices to offer voice-overs in movies and video games.
Ethical issues and regulations to be established
The stakes for production companies are enormous: to be able to reduce content production costs (not having to move actors to the studio or simplify and reduce the cost of rights management, to be able to keep productions alive post-mortem).
By linking itself directly and intimately to the human voice, as in the case of Alexa and Top Gun, the synthetic voice raises new ethical and moral questions. All over the world, the synthetic reproduction of faces by video and of voice by voice synthesis is now common. If some people see the immortalization of a deceased person’s voice as positive and therapeutic, it remains necessary to provide a legal framework for these technological innovations which have grown exponentially in recent years. For the moment, legal texts exist concerning “digital death”, i.e. the right of the family of a deceased person to keep or erase the data linked to their digital presence. Moreover, the risk of deepfake (video or sound faking) and identity theft (of image as well as voice) will pose great challenges in terms of security and authentication. It will therefore be interesting to see how the adaptation of voice data protection will evolve in the RGPD, now that the various synthesis technologies have reached the private domain, in particular through connected speakers.
At ETX Studio we follow these developments closely in our RevoluSOUND Observatory and regularly study the various players in the sector. Within our ETX platform, we have selected the best voices on the market and enriched them with our pronunciation lexicon to enable companies to create audio and video content from text in just a few clicks. The uses are multiple (website accessibility, newsletters, monitoring, internal communications in audio and mobile). We use text-to-speech to transform the websites and content of media, brands and institutions such as Le Figaro, La Tribune and the French Senate. Contact us to integrate digital audio into your editorial strategy to attract, retain or monetize new mobile or voice-sensitive audiences.