Major AI trends in digital audio

Major AI trends in digital audio

Artificial intelligence continues to gain ground in many fields, revolutionizing and radically transforming jobs and uses we have known until now. AI is being deployed in sectors as diverse as medicine, finance or the automotive industry, and digital audio is no exception. Advances in sound processing and data analysis are disrupting the audio experience, redefining the way audiences and creators consume and imagine content. In this article, we will explore the major recent trends in AI in digital audio, through the innovations and implications of this sound revolution.

Artificial intelligence in music

Advances in artificial intelligence are multiplying, and are now making it possible to compose music. Several tools have been developed for this purpose. Among the most important is Jukebox, launched in 2020 by OpenAI, the same company behind ChatGPT. Jukebox is a neural system that generates music, including basic melodies and songs, as raw audio in a variety of genres and artistic styles.

In January 2023, Google unveiled a similar tool called MusicLM, a music ChatGPT of sorts. It can create a sound from a prompt (text description) of any length from 30 seconds to 5 minutes. This intelligence feeds on a huge database of 280,000 hours of music. MusicLM is capable of creating sound content based on a specific genre of music. The tool can also generate an audio sequence by analyzing a painting and its caption. In May, Google made MusicLM available to the public. Internet users can therefore ask the tool to design a soundtrack by simply providing a description. To date, Google’s music AI does not allow users to imagine lyrics to accompany a song. There are already many AI-powered music tools around the world. These include India’s Beatoven, Amper Music (distributed by the American Shutterstock), Aiva, a Luxembourg-based solution, and many others.

Among the greatest innovations in terms of musical AI, it’s impossible not to mention the South Korean label Hybe, responsible for the success of the band BTS. Last May, the label launched a new artist, “Midnatt“, the artificial alter ego of Lee Hyun, one of the label’s biggest stars, and his song Masquerade, released on the same day in 6 languages: Korean, English, Japanese, Chinese, Spanish and Vietnamese. This project was made possible thanks to Supertone, an AI-based text-to-speech technology acquired by Hybe in 2022 for $36 million. K-Pop has been enjoying global success for years, and thanks to voice AI, artists now have a way of reaching their international audience in a a whole new way : in their native language.

At the same time, a new fictional featuring was created, this time between two musical monuments: Drake and The Weekend. An Internet user by the name of Ghostwritter created a track called Heart on My Sleeve, using the voices of the Canadian singers. The track, which went viral in just a few days, impressed fans with its similarity to the original voices. Unfortunately, for copyright reasons, the various streaming platforms that had been hosting the song for several days were quick to remove it.

As far as music is concerned, AI still has a long way to go before it can compete with humans in terms of quality and creativity, but at the speed at which it is evolving, we are still in for many surprises.

AI in virtual assistants and voice chatbots

The market for virtual assistants and voice chatbots has grown significantly in recent years. In 2022, it was estimated that 142 million people in the US, or 42.1% of the population, used a voice assistant (Insider Intelligence). Technological advances, particularly in artificial intelligence and automatic natural language processing, have paved the way for the creation of many increasingly innovative solutions. Virtual assistants and voice chatbots are used in various sectors, such as customer service, e-commerce, healthcare, financial services and many others. The main benefits of these tools are numerous: automation of repetitive tasks, availability at any time, reduced operational costs and improved customer experience.

Companies such as Google, Amazon and Apple have all developed their own virtual assistants. Google offers Google Assistant, Amazon offers Alexa and Apple offers Siri. In addition to the big players, many startups and specialized entities have also entered the field of virtual assistants and voice chatbots, to meet more specific needs.

Leading AIs including Siri, Alexa and Google Assistant are on a constant quest for improvement. In June 2022, Amazon unveiled a new Alexa AI feature that would allow it to imitate anyone’s voice simply by listening to a recording. According to Rohit Prasad, Senior Vice President of Amazon and Chief Scientist of Alexa, this new technology makes memories last, “and is a great step forward and a way to heal the pain of losing a loved one by allowing their voice to be immortalized”. During a demonstration of the feature, a child asked Alexa if she could read a story in her “grandma’s” voice, a challenge the smart speaker completed after a few seconds’ thought. This naturally raises ethical and psychological issues concerning the use and misuse of these technological advances.

More on the subject : Amazon is discontinuing Alexa’s celebrity voices

AI, the origin and the solution against deepfakes and voice frauds

The use of AI in the field of deepfakes has raised major concerns. Deepfakes, are synthetic multimedia content (visual or audio) created using AI to trick users with authentic-looking videos or audio tracks.

Deepfakes (deep learning + fake) are created using machine learning technologies. These algorithms are trained on large data sets, such as videos or audio recordings, to generate synthetic content that is virtually identical to reality. The aim is to make people say words that have not been said, or perform actions that have not been done. This technology first spread to video, with the production of manipulated images featuring politicians, celebrities such as soccer players, or even ordinary people. Fortunately, every problem has its solution, and in the case of deepfakes, AI is also being used to detect and counter these vocal or visual frauds, with the aim of protecting users, especially the most vulnerable.

Advanced algorithms based on machine learning can analyze voice characteristics, emotion and tone, speech patterns and other parameters to assess the authenticity of vocal content. Using these techniques, AI is now powerful enough to detect signs of manipulation and identify recordings that are not authentic.

The use of AI in the creation of deepfakes today raises many concerns from an ethical and legal point of view. However, AI is also a driving force in the detection and prevention of these voice frauds. As technologies evolve, it is essential to continue developing tools and strategies to combat potential threats and protect the integrity of information and citizens around the world. 

AI and text-to-speech

Omnipresent in our daily lives, on our social networks, in our train stations, our voice assistants and our media outlets, text-to-speech has radically evolved in recent years, moving from a tacky approach to a powerful and practical solution. The greatest advantage of text-to-speech is that its technology enables text content to be read without the need for a human voice, saving companies time and resources.

More on the subject : Social media, instant messaging, dating apps: social audio is booming

One example is Apple, which at the end of 2022 launched its audiobooks read in digital voices. This technological prowess was made possible thanks to generative AI, which enables content to be generated without human intervention. Madison, a “digital voice inspired by human narration”, for the time being works only for english content. To access this content, simply type “AI narration” into the app’s search engine. By choosing to have some of the books read using text-to-speech technology, Apple has not only taken a risk, but also succeeded in reducing one of the most significant areas of spending when creating audio books: voice actors and production.

Computer-generated voices have also found their way into the media sector with automatic audio reading of articles. Many media outlets now offer this mode of consumption to their audiences, including Le Figaro, La Tribune and Le Point, for which we provide our technology.

At ETX Studio, we’re constantly examining key players in the digital audio sector. We put our technology at the service of companies to facilitate the transformation of their text content into audio and video, in just a few clicks and in several languages and accents. The possibilities for using audio content in synthesized voice are multiple, whether to make websites more accessible, creating podcasts, audio newsletters, training resources, information monitoring or internal communications, to be consumed anywhere, anytime, especially on the move.

We already transform websites, content and apps for leading media outlets, brands and institutions, thanks to our expertise in artificial and organic audio. Would you like to find out more about our technology? Contact us here.

Written by in / 879 Views