The Inner Detail

The Inner Detail

Home » Technology » Google » Google DeepMind’s new AI model can generate Audio from Video: Check how it works

Google DeepMind’s new AI model can generate Audio from Video: Check how it works

Google AI tool for Audio

Google again made a new AI model and this is something that helps cine-fields and content creators profoundly. It’s a video-to-audio generating AI tool that can create quality sounds from videos and text prompts.

While the swirl of AI technology had unveiled many text-to-video generating tools, most of them generate audios that doesn’t match with video or the videos are silent. Google’s DeepMind wants to leverage this space to bring an AI model that can create rich soundtracks to videos and text prompts.

In case if you don’t know, DeepMind is a subsidiary of Google, which it acquired back in 2014. There are many interesting researches done by DeepMind, including this.

Video to Audio (V2A) Technology

V2A technology by DeepMind aims to bring life to AI-generated videos. Means, just by using AI one could soon be able to create a complete movie. V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.

Google had already unveiled video generating model “Veo” and this V2A will help create realistic sound effects or even dialogues that matches the characters and tone of a video.

How it works?

V2A tool starts working by first encoding the video into a compressed representation and uses diffusion model approach to iteratively refine the audio from random noise. The refining process is closely accompanied with the inputs from video (visual) and text prompts provided. This creates a synchronized realistic audio that fits better with video and prompt. Finally the audio output is decoded, turned into an audio waveform and combined with the video data.

Many generative AI models use Diffusion model approach, including Google’s text-to-image generating model, since its ability to create ultra-realistic output. To brief what a diffusion model is – the AI takes photos, videos or audios while training and convert it back into noise. And when put into action, the model uses those noise to align randomly until it gets a clear picture of what is required. This process repeatedly runs in the background for n number of times, till it produces the result.



Salient features of V2A AI model

To generate higher quality audio and add the ability to guide the model towards generating specific sounds, DeepMind added more information to the training process, including AI-generated annotations with detailed descriptions of sound and transcripts of spoken dialogue.

One of the notable features is that the AI model doesn’t require you to sync the audio and video, as it automatically does. “The system doesn’t need manual alignment of the generated sound with the video, which involves tediously adjusting different elements of sounds, visuals and timings,” says DeepMind.

The model can generate audios from just video without requiring a text prompt since it can better understand videos as raw pixels. This stands out among other similar models. Indeed, this V2A technology can create several audios for a video prompt.

It can also generate soundtracks for a range of traditional footage, including archival material, silent films and more — opening a wider range of creative opportunities.


(For more such interesting informational, technology and innovation stuffs, keep reading The Inner Detail).

Kindly add ‘The Inner Detail’ to your Google News Feed by following us!

Scroll to Top