DeepMind's New AI Brings Videos to Life with Sound

11 days ago

AI is levelling up yet again, and we now have models that are capable of generating sounds for videos. And DeepMind, Google's AI research lab, is developing the technology that would understand a video down to its pixels, generate for, and sync sounds with the video.

Video generation models have been around for a while now, especially since OpenAI announced Sora, their video-generation model, but obviously missing from the videos is sound. Most current video-generation models produce mute videos, and then work has to be done to create matching sounds for the video to turn it into something useful, like this "air head" video short made with Sora.

DeepMind says that the V2A (video-to-audio) technology they are devloping will be a big leap for generated movies to come closer to something that's actually real.

"Video generation models are advancing at an incredible pace, but many current systems can only generate silent output. One of the next major steps toward bringing generated movies to life is creating soundtracks for these silent videos." DeepMind said in a blog.

How DeepMind's V2A technology works is that it takes in video and transforms its pixels [via an encoder] into a reduced representation—simplified data for it to work with—and then generates audio from noise. It then repeatedly refines based on the video and the text prompts given to guide it into generating realistic and synchronised audio that best matches. _{_{Read more...}}

_{Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete}

_{Prompt for audio: A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd}

Fascinating—what AI is now able to do. This technology was going to come sooner or later anyway, and it definitely is a leap in innovation today.

“By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts." DeepMind says.

Ethics in developing and using this type of tool that's so powerful surely has to be considered. Safeguards and idenfication measures on keeping such a model responsible with its [audio] generations and protecting it from misuses. In that light, DeepMind is putting the V2A technology through rigorous safety assessments and testing. Also, they will be watermarking its creations with their SynthID technology. The watermarks are imperceptible to humans, but can be detected with the right tools.

The SynthID technology essentially allows DeepMind to detect AI-generated content and also know if a generation was made with their technology. It'll help address the problems of misinformation and deception with AI technology. Many other AI companies have similar technology implemented in their models.

AI-generated soundtracks aren't exactly new. Stablity AI released theirs recently, and ElevenLabs launched one sometime in May. What's different with DeepMinds V2A technology is the creative control it gives its users. Users can control the kind of audio they generate with their text prompt. It's optional, though, but users can guide it towards a desired output or away from it with their prompts.

Take a listen in the examples below.

_{Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi}

_{Prompt for audio: Ethereal cello atmosphere}

DeepMind knows that V2A isn't there yet. It wasn't trained on that many videos with artifacts or distortions—things that make videos have low quality. And since the audio generated will depend on the quality of the visual input, the model may not be able to produce audio of great quality.

Lip synchronisation is another concern of DeepMind, and something they want to work on more. Their model is great with this either, and so the lip movements may just not match the audio well.

_{Prompt for audio: Music, Transcript: “this turkey looks amazing, I’m so hungry”}

There's still a lot of work to be done, so DeepMind isn't releasing the technology anytime soon.

“To make sure our V2A technology can have a positive impact on the creative community, we’re gathering diverse perspectives and insights from leading creators and filmmakers, and using this valuable feedback to inform our ongoing research and development,” DeepMind says. “Before we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing.”

The V2A can be a really useful tool, but it is bound to cause some level of disruption in the film and videography industry, as is usual with these AI technologies. People will soon be able to sit at home or wherever and simply generate the videos they'd like to use for basically anything and even get audio alongside. What's going to happen to the people who do such jobs with these technologies around? I'm not sure, but I guess there's still a bit of time to do something about legal protections and all.

Make earnings with your content on Hive via InLeo while you truly own your account. If you're new, sign up in a few minutes by clicking here! And here's a guide on navigating.

_{_{Thumbnail Image}}

Posted Using InLeo Alpha

hive-167922 ai v2a audio innovation inleo neoxian tribes pob pimp

0.000

1 comments

@psalmmy264 63

8 days ago

This is another insane development. The whole idea of AI is getting wider. I now doubt if there will be a limit to things AI will be used for.

Just hit me that many actors and actresses would be without job in the next few years as long as this AI can replicate exactly what they can do. A random person can become a film maker or producer as long as he/she can access the AI, prompt it with a good script and market the movie.

0.000