Text-to-video AIs like Sora

Sora (OpenAI)

Sora is the newest name to be announced but has caused the most stir, partly because it is a product of OpenAI - the famous developer of ChatGPT, but mainly because of the quality of the videos the program creates from just text commands.

The company’s success with ChatGPT also gives its AI a deep understanding of language. Clips demonstrating Sora’s abilities show characters moving and expressing themselves in a way that’s as lifelike as a human-shot film.

Video "siêu thực" do Sora tạo từ các lệnh văn bản — "Surrealistic" video created by Sora from text commands

But Sora isn’t yet available for public consumption, for safety reasons. OpenAI will take careful measures before making it available to the general public, especially given the growing number of AI users being used for nefarious purposes, such as impersonating users or committing crimes.

Lumiere (Google)

Lumiere is a product from Google, which is also capable of generating videos from text input, based on the STUNet (Space-Time-U-Net) structure diffusion model. Lumiere does not bother with stitching still frames together, but instead, this AI identifies the details in the video (spatial part), tracks how they move, change at the same time (temporal part), thereby helping the process run smoothly.

Like Sora, Lumiere has not been released to the public. The company only introduced this model in late January 2024 after the launch of Gemini - the large language model that has just been synchronized with Bard.

VideoPoet (Google)

This large language model (LLM) is trained from a huge repository of videos, photos, audio, and text developed by Google Search in 2023. VideoPoet can perform various tasks from input sources such as text, photos, videos... to create videos, highlight content, convert videos to audio, turn still images into animations...

The original idea for VideoPoet stemmed from the need to translate any autoregressive language model into a video generation system. Current autoregressive language models can process text and programming code like humans, but struggle when it comes to video. VideoPoet solves this by using tokenization to translate input from any format into a language it can understand.

Các công cụ tạo ra video từ văn bản đa phần đang thử nghiệm giới hạn — Tools for creating videos from text are mostly testing their limits

Emu Video (Meta)

Besides Google and OpenAI, Meta is also one of the Big Techs that is active in creating AI. The company that owns Facebook also developed a video-making AI called Emu Video, which can convert images into text and then use it as data to create clips.

Emu Video is receiving positive reviews from beta testers, with 81% preferring it over Imagen Video (Google). Over 90% chose Meta's model over PYOCO (Nvidia), even better than Meta's Make-A-Video (96%).

CogVideo (Tsinghua University, China)

Unlike the above models, which are all products of the world's leading technology companies, CogVideo is an AI developed by a research team from Tsinghua University - a top prestigious school in China as well as Asia. The program is based on CogView2, a pre-trained text-to-image model.

Computer art expert Glenn Marshall, who tested CogVideo, said "directors could lose their jobs." His clip, called The Crow , created with the help of CogVideo, received high praise and was nominated for a British Academy Film Award (BAFTA).

Source link