This image of an astronaut riding a horse was created using two types of generative AI models. Photo: MIT News
When speed and quality are no longer trade-offs
In the field of AI imaging, there are currently two main approaches:
Diffusion models allow for sharp, detailed images. However, they are slow and computationally intensive, requiring dozens of processing steps to remove noise from each pixel.
Autoregressive models are much faster because they predict small parts of an image sequentially. But they often produce images with less detail and are prone to errors.
HART (hybrid autoregressive transformer) combines both, providing the “best of both worlds”. It first uses an autoregressive model to construct the overall image by encoding it into discrete tokens. Then, a lightweight diffusion model takes over to fill in the residual tokens – the detailed information lost during encoding.
The resulting images are of comparable (or better) quality to state-of-the-art diffusion models, but are 9x faster to process and use 31% fewer computational resources.
New approach to creating quality images at high speed
One of the notable innovations of HART is how it solves the problem of information loss when using autoregressive models. Converting images into discrete tokens speeds up the process, but also loses important details such as object edges, facial features, hair, eyes, mouths, etc.
HART's solution is to have the diffusion model focus only on "patching up" these details through residual tokens. And since the autoregressive model has already done most of the work, the diffusion model only needs 8 processing steps instead of the 30+ steps it used to.
“The diffusion model is easier to implement, leading to higher efficiency,” explains co-author Haotian Tang.
Specifically, the combination of an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters gives HART the same performance as a diffusion model with up to 2 billion parameters, but nine times faster.
Initially, the team also tried integrating the diffusion model into the early stages of the image generation process, but this led to an accumulation of errors. The most effective approach was to let the diffusion model handle the final step and focus only on the “missing” parts of the image.
Unlocking the Future of Multimedia AI
The team’s next step is to build next-generation visual-linguistic AI models based on the HART architecture. Since HART is scalable and adaptable to a wide range of data types (multimodal), they expect to apply it to video generation, audio prediction, and many other areas.
This research was funded by several organizations including the MIT-IBM Watson AI Lab, the MIT-Amazon Science Center, the MIT AI Hardware Program, and the US National Science Foundation. NVIDIA also donated GPU infrastructure to train the model.
(According to MIT News)
Source: https://vietnamnet.vn/cong-cu-ai-moi-tao-anh-chat-luong-cao-nhanh-gap-9-lan-2384719.html
Comment (0)