The model utilizes image diffusion models that have been trained on extensive web datasets, as well as a multi-view dataset rendered from 3D assets.
A team of researchers from ByteDance has recently introduced a brand-new AI model capable of generating 3D models from nothing but a text description. Meet MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt.
As stated by the team, this achievement is made possible by harnessing image diffusion models pre-trained on extensive web datasets and a multi-view dataset derived from 3D assets. The resulting multi-view diffusion model combines the adaptability of 2D diffusion models with the coherence of 3D data. As a result, it serves as a valuable multi-view reference for 3D content generation through Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem.
Moreover, the team notes that the multi-view diffusion model can also be fine-tuned with just a few input examples, making it suitable for personalized 3D generation, such as in the DreamBooth3D application. In such cases, the model can learn the subject's identity while preserving consistency.
"We present the first multi-view diffusion model that is able to generate a set of multi-view images of an object/scene from any given text," reads the paper. "By fine-tuning a pre-trained text-to-image diffusion model on a mixture of 3D rendered datasets and large-scale text-to-image datasets, our model is able to maintain the generalizability of the base model while achieving multi-view consistent generation.
With a comprehensive design exploration, we found that using a 3D self-attention with camera matrix embedding is sufficient to learn the multi-view consistency from training data. We show that the multi-view diffusion model can serve as a good 3D prior and can be applied to 3D generation via SDS, which leads to better stability and quality than current open-sourced 2D lifting methods. Finally, the multi-view diffusion model can also be trained under a few shot setting for personalized 3D generation."
"We observe the following limitations of our current multi-view diffusion models. For the moment, the model can only generate images at a resolution of 256×256, which is smaller than the 512×512 of the original Stable Diffusion," commented the team on the limitations of MVDream. "Also, the generalizability of the current model seems to be limited to the base model itself. For both aforementioned problems, we expect them to be solved by increasing the dataset size and replacing the base model with a larger diffusion model, such as SDXL (SDX).
Furthermore, we do observe that the generated styles (lighting, texture) of our model would tend to be similar to the rendered dataset, although it can be alleviated by adding more style text prompts, it also indicates that a more diverse and realistic rendering is necessary to learn a better multi-view diffusion model, which could be costly."