Tsinghua Wang Xia contributed qubits | qbitai
An AI model that synthesizes 3D graphics from words has a new SOTA!
Recently, the research group of Professor Liu Yongjin of Tsinghua University proposed a new 3D method based on diffusion model.
Whether it is the consistency between different perspectives or the matching degree with the prompt words, it is much better than before.
Wensheng 3D is a hot research content of 3D AIGC, which has received extensive attention from academia and industry.
The new model proposed by Prof. Yongjin Liu's group is called Text-Image Conditioned Diffusion (TICD), which has reached the SOTA level on the T3bench dataset.
At present, the relevant ** has been released, and ** will be open source soon.
The assessment score has reached SOTA
In order to evaluate the effect of the TICD method, the research team first conducted qualitative experiments and compared some previous methods.
The results show that the 3D graphics generated by the TICD method are of better quality, clearer graphics, and better matching with prompt words.
To further evaluate the performance of these models, the team quantitatively tested the TICD with these methods on the T3Bench dataset.
The results show that TICD achieves the best results in the three prompt sets of single object, single object with background, and multi-object, proving that it has an overall advantage in both generation quality and text alignment.
In addition, to further evaluate the text alignment of these models, the research team also tested the cosine similarity of the clip cosine between the 3D object rendered and the original prompt word, and the results were still the best performer for TICD.
So, how does the TICD method achieve such an effect?
A priori multi-view consistency is incorporated into the NERF supervision.
At present, most of the mainstream 3D methods for text generation use pre-trained 2D diffusion models to optimize neural radiance fields (nerfs) through score distillation sampling (SDS) to generate new 3D models.
However, the supervision provided by this pre-trained diffusion model is limited to the input text itself, and does not constrain the consistency between multiple perspectives, which may lead to problems such as poor generation geometry.
In order to introduce multi-view consistency in the prior of the diffusion model, some recent studies fine-tune the 2D diffusion model by using multi-view data, but still lack fine-grained continuity between the perspectives.
In order to solve this challenge, the TICD method incorporates text-conditioned and image-conditioned multi-view images into the NERF-optimized supervised signal, which ensures the alignment of 3D information and prompt words and the strong consistency between different viewing angles of 3D objects, respectively, and effectively improves the quality of the generated 3D model.
In terms of workflow, TICD first samples several sets of orthogonal reference camera perspectives, uses nerf to render the corresponding reference views, and then applies a text-based conditional diffusion model to these reference views to constrain the overall consistency of the content and the text.
Based on this, several sets of reference camera perspectives are selected, and a view of each additional new perspective is rendered. Then, taking the pose relationship between these two views and perspectives as a new condition, the image-based conditioned diffusion model is used to constrain the detail consistency between different viewing angles.
Combined with the supervised signals of the two diffusion models, TICD can update the parameters of the NERF network and iterate through iterative optimization until the final NERF model is obtained, rendering high-quality, geometrically clear, and text-consistent 3D content.
In addition, the TICD method can effectively eliminate the problems of disappearance, excessive generation of wrong geometric information, and color confusion that may occur when the existing methods face specific text input.
*Address: