The large scale multi view Gaussian model LGM produces high quality 3D objects in 5 seconds, which c

Mondo Health Updated on 2024-02-20

Heart of the Machine column.

Heart of the Machine Editorial Department

To meet the growing demand for 3D creative tools in the metaverse, 3D content generation (3D AIGC) has been receiving considerable attention lately. And, 3D content creation has made significant strides in terms of quality and speed.

While current feedforward generative models can generate 3D objects in seconds, their resolution is limited by the intensive computation required during training, resulting in low-quality content. This begs the question, can you produce a high-resolution, high-quality 3D object in just 5 seconds?

In this paper, researchers from Peking University, Nanyang Technological University's S-Lab, and Shanghai Artificial Intelligence Laboratory propose a:The new framework LGM, the Large Gaussian Modelto generate high-resolution, high-quality 3D objects from a single viewing angle** or text input in as little as 5 seconds.

Currently, both ** and model weights have been open-sourced. The researchers also provide a demo for everyone to try out.

*Title: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation Project Home:

* demo:

To achieve this goal, researchers face two challenges:

Efficient 3D characterization with limited computation: The existing 3D generation work uses a three-planar-based nerf as a 3D representation and rendering pipeline, and its intensive modeling of the scene and ray-traced volumetric rendering techniques greatly limit its training resolution (128 128), resulting in blurry and poor quality of the final content.

3D backbone generation network at high resolution: Existing 3D generation efforts use dense transformers as the backbone network to ensure a sufficiently dense number of parameters to model general-purpose objects, but this sacrifices the training resolution to a certain extent, resulting in the final 3D object quality is not high.

To this end, we propose a novel method to synthesize high-resolution 3D representations from four perspectives**, and thenSupport high-quality text-to-3d and image-to-3d tasks with existing text-to-multi-view image or single-image-to-multiview image models

Technically,The LGM core module is the Large Multi-View Gaussian Model。Inspired by Gaussian sputtering, the method uses an efficient and lightweight asymmetric u-net as the backbone network to directly rejects high-resolution Gaussky elements from four viewing angles, and finally renders them at any viewing angle.

Specifically, the backbone network u-net accepts images from four perspectives and the corresponding Plück coordinates, and outputs a fixed number of Gaussian features from multiple perspectives. This set of Gaussian features is directly fused into the final Gaussian element and can be differentially rendered to obtain images from various viewing angles.

In this process, the cross-view self-attention mechanism is used to model the correlation between different perspectives on the low-resolution feature map while maintaining low computational overhead.

It is important to note that it is not easy to efficiently train such a model at high resolution. In order to achieve robust training, researchers still face the following two problems.

One is that the training phase uses the 3D consistent multi-view rendered in the obj**erse dataset, while in the inference phase, the existing model is directly used to synthesize multiple views from text or images. And because there will always be the problem of multi-view inconsistency in multi-view based on model synthesis, in order to make up for this domain gap,In this paper, a data augmentation strategy based on grid distortion is proposed: to apply random distortion to the ** of three viewing angles in the image space to simulate the inconsistency of multiple viewing angles

Second, because the multi-view ** generated in the inference stage does not strictly ensure the consistency of the three-dimensional geometry of the camera's perspective, thereforeIn this paper, the random perturbation of the camera pose from three perspectives is also used to simulate this phenomenon, which makes the model more robust in the inference stage

Finally, the generated Goussky is rendered as the corresponding image through differentiable rendering, and directly end-to-end learning on the 2D image through supervised learning.

After the training is completed, LGM can implement high-quality text-to-3d and image-to-3D tasks by using existing image-to-multi-view or text-to-multi-view diffusion models.

Given the same input text or images, the method can generate a variety of high-quality 3D models.

To further support downstream graphics tasks, the researchers also propose an efficient method to convert the generated Gaussian representations into smooth and textured mesh:

Please refer to the original ** for more details.

Related Pages