Family, let's take a look at a paragraph first, can you find anything unusual in it?
If, in this **, there is an object that is fake and synthesized into it, can you find out?
Don't sell off, let's find out now.
The object of "fake" is precisely this traffic light that appears in the wrong place.
Let's play "Let's find faults" again, please see the title:
The answer: this device on a shelf.
That's the latest study from Xpeng – anything in any scene.
The main one is to "stuff" anything into the real environment without any sense of disobedience.
And the research team gave this general framework an evaluation of:
Its applications go far beyond data augmentation and show great potential in virtual reality, editing, and a variety of other centric applications.
Some netizens even exclaimed after watching the effect:
Goodbye**Evidence Objects inserted by this technology can maintain the same realism as the original material.
So how effective is this AI technology, let's move on and see.
anything in any scene
Let's take a look at the outdoor scene first.
When compositing an object in **, the reasons why it is often not realistic can be summarized as misplaced positions, no shadows, no HDR and no style migration.
As shown in the following few error cases:
And the effect of the Xpeng team is like this:
Compared to the lack of various factors, the effect is obviously relatively realistic.
Compare this with other existing algorithms and frameworks, such as DoVenet, Stytr2, and Phdiffusion, which look like this when compositing objects in an outdoor scene:
Xpeng's Anything in Any Scene is still relatively more realistic in terms of effect.
Similarly, in the indoor environment, whether it is a bag or a shoe, the effect generated by Xpeng's new AI technology can be said to be the kind that is difficult to distinguish between real and fake.
More effects are shown in the figure below:
In addition to the visuals, the Xpeng team compared the performance of the trained Yolox model on the raw image of the CODA dataset with the performance of the Anything in Any Scene framework when trained on a combination of raw and enhanced images.
From the overall accuracy point of view, there has also been a considerable improvement.
How?
From the perspective of the proposed framework, anything in any scene is mainly composed of three key parts.
The first is the process of object placement and stabilization.
The team first determined the camera's position in the world coordinate system in the scene and used it as a reference point for object insertion. Use the camera's intrinsic matrix and pose (rotation matrix and displacement vector) to project points in the world coordinate system into the pixel coordinate system to determine where the object is placed in the ** frame.
To avoid occlusion with other objects in the scene, the team also used a semantic segmentation model to estimate the segmentation mask for each frame and ensure that objects were placed in unoccluded areas.
In terms of object stabilization, the team estimated the optical flow between successive frames to track the object's trajectory. And by optimizing the camera attitude (rotation matrix and displacement vector), the 3D to 2D projection error of the object in successive frames is minimized, and the stable motion of the object in ** is ensured.
Second, there is lighting estimation and shadow generation.
For HDR panoramic image reconstruction, the team used an image inpainting network to infer the light distribution of the panoramic view, and then converted the panoramic image into an HDR image through the sky HDR reconstruction network. and use a GaN training encoder-decoder network to simulate the brightness distribution of the sun and sky.
In terms of environmental HDR image reconstruction, the researchers collected multi-view LDR images of the scene and restored them to HDR images through the existing model to learn continuous ** value representation.
For object shadow generation, the team used 3D graphics applications (such as Vulkan) and ray tracing to generate shadows that were inserted into objects based on the estimated position of the primary light source.
The final step is the style migration.
The frame fine-tuns the appearance of the inserted object so that its style blends perfectly with the background, further enhancing the realism.
That's why Xpeng's Anything in Any Scene generates objects more realistically in a real-world environment.
And the effect similar to Xpeng's research has actually been done a lot before.
For example, a multimodal generative world model called GAIA-1 can be used to create realistic autonomous driving from head to toe**:
Every frame here is generated by AI, and even different road conditions and weather can be realistic.
Even Lecun was amazed when he saw it:
However, although these AI effects are real, some netizens have raised concerns, that is, the false and generated information on the Internet is becoming more and more realistic; Therefore, it is necessary to be more vigilant in distinguishing the truth of information in the future.
At present, this project has been open-sourced in github, interested partners can learn about it