Visual China.
Text |Light Cone Intelligence, Author |Hao Xin, edited by Wang Yisu and Liu Yuqi.At the beginning of 2024, OpenAI dropped another AI bomb on the world - *generative model Sora.
Just like ChatGPT a year ago, SORA is considered another milestone moment for AGI (Artificial General Intelligence).
SORA means that the realization of AGI will be shortened from 10 years to 1 year", Zhou Hongyi, chairman of 360, predicted.
But this model is so sensational, not just because the AI generation takes longer and has higher definition, but that OpenAI has surpassed the capabilities of all previous AIGC and generated a content related to the real physical world.
Nonsensical cyberpunk is cool, but it's the AI that makes everything in the real world even more meaningful.
To this end, OpenAI has come up with a completely new concept -World Simulator
In the technical report officially issued by OpenAI, SORA is positioned as:"Generative model as a simulator of the world", "Our results show that extending the generative model is a feasible way to build a general simulator of the physical world. ”
Source: OpenAI's official website).
OpenAI believes thatSORA lays the foundation for being able to understand and simulate real-world models, which will be an important milestone in achieving AGI。With this, it has completely opened up a level with companies such as Runway and PIKA in the ai** track.
From text (chatgpt) to ** (dall·e) to ** (sora), for OpenAI, it seems to be collecting puzzle pieces one by one, trying to completely break the boundary between virtual and reality through the form of image media, and become the "number one player" of the movie.
If Apple's Vision Pro is the hardware manifestation of the number one player, then an AI system that can automatically build the best virtual world is the soul.
The language model approximates the human brain, and the model approximates the physical world," said Yao Fu, a PhD student at the University of Edinburgh.
OpenAI's ambitions are beyond everyone's imagination, but it seems that it is the only one that can do it", a number of AI entrepreneurs lamented to Light Cone Intelligence.
OpenAI's newly released SORA model kicked open the door to the AI** track in 2024, completely drawing a line of separation from the old world before 2023.
In the 48 demos released in one go, Light Cone Intelligence found that most of the problems that were criticized by AI in the past have been solved: clearer generation pictures, more realistic generation effects, more accurate comprehension capabilities, smoother logical comprehension capabilities, more stable and consistent generation results, and so on.
But all this is just the tip of the iceberg that OpenAI has revealedBecause OpenAI is not aiming at ** from the beginning, but all the images that exist.
Imagery is a larger concept, a subset of it, such as a large screen scrolling on the street, a virtual set of a game world, and so on. What OpenAI wants to do is to use ** as the entry point, cover all images, simulate and understand the real world, that is, the concept of "world simulator" that it emphasizes.
As Chen Kun, the producer of the AI movie "Wonderland of Mountains and Seas" and Xingxian Culture, told Light Cone Intelligence, "OpenAI is showing us its capabilities in terms of quality, but the real purpose is to obtain people's feedback data and explore what people want to generate." Just like large-scale model training, once the tool is opened, it is equivalent to people all over the world working for it, and through continuous labeling and input, the world model becomes smarter and smarter. ”
And so we see,AI has become the first stage of understanding the physical world, mainly highlighting its attributes as a "generative model"; Only in the second stage can it provide value as a "world simulator".
The core of grasping the "**generated" attribute of SORA is to find the differenceThat is, the difference between SORA and Runway and PIKA is reflected in the **? This question is crucial because it goes some way to explaining why Sora was able to crush.
First of all, OpenAI follows the idea of training large language modelsUse large-scale visual data to train a generative model with general-purpose capabilities.
This is completely different from the logic of "dedicated personnel" in the field of Wensheng**. Last year, Runway had a similar plan, which it called the "Universal World Model", with a similar idea, but there was no follow-up, and this time Sora was the first to fulfill the dream of Runway.
According to Xie Saining, an assistant professor at New York University, the number of SORA parameters is about 3 billion, which is insignificant compared to the GPT model, but this order of magnitude has far surpassed some companies such as Runway and Pika, which can be called a dimensionality reduction blow.
Qi Boquan, general manager of Wondershare Technology AI Innovation Center, commented that SORA's success once again verified the possibility of "vigorously producing miracles", "SORA still follows OpenAI's scaling law, relying on vigorous miracles, a large amount of data, large models and a large amount of computing power. The bottom layer of SORA adopts the world model verified in the fields of games, unmanned driving and robotics to build a Wensheng ** model to achieve the ability to simulate the world. ”
Secondly, for the first time, SORA demonstrated the perfect integration of the capabilities of the diffusion model and the large model.
AI** is like a blockbuster movie, depending on two important elements: the script and the special effects. Among them, the script corresponds to the "logic" in the AI** generation process, and the special effects correspond to the "effect". In order to achieve "logic" and "effect", two technical path diffusion models and large models are differentiated.
At the end of last year, Light Cone Intelligence predicted that in order to meet the effect and logic at the same time, the two routes of diffusion and large model will eventually converge. Unexpectedly, OpenAI solved this problem so quickly.
Source: OpenAI's official website).
OpenAI's technical report highlighted the following: "We turn various types of visual data into a unified representation that can be used to generate large-scale training of models. ”
Specifically, OpenAI encodes each frame of the screen into visual patches, each patch is similar to a token in GPT, becoming the smallest unit of measurement in the image, and can be broken and reassembled anytime and anywhere. Finding a way to unify data, unifying weights and measures, and finding a bridge between the diffusion model and the large model.
In the whole generation process, the diffusion model is still responsible for the part of generating the effect, and after adding the attention mechanism of the large model transformer, there is more ability to generate ** and reasoning, which explains why SORA can generate ** from the existing acquired static images, and can also expand the existing ** or fill in the missing picture frames.
So far, the first model has shown a trend of compounding, and while the model is moving towards integration, the technology is also moving towards compounding.
Applying the accumulation of previously precipitated technologies to visual models has also become OpenAI's advantage. In the training process of SORA Wensheng**, OpenAI introduced the language comprehension capabilities of DALL-E3 and GPT. According to OpenAI, training on the basis of DALL-E3 and GPT enables Sora to accurately generate high-quality ** according to user prompts.
A set of combination punches, the result is the emergence of simulation ability, which forms the basis of the "World Simulator".
We've found that models exhibit a number of interesting emerging capabilities when trained at scale. These abilities make soraAbility to simulate certain aspects of the physical world of people, animals, and the environment. The appearance of these properties does not produce any clear inductive bias to three-dimensional, object, etc- They are purely a phenomenon of scale," says OpenAI.
The fundamental reason why simulation can be so explosive is that people are accustomed to using large models to create things that do not exist, but they can accurately understand the logic of the physical world, such as how forces interact with each other, how friction is generated, how the basketball hits a parabola, etc., these are all things that no previous model can complete, and it is also the fundamental meaning of SORA beyond the first generation level.
However, from the demo to the actual finished product, it can be a surprise or a fright. Yang Likun, chief scientist of Meta, directly questioned Sora, saying: "Just being able to generate realistic ** according to prompts does not mean that the system really understands the physical world. Unlike the world-based model of cause and effect**, the generative model only needs to find a reasonable sample from the possibility space, without the need to understand and simulate real-world causality. ”
Qi Boquan also said that although OpenAI has verified that the Wensheng ** large model based on the world model is feasible, there are also difficulties in the accuracy of physical interactions, and although SORA can simulate some basic physical interactions, it may encounter difficulties in dealing with more complex physical phenomena;There is a challenge in dealing with long-term dependencies, i.e., how to maintain consistency and logic in time;The accuracy of spatial details, if the processing of spatial details is not accurate enough, it may affect the accuracy and credibility of the content.
It may have been a long time before Sora became a world simulator, but in terms of generation, it has already had an impact on the world now.
The first type is to solve the problems that could not be broken through in the previous technology and promote some industries to a new stage.
The most typical is the film and television production industry, and Sora's most revolutionary ability this time is that the maximum generation ** length has reached 1 minute. For reference, the length of the popular PIKA is 3 seconds, and the Gen-2 generation length of the runway is 18 seconds, which means that with SORA, AI will be able to become a real productivity and achieve cost reduction and efficiency increase.
Chen Kun told Light Cone Intelligence that before the birth of SORA, the cost of using AI tools to make science fiction movies had dropped to half, and after SORA landed, it was even more worth looking forward to.
After Sora was released, what impressed him the most was a demo of a dolphin riding a bike. In that **, the upper body is a dolphin, the lower body is a human with two legs, and shoes are also worn on the legs.
It was simply amazing for us! This picture creates a sense of absurdity that has room for imagination and conforms to the laws of physics, which is both reasonable and unexpected, and this is the film and television work that the audience can marvel at," said Chen Kun.
Chen Kun thinksSora will be like smartphones and Douyin back then, lowering the threshold for all content creators by a big step and magnifying content creators by an order of magnitude.
In the future, content creators may not need to shoot, they only need to say a paragraph or a paragraph to express their unique ideas in their heads, and they can be seen by more people. At that time, I think there may be a new platform that is bigger than Douyin. Perhaps the next step is that Sora is able to understand everyone's subconscious thoughts and automatically generate and create content without the need for users to actively seek expression," said Chen.
In the same industry, there are games, and the OpenAI technical report ends with a Minecraft game** with the sentence next to it: "Sora can control the player in Minecraft at the same time with a basic strategy while presenting the world and its dynamics with high fidelity." Just mention 'minecraft' in Sora's tooltip to activate these features at zero distance. ”
Chen Xi, an AI game entrepreneur, told us, "Any game practitioner who sees this sentence is in a cold sweat!" OpenAI has unreservedly demonstrated its ambitions." Chen Xi interprets and analyzes that a short sentence conveys two things:Sora controls the game character while rendering the game environment. "As OpenAI says, SORA is an emulator, a game engine, and an interface between imagination and the real world. In the games of the future, as long as you can say it, the picture will be rendered. Sora has now learned to build a one-minute world and generate stable characters, and with his GPT-5, a purely AI-generated map of thousands of square kilometers of active creatures doesn't sound whimsical. Of course, whether the screen can be generated in real time, and whether it supports multiplayer online is a very real problem. But in any case, a new game mode is already on the horizon, and at least it becomes no problem to generate a "I'm Surrounded by Beautiful Girls" with Sora," Chen said.
The second category is based on the ability to simulate the world and create new things in more areas.
Yao Fu, a PhD student at the University of Edinburgh, said: "Generative models learn the algorithms that generate the data, rather than remembering the data itself. Just as a language model encodes the algorithms that generate a language (in your brain), the model encodes the physical engine that generates the stream. A language model can be thought of as an approximation of the human brain, while a model approximates the physical world. ”
Learned the universal laws of the physical world, so that embodied intelligence is also closer to human intelligence.
For example, in the field of robotics, the previous conduction process was to give the robot brain a handshake instruction first, and then pass it to the hand, but because the robot could not really understand the meaning of "handshake", it could only translate the instruction into "how many centimeters the diameter of the hand is reduced". If the world simulator becomes a reality, the robot can directly skip the process of instruction transformation and understand the human instruction needs in one step.
Jia Kui, founder of cross-dimensional intelligence and professor of South China University of Technology, told Light Cone Intelligence that explicit physical simulation may be applied to the field of robotics in the future, "SORA's physical simulation is implicit, it shows the effect that can only be generated by its internal understanding and simulation of the physical world. ”
SORA capabilities are still achieved through massive amounts of data, as well as recaptioning technology, and there is not even 3D explicit modeling, let alone physical simulation. Although the effect generated is close to that achieved through physical simulation. But what the physics engine can do is not just generate, there are a lot of other elements that must be there to train a robot," says Jaqua.
While SORA still has many limitations, a link has been established between the virtual and real worlds, which makes it possible for both the game-player number one virtual world and the robots to be more human-like.
For more exciting content, pay attention to titanium***id:taimeiti), or**titanium**app