Why is Sora not only impacting Douyin, but also a bunch of robot companies?

Mondo Education Updated on 2024-02-22

key points

According to OpenAI, SORA is not just a generator, but also the basis for understanding and simulating the real world;

SORA generated ** up to 1 minute;

SORA is not a pure diffusion model, it uses a Transformer-based diffusion model, and GPT is based on a Transformer architecture;

Converting visual data into a unified patch format is another key to achieving stunning results in the SORA model;

Yang Likun is a major proponent of the concept of the world model, and he has proposed that in the next 10 years, AI will be able to build a cognitive model for the external world like a human, and make a decision on the next state of the world based on this model.

Yang Likun obviously prefers that his own V-JEPA is the first to have a world model, rather than Sora;

The world model determines whether an AI can move from a 2D screen to a 3D physical world, which is a necessary step to becoming an AGI.

The race is on. The runway CEO wrote on social **.

It's OpenAI again. Following GPT's subversion of the field of natural language processing, on February 16, OpenAI launched the Wensheng model SORA, this time, it subverted the field of generation, as well as the visual content industry including movies, shorts, and games.

What's more, in the blog post Sora, Openal said that Sora is not just a generator, but also a foundation for understanding and simulating the real world, in short, Sora has a world model inside it. This is the first time OpenAI has emphasized this when releasing a model, and OpenAI did not express this when the GPT series of models — even GPT-4 — was released.

On the same day that Sora was released, Meta, an AI giant also released a model that claims to be based on the world model, V-Jepa. Unlike generative SORA, V-JEPA is not generative and will guess the obscured information, the missing part of the image, based on the context.

Will SORA create more value than GPT? What does its advent mean for ** production companies or sharing platforms such as Douyin, YouTube, Paramount, etc.? Why do both OpenAI and Meta emphasize having the foundation of the world model when they release the ** model? What is the World Model ? What does the World Model mean for artificial intelligence? And why did OpenAI make SORA and not others? Here are the answers to these basic questions:

Compared with Runway, Sora is stronger?

As early as 2022, the first open-source model jointly launched by Tsinghua University and KLCII can generate a magical ** like the lion drinking water. Since then, Runway, Stability AI and other companies have successively launched models to enter this track, and the same is true of Pika, which was a smash hit in the AI circle not long ago.

Although OpenAI is the leader in the field of text generation models with ChatGPT, it is a newcomer in the field of ** generation. However, as long as you have seen the example ** shown by SORA on the blog, you have to admit that SORA has left behind the previous models in terms of both the length and quality of the generation**.

Duration: 01:00

*The middle woman walks on the streets of a Japanese city, and the camera switches several times during the 1-minute duration.

The most obvious advantage of SORA over other Wensheng models is that the generated ** can be as long as 1 minute. Previously, the generated ** was usually only a few seconds, for example, Pika could only spawn for 3 seconds, and the most sophisticated runways could only spawn for a maximum of 18 seconds.

According to research statistics, the average shot length of Hollywood movies from the 1930s to the 1940s was about 10 seconds, and this value dropped to less than 4 seconds after 2000. However, this statistic only reflects the average duration of the shot, and high-quality image expression still needs to be achieved by alternating long and short shots, and the applicability of SORA with a duration of up to 1 minute will obviously be much stronger.

Duration: 00:17

*A couple strolls through the streets of Japan, following them from far and near.

In addition, SORA has more surprising capabilities that other Wensheng** models do not have, which OpenAI will call 3D continuity, long-distance relevance, and object perpetuation. 3D continuity and long-distance correlation mean that as the camera moves, objects and scenes in three-dimensional space change accordingly; Object persistence refers to the fact that objects within the lens can be temporarily occluded or removed from the lens.

These are the shots that often appear when we shoot in our daily lives, but they are really difficult for AI-generated. In the real world, the concepts of 3D continuity and object permanence are self-evident, because these are the basic laws of the physical world, and the AI simulates approximate effects without understanding these laws, which seems to imply that SORA can also emerge to learn the laws like the GPT model.

Duration: 00:17

* The castle by the sea is seamlessly connected to the Christmas village.

And the way SORA generates ** is more flexible. In addition to using text prompt generation, SORA also supports generation and editing. Enter a static **, and SORA can directly make ** move. Sora also supports extending a segment forward or backward, and can also connect different styles. In addition, users can edit existing ** through text commands, such as replacing the **background environment of the car driving on the road with a dense jungle.

After the release of SORA, not only was the CEO of RUNWAY forced to make the response mentioned at the beginning of this article, but the founder of PIKA also responded** that he was preparing for a charge and would directly benchmark SORA. An employee of Aishi Technology, another Chinese Wensheng ** company, told Neocortex that Sora's route was very inspiring, and the company organized a technical team to try to reproduce it as soon as possible, but there was no result yet.

Sora's success once again validates the need for generative AI to do wonders?

Since OpenAI has not released the technical details of the SORA model, according to the technical report it released, the core of SORA is related to two points: one is the use of a Transformer-based diffusion model; The second is to convert different types of visual data into a unified format called patches, so that more data can be used to train the model.

First of all, SORA is not a pure diffusion model, the latter is an algorithm used by image and model developers such as Runway, Pika, Midjourney, etc. As early as 2021, the Google Brain team launched a model called Vision Transformer (VIT), which recognizes images by calculating the dependencies between pixels in the same image. Before that, language and vision were seen as different things. Language is linear and sequential, while vision is a kind of parallel data with a spatial structure. But Transformer proves that ** can also be solved as a sequence problem, and ** is a sentence composed of pixels. Not only that, most of the problems can be transformed into sequence problems, such as the protein structure, which also relies on the learning of amino acid sequences. **It's just a continuous one**.

Converting visual data into a unified patch format is another key to achieving amazing results in the SORA model, that is, how to obtain a large amount of data, excellent quality, and cost-effective computing power.

The compressed ** is cut into many small square patches, which are like the basic data unit tokens in the large language model, which are the basic materials before training. This method greatly improves the efficiency of data preprocessing, before that, if you want to input the data into the model to train, you need to do a lot of preprocessing work, such as ensuring that the resolution, aspect ratio and other formats of the training materials are unified. And after cutting the ** into patches, the pre-processing work is much easier, and the ** of any format will eventually be cut into patches of the same format, which is like all the parts of LEGO are uniform small pieces. Finally, each patch will be upgraded to a space-time patch by adding the dimension of time.

SORA runs a lot of work based on OpenAI's image generation model Dall·E 3 and natural language understanding model GPT. For example, the detailed description of each paragraph, including the character, environment, style, shot, etc., is based on the highly descriptive title generated by Dall·e 3 for the visual training data; In addition, OpenAI leverages GPT to convert short user prompts into longer detailed subtitles, which are then sent to the model. According to OpenAI, this allows Sora to generate high-quality products that accurately follow the needs of users.

By packaging the spatiotemporal patch with the description text provided by Dall·E 3 and GPT and inputting it into the model for training, SORA can finally map the text description to the spatiotemporal patch. OpenAI said that similar to large language models, SORA models also show a pattern that the model performs better with the larger the size of the training data.

Why is it said that SORA generates not only **, but also a world model?

In the definition of OpenAI, SORA is not just a ** generator, but also the basis on which AI can understand and simulate the real world , in short, SORA has a world model inside it.

Duration: 00:15

Two pirate ships in a coffee cup are in a naval battle.

OpenAI's conclusion is based on SORA's ability to simulate the real world, especially in terms of representing various physical properties of the world, and a series of ** generated by OpenAI with SORA are intended to show this. The foam that forms in and around the ship, which is bubbling in the boiling coffee, is so realistic that it seems that the sora has mastered fluid dynamics; As the camera moves, the objects and scenes generated by SORA can change accordingly with the change of 3D space, as if the model understands the ...... of 3D perspective

After seeing Sora's work, Nvidia senior scientist Jim Fan also said on the social ** platform that Sora is not just a creative toy, it is a data-driven physics engine and a simulation of the real world. 」

But Turing Award winner Yann Lecun isn't buying it. He said on social media platforms that modeling the world by generating pixels is too expensive and doomed to fail. He believes that just generating a seemingly realistic ** based on text prompts does not mean that the model truly understands the physical world. The process of Wensheng is completely different from the causal model based on the world model. Marcus, who often quarrels with Yang Likun, also stood on the side of his old rival this time.

Whether a model has mastered the world model is an issue that has actually sparked debate in the industry after the release of ChatGPT last year. Named after Emily M., a linguist at the University of WashingtonBender argues that large language models (LLMs) are nothing more than stochastic parrots that don't understand the real world, but just count the probability of a word appearing, and then randomly produce words and phrases that look plausible like parrots. Yang Likun has the same position.

The opposing faction argues that there is already a world model within large language models, especially at the scale of GPT. According to the Harvard-MIT study, large language models (LLMS) learn linear representations of space and time at multiple scales, and these representations are robust to different cue changes and uniform across different environmental types, such as cities and landmarks. Ng later said in his column, "I believe that LLMs have built a complex enough model of the world that I can safely say that, to some extent, they do understand the world." Geoffrey Hinton, who won the Turing Award at the same time as Yang Likun, shares the same views as Ng.

The same argument seems to be repeating itself in Sora. However, this is the first time that OpenAI has claimed that it has the potential to build a model of the world when it released the model, and OpenAI did not express this when the GPT series of models (even GPT-4) was released.

What exactly is a world model?

As the name suggests, the world model is the modeling of the real physical world, and Yang Likun is a major proponent of this concept. He once proposed that in the next 10 years, AI should be able to build a world model, a system that can build a cognitive model for the external world like a human, and make a decision on the next state of the world based on this model.

Since 2022, Yang Likun has been trying to build such a world model for AI. He even proposed an architecture that an autonomous agent should have, which consists of 6 core modules, including:ConfiguratorIt is a coordination command center, responsible for coordinating, configuring and executing instructions issued by other modules; PerceptionPerceive the world state and extract task-related information, and accept configurator calls for specific tasks; World ModelEstimate the missing information about the state of the world that the perceptron does not provide, and make a reasonable future state of the world, including, the future state of the world resulting from a series of actions proposed by the actor module; ActorResponsible for finding the best course of action; The cost module is responsible for calculating the discomfort value of the agent, with the goal of minimizing the intrinsic cost of the future value. Short term memoryResponsible for tracking the state of the current and ** world and the associated costs.

In this agent system, the world model is only one of the modules, which is responsible for the missing information about the world state that the perceptron does not provide, so that the decision-makers in the entire architecture can use this information to make decisions and plan the path. Yang Likun believes that only AI that can do planning can be called artificial general intelligence (AGI), and currently LLMS, including GPT, do not have this planning ability because they lack common sense of how the world works. This common sense includes not only human relationships, but also physical perceptions such as gravity and inertia, which are known as world models, so that machines can tell when they see an apple leave a branch that it will fall next on the ground below, rather than to the left, right, or other directions. This kind of data is not as rich in language as it is in visual data, no matter how it is described.

On February 16, the same day that SORA was released, Meta also released a *** model called V-JEPA (Video Joint Embedding Predictive Architecture). Unlike Sora, which generates the next patch entirely, V-Jepa is a non-generative model. It learns through the abstract representation of the hidden or missing part of the ***, meta does not say whether this abstract representation is text or not, but it is certain that it is not a pixel , but a more abstract representation of data than a pixel.

In this way, Meta tries to get the model to focus on understanding the image conceptually from the high-level level of the content, without worrying about details that are usually inconsequential to accomplish the task, such as the rich bubbles in Sora-generated Ships in Coffee, which may not be the object of V-Jepa will.

V-JEPA is a step towards a more grounded understanding of the world, so machines can enable more general reasoning and planning. Yang Likun said after the release of V-JEPA that this tool can be used as an early model of the physical world — you don't have to see everything that's happening out of sight, the model can tell you what's going on there conceptually.

As the vice president and chief AI scientist of Meta, and the leader of the JEPA series of models, Yang Likun obviously prefers his V-JEPA to be the first to have a world model rather than SORA. In the next step, Meta will probably use V-Jepa as a module of the agent for experiment planning and continuous decision-making.

Why is it important to have a world model?

The pursuit of a world model and the claim to have this ability is not just a good sound, it determines whether an AI can move from a 2D screen to a 3D physical world, which is a necessary step to becoming an AGI.

After the release of ChatGPT, major robotics companies around the world are trying to put GPT into the brains of robots. But they all know that it is not enough for a robot to understand language, and in order to walk in the real physical world, the robot's system must be able to understand the various physical events that occur in the real world in order to survive: if an apple falls, it will hit its head; When a glass is thrown, it will shatter if it touches an object; and how long it will take a person to get to him if he comes from the other side......

Therefore, in the second half of 2023, a major trend in the field of robotics is to let robots travel thousands of miles after reading thousands of books (loading GPT) - training robots in physical space. In July 2023, the Google Deepmind team launched a robot called RT-2 (Robotic Transformer 2), which allows operators to instruct RT-2 robots to complete tasks through natural language, even if it is not trained for tasks. And it does this by using a composite model that blends language models with physical training data.

After hearing about grabbing extinct animals, RT-2 grabbed the dinosaurs.

Google first trained 13 robots in an office workshop environment for 17 months, and the resulting data was loaded into a visual-language model (VLM) based on a large language model, resulting in a visual-language-action (VLA) model, or RT-2.

If SORA could be loaded into the RT-2, it might not need to be trained in a physical office environment for 17 months. Sora's visual generation function can generate the next frame based on the existing scene state, that is, what is likely to happen next, so that the agent can prepare in advance.

Of course, neither Sora nor V-Jepa is currently a sufficiently stable generator or generator. V-Jepa did not show the ** generated by it**, and OpenAI also admitted in ** that the ** generated by SORA is not perfect, it will still generate a picture that does not conform to the laws of physics, for example, in the generated ** a person bites a cookie but does not leave a bite mark, a person runs in the opposite direction on the treadmill, and the direction of the cup does not change after knocking over the cup, and the liquid in the cup flows out first......However, in the successfully generated **, the objects and scenes in the 3D space have already changed as the camera moves. This is something that neither Runway nor Pika has been able to do.

Related Pages