On February 16, 2024, OpenAI released the ** generation AI large model SORA. As soon as the news was issued, the industry was once again shocked.
OpenAI's official website description: SORA is an AI model that generates real and virtual scenes according to text instructions, and can generate high-definition ** for up to 1 minute according to user instructions, which can generate complex scenes with multiple characters and specific movements, that is, it can understand and simulate the physical world in motion.
In the past year, with the boom of ChatGPT and GPTS, various products such as Wensheng Pictures, Wensheng** and Tusheng** have also emerged one after another. Why did Sora make waves again like ChatGPT once it was issued?
1. Performance
Compared with other Wensheng products, it can be generated for up to 60s with coherence and long-term consistency of characters and scenes, which is a significant advantage of SORA.
You know, on January 24 and February 15, Google researchers announced the ** generative models Lumière and Gemini 1Demo of 5**. The former can generate a very high-definition real**, and can realize one-click dressing and generate dynamics according to **and prompt words**, while the latter shows amazing anti-heaven ability in image recognition and multi-round dialogue. However, I never thought that just ten days later, Sora's quiet appearance immediately snatched Lumière and Gemini 15 in the limelight. The reason for this is mainly based on the overall performance of the product.
Despite the Lumière and Gemini 15 is impressive enough, but it doesn't break through in terms of the length and consistency of spawning ** (the duration of spawning is limited to 5s). Similarly, other similar products, such as Runway, Pika, etc., are still breaking through the coherence of a few seconds (coherence greatly affects the authenticity of **). And Sora can directly generate up to 60s and up to 30fps per second**, which is simply crushing other similar types in terms of generation time and consistency. Not only that, but Sora can also generate a wide range of resolutions, including 1920x1080 (widescreen) and 1080x1920 (vertical) and everything in between, up to 2048x2048. This allows the SORA model to create adapted content. See Table 1 below.
Table 1 Comparison of the duration and resolution of various AI models.
Of course, SORA also has advantages over other AI models, including the ability to accurately represent details, understand the existence of objects in the physical world, and generate characters with rich emotions, and even the model can generate based on prompts, still images, and even fill in missing frames in existing **.
2. Implementation
In the past, the main implementation methods of generative ** were Recurrent Neural Network (RNN), Generative Adversarial Networks (GANs), Autoregressive Transformers (GANs), and Diffusion Models. In general, the shortcomings of the generation model based on these implementations are obvious, such as fewer categories of visual data, short time, fixed size, etc.
SORA is trained based on the DIFFUSION Model architecture of Transformer, which integrates the advantages of "generative" of Transformer and "diffusion" of Diffusion Model. Due to its self-attention mechanism, Transformer is able to capture long-distance dependencies in sequences, which gives it an advantage in processing data with complex spatiotemporal dependence. At the same time, due to the characteristics of the self-attention mechanism, the transformer-based model can be efficiently parallelized through matrix operation, so it has the ability to process large-scale data in parallel and generate ** faster. By incorporating diffusion models, Transformer Diffusion Models are able to generate higher quality by retaining more detail and texture information when generating. Thanks to the use of Transformer Diffusion Models, SORA is able to generate a wide variety of images and overcome the limitations of previous methods in terms of length, size, and fixed size. See Table 2 below.
Table 2 Comparison of the implementation methods of various generation methods.
3. The principle of SORA's generation
The generation principle of the SORA model is generally divided into three steps. The first is to compress or compress the video compression network into a compact form (i.e., dimensionality reduction). The second is to perform spacetime latent patches, which decompose the view information into small units, and each unit contains a part of the spatial and temporal information in the view, so that targeted processing can be carried out in the subsequent steps. Finally, there is **generation, by decoding and coding the input text or **, the Transformer model (i.e., ChatGPT basic converter) decides how to convert or combine these units, so as to form a complete **.
Step 1: Compress the network.
As shown in Figure 1 below, the SORA model compresses the input into a low-dimensional representation through the compression network technique. This process is similar to "standardizing" different sizes and resolutions for easy handling and storage.
Sora then further breaks down these compressed view data into so-called "spacetime patches", each of which carries a portion of the spatial and temporal information that forms the basic building blocks of visual content. In this way, on the basis of retaining the richness of the original visual information, SORA can also process different original ** (different lengths, different resolutions, different styles, etc.) into a consistent format.
Step 2: Extract potential patches in time and space.
The pre-trained transformer model will extract the information of the potential patches in time and space generated in step 1 to form a large number of patch "lists", which record the correspondence between the representation of view information and its semantics, and provide knowledge materials for subsequent generation.
Step 3: Generate a Transformer model.
During the generation of SORA, the Transformer model receives potential patches for spatiotemporal (these latent patches come from a period of time that is the same length as the generated target, but the content is completely random noise), and then SORA starts to continuously modify the patches in this section according to the given text prompts (in this process, SORA uses the knowledge learned from a large amount of data to decide how to gradually remove the noise), and the noise is Transform into something close to the text description, and then transform or combine those fragments to produce the final content.
Fourth, SORA's technological innovation
Judging from the technical report released, it is the same as ChatGPT, in the underlying technical level, SORA does not have too much originality, but makes full use of the existing advanced technology. However, in terms of application experience, it has injected innovation that is different from other similar products.
In the three-step process of sora, compression borrows from the idea of "high-resolution image synthesis with latent diffusion models". The concept of "patches" (patches and visual patches) for potential patches in spatiotemporal is taken from "Vivit: A Video Vision Transformer" (i.e., VIT) (Google, 2021). The Transformer Diffusion Model model structure was originally proposed by "Diffusion Models with Transformers" (William Peebles, Saining Xie 2022).
However, the SORA model is unique in terms of size selection, language comprehension, multimodal input, and diversified generation.
In the past, the size and duration of the model would be cropped to the standard size, such as 256*256 for 4 seconds. Sora, on the other hand, can directly generate ** in different sizes. For example, 1920*1080 for horizontal screen and 1080*1920 for vertical screen. This allows Sora to generate different resolutions** depending on the screen size of the device. This is mainly due to the "standardization" of low-dimensional space in the application of ** network compression technology (see above).
According to SORA's technical report, a more detailed and accurate text description of all ** in the model training set was regenerated by drawing on the relabeling technique used in DALL·E3. At the same time, the GPT model is used to expand the user's short prompt into more detailed explanatory text. Through these data enhancements, the SORA model's language comprehension ability is improved.
In terms of input, not only text prompts can be entered, but also ** and **, typical multimodal support. In terms of generation, SORA models can be edited, supplemented and spliced, and can also be extended forward or backward.
5. Future expectations and inspiration
Of course, from the perspective of the performance of the first generation, the SORA model still has many shortcomings, such as simulating the physical phenomena of complex scenes, understanding specific causal relationships, processing spatial details, and accurately describing events that change over time. However, with the increase of training data and the iterative upgrading of the model, it is believed that these shortcomings will be gradually improved.
There is no doubt that the SORA model and its subsequent upgraded version will accelerate the development and application of AIGC in the industry, and have a far-reaching impact on many industries such as film and television, live broadcast, advertising, animation, art and design, etc. Especially at the moment when short ** is prevalent, SORA can already take on the tasks of short ** photography, directing and editing.
However, for OpenAI, in its journey to develop general artificial intelligence, SORA is more than just a generation tool. As there is a quote in Sora's technical documentation: "Our results show that scaling generative models is a promising path towards building a general-purpose physics world simulator". It can be seen that what OpenAI ultimately wants to do is to create a universal "physical world simulator". In this sense, the SORA model is positioned to form a world model to model the real world.
Digital twin is more about forming a "mirror" of the physical world through the digitization of the physical world, so as to increase the mastery of the operating state of the physical world and the control of the rules, and through the instruction intervention in the digital virtual world, to adjust, intervene and optimize the operation of the physical world. The "world model" is expected to fully concretize the human ideological world and psychological world, and compare it with the state and operation of the real physical world, and finally form the expectation and transformation strategy of the state and operation of the physical world. As a result, the SORA model is not only a first-class generative model, but also an objective world simulator, which opens the road to a simulated world.