How exactly does Sora work?

Mondo Workplace Updated on 2024-02-23

Translator: Bugatti.

Last week, the OpenAI team unveiled SORA, a massively generated model that showcases new capabilities for simulating fundamental aspects of the physical world. I've been following the field of text generation for a long time, and I think this model represents a leap forward in quality.

I've seen a lot of speculation on Reddit and Twitter about how this model works, including some non-mainstream claims (does Sora run in a game engine called Unreal?). )。When this groundbreaking AI tool was released, many people wanted to appear as if they knew how it worked, or maybe even tricked themselves into thinking they could get a glimpse of subtle clues based on a few published samples. The worst example I've found of this is Dr. Jim Fan's post claiming that "SORA is a data-driven physics engine", which was viewed about 4 million times on Twitter (SORA is not a data-driven physics engine at all).

Luckily, OpenAI has published a research article explaining the architecture of its model. If we read the article, there is actually no need to guess. Below I will introduce the technology provided by the OpenAI team so that we can understand how Sora actually works.

Since the advent of the field of artificial intelligence, creating AI capable of modeling, understanding, and simulating the inherent complexities of the real world has been a very difficult challenge. Unlike static images, it essentially involves presenting changes over time, 3D space, physical interactions, and object continuity, among other things. Generative models in the past struggled to handle different durations, resolutions, and camera angles. What's more, these systems lack the intrinsic "understanding" of physics, causality, and object persistence that is necessary for high-fidelity simulations of reality.

The ** released by OpenAI showcases a model that is better than anything we've seen in these areas. Frankly, these ** seem very real. For example, a person's head will block the sign and then go over the sign, and the text on the sign will remain the same. Animals realistically swing their wings even when they are "idle". The petals in the wind will sway with the wind. Most ** models are powerless in the face of this challenge, and the result is often some flickering, shaky images that make it difficult for the viewer to figure it out, but Sora does not have this problem. How does it do it?

My first major takeaway from working on the model and existing posts is that this research work builds on previous work on language models such as OpenAI's GPT.

Characterization

One of the key innovations introduced by the researchers is how SORA represents during training. Each frame is divided into many small patches, similar to how words are broken down into tokens in large language models such as GPT-4. This patch-based approach allows SORA to train with ** of varying lengths, resolutions, orientations, and aspect ratios. Regardless of the original shape of the source, the pieces extracted from the frame are processed in exactly the same way.

Figure 1OpenAI's research article said: "Roughly speaking, we first compress ** to a lower dimensional latent space, and then decompose ** representation into space-time patches, so as to turn ** into patches." ”

Model architecture

SORA uses a Transformer architecture that is closely related to its GPT model to handle the long sequence of these block tokens. Transformers contain spatiotemporal self-interest layers that are useful for modeling remote dependencies in sequences such as text, audio, and **.

During the training process, SORA's transformer model takes the block token sequence earlier in the diffusion process as input, the original "denoised" token. By training on millions of different frames, Sora slowly learned the patterns and semantics of natural frames.

Figure 2Diagram of the denoising process from the OpenAI research article.

Text adjustments

SORA is also conditional, which means it can be generated in a controlled manner based on text prompts. The text prompt is embedded and provided to the model as additional context, along with the tile corresponding to the current frame.

To better connect the text description with the actual content, the researchers used highly descriptive captions for each training, which were generated from a separate captioning model. This technology helps Sora follow text prompts more closely.

Reasoning process

In the inference process, the SORA starts with pure noise patches and repeatedly denoises over more than 50 diffusion steps until a coherent and smooth ** is achieved. By providing different text prompts, Sora is able to generate different** that appropriately matches the subtitles.

Patch-based characterization allows SORA to handle any resolution, duration, and orientation at the time of testing, simply by arranging the patches into the desired shape before starting the diffusion process.

By scaling the training data to the size of millions of fragments and using a lot of computational resources, the OpenAI team found some very interesting abrupt behaviors:

Sora is not just text generation, it can also be generated from input images or other **generations.

Sora seems to have a strong 3D "understanding" of the scene, with characters and objects moving realistically in a continuous manner. This is purely derived from the scale of the data and does not require any explicit 3D modeling or graphics**.

The model shows object persistence, often tracking entities and objects, even when they are temporarily out of frame or occluded.

Sora demonstrates the ability to simulate some basic real-world interactions—for example, a digital painter's brushstrokes on a canvas that are accurately handed down over time.

It can also convincingly generate complex virtual worlds and games, such as Minecraft. Sora can be used to control the movement of the scene in this generated environment while rendering the scene.

With the addition of additional computing power and data, the quality, consistency, and cue adherence are greatly improved, suggesting that it further benefits from scale.

However, SORA still shows obvious flaws and limitations:

It is often difficult to accurately model the more complex real-world physical interactions, dynamics, and causal relationships. Simple physics and object properties are still challenging. For example, a glass that is knocked over and spills liquid, showing that the glass has melted onto the table, and the liquid has flowed down the side of the glass without any breaking effect.

Models tend to spontaneously generate unexpected objects or entities, especially in crowded or chaotic scenes.

It's easy to confuse left and right, or when many actions take place, and the precise sequence of events or activities over a period of time can easily be disrupted.

It's still difficult to realistically simulate the natural interactions and environments between multiple characters. For example, it will generate a person walking in the wrong direction on a treadmill.

Despite these persistent shortcomings, SORA portends future potential as researchers continue to scale up generative models. With enough data and computing power, transformers may begin to gain a deeper understanding of real-world physics, causality, and object persistence. Combined with language comprehension capabilities, this is expected to open up new ideas for training AI systems through real-world simulations based on **.

Sora is taking the first few steps towards this goal. While more needs to be done to overcome its many weaknesses, the emerging features it demonstrates highlight the promise of this research direction. Giant transformers trained on a large number of disparate datasets may eventually generate AI systems capable of intelligently interacting and understanding the inherent complexity, richness, and depth of our physical environment.

So, contrary to unfounded claims, SORA does not run through a game engine or a "data-driven physics engine", but through a transformer architecture that runs on "chunks", just as GPT-4 runs on text tokens. It excels at creating ** that indicates depth of understanding, object persistence, and natural dynamics.

The key innovation of the model is that the frame is processed as a sequence of chunks, similar to the word token in a language model, allowing it to effectively manage different aspects. This approach, combined with text conditional generation, enables Sora to generate contextually relevant, visually coherent ** based on text prompts.

Despite its groundbreaking features, SORA has limitations, such as maintaining coherence for modeling complex interactions and dynamic scenes. These limitations are indicative of the need for further research, but without prejudice to the significant achievements it has made in advancing the technology of generation.

I hope that Sora will be released to people to try it out soon, because I've already thought of a lot of new applications for this technology, so let's wait and see.

Related Pages