Editor: Editorial Department.
New Zhiyuan Guide] Jason Wei, an OpenAI researcher who revealed his 996 work and rest, said that SORA represents the GPT-2 moment generated. The key to competition is computing power and data. Is it possible to successfully reproduce Sora domestically? This 37-page technical report by the Chinese team may give us some inspiration.
Today, this picture is hot in the AI community.
It enumerates the birth time, structure, and authorship of the model of a number of scholars.
Not surprisingly, Google is still the author of the first model to be founded. But now the spotlight of AI ** has been snatched away by Sora.
At the same time, Jason Wei, an OpenAI researcher who exposed the 996 schedule of work and rest, said
SORA is a milestone that represents the GPT-2 moment generated. 」
For the field of text generation, GPT-2 is undoubtedly a watershed. The launch of GPT-2 in 2018 marked a new era of being able to generate coherent, grammatically correct paragraphs of text.
Of course, GPT-2 also struggles to complete a complete and error-free article, and there will be logical inconsistencies or fabrication of facts. However, it laid the groundwork for subsequent model development.
In less than five years, GPT-4 has been able to perform complex tasks such as tandem thinking, or write a long article, without fabricating facts.
And today, SORA already means such a moment.
It creates short ** that is both artistic and realistic. Although it is not yet possible to create a 40-minute TV series, the consistency of the characters and the storytelling are already very engaging!
Jason Wei believes that the ability to maintain long-term consistency, near-perfect realism, and create deep storylines will gradually take shape in SORA and future generative models.
Will Sora disrupt Hollywood? How far is it from a blockbuster movie?
Tyler Perry, a well-known Hollywood director, was shocked to see Sora's ** and decided to cancel his Atlanta studio's $800 million expansion plan.
Because in the blockbuster filmed in the future, there may not be a need to find a location or build a real scene.
So, will SORA disrupt the film industry? Jason Wei said that just like the GPT-4 is now, it can be used as an auxiliary tool to improve the quality of the work, so it is still some way from professional filmmaking.
Now, the biggest difference between ** and text is that the information density of the former is low, so in the learning of ** reasoning and other skills, a lot of computing power and data will be required.
As a result, competition for high-quality data will be fierce! It's like right now everyone is competing for high-quality text datasets.
In addition, it will be crucial to combine ** with other information patterns as an aid to the learning process.
And in the future, AI researchers with the best processing experience will become very sought-after! However, they also need to adapt to new technological trends, just like traditional natural language processing researchers.
OpenAI's TikTok account is still releasing new works from Sora.
How far is Sora from Hollywood blockbusters? Let's take a look at a scene that is often seen in this movie - a car speeding through a city street in the pouring rain.
a super car driving through city streets at night with he**y rain everywhere, shot from behind the car as it drivesFor example, the construction site generated by SORA is also very realistic with forklifts, excavators, scaffolding, and construction workers.
It also captures the effect of miniature photography, making everything look like a microcosm.
Of course, if you look closely, there will be some problems with the picture.
For example, one person will suddenly become several people.
Or, one person suddenly becomes another.
SWYX, the founder of the AI company, concludes that the root cause is that SORA does not have an intermediate physical model, which is the exact opposite of Lecun's world model.
Still, it created a quantum leap forward in the filmmaking process and significantly reduced costs.
While a runway can do something similar, Sora takes everything to the next level.
Here's a comparison of Sora and Pika, Runway Gen-2, Animatediff, and Leonardoai.
In the near future, perhaps each of us will be able to generate our own movies in a matter of minutes.
For example, we can use ChatGPT to help write the script, and then use SORA to convert the text**. In the future, SORA will definitely break through the time limit of the 60s.
Imagine what it would be like to have a movie in your head that never existed.
Alternatively, we can use DALL-E or Midjourney to generate the image, and then use Sora to generate **.
The d-id can keep the character's mouth, body movements, and lines in line.
The "Harry Potter" Balenciaga fashion blockbuster that swept the whole network before.
Elevenlabs, which can voice characters in **, enhance the emotional impact**, and create a seamless blend of visual and auditory narratives.
It's that easy to make your own blockbuster!
Unfortunately, the cost of training for SORA is about 10 million dollars.
After the release of ChatGPT last year, a grand scene of the thousand-model war suddenly emerged. And this time it has been half a month since the birth of Sora, and the companies are still silent.
How can Chinese companies replicate SORA?
Recently, the Chinese team also released a very detailed SORA analysis report, which may give some inspiration to this problem.
The Chinese team reverse-engineered SORA
Recently, a Chinese team from Lehigh University and Dr. Jianfeng Gao, Vice President of Microsoft, jointly published a 37-page analysis**.
Through the analysis of public technical reports and reverse engineering studies on the model, the development background of SORA, the technologies it relies on, its application prospects in various industries, the current challenges, and the future trend of text-to-** technology are comprehensively examined.
Among them, the research mainly focuses on the development process of SORA and the key technologies for building this virtual world simulator, and deeply explores the application potential and possible impact of SORA in the fields of film production, education, and marketing.
*Address: Project Address:
As shown in Figure 2, Sora is able to demonstrate the ability to accurately understand and execute complex human instructions.
In terms of producing a long generation that can show movement and interaction in detail, SORA has also made great progress, breaking through the limitations of previous generation technology in terms of length and visual performance. This capability marks a major leap forward for AI creative tools, allowing users to transform textual narratives into vivid visual stories.
According to the researchers, SORA achieves this high level not only because of its ability to process text input from the user, but also because of its ability to understand the complex interrelationships of the various elements in the scene.
As Figure 3 illustrates, the development paths of generative computer vision (CV) technologies have been diverse over the past decade, especially after the successful application of Transformer architectures to natural language processing (NLP).
Researchers have advanced its use in vision tasks by combining Transformer architectures with vision components, such as the groundbreaking Vision Transformer (VIT) and Swin Transformer.
At the same time, diffusion models have also made breakthroughs in the field of image and image generation, and they have demonstrated a mathematically innovative approach by converting noise into images through u-net technology.
Since 2021, the focus of research in the field of AI has shifted to those language and visual generative models that can understand human instructions, that is, multimodal models.
With the release of ChatGPT, we have seen the emergence of commercial text-to-image products such as Stable Diffusion, Midjourney, Dall-E 3, and more in 2023.
However, due to the inherent complexity of time, most current generation tools are only capable of making short ones of a few seconds.
In this context, the emergence of SORA symbolizes a major breakthrough - it is the first model capable of generating up to a minute** based on human instructions, and its significance is comparable to the impact of ChatGPT in the NLP space.
As shown in Figure 4, the core of SORA is a DIFFUSION Transformer that can flexibly process data from different dimensions, and it is mainly composed of three parts:
1.First of all, the space-time compressor will map the original ** to the latent space.
2.Next, the Visual Transformer (VIT) model processes the latent representations that have been segmented and outputs the latent representations after removing the noise.
3.Finally, a system similar to the CLIP model guides the diffusion model to generate ** with a specific style or theme based on the user's instructions (which have been enhanced with a large language model) and latent visual cues. After several denoising processes, a latent representation of the generated ** is obtained, which is then mapped back to the pixel space by the corresponding decoder.
As shown in Figure 5, one of the hallmarks of SORA is its ability to process, understand, and generate a wide range of sizes, from 1920x1080p for widescreen to 1080x1920p for portrait screen.
As shown in Figure 6, compared to models trained only on uniformly cropped squares, SORA shows a better picture layout, ensuring that the subject in the scene is fully captured, avoiding the problem of sometimes truncated images due to square cropping.
Sora's granular understanding and retention of features is a significant step forward in the field of generative models.
Not only does it demonstrate the potential to generate more realistic and engaging**, but it also highlights the importance of diversity in training data to achieve high-quality results for generative AI.
In order to effectively handle a wide variety of visual inputs, such as ** and ** of different lengths, sharpness, and picture ratios, an important way is to convert these visual data into uniform representations. Doing so also facilitates large-scale training of generative models.
Specifically, SORA first compresses ** into low-dimensional latent space, and then decomposes the representation into spatiotemporal patches.
As shown in Figure 7, the goal of SORA's ** compression network (or visual encoder) is to reduce the dimensionality of the input data and output latent representations that have been compressed in spatiotemporal compression.
References in the technical report show that this compression technique is based on VAE or Vector Quantization-VAE (VQ-VAE). However, according to the report, it is difficult for VAEs to map visual data of different sizes into a uniform and fixed-size latent space without resizing and cropping the image.
In response to this problem, the researchers have identified two possible technical implementations:
1.Space patches compression.
This process involves converting ** frames into fixed-size patches, similar to the approach used in the VIT and MAE models (as shown in Figure 8), and then encoding them into latent space.
In this way, the model can efficiently handle different resolutions and aspect ratios, as it can analyze these patches to understand the content of the entire frame. Next, these spatial tokens are arranged chronologically to form a space-time latent representation.
2.Space-time patches compression.
This technology includes the spatial and temporal dimensions of the data, not only considering the static details of the picture, but also paying attention to the movement and changes between the pictures, so as to fully capture the dynamic characteristics of the picture. Utilizing 3D convolution is a straightforward and effective way to achieve this integration.
There is also a key issue in the compression network section: how to handle changes in latent space dimensions (i.e., the number of latent feature blocks or patches of different types) before feeding patches into the input layer of the Diffusion Transformer.
According to SORA's technical report and corresponding references, patch n'Pack (pnp) is most likely a solution.
As shown in Figure 10, PNP packs multiple patches from different images in a single sequence.
Here, the patching and token embedding steps need to be done in the compression network, but Sora may further tranquil the potential patch into a Transformer token, as Diffusion Transformer does.
DIT and U-VIT were among the first to use visual transformers for latent diffusion models. Like VIT, DIT also employs a multi-head self-attention layer and a point convolutional feedforward network, interleaving some layer normalization and scaling layers.
In addition, DIT also performs zero initialization by Adaptive Layer Normalization (ADALN) and adds an additional MLP layer, so that each residual block is initialized as an identity function, thus greatly stabilizing the training process.
U-VIT treats all inputs, including time, conditional, and noise image patches, as tokens, and proposes long-hop connections between shallow and deep Transformer layers. The results show that U-VIT achieves record-breaking FID scores in image and text-to-image generation.
Similar to the mask autoencoder (MAE) method, the mask diffusion Transformer (MDT) also adds a mask latent model in the diffusion process, which effectively improves the learning ability of the context relationship between different object parts in the image.
As shown in Figure 12, MDT uses side-side interpolation to perform additional mask token reconstruction tasks during the training phase to improve training efficiency and learn strong context-aware positional embeddings for inference. MDT achieves better performance and faster learning speed compared to DIT.
In another innovative work, Diffusion Vision Transformers (Diffit) employs the Time-Dependent Self-Attention (TMSA) module to model dynamic denoising behavior over sampling time steps.
In addition, Diffit employs two hybrid hierarchical architectures for efficient denoising in pixel space and latent space, respectively, and implements new SOTA in various generation tasks.
Due to the spatiotemporal nature of the world, the main challenges in applying DIT in this area are:
1) How to compress ** into latent space in space and time to achieve efficient denoising;
2) how to convert the compressed latent space into patches and input them into the transformer;
3) How to deal with the spatiotemporal dependence of long distances and ensure the consistency of content.
imagen video is a text-to-image generation system developed by Google Research that leverages a cascading diffusion model (consisting of 7 sub-models that perform text conditional, spatial and temporal super-resolution) to convert text prompts into high definition.
As shown in Figure 13, first, the frozen T5 text encoder generates a contextual embedding based on the text prompt entered. Subsequently, the embedding information is injected into the base model to generate low resolution**, which is then refined by a cascading diffusion model to improve resolution.
Blattmann et al. proposed an innovative method to convert a 2D latent diffusion model (LDM) into a video latent diffusion model (video LDM).
Model instruction tuning aims to enhance the ability of AI models to accurately follow prompts.
To improve the text-to-model ability to follow text instructions, Sora uses a similar approach to DALL-E 3.
The method involves training a descriptive caption generation model and using the data generated by the model for further fine-tuning.
Through this instruction tuning, SORA is able to meet the various requirements of the user, ensuring that precise attention is paid to the details in the instruction, and the resulting ** can meet the user's needs.
Text prompts are essential to guide text-to-model models such as Sora, making them both visually impactful and precise to meet the user's creation needs.
This requires the creation of detailed instructions to guide the model in order to bridge the gap between human creativity and the ability of AI to execute.
Sora's tips cover a wide range of scenarios.
Recent research work, such as VOP, Make-A-Video, and Tune-A-Video, has shown how prompt engineering can leverage the NLP capabilities of models to decode complex instructions and present them as coherent, vivid, and high-quality narratives.
As shown in Figure 15, a classic sarra demonstration shows a stylish woman walking down a neon-lit Tokyo street.
The hints include the character's actions, settings, character appearances, and even the desired mood, as well as the atmosphere of the scene.
It's one such well-crafted text prompt that ensures that the ** generated by the SORA matches the expected visuals very well.
The quality of prompt engineering depends on the careful choice of words, the specificity of the details provided, and the understanding of their impact on the model's output.
Image cues are all about providing a visual anchor for the generated content and other elements such as people, scenes, and moods.
In addition, text prompts can instruct the model to animate these elements, for example, by adding layers such as action, interaction, and narrative progression to bring static images to life.
By using image cues, Sora can use visual and textual information to transform static images into dynamic, narrative-driven**.
In Figure 16, an AI-generated Shiba Inu wearing a beret and turtleneck, a unique family of monsters, a cloud forming the word sora, and a surfer riding a giant wave in a historic hall are shown.
These examples demonstrate what can be achieved with the image cue SORA generated by DALL-E.
Hints can also be used to generate.
Recent studies such as fast-vid2vid have shown that good cues need to be specific and flexible.
This ensures that the model is clearly guided on specific goals, such as descriptions of specific objects and visual topics, and that it can be changed imaginatively in the final output.
For example, in an extension task, a prompt can specify the direction of the extension (time forward or backward) and the background or theme.
In Figure 17(a), the prompt instructs the SORA to extend a section back to explore the events of the original starting point.
b) shows that when performing a tool-to-edit with prompts, the model needs to have a clear understanding of the transitions required, such as changing the style, scene, or atmosphere, or changing subtle aspects such as lighting or mood.
c), the prompt instructs the SORA connection while ensuring smooth transitions between objects in the different scenes in the field.
SORA's impact on various industries
Finally, the research team also looked at the impact that SORA could have in the fields of film, education, gaming, healthcare, and robotics.
As the diffusion model represented by SORA has become a cutting-edge technology, its application in different research fields and industries is rapidly accelerating.
The impact of this technology goes far beyond mere creation, offering transformative potential for tasks ranging from automated content generation to complex decision-making processes.
*The advent of generative technology heralds a new era in filmmaking, where the dream of making your own movies from simple text is becoming a reality.
Researchers have ventured into the field of film generation, extending generative models to film creation.
For example, using MovieFactory to generate a movie-style ** from the script made by ChatGPT using the diffusion model, the entire workflow has been run through.
MobileVidFactory automatically generates vertical movements** by simply providing text from the user.
And Sora's ability to effortlessly allow users to generate explosive movie clips marks the moment when anyone can make movies.
This will dramatically lower the barrier to entry for the film industry and introduce a new dimension to filmmaking, blending traditional storytelling with AI-driven creativity.
The impact of this AI is not just to make filmmaking simple, but it has the potential to reshape the filmmaking landscape to become more accessible and versatile in the face of changing audience preferences and distribution channels.
People say that 2024 is the first year of robots.
It is precisely because of the outbreak of large models, coupled with the iterative upgrading of the first model, that the robot has entered a new era
Generate and interpret complex sequences with enhanced perception and decision-making capabilities.In particular, the diffusion model unlocks new capabilities for robots, allowing them to interact with their environment and perform tasks with unprecedented complexity and precision.
The introduction of web-scale diffusion models into robotics demonstrates the potential of using large-scale LLMs to enhance robot vision and understanding.
For example, a robot with the blessing of DALL-E can accurately arrange dinner plates.
Another new technology is the latent diffusion model.
It can be guided by language, so that the robot can understand and perform the task through the action results in ***.
In addition, the reliance of robotics research on environmental simulations can be addressed by diffusion models, which create highly realistic sequences.
In this way, diverse training scenarios can be generated for the robot, breaking the limitations caused by the lack of real-world data.
Researchers believe that the integration of technologies such as SORA into the field of robotics promises to lead to breakthroughs.
Harnessing the power of SORA, the future of robotics will advance like never before, with robots seamlessly navigating and interacting with their surroundings.
In addition, for industries such as games, education, and healthcare, the AI** model will also bring profound changes to this.
Finally, the good news is that although the SORA is not yet open, we can apply for the red team test.
As can be seen from the application form, OpenAI is looking for experts in the following fields such as cognitive science, chemistry, biology, physics, computing, economics, etc.
Eligible students can apply!