February 15, 2024 is undoubtedly a day that the whole world needs to record, OpenAI uploaded a 1-minute-long ** generated by the artificial intelligence large model SORA, which amazes all sides, makes people go straight without cold sweat, and virtual shines into reality, so what is SORA? Why does it have such a big charm, even SpaceX Musk can't help but reply: Human beings are finished!!
Let's take a look at the following, let the author take you to appreciate the amazement of sora, unveil its mystery, and what is amazing about it.
soraIt is an artificial intelligence model that can generate ** from text description, developed by OpenAI, an artificial intelligence research institute of a cutting-edge technology company in the United States.
SORA appeared in the public eye for the first time on February 15, 2024, and was first announced in the form of **. **The content shows the public multiple HDs generated by the SORA large model**.
Let's start with a brief **, there is a full version at the end of the article:
It can be seen from the ** that this **, the highly delicate scene, the complex capture angle, the operation of the **lens, the extremely complex camera movements, and the rich emotions of the characters created in the picture made this film popular as soon as it was released. It is a great charm and challenge for people to be infected with the development of cutting-edge science and technology.
Let's start with the name sora, which derives its meaning from the Japanese word "sora", which means skyto show its unlimited creative potential。The meaning behind it was given a new soul by OpenAI: that is, the text was generated into an image model.
Currently the SORA model is in the model"dall·e"Based on the development, it can be created up to60 seconds**, or even more (usually the concept generation, the development generation, the optimization generation, the mature generation).
The SORA model is based on DALL-E, an artificial intelligence model developed by OpenAI, which is one of the reasons why it can generate highly innovative images based on text input. Dall·e is there again"gan"(generative adversarial network) and"transformer"The significance of these two is that it is possible to create novel images that have never been seen, that is, from "zero" to "one", from "nothing" to "being".
In general, from the official technical report, the biggest difference between SORA and the previous model is the integration and coordinated use of many models such as DIFFUSION TRANSFORMER model, TRANSFORMERS model, stable DIFFUSION model, DALL-E-2, etc., which can be seen that SORA's ** is born based on the continuous iterative update of the previous generation.
The advantages brought by this are self-evident, which brings amazing capabilities, you know, the SORA model is not the first** generative model, as early as 2023, there are many ** generative models such as Runway, Stability AI, Pick Labs, etc., but there is only 3 19 seconds of generation ability, which is far inferior to SORA in speed and time, in addition, SORA is also able to generate ** content that adapts to different devices, specifically, It's that it can generate a wide range of aspect ratios and resolutions, including widescreen 1920x1080p and portrait 1080x1920p, and all resolutions in between.
However, the SORA model is not perfect from the current point of view, because it is based on the "Diffusion Model" diffusion model (introduced later), it also has certain drawbacks, such as when patch noise processing and corresponding external input (the current interaction mode is only text), there will also be flaws in picture realism and shortcomings of rational logical thinking, such as the chair loses gravity and moves with the character, When an object blocks another object, the other object will be lost in the next screen.
Normally, we all know that ** is made from frame by frame, but the coherence between the pictures and the one shot presented in the end is another matter, from the normal physical shooting, we can know that "one shot to the end" is often used by movies as a publicity gimmick, so "one shot to the end" has always been identified as a difficult task.
In order to achieve "one shot to the end", the model needs to have a clear understanding of the real world, physical laws, cause and effect, how to express the attributes of each element, and the interaction between elements, so that when the camera switches and rotates, the object to be expressed can still maintain the previous attributes, as well as the coherence of the front and back actions.
The above is not difficult for us humans, because we can feel, based on the current picture ** next frame, but it is not easy for the model to understand the association between the two frames, and a lot of training is required, this is not a chain with a **, it is a "chain" + "chain", which grows geometrically. Here's an example:
1. For example, the interaction between different cola cans leaves different traces.
2. For example, if the vehicle flies too fast, the vehicle is damaged and debris.
2. For example, if the vehicle flies too fast, the vehicle is damaged and debris.
As a new type of AI model, it is possible to look at what this is really going on from a technical level.
First of all, from the official introduction, we know that the SORA model is born from the application of many models, so I will introduce some of the model principles.
1. Diffusion model
The inspiration for the diffusion model is non-equilibrium thermodynamics, just like the diffusion phenomenon seen when a drop of ink is dropped into water, we will reverse the whole process of generating the image to form our diffusion model, which is like a member of thousands of trainings. In addition, we also need to present its characteristic that is, randomness, in countless random processes, technicians continue to remove unwanted noise content from the noise image little by little, so as to achieve the required generation of the desired content.
In the training process of the entire diffusion model, it is divided into common forward propagation and backpropagation processes, also called forward propagation and reverse propagation.
a) Forward propagation: Forward propagation is a step in the training process of a neural network, which passes the input data from the input layer through each hidden layer to the output layer to obtain the best results of the model. In the forward propagation process, the input data is calculated layer by layer through the weights of each layer and the action of the activation function, and passed to the next layer until the output layer. This process can be seen as the process of information transfer from input to output. Example: Random noise is added to a real ** until it becomes pure noise**.
b) Backward propagation: Backpropagation is a critical step in the training process of neural networks. It calculates the loss function by comparing the difference between the model's results and the actual labels, and propagates this error backwards from the output layer to the individual hidden layers to update the model's weights and biases. Backpropagation uses the gradient descent method to calculate the gradient of the parameters in each layer, and adjusts the values of the parameters according to the direction of the gradient to minimize the loss function. This process can be seen as the process of error transfer and parameter update from output to input. Example: Remove the noise from the pure noise** you just got until the image is clearly visible.
Based on the interaction and information transmission between propagation nodes, through the alternating iteration of forward propagation and backpropagation, the neural network can gradually adjust the parameters to make the first result of the model closer to the actual label, so as to improve the accuracy of the model and express the complete propagation process of information, influence or innovation in the network. Vernacular understanding: It is to generate "lying cats in the scene" more quickly and effectively.
1. Transformer model
The Transformer model is the cornerstone of the current mainstream models, such as chat-gpt, a certain degree of Wenxin Yiyan, etc., in the final analysis, it is still a transformer, and its core is the process from encoder to decoder.
To put it simply, "I am Chinese" as text input, first of all, the transformer model will split "I am Chinese", that is, into a basic unit "I", "is", "China", "country" and "people", we can call it "token", the encoder will begin to turn the token into an abstract vector, expressed by unit patch, these abstract encoded information completely record the lexical information, grammatical features and lexical order of the text.
The decoder will use the "abstraction vector" output of the encoder to generate the target sequence based on the requirements, in addition to the "abstraction vector" as an input, it will also use the previous text, that is, the text generated by itself before, as an input, to provide and guarantee the correlation between its own input and external input to itself. In layman's terms, understand the previous question, "If you are from Chongqing, then you are a ** person?" ”
This is the key use of encoder and decoder, SORA is to a certain extent at the same time taking into account the key element object in time and space movement transformation, the two are reasonable, both take into account the time position of the element object in the generation, as well as the spatial position. Of course, in addition, there are many other models, if you are interested, you can go to the information to view, and then we will continue to look at the use of these models on the basis of them, so that we can truly turn decay into magic.
In the author's opinion, the whole process can be divided into four parts.
1. **Generation**: Through the need for accurate text descriptions, let SORA generate high-quality images that match it, which is the basic interaction mode between users and SORA.
2. **Data encoding and compression**: SORA compresses the input ** or ** into a low-dimensional temporal and spatial representation through ** compression network compression, this step aims to effectively encode and store the original information, which is the basis for the implementation of subsequent steps.
3. **Generation**: On the basis of compressed information, SORA decomposes it into basic building blocks through time-space patches, and then recombines and generates them for potential time-space training, and finally forms new **content. The technologies involved include computer vision, deep learning, natural language processing, etc.
4.Compositing: Sora relies on a powerful compositing function that allows you to combine multiple different elements into one to create a new and unparalleled impact. For example, the realistic sora** mentioned above, this ability is due to its deep understanding and fine control of the content.
Artificial intelligence has led to an industrial revolution that is often referred to as Industry 40, among them, AI is widely sought after by the trend of the times, and therefore has injected a large amount of capital, manpower, resources, etc. into the world, and has developed rapidly, beginning to affect all aspects of human development, to some extent no less than any revolution before the development of the entire human race.
In the field of education, students can browse a variety of books through SORA, providing a novel learning tool that enables students to enhance their understanding and contextual learning.
In the commercial sector, the high-quality** generated by SORA can be used in marketing campaigns, enriching the company's brand image and achieving better marketing results.
In the field of military industry, SORA can confuse opponents with its realistic ** and achieve the effect of being fake.
In the field of entertainment, SORA can create a variety of cool fictional scenes, bringing more fun and novelty to users, and so on.
While there may be excellent developments in many areas, there are still concerns about the accuracy and authenticity of the content generated, i.e., the unknown fear of virtual reflection into reality, and its impact on user privacy and data security.
It is even inevitable to cause a lot of thinking, changing film and television advertising, animation games and other industries, and easily eliminating the material companies and advertising companies that eliminate the best copyrights, these worries are currently the hottest discussions, but the author believes that there will be more changes. Some people will ask: does the advent of SORA mean that filmmakers have begun to become a sunset industry, whether the entertainment production industry such as movies has entered a new era, whether a large number of unemployment will cause social problems, and so on.
Although the emergence of SORA has continued to widen the distance between China and the world's leading artificial intelligence industry, and directly made China's AI a prototype, the author believes that unless OpenAI continues to open source, China still needs a long way to break this technical barrier, but every time it is open source, it means that the other party has more capital in their hands, which is hard to imagine.
The stronger the AI, the more complex and large-scale data training is required, which requires not only advanced professional knowledge, a large amount of computing resources, but also carefully prepared training data and training methods, etc., for example, the adversarial training technology of the blue and red teams is the most critical, that is, the blue team is the positive side, and the red team personnel are the negative side (experts in the fields of misinformation, hate content, and bias), and the two teams work closely together to conduct AI "adversarial testing" on the model, and China still lacks a large number of such talents.
In addition, the core of artificial intelligence is a new stage of technology products driven by data and the Internet, not only AI, but also robots, the Internet of Things, big data and cloud computing all play an important role in it, and the "SORA" or similar technology developed by OpenAI may be part of this revolution, and their technical barriers may also include challenges in data processing, algorithm optimization and hardware requirements.
The "SORA" model alone includes the most basic in-depth understanding and application of natural language processing (NLP) and computer vision, which includes the complex process from understanding large-scale text data to generating realistic content, which requires advanced model design and strong computing power.
It is not a bad thing for others to shine, there is pressure to be motivated, and it is precisely because of the "great engineering and technological progress" of cutting-edge technology companies such as Open AI that it will bring an unprecedented veil of mystery to human development.
The author believes that with the country's increased investment and attention to artificial intelligence, domestic technology companies will catch up, perhaps the next "SORA" second generation will be in China, the author firmly believesOne day, success will belong to the great stage of China and bloom its own light.
Hopefully, those are still thinkingIncrease investment in games and enter pre-made dishesA "technology company" that is trying to make money from the people's clothing, food, housing and transportation can incubate technology companies like OpenAI.
Don't let the lead be excessive, the biggest illusion that the West gives to the East is that you think it is good for you, don't develop into the end, just like the sentence in the Three-Body Problem:What has it to do with you to destroy you?
Let's take a look at the full version of the official original ** generated by Sora (through a paragraph of English prompts):
Special gift to the author of the poem that I have been thinking about for a long time:
Untitled (with one).
When I first heard that the name was not enough, the pace of endeavor was outstanding.
Qingyun soars to the peak, and it will surely shine in the world.
If you think it's good, please comment, like, and follow three times! Thank you!