In the past week, I believe everyone's circle of friends has been swiped by one word: sora.
In the early morning of February 16th, without any warning and news disclosure, OpenAI suddenly released its first Wensheng model: SORA, which greatly refreshed multiple indicators in the industry, increased the generation time by 15 times at one time, and subverted the global market pattern of generative AI in the field.
The popularity of OpenAI's new popular SORA continues to ferment, occupying the center of topics on major platforms as a dark horse - Musk lamented that "human beings are willing to gamble and lose"; Yang Likun criticized "SORA is not a world model, and the number of parameters may only be 3 billion"; Zhou Hongyi predicted that "the realization of AGI will be shortened from 10 years to 1 year"...
Seemingly overnight, people are back to that anxious "chatgpt moment" of more than a year ago. Outside of the controversy, where is the "cow" of Sora? For entrepreneurs and the industry, will it lead to "**?
Dr. Ding has more than 20 years of research and work experience in the field of AI, having founded an artificial intelligence platform for PayPal, a Silicon Valley company, and published the book "Generative AI". CITIC College invited Dr. Ding Lei to clarify our thoughts from the perspective of senior practitioners.
Source |CITIC College (ID: CITICBOOK) Author |Edited by Ding Lei|Samadhi.
How "terrible" is it to go from the text to **sora
OpenAI released a 60-second ** generated by Sora based on a piece of text**, you must know that not long ago, Google's latest **generative model VideoPoet was released, and its generation ** was only 10 seconds.
Of course, the breakthrough of SORA is not only in the duration, this 60 seconds **, whether it is fluency and stability, or the processing of details such as light and shadow reflection, movement mode, etc., especially the learning ability of the physical world, all show a very high level.
So how exactly does Sora generate such an amazing ** based on a piece of text?
We know that ** is connected by one frame by one, and if we want to understand "text generation", we should first understand "text generation".
In the past two years, with the emergence of first-class generation tools such as Midjourney, Stable Diffusion, and Dall-E, as well as the upgrading and strengthening of performance, almost together with ChatGPT, people's attention to generative AI has been pushed to an unprecedented height. These tools are able to create a very different style and rich content based on a prompt, and their handling of details is already quite good (see Figure 1).
Figure 1: Image source generated by the generation tool: Behind these generation tools is a key technology called the diffusion model, which can continuously combine a mosaic-like ** with the information provided by the prompt word, restore it many times, and finally form a complete and clear **.
The complete diffusion process of the diffusion model includes two processes: forward diffusion and reverse diffusion (as shown in Fig. 2), in the forward diffusion process, ** becomes blurred by gradually adding Gaussian noise, and in the reverse diffusion process, the model is trained by learning the forward and reverse processes, and the two are combined to form the final diffusion model.
Figure 2: Diffusion model diffusion process.
Since we have **, in order to get the final **, we need to find a way to make ** move, which requires the famous Transformer model.
The Transformer model is a powerful model used to deal with various sequence problems, one is text generation, and ChatGPT uses the Transformer model to generate continuous text content according to the prompt words entered by the user. The other type is generation, because it is essentially composed of continuous frames, we can also understand it as a sequence of image data, which is nothing more than its natural extension in the field of images.
In practice, SORA decomposes ** into smaller data units patches (space-time fragments), each patch is equivalent to a token (token) in the text sequence model, which is also one of the most important concepts of SORA.
From the ** generated by SORA, we seem to be able to perceive that it is like having the ability to understand the general knowledge of the world, and can accurately simulate the real performance of things in the real world, such as the most basic action coherence, the shape of fluid operation, the change of light and shadow with animals, the proportion of object size, etc., all of which are shown as if they were shot in real life**, which is breathtaking.
For example, give the prompt: "The camera follows a white vintage SUV with a black luggage rack on top as it accelerates through a steep dirt road surrounded by pine trees, the terrain is steep, and the wheels are rolling up dust." SORA generates the ** of the car galloping in the mountains (Figure 3), and the "world model" needs ** wheel marks formed by the interaction between the car tires and the road surface, the dust raised by the car when it is galloping, and a series of light and shadow changes.
Figure 3: Screenshot of the ** part generated by SORA Source: OpenAI's official website.
The emergence of SORA not only brings a new application experience in the field of generation, but also makes people have a new understanding and thinking about generative AI with its ability to generate content and understand the world.
Three dimensions look at the "**" caused by SORA
What is really amazing about SORA is that the model can understand how objects exist and operate in the physical world, and the model can learn the laws of the physical world and accurately simulate the real physical world. With the further deepening of this capability, the leapfrog development of artificial intelligence driven by SORA will greatly shorten the distance between us and the more versatile intelligent world of the future.
Detonate the enthusiasm for investment in the AI industry.
The capital market has always been very sensitive, and SORA has detonated the enthusiasm of capital for investment in the AI industry, involving the overall track of AI concepts, and more people have seen the development and hope of generative AI. Among them, the technology giants are still at the forefront, and the domestic and foreign technology giants are increasing their continuous investment in AI technology.
Subsequently, whether it is the Internet, information, finance, retail and other industries, more companies have announced that they are actively investing in the research and development of large models and AI-related layouts. More and more investors are also realizing that more investment and patience are needed to make artificial intelligence more widely and deeply applied in specific industries, which is also of far-reaching significance for the improvement of productivity levels and the adjustment and development of industrial structure.
It brought a "huge earthquake" to the industry
The first to be affected is undoubtedly the film and television, short, advertising, interactive entertainment, etc. SORA can quickly generate high-quality** content, which greatly reduces the production cost of special effects and high-risk shots, and improves the efficiency of content production. With SORA, advertising agencies can quickly create the best ads that meet the needs of the market, shortening the cycle from idea to finished product.
And this will be a double-edged sword, ** content production costs and thresholds will be greatly reduced at the same time, it will also intensify the competition in the industry, it puts forward higher requirements for creators, creators must continue to innovate, in order to maintain the attractiveness and market share of their works.
Are we far from being unemployed?
Not only **, generative AI has driven the rapid development of various content generation technologies such as text, images, and audio, and the rapid evolution of application scenarios, which will affect all walks of life, which has also exacerbated people's worries and concerns, and some people can't help but exclaim that "silicon-based life will eventually replace carbon-based life" and "the pace of AI taking over human society is accelerating". Some people may choose to "lie flat" completely, thinking that AI has evolved so fast that they can even learn the physical world, and we are far from being unemployed!
At present, various generative AI models are still in the development stage and have yet to be further applied, and it is too early to talk about whether they can replace human jobs, but this cannot deny the influence of AI. The changes brought about by AI are deeply rooted in all walks of life and in every corner of our lives.
The rapid advancement of AI will greatly improve production efficiency and work methods, and redefine the position of people in work. With the emergence of more and more new professions and positions, such as AI product managers, prompt engineers, AI creators, AI tuners, etc., the demand and number of these professions will gradually increase, and it can be said that AI will also bring about changes in the career structure.
It is not so much that AI will replace practitioners, but that AI will replace boring and heavy work content, and AI will not eliminate human beings, but backward productivity. When it comes to AI, we shouldn't see it as a competitor, but as a partner we work with, train and use. As the saying goes, a gentleman is not different, and he is good at things.
The future of generative AI from SORA to world models is here.
In the face of the shock brought by SORA, people's reactions can also be said to be mixed. On the one hand, we have witnessed another "miracle" of generative AI, and on the other hand, we may find that large language models are still far from solving practical problems, and it will still take time to "tame" large models.
Some studies claim that as more and more people use it, large models seem to become dumber, and even "hallucinations" appear. The main reason for this problem is that the current mainstream generative model still lacks an understanding of the physical world, so that the questions that are easy for a normal person to answer cannot give the correct output in the eyes of the large model.
The emergence of SORA has made us more aware of this problem, and also provides a direction for the future development of generative AI, which is to let large models understand and learn the physical world, and establish the connection between large models and the physical world. This will inevitably lead to new applications and breakthroughs in AI. Some people believe that SORA means that the time to achieve artificial general intelligence is greatly reduced.
The process by which the human brain perceives things is similar to a model. From an epistemological point of view, in the process of the human brain's cognition, a "model of the world" is gradually formed. People's subjective knowledge does not necessarily conform to the laws of reality from the beginning, but through continuous practice and continuous comparison, the expected results obtained from the model and the results of practice are revised to reduce the difference between the model and practice. This adjustment mechanism can bring the human brain's model of the world closer to the truth.
This is like saying that sport is the embodiment of the human cognition and learning process of the physical world. Taking table tennis as an example, athletes can master the simplest pushing and attacking skills at the beginning, and they can generally deal with the regular incoming ball, and the return route is also in line with their expectations. With the change in the speed and spin of the incoming ball, the player found it difficult to fully cope with the previous receiving skills, and the return ball sometimes went to the net and sometimes came out.
Players are gradually realizing that they can cope with different incoming situations by adjusting the strength and angle of the racket catch. As the incoming ball becomes more diverse, the brain becomes more and more complex to build a "world model", and then it is easy to deal with any situation on the field. This is the process of human "world model" cognition and learning.
The "world model" is also an important concept in the sciences of psychology and engineering. For example, Yann Lecun, a well-known AI scientist, mentioned the importance of the world model when talking about machine intelligence: the world model module forms the most complex part of the architecture, and its role includes estimating the missing information about the state of the world, as well as the future state of the world (Figure 4).
Figure 4: System architecture for autonomous intelligence (simplified from the original diagram) Source: Yann Lecun, "A Path towards Autonomous Machine Intelligence".
The world model can be seen as a kind of "simulator" of the relevant aspects of the world, which models the real physical world, so that the machine, like humans, has a comprehensive and accurate understanding of the world, and can be the natural evolution of the world, or the future world state generated by specific behaviors.
Returning to the discussion of SORA, SORA brings people the shock that it seems to continue to create a "knowledge system" in the physical scene through learning, and by integrating this knowledge, it generates high-quality content and brings a real visual experience to humans. Of course, if we re-examine the current generation results by the standard of "world model", SORA is still a long way from a "world model" in the true sense.
On the one hand, SORA still has some shortcomings when dealing with complex scenes and physical effects. For example, when the scene involves the interaction of multiple objects or complex physical movements, SORA can make mistakes or deviations.
On the other hand, SORA mainly relies on a large amount of training data to learn the generation law of **, which is effective but limits its generalization ability in new scenarios to a certain extent.
Once AI has established a connection with the physical world and learned the "world model", AI's reasoning and advanced capabilities will achieve breakthroughs, which will be promising in many application scenarios and professional fields. Such AI is capable of performing complex tasks and operations, and is even able to completely mimic the behavior of human intelligence, ultimately achieving artificial general intelligence.
Leading the scientific and technological revolution.
Why is it the United States again this time?
I have built a data science platform for PayPal, a Silicon Valley company, to serve global users, and I have more than 20 years of research and work experience in the AI field. Having worked in Silicon Valley for many years, I know very well why people like OpenAI and Sam Altman are born in Silicon Valley - it is the "engineer culture" of Silicon Valley that makes them.
OpenAI is a cultural soil that values the status of engineers in Silicon Valley, USA, and has a strong "engineer culture gene", which simply means that engineers can lead research and development, have greater autonomy, and have more room for creativity.
At the same time, OpenAI adheres to product-driven, without the clear height of intellectuals, whether it is Transformer or Instruction Tuning and other algorithm models, it does not avoid it because it was invented by others, but adheres to the "take-it-or-leave-it" doctrine and continues to work hard in its own large model. For an enterprise, the greatest value creation is always in the products used by users.
So, why is it difficult for big companies such as Google to surpass OpenAI in R&D in the field of artificial intelligence?
A key factor is that these large companies still develop new AI technologies according to the original software development method, splitting tasks into different subdivisions, and multi-department personnel are responsible for subdividing the business, which is a "chicken raising model".
The R&D of emerging AI technologies centered on large model training is essentially a difficult task to dismantle, which requires core leadership to have end-to-end vision and management capabilities at the technical, product, and business levels. This is more like a "baby raising model", parents need to stand in the overall perspective and personally teach and train their children, that is to say, children's education does not need so many teachers, and the core characters are only a few. According to the SORA technical report released by OpenAI, the SORA author team has only 13 people.
It is worth mentioning that OpenAI's CEO Sam Altman has very strong personal ability, not only understands technology, but also understands business operations, and even after the world-renowned "Gong Dou" incident occurred at the end of last year, he can quickly return to his original position, which shows the strength of his influence. It is with such a leader who controls the company's operation mode as a whole and avoids being too constrained by shareholders that OpenAI can always be at the forefront of AI technology innovation.
As far as the development of the AI industry is concerned, China's talents are no less than those of the United States, and if you want to quickly occupy a favorable position in the AI competition, you might as well speed up the layout, fully respect the laws of model training itself, and use objective and comprehensive AI thinking to meet a new round of challenges.
In my new book, Generative Artificial Intelligence: The Logic and Applications of AIGC, I look at the future trends of AIGC and its impact on individuals in detail.
Recommended reading] Ding Lei's "Generative Artificial Intelligence" lays out the future of artificial intelligence, a book to understand the logic and application of AIGC.
This article is original, **Please indicate the source: CITIC College.