Author: Hu Yanping.
Edit: So sleepy Peaches.
In the past few days, I have repeatedly read SORA's technical report and the technical analysis of SORA by all parties.
There are basically three perspectives: exclaim the powerful features, analyze the SORA (implementation)** and evaluate the huge impact.
In terms of impact, the main focus is on the impact on film and television, short **, entertainment and other fields.
But,SORA changes the way AI is perceived and embarks on an epic journey to the World Simulator, which is the real focus of the future storm. The world simulator is an intelligent future that is far more explosive than AGI, embodied intelligence, and the metaverse.
One of the most valuable, vague, and divergent sentences of the SORA Technical Report is: By scaling up generative models, we are expected to build general-purpose simulators that can simulate the physical world, which is undoubtedly a promising development path.
The world simulator described in this article may not be the same as Sora's current self-report and industry understanding.
Obviously, either SORA is exaggerating, or OpenAI has left a hand, or it is due to technical limitations at this stage.
1.SORA is just a compression, diffusion, and spatiotemporal representation of two-dimensional vision, not a physics engine, nor a world model
SORA is not what Jim Fan, a senior scientist at NVIDIA, calls a data-driven physics engine, a learnable simulator or a model of the world, nor will it be able to make AGI a quick realization in a year or two, as Zhou Hongyi suggests.
From the text token vector representation of LLMs to the expression of patches that move towards simulators but are not yet simulators, is the core of the principle change.
The technical report is highly reserved and extremely brief in the principles section, but one of the diagrams is more important. Sora is based on Transformer but has a powerful evolution of Transformer combined with Difussion, Patches are the key.
However, no matter how you look at it, SORA is still only a two-dimensional visual expression of time and space, compressed everywhere, and patches are still image content relationship information, and there is a text imprint, not a multi-dimensional representation of the laws of the physical world. It may be more accurate to add a final phrase in front of the world simulator - two-dimensional visual world simulator.
Three-dimensional images are the spatial construction of the digital world, while two-dimensional vision is actually a combination of motion changes of pixels. Three-dimensional and two-dimensional ** can look like physics, but the essence can only be that the motion changes fit the physical laws, rather than the digital construction of physical rules and internal properties like particle rendering and industry**.
The reason behind it is that you know the answers to the output of large models and the calculation principles of large models, but like Geoffrey Hinton, the father of deep learning in neural networks, and Ilya Sutskever, the former chief scientist of OpenAI, they don't actually know what GPT thinks.
The pixels, position, spatio-temporal information carried by patches, as well as the changes, motions, and relationships with the surrounding patches, in the attention mechanism of transform and the forward and reverse noise process of difussion, after large-scale data training, has the ability to deconstruct and reconstruct all two-dimensional vision, and is oriented to users as emergent generation, which seems to be full of creativity and conforms to the laws of physics, but behind it is actually sora who understands the changes in patches pixels, motion, position, the relationship between time and space in the sense of mathematical and algorithmic representations, these changes and representations fit some of the rationality of the physical world. (patches are not pixels).
Understanding is the algorithm, thinking is the model.
It's awkward, it's abstract, it's tiring, but maybe that's what it is.
For example, SORA's engineers may have fed tens or even hundreds of millions of pieces for large models to learn, but they may not have written a single line about the rules of physical properties.
For example, Sora may have learned some of the assets generated by the 3D engine, and introduced a 3D engine to correct the physical performance of the model generation in the visual sense, just as it was when the model was improved through Dota 2 game battles, but it is 100% certain that Sora does not currently have a built-in 3D engine.
Sora makes users think that it understands the physical world and physical laws, just like when users think they see the physical world when they wear Apple Vision Pro, but in fact, they only see various variations of monocular 3648x3144 pixels carrying ever-changing RGB color information on the screen.
Even the image is not continuous, but constantly refreshed at a frame rate of 90-96 times per second, fitting the principle of human vision, allowing users to have the illusion that it is continuous. Once you shake your head quickly, the picture will produce motion blur. Heavy gamers can even experience screen tearing.
* Conforms to the laws of physics, does not mean that the generation of ** is based on the laws of physics, and it does not mean that the generation of ** large model itself is a data-driven physics engine. The so-called physics can only be the pixel-level change law and representation relationship between the whole and the part, the front and back frames of the first picture.
2.Even so, SORA is still the epic milestone that opened the door to a new horizon of AI, and the cognition of large models has been restarted
Among the various speculations on SORA principles, the analysis of Chinese AI scholar Xie Saining is the closest. However, the framework dismantling of technical principles and the emphasis on flexibility and scalability do not reveal the essence of SORA's mutation - the cognitive restart of large models.
In addition, Xie Sening's intuition that SORA's current estimate of only 3 billion parameters is also too conservative.
SORA is thought to have adopted a hybrid diffusion model DIT with Transformer as the backbone, where DIT = VAE encoder + VIT + DDPM + VAE decoder.
In addition, SORA may use a similar technology to Google's Patch N'Pack (N**IT) to accommodate different resolutions, durations, and aspect ratios.
Although SORA is full of detailed and strong text colors in terms of ** annotation and converting prompt words into detailed descriptions, SORA is essentially a large model that is completely based on vision, oriented to vision, and understands the world with images.
This is very different from the past GPT text data element (not neuron) in the sense of tokens, patches are fragments, patches, basic unit sense of visual information elements (not neurons), the text in SORA is just between people and machines, between machines and ** translators, instructions.
The amount of information in images is actually much greater than that of text, and the real world presented in the visual is even more so. The large number of samples has allowed SORA to establish a basic dynamic relationship between macro and micro spatio-temporal changes in the visual world.
If SORA is connected to robots, smart cars, MR headsets, smartphones, and other devices around the world, with the help of the Digital Eye, large models will be able to:
Seeing, learning, and understanding the world with your own eyes, rather than relying solely on the limited textual data fed to the system by humans, opens the door to a new horizon of vast amounts of knowledge and information. Smart devices are followed by SORA+GPT, real-time perception of reality, is a powerful blessing for embodied intelligence, robots and other intelligent devices are expected to obtain similar to human perception of reality vision and judgment ability, see is to learn, judgment is meaning. Although there is still a big gap with human perception at the beginning, it is far from being comparable to traditional computer vision. In addition, the input and output of the SORA-style large model can be fully textualized, so there is no need to worry about what problems will arise with the visual world of machine cognition and the natural language interaction and somatosensory interaction of humans. It's a process of understanding where machines know what it means to see. The biggest significance of SORA's appearance is not that it can generate 60 seconds, multiple sub-shots, and the subject is unified**, but that it means that the large model can open its eyes and see the world, which is no less than the first cognitive restart of AI that is no less than a human cognitive restart, and that's not all.
3.Cognitive reboot leads to the World Simulator, which means The Force Awakens: the mother model of the big model, the future root technology
Sora is not yet a world simulator, but it shows such potential. It did not produce an ultimate answer, but it told the industry that the vaguely feasible direction is in **.
AlthoughSora is nowhere near enough to be a general-purpose world emulator, but the Sora proof token (10)、patch(2.0) after x(30) Characterization is feasible. From text semantics, vision to physics, it is the three leaps in the principle of large models, and it is also an advanced path to a truly universal (in fact, first based on multi-domain professional simulators) world simulators.
Sora is not yet a physics engine, but it can be generalized to a physics engine in the future.
Patches are only visual information elements in the sense of ((x,y,z), t), relational, color, and content information, not neurons, but they can evolve into digital neurons in the future. In any case, the transformer model cannot have the same quantum capabilities as the human brain, but the high-dimensional global attention mechanism has the potential to fit the quantum state.
Because the underlying logic of AI's representation of intelligence is mathematical, compared with the intuition, fuzziness, randomness, subconscious and other characteristics of the human brain, the mechanics of the large model are in trance together. However, just as the judgment accuracy rate has gone all the way from above %, the principle has been continuously upgraded and the time and space have been continuously transformed, and the evolution towards AGI is manifested as a gradual process of approaching high availability and approaching or even surpassing human intelligence.
But AGI is not the end, nor is it the Holy Grail, the World Simulator is.
SORA helps to achieve AGI, but the main point of SORA's long journey is not AGI, but the World Simulator. There are many definitions of AGI, and AGI in the classical sense is GPT-like after the evolution of data, computing power, and algorithms to a certain extent, it shows the overall ability to achieve local transcendence of human intelligence in terms of knowledge, content, programs, and other work and creation.
AGI is still a tool that supports embodied intelligence, but it is not embodied intelligence. AGI is not really endogenous and autonomous, and more often than not, it is just a tool for human use.
Speaking of which, it must be necessaryOnly by clarifying the different forms and stages of intelligent development can we see the ecological orientation and space-time nodes of GPT4, SORA, AGI, and World Simulator.
When we talk about intelligence, there are actually three kinds of intelligence at the same time. Functional intelligence in the sense of smart, computational perception intelligence in the sense of AI in the past, that is, weak intelligence (AI1.0), since 2020 (especially 2023 is regarded as the official beginning) of strong intelligence (AI20)。
At present, the level of intelligence such as autonomous driving and robots is strictly at AI10 is the category of weak intelligence. Strong Intelligence (AI2.)0) The secondary empowerment of smart cars, robots and other intelligent devices is a coming trend.
This is also the reason why although the development of domestic artificial intelligence is in full swing, there is a generation gap in essence. Some people who can't stand the strong and the weak shout that "we are not bad" and think that this wave of GPT is creating a threat theory. In fact, there is no need to be hard-mouthed, everything must be fought for face first. Just seek truth from facts, see the pattern clearly, grasp the key, and catch up.
How to look at SORA GPT, there is another essence: have you seen the strategic highland, technology leader, smart holy grail, change engine, and eye of the storm in the **. Strong AI is the strategic highland, AI for Science is the leader of science and technology, AGI is the holy grail of intelligence, general and professional models in various fields are the engine of change, and the world simulator is the eye of the storm in the future.
The three types of intelligence mentioned above are only at the level of form, and they are not a distinction between the stages of intelligence development. I willThe development of intelligence is relatively divided into five stages: computational functional intelligence, computational perceptual intelligence, cognitive intelligence, endogenous intelligence (EI), and autonomous intelligence (II).
Note that one day the term AI will be marginalized because intelligence is no longer artificial. Artificial AGI is naturally not the end, and intelligence will go further than we expect from AGI. I have analyzed this point in detail in "Towards the Second Curve", and I will not repeat it here.
The core of intelligent transformation is superintelligence, the embodiment of superintelligence is AGI, and AGI is AI20. The advanced form of cognitive intelligence (but it is mainly artificially fed artificially enhanced intelligence), AGI is the advanced form of AI at this stage, but it is not EI endogenous intelligence and II autonomous intelligence. AGI will not be achieved in a year or two, as some people say, but it is estimated that it will be around GPT6. After that, it belongs to endogenous intelligence (EI), autonomous intelligence (II), and belongs to the world simulator. World Simulator is the cornerstone of the EI, the benchmark of II.
Superintelligence is the brain of the world, and the parent body of superintelligence is the world simulator. The world simulator is the mother model of the big model and the root technology of the future technology.
Looking at the strong start of large models in the fields of industry, environment and climate, materials, protein analysis, molecular drugs, genetic research, etc., you will know that SORA and them are on the same path: the future of the world simulator is not mainly for play, not the speculation of the concept of the metaverse, but the flashpoint of scientific and technological productivity, and the real explosion point of the intelligent future.
The world simulator, the mother technology in science and technology, the core grasp of AI for science, the sympathy, understanding, reproduction, and CAE of the future world in each field are just one of its basic characteristics. The World Simulator is the closest to the existence of the intelligent matrix.
World Simulator means The Force Awakens, a source of innovation, technology-driven, and strategically highlanded, with no room for error.
4.What are the stages of the long journey to the world simulator?
Of all the ** released by Sora, the deepest ** value is actually the fragment of the water cup being poured.
How does SORA fit reality, whether it is a physics engine or not, how can it become a physics engine, and how can it become a world simulator in the future. The answer looms from it.
In the early days of CV development, all the computer could do was to extract and reproduce the contour features of the edge of the cup (e.g., neocognitron), and then it could identify that it was a water cup (e.g., early ImageNet), and then it was possible to understand the relationship between water and the cup (CNN&rnn), and now it can begin to learn and reproduce the process of pouring the water cup (Transformer Sora), and what will happen next, perhaps only large model technology experts know, Maybe they are still being explored, and the jury is still inconclusive.
I'm just doing black-box dialysis from the user's point of view, can the superintelligence do these steps next?
Can the flow characteristics of the pouring water cup fully conform to the physical characteristics and do not appear the current obvious defects? Compatible with fluid mechanics, etc. Can the ice in ** gradually melt in the water after the glass is poured? Corresponding to thermodynamics, etc. After the water cup is poured, can you see the light and shadow and color changes of water stains and water vapor after the tabletop and tablecloth are moistened (so I am more interested in the canvas strokes)? Corresponds to optics, physics, etc. Can the process of pouring the water bottle produce a sound that matches the real scene, rather than just a simple sound effect? Corresponds to acoustics, physics, etc. Can the angle and force of the water cup be randomly controlled to produce different phenomena such as fragmentation, splashing, and evaporation? Synthesis of the above and condensed matter physics. If there is a power supply or hazardous chemicals around the dumping of the water cup, can the scene be predicted? Correspondence with electromagnetic physics, physical chemistry, etc. All of the above is just a simple extension of the physical perspective, the scientific fields that the world simulator needs to correspond to, and the complex phenomena of the real world, even the dozens of major disciplines that have not yet been exhausted. Therefore, whether in terms of process or field, it is a long journey. But this is the sea of stars.
A couple of corresponding step-by-step questions are:
Can Sora be trained on 3D images instead of 2D** generated by a 3D engine? Can SORA learn and train the intrinsic properties of three-dimensional objects from a unified micro-macro scale? SORA can perform x(3.) on the physical world at the model principle, neural network, and node level0) Representation of 3D space-time motion in the sense of the sense, and on the basis of the four elements of sympathy, understanding, reproduction, and ** of the world virtualizer, does X evolve into a neuron? The evolution of world-oriented virtualizers is far more than just these problems, and it is not just these dimensional ......
In general, the SORA partially fits the laws of vision, but does not really understand the physical world yet. At present, SORA is still essentially in the world of visual content, and it is more related to **, games, entertainment, etc. However, it does not prevent the SORA-style large model from entering major smart devices such as robots and smart cars, as well as becoming a world simulator.
AI for Science is a key landing scenario for the World Simulator, andx (3.) in the sense of ai for science0) is the bifurcation point between the physical and visual worlds, just like patch(2.).0) is the text world token(1.0) Bifurcation point with the visual world.
Data, learning, generation, and expectation are the four elements of AGI, and the sense of information content is stronger. Sympathy, understanding, reproduction, and ** are the four elements of the world simulator, and the mother perceives reality with a stronger sense of embodiment. The input and output of the world simulator are essentially mainly completed by the machine intelligence system autonomously, and it is an intelligence with self-reinforcing and autonomous behavior capabilities. The World Simulator has a long journey and will surely lead to EI and II.
5.What's next? 12 scenarios are estimated
Situation 1: The SORA model is not irreproducible.
If OpenAI does not officially launch SORA to global users in the short term, other competitors will also release their own similar products one after another.
There is only a time lag between OpenAI and Google and Meta. However, the competitive weakness caused by the poor data, resources, and computing power of small and medium-sized teams can only be compensated for by the principle of ascending. If PIKA and RUNWAY can't surpass it at the principle level, even if they can barely catch up with SORA, the future is worrying. In addition, the similarity of principles does not mean that the effect is the same, and the difference is a thousand miles.
Situation 2: The principle of computing power is the key to the leap in ability, but the computing power is indispensable and the demand continues to increase sharply.
The computing power consumption of SORA for a single response to prompt and output process must be much higher than GPT40, but that's not the point. SORA once again proves that the importance of the principle is far greater than the computing power, and what computing power counts (not computing power) is superior.
The pattern overturned by the principle is often instantaneous, and it will be repeated many times in the future. However, the overall demand for computing power is still exploding, because it is no longer just text tokens that need to be calculated, and visual patches will increase the demand for computing power sharply.
In the future, the access and computing needs of various sensors in physics engines and world simulators will make computing power even more tight. Even if viewed linearly in front of the eyes, high-quality massive data is always better than small-volume data, large parameters are always better than small parameters, deep, multi-stage, and iterative thinking of models is always better than single-stage, and high-resolution and high-precision are always significantly better than low-precision, so the computing power demand is still growing exponentially. But overall, computing power is only a sustenance.
Situation 3: Transformer-based large models are still the main evolution direction and have great potential.
The self-attention mechanism simulates the quantum state at the level of electronic computing (just god-like), eliminates the distance limitation between information elements, and dissolves the field barrier of CNN.
Situation 4: Light and heavy, large and small, single and mixed, are always two parallel logics.
In the long journey of the computer vision model to the large model, and then to the world simulator, the seemingly reasonable SORA is taking a lighter shortcut, and the sense of control, three-dimensionality, and front and back expansion are naturally not ideal.
3D modeling, particle rendering, and ray tracing are clunky and heavy in terms of computing power, equipment, and labor investment, but they are closer to the essence and have a stronger sense of control. Just like the two computer vision routes of autonomous driving, one relies on CMOS image data to calculate, and the other relies on radar to model point clouds in physical space.
At present, it can only be said that the film industry has one more choice, but it is not as exaggerated as destroying the withering and decaying. Micro-movies and short ** give rise to infinite possibilities.
Situation 5: Functional defects are not a problem, and the further you go in the direction of the World Simulator, the more irrelevant these small problems are generated.
Timeline expansion, subject fusion transitions, scene replacement, continuity, 3D camera movement, multi-shot, hamburger bite marks, these are just some of the current capabilities, and the availability of SORA will exceed expectations in the future.
At present, there are many bugs such as left and right leg teleportation, multi-finger and multi-toe, character disappearance, motion deformation, and people passing through the fence, but the flaws are not hidden, and these problems will inevitably be solved with the increase of training scale and the continuous fine-tuning and optimization of the model.
Situation 6: SORA and Vision Pro are indeed a pair of imaginations, but those who think they can read words with a helmet are likely to be disappointed.
In addition, VR is advancing to MR, AR is retreating to MR, VR is only a function of MR in the future, MR is the intersection of industrial technology can reach at present, and the AR future that is the most difficult to break through is the main form.
Situation 7: 4 possible versus 6 impossible for OpenAI itself.
Possible aspects: Become a mainstream AI developer platform, become the largest store, form an ecosystem of billions of users, and some embodied intelligence capabilities.
Impossible: 7 trillion US dollars in core manufacturing, the model principle continues to lead, open source, vertical and horizontal integration of the industrial chain, become embodied intelligence, endogenous intelligence, independent intelligence, adhere to the concept of entrepreneurship unwaveringly.
In particular, the paradoxical information of the 7 trillion US dollar AI core that has fooled many people is that WSJ quoted the so-called source, not Ultraman himself, Saudi Arabia who has invested in RAIN's equity is being persuaded to withdraw by US imperialism, and has invested trillions of dollars in large-scale chip manufacturing in the United States with Middle East sovereignty**? If green money does not participate, find enough funds equivalent to the total amount of US dollar venture capital + IPO for more than ten years to make AI chips, either the concept is crazy, or the common sense is lacking, or the calculation is not enough. What's more, manufacturing isn't the focus of AI computing breakthroughs.
Situation 8: The transformation of the whole ecosystem has begun, and AI is the main driver but not the whole chemical reaction.
6 elements: perception (interaction), computing (data), intelligence (AI), connection (network), agreement (relationship), energy (energy), etc.
Situation 9: Changing non-linear.
Deep players are not only focusing on the improvement of computing power, but also brewing changes in computing architecture, and the changes will not be linear, and it is possible that the future discussed by the industry is actually the present, rather than the future after the upgrade. In the next step, the model principle, the computing architecture, including the chip, will continue to change significantly.
Situation 10: The AI force is at the bottom, and the application is only the demand traction.
It is true that the domestic team is suitable for starting with the application, but it is not impossible to wake up too early and find that the building has collapsed, or someone needs to focus on the changes at the bottom layer, including the hardware bottom, and someone will fight a tough battle, at least keep up.
Situation 11: It must be a cloud-edge-large, medium, and small-PPP hybrid AI, so that the battlefield can unfold; However, we should not only focus on AI, but also the interweaving of dimensions such as sensing, numeracy, intelligence, software and hardware collaboration, and form innovation is a complete perspective and the key to value development.
If it is only narrowed to AI in the sense of computing power algorithms, and lightweight is applied in the sense of scene requirements, it is tantamount to Internet thinking, which can only be rolled on the first day, and can only be a GTPS and APP in the store, just like the Internet era once lived into a very powerful APP; This is a three-dimensional battle won by the Force, and the most important thing to fade away is Internet thinking; stealing everything is light, embarrassing and embarrassing; Seeking simplicity everywhere, it is difficult to be multifaceted; The transformation of the whole ecosystem and the whole system is not enough to catalyze the application alone, and AI in the sense of computing power algorithm data model alone is not enough to drive it.
Situation 12: Pressure increases sharply.
Back to the old problem, the dispute between China and the United States on AI, Joseph Needham's question and Qian Xuesen's question. Let's be honest gpt35、gpt4.On the occasion of the release of 0, the pressure is not so great, and I always feel that I have to chase it, after all, it is still in the text, ** dimension. But as soon as the SORA came out, the pressure increased sharply. Ascension is faster than you think. Competition and development are not two-dimensional and linear. A real simulator of the physical world, which can already be vaguely smelled, and the principle is faintly visible. This is the explosion point of AI competition in the future and the decisive victory of large models.
A friend has a good saying, after Alpha Go Zero crushed human Go, it was over, and a year later, Alpha Fold came into the sky, reshaping human cognition of protein structure and **, which is called a great project. Sora is the same, if you only think it's a 60-second ** generation artifact, it is sprayed by Internet trolls as a foreigner's strange trick and useless, it can be said that it is a bit similar to the outside industry's understanding of the early alphago's chess toy.
But if you look at the world from the perspective of the large model opening its eyes, the reboot of AI cognition, and the potential development direction of the world simulator, this is obviously the force that is awakening. If the company ignores the trend and falls behind in this epic long journey, it will be hit by dimensionality reduction so that even the mother can't recognize it.
AI cognition has been restarted, super intelligence has lit up the hearts of hundreds of millions of machines, and the world virtualizer has become the root technology of the parent model, not science fiction, which is the prologue of an era.
So, AI cognition has been restarted, and human cognition has been restarted?
About the Author:
Yanping Hu is the founder of DCCI Future Thinktank, the chief expert of FutureLabs Future Lab, and a member of the Information Society 50 Forum. He is the co-author and producer of the science and technology bestseller Black Technology (2017).
He has successively served as the editor-in-chief of Internet Weekly, the director of the Exchange and Development Center of the Internet Society of China, and other first-class and NGO positions, continuing to focus on cutting-edge scientific and technological innovation and exploration, focusing on products from the perspective of technology, industry from products, and ecology from industry.
Since 1997, he has published a number of scientific and technological monographs. He is the author of The Pentium Era (Silicon Valley) (1997), the author of The Digital Blue Book (2000), Crossing the Digital Divide, The Second Modernization, and The Fourth Force (2002), and one of the translators of What Google Will Bring (2009).