Visual China.
Text |AI Blue Media, Author |Tao Ran, edWei Xiao.The release of the SORA** model is almost a replica of the grand scene in the AI circle when GPT-3 first debuted a year and a half ago:
It suddenly appeared, causing heated discussions and widespread shock.
On February 16, Beijing time, without any news leakage or prior notice, OpenAI posted on the social platform X (formerly Twitter) and announced for the first time a Wensheng **AI model named Sora.
The sentence "introducing sora, our text-to-video model" is short to the point, more like a notice than publicity:Yes, we pulled out the big ones again.
This is followed by an introduction to the SORA model's capabilities: SORA can create up to 60 seconds of scenes, complex camera movements, and vibrant, emotional characters.
A prompt for the demonstration case is attached: The beautiful, snow-capped city of Tokyo is bustling. The camera travels through the bustling city streets, following a few people enjoying a beautiful snowy day and shopping at a nearby stall. Beautiful cherry blossom petals flutter in the wind with snowflakes.
When it comes to SORA, the industry is not uniform:
Some people are 100% approved, and some people are % approved.
Zhou Hongyi, the founder of 360, said that SORA means that the realization of general artificial intelligence may be shortened from 10 years to 1 year, and the model shows not only the ability to make the best production, but also shows that after the large model has an understanding and simulation of the real world, it will bring new results and breakthroughs.
Jim Fan, principal research scientist at the NVIDIA AI Institute, calls SORA the GPT-3 moment of the generative world: SORA is a "data-driven physics engine", a learnable simulator or "world model."
Musk, who surfs the web intensively and has always been outspoken, directly hit gg human (humans lose).
Without delving into whether the follow-up impact is positive or negative, it is OpenAI that can bring a subversive and epoch-making sense to AI, film and television, social media and other industries.
For example, a group of engineers are still discussing how to further improve the moon landing program, and the team at OpenAI has sent back a group of ** from Mars—they are always one version ahead, why?
Jim Fan, a scientist at the NVIDIA AI Research Institute, commented on SORA from a technical point of view: he defined SORA as a physics engine and a model of the world. In the traditional sense, the picture is two-dimensional, while the physical world in which people live is three-dimensional.
This has become the conceptual difference at the beginning of the design of the AI model: in the process of generating, should the role of AI be to split and combine multiple segments, or should it be used as a subject to build and record a virtual AI space.
OpenAI's choice is the latter.
In the SORA technical report published on its website, there is a sentence worth noting: "Our results show that the development of general-purpose simulators capable of simulating the dynamics of the physical world is a promising pathway, with unprecedented accuracy and realism." ”
To make a superficial understanding, SORA is not an editor**, but a space is modeled before it is generated, and then it becomes a lens to record this three-dimensional virtual space.
Stereoscopic modeling can present much more information than a floor planIn terms of design ideas, OpenAI is one dimension ahead, or one version ahead of schedule.
Of course, more information means a larger data stream, and running better results within the limited computing power and saving computing power as much as possible on the premise of ensuring the effect is essentially the same problem: AI computing efficiency.
But for OpenAI, there are lessons to be learned from these questionsThe accumulation of technology from ChatGPT to GPT-4 and other projects has become a good foundation for OpenAI to build SORA models.
Inspired by the successful case of large language models, OpenAI is thinking about "how to get similar benefits" when exploring the first model: during the operation of the large model, the token (lexical unit), as the smallest text unit in the natural language processing task, carries the role of input information and helps the model process and understand the text. ChatGPT splits **, mathematics and various different natural languages into tokens, and then hands them over to the model to process and understand the tokens, and can obtain more semantic information by learning the relationship between tokens.
In the same way, in the ** generation model, OpenAI has also created the data unit "patch" (image unit) corresponding to the token, and converted the graphic language into the patch of the corresponding format for calculation, which greatly improves the computing efficiency in the unit computing power while ensuring the scalability of the model.
At the front end of the model, OpenAI also used its own achievements in the GPT series of models:
Similar to text dialogue, in the process of training literati, in addition to the material cases, a large number of corresponding text descriptions are also needed. OpenAI adopts the "re-caption" mode originally proposed in DALL·E 3, and uses a highly descriptive title generator to generate text descriptions for the ** materials in the training set. The results also prove that adding extra notes to the footage during production can improve the overall quality, including accuracy.
In addition, following the practice of DALL·E 3, OpenAI also uses GPT to expand the short prompt words entered by the user to make it easier for AI to understand, and expand the text entered by the user into a longer and more detailed description, which is then processed by the ** generation model.
For technology-driven companies like OpenAI, the accumulation of experience and technology is an acceleration, and the successful experience that can be followed and the team's own leading understanding of AI concepts allow OpenAI to always step on its shoulders upwards or push itself to accelerate forward.
What is more terrible than technological leadership, or what is more worthy of concern for friends, is that this kind of leadership often becomes inertia, step by step. It is expected that by accelerating catch-up and benchmarking to keep up with OpenAI, in the stage when the supporting facilities are becoming more and more mature, the difficulty may only increase. The real increment is still in the innovation of top-level design. Therefore, it is not so much that AI has squeezed out people's innovation space, but rather that AI has raised the threshold for effective innovation: design AI, or design that can surpass AI creativity, is the effective increment in the era of large models.