On February 16, OpenAI released 60 seconds** of multiple artificial intelligence productions, showing the results of OpenAI's text-to-** tool SORA to the world for the first time. It is named after the Japanese word for "sky", which means "unlimited creative potential".
Text-to-** AI tools aren't completely new. Runway Gen-1 and Gen-2 released by Runway, Imagen Video and Phenaki by Google, Make a Video by Meta, and similar AI text-to-image conversion tools are not uncommon.
Most of the previous tools needed to produce each of the sheets frame by frame, and then connect them together. The disadvantage of this technique is that although the same keyword may be shared between each **, it may cause very different generation results, so the length of the generated ** is strictly limited, and once it is too long, it may cause **character discoloration or other incoherent problems.
The primary advantage of SORA over the above tools is that it represents a major breakthrough in length and consistency. According to the technical documents published by OpenAI and the interpretation made by some experts, the "space-time patch" technology adopted by SORA allows it to cut the predetermined ** into multiple small parts with spatial and temporal information after reading the text requirements and generate them separately.
Schematic diagram of the "space-time patching" technique in OpenAI's technical paper.
This allows SORA to ensure consistency in a much more granular way, and greatly enriches the details in it. In the simulations released by SORA, the benefits of this coherence include better simulation of simple interactions between characters and the environment, expansion forward and backward, and the ability to blend two into one coherent form.
In addition to this, SORA performs better in terms of physical modeling and composition. Unlike previous tools, which crop the input image into a fixed format, SORA can generate the image directly at the original scale and resolution, which means that SORA can better grasp the main content and simulate the action of the same object from different angles.
A screenshot of one of the demos released by OpenAI, with the corresponding instruction "The beautiful city of Syracuse Tokyo is bustling. The camera travels through the bustling city streets, following a few people enjoying a beautiful snowy day and shopping at a nearby stall. Beautiful cherry blossom petals flutter in the wind with snowflakes".
But when the outside world exclaims its abilities, there are still many unknowns. For example, it's uncertain if Sora will support languages other than English, or when it will be available to more people. Only a small group of "visual artists, designers, and filmmakers" and specific security testers are currently granted access.
The technical document on the official website only briefly explains the general principle of the technology, mentioning the use of previous technologies such as GPT and Dalle-3 for text analysis, but does not publish the training set and model structure in ** like GPT-3.
New York University professor Xie Senin pointed out that SORA may have used a technical model developed by him and another researcher, and there are also theories that SORA used Unreal Engine 5 to create some of the training data. OpenAI has always refused to disclose how much the system has learned**or ***, only to indicate that the training includes public and copyright owner permissions.
This kind of secrecy seems to have become a standard action when large companies release new versions of large models recently. Google launched Gemini 1 on the same day that it released Sora5 Upgraded version, also in limited preview for a small group of developers and enterprise customers. An analysis of ten major AI models by the Center for Foundational Models at Stanford University revealed that none of the major foundational model developers offered enough transparency.
OpenAI's explanation for not releasing the tool and more details is that it also needs to reduce misinformation, hateful content, and bias in the generated **, and that all generated ** is watermarked, but the watermark can be removed as well. Given that short-term can already have a significant impact on politics, the regulatory pressure on the AI sector will be higher than ever. (Intern Shang Yi).