Recently, Wensheng**sora has become popular all over the world, and with the disclosure of more technical details, many of the physical paradoxes that exist in the generation have also attracted attention. This paper attempts to analyze the shortcomings in the current SORA technical route from the perspective of modern mathematics, especially in the field of global differential geometry.
Written by |Gu Xianfeng (Professor, State University of New York at Stony Brook).
At the beginning of the Year of the Dragon, Sora was born, shocking the world. Sora claims to be "the first generative model for world simulation", and it is bold and dry. There are pessimistic predictions that many traditional fields may be disrupted, the most precarious of which may be computer graphics, short** and film and television entertainment. Following OpenAI's revelation of more technical details, many physical paradoxes generated by SORA circulated on the Internet. Here, the author explains the shortcomings in the current SORA technical route based on some views in modern mathematics, especially in the field of global differential geometry, hoping to broaden the ideas of AI researchers and engineers and jointly promote improvement. Here it is mainly explained by the regularity theory of manifold embedding theory, catastrophic theory (critical state theory), fiber cluster schematic class theory, thermal diffusion equation and optimal transport equation (Monge-Ampere equation).
Manifold distribution rule.
In the field of deep learning, a natural dataset is treated as a probability distribution on a manifold, which is known as the manifold distribution rule. We look at an observed sample as a point in the original data space, and a large number of samples constitute a dense point cloud in the original data space, and this point cloud is near a low-dimensional manifold called a data manifold. The distribution of point clouds on the data manifold is not uniform, but satisfies a specific distribution law and is represented as a data probability distribution.
Then, we naturally have the following questions:1Why is the data point cloud low-dimensional, instead of filling the entire raw data space? 2.Why is the point cloud collection manifold, i.e. the local part is continuous and smooth?
The answer to the first question is that because natural phenomena satisfy a large number of natural laws, the limitations of these laws reduce the dimensionality of the point cloud of the data sample, but cannot occupy the entire space. For example, if we look at a dataset composed of all natural faces, each sampling point is a sample, and the number of pixels multiplied by 3 is the dimensionality of the original image space. Any point in the original image space is a **, but very few** is the human face**, which will fall on the human face**manifold, so the human face**manifold cannot occupy the entire original image space. The human face needs to meet many natural physiological laws, each of which will reduce the dimensionality of the data manifold, such as left-right symmetry, which reduces the number of pixels by nearly half, and there are definite geometric and textured areas such as facial features, and the shape of each organ is similar, and there are not many parameters described, so the dimensionality is further reduced. The genes that ultimately control the face are very limited, so the dimensionality of the face**manifold is much lower than the number of pixels.
For another example, we observe the steady-state temperature distribution in a planar region, which is determined by the physical heat diffusion theorem, and the stability function satisfies the classical Laplace equation, which is uniquely determined by its boundary values. If we have n square sampling points inside the region and n sampling points at the boundary of the region, then each observed temperature function is represented as a vector with a dimension of n squares, i.e., the original data space dimension is n squared, but the actual manifold dimension is the dimension n of the boundary function. It can be seen that the dimension of the data manifold composed of the observation samples satisfying the laws of physics is much lower than that of the original data space.
The answer to the second question is that in most cases, the physical system is deterministic, but in the critical state, the physical system will undergo abrupt changes (as described by cataclysmic theory or critical state theory). The laws of physics are mostly described by the system of partial differential equations, and the solution of the differential equation is controlled by the initial value and the boundary value, and the system is deterministic, which means that due to the conservation of energy, the conservation of mass, and the transfer of energy less than the speed of light, etc., when the initial boundary value changes gradually, the solution also changes gradually. In the regularity theory of partial differential equations, this means that the Sobolev norm of the boundary value controls the Sobolev norm of the solution, and so on. We treat the solution as a point on the data manifold and the edge value as its corresponding local coordinate (i.e., the corresponding implicit feature vector in hidden space). The mapping from the data manifold to the hidden space is called an encoded map, and the mapping from the hidden space to the data manifold is called a decoded map. The regularity theory ensures that the coding and decoding mappings are continuous or even smooth, and the uniqueness of the solutions ensures that these mappings are topological or differential homomorphism. The boundary value can be arbitrarily perturbed, i.e., the latent variable exists in the neighborhood of a Euclidean disk. This means that the observed samples that meet specific physical rules make up the data manifold.
Figure 1SORA maps the ** code to the hidden space, and then cuts it into a space-time patch, which is called a time-space token. (openai.com)
As shown in Figure 1As shown, the training set of SORA is a short set, each sample is a short set, and the short sets of the same kind constitute a data manifold. Sora encodes it into the hidden space for dimensionality reduction, and then cuts the hidden feature vector into patches in the hidden space, and adds the chronological order to form a space-time patch, which is also a time-space token. The concept of space-time here is more critical, and the sequence number (time) of each token in the short ** frame and the sequence number (space) in the current frame are recorded in the token.
Probability distribution transformations.
We can further ask the following questions:3How are probability distributions represented on data manifolds?
The answer to the third question is to use the transfer transformation to turn the probability distribution of the data into a Gaussian distribution that can be generated by a computer. This transfer transformation can be performed in the original data space or in the hidden space. Commonly used transfer transformations include optimal transfer transformation and thermal diffusion. We explain it from the point of view of fluid mechanics. Suppose the entire hidden space is a tank with some kind of solvent in it, the density of which is a probability density. We disturb the tank, causing the liquid to flow and causing the solvent density to change. We calculate the flow direction and velocity of each water molecule so that the entropy of the probability density increases all the way to the end of the Gaussian distribution. For example, if we consider the distribution of face data, where each water molecule is a face**. We keep adding noise to the face, and we get a series of ** until it becomes a white noise**. This series is the trajectory of water molecules. Finally, each face** becomes white noise, and all these white noise distributions satisfy the Gaussian distribution. This process is known as Lang Zhiwan's dynamics. Conversely, given a white noise, we trace the origin of the water molecule trajectory backwards, and we get a human face**. This is how the diffusion model works. Of course, it is also possible to directly use the most propagation theory to solve the hidden space to its own homeomorphism, and turn the data distribution into a Gaussian distribution, which requires solving the Monge-Ampere equation. It follows that all the information about the data distribution is contained by the transport map, which is represented by a deep network.
Figure 2SORA uses a diffusion model to generate data spatiotemporal tokens from white noise spatiotemporal tokens. (openai.com)
As shown in Figure 2, the SORA transforms the probability distribution of the data token into a Gaussian distribution through the diffusion process (Lang Zhiwan dynamical system - noise is gradually added to each token) in the hidden space, and then transforms the white noise token in the hidden space into an implicit data token through the inverse transformation of the transmission transformation.
The blessing of large language models.
Sora incorporates the large language model ChatGPT, which greatly improves the performance of the system. First of all, the training sample of SOAR is (text, **pair, some** corresponding to the title is too short, the subtitle is missing, and SORA uses the re-titled technique of DALL-E.
SORA's training set contains a number of high-quality samples (highly descriptive captions, short manifolds) from which short data manifolds (including spatiotemporal token manifolds) are trained, and each manifold is identified by its subtitle (title). For the lack of title or ambiguous subtitles, SORA encodes them into the hidden space, looks for the hidden feature vectors adjacent to the high-quality ** in the hidden space, and then copies the high-quality subtitles (titles) to the inferior **. In this way, SORA can add highly descriptive captions to all training data, improving the quality of the training set and further improving system performance.
At the same time, the large language model can expand the prompts entered by the user to become more accurate and descriptive, so as to make the generation better fit with the needs of the user. This makes Sora even more powerful. However, SOAR still has many flaws, which can be analyzed by the following examples.
The contradiction between correlation and causality.
ChatGPT breaks down statements into tokens, and then uses Transformer to learn the probability distribution of connections between tokens in context. Similarly, SORA decomposes the token into spatiotemporal tokens, then learns the probability distribution of the connection between the tokens in the context, and generates the token from the white noise based on this probability distribution, connects the token, and decodes it into a short one. Each token represents an image or a local region in **, and the stitching between different local regions becomes the key to the problem. SORA learns each token relatively independently, and expresses the spatial relationship between tokens with the probability embodied in the training set, so that the causal relationship between tokens cannot be accurately expressed.
*1.Sora generated granny blows out birthday candles**. openai.com) [Go to "Back to Park"**
As shown in **1, in the ** generated by Sora, every frame is extremely realistic, but when the grandmother blows out the birthday candle, the flame of the candle does not move at all. If we narrow the field of view to the area of each token, we see a beautiful real picture, and the connection between the tokens is also very smooth and natural, but when there is a causal connection between the tokens that are far apart, that is, when the air blown out affects the beating of the flame, the physical cause and effect between the two tokens is not reflected. This means that transformers are used to express statistical correlations between tokens, and cannot accurately express the laws of physical cause and effect. Although transformers can manipulate natural language to a certain extent, natural language cannot accurately express the laws of physics, which can only be precisely expressed by partial differential equations. This reflects a certain limitation of the probability-based model of the world.
The contradiction between the rational part and the absurdity of the whole.
At present, the splicing between adjacent tokens of SORA is very reasonable, but the ** of the overall splicing may have various paradoxes. This means a gap between local splicing and overall expansion.
*2.The "ghost chair" generated by Sora **openaicom) [Go to "Back to Park"**
We observe that the "ghost chair" is very reasonable if we limit the field of view to a localized area in the middle of the screen. Carefully inspect the direct connections between different token intervals, and it is also very continuous and smooth. But the whole chair hangs like a ghost, which is contrary to everyday experience. This kind of "local reasonable, overall absurdity" generation** means that transformers have learned the local connection probability between tokens, but lack the large-scale overall concept of space-time context. In this **, the overall concept comes from the gravitational field in physics, although it cannot be seen locally, but the whole is everywhere at all times.
*3.Four-legged ants spawned by Sora. (openai.com) [Go to "Back to Park"**
Another example is the ** of the "four-legged ant" generated by Sora, the movements of the ants are lifelike, like flowing water. Locally, it is very smooth and natural, and one can't help but think that there may be such a four-legged ant on a planet. But on the whole, there are no four-legged ants in the natural world of the earth. The rationality of the part here cannot guarantee the rationality of the whole, and the concept of the whole here comes from the facts of biology.
*4.Sora-generated treadmills. (openai.com) [Go to "Back to Park"**
Another example is the "opposite treadmill" generated by SOAR**If we look at each local area, what we see is reasonable, and the connection between the tokens is also natural, but the overall ** is absurd, and the treadmill is in the opposite direction of the runner. This global view is contrary to the facts from ergonomics.
These examples show that current transformers can learn local contexts, but they can't learn more global contexts, which may be gravity fields in physics, or ergonomics, or species classification in biology. This overall view is precisely the idea of dark matter in the AI world proposed by Professor Zhu Songchun. Although each training sample** implicitly expresses the concept of the whole world, the process of tokenization separates the concept of the whole and retains the connection probability between adjacent tokens to a limited extent, resulting in a locally reasonable and overall absurd result.
Modern global differential geometry attaches great importance to the contradiction between the whole and the part, and a variety of theoretical tools have been invented for this purpose. For example, we can construct a smooth scaffolding field in the local topological manifold, but we cannot generalize it globally, and the obstacle to global generalization is the schematic class of the fiber bundle. In the complex manifold, we can construct a subpure function locally, but we cannot splice the local functions into a global metapure function as a whole, and the difference of this local generalization to the whole is accurately described by the upper cohomology theory of the layer. Many physical theories are expressed as schematic theories of specific fiber bundles, such as topological insulator theory. This kind of mathematical theory, which is easy to construct locally and has substantial difficulties in generalizing as a whole, is actually the crystallization of the wisdom of human beings in exploring nature at a deep level. This holistic topology and geometry view has not yet been generalized to the AI domain, and if Transformer can learn this kind of overall obstacle in the context on its own, then AI will be able to explore the natural world more effectively.
Absence of critical state.
The vast majority of physical processes in nature are alternating steady states and critical states. In the steady state, the system parameters change slowly, and the observation data is easy to obtain. In the critical state, the system suddenly changes abruptly, which catches people off guard, and it is difficult to capture the observation data. As a result, the data samples for critical states are very sparse, with almost zero measures in the training set. As a result, most of the data manifolds learned by the SORA system are composed of samples of steady state. The critical state samples in the physical process are mostly distributed at the boundary of the data manifold. Therefore, during the generation process, it is very easy for SORA to generate ** fragments of steady state, but the critical state is often skipped. But in human cognition, the most critical observation is precisely the critical state where the probability is almost zero.
*5.Sora-generated juice spill. (openai.com) [Go to "Back to Park"**
There are two stable states of the juice spill** generated by the sora, the state in which the water cup is upright, and the state in which the juice has been spilled, but the most critical state: the process of the juice spilling out of the cup is not generated. Although it is only a few short frames, it is very important for humans to perceive the whole process. There may be a number of reasons why SORA is unable to generate images of critical critical states:
Different samples of the equilibrium state in the physical process generate different connected branches of the data manifold, and the critical state samples are near the boundary of the steady-state manifold and between the boundaries of the two stationary manifolds. The thermodynamic diffusion process blurs the boundaries of the manifold, thus confusing the manifold boundaries and generating a process ambiguity. In other words, the proximity state corresponds to the boundary of the data manifold, and the boundary situation should be maintained in the learning process without pattern confusion.
Figure 3Mode mixture.
As shown in Figure 3, we trained a codec with MNIST to draw the hidden spatial distribution of the dataset in the hidden space, with 10 handwritten numbers corresponding to 10 clusters, each cluster is a mode, that is, a connected branch of the data manifold. The boundary of the cluster is the boundary of the distribution branch of the hidden space of the data. We generated 100 sampling points in the hidden space and generated 100 handwriting digital images by decoding. If the sampling point falls inside a cluster, the resulting image is very clear; If the sample falls on the outside of the cluster boundary, the resulting image is very blurry, often a fusion of two handwritten numbers. Therefore, it is important to identify the boundaries of a data manifold to identify critical states.
The most popular diffusion model used by SORA will inevitably smooth the boundaries of the data manifold when calculating the transmission map, thus confusing different modes and directly skipping the generation of critical state images. Therefore, it seems that ** suddenly jumps from one state to another, and the most critical pouring process in the middle is missing, resulting in physical absurdity.
*6.Sora-generated puppy. (openai.com) [Go to "Back to Park"**
*6 shows another case where an error occurs due to crossing a manifold boundary. Sora spawns puppies laughing and fighting, sometimes covering each other and sometimes scattering. At a certain moment, the 3 puppies on the screen suddenly became 4 puppies. We explain it this way: the ** of 4 puppies forms a manifold (or connected branch), and the ** of 3 puppies forms another branch, and at the boundary of the manifold of 4 puppies**, there is a critical event: the four puppies block each other, and only 3 puppies can be seen in it. Sora's diffusion model does not identify the boundary of the manifold, but breaks through it, crossing between the manifold of 3 puppies** and the manifold of 4 puppies**. The correct approach should be to identify the boundary of the manifold first, and then fold back to the original manifold at the boundary in a situation that cannot be crossed physically (e.g., 3 becomes 4).
Figure 4The optimal transmission mapping based on the geometric method can accurately detect the boundary of the data manifold and obtain the critical state accurately.
The drawbacks of the diffusion model can be overcome by the optimal transport model based on the geometric method. As shown in Figure 4, it is assumed that we calculate the optimal transport map from a uniform distribution inside the disk to a uniform distribution within the hippocampus-shaped region on the right, which is given by a gradient map of a certain convex potential energy function according to the corresponding Brenier's theorem. This potential energy function satisfies the Monge-Ampere equation, and its continuous, non-derivable set is projected onto a singular set of the disk region (black curve), the regular points are mapped to the regular points of the target region, and the singular set is mapped to the boundary of the target region (each singularity is mapped to the left and right boundary points at the same time). When we cross the singular set, it means that we cross two steady equilibrium states, and there must be a critical (cataclysmic) event, that is, a physical event in which the steady state is broken. It can be seen that it is of fundamental importance to accurately find the singular set of transmission maps and detect the critical (catastrophic) state for the modeling of the physical world.
Brief summary. It can be seen that although SORA claims to be "the first generative model for world simulation", the current technical route cannot accurately simulate the physical laws of the world. Firstly, the correlation of probability and statistics cannot accurately express the causality of physical laws, and the contextual correlation of natural language cannot reach the precision of partial differential equations. Secondly, although the Transformer can learn the connection probability between adjacent space-time tokens, it cannot judge the rationality of the overall situation, and the overall rationality requires a higher-level mathematical theoretical viewpoint, or a more hidden and profound background in the natural sciences and humanities, and the current Transformer cannot truly understand these global viewpoints. In addition, SORA ignores the most critical critical (catastrophic) state in the physical process, on the one hand, because of the scarcity of critical state samples, and on the other hand, because the diffusion model blurs the boundary of the steady state data manifold, eliminating the existence of critical states, and the generated ** has a jump between different steady states. The optimal transmission theoretical framework based on the geometric method can accurately detect the boundary of the steady-state data manifold, which emphasizes the generation of critical state events, avoids the horizontal jump between different steady-state states, and is closer to the physical reality.
At present, the data-driven world simulation model represented by SORA and the world simulation model of physical laws and partial differential equations established by first principles have begun to enter a state of fierce war. This is perhaps the great turning point in human history. I hope that young readers can enthusiastically join the torrent of the times and use their wisdom to promote the development of science and technology and society!
This article is authorized**from WeChat*** Lao Gu talks about geometry". **10,000 Fans Incentive Plan