Not long after Sora was released, Stable AI released Stable Diffusion 3. For those who use artificial intelligence to create design, it is undoubtedly a big year. Then this article is specially prepared for these users, and describes the two major features of Stable Diffusion 3 in more straightforward words, "diffusion transformers model" and "stream matching", to help you better use it for creation after the model is released.
Diffusion Transformers, which we will refer to as DITS. As you can see from the name, this is an image latent diffusion model based on the Transformer architecture. If you have read the article "Demystifying SORA: Understanding with Large Language Models and Realizing the "Emergence" of the Physical World, then you are already at the level of "class representative" for the next content. Like SORA, DTS also uses the concept of "patches", but since DITS is used to generate **, it does not need to maintain a logical association between different frames ** like SORA, so it does not have to generate space-time blocks of time and space.
Stable Diffusion 3 generation is similar to the Vision Transformer (VIT) that took the computer vision field by storm a few years ago, where the image is split into multiple patches and embedded in a continuous vector space to form a sequence input for Transformer processing. However, it should be noted here that because DITS has a business, for the conditional image generation task, DITS needs to receive and fuse external conditional information, such as category labels or text descriptions. This is usually achieved by providing additional input markers or by a cross-attention mechanism that allows the model to guide the generation process based on the given conditional information.
So when this block arrives inside DITS, it can be processed into the required content by the DIT block inside DITS. The DIT Block is the core part of DITS, which is a special Transformer structure designed for diffusion models, capable of processing image and conditional information. Generally speaking, block itself translates to block, but in order to distinguish it from patches, I will use block directly.
Stable Diffusion 3 generates a dit block, which is divided into three smaller blocks: cross-attention, adaln, and adaln-zero. Cross-attention refers to the addition of an additional multi-head cross-attention layer after the multi-head self-attention layer, which is used to use conditional information to guide image generation, so that the generated ** is more in line with the prompt word, but at the cost of increasing the amount of computation by about 15%.
The ln in adaln refers to the problem of reducing the internal covariate shift by normalizing the output of the internal units of each layer of the neural network, thereby improving the convergence speed and performance of the model training process. ADALN is an extension of standard layer normalization, which allows the parameters of layer normalization to be dynamically adjusted based on input data or additional conditional information. It's just like the suspension of the car, it's used to increase the stability and adaptability of the model.
Stable Diffusion 3 generation Next, Stable AI has made an improvement over the Adaln Dit Block by regressing dimension-level scaling parameters in addition to regression and , and applying these parameters immediately before any residuals within the dit block are connected. This block is adaln-zero, which is designed to mimic the beneficial initialization strategy in the residual network to promote effective training and optimization of the model.
After the dit block, the token sequence is decoded into output noise and output diagonal covariance. With a standard linear decoder, the size of these two results is the same as the spatial dimension of the input image. Finally, these decoded tokens are rearranged according to their original spatial layout, so as to obtain the noise value and covariance value.
Stable Diffusion 3 generates the second chapter, Flow Matching (FM). According to Stable AI, it is an efficient, simulation-free CNF model training method that allows the CNF training process to be supervised using a universal probabilistic path. Most importantly, FM breaks the barriers to CNF scalable training outside the diffusion model, and directly manipulates the probabilistic path without a deep understanding of the diffusion process, thus bypassing the difficult problems in traditional training.
The so-called CNF is continuous normalizing flows. This is a probabilistic model and generative model technique in deep learning. In CNF, a simple probability distribution is transformed into a probability distribution of complex, high-dimensional data through a series of reversible and continuous transformations. These transformations are usually parameterized by a neural network, so that the original random variable can be continuously transformed to simulate the distribution of the target data. Translated into the vernacular, CNF generates data like a dice.
Stable Diffusion 3 is generated, but CNF requires a lot of computing resources and time in actual operation, so Stable AI wondered, can there be another method that the result is almost the same as CNF, but the process should be stable and the amount of computation should be low? Thus was born FM, which is essentially a technique for training CNF models to adapt and simulate the evolution of a given data distribution, even if we do not know in advance the specific mathematical expression of the distribution or the corresponding generative vector field. By optimizing the FM objective function, the model can also learn a vector field that can generate probability distributions that approximate the distribution of real data.
Compared with CNF, FM should be regarded as an optimization method, and its goal is to train the vector field generated by the CNF model to be as close as possible to the vector field on the ideal target probability path.
Stable Diffusion 3 GenerationAfter reading the two core technical features of Stable Diffusion 3, you will find that it is actually very close to Sora. Both models are transformer models (Stable Diffusion previously used U-Net), both use blocks, both have epoch-making stability and optimization, and their dates of birth are so close, I don't think it's too much to say that they are related.
However, there is a fundamental difference between the "brothers", that is, SORA is closed source, and stable Diffusion 3 is open source. In fact, both Midjourney and DallĀ·e are closed-source, and only Stable Diffusion is open source. If you pay attention to open-source artificial intelligence, then you must have noticed that the open-source community has been in trouble for a long time, and there has been no obvious breakthrough, and many people have lost confidence in it. Stable Diffusion 2 and Stable Diffusion XL only improve the aesthetics of the generated**, while Stable Diffusion 15 can already do this. Seeing the revolutionary improvements in Stable Diffusion 3 can rekindle the confidence of many developers in the open source community.
Stable Diffusion 3 generation is more exciting, Stable AI's CEO Mohamed Ahmad Mostak ( said in a tweet that although Stable AI has 100 times fewer resources in the field of artificial intelligence than some other companies, the Stable Diffusion 3 architecture can already accept content other than ** and images, but it can't be announced too much at the moment.
You say**and**I can still understand, but what is the content of "outside"? Actually, all I can think of is audio, which is generated by a piece of sound. It's confusing, but once Stable AI releases the latest research results, we will be the first to interpret them.
stable diffusion 3.