Stable Diffusion Analysis: Explore the technological mystery behind AI painting

Mondo Technology Updated on 2024-02-27

Before talking about Stable Diffusion, it is necessary to understand the evolution of AI painting.

Back in 2012, a team led by Chinese scientist Andrew Ng trained the world's largest deep learning network at the time. The network was able to learn to recognize objects such as cats autonomously and drew a vague but recognizable diagram of the cat in just three days. Although this one is blurry, it demonstrates the potential of deep learning for image recognition.

In 2014, Ian Goodfellow, a Google scientist at the University of Montreal in Canada, proposed an algorithm for generative adversarial networks GANs, which became the mainstream direction of AI-generated painting. GaN works by training two deep neural network models, the generator and the discriminator, so that the generator can generate new data samples that are similar to the real data, and the discriminator can accurately distinguish between the fake samples generated by the generator and the real data. The core idea of GaN is that the generator tries to trick the discriminator, and the discriminator tries to distinguish the real from the fake, and the two fight each other and cooperate with each other to achieve high-quality data generation.

In 2016, the first GAN-to-image model, GAN-int-CLS, was published, demonstrating the feasibility of GANs in generating images from text, opening the door to the emergence of various GAN-based conditional image generation models. However, GaN is prone to instability or crashes during training, making it difficult to apply at scale.

In October of the same year, NVIDIA proposed Progressivegan, which made it easier to train models and improved the quality of generation by gradually increasing the size of the neural network to generate high-resolution images, paving the way for the later rise of Stylegan.

In 2017, Google published the famous ** "Attention is All You Need", which proposed the Transformer structure, and then shined in the field of natural language processing; While Transformer is designed to solve natural language processing problems, it also shows great potential in the field of image generation. In 2020, they proposed the concept of VIT, which attempts to replace the traditional convolutional neural network CNN structure with a Transformer structure in computer vision.

2020 took a turn for the worse. The University of California, Berkeley proposed the well-known denoising diffusion probabilistic model DDPM, which simplifies the loss function of the original model, transforms the training target into the noise information added at the current step, greatly reduces the training difficulty, and replaces the network module from a fully convolutional network to unet, which improves the expressive ability of the model.

In January 2021, OpenAI released the Dall-E and Clip models based on the VQVAE model, Contrastive Language-Image Pre-training, which are used for text-to-image generation and text-to-image contrast learning, respectively. This made it seem that for the first time, AI truly "understood" human descriptions and created them, sparking an unprecedented enthusiasm for AI painting. In October 2021, Google released the Disco Diffusion model, which kicked off the era of diffusion models with its amazing image generation.

In February 2022, Disco Diffusion, an AI plot generator based on diffusion models developed by engineers from some open-source communities, was launched. Since then, AI painting has entered a rapid development trajectory, and Pandora's box has been opened. Disco Diffusion is easier to use than traditional AI models, and more and more people are paying attention to it as researchers build up a well-established help documentation and community. In March of the same year, Midjourney, an AI generator developed with the participation of the core developers of Disco Diffusion, was officially released. Midjourney chose to be powered on the Discord platform, with the help of chat-style human-computer interaction, which makes the operation easier, and there is no need for complex parameter adjustments, just enter text into the chat window to generate images.

What's more, the results generated by Midjourney are so stunning that the average person can barely tell if the resulting artwork was drawn by an AI or not. Five months after Midjourney's release, an art competition at the Colorado State Fair in the United States selected the results of the art competition, and a painting called Space Opera won first place, but it was not the work of a human artist, but was created by an artificial intelligence called Midjourney.

When the contestants announced that the work was painted by AI, it sparked anger and anxiety among many human painters.

On April 10, 2022, the previously mentioned OpenAI's DALL·e 2 was released. Whether it's Disco Diffusion or Midjourney, it's still clear that they're AI-generated, but the images generated by Dall·E 2 are indistinguishable from human works.

July 29, 2022, by stabilityThe AI generator of Stable Diffusion, developed by an AI company, has started internal testing. It was found that the quality of the AI paintings generated with it was comparable to that of Dall·e 2 and that it was less restrictive. Stable Diffusion's closed beta was divided into four waves, 15,000 users were invited, and after just ten days, 17 million ** were generated through it. Crucially, Stability AI, the company behind Stable Diffusion, adheres to the open-source philosophy, "AI by the people, for the people", which means that anyone can deploy their own AI painting generator locally, truly realizing that everyone can "create a picture as long as you can speak". The open-source community HuggingFace quickly adapted to it, making it easier for individuals to deploy; The open-source tool stable-diffusion-webui integrates a variety of image generation tools, and can even fine-tune the model and train personal models on the network side, which has been well received and received 3With 40,000 stars, the diffusion generation model has completely moved out of large-scale services and into individual deployments.

November 2022, Stable Diffusion 20 release, the new version generates four times the resolution and faster generation.

Based on Latent Diffusion Models, Stable Diffusion places the most time-consuming diffusion process in a low-dimensional latent variable space, which greatly reduces the computing power requirement and the threshold for personal deployment. It uses a latent space encoding reduction factor of 8, in other words, the length and width of the image are reduced by one-eighth of the original size, for example, an image of 512512 becomes 6464 directly in the latent space, thus saving 64 times the memory! On top of this, Stable Diffusion also reduces performance requirements. Not only can you produce a detailed, 512512 image quickly (in seconds), but it only requires a single NVIDIA-grade 8GB 2060 graphics card. Without this space compression conversion, it would require a super graphics card with 512GB of video memory. According to the evolution of graphics card hardware, it will take at least 8-10 years for consumers to enjoy these kinds of applications. This important iteration of the algorithm has brought AI painting into everyone's life in advance.

In this article, we take a look at the evolution of Stable Diffusion and how it has evolved. If you're also a fan of AI painting, feel free to talk to me**. In the future, I will continue to update this series and share the tutorials of Stable Diffusion and the teaching content of other AI painting software. If you like these contents, please *** thank you for reading and look forward to seeing you again in the next issue!

Extreme Technology, full name Extreme Data (Beijing) Technology*** is a software company focusing on real-time search and data analysis. Its brand, Infini Labs, is committed to creating the ultimate easy-to-use data exploration and analysis experience.

Extreme Technology is a young team that uses a natural distributed approach to remote collaboration, with employees all over the world, hoping to become the first choice for big data real-time search and analysis products in China and even global enterprises, and contribute to the output of Chinese technology brands.

Official website:

Related Pages