The new SOTA YOLOv9 for object detection has been launched, and the new architecture has revitalized

Mondo Technology Updated on 2024-02-23

Reported by the Heart of the Machine.

Heart of the Machine Editorial Department

In the field of object detection, Yolov9 achieves generation-on-generation strengthening, using new architectures and methods to make traditional convolution outperform deep convolution in terms of parameter utilization.
More than a year after the official release of Yolov8 in January 2023, Yolov9 is finally here!

As we know, YOLO is an object detection system based on global image information. Since Joseph Redmon, Ali Farhadi, et al. proposed the first model in 2015, researchers in the field have updated and iterated Yolo many times, and the model performance has become more and more powerful.

This time, yolov9 is jointly developed by Academia Sinica in Taiwan, Taipei University of Technology and other institutions, and the related ** "Learning What You Want to Learn Using Programmable Gradient Information" has been released.

*Address: github address:

Today's deep learning methods focus on how to design the most appropriate objective function so that the results of the model can be as close as possible to the real situation. At the same time, it is essential to design an appropriate architecture that can help get enough information to carry out. However, existing methods ignore the fact that a large amount of information will be lost when the input data undergoes layer-by-layer feature extraction and spatial transformation.

Therefore, Yolov9 delves into the important problem of data loss when data is transmitted through deep networks, namely information bottlenecks and reversible functions.

The researchers proposedProgrammable Gradient Information (PGI).to deal with the various changes required by deep networks to achieve multiple goals. PGI can provide complete input information for the objective function of the target task calculation, so as to obtain reliable gradient information to update the network weights.

In addition, the researchers designed a new lightweight network architecture based on gradient path planning, namely:Generalized Efficient Layer Aggregation Network (GELAN).。This architecture confirms that PGI can achieve excellent results on lightweight models.

The researchers validated the proposed GERAN and PGI on an object detection task based on the MS Coco dataset. The results show that compared with the SOTA method developed based on deep convolution, Gelan can achieve better parameter utilization by using only traditional convolution operators.

For PGI, it is highly applicable and can be used in a wide range of models, from light to large. We can use it to get the complete information thus:Enables models trained from scratch to achieve better results than SOTA models pre-trained with large datasets。Figure 1 below illustrates some of the comparisons.

Alexey Bochkovskiy, who worked on the development of YOLOV7, YOLOV4, SCALED-YOLOV4 and DPT, spoke highly of the newly released Yolov9, saying that YOLOV9 is superior to any convolutional or transformer-based object detector.

Some netizens said that Yolov9 looks like the new SOTA real-time object detector, and his own custom training tutorial is also on the way.

More hard-working netizens have added pip support to the yolov9 model.

Let's take a look at yolov9 in detail.

Problem statement

Often, people attribute the difficulty of convergence in deep neural networks to factors such as gradient vanishing or gradient saturation, which do exist in traditional deep neural networks. However, modern deep neural networks have fundamentally solved the above problems by designing various normalization and activation functions. However, even so, there are still problems in deep neural networks with slow or poor convergence speed. So what exactly is the essence of the problem?

Through an in-depth analysis of the information bottleneck, the researchers deduced the root cause of the problem: soon after the gradient was initially passed from a very deep network, much of the information needed to achieve the goal was lost. To test this inference, the researchers performed feedforward processing on deep networks with different architectures with initial weights. Figure 2 illustrates this visually. Obviously, PlainNet is losing a lot of important information needed for object detection at a deep level. As for the proportion of important information that ResNet, CSPnet, and Gelan are able to retain, it is indeed positively correlated with the accuracy that can be obtained after training. The researchers further devised a reversible network-based method to solve the above problems.

Methodology

Programmable gradient information (PGI).

This study proposes a new auxiliary supervision framework, Programmable Gradient Information (PGI), as shown in Figure 3(d).

PGI mainly consists of three parts, namely (1) primary branch, (2) secondary reversible branch, and (3) multi-level auxiliary information.

PGI's inference process uses only the main branch, so there is no additional inference cost;

The auxiliary reversible branch is to deal with the problem caused by neural network deepening, which will cause information bottlenecks and lead to the loss function not being able to generate reliable gradients.

Multi-level auxiliary information is designed to deal with the problem of error accumulation caused by deep supervision, especially the architecture of multiple branches and lightweight models.

gelan network

In addition, the study also proposes a new network architecture, GELAN (as shown in the figure below), in which the researchers combine the two neural network architectures, CSPNET and ELAN, to design a generalized efficient layer aggregation network (GELAN) that takes into account lightweight, inference speed, and accuracy. The researchers generalized the functionality of ELANs, which initially used only convolutional layer stacking, to a new architecture that could use any computational block.

Experimental results

To evaluate the performance of Yolov9, the study first conducted a comprehensive comparison of Yolov9 with other real-time object detectors trained from scratch, and the results are shown in Table 1 below.

The study also included the ImageNet pre-trained model in the comparison, and the results are shown in Figure 5 below. It is worth noting that Yolov9 using traditional convolution is even better in terms of parameter utilization than Yolo MS using deep convolution.

Ablation experiments

For the role of individual components in YOLOv9, a series of ablation experiments were carried out in the study.

The study begins with an ablation experiment on Gelan's computational block. As shown in Table 2 below, the study found that the system maintained good performance by replacing the convolutional layers in the ELAN with different computational blocks.

The study then performed ablation experiments on different sizes of gelans for ELAN block depth and CSP block depth, and the results are shown in Table 3 below.

In terms of PGI, the investigators performed ablation studies on the backbone network and NECK for auxiliary reversible branches and multi-level auxiliary information, respectively. Table 4 lists the results of all experiments. As can be seen from Table 4, PFH is only valid for deep models, while the PGI proposed in this paper can improve accuracy in different combinations.

The investigators further implemented PGI and in-depth monitoring on models of different sizes and compared the results, which are shown in Table 5.

Figure 6 shows the results of a step-by-step increase of components from the baseline YOLOV7 to YOLOV9-E.

Visualization

The researchers identified the information bottleneck problem and visualized it, and Figure 6 shows the visualization of feature maps obtained using random initial weights as feedforward in different schemas.

Figure 7 illustrates whether PGI can provide a more reliable gradient during training, so that the parameters used for updating can effectively capture the relationship between the input data and the target.

For more technical details, please read the original article.

Related Pages