Dry goods super detailed! This article explains the 3 object recognition methods commonly used in

Mondo Technology Updated on 2024-02-26

With the rapid development of machine vision technology, many traditional tasks that require manual operation are gradually replaced by machines.

Most of the traditional methods of target recognition are achieved manually, and the shape, color, length, width, and aspect ratio are used to determine whether the identified target meets the standard, and finally a series of rules are defined for target recognition. This approach has certainly been well applied in some simple casesThe only drawback is that all the rules and algorithms have to be redesigned and developed as the identified object changes, and even if it is the same product, the change of different batches will create a reality that it cannot be reused.

With the development of machine learning and deep learning, many features that are difficult to be directly quantified by the naked eye can be automatically learned by deep learning, which is the advantage and unprecedented attraction of deep learning.

There are many features that we can't quantify with traditional algorithms, or are difficult to do, deep learning can. In particular, there are significant improvements in image classification and target recognition.

There are three commonly used object recognition methods for vision: blob analysis, template matching, and deep learning. The following is a comparison of three commonly used target recognition methods.

Blob analysis

blobanalysis

In computer vision, a blob refers to a connected area of an image composed of similar colors, textures, and other characteristics. Blob analysis is the analysis of the connected domain of the same pixel in an image (this connected domain is called a blob). The process is to binarize the image, segment the foreground and background, and then perform connected area detection to obtain a blob block. To put it simply, blob analysis is to find out small areas with "grayscale mutations" in a "smooth" area.

For example, let's say there is a piece of glass that has just been produced, and the surface is very smooth and flat. If there is no flaw on this piece of glass, then we will not be able to detect the "gray scale mutation"; On the other hand, if there is a raised bubble, a dark spot, or a crack on the glass for various reasons in the glass production line, then we can detect the texture on the glass, and the color spot in the binarythresholding image can be considered a blob. And these parts are the defects caused by the production process, and this process is blob analysis.

The blob analysis tool can isolate objects from the background and calculate the number, location, shape, orientation, and size of the targets, as well as provide the topology between the associated spots. Instead of analyzing individual pixels one by one, the processing process manipulates rows of images. Each line of the image is represented by a run length code (RLE) to represent an adjacent target range. This algorithm greatly improves the speed of processing compared to pixel-based algorithms.

But on the other hand, blob analysis is not available for the following images:

1.low-contrast images;

2.The necessary image features cannot be described in 2 gray levels;

3.Detection according to the template (pattern detection required).

In general, blob analysis is to detect speckles in images, which is suitable for scenarios with a single background, no distinction between categories of foreground defects, and low recognition accuracy.

Template matching method

template matching

Template matching is the most primitive and basic pattern recognition method, which studies where the pattern of a particular object is located in the image, and then identifies the object, which is a matching problem. It is the most basic and commonly used matching method in image processing. In other words, it is a small image that needs to be matched, searching for a target in a large image, knowing that there is a target to be found in the map, and the target has the same size, direction and image elements as the template, and the target can be found in the graph and determine its coordinate position by statistically calculating the mean, gradient, distance, variance and other characteristics of the image.

This shows that the template we are looking for is the standard in the image, and the standard here is that the standard is saidOnce the image or template is changed, such as rotating, modifying certain pixels, flipping the image, etc., we can't match, which is also the disadvantage of this algorithm.

Therefore, this matching algorithm is to compare the template image with the image of the small thing from left to right and from top to bottom on the image to be detected.

This method has better detection accuracy than blob analysis, and can also distinguish different defect categories, which is equivalent to a search algorithm, which searches and matches all images in the template library with the specified matching method according to different ROIs on the images to be detected, and requires the shape, size, and method of defects to have high consistency, so it is necessary to build a more complete template library to obtain the usable detection accuracy.

Deep learning methods

deep learning method

In 2014, R-CNN was proposed, which made the object detection algorithm based on CNN gradually become the mainstream. The application of deep learning has improved the detection accuracy and detection speed.

Convolutional neural networks can not only extract higher-level and better expressive features, but also extract, select, and classify features in the same model.

In this regard, there are two main types of mainstream algorithms:

The first type is the R-CNN series two-order object detection algorithm (TwoStage) based on classification combined with RPN network

The other type is a first-order object detection algorithm (singlestage) that transforms object detection into a regression problem.

The task of object detection is to find out the objects of interest in the image or **, and at the same time detect their position and size, which is one of the core problems in the field of machine vision.

There are many uncertain factors in the process of object detection, such as the uncertainty of the number of objects in the image, the different appearances, shapes, and postures of objects, and the interference of lighting, occlusion and other factors during object imaging, resulting in certain difficulties in the detection algorithm.

Since entering the era of deep learning, the development of object detection has mainly focused on two directions: twoStage algorithms such as R-CNN series and OneStage algorithms such as YOLO and SSD. The main difference between the two is that the twostage algorithm needs to make a proposal (a pre-check box that may contain the object to be inspected) and then perform fine-grained object detection. The OneStage algorithm will directly extract features in the network to classify and position objects.

In the two-order algorithm, the core of the region extraction algorithm is the convolutional neural network CNN, which first uses the CNN backbone to extract features, then finds the candidate region, and finally slides the window to determine the target category and location.

R-CNN first extracts about 2k regions of interest through the SS algorithm, and then performs feature extraction on the regions of interest. Defects: The weights of regions of interest cannot be shared with each other, there are double calculations, and the intermediate data needs to be stored separately to occupy resources, which affects the detection accuracy of forced scaling of inputs.

SPP-NET does some processing between the last convolutional layer and the first fully connected layer to ensure that the size of the input fully connected layer is the same, which solves the problem of limited size of the input image. The SPP-NET candidate region contains the whole image, and the characteristics of the whole image and all candidate regions can be obtained through a convolutional network once.

Drawing on the feature pyramid of SPP-Net, fastr-cnn proposes that roipooling maps the feature maps of candidate regions of various sizes are mapped into feature vectors of a unified scale, firstly, the candidate regions of different sizes are divided into M n blocks, and then Maxpooling is performed on each block to obtain one value. In this way, all candidate region feature maps are unified into m-n-dimensional eigenvectors. However, the use of the SS algorithm to generate candidate boxes is very time-consuming.

After the feature map enters the RPN network, 9 anchor boxes of different sizes and shapes are preset for each feature point, the intersection and offset ratio of the anchor box and the real target box are calculated, whether there is a target at the location, the predefined anchor box is divided into foreground or background, and then the RPN network is trained according to the deviation loss, the position regression is carried out, and the position of the ROI is corrected. Finally, the corrected ROI is passed to the subsequent network. However, in the detection process, the RPN network needs to perform a regression screening on the target to distinguish the foreground and background targets, and the subsequent detection network performs a fine classification and position regression on the ROI of the RPN output, which leads to a large number of model parameters.

Maskr-CNN adds parallel MASK branches to Fasterr-CNN to generate a pixel-level binary mask for each ROI. In Fasterr-CNN, roipooling is used to generate a uniform-scale feature map, so that when the original image is mapped back, it will be misaligned, so that the pixels cannot be accurately aligned. This has a relatively small impact on object detection, but for pixel-level segmentation tasks, the error cannot be ignored. Bilinear interpolation is used in maskr-cnn to solve the problem that pixels cannot be accurately aligned. However, due to the inheritance of the two-stage algorithm, the real-time performance is still not ideal.

The first-order algorithm performs feature extraction, target classification and position regression in the entire convolutional network, and obtains the target location and category through a reverse calculation, which greatly improves the speed under the premise that the recognition accuracy is slightly weaker than that of the two-stage object detection algorithm.

yolov1 uniformly scales the input image to 448 448 3, and then divides it into 7 7 grids, each grid is responsible for the position and confidence of the two bounding boxes bbox. These two b-boxes correspond to the same category, one big goal and one small goal. The position of the bbox does not need to be initialized, but is calculated by the yolo model after the weight is initialized, and the model adjusts the position of the b-box as the network weight is updated during training. However, the algorithm does not detect small targets well, and each grid can only have one category.

yolov2 divided the original image into 13 13 grids, and through cluster analysis, it was determined that each grid was set with 5 anchor boxes, each anchor box **1 category, and the target position was regressed by the offset between the anchor box and the grid.

SSDs retain the meshing method but extract features from different convolutional layers of the underlying network. With the increase of the number of convolution layers, the size of the anchor box is set from small to large, so as to improve the detection accuracy of the SSD for multi-scale targets.

Yolov3 uses cluster analysis to preset 3 anchor boxes for each grid, uses only the first 52 layers of darknet, and uses a large number of residual layers. Use downsampling to reduce the negative effects of pooling on gradient descent. Yolov3 extracts the deep features by upsampling, so that they are the same as the shallow feature dimensions to be fused, but the number of channels is different, and the feature fusion is achieved by splicing in the channel dimension, and the feature maps of 26 255 and 52 52 255 are fused, and the corresponding detection heads also adopt a full convolution structure.

On the basis of the original YOLO object detection architecture, Yolov4 adopts the best optimization strategy in the field of CNN in recent years, and optimizes the data processing, backbone network, network training, activation function, loss function and other aspects to varying degrees. Up to now, many high-precision object detection algorithms have been proposed, including the recent Transformer research in the field of vision, which has also been improving the accuracy of object detection algorithms.

In summary, the choice of representation can have a huge impact on the performance of machine learning algorithmsFeedforward networks for supervised learning training can be thought of as a form of representational learning. According to this, traditional algorithms such as blob analysis and template matching are manually designed to represent their features, while neural networks are automatically learned by algorithms to learn the appropriate feature representations of objects, which are more efficient and faster than manual feature design, and do not require too much professional feature design knowledge, so they can identify different shapes, sizes, textures and other objects in different scenes, and the accuracy of detection will be further improved as the dataset increases.

Article**: Machine Vision Salon.

Disclaimer: The purpose of this article is to convey more information, and it is only for the purpose of learning and communication among readers. The copyright of the article belongs to the original author, if there is any infringement, please contact to delete.

Related Pages