Hi, hello. I'm a tea truss.
In the last lesson, we Xi learned the principle of convolution in detail, and in the process, we talked about a more important concept calledinput channel
, andoutput channel
Of course, there is no need to implement it directly, the principle of convolution, pytorch, or tensorflow, etc., has actually been implemented. But if we were to use Pytorch's convolution operation now, it would have an input channel and an output channel.
Here's how:conv2d
, indicating that the filter applied here is a 2D one. So how does it go about 3D?is to add the results of each layer together.
There is another counterpart to thisconv3d
At this time, the filter is a lot different.
When we use it, we generally apply itconv2d
。It feels like it would be better if in Conv3D we had different values for each layer. But in fact, the filter we get now is written by us, but in order to extract a very, very many features in nature, in fact, we will let the machine automatically generate and initialize a bunch of filters, and the results of these filters are all random.
That is to say, the results of this filter are actually random in the deep Xi, and then through training and backpropagation, these filters will automatically learn to Xi a value. That is to say, it will automatically converge to a certain value, and how should this value extract features from convolution in this environment, so we need to fit fewer parameters when making 2D.
If there are many and deep input channels in front of the filters, then the filters need to fit a lot of parameters, and if it is 2D, only this layer of 2D needs to be fitted. This is the difference between 2D and 3D, and why 2D is generally used instead of 3D.
If you want to use convolution now, the first argument is the input channel. The second is the output channel. The output channel is actually just how many filters there are.
The kernel size refers to the size of the convolution when the convolution is done. For example, if it is 3 * 3, then the kernel size is 3, which can also be written as (3,3).
There is also a parameter called stride, and this stride is the stride. So what is this stride for?When we generally use filter to deconvolute the image, it is a unit on the matrix that moves from left to right and from top to bottom, and this stride is to speed up the movement, so as to set the interval. For example:Then I'll take a line as an example, just know the meaning. For example, if the filter is 3 columns, then it should be in order
, and then there's
, but I set up stride to skip steps to execute. In
After that, it can be
。stride defaults to 1.
The next parameter, padding. If there is a 6*6 image matrix with a 3*3 filter, then the convolution of the 6*6 image will first become a 4*4, and then become a 2*2. The result is that it is getting smaller and smaller, which means that the level of abstraction is getting higher and higher.
So because each window is constantly changing, in the next round, if you want to add some operations in the middle, the dimension changes, which will cause the dimension to be recalculated every time you want to connect something in the middle.
In other words, the constant change of dimensions will cause the calculation to become more complicated when writing **.
Second, we don't want to reduce it too quickly. At the extreme, taking a 10,000 x 10,000 image can quickly become a 2 x 2 image. If the level of abstraction is too high, there will be less information.
The third one is more complicated to explain, if we think about it, if a filter moves from left to right on the image, then when the convolution is performed sequentially, the leftmost column is only calculated once, but the middle position is swept into the calculation many times. What we want is that the data on the edge can be computed multiple times, that is, it can be extracted repeatedly.
There is a very simple way to solve these three problems, and that is padding. It means that one or two circles of zeros are added to the outside of the figure. If you want to add a circle of 0padding=1
。If it is equal to 2, add two rounds of zero.
Next,dilation
。This is what we used when we were doing the segmentation of the graph. When it comes to pattern recognition, you don't have to Xi it now.
When we convolute a **, it will get smaller and smaller, which is called down sampling. After downsampling, if we want to do image segmentation, we need to blacken all the main parts of the image, and paint all other parts white, and we need to expand it based on this small sample, and slowly return to the original size, which is called upsampling. Sometimes it is used when upsamplingdilation
These are the important parameters. After talking about these parameters, in fact, basically a few important characteristics of convolution are explained.
In addition to convolution, there is one more important operation:pooling
, pooling operation.
The pooling operation is actually very simple, we are given a **, and the convolution operation is to select a window and filter, make an F multiplication W and then add it. sum(f*w)
.Pooling is a straightforward operation that averages all the values on the W side, and possibly a maximum. If it is the maximum value, then the maximum value represents the point in the graph that has the most influence on it.
So after pooling, every time I do this operation, the graphics become smaller, but the images basically remain the same. That is, the images before and after the pooling operation are similar, and it takes the most important information.
Let's think about it now, if there is a **, whether it is convolution or pooling, it will shrink. So let's think about it, if both operations cause the image to be scaled down, why are there two operations?
The most troublesome thing in our machine Xi is the so-called overfitting, which is that the effect is very good during training, but the result is not good in practice.
The main thing to control overfitting is to reduce the parameters, and when fewer parameters can achieve the effect, we expect the fewer parameters to be as good as possible, and the more overfitting can be prevented with the same amount of data.
In the past, these values in the convolution were determined by humans, but now they are expected to be automatically solved by the machine, which means that this parameter needs to be solved by itself. Pooling, on the other hand, doesn't need to set parameters, it doesn't have parameters, which reduces the parameters.
After using pooling, not only the parameters are reduced, but also the dimension of the next x is reduced. So the core thing is that we have reduced the parameters that need to be trained.
The reason why pooling is used is because it can reduce the parameters, so that the problem of overfitting can be reduced. But if you have a lot of data, or if the model itself is easier to train and converge, then you can actually do without this pooling operation.
So at this time, I want to talk to you about a more important concept, called weight sharing and local invariance:parameters sharing and location invariant
.This loction invariant is also called shifting invariant. The first of the two most important features of CNN is its weight sharing.
Let's give a **, just my avatar before, if there is a filter, it is 3 * 3, then this 3 * 3 grid is the same as this filter on every window.
So let's think about it, if there is a graph of 1,000 times 1,000, we have to write this figure of 1,000 * 1,000wx+b
If this x is 1 million dimensions, then this w is also 1 million dimensions.
So if we're going to do training, we're going to have to train 1 million W. This is fitting a linear change, so if we are going to fit a convolution, if the output channel is 10, what are the parameters we need to fit?
The convolution kernel is 3*3 with 10. That's 9*9 multiplied by 10. It doesn't matter if the place is 1,000 * 1,000 or 10,000 * 10,000, we want to fit this parameter in the convolution kernel.
We do a layer of convolution, even if we give 10 convolution cores, it is 90. If you want to make a linear change for it, you have to get 1 million, and the difference between the two is very large.
Why is there such a big difference?This is because the value of the filter used in different positions is the same. filter parameters are shared across the image. This is the weight sharing of convolutional neural networks.
So let's think about it now, we have this parameters sharing, what does it do?
The effect of reducing the number of parameters is to prevent overfitting, and the ultimate embodiment of preventing overfitting is our various computer vision tasks, and the performance is good. In addition to this, there is another feature that can greatly improve our computing speed.
Originally, if you had so many parameters, we would have to backpropagate 1 million backpropagations at a time. Now we're just going to do ninety.
So weight sharing is actually the reason why convolutional neural networks work so well.
In 2012, the test effect of computer vision has improved by leaps and bounds. That's because of the use of convolutional neural networks.
In the past, everyone worked well on the training set in the lab environment, but when they got the test set, the effect was very poor. Later, after using convolutional neural networks, this error rate dropped suddenly.
The value sharing feature brings with it a feature called loction invariant. It's just a local thing, and we're connecting it together.
We have two ** each, for example, the following one of the paintings I used to draw, and I changed the composition separately.
The two image data are very different in terms of representation, but after the convolution of the position of the eye, as long as the same convolution kernel is used, the results are similar. So the loction invariant means that we can extract it no matter where the eye is.
Suppose we are in train, no matter where the eye is, as long as the filter is trained, even if its position changes on the test dataset, we can still extract its features, and still be able to calculate this value similar to it, which is called loction invariant.
These two characteristics are extremely important.
To be honest, the most important thing about building a convolutional neural network is to look at experience. In the next lesson, we will take a look at several classical neural network structures.