This paper deeply analyzes the design and accuracy tuning of quantitative training tools

Reading guide:

On November 22nd, the new issue of Horizon Hello, Developers toolchain technology session was successfully completed in the live broadcast of Wisdom Orangutan Enterprise. The special session was delivered by Qu Shuqian, head of the R&D of quantitative training tools of the Horizon Tool Chain, with the theme of:"Experience in the Design and Accuracy Tuning of Horizon Quantization Training Tools".

This article is a transcript of the main lecture session. If you have a need for live playback and Q&A, you can click hereRead the original articleGo **.

Qu Shuqian:Hello everyone, I am Qu Shuqian from the Horizon Depth Xi Platform, and I am mainly responsible for the quantitative training tools of the Horizon toolchain. Today, we are going to discuss the topic of "Experience in the Design and Accuracy Tuning of Horizon Quantization Training Tools". I will explain today from these perspectives:1Introduction to Quantitative Training Techniques 2Challenges with Quantitative Training Tools3Experience in the design and precision tuning of horizon quantization tools4Future exploration directions.

Introduction to Quantitative Training Techniques

As we all know, integer computing is definitely much faster than floating-point computing in chip or device-side deployment. Here's a simple legend.

As you can see from the diagram, for example, 8b add and the same layer are actually orders of magnitude different. Therefore, from the perspective of power consumption and area, integer computing is actually more suitable for device-side deployment. Our Journey series chips support INT8 or INT16 computing from Journey 2, Journey 3, Journey 5, to the recently launched Journey 6. Our tool is to deploy a floating-point model onto our chips.

Challenges with quantitative training tools

As we all know, there are some quantitative problems in the intelligent driving scenario:

1. There are many types and quantities of sensors in intelligent driving scenarios, such as cameras, radars, etc., and the data range of different types of data is quite different. Take the image as an example, everyone understands that it is int(0-255);Radar has a lot of position information, and even speed information, so its numerical range is very large, which is a big challenge for quantization. Today I also saw some questions from some students, such as whether the input layer can be quantized with a lower bitAt present, it is actually quite difficult. 2. The intelligent driving scenario model will become more and more complex, especially now that everyone is pursuing an end-to-end model. The end-to-end process from perception to regulation is very long, and the model structure at each stage has gaps, so the challenge of quantification will become greater and greater. In our company, there have been many problems when we were doing end-to-end models, and we encountered many problems in quantification. 3. The numerical accuracy of FP32 is 2 For INT8, there is only 1 256, which is 256 numbers, so the phenomenon of unfriendly quantization occurs from time to time. Traditionally, everyone thinks of it as normal, so it's easy to quantify. But this is not the case, especially since some Transformer models are not very friendly to quantization. Quantization training can further reduce the loss of quantization deployment accuracy compared to floating-point precision. I will introduce the basics of quantitative training a little, maybe most of the students today know better, but some students will not be familiar with it. Quantization is the process of mapping from floating-point to fixed-point, and the formula here is a uniform, zero-point asymmetric quantization. Our current quantization training tools are mainly symmetrical quantization, and the zero-point is 0. Dequantization is the process of reverse. Take conv as an example, from input to output, such as int and weight, are floating-point. If it is a quantization process, its real calculation process becomes that the first two have quantization nodes, and the two inputs are int8 but the output is int32. So conv is the calculation of int8. When it comes to quantization training, quantization training is a process of quantizing first and then dequantizing, which is a pseudo-quantization node. This node is more important in the process of quantization training, which is equivalent to simulating the quantization process when deployed on the device side in the process of floating-point training. Its output is also a float, which can be thought of as a float after quantization and dequantization, which is different from the previous float.

Horizon quantification tool design and design

Experience in precision tuning

Next, we will focus on Horizon's current quantitative strategy. 1. Finetune based on existing floating-point models, no need for users to retrain. Of course, there are good and bad things to do, and this is the most common way in the community: train the floating-point model and do a finetune. Its cost is relatively low, and the current perception is below 1 10 of the floating point. There will also be some problems in this, in the mass production process of intelligent driving scenarios, the floating-point model itself is constantly finetuned, and then QAT finetune is added, theoretically there may be some problems, but we have not encountered them yet. On the other hand, we can also allow users to directly do quantitative training starting from 0 based on the quantized model, that is, the fake-quant model just now. This is applicable to some of our company's scenario models, and it may be a little more stable for model iteration. 2. Gradient calculation is also a more common method in the industry, which is STE. Because the quantized function is staircase, there is no derivative. Therefore, the common method in the industry is transparent transmission, which directly transmits the quantized gradient of the output. The only difference is whether the two ends are saturated or unsaturated, and whether the place should be transparent or not. 3. The strategy of scale update is based on statistics, and min and max are generally not used directly. For example, one way to move **er**ge is to make statistics based on some data from the current batch and the previous batch. There is also a Xi-based approach. Scale itself is also Xi, and it can also be done in a gradient way, such as LSQ. This horizon strategy is supported, and it also supports a way to set a fixed scale based on the real data range. This method may not seem so lofty, but it can be used in many scenarios. For some numerical ranges, such as qat or calibration, some data may not be able to cover some real numerical ranges, so at this time, you need to calculate a fixed scale according to the real numerical range, divide by int8 by the min and max of the floating point, such as the min and max of -128 and 127 just mentioned. 4. It also supports the deployment of floating-point model FP16&FP32. Journey 5 adopts a heterogeneous approach, part of which is calculated on the CPU, and Journey 6 will be directly supported. We've been exploring a method since the second half of last year: calibration + QAT. In the early days, we didn't use this method to do QAT quantitative training programs. You may think that QAT is already very stable, and the accuracy will come up if you do it casually. In fact, not really, especially after the emergence of the Transformer model, we have discovered a lot of quantitative problems, and sometimes it may not be possible to solve the problem by directly plugging the QAT. So our strategy is to do calibration first, and then QAT. The advantage of this method is that the accuracy is more *** in some cases that do not go up on the direct QAT, and it can be done in this way. At present, we have about 1 3, or even more reference models that do not need QAT to directly meet the accuracy, and according to the scene, it is probably a little more classification and segmentation model. When the calibration scale is used as the basis for QAT, the number of steps in QAT can be reduced, and the accuracy target can be reached faster, so the overall quantization cost will be lower. In fact, our users are quite concerned about how much cost QAT or the whole solution will bring, in addition to the cost of manpower, there is also the cost of the machine, for example, I spent a week training a model, if the QAT takes a few more days, it may not be acceptable. So this strategy can reach the accuracy of the final deployment much faster, which may have taken a day before, but now it can be done in half a day. In view of the benefits just now, our calibration strategy is also gradually improving, and we have support for the more commonly used strategies in the industry, including kl, percentile, mse, max, adaround, etc. Recently, there has also been support for auto-search hybrid calibration strategies to select the optimal calibration strategy for each layer. In fact, this also comes from the needs of customers, and they don't want to match the strategy at every stage, which will cost a lot of manpowerThe other is that users don't know which one to choose when using it, because it is difficult to know which one is the best under some cases. So we have an automatic search strategy, of course, automatic search will consume more machine resources, which is equivalent to exchanging resources for some manpower, which will be more friendly to everyone. Listed below are the main ones for the accuracy of some CNN models.

From this, such as the classification and segmentation model, we can achieve it by directly doing calibration. The following models need to do QAT. Let's talk about some of the work we've done on transformer quantization. In the early days when we explored Transformer, it gave us a lot of headaches. In the past, we would have been exposed to a lot of CNN models, because the Transformer solution at that time was not very mature when it was oriented to real scenarios. But at present, everyone is doing some integration such as intelligent driving scenarios, and will use some transformer attention solutions, but they actually won't before, so everyone tried a lot of things at the beginning. At present, we provide a more user-friendly accuracy solution for these complex operators, such as layernorm, softmax, and gridsample, but gridsample cannot be regarded as completely here. Users can directly adjust the community's operators without perception. We will transfer it to the model of QAT, which is mainly a mix of INT8 and INT16, and use INT16 to solve some accuracy problems, which can reduce the quantization difficulty of the Transformer model. Among these operators, such as gelu, we don't perceive that there is much of a problem with it. For some common attention structures, such as multiscaledeformableattention, it provides precision and deployment performance optimization. I've probably selected a few models, and you can see some models for various scenarios such as classification, detection, and lane markings.

The final accuracy of the model in this area looks OK. At present, from the perspective of QAT, we can solve the quantization problem of transformers, not only INT8, but also INT16. Just now I talked about the partial quantization training strategy, which will be a little more work, and how to choose the pytorch quantization training interfaceIt's also a big headache. What should I do if I want to try the PyTorch quantization tool?The community may have provided some solutions, but the community interface will always be prototype and beta, and it will last a long time in the very early stages. The quantization training problem is essentially the insertion of a pseudo-quantization node in a suitable position, and there may be some node replacement problems, but the key problem is to insert a pseudo-quantization node. At the same time, the training characteristics should be kept unchanged, and the graph cannot be obtained after insertion, but the experience of using quantization is different from the original floating point experience. This graph is the state of the end of 2022 to the beginning of 2023, and there are many ways to capture the graph with PyTorch.

There are actually many capture methods, from script to trace, to FX, Dynamo, and there are different solutions at different levels. We're more concerned with getting a full picture, and sometimes it's hard to do something without it. So what kind of solution should we choose?One is an EAGER without a graph, or an FX or Dynamo. For dynamo, the previous paragraph just released 21 Dynamo export program. None of these schemes are perfect, each with its own limitations, and will bring something less friendly to the algorithm classmates, and you can't get it all at once. Just like the one we used in the early days, such as MXNet or TF, we directly define the nodes of the graph. At this point, the framework level is simple to do, but it may not be user-friendly. Horizon now supports both Eager Mode and FX Mode.

There is no diagram in the overall scheme of the eager mode. When the user actually does QAT, his experience is the same as doing floating-point, and he can do single-step debugging. However, because there is no graph, users need to manually operate when doing operator fusion, for example, after fusing conv, bn, and relu operators, quantization is more friendly, and deployment will be more friendly. These are the operations for bias optimization. In eager mode, some operators need to be manually replaced, such as addition, subtraction, multiplication, and division. In the floating-point phase, we can call these operators at will, but on QAT, it is different, if you want to add quantization, it means inserting some quantization nodes into it. Quantization nodes are the scale, zero-point, etc. we just saw. They are stateful parameters that are to be turned into a torch model. Replacing a function with a module requires the user to replace it manually, so the problem with this solution may be its friendliness. But after configuring qconfig and prepare, it's still ok. FX Mode doesn't have that bothering because it has a graph where we can do some of the work we want. But fx mode has some syntax issues. This method uses a proxy instead of a real tensor, and if there is a non-real tensor to do symbolic trace, there will be some compatibility problems between symbolic trace and python syntax. Overall, this is our current approach. We are also working hard to help users improve the applicability of quantitative interfaces. I believe that on the basis of these two versions, some more friendly versions will be made. For example, based on Torch 21. Export method trace fx-graph. Next up are hardware-related optimizations.

1. The current QAT is still bound to the hardware and is not completely separated from the hardware. Therefore, its goal is closer to the analog quantization of hardware, which can further reduce the loss from the QAT model to the quantified model. Intuitively, if you're using floating-point, then you probably don't have this problem. But if it is also fixed at the time of deployment, there will still be these problems. After all, the front is the quantization of the simulation, and the back is the real fixed point, so there will be some gap of accuracy. As you can see in the ** just now, our current quantized relative QAT is relatively small. 2. While ensuring accuracy, try to quantify multiple operators and give full play to the potential of hardware. I also said in the previous ppt, int calculation is still relatively economical. 3. Provide hardware efficient operators for some special scenarios. These operators are optimized on our hardware and embedded in the tool so that users can use them on demand. 4. Support quantified model. Based on the high performance of GPU and x86**, we have also optimized several versions this year. 5. Support sparsity training (2:4). (2:4) Sparse may be a recognized hardware-friendly sparsity strategy, the hardware implementation cost is relatively low, and the accuracy loss is OK. I don't have a case of precision here. The whole process of sparsity training + quantization is about 2% lower than the previous accuracy loss, which is the conclusion of some of our current experiments. Its own process is based on the floating-point model, and after the floating-point model comes in, it is first sparsified, then quantized, and then deployed. I talked about our experience in quantizing the training of complex models in many scenarios. We've had some complex models internally, and I think it's going to get more and more complex, at least not for now, and we've had some experience: the first is the input to certain scenario models, like radar data. For radar, the digital range of different channels will be relatively different, which is not suitable for quantification. Therefore, we recommend that users do normalization and then quantify in the network, and the quantization loss will be relatively small. The second is a visual transformer, such as bevformer, which has a lot of reference points and location-related informationFor example, the grid calculation of gridsample is also position-related, whether it is relative position or otherwise, because the quantization sensitivity is relatively high, and higher precision is required. We recommend int16 quantification, or a fixed scale based on the true range of data. I mentioned this earlier, why do you do this kind of work?In fact, sometimes in order to do quantification, this work is necessary. We see that in the experimental scenario, the QAT is run multiple times on the dataset, but when everyone makes the model on the real scene, it is calculated according to step, not EPOCH. In STEP, I can run 1000 steps on QAT, so I don't know how much data I can run. When you select data according to a certain standard, you may not be able to choose it very reasonably, so this case is likely to need to set a fixed scale, and we can also let users set some floating points later. The third is detection and other models, for example, the network output has a large relationship with the algorithm index, we recommend setting high-precision output, or setting high-precision by default. Here's a picture from somewhere else.

For example, the scale, min, and max that you learned are all obtained based on statistics, and some outliers will be removed. But some points are quite important, such as the position of the point just mentioned, if you do saturation without brains, it is problematic. That's our experience with tuning complex models. At present, many of our internal and non-company mass production customers use this set of tools, and the following are rich precision tuning tools. Here I have listed a lot of tools, including some model structure inspection tools, troubleshooting shared OP, operator fusion, quantitative configuration, etc.

Shared OP may often be uncommon to floating-point users. Especially in PyTorch, such as some detection headers, a certain CONV may have to be passed several times, but shared OP is not necessarily quantization-friendly. The problem with shared OP is that it is originally a per-tensor scale, but after sharing, it is equivalent to a multi-tensor scale, and one scale manages multiple tensors. Unless there is a small difference in the distribution of the numerical range of multiple branches each time, there is no problem;Once there are some discrepancies, there is a problem with accuracy. Therefore, we do not recommend everyone to use shared OP in the QAT stage, of course, there are some tools that can be checked. Here are some tools that provide similarity and statistics to help you find quantitative sensitivities. There are also tools that support step-by-step debugging. The experience of single-step debugging and floating-point is the same, just like the biggest advantage of PyTorch is the python program, everyone is the same python program, and it doesn't become anything else. The last one is to support the deployment of consistent alignment tools. Debugging comes at a cost and is not free. At least we don't have a particularly good metric that can handle all the cases. Therefore, it is necessary to jointly analyze through multiple indicators to which places may bring quantitative problems. We are also continuing to explore internally, how to do this betterFor example, they will try to let everyone find bad cases, and if they run these tools directly, they may sometimes not be reflected. But maybe it's OK after finding a bad case. It's the same as debugging, we're trying to improve with a similar strategy, and we'll continue to improve it in the future. This is a flowchart of the current precision tuning, so I won't go into details.

There may be some quantization problems at the floating-point, calibration, QAT, and quantified stages. For example, calibration, I don't expect calibration to reach a loss of less than 1%, many times it is not reached, at this time, you need to adjust to QAT, network structure, calibration strategy or quantitative configuration may have problems. The problem with converting qat to quantified model is that the accuracy may be dropped, and there are many kinds of problems, I will give you a typical example: a nonlinear activation, such as exp, looks very weird, and we will turn it into a lookup table when we quantize, because the lookup table is the best in hardware, and the calculation speed is also very fast. But at this time, how to find your table and how to set the table interval are all problems. We will automatically find these things internally, but there have been 1 or 2 cases, and the table interval we found is not very reasonable, so we need to fine-tune at this time.

Recommend the new issue of Horizon Hello, Developer Toolchain Technology Session: On December 6th, He Daoyuan, the person in charge of Horizon Model Conversion and Quantization Tools, will give a lecture on "Horizon Model Conversion and Quantitative Deployment after Training".

Future exploration directions.

Finally, we will explore the direction of our future exploration.

We will continue to explore the quantitative deployment of complex scenario models, because the current models are indeed becoming more and more complex. The process of quantitative deployment, quantitative tuning, and precision tuning of data should also be continuously improved. There are some models that have appeared in ppt before, and these models do take a lot of time to debug. If you are a person who does not understand quantization algorithms at all, you may not be able to do quantization completely. The quantization of some cases is not friendly, probably because the physical meaning of the input and output is bound. As a person who makes quantitative tools, sometimes you may not be able to perceive these physical meanings. So everyone has to understand each other. Students who do quantification must understand some algorithms, and students who do algorithms must understand some quantification. For now, we think it's inevitable, it's just a matter of how much you understand. We are also continuing to improve, hoping that everyone can understand as little as possible, and can do it in one or a few handfuls. Improve the efficiency of algorithm iteration. After all, everyone is rushing for time and mass production, and everyone has spent a long time adjusting the accuracy of the floating-point model, and they don't want to spend so long doing quantification. We also hope to make this process more perfect, which can improve the efficiency of iteration. Explore lower-bit hybrid quantization. Probably not just lower bits, including int8, int16, fp16 or even lower. Because the hardware will support a wide variety of data types, under what circumstances will these data types be optimal for hardware performance?For QAT, there really is a lot of space to explore. QAT is a little more tolerant of quantization problems, so it is more likely to get a model that is better deployed accurately. Explore the quantitative deployment of large LLM models. The problem with large models may be that they can't do a complete finetune and do all the data again, which may be unacceptable to everyone. At present, everyone is just running a little data, similar to the complex model just mentioned, everyone is running step, not epoch. If there are fewer running steps and some running more steps, the accuracy will be different. These are some of the follow-up tasks that we want to explore. That's all I have to say today, thank you.

This paper deeply analyzes the design and accuracy tuning of quantitative training tools

Related Pages

ABCDE's article takes a deep dive into coprocessors and their solutions

This article will take you to understand the wisdom of cultural tourism

Master Photoshop raised relief skills in one article and make your design unique!

An article that reads and understands the genre that is easy to learn

What does offer mean?Explained in one article