When it comes to the intelligence of large language models (LLMs), we often hear the idea that when LLMs are trained by the Next Token Prediction (NTP) task, they are actually doing lossless compression of the data. This view holds that LLMs are able to write the next word accurately, thus compressing the data without losing any information. However, this view may not be entirely accurate.
First, let's review the concept of lossless compression. In data compression, lossless compression means that the compressed data can be completely restored to its original state without any loss of information. For LLMs, this means that the model is able to revert to the original text completely accurately when the next word is made.
However, when we dig deeper, we find that this "lossless" label may be a little too optimistic. While LLM is indeed able to learn the deep structure of the text through the NTP task during training, it is not always able to perfectly place a word when the model is applied in the real world. This error actually represents a loss of information, as some of the information in the original text was not properly passed on to the next word.
So, how do we explain this loss of information?In fact, this "lossy" compression of the LLM in the ** process is compensated by arithmetic encoding. Arithmetic encoding is an efficient data compression technique that enables data compression by mapping successive real numbers to a finite range of integers. In the LLM process, if there is an error, the arithmetic encoding compensates for this loss of information with additional encoding, making the overall appearance of lossless compression.
So, when we say that LLMs are capable of "lossless compression", we are actually saying that the "LLM + arithmetic coding" system is capable of lossless compression. This system works together to achieve lossless compression through the ability of LLM and the compensation mechanism of arithmetic encoding.
Overall, LLM's data compression capabilities are an important aspect of its intelligence, but we need to understand more accurately what is "lossless" and "lossy" in this process. With the assistance of arithmetic coding, LLM can compensate for the information loss caused by ** error to a certain extent, and realize the effective compression of data.
List of high-quality authors