Transformer doesn t read Dream of the Red Chamber

In Transformer's self-attention mechanism, each token is associated with all other tokens. So, if we have n tokens, then the computational complexity of self-attention is o(n 2). As the length n of the sequence increases, the amount of computation and storage space required grows squarely, which results in very large computational and storage overhead.

This means that when you are no longer satisfied with feeding a 200-word paragraph to the large model, and want to throw him a 20,000-word **, its computation increases by 10,000 times.

Source: Harm de Vries Blog The input and output dialogs are like a gateway for AI to the real world. From ChatGPT's first jump to 4096 tokens, to GPT-4's expansion of the context input length to 32K, to the megabyte method of millions of tokens proposed by MetaAI, and the 200,000 and 350,000 Chinese character length competition between the domestic large model manufacturer Moon and Baichuan Intelligence, the process of solving practical problems in large models, and the appetite for input windows are becoming an important prerequisite.

In other words, when it can read "Dream of Red Mansions" as carefully and carefully as reading a brain teaser, things will be much easier.

The breakthrough singularity falls on the number of tokens. The research on it has never stopped.

For the advancement of context length is necessary, the research team of Fudan and the University of Hong Kong argues in a ** article. They did an L-EVAL benchmark in which the Claude-100K was still weaker than GPT-4-32K in reasoning on closed tasks, but in open-ended tasks, the longer – which means that it usually had more input markers – outperformed GPT-4-32K.

Source: "L-Eval: Instituting Standardized Evaluation for Long Context Language Models" The conclusion is very positive, which means that the story of hard work can make up for clumsiness, if you have a bad brain, you can give a few more instructions, and a stupid student can also achieve something.

Prior to that, Google Brain had already done a similar experiment, and Li Wei, an engineer who has been involved in Bard's R&D training, told Silicon Star that last year, the Google team had observed the output performance of the model by controlling the length of the training input context, and the results did have a positive correlation between the context length and the model performance. This realization also helped in the later development of Bard.

At least that's a very firm direction in the industry. Theoretically, wouldn't it be nice to keep expanding the context length?

The problem is that it can't scale, and that obstacle still falls on the transformer.

The large model based on the Transformer architecture also means that the capabilities and limitations given by the self-attention mechanism are accepted. Self-attention mechanisms are fond of comprehension, but they are inherently incompatible with long text input. The longer the text, the harder it is to train, and the worst possible result is gradient or disappearance.

The gradient is the direction and size of the parameter update. Ideally, the gap between the content generated by the large model and the human's desired response should be closer after each round of deep learning Xi than before. If you think of the model as trying to Xi a linear relation y=wx+b, w represents the weight that the model is looking for. The concept of gradient is to express how w changes.

A steady learning Xi process is gradual, and gradual means that fine-tuning is traceable. That is to say, the weights cannot change without changing or mutating.

No change or very little change – i.e., a vanishing gradient – means that the learning Xi time of the deep learning Xi network is indefinitely extended;Mutations are called gradients, and the weights are updated too much, resulting in network instability. This can cause the weights to become very large or the model to diverge, making training impossible.

At the core, short texts often cannot fully describe complex problems, and under the limitation of attention mechanism, processing long texts requires a lot of computing power, which means increasing costs. The length of the context itself determines the ability to describe the problem, and the self-attention mechanism determines the ability to understand and disassemble the large model, and the computing power supports all this.

Source: The ARXIV problem is still on the transformer, and the expansion of the context length is trapped in a triangle due to the computational complexity of the self-attention mechanism.

This impossible triangle is here, and the solution is here. In order to increase the length of the context, it is obviously ideal to be able to push the hashrate (money and cards) to infinity. Obviously, this is not realistic now, so we can only move the mind of the attention mechanism to reduce the computational complexity from n 2.

The effort to extend contextual input has largely contributed to a transformer revolution.

In July this year, Google's DeepMind research team proposed a new Focused Transformer (FOT) framework, which uses a training process inspired by contrasts Xi to improve the structure of the (key, value) space and allow some attention layers to access (key, value) pairs in external memories through the k-nearest neighbor (knn) algorithm. This mechanism effectively extends the total context length.

It's a bit like a derivation of Google's 2022 memorizing transformer, whose transformer-based tuning logic is to save all the tokens it has seen before in a database while reading down while encoding long textWhen reading the current fragment, it will find similar content in the database in the way of knn.

The Focused Transformer was combined with the LLAMA-3B open source model to become LongLLAMA-3B, and then compared with LLAMA-3B in terms of the length of the context input. As a result, after the context input length reached 100k, the response accuracy of Longllama-3b began to increase from 945 Rapid decline. And LLAMA-3B instantly fell to 0 after more than 2K.

Source: Focused Transformer: Contrastive Training for Context Scaling emerged from Transformer-XL in 2019 and introduced a segment recurrent mechanism mechanism) to provide additional context to the model, and a year later, Longformer and Bigbird introduced the sparse attention mechanism, extending the length to 4096 tokens for the first time, and the circular attention and sparse attention mechanisms began to become the transformation direction of Transformer.

The Focused Transformer also implements some form of sparse attention through the knn algorithm, and the newer architecture adjustment scheme is ring attention.

In October, Berkeley's team came up with Ring Attention, which changes the Transformer framework from a memory perspective. By processing the self-attention mechanism in chunks and the method of feedforward network computation, the sequence of contexts can be distributed across multiple devices and analyzed more efficiently. This also theoretically removes the memory constraints imposed by a single device, allowing sequence training and inference to process sequences that are much longer than previous memory-efficient transformers.

That is, Ring Attention implements "near-infinite context".

But this matter is not absolute. Stanford published a study in September that showed that if the context is too long, the model will skip the middle.

They are interested in a number of different open sources (mpt-30b-instruct, longchat-13b (16k)) and closed source (openai's GPT-35-Turbo and Anthropic's Claude). The researcher increases the length of the input context by adding more documents to the input context (similar to retrieving more documents in a retrieval augmented generation task);and by modifying the order of the input context Chinese files, placing the relevant information at the beginning, middle, or end of the context, thereby modifying the position of the relevant information in the context.

Source: "Lost in the Middle: How Language Models Use Long Contexts" shows a clear U-shaped trend in model performance as the location of relevant information changes. That is, language models perform best when relevant information appears at the beginning or end of the input context;When the information that the model must acquire and use is in the middle of the input context, the model performance is significantly reduced. For example, when relevant information is placed in the middle of its input context, gpt35-Turbo performs worse on multi-document issue tasks than when there are no documents (i.e., closed-book settings;56.1%）。In addition, the researchers found that model performance steadily decreased when the context was longer;And a model with context extensions doesn't necessarily make it better at using its own context.

A whop on the head. But that's probably a good thing, because if the article is too long, people will probably only remember the beginning and the end, how similar the big model is to the human brain.

At present, the context length of GPT-4 and LLAMA 2 Long has reached 32K, and the Claude 21 reaches 200k. Among the large models in China, the context length of chatglm-6b has reached 32k, and the dark side of the moon, the latest star company to appear, directly came up with a 200k kimi chat.

The idea of the dark side of the moon is also to transform the transformer, but Yang Zhilin also leads to a confrontation.

Tadpole model and bee model.

To put it bluntly, there are many ways to make a concave shape look like it can support longer contextual inputs, and the easiest way to do this is to sacrifice the parameters of the model itself.

The reduction of model parameters means a reduction in memory consumption and a simplification of computation, and the reallocation of memory and computing resources can be translated into an increase in the length of context input under the same computing resources. In retrospect, the verification of Focused Transformer is placed on a 3B small parameter model, and the test of Ring Attention is also more focused on the effectiveness verification of the proposed method, rather than a large model with more than 10 billion parameters.

Source: "Focused Transformer: Contrastive Training for Context Scaling" But 1 billion level model parameters do not have too good intelligence emergence capabilities. Yang describes it as a tadpole model because it can't do anything more.

In addition to this, another way in which the length can be increased is by introducing a search mechanism from the outside.

This is also the approach that quite a few large models can take to quickly extend the length of the context - if the self-attention mechanism is not used, the other way is to turn long text into a collection of short text.

For example, the common retrieval-augmented generation (RAG). It is a deep learning Xi framework that combines the mechanisms of retrieval and generation, with the aim of introducing the ability of information retrieval into sequence generation tasks, so that models can leverage an external knowledge base when generating responses or content.

Source: When the towhee model processes long text, the search mechanism introduced will retrieve the short text in the database to obtain the long text composed of multiple short text responses. Only the short text snippets it needs are loaded at a time, avoiding the problem that the model can't read the entire long text at once.

Yang Zhilin refers to this type of model as the bee model.

Traditional Transformer institutions, such as GPT, have a limited input length and are usually limited to a few hundred to a few thousand tokens. This means that when dealing with large amounts of information or long documents, the model may have difficulties. Through its retrieval mechanism, RAG can retrieve information in a large external knowledge base and then feed only the most relevant fragments into the generative model along with the original input. This allows the model to handle a larger length than its original input.

The rationale behind this idea is interesting. If both loop attention and sparse attention are improvements to the attention mechanism within the model and belong to the model structure-level optimization, then it is only an input-level optimization to avoid directly processing the full long text by retrieving relevant information fragments in an external database and then feeding the most relevant fragments into the generative model along with the original input.

The main purpose of this approach is to quickly capture the most relevant pieces of information about the problem at hand from a large number of documents, which may represent only part of the context of the original input or some specific aspects. In other words, this approach places more emphasis on local information and may not be able to grasp the overall meaning of a long text input.

It's like modules separated from each other in a hive.

The statement of the bee model has some joking meanings pointing to the large model manufacturers on the market that use plug-in search to expand the window capacity, and it is difficult for people not to think of the rapid development in recent months, and has been emphasizing its "search gene" Baichuan Intelligence.

On October 30, Baichuan Intelligent released the new Baichuan2-192K large model. The length of the context window is stretched to 192k, which is equivalent to about 350,000 characters in Chinese characters. Kimi Chat on the Dark Side of the Moon has expanded the context length of Claude-100K by 25 times, baichuan2-192k almost doubled the upper limit of kimi chat.

In the Longeval long-window text comprehension evaluation list, the Baichuan 2-192K outperformed the Claude-100K, maintaining strong performance even after a window length of more than 100K, and the latter performed extremely fast after 70K**.

But there is another angle of observation on the question "why don't we try a longer context".

The publicly available internet scraping dataset CommonCrawl is the main data of the Llama training dataset**, including C4, another major dataset obtained from CommonCrawl, which accounts for 82% of the Llama training dataset, and the corpus data in Commoncrawl is very short. Harm de Vries, a researcher at ServiceNow, said in an analysis that more than 95% of the corpus data files in this large dataset have less than 2 bytes of tokens, and the vast majority of them are actually in the range of less than 1 k.

Source: Harm de Vries blog "If you want to find long texts with obvious contextual logic, it's even less," says Li Wei.

Longer training is expected to be pinned on books or even Wikipedia, etc. Research by Harm de Vries shows that more than 50% of Wikipedia articles have more than 4k tokens, and more than 75% of books have more than 16k tokens. However, judging from the distribution of the main data ** expected by LLAMA training, the proportion of Wikipedia and books is only 45%, *arxiv) accounted for only 25%。The sum of the three is only 270GB of data, which is less than 10% of CommonCrawl.

Perhaps there is not so much long text corpus that can be used for training at the moment, and transformers who don't read "Dream of Red Mansions" and don't have so many "Dream of Red Mansions" to read, which is a real problem for all those who pursue the length of context.

Transformer doesn t read Dream of the Red Chamber

Related Pages

Subvert Transformer supremacy!Mamba's new architecture solves fatal bugs and increases the inference

Revolutionary architecture in Transformer neural networks

Transformer Challenger Appears!Stanford CMU joint team, open source model and code

Is it really impossible to have a future if you don't study Is it necessary to go to school if you a

If you don't read Wei Yingwu, you don't know what life is all aboutDeeply Wei Yingwu, only then do y