RNN model challenges Transformer hegemony! 1 Cost performance comparable to Mistral 7B

Mondo Digital Updated on 2024-02-20

Edited by alan

At the same time as the large model is involuted, the status of Transformer has also been challenged one after another.

Recently, RWKV released the Eagle 7B model, which is based on the latest RWKV-V5 architecture.

The Eagle 7b beat all models in its class in multilingual benchmarks, and in separate English tests, it was almost equal to the best performers.

At the same time, Eagle 7B uses the RNN architecture, which reduces the inference cost by more than 10-100 times compared with the Transformer model of the same size, which can be said to be the most environmentally friendly 7B model in the world.

Since the RWKV-V5 may not be released until next month, here is the RWKV, the first non-Transformer architecture to scale to tens of billions of parameters.

*Address: This work has been accepted by EMNLP 2023, and we can see that the authors are from top universities, research institutes, and technology companies in different countries.

Below is the official image of Eagle 7b, which shows that the eagle is flying over the Transformers.

eagle 7b

The Eagle 7b is available in more than 100 languages,1With 1T (trillion) tokens of training data, Eagle 7B ranks first on average in the multilingual benchmark in the figure below.

Benchmarks include xlambda, xstorycloze, xwinograd, and xcopa, covering 23 languages, as well as common sense reasoning in their respective languages.

The Eagle 7B took first place in three of them, and although one did not play the Mistral-7B and came in second, the opponent used much higher training data than the Eagle.

The English test below contains 12 independent benchmarks, common sense reasoning, and world knowledge.

In the English performance test, the level of the Eagle 7b is close to that of the Falcon (1.).5T), LLAMA2 (2T), and MISTRAL (>2T), which are comparable to MPT-7B, which also uses about 1 T of training data.

And, in both tests, the new V5 architecture represents a huge overall leap over the previous V4.

Eagle 7B is currently hosted by Linux for Apache 20 license grant, can be used for personal or commercial use without restrictions.

As mentioned earlier, the Eagle 7b's training data comes from more than 100 languages, while the 4 multilingual benchmarks used above only include 23 languages.

Despite the first-place result, the Eagle 7b suffers overall, after all, the benchmark cannot directly evaluate the model's performance in more than 70 other languages.

The extra cost of training won't help you improve your rankings, and if you focus on English, you may get better results than you do now.

So, why is RWKV doing this? Officially, this is said:

building inclusive ai for everyone in this world ——not just the english

Among the many feedbacks on the RWKV model, the most common are:

The multilingual approach impairs the model's English assessment score and slows down the development of linear transformers.

It would be unfair to have a multilingual model compare multilingual performance with an English-only model.

Officially, for the most part, we agree with these opinions

But we have no plans to change that, because we're building AI for the world – it's not just an English-speaking world. 」

In 2023, only 17% of the world's population speaks English (about 1.3 billion people), however, by supporting the world's top 25 languages, the model could reach about 4 billion people, or 50% of the world's total population.

The team hopes that the future of AI can help everyone, such as allowing models to run cheaply on low-end hardware, such as supporting more languages.

The team will gradually expand the multilingual dataset to support a wider range of languages and slowly expand coverage to 100% of the world,—— ensuring that no language is left out.

During the training process of the model, there is a noteworthy phenomenon:

As the size of the training data increases, the performance of the model gradually improves, and when the training data reaches about 300b, the model shows the same performance as pythia-69b with similar performance, while the latter has a training data volume of 300b.

This phenomenon is the same as in a previous experiment on the RWKV-V4 architecture,—— which means that a linear Transformer like RWKV will perform about the same as a Transformer at the same size of training data.

So we have to ask, if this is the case, is the data more important to the performance of the model than the exact architecture?

We know that the computational and storage costs of the Transformer class are squared, while the computational cost of the RWKV architecture in the above figure only increases linearly with the number of tokens.

Perhaps we should look for more efficient and scalable architectures to improve accessibility, reduce AI costs for everyone, and reduce environmental impact.

rwkv

The RWKV architecture is an RNN with GPT-level LLM performance, while training can be parallelized like a Transformer.

Combining the best of RNN and Transformer – excellent performance, fast inference, fast training, saving VRAM, unlimited context length, and free sentence embedding, RWKV does not use attention mechanisms.

The following figure shows the computational cost comparison between the RWKV and Transformer models

To solve the temporal and spatial complexity of transformers, researchers propose multiple architectures:

The RWKV architecture consists of a series of stacked residual blocks, each consisting of a time mix with a cyclic structure and a channel mix subblock.

In the image below, the RWKV block element is on the left, the RWKV residual block is on the right, and the final head for language modeling.

Recursion can be expressed as a linear interpolation between the current input and the input of the previous time step (as shown by the diagonal in the figure below), which can be adjusted independently for each linear projection in which the input is embedded.

A vector that handles the current token separately is also introduced here to compensate for potential degradation.

RWKV can be efficiently parallelized (matrix multiplication) in what we call temporal parallelism.

In a recurrent network, the output of the previous moment is usually used as the input to the current moment. This is especially evident in the autoregressive decoding inference of language models, which require each token to be computed before the next input is made, allowing RWKV to take advantage of its RNN-like structure, known as a chronological pattern.

In this case, RWKV can be conveniently recursively formulated for decoding during inference, taking advantage of the fact that each output token relies only on the latest state, the size of which is constant, independent of sequence length.

It then acts as an RNN decoder, producing a constant speed and memory footprint relative to the length of the sequence, enabling longer sequences to be processed more efficiently.

In contrast, the self-attention KV cache grows relative to the sequence length, resulting in a decrease in efficiency and an increase in memory footprint and time as the sequence lengthens.

References:

Related Pages