Abundant color from the concave non-temple qubit | qbitai
The hanging heart is finally dead:
MAMBA, revered as a Transformer challenger, has been officially rejected by the ICLR.
After being "initially rejected", it caused an uproar in the academic circle and turned into a "decision pending" state).
But how will the popularity of this "top stream" be affected?
No, a new popular interpretation of it (author: jack cook, a researcher at the Oxford Internet Research Institute, who has worked in MIT, NVIDIA, and Microsoft), has just been born, and is still being liked and collected by netizens.
Some even call it as:
Best of the year so far (interpretation).
We can't miss it either.
The following is the essence of the original text:
Background: S4 architecture.
MAMBA's architecture is based on S4, a state-space model (SSM) architecture.
The main idea is as follows:
At a high level, s4 learns how to map input x(t) to output y(t) via the intermediate state h(t).
Here, since SSM is designed to handle continuous data well, such as audio, sensor data, and images, x, y, t are functions of x.
S4 interconnects them by three continuous parameter matrices A, B, and C, in the form of the following two equations (1a and 1b in MAMBA**).
Since in practice we are generally dealing with discrete data such as text, this requires us to discretize the SSM by converting the continuous parameters a, b, and c into discrete parameters, and c, by using a special fourth parameter δ.
After discretization, we can express SSM: by these two equations (2a and 2b in mamba**).
These equations form a recursion, similar to what we see in RNN networks. In each step t, we combine the hidden state of the previous time step ht 1 with the current input xt to create a new hidden state ht.
The diagram below shows how it works when the next word in a sentence (we have "and" after "my name is jack").
Based on this, we can essentially use S4 as a recurrent neural network RNN to generate tokens one at a time.
However, what's really cool about S4 is that you can also use it as a convolutional neural network CNN.
In the example above, what happens when we extend the previous discrete equation to try to calculate h3?
For simplicity's sake, let's assume x 1 = 0.
After calculating h3, we can substitute it into the equation of y3 to ** the next word:
Now, notice that y3 can actually be computed as a dot product, where the right vector is our input x:
Since the parameters, and c are constants, we can precompute the left vector and save it as a convolutional kernel. This gives us a simple way to calculate y using convolution, as shown in the following two equations (3a and 3b in mamba**):
Important: These cyclic and convolutional forms (which the authors call "RNN patterns" and "CNN patterns") are mathematically equivalent.
So the s4 can be deformed according to what you want it to do, while the output doesn't make any difference.
Of course, the CNN mode is more suitable for training, and the RNN mode is more suitable for inference.
The first main idea: optionality.
In this section we discuss the first major idea introduced by MAMBA: optionality. Let us recall two equations that define the discrete form of S4:
Note that in S4, our discrete parameter, and c are constant. However, MAMBA makes these parameters vary depending on the input. So we end up with something like this:
MAMBA authors (GU and DAO) believe that selectivity or input dependency is important for many tasks.
The popular science author of this article argues that because S4 is not selective, it is forced to process all parts of the input in exactly the same way.
However, when we are confronted with a sentence, some of these words are inevitably more important than others.
For example, "I want to order a hamburger.""This sentence.
If there is no selectivity, S4 will spend the same "effort" on each word:
But if it's a model trying to categorize the intent of the sentence, it might want to "focus" more on order, hamburger, and less on want, to.
As shown in the diagram below, by making the model parameters a function of the input, MAMBA can "focus" on the parts of the input that are more important to the task at hand.
However, selectivity poses a problem for us. Let's recall the convolution kernel that was calculated earlier.
In S4, we can precompute the kernel, save it, and multiply it by input x.
This is good because the discrete parameters, and c are constant. But again, in MAMBA, these matrices change depending on the input! As a result, we can't precompute, and we can't use CNN patterns to train our models. If we want selectivity, we have to train in RNN mode. The way to do this is to remove Equation 3b to get a "dramatic effect".
But this poses a problem for the authors of MAMBA: the RNN mode is very slow to train.
Let's say we're training our model with a sequence of 1000 tokens:
CNNs essentially compute the dot product between their kernel and input vectors, and these calculations can be performed in parallel. In contrast, RNN needs to update its hidden state 1000 times in order.
This led the authors of MAMBA to come up with their second great idea.
The second main idea: fast training without convolution.
MAMBA can be trained very, very fast in RNN mode.
At some point, their recursion is very similar to the scanning algorithm (also known as prefix sum, prefix sum).
To calculate the prefix sum, we need to get an input array [x1,x2,...xn] and returns an output array, where each element is the sum of the item and its predecessors.
In other words, the first element of the output will be x1, the second element will be [x1+[x2, and so on. An example:
Now let's draw the process of updating the hidden state of the MAMBA in RNN mode.
And so on......If we have to formalize the prefix sum, we can write it as the following equation:
The equation forms a recursion: at each step, we calculate the new value by adding the previously stored value to the current input. Now, let's take another look at the loop of mamba hidden state after the update.
These two equations are really, really similar, yes!
And here comes the coolest part: while the calculation of prefixes and seems sequential in nature, we actually have efficient parallel algorithms for this task!
In the image below, we can see the parallel prefixes and algorithms in action, where each vertical line represents one of the items in the array.
Take a moment to brush up on this algorithm:
Select any vertical line, start at the top, and move down, tracing each addition back to the first few items of the array. When you reach the bottom, you should see the sum of all the items on the left side of the row.
For example, after the first element is added to the second element at the beginning, the third element of the array receives the added value of the second element at the end. As a result, when the parallel sweep is complete, the third element contains the first.
The sum of the first, second, and third elements.
If we run the algorithm in a single thread without parallelism, it will take longer than it would take to just add the values sequentially. But GPUs have a large number of processors and can do highly parallel computing. As a result, we can calculate this prefix and/or scan operation in about o(logn) time!
As a result, the authors of MAMBA realized that if they wanted to train efficiently in RNN mode, they might be able to use parallel scanning.
But since there is currently no scan implementation of pytorch, the authors of mamba wrote one themselves - but, the results were not good.
In the image above, you can see that their pytorch-based scanning implementation (green) is always slower than FlashAttention-2 (blue), which is the fastest implementation of "Precise Attention" available.
Although the scan seems to catch up at runtime when the sequence length is 128000 tokens, it still runs out of memory.
In order for MAMBA to be practical, it needs to be faster. This allowed the authors of MAMBA to see the DAO's previous work on FlashAttention, thus solving the problem.
Due to space limitations, we have omitted the principle introduction of flashattention in the original article (review: flashattention), interested friends can check the original blog flashattention original**, or one of our previous principle introduction articles.
back to mamba
Again, based on the previous comparison chart.
It turns out that if you take the same memory-aware tiling approach when calculating scans, you can speed it up considerably.
With this optimization, MAMBA (red) is now faster than FlashAttention-2 (blue) at all sequence lengths.
These results suggest that MAMBA is practical in terms of speed, running even faster than the fastest Transformer. But does it have anything to do with language modeling?
MAMBA authors evaluated MAMBA on a number of sequence modeling tasks involving language, genomics, and audio.
The results look cool: MAMBA has built state-of-the-art performance when modeling the DNA of the Human Genome Project and the audio of the Piano** dataset.
However, what excites a lot of people is the results on the language task. A lot of the discussions about mamba are focused on the following diagram:
We can see that the model size increases to the right, and the language modeling performance improves further downwards.
This means that the best models should be on the left: small (and therefore fast) and very good at modeling languages.
Since the MAMBA authors are academics and can't afford to get thousands of GPUs to train GPT-4-sized models, the experiment is by training a bunch of smaller models (about 125M to 1.).3b parameters).
As you can see in the image above, the results look very promising. Compared to other models of similar size, MAMBA seems to be the best at modeling languages.
Why was it "rejected twice".
At the end of the writing, the author of this article once again expressed his regret for the rejection of MAMBA:
I really think Mamba innovates in language modeling in a very unique and interesting way. Unfortunately, some reviewers disagreed.
Judging from the latest rejections, one of the reviewers' reasons for rejection was related to "two significant benchmark assessments".
The first is the lack of LRA (Long Range Arena) evaluation, which is a recognized benchmark for long sequence modeling.
Second, it is not enough to use confusion assessment as the main evaluation index, because low confusion is not necessarily positively correlated with generation performance.
The final general idea was to add additional experiments.
Regarding this result, some netizens also commented again:
This only shows that whether an article is accepted by the conference or not is not linked to its value contribution to the community. Because the former can easily rely on the judgment of a very small number of people.
In fact, when it comes to the fact that it is recognized that the good ** will pass, MAMBA is really not the first one.
About a decade ago, Word2vec was also "ugly rejected" by the ICLR, but last year, it also won the Neurips Time Test.
Do you think time will "justify" Mamba?
Interpretation of the original text: Reference link: [1].