Mengchen from the Qubit | qbitai
Now ChatGPT and other large models are a major pain point:
The reason behind the huge computing power consumption of processing long texts is the secondary complexity of the attention mechanism in the Transformer architecture.
The new architecture proposed by Tri Dao, the author of Flashattention, has become a strong challenger and has attracted a lot of attention:
Mamba (a kind of snake) that beats the matching transformer performance on language tasks with linear complexity and 5x inference throughput.
Specifically, MAMBA implements SOTA in language, audio, and DNA sequence modalities.
In the most talked-about language tasks, MAMBA-3B surpasses Transformer of comparable size, rivaling twice as large a Transformer.
And the related ** and pre-trained model checkpoints have been open sourced.
Both authors' interpretations have received a large number of **.
Some netizens found that even the "Transformer is still SOTA in 2027?" on the *** platform"There was a significant drop on this day.
Selective processing information + hardware-aware algorithms.
MAMBA is a State Space Model (SSM).
It is based on the more modern Structured SSM (S4) for deep Xi and has similarities with the classical RNN.
There are three main innovations in the MAMBA previously studied:
Selective processing of input information, hardware-aware algorithms, simpler architecture, selective state-space models.
According to the authors, one of the fundamental problems of sequence modeling is to compress the context into a smaller state.
From this point of view, although the attention mechanism is high-performance but inefficient, it needs to explicitly store the entire context (i.e., KV cache), which directly leads to the high computing power consumption of training and inference.
RNN-like recurrent neural networks have a finite state and are efficient, but their performance is limited by the degree of context compression.
MAMBA's solution is to have the model selective about the information, to focus on or ignore the incoming content, and to compress the context even if the state size is fixed.
An intuitive analogy:
Transformer is like a human who Xi all the previous words + input before writing a word, so the writing is slow.
rnn only refers to the fixed number of words in front of it at a time, and it is fast to write, but it is easy to forget the earlier content.
Each time mamba refers to a summary of all the previous content, the further back it writes, the more ruthless it summarizes the previous content, and the details are lost to retain the general idea.
In its predecessor, the structured state-space model (S4), the four parameters a, b, c, are fixed and do not change with the input.
In Mamaba, the authors make these parameters b, c, and functions of the input, allowing the model to adaptively adjust its behavior based on the input.
Hardware-aware state extension.
In order to make the old SSM more efficient on modern GPUs, Mamba uses the same technology as FlashAttention.
The core idea is to use the different hierarchies of memory to handle the state of the SSM, reducing the bottleneck of repeated reads and writes of high-bandwidth but slow HBM memory, specifically:
Discretization and recursion operations are performed in faster SRAM memory, and the output is written back to the HBM. Parallelization is achieved by parallel scanning algorithms. When an input is loaded from HBM to SRAM, the intermediate state is not saved, but is recalculated in backpropagation.
Simplified SSM architecture.
The basic blocks of most SSM architectures are combined with gated MLPs that are prevalent in modern neural networks to form the new MAMBA blocks.
Repeat this block, combined with normalization and residual joining, to form the MAMBA architecture.
Experimental results. When Mamba is pre-trained under Chinchilla's scaling law, the language task is better than that of similar open-source models.
The Transformer ++ in the comparison object is the standard GPT-3 architecture plus the improvements in Google Palm and Meta Llama, which is the strongest known Transformer recipe.
On downstream tasks, MAMBAs for each scale size are best-in-class and typically match baseline performance at twice the scale.
Especially when the sequence length is increased to 512k, it is orders of magnitude faster than a Transformer using FlashAttention-2 and does not run out of memory.
What's next for Transformer?
Ultimately, MAMBA is the first linear time series model to truly match the performance of the Transformer, both in terms of pre-training perplexity and downstream task evaluation.
It is also superior to the previous SOTA model in terms of audio and DNA sequence modeling, showing a certain degree of versatility.
In their conclusion, the authors propose that MAMBAs are strong candidates for the backbone of the general sequence model.
The founder of Stability AI was immediately concerned.
Nvidia scientist Jim Fan is also excited about the emergence of the Transformer's challenger.
*The two authors, Albert Gu and Tri Dao, both graduated from Stanford University with Ph.D. under the supervision of Christopher RĂ©.
Albert GU is now an assistant professor at CMU and has been driving the development of SSM architecture for many years.
He previously worked at DeepMind and is currently the co-founder and Chief Scientist of Cartesia AI.
Tri Dao, best known for his work on the FlashAttention and FlashDecoding series, is now an assistant professor at Princeton, and chief scientist at Together AI, and also serves as a consultant at Cartesia AI.
Cartesia AI's company profile mentions its commitment to building a next-generation foundation model based on the new architecture, which now seems to be primarily an innovative SSM architecture.
Lianchuang and CEO Karan Goel both graduated from Stanford with a Ph.D. and are also one of the authors of S4**, the predecessor of MAMBA.
For the next step of MAMBA, it is mentioned in ** that "explore whether the new architecture can be applied to the rich large model ecosystem that Transformer has established".
These include Fine-tuning, Adaptive, Cue Xi, Contextual Xi, Instruction Tuning, RLHF, Quantization, ......That is, to develop the basic model into GPT-35. LLAMA is a similar assistant model.
However, the authors also mention that the current scale of experiments is small, and that it is necessary to verify at least 7B scale to fully evaluate whether SSM can compete with Transformer and other architectures such as RWKV and Microsoft RETNET.
In the process of scaling SSM, there are also new engineering challenges and adjustments to the model, which are not covered in **.
Finally, Albert Gu shared why the new architecture was named a viper
It's fast, lethal to sequence modeling problems, and the predecessor S4 was SSSS (Hiss Hiss).
Reference link: [1].