In the vast field of artificial intelligence, sequence modeling has always been a hot topic of research. With the rapid development of deep learning technology, the transformer architecture and its core attention mechanism have become the mainstream method for processing sequential data. However, the computational efficiency of transformers in processing long sequences has always been a bottleneck restricting their application scope. In order to solve this problem, researchers have proposed a variety of subquadratic time complexity architectures, such as linear attention, gated convolution, and recurrent models, but the performance of these models in some important modalities (such as language) does not reach the level of attention mechanism.
In this paper, the authors propose a new selective state space model (SSSMS) and integrate it into a simplified end-to-end neural network architecture with no attention mechanism or even a multilayer perceptron (MLP) block, known as MAMBA. MAMBA is 5x faster than Transformer in inference speed and has linear scalability in sequence length. On real-world data, MAMBA's performance is improved on million-length sequences. As the backbone of the universal sequence model, MAMBA achieves state-of-the-art performance across multiple modalities such as language, audio, and genome.
The core improvement of MAMBA lies in its selectivity mechanism. Firstly, by making the SSM parameter an input function, the weakness of SSM in discrete mode is solved, so that the model can selectively propagate or forget information in the sequence length dimension according to the current token. Second, although this change prevents the use of effective convolution, the authors designed a hardware-aware parallel algorithm that computes in cyclic mode. This algorithm is faster than the previous implementation, not only theoretically (its sequence length scaling is pseudolinear compared to all convolution-based SSMS) but also on modern hardware (3x faster than the previous method on A100 GPUs).
The architecture of MAMBA simplifies the previous deep sequence model architecture, combining the design of the SSM with the MLP block of the Transformer to form a single block, which makes the MAMBA more simple and unified in structure. Key features of selective SSMS and MAMBA architectures include: (i) high quality: selectivity brings robust performance on dense modalities such as language and genome;(ii) Fast training and inference: During training, computation and memory scale linearly over sequence length, and during autoregressive inference, only a constant time is required for each step as there is no need to cache the previous elements;(iii) Long context: The combination of quality and efficiency results in improved performance on real-world data until the sequence length reaches 1m.
Through empirical validation on multiple types and settings such as synthesis tasks, audio and genome, language modeling, etc., MAMBA has demonstrated its potential as a backbone of general-purpose sequence FM. In terms of language modeling, our MAMBA-3B model outperformed the same size Transformer in both pre-training and downstream evaluation, and matched twice the size of the Transformer, both in pre-trained confusion and in downstream evaluation.
Mamba's ** and pre-trained checkpoints have been open-sourced and can be found on github. This work not only demonstrates a new approach in the field of sequence modeling, but also provides an efficient and powerful tool for working with long sequence data. With the continuous advancement of artificial intelligence technology, we expect MAMBA to play a role in more application scenarios and promote the development of sequence modeling technology."
Conclusion: The proposal of MAMBA brings new perspectives and possibilities to the field of sequence modeling. It not only provides a new selectivity mechanism in theory, but also proves its efficiency and superior performance in processing long series data in practice. As an AI expert with 20 years of experience, I am excited about the potential of MAMBA and look forward to seeing more breakthroughs in future research and applications.