Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu, Tri Dao

Mamba introduces a linear-time sequence model using selective state spaces that improves efficiency and performance across modalities, outperforming traditional Transformers especially on long sequences.
The paper proposes a novel selective state space model that enhances content-based reasoning and achieves linear scaling, outperforming Transformers in speed and accuracy.
5× higher throughput than Transformers
Effective on sequences up to a million tokens
Outperforms Transformers of same size in language modeling
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even…
Decision·Submitted to ICLR 2024
* The proposed Mamba method includes a simple modification to the conventional SSM model: add additional models to make SSM models dependent on the inputs. SSMs are known for their computational difficulties, and the authors address this issue by several performance optimization techniques. * The authors pre-train several variants of Mamba, ranging from 130M parameters to 1.4B parameters. These pre-trained models show performance improvements compared with the baselines in the paper.
Concerns about model design: * The motivation of Mamba is to address the drawbacks of recurrent models while improving the efficiency of attention-based models. There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some
+ A key limitation of prior SSMs is the inability to efficiently select data in an input-dependent manner. The paper introduces a key mechanism by parameterizing the SSM parameters based on the input, allowing the model to filter out irrelevant information and remember relevant information indefinitely. + The results as compared to Pythia, and Transforms on many benchmarks are impressive.
- The model still has a quadratic memory requirement during training like Transformers.
The paper is written in a clear and understandable manner, with a well-defined approach and simple yet effective improvement strategies.
The paper lacks references to some relevant works, such as [1], [2], [3], [4] which discusses some Linear Attention methods, and [5], which is also a LongConv method. However, these references are completely absent in the paper. I suggest that the authors consider adding these citations to provide a more comprehensive review of related work. [1] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in atten
- 🤗tiiuae/falcon-mamba-7b-instructmodel· 47k dl· ♡ 7247k dl♡ 72
- 🤗Schmadge/mamba-slim-orcamodel· 6 dl· ♡ 46 dl♡ 4
- 🤗state-spaces/mamba-2.8b-slimpjmodel· 182 dl· ♡ 129182 dl♡ 129
- 🤗Q-bert/Mamba-130Mmodel· 44 dl· ♡ 1244 dl♡ 12
- 🤗Q-bert/Mamba-370Mmodel· 26 dl· ♡ 426 dl♡ 4
- 🤗Q-bert/Mamba-790Mmodel· 24 dl· ♡ 224 dl♡ 2
- 🤗Q-bert/Mamba-1Bmodel· 27 dl· ♡ 2727 dl♡ 27
- 🤗Q-bert/Mamba-3Bmodel· 27 dl· ♡ 1727 dl♡ 17
- 🤗Q-bert/Mamba-3B-slimpjmodel· 17 dl· ♡ 217 dl♡ 2
- 🤗Q-bert/MambaHermes-3Bmodel· 18 dl· ♡ 1018 dl♡ 10
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)· youtube
MAMBA and State Space Models explained | SSM explained· youtube
AutoGrad Changed Everything (Not Transformers) [Dr. Jeff Beck]· youtube
AI On An Exponential? Data, Mamba, and More· youtube
TopicsNeural Networks and Applications · Topic Modeling · Machine Learning and Algorithms
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Multi-Head Attention · Attention Is All You Need · Absolute Position Encodings · Dense Connections · Dropout · Byte Pair Encoding · Softmax · 1x1 Convolution · Layer Normalization
