Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu; Tri Dao

arXiv:2312.00752·cs.LG·June 3, 2024·983 cites

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

PDF

Open Access 5 Repos 10 Models 2 Datasets 4 Videos 3 Reviews

TL;DR

Mamba introduces a linear-time sequence model using selective state spaces that improves efficiency and performance across modalities, outperforming traditional Transformers especially on long sequences.

Contribution

The paper proposes a novel selective state space model that enhances content-based reasoning and achieves linear scaling, outperforming Transformers in speed and accuracy.

Findings

01

5× higher throughput than Transformers

02

Effective on sequences up to a million tokens

03

Outperforms Transformers of same size in language modeling

Abstract

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 5

Strengths

* The proposed Mamba method includes a simple modification to the conventional SSM model: add additional models to make SSM models dependent on the inputs. SSMs are known for their computational difficulties, and the authors address this issue by several performance optimization techniques. * The authors pre-train several variants of Mamba, ranging from 130M parameters to 1.4B parameters. These pre-trained models show performance improvements compared with the baselines in the paper.

Weaknesses

Concerns about model design: * The motivation of Mamba is to address the drawbacks of recurrent models while improving the efficiency of attention-based models. There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

+ A key limitation of prior SSMs is the inability to efficiently select data in an input-dependent manner. The paper introduces a key mechanism by parameterizing the SSM parameters based on the input, allowing the model to filter out irrelevant information and remember relevant information indefinitely. + The results as compared to Pythia, and Transforms on many benchmarks are impressive.

Weaknesses

- The model still has a quadratic memory requirement during training like Transformers.

Reviewer 03Rating 8· accept, good paperConfidence 5

Strengths

The paper is written in a clear and understandable manner, with a well-defined approach and simple yet effective improvement strategies.

Weaknesses

The paper lacks references to some relevant works, such as [1], [2], [3], [4] which discusses some Linear Attention methods, and [5], which is also a LongConv method. However, these references are completely absent in the paper. I suggest that the authors consider adding these citations to provide a more comprehensive review of related work. [1] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in atten

Code & Models

Repositories

Models

Datasets

Videos

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)· youtube

MAMBA and State Space Models explained | SSM explained· youtube

AutoGrad Changed Everything (Not Transformers) [Dr. Jeff Beck]· youtube

AI On An Exponential? Data, Mamba, and More· youtube

Taxonomy

TopicsNeural Networks and Applications · Topic Modeling · Machine Learning and Algorithms

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · Multi-Head Attention · Attention Is All You Need · Absolute Position Encodings · Dense Connections · Dropout · Byte Pair Encoding · Softmax · 1x1 Convolution · Layer Normalization