Scaling Transformer to 1M tokens and beyond with RMT
Scaling Transformer to 1M tokens and beyond with RMT is a research paper on the application of a recurrent memory mechanism to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing.[1] The paper was written by Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev and was published in April 2023.
Introduction
The Transformer model has been widely adopted and used in various research areas and industrial applications. However, the most important issue of the model is the quadratic complexity of the attention operation, which makes it increasingly difficult to apply large models to longer inputs.
In this paper, the authors show that by using a simple token-based memory mechanism introduced in a previous work by Bulatov et al. in 2022, the Transformer can be combined with pretrained models like BERT with full attention and full precision operations, and can be applied to sequences longer than 1 million tokens using a single Nvidia GTX 1080Ti GPU.
Contributions
The authors make the following contributions in their work:
- They enhance BERT by incorporating token-based memory storage and segment-level recurrence with recurrent memory (RMT).
- They demonstrate that the memory-augmented BERT can be trained to tackle tasks on sequences with lengths up to seven times its originally designed input length (512 tokens).
- They discovered the trained RMT's capacity to successfully extrapolate to tasks of varying lengths, including those exceeding 1 million tokens with linear scaling of computations required.
- Through attention pattern analysis, they found the operations RMT employs with memory, enabling its success in handling exceptionally long sequences.
Discussion
The problem of long inputs in Transformers has been extensively researched since the popularization of this architecture. In this work, the authors demonstrate that applying Transformers to long texts does not necessarily require large amounts of memory. By employing a recurrent approach and memory, the quadratic complexity can be reduced to linear. Furthermore, models trained on sufficiently large inputs can extrapolate their abilities to texts orders of magnitude longer.
Synthetic tasks explored in this study serve as the first milestone for enabling RMT to generalize to tasks with unseen properties, including language modelling. In their future work, the authors aim to tailor the recurrent memory approach to the most commonly used Transformers to improve their effective context size.
Overall, this report presents a promising approach to extend the context length of Transformer-based models, which holds significant potential to enhance long-term dependency handling in natural language understanding and generation tasks as well as enable large-scale context processing for memory-intensive applications.
References
- ↑ Bulatov et al. "Scaling Transformer to 1M tokens and beyond with RMT." 2023. arXiv:2304.11062