Big Bird: Transformers for Longer Sequences

TLDR

Instead of using attention on all tokens in an input sequence, for a token t in the input sequence, only attend to some local tokens, some special global tokens {looks like just the first few tokens in the sequence}, and some random tokens as well. This allows for very long input sequences.

Introduction

The paper "Big Bird: Transformers for Longer Sequences" by Manzil Zaheer et al. was published by Google Research in 2020. It introduces a new transformer architecture called BigBird, which can handle much longer sequences than the typical transformer models such as BERT or GPT. This new architecture alleviates the quadratic dependency on input length in Transformer attention mechanisms, which has previously limited their ability to process long sequences.

Motivation

Standard Transformers suffer from a quadratic complexity issue due to the full self-attention mechanism. Specifically, for a sequence of length (n), the self-attention mechanism computes a dot product between each pair of tokens in the sequence, leading to a complexity of (O(n^2)). This makes it computationally inefficient for longer sequences.

BigBird Model

BigBird modifies the standard Transformer model by using a sparse attention mechanism. Instead of having each token attend to every other token in the sequence, each token in BigBird attends to a fixed number of tokens. This results in a significantly reduced complexity of (O(n)), making it feasible to process longer sequences.

The sparsity pattern of BigBird's attention mechanism is designed in such a way that it includes the following types of attention:

Local Attention: Each token attends to its neighboring tokens. This helps the model capture local dependencies in the sequence.
Global Attention: Each token also attends to a fixed set of global tokens that are common for all tokens. This helps the model capture important global context that is relevant to all tokens.
Random Attention: Each token attends to a set of random tokens. This helps the model capture long-range dependencies in the sequence that cannot be captured by local or global attention.

This pattern can be visualized as follows: Big Bird Attention Pattern

Mathematical Formulation

The standard Transformer computes the self-attention as follows:

[ A = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V ]

where (Q), (K), and (V) are the query, key, and value matrices, and (d) is the dimension of the query and key vectors.

In contrast, BigBird computes the self-attention as follows:

[ A = \text{softmax}\left(\frac{QK_{\text{BB}}^T}{\sqrt{d}}\right)V_{\text{BB}} ]

where (K_{\text{BB}}) and (V_{\text{BB}}) are the key and value matrices for the BigBird attention pattern. They are computed by concatenating the key and value vectors for the local, global, and random tokens.

Results

The authors of the paper tested BigBird on a variety of tasks and found that it outperformed other models on long-sequence tasks. They also found that it was competitive with other models on standard benchmark tasks, demonstrating its versatility.

Implications

The BigBird model has significant implications for the field of natural language processing. It enables the processing of longer sequences, which can be beneficial for many applications such as document summarization, long-form question answering, and other tasks that require understanding of long sequences of text.

It also demonstrates that sparse attention mechanisms can be as effective as full self-attention mechanisms, which challenges the conventional wisdom in the field. This opens up new possibilities for designing more efficient Transformer models.

How does the model choose which global tokens to attend to?

In the BigBird model, the global tokens are predefined and fixed throughout the sequence. They do not change dynamically based on the input sequence.

The authors of the BigBird paper propose to select the first few tokens of the sequence as the global tokens. The intuition behind this is that the beginning of the sequence often contains important information that should be attended to by all other tokens. For example, in a document, the title and introductory paragraph often summarize the main points of the entire document.

By having each token attend to these global tokens, the model can effectively propagate this important global context throughout the sequence. This allows each token to not only understand its local context, but also relate it to the global context of the sequence.

The number of global tokens is a hyperparameter that needs to be chosen based on the specific task and dataset. The authors found that having a small fraction of the sequence length as global tokens (e.g., 2-10%) worked well in their experiments.

NoiseDive

Big Bird: Transformers for Longer Sequences

TLDR

Introduction

Motivation

BigBird Model

Mathematical Formulation

Results

Implications

How does the model choose which global tokens to attend to?

👁️ 963

hills

23:18

06.06.23