‘Attention is All You Need’ by Vaswani et al., 2017, is a seminal paper in the field of natural language processing (NLP) that introduces the Transformer model, a novel architecture for sequence transduction (or sequence-to-sequence) tasks such as machine translation. It has since become a fundamental building block for many state-of-the-art models in NLP, including BERT, GPT, and others.
Before this paper, most sequence transduction models were based on Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), or a combination of both. These models performed well but had some limitations. For instance, RNNs have difficulties dealing with long-range dependencies due to the vanishing gradient problem. CNNs, while mitigating some of these problems, have a fixed maximum context window and require many layers to increase it. Both architectures have inherently sequential computation which is hard to parallelize, slowing down training.
The authors propose the Transformer model, which dispenses with recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for much higher parallelization and can theoretically capture dependencies of any length in the input sequence.
The Transformer follows the general encoder-decoder structure but with multiple self-attention and point-wise, fully connected layers for both the encoder and decoder.
The core components of the Transformer are:
Self-Attention (Scaled Dot-Product Attention): This is the fundamental operation that replaces recurrence in the model. Given a sequence of input tokens, for each token, a weighted sum of all tokens' representations is computed, where the weights are determined by the compatibility (or attention) of each token with the token of interest. This compatibility is computed using a dot product between the query and key (both derived from input tokens), followed by a softmax operation to obtain the weights. The weights are then used to compute a weighted sum of values (also derived from input tokens). The scaling factor in the dot-product attention is the square root of the dimension of the key vectors, which is used for stability.
The queries, keys, and values in the Transformer model are derived from the input embeddings.
The input embeddings are the vector representations of the input tokens. These vectors are high-dimensional, real-valued, and dense. They are typically obtained from pre-trained word embedding models like Word2Vec or GloVe, although they can also be learned from scratch.
In the context of the Transformer model, for each token in the input sequence, we create a Query vector (Q), a Key vector (K), and a Value vector (V). These vectors are obtained by applying different learned linear transformations (i.e., matrix multiplication followed by addition of a bias term) to the input embeddings. In other words, we have weight matrices WQ, WK, and WV for the queries, keys, and values, respectively. If we denote the input embedding for a token by x, then:
Q = WQ * x K = WK * x V = WV * x
These learned linear transformations (the weights WQ, WK, and WV) are parameters of the model and are learned during training through backpropagation and gradient descent.
In terms of connections, the Query (Q), Key (K), and Value (V) vectors are used differently in the attention mechanism.
The scoring is done by taking the dot product of the Query vector with each Key vector, which yields a set of scores that are then normalized via a softmax function. The softmax-normalized scores are then used to take a weighted sum of the Value vectors.
In terms of shape, Q, K, and V typically have the same dimension within a single attention head. However, the model parameters (the weight matrices WQ, WK, and WV) determine the actual dimensions. Specifically, these matrices transform the input embeddings (which have a dimension of d_model in the original 'Attention is All You Need' paper) to the Q, K, and V vectors (which have a dimension of d_k in the paper). In the paper, they use d_model = 512 and d_k = 64, so the transformation reduces the dimensionality of the embeddings.
In the multi-head attention mechanism of the Transformer model, these transformations are applied independently for each head, so the total output dimension of the multi-head attention mechanism is d_model = num_heads * d_k. The outputs of the different heads are concatenated and linearly transformed to match the desired output dimension.
So, while Q, K, and V have the same shape within a single head, the model can learn different transformations for different heads, allowing it to capture different types of relationships in the data.
After the Q (query), K (key), and V (value) matrices are calculated, they are used to compute the attention scores and subsequently the output of the attention mechanism.
Here's a step-by-step breakdown of the process:
Compute dot products: The first step is to compute the dot product of the query with all keys. This is done for each query, for every position in the input sequence. The result is a matrix of shape (t, t), where t is the number of tokens in the sequence.
Scale: The dot product scores are then scaled down by a factor of square root of the dimension of the key vectors (d_k). This is done to prevent the dot product results from growing large in magnitude, leading to tiny gradients and hindering the learning process due to the softmax function used in the next step.
Apply softmax: Next, a softmax function is applied to the scaled scores. This has the effect of making the scores sum up to 1 (making them probabilities). The softmax function also amplifies the differences between the largest and other elements.
Multiply by V: The softmax scores are then used to weight the value vectors. This is done by multiplying the softmax output (which has the same shape as the key-value pairs) with the V (value) matrix. This step essentially takes a weighted sum of the value vectors, where the weights are the attention scores.
Summation: Finally, the results from the previous step for each query are summed together to produce the output of the attention mechanism for that particular query. This output is then used as input to the next layer in the Transformer model.
In the multi-head attention mechanism, the model uses multiple sets of these transformations, allowing it to learn different types of attention (i.e., different ways of weighting the relevance of other tokens when processing a given token) simultaneously. Each set of transformations constitutes an ‘attention head’, and the outputs of all heads are concatenated and linearly transformed to result in the final output of the multi-head attention mechanism.
Multi-Head Attention: Instead of performing a single attention function, the model uses multiple attention functions, called heads. For each of these heads, the model projects the queries, keys, and values to different learned linear projections, then applies the attention function on these projected versions. This allows the model to jointly attend to information from different representation subspaces at different positions.
Position-Wise Feed-Forward Networks: In addition to attention, the model uses a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
Positional Encoding: Since the model doesn't have any recurrence or convolution, positional encodings are added to the input embeddings to give the model some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension as the embeddings so that they can be summed. A specific function based on sine and cosine functions of different frequencies is used.
The authors trained the Transformer on English-to-German and English-to-French translation tasks. It achieved new state-of-the-art results on both tasks while using less computational resources (measured in training time or FLOPs).
The Transformer's success in these tasks demonstrates its ability to handle long-range dependencies, given that translating a sentence often involves understanding the sentence as a whole.