Self-Attentive Sequential Recommendation


SASR stands for "Self-Attentive Sequential Recommendation". This approach focuses on sequence-aware recommendations {e.g. something a company like Tik-Tok might use, or any website that wants to be able to shift recommendations rapidly}.

My interpretation of the approach is this:

NOTE: As with most recommendation approaches, all of this is tricky because of the self-fulfilling prophecy issue: if you recommend an item to a user, they're more likely to click on it, and if you have multiple ways of recommending items to users at once with varying powers {think Netflix}, this becomes very muddled. But you can still learn valuable signal from just this $P(interacted | was served)$ estimate. In practice, you can help deal with this by serving random content sometimes (or content with high temperature, basically) to aid in figuring out better probabilities, but it will always be a bit of a tricky issue.


The "Self-Attentive Sequential Recommendation" paper introduces a novel method for recommendation systems, particularly sequence-aware recommendations. The authors propose a model called SASRec, an acronym for Self-Attentive Sequential Recommendation. This model is a transformer-based architecture that leverages self-attention mechanisms to capture the sequential behavior of users' actions over time.

Architecture and Model

The model is fully attentive and does not rely on any recurrent or convolutional architecture. It takes sequences of items as inputs and employs self-attention to capture dependencies between items in the sequence. The model uses a stacked, transformer-based architecture with positional encoding to handle the sequential nature of the data.

Specifically, the model consists of an embedding layer, multiple self-attention layers, and a final prediction layer. In the embedding layer, each item in the sequence is embedded into a dense vector. The self-attention layers then capture dependencies between these items, and the prediction layer produces the next item in the sequence.

The self-attention mechanism is particularly important as it allows the model to assign different weights to different items in the sequence based on their relevance to the current prediction. In particular, the self-attention mechanism in this model employs multi-head attention, which enables the model to focus on different parts of the sequence at the same time.

The attention weight of item (j) for item (i) is computed as follows:

[ A_{ij} = \frac{\exp{\text{{score}}(E_i, E_j)}}{\sum_{k=1}^{n} \exp{\text{{score}}(E_i, E_k)}} ]

where (E_i) and (E_j) are the embeddings of item (i) and (j), and score is a function that calculates the relevance of (E_j) to (E_i). In the case of the transformer model, the score function is a scaled dot product:

[ \text{{score}}(E_i, E_j) = \frac{E_i \cdot E_j}{\sqrt{d}} ]

where (d) is the dimension of the embeddings.


For training the model, the authors propose using the BPR {Bayesian Personalized Ranking} loss, a pairwise ranking loss. Given a triplet of user, positive item, and negative item (u, i, j), the BPR loss is defined as follows:

[ \text{{BPR}}(u, i, j) = -\ln \sigma(\hat{y}{ui} - \hat{y}{uj}) ]

where (\hat{y}{ui}) and (\hat{y}{uj}) are the model's predictions for the positive and negative items, and (\sigma) is the sigmoid function.


The proposed SASRec model represents a significant shift in the paradigm of recommendation systems. By moving away from RNNs and CNNs, which have dominated this field, the model opens up new possibilities for leveraging the power of self-attention in sequence-aware recommendations.

Furthermore, the model's full attention mechanism allows it to capture long-range dependencies in the data, which can be critical for making accurate recommendations. This could have significant implications in areas like e-commerce, where understanding a user's entire interaction history can lead to more personalized and accurate recommendations.

Tags: recommendation
👁️ 100
you need login for comment