GPT-2: Language Models are Unsupervised Multitask Learners

"Language Models are Unsupervised Multitask Learners" by Radford et al. was published by OpenAI in 2019. This paper introduced GPT-2, an improved version of GPT {Generative Pretraining Transformer}, which was a highly influential model in the field of Natural Language Processing {NLP}.

Model Architecture

GPT-2 uses a transformer model, which is an architecture that relies heavily on self-attention mechanisms. The transformer model was first introduced in the paper "Attention is All You Need" by Vaswani et al.

The key innovation of the transformer architecture is the self-attention mechanism, also known as scaled dot-product attention. Given a sequence of inputs (x_1, x_2, \ldots, x_n), the self-attention mechanism computes a weighted sum of the inputs, where the weight assigned to each input is determined by the input's compatibility with all other inputs.

The self-attention mechanism is formally described as follows:

Given a query (Q), key (K), and value (V) (all of which are vectors), the output is computed as:

[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]

Where (d_k) is the dimensionality of the query and key vectors.

In the context of the transformer model, the query, key, and value vectors are all derived from the input to the self-attention layer. They are computed by multiplying the input by learned weight matrices (W_Q), (W_K), and (W_V), respectively.

The transformer model stacks multiple of these self-attention layers (along with feed-forward neural networks) to form the final model. GPT-2 specifically uses a decoder-only transformer, which means it only has the decoder part of the original transformer model.


GPT-2 was trained on a large corpus of internet text. It uses a language modeling objective, which means it is trained to predict the next word in a sentence given the previous words. This is an unsupervised learning task, as it doesn't require any labeled data.

The training objective for a language model is typically the cross-entropy loss:

[ \mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i) ]

Where (y_i) are the true labels (i.e., the actual next words in the text), and (\hat{y}_i) are the predicted probabilities of the next words.

Results and Implications

The key finding of the GPT-2 paper was that a language model trained on a diverse range of internet text could generate coherent and diverse paragraphs of text. When given a short prompt {such as "In a shocking turn of events,"}, GPT-2 could generate a continuation of the text that was both contextually relevant and linguistically sophisticated.

This finding has significant implications for the field of NLP. It suggests that a single, large-scale language model can be fine-tuned for a variety of specific tasks, effectively serving as a general-purpose "text understanding" model.

However, the paper also highlighted the potential risks of such powerful language models. For example, they could be used to generate misleading news articles or spam at scale. As a result, OpenAI initially chose not to release the full model, citing concerns about malicious use.

GPT-2 vs. BERT

Yes, there are some key differences between the GPT-2 architecture and other transformer-based models like BERT.

Architecture Differences

  1. Directionality:

    • GPT-2 is a transformer decoder, meaning it operates in a left-to-right context or auto-regressive manner. During training, it uses all the previous words in the input to predict the next word.
    • BERT, on the other hand, is a transformer encoder, and it's bidirectional — it uses both the left and right context of a word during training. This is achieved by masking some percentage of the input tokens at random and then predicting those masked tokens.
  2. Training Objective:

    • GPT-2 is trained with a language modeling objective, where the aim is to predict the next word in a sequence based on the previous words.
    • BERT uses a different training objective called the masked language model {MLM} objective, where it randomly masks some of the tokens in the input and the model must predict the original vocabulary id of the masked word based only on its context. Additionally, BERT is also trained using a next sentence prediction task that involves predicting whether two given sentences are in the correct order.
  3. Use Case:

    • GPT-2 is used as a standalone model for a variety of tasks such as text generation, translation, and summarization without any task-specific layers or training.
    • BERT, by contrast, is typically used as a base model for downstream tasks, such as question answering and sentiment analysis, and requires task-specific layers and fine-tuning.

What Makes GPT-2 Special?

What makes GPT-2 special is its scale and its ability to perform a range of tasks without any task-specific training. This is often referred to as "zero-shot" learning. Given a prompt, GPT-2 generates a continuation of the text that aligns with the intended task, demonstrating a surprising amount of "understanding" despite never being explicitly trained on that task.

Additionally, the scale of GPT-2 {1.5 billion parameters} was impressive at the time of its release and contributed to its strong performance. The model was trained on a diverse range of internet text, but because of its unsupervised nature, it doesn't require any task-specific training data.

However, it's worth noting that while GPT-2 is a powerful model, it's not without its shortcomings. The model can sometimes generate text that is plausible-sounding but factually incorrect, and it can be sensitive to the exact wording and phrasing of the input prompt.

Model Size

There are also smaller versions with fewer parameters.

Tags: transformers, GPT-2
👁️ 290
you need login for comment