"Language Models are Unsupervised Multitask Learners" by Radford et al. was published by OpenAI in 2019. This paper introduced GPT-2, an improved version of GPT {Generative Pretraining Transformer}, which was a highly influential model in the field of Natural Language Processing {NLP}.
GPT-2 uses a transformer model, which is an architecture that relies heavily on self-attention mechanisms. The transformer model was first introduced in the paper "Attention is All You Need" by Vaswani et al.
The key innovation of the transformer architecture is the self-attention mechanism, also known as scaled dot-product attention. Given a sequence of inputs (x_1, x_2, \ldots, x_n), the self-attention mechanism computes a weighted sum of the inputs, where the weight assigned to each input is determined by the input's compatibility with all other inputs.
The self-attention mechanism is formally described as follows:
Given a query (Q), key (K), and value (V) (all of which are vectors), the output is computed as:
[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]
Where (d_k) is the dimensionality of the query and key vectors.
In the context of the transformer model, the query, key, and value vectors are all derived from the input to the self-attention layer. They are computed by multiplying the input by learned weight matrices (W_Q), (W_K), and (W_V), respectively.
The transformer model stacks multiple of these self-attention layers (along with feed-forward neural networks) to form the final model. GPT-2 specifically uses a decoder-only transformer, which means it only has the decoder part of the original transformer model.
GPT-2 was trained on a large corpus of internet text. It uses a language modeling objective, which means it is trained to predict the next word in a sentence given the previous words. This is an unsupervised learning task, as it doesn't require any labeled data.
The training objective for a language model is typically the cross-entropy loss:
[ \mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i) ]
Where (y_i) are the true labels (i.e., the actual next words in the text), and (\hat{y}_i) are the predicted probabilities of the next words.
The key finding of the GPT-2 paper was that a language model trained on a diverse range of internet text could generate coherent and diverse paragraphs of text. When given a short prompt {such as "In a shocking turn of events,"}, GPT-2 could generate a continuation of the text that was both contextually relevant and linguistically sophisticated.
This finding has significant implications for the field of NLP. It suggests that a single, large-scale language model can be fine-tuned for a variety of specific tasks, effectively serving as a general-purpose "text understanding" model.
However, the paper also highlighted the potential risks of such powerful language models. For example, they could be used to generate misleading news articles or spam at scale. As a result, OpenAI initially chose not to release the full model, citing concerns about malicious use.
Yes, there are some key differences between the GPT-2 architecture and other transformer-based models like BERT.
Directionality:
Training Objective:
Use Case:
What makes GPT-2 special is its scale and its ability to perform a range of tasks without any task-specific training. This is often referred to as "zero-shot" learning. Given a prompt, GPT-2 generates a continuation of the text that aligns with the intended task, demonstrating a surprising amount of "understanding" despite never being explicitly trained on that task.
Additionally, the scale of GPT-2 {1.5 billion parameters} was impressive at the time of its release and contributed to its strong performance. The model was trained on a diverse range of internet text, but because of its unsupervised nature, it doesn't require any task-specific training data.
However, it's worth noting that while GPT-2 is a powerful model, it's not without its shortcomings. The model can sometimes generate text that is plausible-sounding but factually incorrect, and it can be sensitive to the exact wording and phrasing of the input prompt.
GPT-2: GPT-2 has 1.5 billion parameters. The exact number of neurons is not usually specified in transformer-based models, but if we consider each parameter as connecting two neurons, it suggests a very large number of neurons.
GPT-3: GPT-3 significantly scales up the architecture with 175 billion parameters, making it over a hundred times larger than GPT-2. This also implies an extremely large number of neurons.
GPT-4: about six times larger than GPT-3, GPT-4 has about a trillion parameters.
There are also smaller versions with fewer parameters.