BERT: Pre-training of Deep Bidirectional Transformers for Language

"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" is a seminal paper in the field of Natural Language Processing (NLP), published by Devlin et al. from Google AI Language in 2018. BERT stands for Bidirectional Encoder Representations from Transformers.

This paper introduced BERT, a new method for pre-training language representations that enables us to train a deep, bidirectional Transformer model. BERT's main technical innovation is applying the bidirectional training of the Transformer to language modelling. This is in contrast to previous efforts, which looked at a text sequence either from left to right or combined left-to-right and right-to-left training.

The main contributions of the paper are:

  1. Introduction of BERT: A method for pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus, and then use that model for downstream NLP tasks.

  2. Novel Training Strategies: Two novel pre-training strategies are proposed: Masked Language Model (MLM) and Next Sentence Prediction (NSP).

Let's dive into the details.

BERT Model Architecture

BERT's model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al., 2017. In the paper, they primarily report on two model sizes:

  1. BERTBASE: Comparable in size to the transformer in "Attention is All You Need", it uses 12 layers {transformer blocks}, 768 hidden units, 12 attention heads, and 110M parameters.
  2. BERTLARGE: A significantly larger model with 24 layers, 1024 hidden units, 16 attention heads, and 340M parameters.

Training Strategies

BERT uses two training strategies:

Masked Language Model (MLM)

In this strategy, the model randomly masks out some words in the input and then predicts those masked words. Specifically, it replaces a word with a special [MASK] token 15% of the time and then tries to predict the original word in that position based on the context provided by the non-masked words.

The objective of the MLM training is:

[ L_{\text{MLM}} = -\log P(\text{Word} | \text{Context}) ]

where Context refers to the non-masked words, and Word refers to the original word at a masked position.

Next Sentence Prediction (NSP)

In addition to the masked language model, BERT is also trained on a next sentence prediction task. For each training example, the model gets two sentences A and B, and must predict if B is the next sentence that follows A in the original document.

The objective of the NSP training is:

[ L_{\text{NSP}} = -\log P(\text{IsNext} | A, B) ]

where IsNext is a binary label indicating whether sentence B is the next sentence that follows sentence A.

The final loss function to train BERT is a combination of the MLM loss and the NSP loss.


BERT has had a significant impact on the field of NLP. By pre-training a deep, bidirectional model, BERT is able to effectively capture a wide range of language patterns. This has led to state-of-the-art results on a variety of NLP tasks, including question answering, named entity recognition, and others.

One of the key advantages of BERT is that it can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, without substantial task-specific architecture modifications.

However, it's worth noting that BERT is computationally intensive to train and requires a large amount of data.

BERT's impact on the field of NLP has been significant. By using a Transformer-based architecture, BERT is able to capture intricate context dependencies in the input text, leading to state-of-the-art performance on a wide range of tasks. However, the model is also known to be resource-intensive, both in terms of computation and data requirements.

BERT architecture structure:

  1. Input Embeddings: BERT uses WordPiece embeddings with a 30,000 token vocabulary. The input representation is able to represent both a single text sentence as well as a pair of sentences (e.g., Question, Answer) in one token sequence. During pre-training, the model is fed with two sentences at a time, and 50% of the time the second sentence is the actual next sentence, and 50% of the time it is a random sentence.

  2. Transformer Blocks: These are the heart of BERT, which uses the Transformer model architecture as its core. BERT-BASE consists of 12 Transformer blocks, and BERT-LARGE consists of 24 Transformer blocks. Each block is a self-attention mechanism that processes the input data in parallel, rather than sequentially as in an RNN or LSTM.

  3. Pooler: The pooler takes as input the final hidden state corresponding to the first token in the input (the [CLS] token), applies a dense layer and tanh activation, and outputs a vector. This output vector serves as the aggregate sequence representation for classification tasks.

  4. Output Layer: For different downstream tasks, there will be different types of output layers. For instance, in text classification tasks, a softmax layer is commonly used as the output layer to output probabilities of different classes.

The BERT model is then fine-tuned on specific tasks with additional output layers, which is one of the reasons for its effectiveness on a wide range of NLP tasks.

Tags: BERT, Transformers
👁️ 100
you need login for comment