BERT (Bidirectional Encoder Representations from Transformers)

Author

1 BERT Intuition

BERT is a language representation model that is pre-trained on a large corpus of text and can be fine-tuned for various NLP tasks. It stands for Bidirectional Encoder Representations from Transformers.

Encoder Representations: language modelling system, pre-trained with unlabeled data, then fine-tuning
From Transformers: based on the transformer architecture, which is a neural network architecture that uses self-attention mechanisms to process input data.
Bidirectional: BERT is designed to understand the context of a word based on all of its surroundings (left and right of the word). This is in contrast to traditional language models that read text input sequentially (left-to-right or right-to-left).

2 Embedding

Embedding is a technique used in natural language processing (NLP) to represent words or phrases as vectors in a continuous vector space.

2.1 One Hot Encoding

The early idea of word embedding is by using one-hot encoding.

It conveys information about which words we are dealing with, but nothing about the relations between those words.

2.2 Word Embedding

Instead of using a sparse matrix of size vocab_size, we can use a dense matrix of size embedding_dim. In word embedding, we make each vector smaller and we add relation between words.

It forces the system to create relations, links, or even meaning in the process.

We can have a mathematical representation of the word, e.g, king - man + woman = queen
2D dimensional representation of word embedding, similar meaning will be close

The embedding process works as: Word Embedding Transformation Process

Input Layer: One-Hot Encoding
- Process: Each word is represented as a one-hot encoded vector.
- Dimensions: [1 × vocabulary_size] = [1 × 10,000]
- Example: The word “dog” (217th word) has a vector of 10,000 elements with a 1 at position 217 and 0s elsewhere.
Embedding Matrix (W)
- Process: The one-hot vector multiplies with the embedding matrix W.
- Dimensions: [vocabulary_size × embedding_dimension] = [10,000 × 64]
- Computation: Effectively selects one row from W.
- Example: Multiplying: [1×10,000] × [10,000×64] = [1×64]
Hidden Layer (Word Embedding)
- Process: The selected row becomes the word’s dense embedding vector.
- Dimensions: [1 × embedding_dimension] = [1 × 64]
- Example: “dog” is now represented as a 64-dimensional vector like [0.2, -0.4, 0.1, …, 0.7]
Context Matrix (W’)
- Process: The embedding vector multiplies with the context matrix W’.
- Dimensions: [embedding_dimension × vocabulary_size] = [64 × 10,000]
- Computation: Multiplying: [1×64] × [64×10,000] = [1×10,000]
- Example: The 64-dimensional vector for “dog” transforms into a 10,000-dimensional score vector.
Output Layer (Softmax)
- Process: The score vector goes through a softmax function.
- Input Dimensions: [1 × vocabulary_size] = [1 × 10,000]
- Output Dimensions: Same [1 × 10,000], but values now sum to 1 (probabilities)
- Example: Values in this vector represent probabilities of each word appearing in context with “dog”.
Training Process Example with Dimensions
- Sentence: “The friendly dog plays in the park”
- Target word: “dog” (one-hot vector of dimension [1×10,000])
- Training pairs:
  - (“dog”, “the”) where “the” is the target label (position in 10,000-dim vector)
  - (“dog”, “friendly”)
  - (“dog”, “plays”)
  - (“dog”, “in”)
- Full computation path:
  - Input: [1×10,000] one-hot vector for “dog”
  - Matrix W: [10,000×64]
  - Embedding: [1×64] dense vector
  - Matrix W’: [64×10,000]
  - Output scores: [1×10,000]
  - After softmax: [1×10,000] probability distribution
Concrete Numeric Example
- Input for “dog”: [0,0,…,1 (at position 217),…,0] with dimension [1×10,000]
- Embedding matrix W: 10,000 rows × 64 columns
- Extracted embedding: Row 217 of W, with dimension [1×64]
- Context matrix W’: 64 rows × 10,000 columns
- Output vector before softmax: [1×10,000]
- Probability vector after softmax: [1×10,000]

3 BERT’s General Idea

We can fine-tune BERT using transfer learning.

In semi-supervised stage, we do not have a specific task as the data does not come with label. All we want is to use a huge amount of unlabeled data to let the model learns our language. The huge corpus including wikipedia, books, and other text data. The model learns the language by predicting the next word in a sentence.

In supervised stage, we have a specific task with labeled data. We can use the model to predict the next word in a sentence, or we can use it for other tasks like sentiment analysis, named entity recognition, etc.

In previous transfer learning in NLP, we have ELMo, ULMFiT, and GPT. They are all unidirectional. BERT is the first bidirectional model. It can look at the left and right context of a word at the same time.

ELMo is pseudo bidirectional and uses LSTM (RNN). In order to simulate pseudo bidirectional, it uses two LSTM layers, one layer to handle left-right to predict based on words on the previous left words and another layer to handle right-left to predict based on words on the next right words. The two layers are then concatenated to form a single representation. The weakness of LSTM is that it might not be able to capture long-term dependencies and it will give the same importance/weight to each word.

GPT used transformer architecture but it is unidirectional. It only looks at the left context of a word. It uses masked language model (MLM) to predict the next word in a sentence. The weakness of GPT is that it can only look at the left context of a word.

BERT used transformer architecture and it is bidirectional. It can look at the left and right context of a word at the same time / in the same neuron. It uses masked language model (MLM) to predict the next word in a sentence. In MLM, we randomly mask some words in the input sentence and the model has to predict the masked words based on the context of the other words. For example, in the sentence “The cat sat on the [MASK]”, the model has to predict the word “mat” based on the context of the other words. The model is trained to predict the masked words using a large corpus of text data.

3.1 BERT in Pre-Training

In ints pre-training stage, BERT has two core capabilities:

Pre-trained Objectives

BERT was pre-trained on two specific tasks:
- Masked Language Modeling (MLM)
  - BERT can predict masked words in a sentence
  - Example: “The [MASK] is bright today” → BERT predicts “sun” with high probability
  - This demonstrates understanding of context and semantics
- Next Sentence Prediction (NSP)
  - BERT can predict if two sentences naturally follow each other
  - Example: “I went to the store. [SEP] They were closed.” → BERT predicts these sentences are related
  - This shows understanding of discourse and coherence
Contextual Embeddings

Without fine-tuning, BERT can still generate high-quality contextual embeddings:
- You can extract token-level or sentence-level embeddings from any layer
- These embeddings capture semantic and syntactic information
- Unlike static embeddings (Word2Vec, GloVe), BERT’s embeddings are context-dependent
- Example: “bank” has different embeddings in “river bank” vs. “bank account”

Practical uses without fine-tuning:

Feature Extraction
- Extract BERT embeddings for use in other models
- Use these embeddings as input features for traditional ML models
Similarity Comparison
- Calculate cosine similarity between embeddings
- Useful for semantic search or finding similar documents
Clustering
- Group similar sentences or documents based on embedding similarity
- Useful for exploratory text analysis
Zero-shot Word Prediction
- Predict missing words in sentences using MLM objective
- Limited to vocabulary seen during pre-training

3.2 BERT in Fine-Tuning

Through fine-tuning, BERT can be adapted for specific tasks

3.2.1 Sentence-Level Tasks (Using [CLS] Token)

For all sentence-level tasks, we add a task-specific layer on top of the [CLS] token representation:

Sentiment Analysis Mode
- Add a simple feed-forward layer to the [CLS] token representation
- Use a softmax or sigmoid activation based on whether it’s multi-class or binary
- Fine-tune with labeled sentiment data (positive/negative/neutral)
- Loss function: Cross-entropy loss
Topic Classification Mode
- Add a feed-forward layer with neurons equal to the number of topic categories
- Apply softmax activation for multi-class classification
- Fine-tune with topic-labeled data
- Loss function: Cross-entropy loss
Natural Language Inference Mode
- Add a feed-forward layer with 3 outputs (entailment, contradiction, neutral)
- Input: [CLS] premise [SEP] hypothesis [SEP]
- Fine-tune with labeled NLI data (like SNLI or MNLI)
- Loss function: Cross-entropy loss

3.2.2 Token-Level Tasks

For token-level tasks, we use all token representations except [CLS] and [SEP]:

Named Entity Recognition Mode
- Add a token classification layer on top of each token representation
- Typically a fully connected layer with softmax over entity classes (PERSON, LOCATION, etc.)
- Either use token-by-token classification or CRF layer for sequence labeling
- Fine-tune with NER-labeled data
- Loss function: Token-wise cross-entropy loss
Part-of-Speech Tagging Mode
- Add a token classification layer for POS tags
- Each token gets its own softmax over possible POS tags
- Fine-tune with POS-tagged data
- Loss function: Token-wise cross-entropy loss
Question Answering Mode
- Add two separate feed-forward layers to each token representation:
  - Start-position scorer: Predicts probability of token being answer start
  - End-position scorer: Predicts probability of token being answer end
- Input format: [CLS] question [SEP] context [SEP]
- Fine-tune with QA data (question-context-answer triplets)
- Loss function: Sum of cross-entropy losses for start and end positions

BERT itself is a language model system.

It is not a task-specific model, it is a general-purpose language model that can be fine-tuned for various NLP tasks.
BERT is not a generative model, it is a discriminative model.
It is not designed to generate text, it is designed to understand text.
It only has encoder part of the transformer architecture, it does not have the decoder part. The encoder part is used to understand the text and the decoder part is used to generate text.

4 BERT’s History: from RNN to Transformer

4.1 RNN

Previously, task such as text summarization or translation is handled by using RNN in Seq2Seq model. However, in a long sequence, the early words might be faded and the decoder lost the context of the early words.

The attention mechanism is used to solve the problem of long-term dependencies in RNN. It allows the model to focus on specific parts of the input sequence when making predictions. The attention mechanism or context vector assigns weights to different parts of the input sequence, allowing the model to focus on the most relevant parts.

Take an example of predicting \(g_2\) where in RNN we used \(g_1\) and its output of \(she\) to predict. However, in attention mechanism, we also add context vector which is a weighted sum of all the encoder hidden states to \(g_2\)

4.2 Transformer

The transformer works as follow

Input Processing: The transformer encoder takes the input sentence “How are you doing?” and converts each token into embeddings. These embeddings are combined with positional encodings to retain sequence order information.
Encoder Processing: The encoder processes these embeddings through multiple layers of self-attention and feed-forward neural networks. Self-attention allows each token to attend to all other tokens in the input, creating contextual representations. The multi-head attention is based on the idea of attention mechanism but expanded to multiple instance so that different attention heads can focus on different parts of the input sequence. For example, one head might focus on the subject of the sentence, while another head might focus on the verb.
Decoder Initialization: During inference, the decoder starts with a special token (often or ).
Decoder Processing: The decoder has three main components:
- Masked self-attention (to prevent seeing future tokens)
- Cross-attention with the encoder’s output (connecting the encoder and decoder)
- Feed-forward neural networks
Autoregressive Generation: The decoder generates one token at a time. After generating “I”, it takes both the encoder output AND the growing sequence “ I” to predict “am”. This continues until an end token is generated.

The “it” in the left example focuses to “animal”, the “it” in the right example focuses to “street”.

4.3 Attention Mechanism

The attention mechanism is a technique that allows models to selectively focus on the most relevant parts of an input sequence by assigning importance weights when making predictions. It is what powers the transformer architecture.

5 BERT’s Architecture

BERT is stack of layers of transfoermer encoders. Each transformer encoder has 2 sub-layers:

Multi-head self-attention
Feed-forward neural network

BERT has two variants: 1. BERT Base: 12 layers of encoders, 768 hidden size (embedding dimension), 12 self-attention heads, 110M parameters 2. BERT Large: 24 layers of encoders, 1024 hidden size (embedding dimension), 16 self-attention heads, 345M parameters

We use [CLS] and [SEP] tokens to separate sentences. The [CLS] token is used for classification tasks, and the [SEP] token is used to separate sentences in a pair.

BERT uses WordPiece tokenizer.

The tokenizer covers 30,522 words. The vocabulary consists of whole words, subwords, and special tokens without this specific categorization
Deals with a new word by combining known words:
- For example, “unhappiness” is split into “un”, “happi”, and “ness”
- This allows BERT to handle out-of-vocabulary words and rare words more effectively
Trade-off between vocab size and out-of-vocab word:
- A smaller vocab size means more subwords, which can lead to better handling of rare words
- A larger vocab size means fewer subwords, which can lead to worse handling of rare words
Each one corresponds to an ID so from strings we get numbers, usable by computers:
- For example, “unhappiness” is split into “un”, “happi”, and “ness”, which are then converted to their corresponding IDs in the vocab

Assuming the original input as: [CLS] my dog is cute [SEP] he likes play ###ing [SEP]. The encoder needs the following as its inputs:

Embedded words:

Each token is converted to its corresponding vector representation from BERT’s embedding table:
- “[CLS]” → Mapped to its embedding vector (dimension 768 for BERT-base)
- “my” → Mapped to its embedding vector
- “dog” → Mapped to its embedding vector
- “is” → Mapped to its embedding vector
- “cute” → Mapped to its embedding vector
- “[SEP]” → Mapped to its embedding vector
- “he” → Mapped to its embedding vector
- “likes” → Mapped to its embedding vector
- “play” → Mapped to its embedding vector
- “###ing” → Mapped to its embedding vector (the “###” indicates this is a subword)
- “[SEP]” → Mapped to its embedding vector
Indication about 1st and 2nd sentence

BERT adds segment embeddings to distinguish between sentences:
- First sentence tokens: “[CLS]”, “my”, “dog”, “is”, “cute”, first “[SEP]” → All receive Segment Embedding A (typically encoded as 0)
- Second sentence tokens: “he”, “likes”, “play”, “###ing”, second “[SEP]” → All receive Segment Embedding B (typically encoded as 1)
Positional embedding (like for the Transformer)

Each token position gets a unique positional embedding vector:
- “[CLS]” → Position 0 embedding
- “my” → Position 1 embedding
- “dog” → Position 2 embedding
- “is” → Position 3 embedding
- “cute” → Position 4 embedding
- “[SEP]” → Position 5 embedding
- “he” → Position 6 embedding
- “likes” → Position 7 embedding
- “play” → Position 8 embedding
- “###ing” → Position 9 embedding
- “[SEP]” → Position 10 embedding

The final input representation for each token is the element-wise sum of these three embeddings: Final Input = Token Embedding + Segment Embedding + Positional Embedding

BERT produces contextualized representations for every token in the input sequence. These vectors capture both semantic meaning and contextual relationships. The output consists of:

[CLS] Token Representation
- A special vector (768 dimensions in BERT-base) corresponding to the [CLS] token
- Primary purpose: Sentence-level classification tasks (sentiment analysis, topic classification, etc.)
- This representation is fine-tuned during pre-training via Next Sentence Prediction (NSP)
- Acts as an aggregate representation of the entire sequence
Token-level Representations
- Individual vectors for each token in the input sequence
- Used for token-level tasks such as:
  - Named Entity Recognition (NER)
  - Part-of-speech tagging
  - Question-answering (extracting answer spans)
  - Token classification
- These representations are trained during Masked Language Modeling (MLM)
- Capture contextual meaning of words based on surrounding context

The power of BERT lies in how these representations can be fine-tuned for specific downstream tasks with minimal additional architecture.RetryClaude can make mistakes. Please double-check responses.

6 BERT’s Pre-Training

During pre-training, BERT only has two objectives:

6.1 Masked Language Model (MLM)

15% of he words are replaced by:

[MASK] token 80% of the time
random token 10% of the time
left unchanged 10% of the time

Example: “[CLS] The [MASK] is bright today” → BERT predicts “sun” with high probability

6.2 Next Sentence Prediction (NSP)

Next Sentence Prediction The goal is to:

Get a higher level understanding, from words to sentences
Get access to more tasks, like question answering

Example: “I went to the store. [SEP] They were closed.” → BERT predicts these sentences are related

Next Sentence Prediction Mechanism # CNN for NLP Architecture

It is also possible to use CNN for NLP. However, we need to pay attention to the following:

Each filter / feature detector must have a size of 1xN, where N is the number of words in the sentence. This is because we want to detect features that span the entire sentence. Splitting the embedding dimension doesn’t make sense. E.g. the embedding for like is [1, 0, 0, 1, 0] and so if we have a feature detector of 3 (less than 5), we will only detect [1, 0, 0] which has a different meaning. Hence, all feature map will be in Nx1 size.
We take one max pooling for each filter. Position of a feature in the sentence is less important.
Multiple size of filters (3 in this example) to capture different scale of correlation between words.