1 An Introduction to Large Language Models

1.1 Representing Language as a Bag-of-Words

Split the words by whitespace

Each sentence is split into words (tokens) by splitting on a whitespace.

Then convert into vocabulary

vocabulary is created by retaining all unique words across both sentences.

Now assuming with that vocabulary, we can represent each sentence as a vector of counts of the words in the vocabulary. This is called a bag-of-words representations, vectors, or vector representations.

A bag-of-words is created by counting individual words. These values are referred to as vector representations.

1.2 Better Representations with Dense Vector Embeddings

Bag-of-words has its flaw because it considers language to be nothing more than an almost literal bag of words and ignores the semnatic nature, or meaning, of text.

word2vec generates word embeddings by looking at which other words they tend to appear next to in a given sentence. We start by assigning every word in our vocabulary with a vector embedding, say of 50 values for each word initialized with random values. Then in every training step, we take pairs of words from the training data and a model attempts to predict whether or not they are likely to be neighbors in a sentence.

A neural network is trained to predict if two words are neighbors. During this process, the embeddings are updated to be in line with the ground truth.

For instance, the word “baby” might score high on the properties “newborn” and “human” while the word “apple” scores low on these properties.

The values of embeddings represent properties that are used to represent words. We may oversimplify by imagining that dimensions represent concepts (which they don’t), but it helps express the idea.

If we compress the embedding into two-dimensional space, we have

Embeddings of words that are similar will be close to each other in dimensional space.

1.3 Types of Embeddings

Types of embeddings covers:

Token embedding: represent individual tokens (e.g., word2vec, GloVe, fastText).
Word embedding: represent individual words (e.g., word2vec, GloVe, fastText). Works by looking at the neighboring words in a sentence and updating the embedding values to be in line with the ground truth.
Sentence embedding: represent entire sentences or paragraphs (e.g., Universal Sentence Encoder, BERT-based models). Works by averaging the word embeddings of the words in the sentence.
Document embedding: represent entire documents (e.g., Doc2Vec, bag-of-words). Works by averaging the word embeddings of the words in the document.

Embeddigns can be created for different types of input

1.4 Encoding and Decoding Context with Attention

“bank” could means the financial institution and a riverbank. To disambiguate the meaning of a word, we need to look at the context in which it appears. RNN is used to encode or representing an input sentence and decode or generating an output sentence. Each step in the architecture is autoregressive meaning it consume all previously generated words.

Two recurrent neural networks (decoder and encoder) translating an input sequence from English to Dutch

1.4.1 Encoder

The encoding step aims to represent the input as well as possible, generating the context in the form of an embedding, which serves as the input for the decoder. To generate this representation, it takes embeddings as its inputs for words, which means we can use word2vec for the initial representations.

Using word2vec embeddings, a context embedding is generated that represents the entire sequence.

Context embedding is difficult to deal with longer sentence since it is merely a single embedding representing the entire input. Attention was later introduced in 2014 to allow the model to focus on parts of the input sequence that are relevant to one another and amplify their signal.

Attention allows a model to “attend” to certain parts of sequences that might relate more or less to one another.

1.4.2 Decoder

By adding these attention to the decoder step, the RNN can generate signals for each input word in the sequence related to the potential output. Instead of passing only a context embedding to the decoder, the hidden states of all input words are passed.

After generating the words “Ik,” “hou,” and “van,” the attention mechanism of the decoder enables it to focus on the word “llamas” before it generates the Dutch translation (“lama’s”).

1.5 Attention Is All You Need

The paper proposed a network architecture Transformer which was solely based on the attention mechanism and removed the recurrence network we saw before. Compared to the recurrence network, the Transformer could be trained in parallel, which tremendously sped up training. In Transformer, encoding and decoder components are stacked on top of each other. Still an autoregressive.

The Transformer is a combination of stacked encoder and decoder blocks where the input flows through each encoder and decoder.

Both the encoder and decoder uses attention instead of RNN with attention features.

1.5.1 Encoder

Consist of two parts:

Self-attention, and
Feedforward neural network

An encoder block revolves around self-attention to generate intermediate representations.

Self-attention can attend to different positions within a single sequence, thereby more easily and accurately representing the input sequence. Self-attention attends to all parts of the input sequence so that it can “look” both forward and back in a single sequence.

1.5.2 Decoder

Consist of three parts: - Masked self-attention - Encoder attention - Feedforward neural network

The decoder has an additional attention layer that attends to the output of the encoder.

Masked self-attention masks future positions so it only attends to earlier positions to prevent leaking information when generating the output.

Only attend to previous tokens to prevent “looking into the future.”

1.6 Representation Models: Encoder-Only Models

Representation models mainly focus on representing language, for instance, by creating embeddings, and typically do not generate text. In contrast, generative models focus primarily on generating text and typically are not trained to generate embeddings.

The original Transformer model is an encoder-decoder architecture that serves translation tasks well but cannot easily be used for other tasks, like text classification.

In 2018, Bidirectional Encoder Representations from Transformers (BERT) was introduced as an encoder-only architecture with focus on representing language.

The architecture of a BERT base model with 12 encoders

The encoder is still the same self-attention and feedforward neural networks. The input contains an additional token, the [CLS] or classification token, which is used as the representation for the entire input. Often, we use this [CLS] token as the input embedding for fine-tuning the model on specific tasks, like classification.

BERT can be trained with two objectives:

Masked language modeling (MLM): randomly mask some tokens in the input and train the model to predict the masked tokens based on the context.

from transformers import BertForMaskedLM, pipeline

nlp = pipeline("fill-mask", model='bert-base-cased')
preds = nlp(f"If you don’t {nlp.tokenizer.mask_token} at the sign, you will get a ticket.")

for pred in preds:
    print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
    
# Output
# Token:look. Score: 48.00%
# Token:stop. Score: 42.63%
# Token:glance. Score: 1.39%
# Token:arrive. Score: 0.88%
# Token:turn. Score: 0.62%

Next sentence prediction (NSP): train the model to predict whether two sentences are consecutive in the original text.

from transformers import BertForNextSentencePrediction, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_nsp = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

text = "Deliver huge improvements to your machine learning pipelines without spending hours fine-tuning parameters!"
text2 = "This book’s practical case-studies reveal feature engineering techniques that upgrade your data wrangling—and your ML results."

inputs = tokenizer(text, text2, return_tensors='pt')

# 0 == "isNextSentence" and 1 == "notNextSentence"
outputs = bert_nsp(**inputs)
print(outputs[0][0])
# Output
# tensor([ 6.0295, -5.5733], grad_fn=<SelectBackward0>)

print(f"isNextSentence: {outputs[0][0][0]}")
# Output
# isNextSentence: 6.029475212097168

print(f"notNextSentence: {outputs[0][0][1]}")
# Output
# notNextSentence: -5.573266983032227

Train a BERT model by using masked language modeling.

BERT-like models are commonly used for transfer learning which involves pretraining it for language modelling then fine-tuning for specific task. For instance, training BERT on Wikipedia then used for text classification

After pretraining BERT on masked language model, we fine-tune it for specfic tasks

1.7 Generative Models: Decoder-Only Models

Representation models mainly focus on representing language, for instance, by creating embeddings, and typically do not generate text. In contrast, generative models focus primarily on generating text and typically are not trained to generate embeddings.

In 2018, Generative Pre-trained Transformer (GPT) known as GPT-1 was introduced as an decoder-only architecture. It has 117 million parameter where each parameter is a numerical value that represents the model’s understanding of language.

The architecture of a GPT-1. It uses a decoder-only architecture and removes the encoder-attention block.

The larger generative decoder-only models are commonly referred to as large language models (LLMs), but actually it can also be used for representation models (encoder-only) as well.

While generative models receives input and try to complete it. With fine-tuning to create instruct or chat models, we can attempts to answer a question.

Generative LLMs take in some input and try to complete it. With instruct models, this is more than just autocomplete and attempts to answer the question.

Context length or context window determine the max number of tokens the model can process. Due to autoregressive, the current context length will increase as new tokens are generated.

The context length is the maximum context an LLM can handle.

1.8 The Training Paradigm of Large Language Models

This is traditional machine learning Traditional machine learning involves a single step: training a model for a specific target task, like classification or regression.

But for LLM: - Language modelling or pretraining: trained with vast corpus to learn grammar, context and language patterns. The result is next word prediction model called foundation model or base model. It does not follow instructions. - Fine-tuning or post-training: trained for classification task or follow instructions.

Compared to traditional machine learning, LLM training takes a multistep approach.

1.9 Generating Your First Text

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

 Why did the chicken join the band? Because it had the drumsticks!

2 Tokens and Embeddings

2.1 LLM Tokenization

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(input_ids=input_ids, max_new_tokens=20)

# Print the output
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Sincere Apologies for the Gardening Mishap


Dear

We can see that the model does not in fact receive the text prompt. Instead, the tokenizers processed the input prompt, and returned the information the model needed in the variable input_ids, which the model used as its input.

for id in input_ids[0]:
    print(id, tokenizer.decode(id))

tensor(14350, device='cuda:0') Write
tensor(385, device='cuda:0') an
tensor(4876, device='cuda:0') email
tensor(27746, device='cuda:0') apolog
tensor(5281, device='cuda:0') izing
tensor(304, device='cuda:0') to
tensor(19235, device='cuda:0') Sarah
tensor(363, device='cuda:0') for
tensor(278, device='cuda:0') the
tensor(25305, device='cuda:0') trag
tensor(293, device='cuda:0') ic
tensor(16423, device='cuda:0') garden
tensor(292, device='cuda:0') ing
tensor(286, device='cuda:0') m
tensor(728, device='cuda:0') ish
tensor(481, device='cuda:0') ap
tensor(29889, device='cuda:0') .
tensor(12027, device='cuda:0') Exp
tensor(7420, device='cuda:0') lain
tensor(920, device='cuda:0') how
tensor(372, device='cuda:0') it
tensor(9559, device='cuda:0') happened
tensor(29889, device='cuda:0') .
tensor(32001, device='cuda:0') <|assistant|>

The IDs reference a table inside the tokenizer containing all the tokens it knows. The reference table is called a vocabulary.

For this model, the vocabulary is

# The model has vocabulary size of
print(f"Vocab size is: {model.config.vocab_size}")

Vocab size is: 32064

# The model has hidden size of
print(f"Hidden size is: {model.config.hidden_size}")

Hidden size is: 3072

# This means that the number has embedding size of
print(f"The model embedding size is: {model.get_input_embeddings().weight.shape}")

The model embedding size is: torch.Size([32064, 3072])

# To know the `learned vector representation` of each word, example is `write`
print(
    f"The vector representation of the word is: {model.get_input_embeddings().weight[14350]} with size of: {model.get_input_embeddings().weight[14350].shape}"
)

The vector representation of the word is: tensor([ 0.0640, -0.0210,  0.0095,  ..., -0.0693,  0.0008,  0.0608],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<SelectBackward0>) with size of: torch.Size([3072])

# To check the vocabulary based on ID
for i in range(31997, 32011):
    print(tokenizer.decode(i))

收
弘
给
<|endoftext|>
<|assistant|>
<|placeholder1|>
<|placeholder2|>
<|placeholder3|>
<|placeholder4|>
<|system|>
<|end|>
<|placeholder5|>
<|placeholder6|>
<|user|>

We can also observe the generation_output

print(f"Generation output: {generation_output}")

Generation output: tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901,   317,  3742,   406,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799]], device='cuda:0')

print(tokenizer.decode(3323))

Sub

print(tokenizer.decode(622))

ject

print(tokenizer.decode([3323, 622]))

Subject

2.1.1 Tokenizer Break Down Text

Three major factors: - At model design time, the creator choose a tokenization method. Popular methods such as byte pair encoding (BPE) used in GPT models and WordPiece (used in BERT) - After the method, need to decide the vocabulary size and special tokens, which is the number of unique tokens the model can recognize. A larger vocabulary can capture more nuances but may require more computational resources. - The tokenizer trained on a specific dataset to make the best vocabulary for that dataset. English dataset <> multilingual text dataset.

Tokenizer used in both encoding the input and decoding the output The tokenizer is used to encode the input and decode the output.

2.1.2 Word, Subwords, Character, Byte Tokens

There are multiple methods of tokenization that break down the text to different sizes of components (words, subwords, characters, and bytes).

Word tokens: each token corresponds to a whole word. But, tokenizer might struggle with out-of-vocabulary words or broad vocabulary with minimal differences (e.g, apology, apologize, apologetic). Later solved by subword tokenization which breaks down into token of apolog and suffix tokens (e.g., -y, -ize, -etic, -ist).
Subword tokens: break down words into smaller units, allowing the model to handle out-of-vocabulary words and capture morphological information.
Character tokens: break down text into individual characters, allowing the model to handle any word but may require longer sequences and more computational resources. “play” as one token in subword become “p-l-a-y” in character. In Transformer model of 1,024, subword tokenization fits 3x as much text compared to character (assuming 1 subword = 3 chars).
Byte tokens: break down text into bytes, allowing the model to handle any text but may require even longer sequences and more computational resources.

Subword tokenization may also include byte as tokens in vocab if the model faces out-of-vocabulary.

2.2 Token Embeddings

After training the tokenizer, we have a vocabulary of tokens. Each token is associated with a unique ID, and the model learns an embedding for each token during training. The embedding is a dense vector representation that captures the semantic meaning of the token in the context of the training data.

2.2.1 A Language Model Holds Embeddings for the Vocabulary of Its Tokenizer

Each tokenizer is associated with a specific model, and the model learns an embedding for each token in the tokenizer’s vocabulary during training. The embedding and the model’s weight are initially randomly assigned.

A language model holds an embedding vector associated with each token in its tokenizer.

“The” -> [0.1, 0.3, -0.2, …, 0.5] (768 dimensions) “cat” -> [0.4, -0.1, 0.6, …, -0.3] “runs” -> [-0.2, 0.8, 0.1, …, 0.4] “fast” -> [0.3, 0.5, -0.4, …, 0.7]

2.2.2 Creating Contextualized Word Embeddings with Language Models

The difference between the token embeddings and the contextualized embeddings

A language model operates on raw, static embeddings as its input and produces contextual text embeddings.

Token Embeddings (Static)

Created by the embedding layer inside a model
Acts like a lookup table: each token ID maps to a fixed vector
Always the same regardless of context
These are the INPUT to the language model

Example:

# Token ID 245 always maps to:
"bank" -> [0.2, 0.5, -0.3, 0.8, ...]  (always identical)

Contextualized Embeddings (Dynamic)

Created by the language model after processing
Changes based on surrounding words
These are the OUTPUT of the language model
Much more useful for NLP tasks

Example:

# Same word, different contexts:
"The bank is closed" 
"bank" -> [0.7, -0.2, 0.9, -0.4, ...]  (financial meaning)

"The river bank"
"bank" -> [-0.3, 0.8, 0.1, 0.6, ...]  (geographical meaning)

2.2.3 From Text to Contextualized Embeddings

Text -> Tokenizer -> Token IDs -> Embedding Layer (static) -> Language Model (processes) -> Contextualized Embeddings (context-aware)

Layers in the Transformer model (in order):

Embedding Layer - converts token IDs to vectors (static embeddings)
Transformer Layer 1 - self-attention + feed-forward
Transformer Layer 2 - self-attention + feed-forward
Transformer Layer 3 - self-attention + feed-forward
…
Transformer Layer N (e.g., Layer 12) - self-attention + feed-forward
LM Head (optional, only for text generation) - projects to vocabulary size

Input text: “Hello world”

2.2.3.1 Step 1: Tokenization

“Hello world” -> [CLS] Hello world [SEP] -> [1, 245, 892, 2] (token IDs)

[CLS] (ID: 1) = classification token (marks the start)
Hello (ID: 245) = actual word
world (ID: 892) = actual word
[SEP] (ID: 2) = separator token (marks the end)

2.2.3.2 Step 2: Static Embeddings (Embedding Layer)

The embedding layer converts token IDs into vectors (lookup table):

Token ID 1   ([CLS])  → [0.12, -0.34, ..., 0.56]  (384 dims)
Token ID 245 (Hello)  → [0.45,  0.23, ..., -0.12] (384 dims)
Token ID 892 (world)  → [-0.23, 0.67, ..., 0.89]  (384 dims)
Token ID 2   ([SEP])  → [0.34, -0.12, ..., 0.23]  (384 dims)

Shape: [1, 4, 384] (1 batch, 4 tokens, 384 dimensions)

These are static - always the same for these token IDs, no context yet.

2.2.3.3 Step 3: Language Model Processing (Transformer Layers)

The model passes these static embeddings through multiple transformer layers (e.g., 12 layers). Each layer has two components:

2.2.3.3.1 Layer 1 - Self-Attention

Tokens “talk to each other” and exchange information.

Step 3.1: Create Query, Key, Value matrices

Each embedding is transformed into three vectors using learned weight matrices:

# Weight matrices (learned during training)
W_query = [[...]]  # Shape: [384, 384]
W_key = [[...]]    # Shape: [384, 384]
W_value = [[...]]  # Shape: [384, 384]

# For "Hello" token with embedding [0.45, 0.23, ..., -0.12]
Query_Hello = [0.45, 0.23, ..., -0.12] @ W_query
            = [0.52, -0.18, ..., 0.34]  (384 dims)

Key_Hello = [0.45, 0.23, ..., -0.12] @ W_key
          = [0.31, 0.45, ..., -0.22]  (384 dims)

Value_Hello = [0.45, 0.23, ..., -0.12] @ W_value
            = [0.67, 0.12, ..., 0.45]  (384 dims)

Similarly for all tokens:

Query_[CLS] = [0.23, -0.45, ..., 0.12]  (384 dims)
Query_Hello = [0.52, -0.18, ..., 0.34]  (384 dims)
Query_world = [-0.34, 0.78, ..., 0.56]  (384 dims)
Query_[SEP] = [0.45, -0.23, ..., 0.18]  (384 dims)

Key_[CLS] = [0.34, 0.12, ..., -0.23]    (384 dims)
Key_Hello = [0.31, 0.45, ..., -0.22]    (384 dims)
Key_world = [0.56, -0.12, ..., 0.67]    (384 dims)
Key_[SEP] = [0.23, 0.34, ..., -0.12]    (384 dims)

Value_[CLS] = [0.45, -0.12, ..., 0.34]  (384 dims)
Value_Hello = [0.67, 0.12, ..., 0.45]   (384 dims)
Value_world = [-0.23, 0.89, ..., 0.78]  (384 dims)
Value_[SEP] = [0.34, -0.23, ..., 0.12]  (384 dims)

Step 3.2: Calculate attention scores

For the token “Hello”, calculate how much attention it should pay to each token:

# Dot product of Query_Hello with all Keys
score_with_[CLS] = Query_Hello · Key_[CLS]
                 = (0.52 × 0.34) + (-0.18 × 0.12) + ... + (0.34 × -0.23)
                 = 12.45  (sum of 384 multiplications)

score_with_Hello = Query_Hello · Key_Hello
                 = (0.52 × 0.31) + (-0.18 × 0.45) + ... + (0.34 × -0.22)
                 = 18.32  (sum of 384 multiplications)

score_with_world = Query_Hello · Key_world
                 = (0.52 × 0.56) + (-0.18 × -0.12) + ... + (0.34 × 0.67)
                 = 35.67  (sum of 384 multiplications) ← Highest!

score_with_[SEP] = Query_Hello · Key_[SEP]
                 = (0.52 × 0.23) + (-0.18 × 0.34) + ... + (0.34 × -0.12)
                 = 8.91   (sum of 384 multiplications)

Step 3.3: Scale and normalize (softmax)

# Scale: Divide by sqrt(384) ≈ 19.6 to prevent large values
scaled_scores = [12.45/19.6, 18.32/19.6, 35.67/19.6, 8.91/19.6]
              = [0.635, 0.935, 1.820, 0.455]

# Apply softmax to get probabilities (sum to 1)
attention_weights = softmax([0.635, 0.935, 1.820, 0.455])
                  = [0.15, 0.20, 0.50, 0.15]

Interpretation: - 15% attention to [CLS] - 20% attention to Hello (itself) - 50% attention to world ← Most important! - 15% attention to [SEP]

Step 3.4: Mix the Value vectors

“Hello” creates its new representation by mixing information from all tokens:

new_Hello = 0.15 × Value_[CLS] + 0.20 × Value_Hello + 0.50 × Value_world + 0.15 × Value_[SEP]

# Calculate element by element:
new_Hello[0] = 0.15×0.45 + 0.20×0.67 + 0.50×(-0.23) + 0.15×0.34
             = 0.0675 + 0.134 - 0.115 + 0.051
             = 0.14

new_Hello[1] = 0.15×(-0.12) + 0.20×0.12 + 0.50×0.89 + 0.15×(-0.23)
             = -0.018 + 0.024 + 0.445 - 0.0345
             = 0.42

...

new_Hello[383] = 0.15×0.34 + 0.20×0.45 + 0.50×0.78 + 0.15×0.12
               = 0.051 + 0.09 + 0.39 + 0.018
               = 0.55

# Result:
new_Hello = [0.14, 0.42, ..., 0.55]  (384 dims)

After attention for all tokens:

[CLS] → [0.18, -0.25, ..., 0.47]  (384 dims)
Hello → [0.14,  0.42, ..., 0.55]  (384 dims) ← Influenced by "world"
world → [-0.16, 0.58, ..., 0.72]  (384 dims) ← Influenced by "Hello"
[SEP] → [0.26, -0.09, ..., 0.31]  (384 dims)

2.2.3.3.2 Layer 1 - Feed-Forward Network

After attention, each token goes through a 2-layer neural network:

# For "Hello" token
input_to_ffn = [0.14, 0.42, ..., 0.55]  (384 dims)

# First layer: expand to 1536 dims (384 × 4 is common)
W1 = [[...]]  # Shape: [384, 1536]
b1 = [...]    # Shape: [1536]

hidden = ReLU(input_to_ffn @ W1 + b1)
       = [0.67, 0.0, 0.89, ..., 0.34]  (1536 dims)

# Second layer: project back to 384 dims
W2 = [[...]]  # Shape: [1536, 384]
b2 = [...]    # Shape: [384]

output_Hello_L1 = hidden @ W2 + b2
                = [0.38, 0.51, ..., 0.63]  (384 dims)

After Layer 1 (attention + feed-forward):

[CLS] → [0.22, -0.31, ..., 0.54]  (384 dims)
Hello → [0.38,  0.51, ..., 0.63]  (384 dims)
world → [-0.19, 0.72, ..., 0.81]  (384 dims)
[SEP] → [0.31, -0.15, ..., 0.39]  (384 dims)

End of Layer 1

2.2.3.3.3 Layer 2 - Self-Attention (More Refinement)

Now the tokens continue exchanging information, but they already have some context from Layer 1. The same process repeats: - Calculate Query, Key, Value - Compute attention scores - Mix Value vectors - Feed-forward network

After Layer 2:

[CLS] → [-0.45, 0.12, ..., 0.67]  (384 dims)
Hello → [ 0.56, 0.73, ..., 0.84]  (384 dims)
world → [ 0.12, 0.95, ..., 1.08]  (384 dims)
[SEP] → [-0.52, 0.08, ..., 0.59]  (384 dims)

End of Layer 2

2.2.3.3.4 Layers 3-12 (Progressive Refinement)

Each layer continues to refine the embeddings through the same attention + feed-forward process:

Layer 3:  More context awareness
Layer 4:  Even more refined
Layer 5:  ...
Layer 6:  Halfway point
Layer 7:  ...
...
Layer 12: Highly contextualized (final output)

After Layer 6 (halfway):

[CLS] → [-1.23,  0.45, ..., 0.78]  (384 dims)
Hello → [ 0.67,  0.84, ..., 0.92]  (384 dims)
world → [ 0.23,  1.12, ..., 1.34]  (384 dims)
[SEP] → [-1.45,  0.12, ..., 0.67]  (384 dims)

After all 12 layers:

[CLS] → [-3.31, -0.05, ..., 0.69]  (384 dims) - represents entire sentence
Hello → [ 0.89,  0.07, ..., 0.08]  (384 dims) - fully aware of context
world → [ 0.09,  0.64, ..., 1.02]  (384 dims) - fully aware of context
[SEP] → [-3.16, -0.14, ..., 0.80]  (384 dims) - represents sentence ending

2.2.3.4 Step 4: Contextualized Output

The final output contains highly contextualized embeddings:

output = tensor([[
    [-3.31, -0.05, ..., 0.69],  # [CLS] (384 dims) - understands full sentence
    [ 0.89,  0.07, ..., 0.08],  # Hello (384 dims) - influenced by "world"
    [ 0.09,  0.64, ..., 1.02],  # world (384 dims) - influenced by "Hello"
    [-3.16, -0.14, ..., 0.80]   # [SEP] (384 dims) - context-aware
]])

Shape: [1, 4, 384] (same shape as input, but vectors are now contextualized)

These vectors are now context-aware and can be used for downstream tasks like: - Named Entity Recognition (NER) - Text Classification - Semantic Search - Question Answering

2.3 Text Embeddings (for Sentences and Whole Documents)

While token embeddings represent individual tokens, we can also create embeddings for entire sentences or documents. These are called text embeddings or sentence embeddings.

We use the embedding model to extract the feature and convert the input text to embeddings

2.4 Word Embeddings Beyond LLMs

2.4.1 Using Pretrained Word Embeddings

We can use word2vec or GloVe

import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")

# See the nearest neighbors of a specific word "king"
model.most_similar([model['king']], topn=11)

# The output
#[('king', 1.0000001192092896),
# ('prince', 0.8236179351806641),
# ('queen', 0.7839043140411377),
# ('ii', 0.7746230363845825),
# ('emperor', 0.7736247777938843),
# ('son', 0.766719400882721),
# ('uncle', 0.7627150416374207),
# ('kingdom', 0.7542161345481873),
# ('throne', 0.7539914846420288),
# ('brother', 0.7492411136627197),
# ('ruler', 0.7434253692626953)]

2.4.2 The Word2vec Algorithm and Contrastive Training

Word2Vec uses sliding window approach (skip-gram) to create training samples from text corpus. For each word in the center of the window, it creates positive samples with neighboring words and negative samples with random words from the vocabulary.

If, however, we have a dataset of only a target value of 1, then a model can cheat and ace it by outputting 1 all the time. To get around this, we need to enrich our training dataset with examples of words that are not typically neighbors. These are called negative examples.

So far we have seen two main concepts of word2vec: skip-gram, the method of selecting neighboring words, and negative sampling, adding negative examples by random sampling from the dataset.

3 Looking Inside Large Language Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    dtype="auto",
)

3.1 An Overview of Transformer Models

A complete Transformer-based large language model (LLM) consists of three main components: the tokenizer, the stack of Transformer blocks, and the language modeling head (LM head).

1. Embedding Layer
   └─ Converts token IDs to vectors (384 dims)

2. Transformer Layer 1
   ├─ Multi-Head Self-Attention
   │  ├─ Head 1 (does attention calculation)
   │  ├─ Head 2 (does attention calculation)
   │  ├─ ...
   │  └─ Head 8 (does attention calculation)
   │  └─ Combine all heads
   └─ Feed-Forward Network
      ├─ Layer 1: expand 384 → 1536 dims
      └─ Layer 2: project 1536 → 384 dims

3. Transformer Layer 2
   ├─ Multi-Head Self-Attention (8 heads)
   └─ Feed-Forward Network

...

12. Transformer Layer 12
    ├─ Multi-Head Self-Attention (8 heads)
    └─ Feed-Forward Network

13. LM Head
    └─ Projects 384 dims → 50,000 dims (vocab size)

print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLUActivation()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

This shows us the various nested layers of the model. The majority of the model is labeled model, followed by lm_head.
Inside the Phi3Model model, we see the embeddings matrix embed_tokens and its dimensions. It has 32,064 tokens each with a vector size of 3,072.
Skipping the dropout layer for now, we can see the next major component is the stack of Transformer decoder layers. It contains 32 blocks of type Phi3DecoderLayer.
Each of these Transformer blocks includes an attention layer and a feedforward neural network (also known as an mlp or multilevel perceptron). We’ll cover these in more detail later in the chapter.
Finally, we see the lm_head taking a vector of size 3,072 and outputting a vector equivalent to the number of tokens the model knows. That output is the probability score for each token that helps us select the output token.

3.1.1 Reading the Model Architecture

Input Setup

Input text: “The cat sat”
After tokenization: token IDs [245, 1834, 5847]
Batch size: 1, Sequence length: 3
Model dimension (d_model): 3072
Number of attention heads: 32
Head dimension (d_head): 96 (= 3072 ÷ 32)

3.1.1.1 Step 1: Token Embedding

(embed_tokens): Embedding(32064, 3072, padding_idx=32000)

Input: Token IDs [245, 1834, 5847]
Output: [1, 3, 3072]
- 3 tokens, each becomes a 3072-dimensional vector
- Each token is now represented in the model’s embedding space

3.1.1.2 Step 2-33: Process Through 32 Transformer Blocks

Each block (0-31): 32 x Phi3DecoderLayer processes the sequence. Below is the detailed flow for one block:

3.1.1.2.1 2a. Pre-Attention Layer Normalization

(input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)

Input: [1, 3, 3072]
Output: [1, 3, 3072] (normalized)
Purpose: Stabilize training by normalizing activations before attention

3.1.1.2.2 2b. Self-Attention

3.1.1.2.2.1 2b.1: QKV Projection

(qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)

Input: [1, 3, 3072]
Output: [1, 3, 9216]
What happens:
- This single linear layer produces ALL queries, keys, and values at once
- More efficient than having 3 separate linear layers
Split into Q, K, V:

  Q = output[:, :, 0:3072]      # [1, 3, 3072]
  K = output[:, :, 3072:6144]   # [1, 3, 3072]
  V = output[:, :, 6144:9216]   # [1, 3, 3072]

Why 9216? 3072 (Q) + 3072 (K) + 3072 (V) = 9216

Each token now has:

A Query vector (3072-dim): “What am I looking for?”
A Key vector (3072-dim): “What information do I contain?”
A Value vector (3072-dim): “What information do I provide?”

3.1.1.2.2.2 2b.2: Reshape for Multi-Head

Starting point: Q, K, V are each [1, 3, 3072]

Goal: Split into 32 attention heads, each with 96 dimensions

Step-by-step reshaping for Q (same for K and V):

# Step 1: Reshape to separate heads
Q: [1, 3, 3072] → [1, 3, 32, 96]
# batch_size=1, seq_len=3, num_heads=32, head_dim=96
# We're dividing the 3072 dimensions into 32 chunks of 96

# Step 2: Transpose for efficient computation
Q: [1, 3, 32, 96] → [1, 32, 3, 96]
# batch_size=1, num_heads=32, seq_len=3, head_dim=96
# Now each head processes all tokens independently

Why reshape?

Multi-head attention processes the same input from different “representation subspaces”
Each of the 32 heads looks at different aspects of the relationships between tokens
head_dim = d_model / num_heads = 3072 / 32 = 96

Visualization for one token:

Original Q vector (3072-dim):
[q₁, q₂, q₃, ..., q₃₀₇₂]

After reshape into 32 heads:
Head 1:  [q₁, q₂, ..., q₉₆]
Head 2:  [q₉₇, q₉₈, ..., q₁₉₂]
Head 3:  [q₁₉₃, q₁₉₄, ..., q₂₈₈]
...
Head 32: [q₂₉₇₇, q₂₉₇₈, ..., q₃₀₇₂]

After reshaping all three:

Q: [1, 32, 3, 96] - 32 heads, each sees 3 tokens with 96-dim queries
K: [1, 32, 3, 96] - 32 heads, each sees 3 tokens with 96-dim keys
V: [1, 32, 3, 96] - 32 heads, each sees 3 tokens with 96-dim values

Now each of the 32 heads can compute attention independently and in parallel!

3.1.1.2.2.3 2b.3: Apply Rotary Position Embeddings

(rotary_emb): Phi3RotaryEmbedding()

Apply to: Q and K only (not V)
Purpose: Encode relative position information
Output: Q and K remain [1, 32, 3, 96] but with position info

3.1.1.2.2.4 2b.4: Attention Computation

Compute Attention Scores:

Scores = (Q @ K^T) / √d_head

Q @ K^T: [1, 32, 3, 96] @ [1, 32, 96, 3] = [1, 32, 3, 3]
Scale by √96 ≈ 9.8 to prevent large values
Result: [1, 32, 3, 3] - attention scores for each head

Apply Causal Mask:

Mask ensures token i can only attend to tokens ≤ i
[1, 0, 0]
[1, 1, 0]
[1, 1, 1]

Softmax to Get Attention Weights:

Input: [1, 32, 3, 3] (masked scores)
Output: [1, 32, 3, 3] (attention probabilities, each row sums to 1)

Compute Context Vectors:

Context = Attention_weights @ V

[1, 32, 3, 3] @ [1, 32, 3, 96] = [1, 32, 3, 96]
Each token now has a context-aware representation from each head

3.1.1.2.2.5 2b.5: Concatenate Heads

Input: [1, 32, 3, 96]
Transpose: [1, 32, 3, 96] → [1, 3, 32, 96]
Reshape: [1, 3, 32, 96] → [1, 3, 3072]
Result: All 32 heads’ outputs concatenated back together

2b.6: Output Projection

(o_proj): Linear(in_features=3072, out_features=3072, bias=False)

Input: [1, 3, 3072]
Output: [1, 3, 3072]
Purpose: Mix information across heads and project back to model dimension

3.1.1.2.3 2c. Attention Residual Connection + Dropout

(resid_attn_dropout): Dropout(p=0.0, inplace=False)

Computation: output = attention_output + input_to_block
Output: [1, 3, 3072]
Purpose: Residual connection helps gradient flow during training

3.1.1.2.4 2d. Pre-FFN Layer Normalization

(post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)

Input: [1, 3, 3072]
Output: [1, 3, 3072] (normalized)
Purpose: Normalize before feed-forward network

3.1.1.2.5 2e. Feed-Forward Network (MLP)

3.1.1.2.5.1 2e.1: Gate-Up Projection (Expansion)

(gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)

Input: [1, 3, 3072]
Output: [1, 3, 16384]
Split into two parts:
- Gate: [1, 3, 8192] - controls information flow
- Up: [1, 3, 8192] - actual content transformation
Why 16384? 8192 (Gate) + 8192 (Up) = 16384

3.1.1.2.5.2 2e.2: Activation (SwiGLU)

(activation_fn): SiLUActivation()

Computation: SiLU(Gate) ⊙ Up (element-wise multiply)
Input: Gate [1, 3, 8192], Up [1, 3, 8192]
Output: [1, 3, 8192]
Purpose: Non-linear gating mechanism for selective information flow

3.1.1.2.5.3 2e.3: Down Projection (Compression)

(down_proj): Linear(in_features=8192, out_features=3072, bias=False)

Input: [1, 3, 8192]
Output: [1, 3, 3072]
Purpose: Project back to model dimension

3.1.1.2.6 2f. FFN Residual Connection + Dropout

(resid_mlp_dropout): Dropout(p=0.0, inplace=False)

Computation: output = mlp_output + input_to_mlp
Output: [1, 3, 3072]
Purpose: Another residual connection for training stability

The above steps (2a-2f) repeat for all 32 decoder layers

3.1.1.3 Step 34: Final Layer Normalization

(norm): Phi3RMSNorm((3072,), eps=1e-05)

Input: [1, 3, 3072] (output from last decoder layer)
Output: [1, 3, 3072] (normalized)
Purpose: Final normalization before prediction head

3.1.1.4 Step 35: Language Model Head (Logits)

(lm_head): Linear(in_features=3072, out_features=32064, bias=False)

Input: [1, 3, 3072]
Output: [1, 3, 32064]
Interpretation: Logits (raw scores) for each of 32064 vocabulary tokens
- Each of the 3 input tokens gets a 32064-dim vector of logits

3.1.1.5 Step 36: Next Token Prediction

Take last token’s logits: [1, 32064] (from position 2, “sat”)
Apply softmax: Convert logits to probabilities
Sample or argmax: Select next token ID
- Example: Token ID 278 → decode to ” on”
Result: Predicted next token in the sequence

3.1.1.6 Summary Visualization

Input tokens [245, 1834, 5847] ("The cat sat")
    ↓
[embed_tokens] → [1, 3, 3072]
    ↓
┌─────────────────────────────────┐
│ Layer 0 (Phi3DecoderLayer)      │
│   input_layernorm               │
│   self_attn                     │
│     qkv_proj → Q, K, V          │
│     reshape to 32 heads         │
│     rotary_emb (RoPE)           │
│     attention computation       │
│     o_proj                      │
│   resid_attn_dropout + residual │
│   post_attention_layernorm      │
│   mlp                           │
│     gate_up_proj → Gate, Up     │
│     activation_fn (SwiGLU)      │
│     down_proj                   │
│   resid_mlp_dropout + residual  │
└─────────────────────────────────┘
    ↓
[Repeat for layers 1-31]
    ↓
[norm] → [1, 3, 3072]
    ↓
[lm_head] → [1, 3, 32064]
    ↓
Next token prediction: " on"

3.1.1.7 Key Architecture Features

Model Type: Decoder-only transformer (MHA - Multi-Head Attention)
Layers: 32 transformer blocks
Model Dimension: 3072
Attention Heads: 32 (each 96-dim)
FFN Hidden Size: 8192 (effective intermediate dimension)
Vocabulary Size: 32064 tokens
Activation: SwiGLU (SiLU with gating)
Position Encoding: RoPE (Rotary Position Embedding)
Normalization: RMSNorm

3.1.2 The Components of the Forward Pass

The tokenizer is followed by the neural network: a stack of Transformer blocks that do all of the processing. That stack is then followed by the LM head, which translates the output of the stack into probability scores for what the most likely next token is.

A Transformer LLM is made up of a tokenizer, a stack of Transformer blocks, and a language modeling head.

The tokenizer has a vocabulary of 50,000 tokens. The model has token embeddigns associated with those embeddings

At the end of the forward pass, the model predicts a probability score for each token in the vocabulary

The LM head is a simple neural network layer itself. It is one of multiple possible “heads” to attach to a stack of Transformer blocks to build different kinds of systems. Other kinds of Transformer heads include sequence classification heads and token classification heads.

3.1.3 Adding the LM Head for Text Generation

Continuing from @text-to-contextualized-embeddings

For text generation (like GPT), we add an extra component called the Language Modeling Head (LM Head) that predicts the next token.

3.1.3.1 Step 5: LM Head (Language Modeling Head)

Step 5.1: Extract last token’s embedding

For “Hello world”, we take the last meaningful token (“world”) to predict what comes next:

last_token_embedding = output[0, 2, :]  # Index 2 is "world"
                     = [0.09, 0.64, ..., 1.02]  (384 dims)

Step 5.2: LM Head projection

The LM Head is a linear layer that projects from 384 dimensions to vocabulary size (50,000):

# Weight matrix for LM Head
W_lm_head = [[...]]  # Shape: [384, 50000]
b_lm_head = [...]     # Shape: [50000]

# Matrix multiplication
logits = last_token_embedding @ W_lm_head + b_lm_head

# Calculation for each token in vocabulary:
logits[0] = (0.09×W[0,0] + 0.64×W[1,0] + ... + 1.02×W[383,0]) + b[0]
          = -2.34  (token 0: "the")

logits[1] = (0.09×W[0,1] + 0.64×W[1,1] + ... + 1.02×W[383,1]) + b[1]
          = -1.78  (token 1: "a")

logits[245] = ... = 1.23   (token 245: "Hello")
logits[892] = ... = 0.87   (token 892: "world")
logits[1543] = ... = 3.45  (token 1543: "!") ← Highest score!
logits[2891] = ... = 2.12  (token 2891: "?")
...
logits[49999] = ... = -4.56  (token 49999: some rare word)

Result - logits (raw scores) for all 50,000 tokens:

logits = [-2.34, -1.78, ..., 1.23, ..., 0.87, ..., 3.45, ..., 2.12, ..., -4.56]
# Shape: [50000]

Step 5.3: Apply softmax to get probabilities

Convert raw scores (logits) to probabilities that sum to 1.0:

probabilities = softmax(logits)

# Softmax formula: exp(x_i) / sum(exp(x_j))
probabilities[0] = exp(-2.34) / sum_of_all_exponentials
                 = 0.0001  (0.01% - "the")

probabilities[1] = exp(-1.78) / sum_of_all_exponentials
                 = 0.001   (0.1% - "a")

probabilities[245] = exp(1.23) / sum_of_all_exponentials
                   = 0.015   (1.5% - "Hello")

probabilities[892] = exp(0.87) / sum_of_all_exponentials
                   = 0.008   (0.8% - "world")

probabilities[1543] = exp(3.45) / sum_of_all_exponentials
                    = 0.350   (35% - "!") ← Highest probability!

probabilities[2891] = exp(2.12) / sum_of_all_exponentials
                    = 0.095   (9.5% - "?")
...
probabilities[49999] = exp(-4.56) / sum_of_all_exponentials
                     = 0.00001  (0.001% - rare word)

Step 5.4: Select the next token

Pick the token with the highest probability:

# Pick the token with highest probability
next_token_id = argmax(probabilities)
              = 1543  (corresponds to "!")

# Decode to text
next_token = tokenizer.decode(1543)
           = "!"

# Full sequence now becomes:
"Hello world!"

3.1.4 Choosing a Single Token from the Probability Distribution (Sampling/Decoding)

From this image We know that “Dear” has the highest probability. The strategy to choose a single token from the probability distribution is called decoding strategy. The simplest one is greedy decoding, which always picks the token with the highest probability. However, this can lead to repetitive or generic outputs.

prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids = input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

Debugging the model_output

# This is the contextualized embedding for the input
model_output[0][0]

tensor([[-0.3320,  1.1797,  0.2949,  ..., -0.3125,  0.7383,  0.1562],
        [-0.1377,  0.3145,  0.3750,  ...,  0.5547, -0.1445, -0.6055],
        [-0.5781,  1.0859,  1.5391,  ..., -0.4121,  0.2773,  0.4043],
        [-0.3984,  0.6836,  0.2480,  ..., -0.0762,  0.8164, -0.6836],
        [-0.6562,  0.6914,  0.5430,  ...,  0.2422,  0.1875, -0.2930]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<SelectBackward0>)

Debugging the lm_head_output

# 1. Basic shape and type
# print(f"\n1. Type: {type(lm_head_output)}")
# print(f"   Shape: {lm_head_output.shape}")
# print(f"   Dtype: {lm_head_output.dtype}")
# print(f"   Device: {lm_head_output.device}")
# Output: 1. Type: <class 'torch.Tensor'>
#            Shape: torch.Size([1, 5, 32064])
#            Dtype: torch.bfloat16
#            Device: cuda:0

# 2. Understand dimensions
batch_size, seq_length, vocab_size = lm_head_output.shape
# print(f"   - Batch size: {batch_size}")
# print(f"   - Sequence length: {seq_length}")
# print(f"   - Vocabulary size: {vocab_size}")
# Output: 2. Dimensions breakdown:
#            - Batch size: 1
#            - Sequence length: 5
#            - Vocabulary size: 32064

import torch
# 3. Show the actual tokens in the input
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
for i, token in enumerate(tokens):
    print(f"   Position {i}: '{token}'")

   Position 0: '▁The'
   Position 1: '▁capital'
   Position 2: '▁of'
   Position 3: '▁France'
   Position 4: '▁is'

# 4. For each position, show top 3 predictions
for pos in range(seq_length):
    print(f"\n   After position {pos} ('{tokens[pos]}'):")
    top_3_values, top_3_indices = lm_head_output[0, pos].topk(3)
    for j in range(3):
        token_id = top_3_indices[j].item()
        score = top_3_values[j].item()
        token_text = tokenizer.decode(token_id)
        print(f"      {j + 1}. '{token_text}' (ID: {token_id}, score: {score:.2f})")


   After position 0 ('▁The'):
      1. 'code' (ID: 775, score: 38.00)
      2. '`' (ID: 421, score: 37.00)
      3. 'function' (ID: 740, score: 36.50)

   After position 1 ('▁capital'):
      1. 'of' (ID: 310, score: 50.00)
      2. 'city' (ID: 4272, score: 49.75)
      3. 'ist' (ID: 391, score: 48.50)

   After position 2 ('▁of'):
      1. 'France' (ID: 3444, score: 48.50)
      2. 'the' (ID: 278, score: 48.50)
      3. 'Australia' (ID: 8314, score: 47.25)

   After position 3 ('▁France'):
      1. 'is' (ID: 338, score: 54.00)
      2. ',' (ID: 29892, score: 51.50)
      3. '
' (ID: 13, score: 50.50)

   After position 4 ('▁is'):
      1. 'Paris' (ID: 3681, score: 44.50)
      2. '_' (ID: 903, score: 41.00)
      3. 'not' (ID: 451, score: 40.25)

The model generates predictions at all input positions because transformers are trained to predict the next token at each position simultaneously. For text generation, we only need the last position’s logits; earlier positions are byproducts of the architecture.

import torch

# 5. Focus on the last position (next token prediction)
# print(f"   Input prompt: '{prompt}'")
# Output:
# Input prompt: 'The capital of France is'

last_pos_logits = lm_head_output[0, -1]
# print(f"   Logits shape at last position: {last_pos_logits.shape}")
# print(f"   Min logit: {last_pos_logits.min().item():.2f}")
# print(f"   Max logit: {last_pos_logits.max().item():.2f}")
# print(f"   Mean logit: {last_pos_logits.mean().item():.2f}")

# Output
# Logits shape at last position: torch.Size([32064])
# Min logit: 12.62
# Max logit: 44.50
# Mean logit: 24.50

print(f"\n   Top 5 next token predictions:")


   Top 5 next token predictions:

top_5_values, top_5_indices = last_pos_logits.topk(5)
for i in range(5):
    token_id = top_5_indices[i].item()
    score = top_5_values[i].item()
    token_text = tokenizer.decode(token_id)
    print(f"      {i + 1:2d}. '{token_text}' (ID: {token_id:5d}, score: {score:8.2f})")

       1. 'Paris' (ID:  3681, score:    44.50)
       2. '_' (ID:   903, score:    41.00)
       3. 'not' (ID:   451, score:    40.25)
       4. '...' (ID:   856, score:    39.75)
       5. '
' (ID:    13, score:    39.50)

In this case, the score is a logit instead of probability. A logit is the raw output from the LM head before applying softmax. It can be positive or negative, and higher logits correspond to higher probabilities after softmax. To convert logit to probability

# Get the logit for token 3681 (Paris)
paris_logit = lm_head_output[0, -1, 3681]
# print(f"Paris logit: {paris_logit.item():.4f}")
# Output:
# Paris logit: 44.5000

# Convert ALL logits to probabilities using softmax
all_logits = lm_head_output[0, -1]
all_probs = torch.softmax(all_logits, dim=-1)

# Get the probability for token 3681
paris_prob = all_probs[3681]
# print(f"Paris probability: {paris_prob.item():.4f} ({paris_prob.item() * 100:.2f}%)")
# Output:
# Paris probability: 0.8750 (87.50%)

# Verify probabilities sum to 1
# print(f"Sum of all probabilities: {all_probs.sum().item():.4f}")
# Output:
# Sum of all probabilities: 1.0000

Example of logit to probability conversion

import torch
import math

# Example logits for 3 tokens
logits = torch.tensor([44.5, 40.0, 38.0])

print("Step 1: Original logits")
print(f"Token A (Paris):  {logits[0]:.2f}")
print(f"Token B (London): {logits[1]:.2f}")
print(f"Token C (Berlin): {logits[2]:.2f}")
# Output: 
# Token A (Paris):  44.50
# Token B (London): 40.00
# Token C (Berlin): 38.00

print("\nStep 2: Calculate exp(logit) for each")
exp_logits = torch.exp(logits)
print(f"exp(44.5) = {exp_logits[0]:.2e}")
print(f"exp(40.0) = {exp_logits[1]:.2e}")
print(f"exp(38.0) = {exp_logits[2]:.2e}")
# Output: 
# exp(44.5) = 2.12e+19
# exp(40.0) = 2.35e+17
# exp(38.0) = 3.19e+16

print("\nStep 3: Sum all exp values")
sum_exp = exp_logits.sum()
print(f"Sum = {sum_exp:.2e}")
# Output: 
# Sum = 2.15e+19

print("\nStep 4: Divide each exp by sum (this is softmax)")
probabilities = exp_logits / sum_exp
print(f"Token A (Paris):  {probabilities[0]:.4f} = {probabilities[0] * 100:.2f}%")
print(f"Token B (London): {probabilities[1]:.4f} = {probabilities[1] * 100:.2f}%")
print(f"Token C (Berlin): {probabilities[2]:.4f} = {probabilities[2] * 100:.2f}%")
# Output: 
# Token A (Paris):  0.9875 = 98.75%
# Token B (London): 0.0110 = 1.10%
# Token C (Berlin): 0.0015 = 0.15%

print(f"\nVerify: Sum of probabilities = {probabilities.sum():.4f}")
# Output:
# Verify: Sum of probabilities = 1.0000

print("\n" + "=" * 60)
print("Compare with torch.softmax:")
probs_torch = torch.softmax(logits, dim=-1)
print(f"Token A: {probs_torch[0]:.4f} = {probs_torch[0] * 100:.2f}%")
print(f"Token B: {probs_torch[1]:.4f} = {probs_torch[1] * 100:.2f}%")
print(f"Token C: {probs_torch[2]:.4f} = {probs_torch[2] * 100:.2f}%")
# Output: 
# Token A: 0.9875 = 98.75%
# Token B: 0.0110 = 1.10%
# Token C: 0.0015 = 0.15%

3.1.5 Parallel Token Processing and Context Size

The model processes all tokens in the input sequence simultaneously, which is a key advantage of the transformer architecture. This allows it to capture complex relationships between tokens regardless of their position in the sequence. The maximum number of tokens the model can process at once is determined by its context size (e.g., 4,096 tokens). If the input exceeds this limit, it must be truncated or processed in chunks.

For text generation, only the last output vectors are used to predict the next token

Each processing stream takes a vector as input and produces a final resulting vector of the same size (often referred to as the model dimension).

3.1.6 Speeding Up Generation by Caching Keys and Values

In autoregressive, the generated first token needs to be sent to generate the second token. To avoid recomputing the entire sequence each time, models cache the Key and Value matrices from previous tokens. This way, when generating the next token, only the new token’s Query needs to be computed and compared against the cached Keys and Values.

3.1.7 Inside the Transformer Block

The attention layer is mainly concerned with incorporating relevant information from other input tokens and positions
The feedforward layer houses the majority of the model’s processing capacity

3.1.7.1 Feedforward Neural Network

When given “The Shawshank”, the language model needs to predict the next token “Redemption”. The NN is trained on massive text corpora to learn patterns and relationships between words instead of just memorization as it should also be able to interpolate unseen data. Feedforward Neural Network

3.1.7.2 Self-Attention Mechanism

Memorization and interpolation is not enough, the model also needs to understand the context of words in relation to each other. This is where the self-attention mechanism comes in. It allows the model to weigh the importance of different words in the input sequence when generating each token.

“The dog chased the squirrel because it”, does the “it” refers to “dog” or “squirrel”? The model does that based on the patterns seen and learned from the training dataset. Perhaps previous sentences also give more clues, like, for example, referring to the dog as “she” thus making it clear that “it” refers to the squirrel.

3.1.7.3 Attention Is All You Need

Two process:

Score relevance
- Calculate how important each token is (attention scores)
- This is what we did: Hello’s attention to [CLS]=0.15, Hello=0.20, world=0.50, [SEP]=0.15
Combine information
- Mix the information based on scores
- This is what we did: new_Hello = 0.15×[CLS] + 0.20×Hello + 0.50×world + 0.15×[SEP]

Then the “Self Attention” layer could have 8 heads where each head learns to focus on different aspects of the input. For example, one head might focus on syntactic relationships while another focuses on semantic relationships.

3.2 Recent Improvements to the Transformer Architecture

3.2.1 More Efficient Attention

The area that gets most focus in the attention layer due to the most computationally expensive part of the process.

3.2.1.1 Local/Sparse Attention

Instead of attending to all tokens, the model only attends to a local window of tokens around the current token. Local Attention

3.2.1.2 Multi-query and Grouped-query Attention

These methods improve inference scalability of larger models by reducing the size of the matrices involved. Attention Comparison

3.2.1.3 Flash Attention

It speeds up the attention calculation by optimizing what values are loaded and moved between a GPU’s shared memory (SRAM) and high bandwidth memory (HBM).

3.2.1.4 The Transformer Block

There has been modification to the Transformer block as follows:

3.2.1.5 Positional Embeddings (RoPE)

RoPE encodes relative position information by applying a rotation to the query and key vectors based on their positions. This allows the model to capture the relative distances between tokens, which is crucial for understanding context in language. RoPE in Training

4 Text Classification

Text classification is an important task in NLP. We will use both Representative Models and Generative Models. Although both representation and generative models can be used for classification, their approaches differ

4.1 Text Classification with Representation Models

Classification with pretrained representation models can be through task-specific model or an embedding model but we will not train the model in this chapter, we will only use the model.

4.1.1 Representation Model - Task-Specific Model

Use twitter-roberta-base-sentiment-latest in rotten_tomatoes movie review dataset

from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

# Output
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path, tokenizer=model_path, return_all_scores=True, device="cuda:0"
)

import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    if output["label"] == "positive":
        assignment = 1
    else:
        assignment = 0
    y_pred.append(assignment)
    
from sklearn.metrics import classification_report


def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred, target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

evaluate_performance(data["test"]["label"], y_pred)

# Output
precision    recall  f1-score   support

Negative Review       0.68      0.94      0.79       533
Positive Review       0.91      0.56      0.69       533

       accuracy                           0.75      1066
      macro avg       0.79      0.75      0.74      1066
   weighted avg       0.79      0.75      0.74      1066

4.1.2 Representation Model - Embedding Model

What if we could not find a pretrained task-specific model for our task? We could use an embedding model.

4.1.2.1 Supervised Classification

We will use an embedding model for generating features. Those features can then be fed to train and infer a classifier, thereby creating a two-step approach The feature extraction step and classification steps are separated.

Convert textual input to embedding using an embedding model
Train a classifier (e.g., logistic regression, SVM, random forest) on the embeddings with a training label to predict the target labels

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

from sklearn.linear_model import LogisticRegression

# Train a logistic regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

# Output
              precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066

4.1.2.2 Unsupervised Classification

To use when no labelled data but we have candidate labels to choose from.

Convert textual input to embedding using an embedding model
Convert the candidate labels to embeddings using the same embedding model
Compare the cosine similarity between the textual input and the candidate labels

label_embeddings = model.encode(["A negative review",  "A positive review"])

from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

evaluate_performance(data["test"]["label"], y_pred)

# Output
                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066
   
# or
cosine_similarity(
    model.encode(data["train"][8500]["text"]).reshape(1, -1),
    model.encode("A good movie").reshape(1, -1),
)
# Output
array([[0.13109298]], dtype=float32)

# or
cosine_similarity(
    model.encode(data["train"][8500]["text"]).reshape(1, -1),
    model.encode("A bad movie").reshape(1, -1),
)
# Output
array([[0.22194049]], dtype=float32)

# Hence, the movie is a bad one.

4.2 Text Classification with Generative Models

Generative models works slightly different since it was not trained to produce a fixed set of labels but to generate text.

A task-specific model generates numerical values from sequences of tokens while a generative model generates sequences of tokens from sequences of tokens.

However, we can still use it for classification by prompting it to generate the label as the next token. If only sending the movie review without any instruction, it might not know what to do.

Prompt engineering allows prompts to be updated to improve the output generated by the model.

4.2.1 Generative Model - Text-to-Text Transfer Transformer (T5)

Original Transformer has encoder and decoder, encoder-only model such as BERT, decoder-only model such as GPT, and encoder-decoder model such as T5. Like the decoder-only models, these encoder-decoder models are sequence-to-sequence models and generally fall in the category of generative models.

T5 training 1. The T5 first pretrained using MLM to mask sets of tokens (or token spans) instead of individual tokens. In the first step of training, namely pretraining, the T5 model needs to predict masks that could contain multiple tokens

Then is fine-tuning for various tasks where each task is converted to a sequence-to-sequence task and trained simultaneously.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

def add_prompt(text):
    prompt = f"Is the following sentence positive or negative? \n{text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=False)
    outputs = model.generate(**inputs, max_new_tokens=10)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

y_pred = []
for item in tqdm(data["test"], total=len(data["test"])):
    output = add_prompt(item["text"])
    y_pred.append(0 if output == "negative" else 1)
    
evaluate_performance(data["test"]["label"], y_pred)

# Output
                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066

4.2.2 Generative Model - ChatGPT for Classification

ChatGPT is trained on preference tuninge data where human annotators rank the outputs of the model based on quality. This allows it to generate more human-like responses and follow instructions better than previous models. Manually ranked preference data was used to generate the final model, ChatGPT.

5 Text Clustering and Topic Modeling

Text clustering and topic modeling are unsupervised learning techniques used to group similar documents together based on their content. Clustering unstructured textual data.

5.1 Text Clustering

The process is: 1. Convert the input documents to embeddings with an embedding model. 2. Reduce the dimensionality of embeddings with a dimensionality reduction model. 3. Find groups of semantically similar documents with a cluster model.

5.1.1 Text Clustering Step 1: Convert to Embeddings

Step 1: We convert documents to embeddings using an embedding model.

from datasets import load_dataset

dataset = load_dataset("maartengr/arxiv_nlp")["train"]

# Extract metadata
abstracts = dataset["Abstracts"]
titles = dataset["Titles"]

from sentence_transformers import SentenceTransformer

# Create an embedding for each abstract
embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

embeddings.shape
# (44949, 384)

5.1.2 Text Clustering Step 2: Dimensionality Reduction

Step 2: The embeddings are reduced to a lower-dimensional space using dimensionality reduction. Need to reduce the dimensionality of embeddings from 384 to < 384 such as through Principal Component Analysis (PCA) or UMAP (Uniform Manifold Approximation Projection)

from umap import UMAP

# We reduce the input embeddings from 384 dimensions to 5 dimensions
umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
) 
reduced_embeddings = umap_model.fit_transform(embeddings)

`n_components` is the number of dimensions to reduce to
`min_dist` controls how tightly UMAP clusters points together (lower values create tighter clusters)

5.1.3 Text Clustering Step 3: Clustering

Step 3: We cluster the documents using the embeddings with reduced dimensionality.

Although a common choice is a centroid-based algorithm like k-means, which requires a set of clusters to be generated, we do not know the number of clusters beforehand. Instead, a density-based algorithm freely calculates the number of clusters and does not force all data points to be part of a cluster Centroid-based vs Density-based Clustering

from hdbscan import HDBSCAN

# We fit the model and extract the clusters
hdbscan_model = HDBSCAN(
    min_cluster_size=50, metric="euclidean", cluster_selection_method="eom"
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_

# How many clusters did we generate?
len(set(clusters))
# 156

5.1.4 Text Clustering Inspection

import numpy as np

# Print first three documents in cluster 0
cluster = 0
for index in np.where(clusters==cluster)[0][:3]:
    print(abstracts[index][:300] + "... \n")

import pandas as pd

# Reduce 384-dimensional embeddings to two dimensions for easier visualization
reduced_embeddings = UMAP(
    n_components=2, min_dist=0.0, metric="cosine", random_state=42
).fit_transform(embeddings)

# Create dataframe
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"])
df["title"] = titles
df["cluster"] = [str(c) for c in clusters]

# Select outliers and non-outliers (clusters)
clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]

import matplotlib.pyplot as plt

# Plot outliers and non-outliers separately
plt.scatter(outliers_df.x, outliers_df.y, alpha=0.05, s=2, c="grey")
plt.scatter(
    clusters_df.x, clusters_df.y, c=clusters_df.cluster.astype(int),
    alpha=0.6, s=2, cmap="tab20b"
)
plt.axis("off")

The generated clusters (colored) and outliers (gray) are represented as a 2D visualization.

Although the result is appealing, it does not yet allow us to see what is happening inside the clusters. Need to extend this from text clustering to topic modeling.

5.2 Topic Modeling

Topic modeling results in a set of keywords or phrases that best represent andthe topic. We will implement BERTopic

Traditionally, topics are represented by a number of keywords but can take other forms.

Classic approach such as latent Dirichlet allocation assume each topic is characterized by a probability distribution over words and each document is a mixture of topics. However, this approach does not takes context, meaning of words, or phrases into account, but Transformer does. Example of Latent Dirichlet Allocation

The process with BERTopic:

Embed documents
Reduce dimensionality
Cluster compressed embeddings
Create a class-based bag-of-words (c-TF-IDF)
Weigh terms

These 3 steps are the same as text clustering.

To create c-TF-IDF, create bag-of-words for the documents in each cluster.

Calculate the class-based term frequency (cTF).

Calculate the inverse document frequency (IDF) for each term in the cluster.

The c-TF-IDF score is calculated by multiplying the term frequency (c-TF) by the inverse document frequency (IDF). This gives us a score that reflects how important a term is to a cluster relative to the entire corpus.

from bertopic import BERTopic

# Train our model with our previously defined models
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)

topic_model.get_topic_info()

# Output
|Topic|Count|Name|
|-1|14520|-1_the_of_and_to|
|0|2290|0_speech_asr_recognition_end|

# The `-1` topic is the outlier topic. HDBSCAN does not force all documents to belong in a cluster

# Visualize topics and documents
fig = topic_model.visualize_documents(
    titles, 
    reduced_embeddings=reduced_embeddings, 
    width=1200, 
    hide_annotations=True
)

# Update fonts of legend for easier visualization
fig.update_layout(font=dict(size=16))

The output when we visualize documents and topics.

5.2.1 Topic Modelling + Reranker

5.2.1.1 KeyBERTInspired

Since BERTopic relies on c-TF-IDF to determine the most representative words for each topic, it may not always capture the most semantically relevant terms. To address this, we can add a reranker that uses a more sophisticated method to identify the most representative words for each topic.

After applying the c-TF-IDF weighting, topics can be fine-tuned with a wide variety of representation models, many of which are large language models.

We will use KeyBERTInspired, which extracts keywords from texts by comparing word and document embeddings through cosine similarity. KeyBERTInspired uses c-TF-IDF to extract the most representative documents per topic by calculating the similarity between a document’s c-TF-IDF values and those of the topic they correspond to. The average document embedding per topic is calculated and compared to the embeddings of candidate keywords to rerank the keywords.

KeyBERTInspired representation model procedure.

from bertopic.representation import KeyBERTInspired

# Update our topic representations using KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)

# Show topic differences
topic_differences(topic_model, original_topics)

# Output
|Topic|Original|Updated|
|0|speech, asr, recognition, end|speech, encoder, phonetic, language|
|1|medical, clinic, biomedical, patient|nlp, ehr, clinical|

Although improved, there are still some redudancy such as “summaries” and “summary” in the same topic. #### Maximal Marginal Relevance (MMR) Maximal Marginal Relevance (MMR) diversifies our topic representations. The algorithm attempts to find a set of keywords that are diverse from one another but still relate to the documents they are compared to. It filters out redundant words and only keeps words that contribute something new to the topic representation.

from bertopic.representation import MaximalMarginalRelevance

# Update our topic representations to MaximalMarginalRelevance
representation_model = MaximalMarginalRelevance(diversity=0.2)
topic_model.update_topics(abstracts, representation_model=representation_model)

# Show topic differences
topic_differences(topic_model, original_topics)

# Output
|Topic|Original|Updated|
|0|speech, asr, recognition, end, accoustic|speech,asr,error,model,training|
|1|medical, clinical, biomedical, patient, ...|clinical, biomedical, patient|

The resulting topics demonstrate more diversity in their representations.

5.3 Topic Labelling

Topic labelling is the process of assigning a human-readable label to a topic based on its most representative keywords.

Use text generative LLMs and prompt engineering to create labels for topics from keywords and documents related to each topic.

Through LLM, we receive the following labels for the topics:

# Output
|Topic|Original|Updated|
|0|speech, asr, recognition, end, acoustic|Levaraging External Data ...|
|1|medical, clinical, biomedical, patient|Improved Representation Learning ...|

6 Prompt Engineering

6.1 Text Generation Models

Moreover, we could control the model output with several parameters:

temperature: A higher temperature increases the likelihood that less probable tokens are generated and vice versa.
top_p: Also known as nucleus sampling, is a sampling technique that controls which subset of tokens (the nucleus) the LLM can consider. It will consider tokens until it reaches their cumulative probability. If we set top_p to 0.1, it will consider tokens until it reaches that value.
top_k: Controls exactly how many tokens the LLM can consider. If you change its value to 100, the LLM will only consider the top 100 most probable tokens.

6.2 Intro to Prompt Engineering

Prompt examples of common use cases. Notice how within a use case, the structure and location of the instruction can be changed.

6.3 Advanced Prompt Engineering

6.3.1 Advanced Prompt Component

Persona

Describe what role the LLM should take on. For example, use “You are an expert in astrophysics” if you want to ask a question about astrophysics.
Instruction

The task itself. Make sure this is as specific as possible. We do not want to leave much room for interpretation.
Context

Additional information describing the context of the problem or task. It answers questions like “What is the reason for the instruction?”
Format

The format the LLM should use to output the generated text. Without it, the LLM will come up with a format itself, which is troublesome in automated systems.
Audience

The target of the generated text. This also describes the level of the generated output. For education purposes, it is often helpful to use ELI5 (“Explain it like I’m 5”).
Tone

The tone of voice the LLM should use in the generated text. If you are writing a formal email to your boss, you might not want to use an informal tone of voice.
Data

The main data related to the task itself.

6.3.2 In-Context Learning

6.3.3 Chain Prompting

6.4 Reasoning with Generative Models

System 1 thinking represents an automatic, intuitive, and near-instantaneous process. It shares similarities with generative models that automatically generate tokens without any self-reflective behavior. In contrast, system 2 thinking is a conscious, slow, and logical process, akin to brainstorming and self-reflection.

6.4.1 Chain-of-Thought: Think Before Answering

# Answering with chain-of-thought
cot_prompt = [
    {"role": "user", "content": "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"},
    {"role": "assistant", "content": "Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11."},
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?"}
]

As CoT requires example, we can also use zero-shot.

# Zero-shot chain-of-thought
zeroshot_cot_prompt = [
    {"role": "user", "content": "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Let's think step-by-step."}
]

6.4.2 Self-Consistency: Sampling Outputs

As temperature and top_p affects output result, we can sample multiple outputs and select the most consistent answer among them. This is called self-consistency. We can further use chain-of-thought in the process.

6.4.3 Tree-of-Thought: Exploring Intermediate Steps

Tree-of-Thought (ToT) prompting is a method that allows the model to explore multiple reasoning paths by generating intermediate steps in a tree-like structure. This approach can help the model find better solutions by considering various possibilities and their consequences.

By leveraging a tree-based structure, generative models can generate intermediate thoughts to be rated. The most promising thoughts are kept and the lowest are pruned.

# Zero-shot tree-of-thought
zeroshot_tot_prompt = [
    {"role": "user", "content": "Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they're wrong at any point then they leave. The question is 'The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?' Make sure to discuss the results."}
]

6.5 Output Verification

We can use

Examples: provide a number of examples of the expected output
Grammar: control the token selection process
Fine-tuning: tune a model on data that contains the expected output (discussed in later chapter)

6.5.1 Examples

# One-shot learning: Providing an example of the output structure
one_shot_template = """Create a short character profile for an RPG game. Make sure to only use this format:

{
  "description": "A SHORT DESCRIPTION",
  "name": "THE CHARACTER'S NAME",
  "armor": "ONE PIECE OF ARMOR",
  "weapon": "ONE OR MORE WEAPONS"
}
"""
one_shot_prompt = [
    {"role": "user", "content": one_shot_template}
]

6.5.2 Grammar

We can use Guidance, Guardrails, and LMQL to control the token selection process.

Use an LLM to check whether the output correctly follows our rules

We can use llama_cpp for JSON structured output example

from llama_cpp.llama import Llama

# Load Phi-3
llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

# Generate output
output = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Create a warrior for an RPG in JSON format."},
    ],
    response_format={"type": "json_object"},
    temperature=0,
)['choices'][0]['message']["content"]

7 Advanced Text Generation Techniques and Tools

Methods and concepts for improving the quality of generated text

Model I/O: Loading and working with LLMs
Memory: Helping LLMs to remember
Agents: Combining complex behavior with external tools
Chains: Conencting methods and modules

7.1 Model I/O

A GGUF model represents a compressed version of its original counterpart through quantization.

7.2 Chains

An example of a single chain using Phi-3’s template

from langchain import PromptTemplate

# Create a prompt template with the "input_prompt" variable
template = """<s><|user|>
{input_prompt}<|end|>
<|assistant|>"""
prompt = PromptTemplate(
    template=template,
    input_variables=["input_prompt"]
)

basic_chain = prompt | llm

# Use the chain
basic_chain.invoke(
    {
        "input_prompt": "Hi! My name is Maarten. What is 1 + 1?",
    }
)

7.3 Memory

Two types of memory

ConversationBufferMemory: remembers everything
ConversationBufferWindowMemory: remembers only the last n interactions
ConversationSummaryMemory: remembers a summary of the conversation

7.4 Agents

For this example, we will use ReAct Framework

The framework consists of three steps:

Thought: create a thought about the input prompt
Action: based on the thought, an action is triggered. The action could be an external tool like calculator or a search engine
Observation: the outpu of the tool is sent to the LLM and the LLM observes the output, which is often oa summary of whatever result it retrieved.

8 Semantic Search and Retrieval-Augmented Generation

They consist of three broad categories:

Dense retrieval: one of the key types of semantic search, relying on the similarity of text embeddings to retrieve relevant results.
Reranking: the second key type of semantic search, take a search query and a collection of results, and reorder them by relevance, often resulting in vastly improved results.
RAG: RAG system formulates an answer to a question and (preferably) cites its information sources.

8.1 Semantic Search

8.1.1 Dense Retrieval

Dense retrieval relies on the similarity of text embeddings to retrieve relevant results.

8.1.1.1 Chunking

We have options for:

One vector per document: results in a highly compressed vector that loses a lot of the information in the document.
Multiple vectors per document: results in vectors that capture individual concepts inside the text, leading to a more expressive search index.

8.1.1.2 Nearest Neighbor Search vs. Vector Database

Once the query is embedded, to find the nearest vectors to it from our query/text, we can either use:

Nearest neighbor search: a method that calculates the distance between the query embedding and all the text embeddings in the index to find the closest matches. This can be computationally expensive, especially for large datasets. Can be done with NumPy
Vector database: a specialized database designed to store and query high-dimensional vectors efficiently. It uses indexing techniques to speed up the retrieval process, making it more scalable for large datasets. Can be done with FAISS, Weaviate, or Pinecone.

8.1.2 Reranking

Reranking takes a search query and a collection of results, and reorders them by relevance, often resulting in vastly improved results.

LLM rerankers operate as part of a search pipeline with the goal of reordering a number of shortlisted search results by relevance.

A reranker assigns a relevance score to each document by looking at the document and the query at the same time.

8.1.3 Retrieval Evaluation Metrics

We can use mean average precision (MAP); we need a text archieve, a set of queries, and relevance judgements indicating which documents are relevant for each query.

Looking at the relevance judgements from our test suite, we can see that system 1 did a better job than system 2.

But what if our results is as follow?

We need a scoring system that rewards system 1 for assigning a high position to a relevant result—even though both systems retrieved only one relevant result in their top three results.

To calculate MAP, we need to first calculate average precision (AP):

Then we calculate MAP

The mean average precision takes into consideration the average precision score of a system for every query in the test suite. By averaging them, it produces a single metric that we can use to compare a search system against another.

Aside from MAP, we can also use normalized discounted cumulative gain (nDCG) which is more nuanced in that the relevance of documents is not binary (relevant versus not relevant) and one document can be labeled as more relevant than another in the test suite and scoring mechanism.

8.2 Retrieval-Augmented Generation (RAG)

8.2.1 Advanced RAG Techniques

8.2.1.1 Query Rewriting

Query rewriting is a technique that reformulates the original query to improve retrieval performance. It can be used to clarify ambiguous queries, expand queries with synonyms, or add context to the query.

Assuming:

Original query: “We have an essay due tomorrow. We have to write about some animal. I love penguins. I could write about them. But I could also write about dolphins. Are they animals? Maybe. Let’s do dolphins. Where do they live for example?”
Rewritten query: “Where do dolphins live”

8.2.1.2 Multi-query RAG

Multi-query RAG is a technique that generates multiple queries from the original query to retrieve a more diverse set of relevant documents. This can be done by using different query reformulation strategies or by generating queries that focus on different aspects of the original query. We then present the top results of both queries to the model for grounded generation

Assuming:

Original query: “Compare the financial results of Nvidia in 2020 vs. 2023”
Rewritten queries:
1. “Nvidia 2020 financial results”
2. “Nvidia 2023 financial results”

8.2.1.3 Multi-hop RAG

Multi-hop RAG is a technique that allows the model to retrieve information from multiple sources or documents to answer a complex query. The model can perform multiple retrieval steps, where the output of one retrieval step is used as input for the next retrieval step. This allows the model to gather information from different sources and combine it to generate a more comprehensive answer.

Assuming:

Original query: “Who are the largest car manufacturers in 2023? Do they each make EVs or not?”

The system must first seach for:

Step 1, Query 1: “largest car manufacturers 2023”
Assuming the answer is Toyota, Volkswagen, and Hyundai; it should ask follow-up questions
Step 2, Query 1: “Toyota Motor Corporation electric vehicle”
Step 2, Query 2: “Volkswagen AG electric vehicle”
Step 2, Query 3: “Hyundai Motor Company electric vehicle”

8.2.1.4 Query routing

Query routing is a technique that directs different types of queries to different retrieval systems or models based on the characteristics of the query. Specify for the model that if it gets a question about HR, it should search the company’s HR information system but if the question is about customer data then CRM.

8.2.1.5 Agentic RAG

Agentic RAG is a technique that combines retrieval-augmented generation with agent-based systems. In this approach, the model can interact with external tools or agents to retrieve information or perform actions that are necessary to answer the query.

8.2.2 RAG Evaluation

Although there are still ongoing developments in how to evaluate RAG models, we can refer to this paper for some of the evaluation methods Evaluating verifiable generative search engines covering:

Fluency: whether the generated text is fluent and cohesive
Perceived utility: whether the generated answer is helpful and informative
Citation recall: the proportion of generated statements about the external world that are fully supported by their citations
Citation precision: the proportion of generated citations that support their associated statements

Moreover we can use Ragas which adds:

Faithfullness: whether the answer is consistent with the provided context
Answer relevance: how relevant the answer is to the question

9 Multimodal Large Language Models

Discusses:

The Transformer Vision model
The embedding for a vision model
Complementing a text generation model with a vision model

9.1 Transformers for Vision

In the original Transformer, the input is tokenized into subword tokens and then embedded into a vector space. In vision transformers:

The input image is divided into patches. In this example we are using 3x3 patches but the original implementation used 16x16 patches.
All patches flattened into 1xn vector
However, unlike tokens, we can’t just assign each patch with an ID since these patches will rarely be found in other images, unlike the vocabulary of a text.
Embedded into a vector space

The main algorithm behind ViT. After patching the images and linearly projecting them, the patch embeddings are passed to the encoder and treated as if they were textual tokens.

9.2 Multimodal Embedding Models

Multimodal embedding models are designed to handle and integrate information from multiple modalities, such as text, images, and audio.

Multimodal embedding models can create embeddings for multiple modalities in the same vector space

Similar to text embedding, a multimodal embedding allows for comparing multimodal representations snce the resulting embeddings lie in the same vector space.

Despite having coming from different modalities, embeddings with similar meaning will be close to each other in vector space

9.2.1 Multimodal Embedding - Constrastive Language Image Pre-Training (CLIP)

CLIP is an embedding model that learns to associate images and text by training on a large dataset of image-text pairs. It uses a contrastive learning approach/technique, where the model is trained to maximize the similarity between the embeddings of matching image-text pairs and minimize the similarity between non-matching pairs. The resulting embeddings lie in the same vector space, which means that the embeddings of images can be compared with the embeddings of text.

Tasks for CLIP:

Zero-shot classification: we can compare the embedding of an image with that of the description of its possible classes to find which class is most similar
Clustering: cluster both images and a collection of keywords to find which keywords belong to which sets of images.
Search: across billions of texts or images, we can quickly find what relates to an input text or image.
Generation: use multimodal embeddings to drive the generation of images (e.g. stable diffusion)

9.2.2 CLIP Generate Multimodal Embedding

CLIP generates multimodal embeddings by using two separate encoders: one for images and one for text (positive and negative text). The image encoder is typically a convolutional neural network (CNN) that processes the input image and produces an embedding vector. The text encoder is usually a transformer-based model that processes the input text and produces another embedding vector.

The type of data that is needed to train a multimodal embedding model.

During training, CLIP learns to align these two embedding spaces by maximizing the cosine similarity between the embeddings of matching image-text pairs and minimizing it for non-matching pairs. When we initially staretd the training, the similarity between the image and text embedding will be low, but as the training progresses, the model learns to bring the embeddings of matching pairs closer together in the vector space, while pushing non-matching pairs further apart.

In the first step of training CLIP, both images and text are embedded using an image and text encoder.

From second onward, the text and image encoders are updated to match what the intended similarity should be. This updates the embeddings such that they are closer in vector sapce if the inputs are similar.

9.2.3 OpenCLIP

OpenCLIP is an open-source implementation of the CLIP model.

from urllib.request import urlopen
from PIL import Image

# Load an AI-generated image of a puppy playing in the snow
puppy_path = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/puppy.png"
image = Image.open(urlopen(puppy_path)).convert("RGB")
caption = "a puppy playing in the snow"

from transformers import CLIPTokenizerFast, CLIPProcessor, CLIPModel

model_id = "openai/clip-vit-base-patch32"

# Load a tokenizer to preprocess the text
clip_tokenizer = CLIPTokenizerFast.from_pretrained(model_id)

# Load a processor to preprocess the images
clip_processor = CLIPProcessor.from_pretrained(model_id)

# Main model for generating text and image embeddings
model = CLIPModel.from_pretrained(model_id)

## Handling the caption
inputs = clip_tokenizer(caption, return_tensors="pt")
text_embedding = model.get_text_features(**inputs)

## Handling the image
processed_image = clip_processor(
    text=None, images=image, return_tensors='pt'
)['pixel_values']
image_embedding = model.get_image_features(processed_image)

## Combine the caption and image
# Normalize the embeddings
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
image_embedding /= image_embedding.norm(dim=-1, keepdim=True)

# Calculate their similarity
text_embedding = text_embedding.detach().cpu().numpy()
image_embedding = image_embedding.detach().cpu().numpy()
score = text_embedding @ image_embedding.T
score

# Output
array([[0.33149636]], dtype=float32)

We can compare to other images The similarity matrix between three images and three captions

9.3 Making Text Generation Models Multimodal

To bridge the gap between multimodal embedding models and text generation models, we can use BLIP-2: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 2. BLIP-2 is a modular technique to introduce vision capabilities to existing language models.

Instead of creating a multimodal language model from scratch, we could use BLIP-2 which bridges the vision-language gap by building a bridge, named the Querying Transformer (Q-Former), that connects a pretrained image encoder and a pretrained LLM.

By leveraging pretrained models, BLIP-2 only needs to train the bridge without needing to train the image encoder and LLM from scratch.

The Querying Transformer is the bdige between vision (ViT) and text (LLM) that is the only trainable component of the pipeline.

To connect the two pretrained models, Q-Former has two modules that share their attention layers:

An Image Transformer to interact with the frozen Vision Transformer for feature extraction
A Text Transformer that can interact with the LLM

BLIP-2 training steps:

The image is processed by the frozen ViT to extract the vision embedding.
The vision embedding is passed to the Q-Former, which generates a set of query tokens that capture the relevant information from the image.
The query tokens are then passed to the LLM, which generates a text description of the image based on the information captured by the query tokens.
The generated text is compared to the ground truth caption, and the loss is backpropagated through the Q-Former to update its parameters.
The ViT and LLM remain frozen during this process, allowing the Q-Former to learn how to effectively bridge the gap between the two modalities without needing to retrain the entire model.
After training, the Q-Former can effectively translate the visual information from the ViT into a format that the LLM can understand, enabling the generation of accurate and coherent text descriptions based on the input images.

The Q-Former is trained on three tasks:

Image-Text Contrastive Learning: the model learns to align the image and text embeddings by maximizing the cosine similarity between the embeddings of matching image-text pairs and minimizing it for non-matching pairs.
Image-Text Matching: the model learns to predict whether a given image and text pair match or not, which helps the model to understand the relationship between images and text.
Image Captioning: the model learns to generate a caption for a given image, which helps the model to learn how to generate coherent and relevant text based on visual input.

Alternatives to BLIP-2 includes LLaVA or Idefics 2

9.3.1 Use Case 1: Image Captioning

By default, BLIP-2 is trained for image captioning, which means that it can generate a text description of an image.

from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

# Load processor and main model
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16
)

# Send the model to GPU to speed up inference
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load Rorschach image
url = "https://upload.wikimedia.org/wikipedia/commons/7/70/Rorschach_blot_01.jpg"
image = Image.open(urlopen(url)).convert("RGB")

# Generate caption
inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = blip_processor.batch_decode(
    generated_ids, skip_special_tokens=True
)
generated_text = generated_text[0].strip()
generated_text

9.3.2 Use Case 2: Multimodal Chat-Based Prompting

Since the default behavior of BLIP-2, to chat with the image, we need to provide a prompt in the blip_processor

from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

# Load processor and main model
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16
)

# Send the model to GPU to speed up inference
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load an AI-generated image of a supercar
image = Image.open(urlopen(car_path)).convert("RGB")

# Chat-like prompting
prompt = "Question: Write down what you see in this picture. Answer: A sports car driving on the road at sunset. Question: What would it cost me to drive that car? Answer:"

# Generate output
inputs = blip_processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=30)
generated_text = blip_processor.batch_decode(
    generated_ids, skip_special_tokens=True
)
generated_text = generated_text[0].strip()
generated_text

10 Creating Text Embedding Models

10.1 Constrastive Learning

There are many ways in which we can train, fine-tune, and guide embedding models, but one of the strongest and most widely used techniques is called contrastive learning. Constrastive learning technique is a method that trains a model to differentiate between similar and dissimilar pairs of data points. The model learns to bring similar data points closer together in the embedding space while pushing dissimilar data points further apart.

For example, you could teach a model to understand what a dog is by letting it find features such as “tail,” “nose,” “four legs,” etc. This learning process can be quite difficult since features are often not well-defined and can be interpreted in a number of ways. A being with a “tail,” “nose,” and “four legs” can also be a cat. To help the model steer toward what we are interested in, we essentially ask it, “Why is this a dog and not a cat?” By providing the contrast between two concepts, it starts to learn the features that define the concept but also the features that are not related. We get more information when we frame a question as a contrast.

When we feed an embedding model different contrasts (degrees of similarity), it starts to learn what makes things different from one another and thereby the distinctive characteristics of concepts.

To apply contrastive learning to create a text embedding models, we can use sentence-transformers.

10.2 sentence-BERT (SBERT)

SBERT (through sentence-transformers) is a modification of the BERT architecture where BERT had issue on its computational overhead and sentence embedding often used an architectural structure called cross-encoders with BERT

A cross-encoder allows two sentences to be passed to the Transformer network simultanenously to predict the extent to which the two sentences are similar. However, with a collection of 10,000 sentences we would need n x (n-1) / 2 = 49,995,000 computations.

The architecture of a cross-encoder. Both sentences are concatenated, separated with a `<SEP>` token, and fed to the model simultaneously.

sentence-transformers dropped the classification head and used mean pooling on the final output layer to generate an embedding. The pooling layer averages the word embeddings and gives back a fixed dimensional output vector. Then, models are optimized through the similarity of the sentence embeddings. Since the weights are identical for both BERT models, we can use a single model and feed it the sentences one after the other.

The architecture of the original `sentence-transformers` model, which leverages a Siamese network, also called a bi-encoder.

To perform contrastive learning, we need two things. First, we need data that constitutes similar/dissimilar pairs. Second, we will need to define how the model defines and optimizes similarity.

10.3 Creating an Embedding Model

Steps:

Prepare the data
Train the model
Evaluate the model
Loss function for the model

10.3.1 Generating Constrastive Examples

In embedding model pretraining, use natural language inference (NLI) datasets. NLI datasets consist of pairs of sentences and a label indicating whether the second sentence is a hypothesis (entailment), contradicts it (contradiction), or neither (neutral).

We can levarage the structure of NLI datasets to generate negative examples (contradiction) and positive examples (entailments) for constrastive learning

Dataset used: Multi-Genre Natural Language Inference (MNLI) corpus, one of nine tasks of General Language Understanding Evaluation benchmark (GLUE)

from datasets import load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")
train_dataset[2]

# Output
{'premise': 'One of our number will carry out your instructions minutely.',
 'hypothesis': 'A member of my team will execute your orders with immense precision.',
 'label': 0}

10.3.2 Train Model

In this example:

Model: bert-case-uncased
Train split: MNLI (Multi-Genre Natural Language Inference, a dataset that contains pairs of sentences and a label indicating whether the second sentence is a hypothesis (entailment), contradicts it (contradiction), or neither (neutral))
Val split: STSB (Semantic Textual Similarity Benchmark, a dataset that contains pairs of sentences and a similarity score between 0 and 5)
Test split: MTEB (Massive Text Embedding Benchmark, a dataset that contains pairs of sentences and a label indicating whether the two sentences are similar or not)
Baseline accuracy with Softmax is 0.59

10.3.3 Loss Functions

We trained using softmax loss function, which is suboptimal because it is a classification loss. Two other well-known loss functions:

Cosine similarity
Multiple negative ranking (MNR) loss

10.3.3.1 Cosine Similarity

An intuitive loss function in semantic textual similarity tasks. It calculates the cosine similarity between the two embeddings of the two texts and compares that to the labelled similarity scores. Cosine similarity loss intuitively works best using data where you have pairs of sentences and labels that indicate their similarity between 0 and 1.

To use in NLI, convert the labels between 0 adn 1. Entailment should be 1, neutral and contradiction should be 0.

from datasets import Dataset, load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset(
    "glue", "mnli", split="train"
).select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

# (neutral/contradiction)=0 and (entailment)=1
mapping = {2: 0, 1: 0, 0:1}
train_dataset = Dataset.from_dict({
    "sentence1": train_dataset["premise"],
    "sentence2": train_dataset["hypothesis"],
    "label": [float(mapping[label]) for label in train_dataset["label"]]
})

Accuracy with cosine similarity is 0.72

10.3.3.2 Multiple Negative Ranking (MNR) loss

Also referred as InfoNCE or NTXentLoss. It uses either positive pairs or triplets containing a positive pairs and a negative example.

For example, you might have pairs of question/answer, image/image caption, paper title/paper abstract, etc. The great thing about these pairs is that we can be confident they are hard positive pairs. In MNR loss, negative pairs are constructed by mixing a positive pair with another positive pair. In the example of a paper title and abstract, you would generate a negative pair by combining the title of a paper with a completely different abstract. These negatives are called in-batch negatives and can also be used to generate the triplets

Multiple negatives ranking loss aims to minimize the distance between related pairs of text, such as questions and answers, and maximize the distance between unrelated pairs, such as questions and unrelated answers.

In MNLI dataset:

The positive is selected from “entailment” pairs
The negative is randomly shuffle the “hypothesis”

import random
from tqdm import tqdm
from datasets import Dataset, load_dataset

# # Load MNLI dataset from GLUE
mnli = load_dataset("glue", "mnli", split="train").select(range(50_000))
mnli = mnli.remove_columns("idx")
mnli = mnli.filter(lambda x: True if x["label"] == 0 else False)

# Prepare data and add a soft negative
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)

Accuracy with MNR loss is 0.80

However, the example of soft negative was easy. To make it harder, we should have negative that are very related to the question but not the right answer. Example:

Consist of:

Easy negatives: through randomly sampling documents as the previous example
Semi-hard negatives: using a pretrained embedding model, we can apply cosine similarity on all sentence embeddings to find those that are highly related. Generally, this does not lead to hard negatives since this method merely finds similar sentences, not question/answer pairs.
Hard negatives: These often need to be either manually labeled (for instance, by generating semi-hard negatives) or you can use a generative model to either judge or generate sentence pairs.

10.4 Fine-Tuning an Embedding Model

sentence-transformers also allows fine-tuning. Two methods:

Supervised fine-tuning
Augmented SBERT

10.4.1 Supervised Fine-Tuning

Supervised fine-tuning is the process of taking a pretrained embedding model and further training it on a specific task or dataset with labeled data. Replacing the bert-base-uncased with a pretrained sentence-transformers of all-miniLM-L6-v2 model.

Accuracy of all-miniLM-L6-v2 with MNR loss is 0.85

10.4.2 Augmented SBERT

Training needs many high quality dataset, in case of short of, do:

Fine-tune a cross-encoder (BERT) using a small, annotated dataset (gold dataset).
Create new sentence pairs.
Label new sentence pairs with the fine-tuned cross-encoder (silver dataset).
Train a bi-encoder (SBERT) on the extended dataset (gold + silver dataset).

Augmented SBERT works through training a cross-encoder on a small gold dataset, then using that to label an unlabeled dataset to generate a larger silver dataset. Finally, both the gold and silver datasets are used to train the bi-encoder.

Assuming that:

We take first 10,000 on CrossEncoder
Then take only the premise and hypothesis to labelled. Note that we might get more disimilar pairs than similar pairs. Instead, we can use a pretrained embedding model to embed all candidate sentence pairs and retrieve the top-k sentence for each input sentence using semantic search.

10.4.3 Unsupervised Fine-Tuning

Many approaches for unsupervised fine-tuning

Simple Constrastive Learning of Sentence Embeddings (SimCSE)
Contraastive Tension (CT)
Transformer-based Sequential Denoising Auto-Encoder (TSDAE)
Generative Pseudo-Labeling( GPL)

10.4.3.1 Transformer-based Sequential Denoising Auto-Encoder (TSDAE)

TSDAE assumes we have no labeled data at all and does not require us to artificially create labels. It works by:

Start with a raw unlabelled sentence “the dog is playing in the park”
Add noise by randomly delete a percentage of words from the sentence to “the __ is playing in the __”
Encode the damaged sentence by passing through the Transformer decoder then a pooling layer compress it into a single fixed-size embedding vector
Decoder tries to reconstruct the original sentence from the damaged sentence’s embedding vector, which forces the model to learn a meaningful representation of the sentence in order to be able to reconstruct it.
Compute the loss against the original sentence.

TSDAE randomly removes words from an input sentence that is passed through an encoder to generate a sentence embedding. From this sentence embedding, the original sentence is reconstructed.

10.4.3.2 Using TSDAE for Domain Adaptation

Unsupervised is generally outperformed by supervised and have difficulty learning domain-specific concepts. Use domain adaptation technique to solve this, the goal is to update existing embedding models to a specific textual domain that contains different subjects from source domain.

In domain adaptation, the aim is to create and generalize an embedding model from one domain to another

One method of domain adaptation is adaptive learning. Adaptive pretraining starts by pretraining target domain using unsupervised such as TSDAE. Then fine-tune that model on a supervised dataset from non-target domain.

Domain adaptation can be performed with adaptive pretraining and adaptive fine-tuning.

11 Fine-Tuning Representation Models

Methods and application for BERT fine-tuning:

Supervised Classification: general process of fine-tuning a classification model
Few-Shot Classification: using SetFit, which is a method for efficiently fine-tuning a high-performing model using a small number of training examples
Continued Pretraining with Masked Language Modeling: continue training a pretrained model
Named-Entity Recognition: classification on a token level

11.1 Supervised Classification

Extending from Chapter 4 where we use an embedding model and a classifier separately, this time we will train a BERT where it has both an encoder and feedforward neural network.

Compared to the “frozen” architecture, we instead train both the pretrained BERT model and the classification head. A backward pass will start at the classification head and go through BERT.

The architecture of a task-specific model. It contains a pretrained representation model (e.g., BERT) with an additional classification head for the specific task.

In the books, training all layers of BERT and classification head resulted in F1 of 0.85.

Moreover, freezing some layers resulted in different results. Assuming freezing all ambedding layers and encoder layers while only training the classifier layers

We fully freeze all encoder blocks and embedding layers such that the BERT model does not learn new representations during fine-tuning.

# Print layer names
for name, param in model.named_parameters():
    print(name)

# Output
bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
...
bert.encoder.layer.11.output.LayerNorm.weight
bert.encoder.layer.11.output.LayerNorm.bias
bert.pooler.dense.weight
bert.pooler.dense.bias
classifier.weight
classifier.bias

for name, param in model.named_parameters():

     # Trainable classification head
     if name.startswith("classifier"):
        param.requires_grad = True

      # Freeze everything else
     else:
        param.requires_grad = False

Resulted in F1 of 0.63.

Moreover, we can experiment with freezing some layers

The effect of freezing certain encoder blocks on the performance of the model. Training more blocks leads to improved performance but stabilizes early on.

11.2 Few-Shot Classification with SetFit

Use few-shot classification when there are few labelled data points through the framework of SetFit (Sentence Transformers for Few-Shot Learning).

In few-shot classification, we only use a few labeled data points to learn from.

SetFit consists of three steps:

Sampling training data: based on in-class and out-class selection of labeled data it generates positive (similar) and negative (dissimilar) pairs of sentences
Fine-tuning embeddings: fine-tuning a pretrained embedding model based on the previously generated training data
Training a classifier: create a classification head on top of the embedding model and train it using the previously generated data

11.2.1 SetFit Step 1

SetFit assumes training data to be samples of positive (similar) and negative (dissimilar) pairs of sentences. However, classification task data is generally not labeled as such.

Assume we have the following texts and classes, then we create pairs of sentences based on the class labels. For instance, we can create a positive pair by taking two sentences from the same class and a negative pair by taking two sentences from different classes.

Data in two classes: text about programming languages and text about pets.

Step 1: Sampling the training data. We assume sentences within a class are similar and create positive pairs while sentences in different classes become negative pairs.

11.2.2 SetFit Step 2

Use the generated sentence pairs through contrastive learning to fine-tune.

Step 2: Fine-tuning a `SentenceTransformers` model. Using contrastive learning, embeddings are learned from positive and negative sentence pairs.

11.2.3 SetFit Step 3

After fine-tuning, generate embedings using the fine-tuned SentenceTransformers for all sentences and use those as the input of a classifier.

Step 3: Training a classifier. The classifier can be any scikit-learn model or a classification head.

11.3 Continued Pretraining with Masked Language Modeling

So far, we have always used a pretrained model to fine-tune on our specific tasks.

To fine-tune the model on a target task—for example, classification—we either start with pretraining a BERT model or use a pretrained one.

The pretrained is trained with general data and we need to adapt it to our specific domain. Instead of fine-tuning, we can squeeze another step between them with continuous pretraining an already pretrained BERT model. We can simply continue training the BERT model using masked language modeling (MLM) but instead use data from our domain. It is like going from a general BERT model to a BioBERT model specialized for the medical domain, to a fine-tuned BioBERT model to classify medication. Continuous pretraining shown to improve the performance of models in classification tasks.

Instead of a two-step approach, we can add another step that continues to pretrain the pretrained model before fine-tuning it on the target task. Notice how the masks were filled with abstract concepts in 1 while they were filled with movie-specific concepts in 2.

The three-step approach illustrated for specific use cases.

Furthermore, we can customize the masking procedure to be token masking (faster) or whole-word masking (slower) process.

Different methods for randomly masking tokens.

11.4 Named-Entity Recognition

Task that involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, etc. NER is a token-level classification task, which means that the model needs to classify each token in the input text as belonging to a specific entity type or not.

Fine-tuning a BERT model for NER allows for the detection of named entities, such as people or locations.

During the fine-tuning process of a BERT model, individual tokens are classified instead of words or entire documents.

12 Fine-Tuning Generation Models

12.1 LLM Training Steps:

The three common steps to create an LLM is:

Language modelling (self-supervised)

During language modeling, the LLM aims to predict the next token based on an input. This is a process without labels.
- Method: The first step is to pretrain it on massive text datasets. The training process attempts to predice the next token to learn linguistic and semantic representations.
- Output: base model / pretrained model / foundation model
Fine-tuning 1 (supervised fine-tuning)
- Method: The second step is to fine-tune the pretrained model on a specific task or dataset with labeled data. This process involves updating the model’s parameters to optimize its performance on the target task, such as text classification, question answering, or named entity recognition.
- Output: task-specific model
Fine-tuning 2 (preference tuning)