Chapter 2: Tokens and Embeddings - Homework Assignment

Course: Large Language Models

Chapter: Tokens and Embeddings

Total Points: 100 points

Due Date: September 11, 2025

Instructions

This homework assignment tests your understanding of tokenization and embeddings, which are foundational concepts for Large Language Models. Answer all questions completely and show your work where applicable.

Part I: Descriptive Questions (20 points)

Question 1 (5 points)

Explain the role of a tokenizer in the LLM pipeline. Why is it necessary to convert text into token IDs before the model can process it? Describe the difference between the input the user provides and the input the model actually receives.

A tokenizer is responsible for breaking raw text into tokens and mapping them to numerical IDs that a large language model can process. This conversion is essential because neural networks cannot operate directly on raw strings; they require numerical input. The user provides natural language sentences, such as words and punctuation, but the model receives only token IDs, which are then transformed into dense vector embeddings. This allows the model to learn patterns and meaning in a mathematically structured way.

Question 2 (5 points)

Compare and contrast the four main tokenization methods discussed in the chapter: word, subword, character, and byte tokens. What are the primary advantages and disadvantages of each, especially concerning vocabulary size and handling of unknown words?

Word tokens treat each unique word as a unit. Their advantage is human interpretability, but the vocabulary becomes very large and unknown words (out-of-vocabulary items) cannot be represented.
Subword tokens (like Byte-Pair Encoding) break words into smaller, frequent units. They balance vocabulary size and coverage well, handling rare words by combining smaller known pieces.
Character tokens operate at the level of individual characters. They ensure full coverage of any text, but sequences are very long, making training less efficient.
Byte tokens represent raw byte values (0–255). This guarantees coverage for any text, including multilingual and special symbols, and avoids out-of-vocabulary issues, though sequences are longer than subword approaches.

Question 3 (5 points)

What is the difference between a static token embedding (like those from word2vec) and a contextualized word embedding produced by a modern LLM like DeBERTa? Use an example to illustrate why context is important for word representation.

Static embeddings, such as those from word2vec, assign each word a single vector regardless of context. This means the word “bank” has the same representation whether it refers to a riverbank or a financial institution.
Contextual embeddings, as produced by modern models like DeBERTa, generate different vectors for the same word depending on its surrounding text. For example, in “I deposited cash at the bank,” the vector for “bank” reflects financial meaning, whereas in “He sat on the river bank,” the vector reflects a geographical meaning. Context ensures that the model captures word sense accurately and avoids ambiguity.

Question 4 (5 points)

Describe the core principle behind the word2vec algorithm. Explain the roles of “positive examples” (neighboring words) and “negative examples” (non-neighboring words) in the contrastive training process.

The core idea of word2vec is to learn word embeddings by predicting which words are likely to appear near each other in text. The algorithm is trained using a contrastive objective:
Positive examples are neighboring words in a given context window, which reinforce associations between words that co-occur.
Negative examples are randomly sampled words from outside the context window, which teach the model to push apart unrelated words.
Through this contrastive training, word2vec embeddings capture semantic similarity, such words with similar contexts (e.g., “king” and “queen”) end up close in vector space.

Part II: Multiple Choice Questions (20 points)

Question 5

What is the typical output of an LLM tokenizer that is fed to the language model?

A string of cleaned text.
A list of floating-point numbers representing embeddings.
A list of integers representing token IDs.
A dictionary of word counts.

Question 6

Which tokenization method is most commonly used in modern LLMs like GPT-4 and StarCoder2?

Word tokenization
Character tokenization
Byte Pair Encoding (BPE), a type of subword tokenization
Byte tokenization

Question 7

What is a primary advantage of subword tokenization over word tokenization?

It results in a much smaller vocabulary size.
It can represent new or unknown words by breaking them into known subwords.
It is significantly faster to train the tokenizer.
It preserves the original capitalization of all words perfectly.

Question 8

What is the purpose of a text embedding model like sentence-transformers/all-mpnet-base-v2?

To generate a unique embedding vector for each token in a sentence.
To generate a single embedding vector that represents the meaning of an entire sentence or document.
To check for spelling and grammar errors in a text.
To compress a text file to a smaller size.

Question 9

In the word2vec algorithm, what is the purpose of the “sliding window”?

To determine the size of the embedding vectors.
To generate positive training examples of words that appear near each other.
To filter out stop words from the text.
To visualize the final embeddings in 2D space.

Question 10

When using embeddings for a recommendation system (e.g., for music), what do the “words” and “sentences” correspond to?

Words = Artists, Sentences = Albums
Words = Songs, Sentences = Playlists
Words = Genres, Sentences = Artists
Words = Users, Sentences = Songs

Question 11

What does the shape torch.Size([1, 4, 384]) represent for the output of a contextualized embedding model?

1 sentence, 4 layers, 384-dimensional embeddings.
1 batch, 4 tokens, 384-dimensional embeddings.
1 batch, 4 attention heads, 384 possible next tokens.
1 model, 4 sentences, 384-dimensional embeddings.

Question 12

Why are negative examples crucial for training word2vec?

To make the training process faster.
To increase the size of the vocabulary.
To prevent the model from learning to predict that every word pair is a neighbor.
To help the model handle punctuation.

Question 13

Which tokenizer discussed in the chapter is specifically optimized for code and represents individual digits as separate tokens?

BERT (uncased)
GPT-2
GPT-4
StarCoder2

Question 14

What is the typical dimensionality of a text embedding from the all-mpnet-base-v2 model?

50
300
768
4096

Part III: Programming Questions (60 points)

Question 15 (15 points)

Tokenizer Comparison

Complete the following Python function to tokenize a given text using two different tokenizers (bert-base-uncased and gpt2) and compare their outputs.

from transformers import AutoTokenizer

def compare_tokenizers(text):
    """
    Tokenizes a text with two different tokenizers and prints the results.
    
    Args:
        text: The string to tokenize.
    """
    tokenizer_names = ["bert-base-uncased", "gpt2"]
    
    for name in tokenizer_names:
        print(f"--- Tokenizer: {name} ---")
        
        # TODO: Load the tokenizer
        tokenizer = # Your code here
        
        # TODO: Tokenize the text and get the tokens
        tokens = # Your code here
        
        # TODO: Convert tokens to IDs
        token_ids = # Your code here
        
        print(f"Number of tokens: {len(tokens)}")
        print(f"Tokens: {tokens}")
        print(f"Token IDs (first 10): {token_ids[:10]}")
        print("\\n")

# Test your implementation
text_to_tokenize = "Tokenization is a foundational concept in NLP."
compare_tokenizers(text_to_tokenize)

#revised code
from transformers import AutoTokenizer

def compare_tokenizers(text):
    """
    Tokenizes a text with two different tokenizers and prints the results.
    
    Args:
        text: The string to tokenize.
    """
    tokenizer_names = ["bert-base-uncased", "gpt2"]
    
    for name in tokenizer_names:
        print(f"--- Tokenizer: {name} ---")
        
        # Loading the tokenizer
        tokenizer = AutoTokenizer.from_pretrained(name)
        
        # Tokenizing
        tokens = tokenizer.tokenize(text)
        
        # Convert tokens to IDs
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        
        print(f"Number of tokens: {len(tokens)}")
        print(f"Tokens: {tokens}")
        print(f"Token IDs (first 10): {token_ids[:10]}")
        print("\n")

# Test
text_to_tokenize = "Tokenization is a foundational concept in NLP."
compare_tokenizers(text_to_tokenize)

Tokenizer outputs (Colab snapshot)

Here is the screenshot of the tokenizer results: —

Question 16 (15 points)

Using Pretrained Word2Vec Embeddings

Complete the following Python script to load a pretrained word2vec model from gensim and perform similarity operations.

import gensim.downloader as api

def explore_word_embeddings():
    """
    Loads a pretrained word2vec model and explores word similarities.
    """
    # TODO: Load the "glove-wiki-gigaword-50" model
    model = # Your code here
    
    # TODO: Find the 5 most similar words to "woman"
    similar_to_woman = # Your code here
    print("Most similar to 'woman':", similar_to_woman)
    
    # TODO: Find the 5 most similar words to "car"
    similar_to_car = # Your code here
    print("Most similar to 'car':", similar_to_car)
    
    # TODO: Solve the analogy: king - man + woman = ?
    # Find the top 1 result for this analogy.
    analogy_result = # Your code here
    print("Analogy 'king - man + woman':", analogy_result)

# Run the exploration
explore_word_embeddings()

#revised code
!pip install gensim
import gensim.downloader as api

def explore_word_embeddings():
    """
    Loads a pretrained word2vec model and explores word similarities.
    """
    # Loading "glove-wiki-gigaword-50" model
    model = api.load("glove-wiki-gigaword-50")
    
    # Find most similar words to "woman"
    print("Most similar to 'woman':")
    for word, score in model.most_similar("woman", topn=5):
        print(f"  - {word} ({score:.4f})")
    
    print("\nMost similar to 'car':")
    for word, score in model.most_similar("car", topn=5):
        print(f"  - {word} ({score:.4f})")
    
    # Solve the analogy: king - man + woman = ?
    print("\nAnalogy 'king - man + woman':")
    for word, score in model.most_similar(positive=["king", "woman"], negative=["man"], topn=1):
        print(f"  - {word} ({score:.4f})")

# Run
explore_word_embeddings()

Pretrained Word2Vec Embedding output (Colab snapshot)

Here is the screenshot of the Word2Vec Embedding results:

Question 17 (15 points)

Create your own tokenizer

from collections import Counter
import re

def create_simple_tokenizer(texts, vocab_size=1000):
    """
    Create a simple BPE-style tokenizer from scratch.
    
    Args:
        texts: List of strings to train the tokenizer on
        vocab_size: Maximum vocabulary size
    
    Returns:
        A dictionary containing the tokenizer vocabulary and encode/decode functions
    """
    
    # TODO: Implement a basic character-level tokenizer that can:
    # 1. Split text into characters initially
    # 2. Count character frequencies
    # 3. Build a vocabulary of the most common characters/subwords
    # 4. Provide encode() and decode() methods
    
    def preprocess_text(text):
        # TODO: Clean and normalize the input text
        # Hint: Convert to lowercase, handle punctuation
        pass
    
    def build_vocab(processed_texts):
        # TODO: Build vocabulary from processed texts
        # Start with character-level tokens, then optionally merge frequent pairs
        pass
    
    def encode(text):
        # TODO: Convert text to token IDs using your vocabulary
        pass
    
    def decode(token_ids):
        # TODO: Convert token IDs back to text
        pass
    
    # Your implementation here
    return {
        'vocab': vocab,
        'encode': encode,
        'decode': decode
    }

# Test your tokenizer
sample_texts = [
    "Hello world! This is a test.",
    "Natural language processing is fascinating.",
    "Tokenization helps models understand text."
]

tokenizer = create_simple_tokenizer(sample_texts, vocab_size=50)
test_text = "Hello! This is new text."
encoded = tokenizer['encode'](test_text)
decoded = tokenizer['decode'](encoded)

print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Vocabulary size: {len(tokenizer['vocab'])}")

#revised code
from collections import Counter
import re

def create_simple_tokenizer(texts, vocab_size=1000):
    """
    Minimal BPE-style tokenizer with robust base alphabet:
    - collects ALL characters from training texts
    - keeps both bare chars (e.g., 'w') and word-start forms (e.g., '▁w')
    - trains merges up to vocab_size
    - provides encode/decode
    """

    def preprocess_text(s: str) -> str:
        s = s.lower()
        s = re.sub(r"\s+", " ", s).strip()
        return s

    def words_to_initial_tokens(s: str):
        words = s.split(" ")
        out = []
        for w in words:
            if not w:
                continue
            out.append(["▁" + w[0]] + list(w[1:]))
        return out  # list[list[str]]

    def get_pair_counts(word_tokens_list):
        pair_counts = Counter()
        for toks in word_tokens_list:
            for a, b in zip(toks, toks[1:]):
                pair_counts[(a, b)] += 1
        return pair_counts

    def merge_pair_in_words(word_tokens_list, pair):
        a, b = pair
        ab = a + b
        merged = []
        for toks in word_tokens_list:
            i, new = 0, []
            while i < len(toks):
                if i < len(toks)-1 and toks[i] == a and toks[i+1] == b:
                    new.append(ab); i += 2
                else:
                    new.append(toks[i]); i += 1
            merged.append(new)
        return merged

    # ---------- train ----------
    processed = [preprocess_text(t) for t in texts]

    # collect ALL characters seen anywhere & drop spaces
    all_chars = set()
    for s in processed:
        all_chars.update(list(s.replace(" ", "")))

    # initial tokenization
    words_tokens = []
    for s in processed:
        words_tokens.extend(words_to_initial_tokens(s))

    # base symbols = BOTH bare chars and their word-start forms
    base_symbols = set()
    for ch in all_chars:
        base_symbols.add(ch)
        base_symbols.add("▁" + ch)

    # ensure our current tokenized corpus
    vocab = set(base_symbols)

    merges = []
    while len(vocab) < vocab_size:
        pair_counts = get_pair_counts(words_tokens)
        if not pair_counts:
            break
        (best_a, best_b), freq = pair_counts.most_common(1)[0]
        if freq < 2:
            break
        words_tokens = merge_pair_in_words(words_tokens, (best_a, best_b))
        new_tok = best_a + best_b
        if new_tok not in vocab:
            vocab.add(new_tok)
            merges.append((best_a, best_b, new_tok))

    # Rankings
    merge_ranks = { (a, b): i for i, (a, b, _) in enumerate(merges) }

    id2tok = sorted(vocab)
    tok2id = {t: i for i, t in enumerate(id2tok)}
    unk = "<unk>"
    if unk not in tok2id:
        tok2id[unk] = len(tok2id); id2tok.append(unk)

    def bpe_encode_word(w: str):
        if not w:
            return []
        tokens = ["▁" + w[0]] + list(w[1:])
        if len(tokens) <= 1:
            return tokens
        while True:
            pairs = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]
            ranked = [(merge_ranks[p], p) for p in pairs if p in merge_ranks]
            if not ranked:
                break
            _, best = min(ranked, key=lambda x: x[0])
            a, b = best
            i, new = 0, []
            while i < len(tokens):
                if i < len(tokens)-1 and tokens[i] == a and tokens[i+1] == b:
                    new.append(a + b); i += 2
                else:
                    new.append(tokens[i]); i += 1
            tokens = new
            if len(tokens) == 1:
                break
        return tokens

    def encode(text: str):
        s = preprocess_text(text)
        ids = []
        for w in s.split(" "):
            if not w: continue
            for t in bpe_encode_word(w):
                ids.append(tok2id.get(t, tok2id[unk]))
        return ids

    def decode(token_ids):
        toks = [id2tok[i] if 0 <= i < len(id2tok) else unk for i in token_ids]
        return "".join(toks).replace("▁", " ").strip()

    return {"vocab": tok2id, "encode": encode, "decode": decode, "merges": merges}

# Sample Test
sample_texts = [
    "Hello world! This is a test.",
    "Natural language processing is fascinating.",
    "Tokenization helps models understand text."
]

tokenizer = create_simple_tokenizer(sample_texts, vocab_size=50)
test_text = "Hello! This is new text."
encoded = tokenizer["encode"](test_text)
decoded = tokenizer["decode"](encoded)

print(f"Original: {test_text}")
print(f"Encoded IDs: {encoded}")
print(f"Decoded: {decoded}")
print(f"Vocabulary size: {len(tokenizer['vocab'])}")

My own Tokenizer output (Colab snapshot)

Here is the screenshot of my Tokenizer results:

Question 18 (15 points)

Extend the vocabulary of a current tokenizer

from transformers import AutoTokenizer
import torch

def extend_tokenizer_vocabulary(base_tokenizer_name, new_tokens):
    """
    Extend an existing tokenizer's vocabulary with new tokens.
    
    Args:
        base_tokenizer_name: Name of the base tokenizer (e.g., "bert-base-uncased")
        new_tokens: List of new tokens to add to the vocabulary
    
    Returns:
        Extended tokenizer and demonstration of the new tokens
    """
    
    # TODO: Load the base tokenizer
    tokenizer = # Your code here
    
    print(f"Original vocabulary size: {len(tokenizer)}")
    
    # TODO: Add new tokens to the tokenizer
    # Hint: Use tokenizer.add_tokens() method
    num_added = # Your code here
    
    print(f"Added {num_added} new tokens")
    print(f"New vocabulary size: {len(tokenizer)}")
    
    # TODO: Test the extended tokenizer with text containing new tokens
    test_text = "The AI model uses <SPECIAL_TOKEN> for classification."
    
    # Tokenize before and after adding special tokens
    tokens_before = # Your code here (you'll need to reload original tokenizer)
    tokens_after = # Your code here
    
    print(f"\\nTest text: {test_text}")
    print(f"Tokens with original tokenizer: {tokens_before}")
    print(f"Tokens with extended tokenizer: {tokens_after}")
    
    # TODO: Show token IDs for the new tokens
    for token in new_tokens:
        if token in tokenizer.vocab:
            token_id = # Your code here
            print(f"Token '{token}' has ID: {token_id}")
    
    return tokenizer

# Test the function
new_special_tokens = ["<SPECIAL_TOKEN>", "<DOMAIN_TERM>", "<CUSTOM_ENTITY>"]
extended_tokenizer = extend_tokenizer_vocabulary("bert-base-uncased", new_special_tokens)

# Additional test: Show how this affects model input
sample_text = "Process this <SPECIAL_TOKEN> carefully with <DOMAIN_TERM>."
input_ids = extended_tokenizer.encode(sample_text, return_tensors="pt")
print(f"\\nInput IDs shape: {input_ids.shape}")
print(f"Input IDs: {input_ids}")

#revised code
from transformers import AutoTokenizer, AutoModel

# === Config ===
BASE = "bert-base-uncased"
NEW_TOKENS = ["<SPECIAL_TOKEN>", "<DOMAIN_TERM>", "<CUSTOM_ENTITY>"]

# 1) Loading original tokenizer "before/after comparison"
orig_tok = AutoTokenizer.from_pretrained(BASE)

# 2) Extending tokenizer with *special* tokens
tok = AutoTokenizer.from_pretrained(BASE)
added = tok.add_special_tokens({"additional_special_tokens": NEW_TOKENS})

print(f"Original vocabulary size: {len(orig_tok)}")
print(f"Added {added} new tokens")
print(f"New vocabulary size: {len(tok)}")

# 3) Using this model with these tokens, expand the embedding matrix
model = AutoModel.from_pretrained(BASE)
model.resize_token_embeddings(len(tok))

# 4) Comparing tokenization before vs after
test_text = "The AI model uses <SPECIAL_TOKEN> for classification."
before = orig_tok.tokenize(test_text)
after  = tok.tokenize(test_text)

# Two-column comparison table
w1 = max((len(x) for x in before), default=0)
w2 = max((len(x) for x in after),  default=0)
print("\nTokenization comparison:")
print(f"{'Original tokenizer'.ljust(w1)}  |  {'Extended tokenizer'.ljust(w2)}")
print("-" * (w1 + w2 + 5))
for i in range(max(len(before), len(after))):
    c1 = before[i] if i < len(before) else ""
    c2 = after[i]  if i < len(after)  else ""
    print(f"{c1.ljust(w1)}  |  {c2}")

# 5) Show IDs assigned to the new tokens
print("\nAssigned IDs for new tokens:")
for t in NEW_TOKENS:
    print(f"{t} -> {tok.convert_tokens_to_ids(t)}")

# 6) Encoding a sample sentence using the extended tokenizer
sample_text = "Process this <SPECIAL_TOKEN> carefully with <DOMAIN_TERM>."
ids = tok.encode(sample_text, return_tensors="pt")
print(f"\nEncoded sample text: {sample_text}")
print(f"Input IDs shape: {ids.shape}")
print(f"Input IDs: {ids}")

Extended vocabulary with Current Tokenizer output (Colab snapshot)

Here is the screenshot of Current Tokenizer with extended vocabulary results:

Submission Guidelines

Format: Submit your answers in a markdown file named chapter2_homework_[your_name].md
Code: Include all code with proper comments and output.
Explanations: Provide clear explanations for descriptive questions.