Course: Large Language Models
Chapter: Tokens and Embeddings
Total Points: 100 points
Due Date: September 11, 2025
This homework assignment tests your understanding of tokenization and embeddings, which are foundational concepts for Large Language Models. Answer all questions completely and show your work where applicable.
Explain the role of a tokenizer in the LLM pipeline. Why is it necessary to convert text into token IDs before the model can process it? Describe the difference between the input the user provides and the input the model actually receives.
A tokenizer is responsible for breaking raw text into tokens and mapping them to numerical IDs that a large language model can process. This conversion is essential because neural networks cannot operate directly on raw strings; they require numerical input. The user provides natural language sentences, such as words and punctuation, but the model receives only token IDs, which are then transformed into dense vector embeddings. This allows the model to learn patterns and meaning in a mathematically structured way.
Compare and contrast the four main tokenization methods discussed in the chapter: word, subword, character, and byte tokens. What are the primary advantages and disadvantages of each, especially concerning vocabulary size and handling of unknown words?
Word tokens treat each
unique word as a unit. Their advantage is human interpretability, but
the vocabulary becomes very large and unknown words (out-of-vocabulary
items) cannot be represented.
Subword tokens (like Byte-Pair Encoding) break words
into smaller, frequent units. They balance vocabulary size and coverage
well, handling rare words by combining smaller known pieces.
Character tokens operate at the level of individual
characters. They ensure full coverage of any text, but sequences are
very long, making training less efficient.
Byte tokens represent raw byte values (0–255). This
guarantees coverage for any text, including multilingual and special
symbols, and avoids out-of-vocabulary issues, though sequences are
longer than subword approaches.
What is the difference between a static token embedding (like those from word2vec) and a contextualized word embedding produced by a modern LLM like DeBERTa? Use an example to illustrate why context is important for word representation.
Static embeddings, such as those from
word2vec, assign each word a single vector regardless of context. This
means the word “bank” has the same representation whether it refers to a
riverbank or a financial institution.
Contextual embeddings, as produced by modern models like DeBERTa,
generate different vectors for the same word depending on its
surrounding text. For example, in “I deposited cash at the bank,” the
vector for “bank” reflects financial meaning, whereas in “He sat on the
river bank,” the vector reflects a geographical meaning. Context ensures
that the model captures word sense accurately and avoids ambiguity.
Describe the core principle behind the word2vec algorithm. Explain the roles of “positive examples” (neighboring words) and “negative examples” (non-neighboring words) in the contrastive training process.
The core idea of word2vec is to learn word
embeddings by predicting which words are likely to appear near each
other in text. The algorithm is trained using a contrastive
objective:
Positive examples are neighboring words in a given
context window, which reinforce associations between words that
co-occur.
Negative examples are randomly sampled words from
outside the context window, which teach the model to push apart
unrelated words.
Through this contrastive training, word2vec embeddings capture semantic
similarity, such words with similar contexts (e.g., “king” and “queen”)
end up close in vector space.
What is the typical output of an LLM tokenizer that is fed to the language model?
Which tokenization method is most commonly used in modern LLMs like GPT-4 and StarCoder2?
What is a primary advantage of subword tokenization over word tokenization?
What is the purpose of a text embedding model like
sentence-transformers/all-mpnet-base-v2
?
In the word2vec algorithm, what is the purpose of the “sliding window”?
When using embeddings for a recommendation system (e.g., for music), what do the “words” and “sentences” correspond to?
What does the shape torch.Size([1, 4, 384])
represent
for the output of a contextualized embedding model?
Why are negative examples crucial for training word2vec?
Which tokenizer discussed in the chapter is specifically optimized for code and represents individual digits as separate tokens?
What is the typical dimensionality of a text embedding from the
all-mpnet-base-v2
model?
Tokenizer Comparison
Complete the following Python function to tokenize a given text using
two different tokenizers (bert-base-uncased
and
gpt2
) and compare their outputs.
from transformers import AutoTokenizer
def compare_tokenizers(text):
"""
Tokenizes a text with two different tokenizers and prints the results.
Args:
text: The string to tokenize.
"""
tokenizer_names = ["bert-base-uncased", "gpt2"]
for name in tokenizer_names:
print(f"--- Tokenizer: {name} ---")
# TODO: Load the tokenizer
tokenizer = # Your code here
# TODO: Tokenize the text and get the tokens
tokens = # Your code here
# TODO: Convert tokens to IDs
token_ids = # Your code here
print(f"Number of tokens: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Token IDs (first 10): {token_ids[:10]}")
print("\\n")
# Test your implementation
text_to_tokenize = "Tokenization is a foundational concept in NLP."
compare_tokenizers(text_to_tokenize)
#revised code
from transformers import AutoTokenizer
def compare_tokenizers(text):
"""
Tokenizes a text with two different tokenizers and prints the results.
Args:
text: The string to tokenize.
"""
tokenizer_names = ["bert-base-uncased", "gpt2"]
for name in tokenizer_names:
print(f"--- Tokenizer: {name} ---")
# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(name)
# Tokenizing
tokens = tokenizer.tokenize(text)
# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Number of tokens: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Token IDs (first 10): {token_ids[:10]}")
print("\n")
# Test
text_to_tokenize = "Tokenization is a foundational concept in NLP."
compare_tokenizers(text_to_tokenize)
Here is the screenshot of the tokenizer results: —
Using Pretrained Word2Vec Embeddings
Complete the following Python script to load a pretrained
word2vec
model from gensim
and perform
similarity operations.
import gensim.downloader as api
def explore_word_embeddings():
"""
Loads a pretrained word2vec model and explores word similarities.
"""
# TODO: Load the "glove-wiki-gigaword-50" model
model = # Your code here
# TODO: Find the 5 most similar words to "woman"
similar_to_woman = # Your code here
print("Most similar to 'woman':", similar_to_woman)
# TODO: Find the 5 most similar words to "car"
similar_to_car = # Your code here
print("Most similar to 'car':", similar_to_car)
# TODO: Solve the analogy: king - man + woman = ?
# Find the top 1 result for this analogy.
analogy_result = # Your code here
print("Analogy 'king - man + woman':", analogy_result)
# Run the exploration
explore_word_embeddings()
#revised code
!pip install gensim
import gensim.downloader as api
def explore_word_embeddings():
"""
Loads a pretrained word2vec model and explores word similarities.
"""
# Loading "glove-wiki-gigaword-50" model
model = api.load("glove-wiki-gigaword-50")
# Find most similar words to "woman"
print("Most similar to 'woman':")
for word, score in model.most_similar("woman", topn=5):
print(f" - {word} ({score:.4f})")
print("\nMost similar to 'car':")
for word, score in model.most_similar("car", topn=5):
print(f" - {word} ({score:.4f})")
# Solve the analogy: king - man + woman = ?
print("\nAnalogy 'king - man + woman':")
for word, score in model.most_similar(positive=["king", "woman"], negative=["man"], topn=1):
print(f" - {word} ({score:.4f})")
# Run
explore_word_embeddings()
Here is the screenshot of the Word2Vec Embedding results:
Create your own tokenizer
from collections import Counter
import re
def create_simple_tokenizer(texts, vocab_size=1000):
"""
Create a simple BPE-style tokenizer from scratch.
Args:
texts: List of strings to train the tokenizer on
vocab_size: Maximum vocabulary size
Returns:
A dictionary containing the tokenizer vocabulary and encode/decode functions
"""
# TODO: Implement a basic character-level tokenizer that can:
# 1. Split text into characters initially
# 2. Count character frequencies
# 3. Build a vocabulary of the most common characters/subwords
# 4. Provide encode() and decode() methods
def preprocess_text(text):
# TODO: Clean and normalize the input text
# Hint: Convert to lowercase, handle punctuation
pass
def build_vocab(processed_texts):
# TODO: Build vocabulary from processed texts
# Start with character-level tokens, then optionally merge frequent pairs
pass
def encode(text):
# TODO: Convert text to token IDs using your vocabulary
pass
def decode(token_ids):
# TODO: Convert token IDs back to text
pass
# Your implementation here
return {
'vocab': vocab,
'encode': encode,
'decode': decode
}
# Test your tokenizer
sample_texts = [
"Hello world! This is a test.",
"Natural language processing is fascinating.",
"Tokenization helps models understand text."
]
tokenizer = create_simple_tokenizer(sample_texts, vocab_size=50)
test_text = "Hello! This is new text."
encoded = tokenizer['encode'](test_text)
decoded = tokenizer['decode'](encoded)
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Vocabulary size: {len(tokenizer['vocab'])}")
#revised code
from collections import Counter
import re
def create_simple_tokenizer(texts, vocab_size=1000):
"""
Minimal BPE-style tokenizer with robust base alphabet:
- collects ALL characters from training texts
- keeps both bare chars (e.g., 'w') and word-start forms (e.g., '▁w')
- trains merges up to vocab_size
- provides encode/decode
"""
def preprocess_text(s: str) -> str:
s = s.lower()
s = re.sub(r"\s+", " ", s).strip()
return s
def words_to_initial_tokens(s: str):
words = s.split(" ")
out = []
for w in words:
if not w:
continue
out.append(["▁" + w[0]] + list(w[1:]))
return out # list[list[str]]
def get_pair_counts(word_tokens_list):
pair_counts = Counter()
for toks in word_tokens_list:
for a, b in zip(toks, toks[1:]):
pair_counts[(a, b)] += 1
return pair_counts
def merge_pair_in_words(word_tokens_list, pair):
a, b = pair
ab = a + b
merged = []
for toks in word_tokens_list:
i, new = 0, []
while i < len(toks):
if i < len(toks)-1 and toks[i] == a and toks[i+1] == b:
new.append(ab); i += 2
else:
new.append(toks[i]); i += 1
merged.append(new)
return merged
# ---------- train ----------
processed = [preprocess_text(t) for t in texts]
# collect ALL characters seen anywhere & drop spaces
all_chars = set()
for s in processed:
all_chars.update(list(s.replace(" ", "")))
# initial tokenization
words_tokens = []
for s in processed:
words_tokens.extend(words_to_initial_tokens(s))
# base symbols = BOTH bare chars and their word-start forms
base_symbols = set()
for ch in all_chars:
base_symbols.add(ch)
base_symbols.add("▁" + ch)
# ensure our current tokenized corpus
vocab = set(base_symbols)
merges = []
while len(vocab) < vocab_size:
pair_counts = get_pair_counts(words_tokens)
if not pair_counts:
break
(best_a, best_b), freq = pair_counts.most_common(1)[0]
if freq < 2:
break
words_tokens = merge_pair_in_words(words_tokens, (best_a, best_b))
new_tok = best_a + best_b
if new_tok not in vocab:
vocab.add(new_tok)
merges.append((best_a, best_b, new_tok))
# Rankings
merge_ranks = { (a, b): i for i, (a, b, _) in enumerate(merges) }
id2tok = sorted(vocab)
tok2id = {t: i for i, t in enumerate(id2tok)}
unk = "<unk>"
if unk not in tok2id:
tok2id[unk] = len(tok2id); id2tok.append(unk)
def bpe_encode_word(w: str):
if not w:
return []
tokens = ["▁" + w[0]] + list(w[1:])
if len(tokens) <= 1:
return tokens
while True:
pairs = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]
ranked = [(merge_ranks[p], p) for p in pairs if p in merge_ranks]
if not ranked:
break
_, best = min(ranked, key=lambda x: x[0])
a, b = best
i, new = 0, []
while i < len(tokens):
if i < len(tokens)-1 and tokens[i] == a and tokens[i+1] == b:
new.append(a + b); i += 2
else:
new.append(tokens[i]); i += 1
tokens = new
if len(tokens) == 1:
break
return tokens
def encode(text: str):
s = preprocess_text(text)
ids = []
for w in s.split(" "):
if not w: continue
for t in bpe_encode_word(w):
ids.append(tok2id.get(t, tok2id[unk]))
return ids
def decode(token_ids):
toks = [id2tok[i] if 0 <= i < len(id2tok) else unk for i in token_ids]
return "".join(toks).replace("▁", " ").strip()
return {"vocab": tok2id, "encode": encode, "decode": decode, "merges": merges}
# Sample Test
sample_texts = [
"Hello world! This is a test.",
"Natural language processing is fascinating.",
"Tokenization helps models understand text."
]
tokenizer = create_simple_tokenizer(sample_texts, vocab_size=50)
test_text = "Hello! This is new text."
encoded = tokenizer["encode"](test_text)
decoded = tokenizer["decode"](encoded)
print(f"Original: {test_text}")
print(f"Encoded IDs: {encoded}")
print(f"Decoded: {decoded}")
print(f"Vocabulary size: {len(tokenizer['vocab'])}")
Here is the screenshot of my Tokenizer results:
Extend the vocabulary of a current tokenizer
from transformers import AutoTokenizer
import torch
def extend_tokenizer_vocabulary(base_tokenizer_name, new_tokens):
"""
Extend an existing tokenizer's vocabulary with new tokens.
Args:
base_tokenizer_name: Name of the base tokenizer (e.g., "bert-base-uncased")
new_tokens: List of new tokens to add to the vocabulary
Returns:
Extended tokenizer and demonstration of the new tokens
"""
# TODO: Load the base tokenizer
tokenizer = # Your code here
print(f"Original vocabulary size: {len(tokenizer)}")
# TODO: Add new tokens to the tokenizer
# Hint: Use tokenizer.add_tokens() method
num_added = # Your code here
print(f"Added {num_added} new tokens")
print(f"New vocabulary size: {len(tokenizer)}")
# TODO: Test the extended tokenizer with text containing new tokens
test_text = "The AI model uses <SPECIAL_TOKEN> for classification."
# Tokenize before and after adding special tokens
tokens_before = # Your code here (you'll need to reload original tokenizer)
tokens_after = # Your code here
print(f"\\nTest text: {test_text}")
print(f"Tokens with original tokenizer: {tokens_before}")
print(f"Tokens with extended tokenizer: {tokens_after}")
# TODO: Show token IDs for the new tokens
for token in new_tokens:
if token in tokenizer.vocab:
token_id = # Your code here
print(f"Token '{token}' has ID: {token_id}")
return tokenizer
# Test the function
new_special_tokens = ["<SPECIAL_TOKEN>", "<DOMAIN_TERM>", "<CUSTOM_ENTITY>"]
extended_tokenizer = extend_tokenizer_vocabulary("bert-base-uncased", new_special_tokens)
# Additional test: Show how this affects model input
sample_text = "Process this <SPECIAL_TOKEN> carefully with <DOMAIN_TERM>."
input_ids = extended_tokenizer.encode(sample_text, return_tensors="pt")
print(f"\\nInput IDs shape: {input_ids.shape}")
print(f"Input IDs: {input_ids}")
#revised code
from transformers import AutoTokenizer, AutoModel
# === Config ===
BASE = "bert-base-uncased"
NEW_TOKENS = ["<SPECIAL_TOKEN>", "<DOMAIN_TERM>", "<CUSTOM_ENTITY>"]
# 1) Loading original tokenizer "before/after comparison"
orig_tok = AutoTokenizer.from_pretrained(BASE)
# 2) Extending tokenizer with *special* tokens
tok = AutoTokenizer.from_pretrained(BASE)
added = tok.add_special_tokens({"additional_special_tokens": NEW_TOKENS})
print(f"Original vocabulary size: {len(orig_tok)}")
print(f"Added {added} new tokens")
print(f"New vocabulary size: {len(tok)}")
# 3) Using this model with these tokens, expand the embedding matrix
model = AutoModel.from_pretrained(BASE)
model.resize_token_embeddings(len(tok))
# 4) Comparing tokenization before vs after
test_text = "The AI model uses <SPECIAL_TOKEN> for classification."
before = orig_tok.tokenize(test_text)
after = tok.tokenize(test_text)
# Two-column comparison table
w1 = max((len(x) for x in before), default=0)
w2 = max((len(x) for x in after), default=0)
print("\nTokenization comparison:")
print(f"{'Original tokenizer'.ljust(w1)} | {'Extended tokenizer'.ljust(w2)}")
print("-" * (w1 + w2 + 5))
for i in range(max(len(before), len(after))):
c1 = before[i] if i < len(before) else ""
c2 = after[i] if i < len(after) else ""
print(f"{c1.ljust(w1)} | {c2}")
# 5) Show IDs assigned to the new tokens
print("\nAssigned IDs for new tokens:")
for t in NEW_TOKENS:
print(f"{t} -> {tok.convert_tokens_to_ids(t)}")
# 6) Encoding a sample sentence using the extended tokenizer
sample_text = "Process this <SPECIAL_TOKEN> carefully with <DOMAIN_TERM>."
ids = tok.encode(sample_text, return_tensors="pt")
print(f"\nEncoded sample text: {sample_text}")
print(f"Input IDs shape: {ids.shape}")
print(f"Input IDs: {ids}")
Here is the screenshot of Current Tokenizer with extended vocabulary results:
chapter2_homework_[your_name].md