hllinas2023

1 Prerequisites and Software Setup

The examples developed in this document are based on pretrained word embeddings and vector-based distance computations. To ensure that all code runs correctly, a small set of Python libraries must be available before executing the examples.

The commands shown below are provided for reference only. They should be executed in a Python environment, such as a system terminal, a Conda prompt, or an R Markdown document configured with the reticulate package.

1.0.1 Required Python packages

### Core libraries for word embeddings
pip install gensim

### Optional dependency for Word Mover’s Distance
pip install pot

The pot package (Python Optimal Transport) is required only for computing Word Mover’s Distance (WMD). If it is not installed, all other examples in this document will still run correctly.

1.0.2 Python imports used throughout this document

import numpy as np
import gensim
from gensim.models import KeyedVectors, Word2Vec
import gensim.downloader as api

1.0.3 Role of the libraries

The purpose of each library used in this chapter is summarized below.

  • gensim: Provides tools for loading, training, and querying word embedding models. In this document, it is used to:

    • load pretrained embeddings (e.g., GloVe, a word embedding model based on global co-occurrence statistics),

    • compute similarity queries,

    • perform vector arithmetic,

    • and calculate Word Mover’s Distance when the required dependency is available.

  • gensim.downloader: Offers a convenient interface for downloading lightweight pretrained embedding models, avoiding manual file handling.

  • Word2Vec (gensim.models): Included to illustrate how embedding models can also be trained from scratch on custom corpora. This neural-based approach learns vector representations from local context, in contrast to global co-occurrence methods such as GloVe.

  • numpy: Supports low-level numerical operations and vector computations required for similarity and distance calculations.

  • pot (Python Optimal Transport): Implements optimal transport algorithms used internally by gensim to compute Word Mover’s Distance. This dependency is only needed for WMD-related examples.

2 Trained and pretrained models

2.0.1 Preliminars

Definition.

  • Word embedding models are models trained on textual data with the objective of learning continuous vector representations of words that capture semantic and syntactic relationships.

  • When such models are trained in advance on large, general-purpose corpora and later reused without further training, they are referred to as pretrained models.

Training of pretrained models.

  • Pretrained word embedding models are typically learned from very large text collections, such as Wikipedia or news archives.

  • Rather than training embeddings from scratch for each task, these models can be reused to explore semantic relationships, compute similarities, and perform analogy-based queries.

Use of pretrained models.

In this document, pretrained models are used to:

  • Illustrate how words are represented as vectors,

  • Explore semantic similarity and distance measures, and

  • Analyze relationships between words and short texts.

Download of pretrained models.

  • To ensure reproducibility and ease of setup, all examples rely on compact pretrained models that can be downloaded automatically.

  • Large external embedding files are intentionally avoided to guarantee consistent execution across different systems.

2.0.2 Comparison of GloVe and Word2Vec

The two most widely used approaches for learning word embeddings are GloVe and Word2Vec. Although both produce vector representations of words, they differ in how these representations are learned.

Aspect GloVe Word2Vec
Learning approach Global co-occurrence statistics Local context windows
Model type Matrix factorization-based Neural network-based
Context used Global (entire corpus) Local (neighboring words)
Training objective Factorization of co-occurrence matrix Prediction (CBOW / Skip-gram)
Interpretability More interpretable Less interpretable
Typical use Pretrained embeddings Training custom embeddings

2.0.3 Example tokens from a pretrained embedding model

Pretrained embedding models typically contain vector representations for tens or hundreds of thousands of tokens. Throughout the document, we will work with a pretrained GloVe model trained on Wikipedia data, using \(n\)-dimensional word vectors. The following code illustrates how such tokens can be accessed once a pretrained model has been loaded. The chunk is shown for reference only and is not explained at this stage.

import gensim.downloader as api

# Download and load a compact pretrained model 
model = api.load("glove-wiki-gigaword-100")

# Inspect a small sample of tokens
list(model.key_to_index.keys())[:50]
## ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as', 'it', 'by', 'at', '(', ')', 'from', 'his', "''", '``', 'an', 'be', 'has', 'are', 'have', 'but', 'were', 'not', 'this', 'who', 'they', 'had', 'i', 'which', 'will', 'their', ':', 'or', 'its', 'one', 'after']

The output would be a list of common tokens (such as frequent nouns, verbs, or adjectives) for which vector representations are available. In practice, pretrained models contain many more tokens than those shown here. Models accessed through gensim.downloader are downloaded automatically on first use and cached locally for future sessions.

Notice that the vocabulary includes not only words, but also punctuation symbols and contractions, reflecting the distribution of tokens in the original training corpus.

Each token in the vocabulary is associated with a numeric vector representation that encodes its semantic properties.

# Inspect the vector representation of a token
model["data"][:10]  # show first 10 dimensions for readability
## array([-0.47099,  0.61577,  0.68969, -0.18149,  0.30778, -0.8415 ,
##        -0.41873, -0.20013,  0.28184, -0.34005], dtype=float32)

The output is a numeric vector corresponding to the token data. In practice, each token is represented by a vector of length \(100\) in this model.

2.0.4 Other pretrained embedding models available in gensim

In addition to the GloVe model used throughout this document, gensim.downloader provides access to several other pretrained word embedding models. These models differ in training corpus, dimensionality, and intended use. The following code lists all pretrained models available via gensim.downloader.

import gensim.downloader as api

# List all available pretrained models
api.info()["models"].keys()
## dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

Common alternatives include:

  • glove-wiki-gigaword-50: A lower-dimensional version of the Wikipedia-based GloVe model, suitable for fast experimentation and visualization.

  • glove-wiki-gigaword-200: A higher-dimensional variant that may capture finer semantic distinctions, at the cost of increased memory usage.

  • glove-twitter-25: Trained on Twitter data; useful for informal language, abbreviations, and social media text.

  • word2vec-google-news-300: A Word2Vec model trained on a large news corpus. Due to its size, it is not recommended for lightweight or instructional settings.

  • fasttext-wiki-news-subwords-300: A FastText model trained on Wikipedia data that incorporates subword information and can handle out-of-vocabulary words more effectively.

3 Introduction

The development of this document follows a progressive transformation of text representations, from lexical units to semantic vector spaces, as summarized in Figure 3.1.

Figure 3.1: Conceptual route of the NLP materials developed in Sections 1.1 to 1.3. Source: Created by the author with ChatGPT (OpenAI)

3.0.1 Preliminaries

In previous materials, we first introduced the construction of vocabularies and tokenization (see “Lexical foundations and vocabulary construction in NLP”), followed by frequency-based representations such as bag-of-words and TF-IDF (see “Transforming Text into Data Structure”).

These methods encode text numerically by focusing on the presence and frequency of words within documents and across a corpus.

Although effective in many applications, frequency-based representations largely ignore the contextual surroundings of words. They do not account for which terms tend to appear before or after a given word, even though this local neighborhood plays a crucial role in shaping meaning. The semantic role of a word is strongly influenced by the context in which it appears.

In this document, we build on this idea by introducing word embeddings, which represent words as vectors learned from their contextual usage in text. These representations aim to capture semantic relationships between words rather than relying solely on surface-level frequency information.

The following topics are covered in this document:

  • Understanding word embeddings,

  • Demystifying the Word2Vec model,

  • Training a Word2Vec model from text data, and

  • Introducing Word Mover’s Distance as a measure of semantic similarity between texts.

Throughout this document, small helper functions are introduced only when they become necessary, in order to simplify repetitive tasks and improve the robustness of the examples.

3.0.2 Motivation: Embeddings in the Transformer architecture

Unlike frequency-based representations such as bag-of-words and TF-IDF, embeddings provide dense, context-aware representations of words, which are essential for modern architectures such as Transformers.

The representations introduced in this chapter correspond directly to the first computational stage of modern neural language models. In the Transformer architecture (Vaswani et al., 2017), each input token is mapped to a dense vector through an embedding layer, as illustrated in Figure 3.2.

General architecture of the Transformer model. Source: [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)

Figure 3.2: General architecture of the Transformer model. Source: Vaswani et al. (2017)

These embeddings serve as the initial numerical representation of language within the model. Unlike frequency-based approaches, they are designed to capture semantic relationships by encoding contextual usage patterns. As a result, words with similar meanings tend to occupy nearby regions in the vector space.

In addition to word embeddings, the Transformer incorporates positional encoding to preserve information about word order, which is not inherently captured by vector representations alone. The combination of embeddings and positional encoding forms the input to the encoder and decoder stacks.

Crucially, these vector representations are not static in modern architectures. Within the Transformer, embeddings are progressively refined through attention mechanisms, which dynamically adjust representations based on the relationships between tokens in a sequence. This allows the model to capture context-dependent meaning rather than relying solely on fixed representations.

Understanding embeddings is therefore essential for interpreting how Transformers operate: they define the space in which attention mechanisms, feed-forward networks, and all subsequent transformations take place.

In this sense, embeddings are not merely a representation technique. They constitute the foundational layer upon which the entire architecture is built.

4 Learning semantic word representations

4.0.1 From Distributional Hypothesis to Vector Geometry

Word embeddings represent words as numerical vectors in an \(n\)-dimensional space. Words that appear in similar contexts tend to have similar meanings, and therefore are located near each other in this space.

Models such as Word2Vec (Mikolov et al. (2013) at Google) learn these representations by analyzing patterns of word co-occurrence in large text corpora. In general, the intuition behind Word2Vec follows the distributional hypothesis:

You shall know a word by the contexts in which it appears

Figure 4.1 illustrates this idea. On the left, meaning is derived from contextual usage within a sentence. On the right, words are represented as points in a vector space, where semantic similarity corresponds to spatial proximity.

Word embeddings. Source: Created by the author with ChatGPT (OpenAI)

Figure 4.1: Word embeddings. Source: Created by the author with ChatGPT (OpenAI)

Beyond simple similarity, embeddings also capture relationships between words.

4.0.2 Inferring meaning from context: an intuitive example

To better understand how meaning can be derived from context, consider a word that is unfamiliar to us:

"Students meet in the lernraum to prepare for exams."

"The lernraum provides access to shared materials and quiet work areas."

"Collaborative activities are often organized in the lernraum."

Even without knowing the exact meaning of lernraum, we can infer from its surrounding words (such as students, exams, materials, and collaborative activities) that it likely refers to a space dedicated to studying or learning.

Now consider other expressions that appear in similar contexts:

"Students often meet in *study rooms* to prepare for exams."

"The *classroom* offers access to learning resources and work areas."

"Collaborative activities take place in shared *learning environments*."

Because lernraum, study room, classroom, and learning environment appear in similar linguistic settings, we infer that they convey related meanings.

This intuition lies at the core of the distributional hypothesis: words that occur in similar contexts tend to have similar meanings, and therefore are represented by nearby vectors in the embedding space.

This idea can also be visualized geometrically. Figure 4.2 shows a three-dimensional projection of word embeddings, where terms related to university form distinct semantic clusters. Words associated with academic institutions, roles, and activities appear close together, reflecting their shared contextual usage.

Such visualizations can be explored interactively using tools like the TensorFlow Embedding Projector, which allows users to inspect pretrained embeddings and observe their geometric structure.

Tree-dimensional (PCA) visualization of word embeddings centered on the term *university*. Words with similar meanings and contexts appear close together in the embedding space. Visualization adapted from the TensorFlow Embedding Projecto

Figure 4.2: Tree-dimensional (PCA) visualization of word embeddings centered on the term university. Words with similar meanings and contexts appear close together in the embedding space. Visualization adapted from the TensorFlow Embedding Projecto

Although the original embeddings exist in a high-dimensional space, dimensionality reduction techniques such as t-Distributed Stochastic Neighbor Embedding (t-SNE) allow us to visualize their structure in two dimensions.

4.0.3 Example: education and professional roles

Consider terms related to education and professional roles. If a Word2Vec model has been trained appropriately and the relevant terms are present in its vocabulary, one may observe vector relationships such as:

\[\overrightarrow{\text{teacher}} \quad-\quad \overrightarrow{\text{school}} \quad+\quad \overrightarrow{\text{university}} \quad\approx\quad \overrightarrow{\text{professor}}\]

Graphically, this relationship can be interpreted as a sequence of vector operations in the embedding space. The arrow above each word denotes its vector representation, that is, the numerical vector learned by the Word2Vec model for that word.

In this analogy, the role of a teacher in a school is related to the role of a professor in a university. Rearranging the expression highlights the symmetry of the relationship:

\[\overrightarrow{\text{teacher}} \quad+\quad \overrightarrow{\text{university}} \quad\approx\quad \overrightarrow{\text{school}} \quad+\quad \overrightarrow{\text{professor}}\]

In other words, Word2Vec captures regularities in language by encoding comparable relationships as similar geometric transformations in the embedding space.

4.0.4 Example: geographical relationships

As a second example, consider geographical relationships that do not rely on country–capital pairs. Instead, we examine relationships between countries and their corresponding demonyms or nationalities. A typical analogy captured by word embeddings may take the form:

\[\overrightarrow{\text{Japan}} \quad-\quad \overrightarrow{\text{Japanese}} \quad+\quad \overrightarrow{\text{Italian}} \quad\approx\quad \overrightarrow{\text{Italy}}\]

Conceptually, this corresponds to transferring a relationship learned in one geographical context to another. Here, the association between a country and its demonym is shifted across contexts. This example illustrates the ability of embedding models to generalize relational patterns beyond individual word pairs.

These patterns suggest that semantic relationships can be interpreted as geometric transformations in the embedding space. We now turn to the learning mechanism that makes such representations possible.

5 Making sense of Word2Vec

5.0.1 How Word2Vec learns word representations

Word2Vec learns word representations by solving a prediction task based on local context.

  • In the Skip-gram model, a word is used to predict its surrounding context.

  • In the CBOW model, the context is used to predict the target word.

Through this process, words that appear in similar contexts are mapped to nearby points in the embedding space, giving rise to the semantic structures illustrated earlier.

5.0.2 Word2Vec: pretrained model

Training Word2Vec from scratch typically requires large corpora and significant computational resources. In practice, pretrained models are often used.

A Word2Vec model can be represented as a matrix of size \(|W| \times K\), where \(|W|\) is the vocabulary size and \(K\) is the embedding dimension (typically between 50 and 300). Each row corresponds to the vector representation of a word.

Pretrained models are available through libraries such as gensim and can be directly applied or fine-tuned for specific tasks. Large-scale models trained on news corpora (e.g., approximately 3 million words with 300-dimensional vectors) are widely used in research and applications.

Due to their size (around 1.5 GB), these models are primarily suited for large-scale or research-oriented settings and are available through public repositories such as the Google Code Archive.

Figure 5.1 shows an example of the original Word2Vec repository, which provides implementations of the Skip-gram and CBOW architectures. This repository illustrates how theoretical concepts in word embeddings are implemented in real-world systems used in natural language processing.

Original Word2Vec project repository hosted in the Google Code Archive, including implementations of the Skip-gram and CBOW models. Source: Google Code Archive.

Figure 5.1: Original Word2Vec project repository hosted in the Google Code Archive, including implementations of the Skip-gram and CBOW models. Source: Google Code Archive.

6 Exploring a pretrained word embedding model with gensim

6.0.1 gensim: loading a pretrained embedding model

As discussed in the introductory sections of this document, pretrained embedding models provide ready-to-use vector representations that capture semantic relationships between words. We now move from conceptual understanding to practical implementation.

For instructional purposes, it is often preferable to work with compact pretrained embedding models that can be downloaded automatically. The gensim library provides convenient access to several such models through its internal data repository. To begin, we load a lightweight pretrained model based on GloVe embeddings trained on Wikipedia data.

import gensim.downloader as api

# Load a compact pretrained embedding model
model = api.load("glove-wiki-gigaword-100")

# List all available pretrained models
api.info()["models"].keys()
## dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

The output above shows a variety of pretrained embedding models available through gensim, each trained on different corpora and with different characteristics (e.g., dimensionality, domain, and language).

Once a model has been selected and loaded, the next step is to ensure that the words we intend to analyze are actually present in its vocabulary.

6.0.2 gensim: verifying vocabulary coverage

Before performing similarity queries or vector arithmetic, it is important to verify that the words we plan to use are actually present in the model’s vocabulary. Pretrained embedding models only contain vectors for words observed during training. If a word is missing, any attempt to query it will result in an error.

Defining candidate words.

To illustrate this, we define a small list of general-purpose words, including some terms that may not appear in the model’s vocabulary:

candidates = [
    "city", "country", "river", "music", "science",
    "computer", "economy", "school", "university",
    "government", "health",
    "lernraum", "data_science_lab", "nonexistentword123"
]

Checking vocabulary membership.

We then check which of these words are included in the model’s vocabulary:

[w for w in candidates if w in model.key_to_index]

The above expression is a list comprehension that:

  • Iterates over each word w in the list candidates.

  • Checks whether that word exists in model.key_to_index.

  • Keeps only those words that are present in the vocabulary.

Here, model.key_to_index is a dictionary that maps each word in the pretrained model to its internal index. Therefore, the condition

w in model.key_to_index

verifies whether the embedding model contains a vector representation for the word w.

Output.

The output confirms that only the words observed during training are retained:

## ['city', 'country', 'river', 'music', 'science', 'computer', 'economy', 'school', 'university', 'government', 'health']

Notice that terms such as lernraum, data_science_lab, or nonexistentword123 are not included in the output, indicating that they are not present in the model’s vocabulary.

6.0.3 gensim: handling out-of-vocabulary words

Example: an out-of-vocabulary error.

# Attempt to access a word not in the vocabulary
model["lernraum"]

The output is:

KeyError: "Key 'lernraum' not present in vocabulary"

This error occurs because the word lernraum was not observed during the training of the model and therefore does not have a corresponding vector representation.

How to avoid this error.

word = "lernraum"

if word in model.key_to_index:
    print(model[word][:10])
else:
    print(f"The word '{word}' is not in the vocabulary.")
## The word 'lernraum' is not in the vocabulary.

Such words are commonly referred to as out-of-vocabulary (OOV) terms. Handling OOV words is an important practical consideration when working with pretrained embedding models.

6.0.4 gensim: why is this step useful?

This verification step serves two main purposes:

  • Technical safety: It prevents runtime errors when querying embeddings for out-of-vocabulary words.

  • Conceptual clarity: It reinforces the idea that embedding models operate over a fixed vocabulary learned during training.

Only after confirming vocabulary coverage does it make sense to proceed with similarity queries, analogy tasks, or vector arithmetic.

6.0.5 gensim: inspecting the internal vocabulary structure

To better understand how tokens are stored internally, we can display a small sample of words from the model’s vocabulary:

# View a small sample of tokens
list(model.key_to_index.keys())[:20]
## ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as']

This command extracts the first 20 tokens from the vocabulary dictionary. It provides a quick glimpse into how words are indexed and stored within the pretrained model.

6.0.6 gensim: vocabulary size and embedding dimensionality

We can inspect the size of the vocabulary included in the pretrained model as follows:

len(model.key_to_index)
## 400000

This value indicates the total number of unique tokens for which vector representations are available. Next, we verify the dimensionality of the word vectors:

model.vector_size
## 100

In this case, each word is represented as a vector in a 100-dimensional embedding space. This dimensionality determines the number of numerical features used to encode semantic and syntactic information.

6.0.7 gensim: accessing the complete vocabulary mapping

For completeness, the full vocabulary-to-index mapping can be accessed via:

model.key_to_index

This object is a dictionary where:

  • Keys correspond to vocabulary words.

  • Values correspond to their internal integer indices.

The output is intentionally suppressed here, as it is too large to display meaningfully in the rendered document.

7 Nearest Neighbors with most_similar()

One of the most direct ways to investigate what a word embedding model has learned is to examine nearest neighbors in the embedding space. The method most_similar() retrieves the words whose vectors are closest to a given query word, where closeness is measured using cosine similarity.

The output consists of pairs (word, similarity score), where higher values indicate stronger semantic proximity to the query word.

model.most_similar("city", topn=10)
## [('town', 0.8263899087905884), ('cities', 0.7764331698417664), ('where', 0.754779040813446), ('area', 0.7458435297012329), ('downtown', 0.7437540292739868), ('capital', 0.713450014591217), ('southern', 0.7070139646530151), ('near', 0.7027733325958252), ('neighborhood', 0.6954501271247864), ('suburb', 0.6925175189971924)]

\[ \begin{array}{ll} \textbf{Word} & \textbf{Cosine Similarity} \\ \hline \texttt{town} & 0.8264 \\ \texttt{cities} & 0.7764 \\ \texttt{where} & 0.7548 \\ \texttt{area} & 0.7458 \\ \texttt{downtown} & 0.7438 \\ \texttt{capital} & 0.7135 \\ \texttt{southern} & 0.7070 \\ \texttt{near} & 0.7028 \\ \texttt{neighborhood} & 0.6955 \\ \texttt{suburb} & 0.6925 \\ \end{array} \]

From a practical perspective, the table should be read from top to bottom: the first entries correspond to the words that are most similar to the query term. For example, town and cities appear at the top of the list because they share strong contextual and semantic associations with city.

Lower-ranked terms, such as neighborhood or suburb, are still related but reflect weaker or more specific connections. This ranking illustrates how similarity is gradual rather than binary.

These results illustrate how semantic similarity emerges geometrically: words that tend to appear in related contexts occupy nearby positions in the vector space. It is important to note that this operation requires the query token to exist exactly as stored in the model vocabulary; otherwise, gensim raises a KeyError.

7.0.1 gensim: remarks

These nearest-neighbor queries provide intuition about the local structure of the embedding space.

  • Pretrained embedding models follow specific tokenization conventions. Proper nouns, rare words, inflected forms, or multiword expressions may be absent or represented differently.

  • Cosine similarity does not imply strict synonymy; rather, it measures distributional proximity, which may reflect topical association rather than identical meaning.

  • For instructional purposes, selecting common nouns and broadly used terms typically yields more stable and reproducible examples.

We now extend this idea from proximity to relational structure by introducing analogy-based vector arithmetic.

8 Analogy queries with pretrained embeddings

8.0.1 Preliminaries

Beyond nearest-neighbor exploration, word embeddings can also be examined through analogy-style queries. Unlike simple similarity queries (which measure proximity to a single word), analogy queries explore relational structure in the embedding space.

The central idea is that semantic relationships can often be expressed as differences between vectors. For example, if two pairs of words share a comparable relationship, the difference between their vectors tends to be similar.

Formally, analogy queries attempt to solve expressions of the form:

\[\mathbf{v}_b \;-\; \mathbf{v}_a \;+\; \mathbf{v}_c \;\approx\; \mathbf{v}_d,\]

where the model searches for the word \(d\) whose vector is closest (in cosine similarity) to the resulting vector.

8.0.2 Helper utilities for safer, reproducible analogy queries

Before running such queries, it is useful to define a set of helper functions. These utilities allow us to verify vocabulary coverage and execute analogy queries safely.

def in_vocab(w, m):
    return w in m.key_to_index

def require_tokens(m, tokens):
    missing = [t for t in tokens if not in_vocab(t, m)]
    return missing

def analogy(m, positive, negative, topn=1):
    return m.most_similar(
        positive=positive,
        negative=negative,
        topn=topn
    )

In the previous code block:

  • in_vocab() is used to verify whether a given token is present in the embedding vocabulary (true or false).

  • require_tokens() checks a collection of tokens and reports any that are missing before a query is executed.

  • analogy() acts as a lightweight interface to perform analogy-based queries using the underlying most_similar() method.

Analogy queries combine word vectors through addition and subtraction. The helper functions above do not define the analogy itself; rather, they ensure that the required tokens are available and provide a clean interface for executing these queries.

Within the gensim framework, analogy queries are implemented by combining vectors through addition and subtraction:

  • positive=[...] specifies the words whose vectors are added,

  • negative=[...] specifies the words whose vectors are subtracted,

  • topn = k retrieves the \(k\) closest candidate words.

8.0.3 Quick checks using the helper functions

A single token.

Before running full analogy queries, we can use the helper utilities to verify vocabulary coverage. First, we can check whether a single word is present in the model:

# Check if a word is in the vocabulary
in_vocab("city", model)
## True

This returns True if the word exists in the vocabulary, and False otherwise.

Multiple tokens.

For multiple tokens, we use require_tokens(), which reports any missing words:

# Check multiple tokens at once
require_tokens(model, ["city", "university"])
## []

If the output is an empty list, all tokens are available. Otherwise, the function returns the missing tokens:

# Check multiple tokens at once
require_tokens(model, ["city", "university", "lernraum"])
## ['lernraum']

These missing terms should be adjusted before running analogy queries.

8.0.4 A first relationship (running an analogy query safely)

The relationship (as text).

We now combine the helper functions to perform a complete analogy query. The goal is to identify words that are semantically related to teacher after shifting the context from school to university.

\[ \mathbf{teacher} \quad + \quad \mathbf{university} \quad - \quad \mathbf{school}. \]

The relationship vectorially.

In vector terms, this corresponds to finding words whose embeddings are closest to:

\[\mathbf{v}_{\text{teacher}} \quad +\quad \mathbf{v}_{\text{university}} \quad - \quad \mathbf{v}_{\text{school}} \]

Checking for missing tokens and running the analogy.

tokens = ["teacher", "school", "university"]

# Step 1: check for missing tokens
missing = require_tokens(model, tokens)

if not missing:
    # Step 2: run the analogy
    analogy(
        model,
        positive=["teacher", "university"],
        negative=["school"],
        topn=5
    )
else:
    print("Missing tokens:", missing)

The resulting top candidates are:

## [('professor', 0.8101112842559814), ('lecturer', 0.7625928521156311), ('scientist', 0.7011223435401917), ('faculty', 0.6967645883560181), ('researcher', 0.6935580372810364)]

\[ \begin{array}{lc} \textbf{Word} & \textbf{Cosine Similarity} \\ \hline \texttt{professor} & 0.8101 \\ \texttt{lecturer} & 0.7626 \\ \texttt{scientist} & 0.7011 \\ \texttt{faculty} & 0.6968 \\ \texttt{researcher} & 0.6936 \\ \end{array} \]

If all tokens are available, the query returns the top candidate words whose embeddings are most similar to this semantic combination.

For instance, results such as professor, lecturer, or scientist indicate that the model has successfully shifted from a school-level teaching context to a broader university-level academic environment.

This workflow ensures that all required tokens are present before executing the analogy query, avoiding runtime errors and improving reproducibility.

To further reinforce this interpretation, we now consider a second example from a different semantic domain.

8.0.5 A second relationship (medical context)

The relationship (as text).

The same logic can be applied to a different semantic domain. Here, we examine how the meaning of doctor changes when the context shifts from school to hospital:

\[ \mathbf{doctor} \quad +\quad \mathbf{hospital} \quad -\quad \mathbf{school}. \]

The relationship vectorially.

In vector terms, this corresponds to finding words whose embeddings are closest to:

\[\mathbf{v}_{\text{doctor}} \quad +\quad \mathbf{v}_{\text{hospital}} \quad - \quad \mathbf{v}_{\text{school}} \]

Checking for missing tokens and running the analogy.

tokens = ["doctor", "school", "hospital"]

# Step 1: check for missing tokens
missing = require_tokens(model, tokens)

if not missing:
    # Step 2: run the analogy
    analogy(
        model,
        positive=["doctor", "hospital"],
        negative=["school"],
        topn=5
    )
else:
    print("Missing tokens:", missing)

The resulting top candidates are:

## [('patient', 0.6921682357788086), ('doctors', 0.6606871485710144), ('hospitalized', 0.6557579636573792), ('surgeon', 0.6467913389205933), ('nurse', 0.6424804329872131)]

\[ \begin{array}{lc} \textbf{Word} & \textbf{Cosine Similarity} \\ \hline \texttt{patient} & 0.6922 \\ \texttt{doctors} & 0.6607 \\ \texttt{hospitalized} & 0.6558 \\ \texttt{surgeon} & 0.6468 \\ \texttt{nurse} & 0.6425 \\ \end{array} \]

Interpretation.

To interpret these results, consider the vector:

\[ \mathbf{v}_{\text{doctor}} \quad + \quad \mathbf{v}_{\text{hospital}} \quad -\quad \mathbf{v}_{\text{school}} \]

This vector does not correspond to a single exact word. Instead, it lies in a region of the embedding space associated with the hospital context.

The closest word vectors to this transformation are:

  • \(\mathbf{v}_{\text{patient}}\) (highest similarity),

  • followed by \(\mathbf{v}_{\text{doctors}}\), \(\mathbf{v}_{\text{hospitalized}}\), \(\mathbf{v}_{\text{surgeon}}\), and \(\mathbf{v}_{\text{nurse}}\).

Here, patient is the closest match in terms of cosine similarity, while the remaining terms represent semantically related concepts within the same domain.

These results reflect a broader medical or clinical context rather than a simple role substitution. For example, terms such as patient, nurse, and surgeon indicate that the model captures not only professional roles but also functional relationships and entities associated with the hospital environment.

This suggests that analogy queries can recover entire semantic fields, not just one-to-one role correspondences.

Remarks.

Together, these examples illustrate that word embeddings encode relational structure, not just similarity. Unlike the previous example, which yielded a direct role correspondence (teacherprofessor), this case reveals a richer set of associations within the medical domain.

Having illustrated how analogy queries work in practice, we now formalize the underlying relationship more explicitly.

8.0.6 A third relationship (a semantic shift within everyday concepts)

One relationship (as text).

Consider the following relationship:

\[\text{teacher} \quad - \quad \text{school} \quad + \quad \text{university} \quad \quad \approx \quad \text{professor}\]

Conceptually, this analogy asks:

If a teacher is associated with a school, what is the corresponding role associated with a university?

The relationship vectorially.

In vector terms, we compute:

\[\mathbf{v}_{\text{teacher}} \quad - \quad \mathbf{v}_{\text{school}} \quad +\quad \mathbf{v}_{\text{university}}\]

and search for the word vector that is closest (under cosine similarity) to this resulting vector.

Step 1: define the query and check token availability.

In this step, we specify the analogy by separating the words into two groups: pos contains the words whose vectors are added, while neg contains the word to be subtracted.

We then verify that all required tokens are present in the model’s vocabulary before executing the query. If the output is an empty list ([]), this indicates that no tokens are missing and the query can proceed safely.

pos = ["teacher", "university"]
neg = ["school"]

missing = require_tokens(model, pos + neg)
missing
## []

Step 2: run the analogy (if all tokens are available).

In this step, we execute the analogy only if no tokens are missing. The function analogy() computes the vector combination defined earlier and returns the word whose embedding is closest to the resulting vector (according to cosine similarity).

# Run the analogy only if all tokens exist
if len(missing) == 0:
    analogy(model, positive=pos, negative=neg, topn=1)
else:
    "Some tokens are missing from the vocabulary: " + ", ".join(missing)

The result (top-1 candidate) is:

## [('professor', 0.8101112842559814)]

The model returns the closest word to the transformed vector:

\[ \begin{array}{lc} \textbf{Top-1 word} & \textbf{Cosine Similarity} \\ \hline \texttt{professor} & 0.8101 \\ \end{array} \]

Interpretation.

The model correctly identifies professor as the closest match. This suggests that the difference vector

\[\mathbf{v}_{\text{teacher}} \quad - \quad \mathbf{v}_{\text{school}}\]

encodes a professional-role-in-institution relationship, which can be transferred to a new institutional context.

This illustrates an important property of embeddings:

Relationships between words can be represented as approximately linear transformations in vector space.

8.0.7 A fourth relationschip (inspecting multiple candidates)

Example: inspect more than one candidate.

In practice, it is often informative to inspect more than one candidate, as the model may return closely related terms or near-synonyms.

In this step, we request more than one candidate by setting topn=2. This allows us to inspect not only the best match, but also other words that are close to the transformed vector in the embedding space.

if len(missing) == 0:
    analogy(model, positive=pos, negative=neg, topn=2)
else:
    "Some tokens are missing from the vocabulary: " + ", ".join(missing)

The top-2 most similar words are:

## [('professor', 0.8101112842559814), ('lecturer', 0.7625928521156311)]

The two closest words to the transformed vector are:

\[ \begin{array}{lc} \textbf{Top-2 words} & \textbf{Cosine Similarity} \\ \hline \texttt{professor} & 0.8101 \\ \texttt{lecturer} & 0.7626 \\ \end{array} \]

In many cases, the top-ranked result reflects the intended relationship, while subsequent candidates provide semantically close alternatives within the same context.

The second result, lecturer, is semantically coherent with the analogy. This highlights an important point:

  • Embedding models do not return a single “true” answer.

  • Instead, they provide a ranked list of candidates based on geometric proximity.

  • Several words may satisfy the relational constraint to varying degrees.

8.0.8 A fifth relationship (a geography-oriented analogy without capitals)

Another relationship.

So far, the previous examples have shown analogy queries that can be executed successfully. We now consider a case that highlights an important practical limitation of pretrained embeddings.

Country-capital analogies can be unstable across models due to tokenization and vocabulary coverage issues. As an alternative, we consider a country–nationality style analogy:

\[ \text{Japan} \quad - \quad \text{Japanese} \quad + \quad \text{Italian} \quad \approx \quad \text{Italy} \]

Conceptually, this expression attempts to transfer the relationship

\[ \text{Country} \quad \Longleftrightarrow \quad \text{Nationality} \]

from one pair to another.

As in previous examples, we first verify whether all required tokens are available in the model’s vocabulary. If any token is missing, the analogy cannot be executed safely.

Example 1 (capitalized tokens).

We first define the analogy and check whether all required tokens are available:

pos2 = ["Japan", "Italian"]
neg2 = ["Japanese"]

missing2 = require_tokens(model, pos2 + neg2)
missing2
## ['Japan', 'Italian', 'Japanese']

Since the output is not an empty list, some tokens are missing from the model’s vocabulary, and the analogy cannot be executed.

We attempt to run the query:

if len(missing2) == 0:
    analogy(model, positive=pos2, negative=neg2, topn=1)
else:
    "Some tokens are missing from the vocabulary: " + ", ".join(missing2)
## 'Some tokens are missing from the vocabulary: Japan, Italian, Japanese'

This confirms that the analogy cannot be performed because the required tokens are missing from the model’s vocabulary.

Example 2 (lowercase tokens).

We now repeat the same query using lowercase tokens:

pos2 = ["japan", "italian"]
neg2 = ["japanese"]

missing2 = require_tokens(model, pos2 + neg2)
missing2
## []

In this case, the output is an empty list ([]), indicating that all tokens are available and the analogy can be executed:

if len(missing2) == 0:
    analogy(model, positive=pos2, negative=neg2, topn=1)
else:
    "Some tokens are missing from the vocabulary: " + ", ".join(missing2)

The model returns:

## [('italy', 0.9241219162940979)]

\[ \begin{array}{lc} \textbf{Top-1 word} & \textbf{Cosine Similarity} \\ \hline \texttt{italy} & 0.9241 \\ \end{array} \]

When using lowercase tokens, the query can be successfully executed, and the model correctly returns as the closest match.

This contrast highlights an important practical issue: pretrained embedding models are often case-sensitive, meaning that Italy and italy may be treated as distinct tokens.

As a result, careful preprocessing (e.g., lowercasing) is often necessary to ensure reliable analogy queries.

More broadly, this example shows that successful analogy queries depend not only on semantic structure, but also on how words are represented in the model’s vocabulary.

Why does this happen?

Pretrained embedding models:

  • Contain only words observed during training.

  • May tokenize proper nouns differently (e.g., lowercase vs uppercase).

  • May omit infrequent named entities.

These characteristics explain why some analogy queries fail even when the underlying relationship is conceptually valid.

This reinforces a crucial lesson:

Embedding models operate over a fixed vocabulary determined at training time.

8.0.9 Important remarks on analogy queries

Analogy queries are powerful exploratory tools, but they come with important limitations:

  • They do not guarantee a unique or universally correct answer.

  • Results depend on the training corpus.

  • Tokenization conventions directly affect outcomes.

  • Named entities are often less stable than common nouns.

For instructional purposes, it is advisable to:

  • Prefer common terms.

  • Avoid rare proper nouns.

  • Use broadly shared conceptual relationships.

Ultimately, analogy queries reveal not just similarity, but the relational geometry encoded in the embedding space.

9 The Word2Vec architecture

9.0.1 Preliminaries

In the previous section, we worked with pretrained Word2Vec-style embeddings and examined how they can be queried to reveal semantic relationships. We now turn our attention to the learning process itself and describe how Word2Vec models are trained.

Word2Vec can be trained using two closely related modeling strategies:

  • Skip-gram, where the model predicts surrounding context words given a target word.

  • Continuous Bag-of-Words (CBOW), where the model predicts a target word given its surrounding context.

Word2Vec architectures. Source: author’s own elaboration

Figure 9.1: Word2Vec architectures. Source: author’s own elaboration

9.0.2 A simple illustrative example

To build intuition, consider the simple sentence:

the dog barks loudly

Assume a small context window of size 1, meaning that each word is associated with its immediate neighbors.

Skip-gram (predict context from a target word).

Skip-gram (predict context from a target word) {.unlisted .unnumbered}

If the target word is:

dog

the model learns to predict its surrounding context words:

Input  →  Output  
dog    →  the  
dog    →  barks  

CBOW (predict target from context words).

Using the same sentence, the context words around dog are:

the, barks

The model now predicts the target word:

   Input         →  Output  
(the, barks)     →  dog  

Remarks.

This simple example illustrates the key difference:

  • Skip-gram starts from one word and predicts its neighbors.

  • CBOW starts from neighboring words and predicts the central word.

Both approaches rely on the same underlying principles and differ mainly in the direction of prediction. In this document, we focus on the Skip-gram architecture, as its intuition is often easier to visualize. The same ideas can be transferred directly to the CBOW formulation.

In practice, this process is repeated across large text corpora, allowing the model to learn meaningful vector representations from word co-occurrence patterns.

10 The Skip-gram approach

The Skip-gram model learns word representations by predicting context words from a given target word. Words that frequently appear near one another in text contribute to each other’s representations.

To formalize this process, we introduce two key concepts:

  • The target word is the central word currently being processed by the model.

  • The context words are the words that appear within a fixed neighborhood around the target word.

This neighborhood is controlled by a parameter known as the window size, which determines how many words to the left and right of the target are considered as context.

Each observed (target, context) pair provides a training signal that helps refine the embedding vectors. Over time, this process leads to word vectors that encode meaningful semantic structure.

10.0.1 Skip-gram: defining target and context words

A simple illustration.

Consider the following sentence:

Students develop skills by practicing data analysis techniques

Suppose we select data as the target word. As defined in the Skip-gram formulation, the context words correspond to those within the specified window around the target.

In this example, we use a window that spans \(5\) positions: the target word itself, up to \(2\) words to the left, and up to \(2\) words to the right.

Under this configuration, the following (target, context) training pairs are generated:

During training, the model learns to associate the target word with each of its surrounding context words. Each (target, context) pair contributes to refining the vector representation of the target word.

More generally, the window slides across the sentence, producing multiple target–context pairs as different words take the role of the target.

Sliding window across a longer sentence.

To better visualize how this process operates across an entire sentence, we now consider the same sentence and apply the sliding-window mechanism.

In this case, we use the same window configuration as before (two words to the left and two to the right of the target).

Students develop skills by practicing data analysis techniques
Illustration of sliding window context generation.

Figure 10.1: Illustration of sliding window context generation.

In Figure 10.1:

  • The blue cell represents the target word.

  • The grey cells represent the context words within the window.

  • Each row corresponds to a different position of the target word as the window moves across the sentence.

For example, when the word practicing is selected as the target, the words skills, by, data, and analysis form its context.

This sliding-window mechanism allows the Skip-gram model to generate a large number of meaningful training examples from a single sentence, efficiently capturing local co-occurrence patterns in text.

10.0.2 Skip-gram: one-to-many prediction problem

We now examine the fundamental components involved in training a Skip-gram model (Word2Vec).

The Skip-gram architecture learns word representations by predicting surrounding context words given a target (center) word:

\[ \text{Target} \;\longrightarrow\; \text{Context words} \]

Mathematical formulation.

Conceptually, Skip-gram can be interpreted as a one-to-many prediction problem:

  • Given a single center word \(w_t\),

  • predict multiple surrounding context words within a window of size \(c\).

Formally, for a given position \(t\) in the corpus, the objective is to maximize the probability of observing the context words given the center word:

\[ P(\text{context words} \mid w_t) \;=\; P(w_{t-c}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+c} \mid w_t) \;=\; \prod_{\substack{-c \le j \le c \\ j \ne 0}} P\big(w_{t+j} \mid w_t\big) \]

Here, context words refers to all words within a fixed window around the target word. This formulation assumes conditional independence between context words given the target word.

That is, the probability of observing all context words around \(w_t\) is expressed as the product of individual conditional probabilities.

This objective encourages the model to assign high probability to each word that appears within the context window of the target word, while ensuring that the entire context is well predicted.

In practice, this objective is optimized in logarithmic form, which transforms the product into a sum and allows each \((w_t, w_{t+j})\) pair to contribute additively to the learning process.

Example

For example, if the context window contains \(4\) words, the joint probability of observing these context words given the center word can be written as:

\[ P(w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2} \mid w_t) \;=\; P(w_{t-2} \mid w_t) \cdot P(w_{t-1} \mid w_t) \cdot P(w_{t+1} \mid w_t) \cdot P(w_{t+2} \mid w_t) \]

This expression shows how the joint probability of the context is decomposed into individual conditional probabilities under the independence assumption.

However, directly computing these probabilities using a full Softmax function is computationally expensive for large vocabularies. As a result, this objective is typically approximated in practice.

Negative sampling.

Instead of modeling the full conditional probability distribution, the problem is often reformulated as a binary classification task. For this reason, Word2Vec is typically trained using an approximation known as negative sampling.

Instead of computing probabilities over the entire vocabulary, the model learns to distinguish between:

  • Positive pairs \((w, c_{pos})\): real context words.

  • Negative pairs \((w, c_{neg})\): randomly sampled words.

Negative sampling: training objective.

The training objective becomes:

\[ \mathcal{L}_{\text{obj}}(w,c_{pos},c_{neg}^{*}) \; =\; \log \sigma(\mathbf{v}_w \cdot \mathbf{v}_{c_{pos}}) \;+\; \sum_{i=1}^{k} \log \sigma(-\mathbf{v}_w \cdot \mathbf{v}_{c_{neg_i}}) \]

This objective can be formally derived by modeling the joint probability of correctly classifying one positive pair and \(k\) negative samples, followed by a logarithmic transformation. A concise intuition is provided below, while a detailed derivation is presented in later sections.

This objective encourages the model to:

  • Increase the dot product between the target word and true context words.

  • Decrease the dot product between the target word and randomly sampled (noise) words.

In this way, the model learns meaningful vector representations by contrasting real and artificial contexts.

During training, this mechanism generates several \((\text{target}, \text{context})\) pairs from a single sentence, significantly increasing the number of training examples and improving statistical efficiency.

Negative sampling: intuition behind the objective.

The objective function arises from modeling the joint probability of correctly classifying one positive pair and several negative pairs:

\[ P\bigl(\text{labels}\mid w,c_{pos},c_{neg}^{*}\bigr) \; = \; P(+\mid w,c_{pos}) \,\prod_{i=1}^{k} P(-\mid w,c_{neg_i}) \]

Taking logarithms transforms this product into a sum, which directly leads to the training objective shown above.

Negative sampling: example.

To illustrate this idea, consider the target word data and one of its true context words, such as analysis.

  • The pair \((\text{data}, \text{analysis})\) is treated as a positive example.

  • The model then samples a few unrelated words (e.g., tree, music, city) to form negative examples: \((\text{data}, \text{tree})\), \((\text{data}, \text{music})\), \((\text{data}, \text{city})\).

The objective is to assign a high score to the positive pair and low scores to the negative pairs.

This transforms the original problem into a series of binary classification decisions: distinguishing real context words from noise.

In practice, several negative samples are generated for each positive pair, making training both efficient and scalable.

Negative sampling: simple numerical illustration

To make this idea more concrete, consider a simplified example where word embeddings are represented as low-dimensional vectors.

To map similarity scores into probabilities, we use the sigmoid function, defined as

\[ \sigma(x) = \frac{1}{1 + e^{-x}}. \]

This function transforms any real-valued input into a value in the interval \((0,1)\), making it suitable for interpreting scores as probabilities. Intuitively, larger dot products lead to values closer to 1, while smaller (or negative) dot products produce values closer to 0.

import numpy as np

# Target word embedding (data)
v_w = np.array([0.8, 0.2])

# Positive context (analysis)
v_pos = np.array([0.7, 0.3])

# Negative contexts
v_neg1 = np.array([-0.4, 0.6])   # tree
v_neg2 = np.array([0.1, -0.7])   # music

# Sigmoid function
sigmoid = lambda x: 1 / (1 + np.exp(-x))

# Scores
score_pos = sigmoid(np.dot(v_w, v_pos))
score_neg1 = sigmoid(np.dot(v_w, v_neg1))
score_neg2 = sigmoid(np.dot(v_w, v_neg2))

# Objective function (log-likelihood)
objective = (
    np.log(score_pos) +
    np.log(1 - score_neg1) +
    np.log(1 - score_neg2)
)

print("Objective value:", objective)
print("Positive pair score:", score_pos)
print("Negative pair scores:", score_neg1, score_neg2)
## Objective value: -1.692182726487229
## Positive pair score: 0.650218548573827
## Negative pair scores: 0.45016600268752216 0.4850044983805899

The positive pair receives a higher score (approximately 0.65) than the negative pairs (approximately 0.45 and 0.49), indicating stronger compatibility with the true context word.

The objective value is negative because it is computed as a sum of logarithms of probabilities in the interval \((0,1)\), and therefore \(\log(p) < 0\) for any such value. During training, the goal is to maximize this objective, making it progressively less negative over time.

In other words, the model improves when:

  • The positive pair score increases (closer to 1), and

  • The negative pair scores decrease (closer to 0).

This behavior directly reflects the optimization objective derived earlier.

10.0.3 Skip-gram: core components of this model

From a structural perspective, the Skip-gram model consists of the following key components:

  1. Input representation (one-hot encoding of the target word).

  2. Embedding matrix (\(|W| \times K\)). Here, \(|W|\) denotes the size of the vocabulary (i.e., the number of unique words), and \(K\) is the dimensionality of the embedding spaceIn practice, \(K\) is much smaller than \(|W|\) (\(K \ll |W|\)), allowing words to be represented in a compact, dense form.

  3. Context (output) embedding matrix.

  4. Output score vector.

  5. Softmax normalization.

  6. Loss computation and backpropagation.

Each component plays a specific role in transforming a discrete input word into a dense vector representation and updating the model parameters during training. We now analyze each of these components in detail.

These components are illustrated in Figure 10.2, which provides a step-by-step view of how a target word is transformed into predictions over context words.

We now examine how each component operates in detail, starting with the input representation.

Skip-gram: core Components. Source: Created by the author with ChatGPT (OpenAI)

Figure 10.2: Skip-gram: core Components. Source: Created by the author with ChatGPT (OpenAI)

10.0.4 Skip-gram: input representation

In the Skip-gram architecture, the input word \(w_t\) is encoded as a one-hot vector of size \(|W| \times 1\), where \(|W|\) denotes the size of the vocabulary. Formally,

\[ \mathbf{x}_t \in \mathbb{R}^{|W|}, \quad x_{t,j} = \begin{cases} 1 & \text{if } j = \text{index}(w_t) \\ 0 & \text{otherwise} \end{cases}, \quad \|\mathbf{x}_t\|_1 = 1 \]

That is, exactly one component equals 1 (corresponding to the position of \(w_t\) in the vocabulary), while all remaining components equal 0.

Thus, each one-hot vector contains exactly one active entry, identifying the target word, while all other positions indicate absence. This sparse representation constitutes the starting point of the forward pass through the embedding matrix.

Example.

To illustrate this idea, consider a vocabulary composed of four tokens:

data, models, learn, patterns

Then the corresponding one-hot encodings are:

  • data1 0 0 0

  • models0 1 0 0

  • learn0 0 1 0

  • patterns0 0 0 1

Each vector has length \(|W| = 4\), and only a single position is active in each case. Formally, we denote the vocabulary as:

\[\color{brown}{W=\{\texttt{data},\ \texttt{models},\ \texttt{learn},\ \texttt{patterns}\}}\]

Hence, the input space of one-hot representations is \(\mathbb{R}^{|W|}=\mathbb{R}^{4}\). A convenient way to visualize all possible one-hot inputs is through the following matrix:

\[\color{green}{\mathbf{X}^{(0)} = \left( \begin{array}{c|cccc} \text{Word} & \texttt{data} & \texttt{models} & \texttt{learn} & \texttt{patterns} \\ \hline \texttt{data} & 1 & 0 & 0 & 0 \\ \texttt{models} & 0 & 1 & 0 & 0 \\ \texttt{learn} & 0 & 0 & 1 & 0 \\ \texttt{patterns} & 0 & 0 & 0 & 1 \end{array} \right)}\]

Interpretation.

  • Each row represents a valid input configuration for the model.

  • Selecting the word learn as the target corresponds to activating the vector:

\[\color{blue}{\mathbf{x}_{\texttt{learn}} = \begin{pmatrix} 0\\ 0\\ 1\\ 0 \end{pmatrix}}\]

Although the Skip-gram model processes one target word at a time and never uses the full matrix simultaneously during training, this representation is pedagogically useful because it:

  • Makes the structure of the input space explicit,

  • Provides a clear connection to Bag-of-Words (BoW) representations, and

  • Prepares the transition to the embedding matrix, where these sparse vectors are mapped into dense semantic representations.

10.0.5 Skip-gram: embedding matrix

The next component of the Skip-gram architecture is the embedding matrix, denoted as \(\mathbf{E} \in \mathbb{R}^{|W| \times K}\), where:

  • \(|W|\) is the size of the vocabulary, and

  • \(K\) is the embedding dimension, that is, the number of latent features used to represent each word.

This matrix is not directly constructed from the data. Instead, it is treated as a set of model parameters that are learned during training. It is typically initialized with small random values or with structured initialization schemes designed to improve numerical stability and convergence.

During training, the entries of \(\mathbf{E}\) are iteratively updated through backpropagation so that words appearing in similar contexts acquire similar vector representations.

When a one-hot input vector corresponding to a target word is multiplied by the embedding matrix, the operation does not involve a full matrix multiplication in practice. Instead, it effectively selects the row of the embedding matrix associated with the active position in the one-hot vector.

The selected row constitutes the intermediate embedding vector, a dense vector of length \(K\) that encodes the semantic representation of the target word in the embedding space (see Figure 10.3).

Embedding Matrix. Source: author’s own elaboration

Figure 10.3: Embedding Matrix. Source: author’s own elaboration

Importantly, the embedding matrix starts as a random lookup table and gradually evolves into a meaningful geometric representation of the vocabulary through training.

10.0.6 Skip-gram: context (or prediction) matrix

A second trainable parameter matrix, commonly referred to as the context matrix (or prediction matrix), is introduced in the Skip-gram architecture. This matrix is denoted by \(\mathbf{C} \in \mathbb{R}^{|W| \times K}\), where:

  • \(|W|\) is the vocabulary size, and

  • \(K\) is the embedding dimension.

The intermediate embedding vector \(\mathbf{v} \in \mathbb{R}^{1 \times K}\) obtained from the embedding matrix is then combined with the context matrix to produce a score for each word in the vocabulary. This operation is given by:

\[ \mathbf{z} = \mathbf{v}\,\mathbf{C}^\top \]

where \(\mathbf{z} \in \mathbb{R}^{1 \times |W|}\) is a vector of unnormalized scores.

Operationally, this step computes the dot product between the embedding vector \(\mathbf{v}\) and every row of the context matrix. Each entry \(z_j\) reflects how compatible the target word is with the \(j\)-th word in the vocabulary.

Conceptually, this operation measures the alignment between the semantic representation of the target word and each possible context word. Words that are more semantically or syntactically compatible with the target receive higher scores, indicating stronger contextual association (see Figure 10.4).

Context (or prediction) Matrix. Source: author’s own elaboration

Figure 10.4: Context (or prediction) Matrix. Source: author’s own elaboration

10.0.7 Skip-gram: output vector and softmax normalization

Overview.

The computation performed in the previous step yields an output vector \(\mathbf{z} \in \mathbb{R}^{1 \times |W|}\), where \(|W|\) denotes the vocabulary.

Each component of this vector corresponds to an unnormalized score that reflects how strongly the model associates a given vocabulary word with the current target word as a potential context word. At this stage, these scores are real-valued and do not yet constitute probabilities.

The vector \(\mathbf{z}\) is obtained from a linear transformation and must be normalized to form a valid probability distribution.

Softmax function.

To convert these raw scores into a probabilistic interpretation, the softmax function is applied.

Formally, given a score vector \[\mathbf{z} = (z_1, z_2, \dots, z_{|W|}), \]

the softmax transformation is defined componentwise as

\[\text{softmax}(\mathbf{z})_i \;=\; \frac{\exp(z_i)}{\sum\limits_{j=1}^{|W|} \exp(z_j)}\]

A detailed discussion of this function and its probabilistic interpretation can be found in my notes on logistic regresssion. The result of this operation is a normalized output vector, whose components sum to one and can be interpreted as probabilities over the vocabulary.

\[\text{softmax}(\mathbf{z}) \in \mathbb{R}^{1 \times |W|},\]

and therefore has the same dimensionality as the input score vector. Here, the vector \(\mathbf{z}\) represents the output score vector produced by the model prior to normalization (see Figure 10.5).

Softmax Matrix. Source: author’s own elaboration

Figure 10.5: Softmax Matrix. Source: author’s own elaboration

Each component \(z_i\) corresponds to the model’s score for the \(i^{\text{th}}\) word in the vocabulary being the correct context word.

Applying the softmax function guarantees that:

  • All resulting values lie in the interval \([0, 1]\),

  • The values sum to 1 across the vocabulary,

  • Each value can be interpreted as a probability.

10.0.8 Skip-gram: example (output vector and softmax normalization)

Suppose the model produces the following score vector for a given target word:

\[\color{blue}{\mathbf{z} =\left(\begin{array}{c} 1.5 \\ 0.5 \\ 2.5 \\ 1.0 \\ 0.2 \end{array}\right) \in \mathbb{R}^{5}}\]

Each entry corresponds to a different vocabulary word. To convert these scores into probabilities, we apply the softmax function. The denominator (normalizing constant) is:

\[\color{green}{\sum_{j=1}^{5} \exp(z_j) = \exp(1.5) + \exp(0.5) + \exp(2.5) + \exp(1.0) + \exp(0.2)}.\]

Therefore, the softmax vector can be written compactly as:

\[\color{orange}{\text{softmax}(\mathbf{z}) = \frac{1} {\exp(1.5)+\exp(0.5)+\exp(2.5)+\exp(1.0)+\exp(0.2)} \begin{pmatrix} \exp(1.5) \\ \exp(0.5) \\ \exp(2.5) \\ \exp(1.0) \\ \exp(0.2) \end{pmatrix}= \begin{pmatrix} 0.20140079 \\ 0.07409121 \\ 0.54746412 \\ 0.12215576 \\ 0.05488812 \end{pmatrix}}.\]

The resulting vector belongs to \(\mathbb{R}^{5}\), its components lie in the interval \([0,1]\), and they sum to one, forming a valid probability distribution over the vocabulary. Each value indicates the likelihood that the corresponding word is the correct context word for the given target. In python:

import numpy as np

z = np.array([1.5, 0.5, 2.5, 1.0, 0.2])
np.exp(z) / np.sum(np.exp(z))
## array([0.20140079, 0.07409121, 0.54746412, 0.12215576, 0.05488812])

Words with larger scores in the original vector \(\mathbf{z}\) receive higher probabilities after normalization, while smaller scores are comparatively suppressed.

This transformation converts raw compatibility scores into a probability distribution over all possible context words.

10.0.9 Skip-gram: loss computation and backpropagation

Intuition: from probabilities to error

Once the model produces a probability distribution over the vocabulary, this predicted vector is compared against the true context word, which is encoded as a one-hot vector.

The difference between the predicted probabilities and the true target representation quantifies the training error, commonly referred to as the loss. This value reflects how accurately the model identifies the correct context word given the target word.

Error signal

The discrepancy between the predicted and target vectors can be visualized as:

\[ \mathbf{e} = \color{blue}{ \begin{array}{c} \mathbf{x}_{\text{target}}\\ \left(\begin{array}{c} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \end{array}\right) \end{array}} \quad - \quad \color{green}{ \begin{array}{c} \mathbf{y}_{\text{predicted}}\\ \left(\begin{array}{c} 0.20 \\ 0.10 \\ 0.50 \\ 0.04 \\ 0.00 \\ \vdots \end{array}\right) \end{array}} \quad = \quad \color{red}{ \begin{array}{c} \text{Error}\\ \left(\begin{array}{r} 0.80 \\ -0.10 \\ -0.50 \\ -0.04 \\ 0.00 \\ \vdots \end{array}\right) \end{array}} \]

This error vector \(\mathbf{e}\) indicates how much each predicted probability deviates from the true target.

For the correct word, the error is positive, indicating that its probability should be increased. For all other words, the error is negative, indicating that their probabilities should be reduced.

Backpropagation

Each parameter update is proportional to how much it contributes to the prediction error. As a result:

  • The embedding of the target word is adjusted.

  • The embeddings of context words are updated.

  • And the model gradually improves its predictions.

This iterative process is known as backpropagation.

10.0.10 Skip-gram: inference and learned embeddings

This diagram summarizes the main components and interactions involved in training a Word2Vec model using the Skip-gram approach.

Forward and backward propagation in the Skip-gram Word2Vec model. Source: author’s own elaboration

Figure 10.6: Forward and backward propagation in the Skip-gram Word2Vec model. Source: author’s own elaboration

Training is performed over multiple passes through the corpus, commonly referred to as epochs. As training progresses, the embedding matrix stabilizes and converges to a set of meaningful vector representations.

After training is complete, each row of the embedding matrix corresponds to the learned vector for a specific word in the vocabulary. These vectors constitute the final word embeddings and can be extracted for use in downstream tasks, such as:

  • Measuring semantic similarity,

  • Performing analogy queries, or

  • Serving as input features for other machine learning models.

Applications of Learned Word Embeddings. Source: Created by the author with ChatGPT (OpenAI)

Figure 10.7: Applications of Learned Word Embeddings. Source: Created by the author with ChatGPT (OpenAI)

Having examined the objective function and its numerical behavior, we now step back and consider a global view of how these computations interact during training.

10.0.11 From softmax to negative sampling: motivation

Limitations of softmax and the need for negative sampling

The softmax formulation requires computing a full probability distribution over the entire vocabulary. However, this becomes computationally expensive when the vocabulary size is large.

To address this limitation, Word2Vec is typically trained using an approximation known as negative sampling, which reformulates the problem as a binary classification task.

Negative sampling objective: derivation

In this formulation, the probability of a positive pair is modeled using a sigmoid function applied to the dot product between the corresponding word embeddings.

We start from the joint probability of correctly classifying one positive pair and \(k\) negative pairs:

\[ L(w,c_{pos},c_{neg}^{*}) \; =\; P\bigl(\text{labels}\mid w,c_{pos},c_{neg}^{*}\bigr) \; = \; P(+\mid w,c_{pos}) \, \prod_{i=1}^{k} P(-\mid w,c_{neg_i}) \]

Taking the negative logarithm, the loss becomes:

\[ \begin{aligned} \mathcal{L}(w,c_{pos},c_{neg}^{*}) &= -\log\left[ P(+\mid w,c_{pos}) \prod_{i=1}^{k} P(-\mid w,c_{neg_i}) \right] \\[4pt] &= -\left[ \log P(+\mid w,c_{pos}) + \sum_{i=1}^{k}\log P(-\mid w,c_{neg_i}) \right]. \end{aligned} \]

Using the identity \(P(-\mid w,c)=1-P(+\mid w,c)\), we obtain:

\[ \mathcal{L}(w,c_{pos},c_{neg}^{*}) = -\left[ \log P(+\mid w,c_{pos}) + \sum_{i=1}^{k}\log\bigl(1-P(+\mid w,c_{neg_i})\bigr) \right]. \]

Assuming a sigmoid parameterization:

\[ P(+\mid w,c)=\sigma(\mathbf{v}_w\cdot \mathbf{v}_c), \]

we have:

\[ P(-\mid w,c)=1-\sigma(\mathbf{v}_w\cdot \mathbf{v}_c) = \sigma(-\mathbf{v}_w\cdot \mathbf{v}_c), \]

and therefore:

\[ \mathcal{L}(w,c_{pos},c_{neg}^{*}) \; =\; -\; \underbrace{\left[ \log \sigma(\mathbf{v}_w\cdot \mathbf{v}_{c_{pos}}) + \sum_{i=1}^{k}\log \sigma(-\mathbf{v}_w\cdot \mathbf{v}_{c_{neg_i}}) \right]}_{\mathcal{L}_{\text{obj}}(w,c_{pos},c_{neg}^{*})} \]

Equivalently, maximizing the log-likelihood is the same as maximizing:

\[ \mathcal{L}_{\text{obj}}(w,c_{pos},c_{neg}^{*}) \; =\; \log \sigma(\mathbf{v}_w \cdot \mathbf{v}_{c_{pos}}) \;+\; \sum_{i=1}^{k} \log \sigma(-\mathbf{v}_w \cdot \mathbf{v}_{c_{neg_i}}) \]

Interpretation.

This objective encourages the model to:

  • Increase the similarity between the target word and true context words,

  • Decrease the similarity with randomly sampled words.

As a result, the model learns embeddings that capture meaningful semantic relationships through contrastive learning, while avoiding the computational cost of a full softmax over the vocabulary.

Important remark: no fixed threshold (relative ranking).

In practical terms, the model improves when:

  • The positive score increases (moves closer to 1), and

  • The negative scores decrease (move closer to 0).

This behavior reflects the optimization objective, which pushes true context words closer to the target word while pushing randomly sampled words farther away.

It is important to note, however, that there is no fixed threshold separating positive and negative pairs.

Instead, the model relies on relative comparisons:

\[ \sigma(\mathbf{v}_w \cdot \mathbf{v}_{c_{pos}}) \;>\; \sigma(\mathbf{v}_w \cdot \mathbf{v}_{c_{neg}}) \]

That is, the score assigned to a true context word should be higher than the scores assigned to negative samples.

In practice, learning is driven by ranking rather than absolute values. The model does not aim to produce perfectly calibrated probabilities, but rather to ensure that true context words consistently receive higher scores than randomly sampled words.

10.0.12 Negative sampling: verifying the objective function

To further understand the behavior of the objective function, we now verify it numerically using a simple example.

To verify the consistency of the theoretical formulation, we compute the objective function using the equivalent expression based on the sigmoid function.

Step 1. Numerical computation.

import numpy as np

# Embeddings
v_w = np.array([0.8, 0.2])        # target word
v_pos = np.array([0.7, 0.3])      # positive context
v_neg1 = np.array([-0.4, 0.6])    # negative context 1
v_neg2 = np.array([0.1, -0.7])    # negative context 2

# Sigmoid
sigmoid = lambda x: 1 / (1 + np.exp(-x))

# Dot products
pos_dot = np.dot(v_w, v_pos)
neg1_dot = np.dot(v_w, v_neg1)
neg2_dot = np.dot(v_w, v_neg2)

# Objective function (log-likelihood)
objective = (
    np.log(sigmoid(pos_dot)) +
    np.log(sigmoid(-neg1_dot)) +
    np.log(sigmoid(-neg2_dot))
)

print("Scores:")
print("  Positive:", sigmoid(pos_dot))
print("  Negative:", sigmoid(neg1_dot), sigmoid(neg2_dot))

print("\nObjective function value:")
print("  ", objective)
## Positive pair score: 0.650218548573827
## Negative pair scores: 0.45016600268752216 0.4850044983805899
## Objective function value:  -1.692182726487229

Step 2. Interpretation of the outputs.

The outputs of the computation can be interpreted as follows:

  • The positive pair score is approximately \(0.65\). This indicates a relatively strong association between the target word and its true context word.

  • The negative pair scores are approximately \(0.45\) and \(0.48\). These values are lower than the positive score, indicating weaker compatibility with the target word.

This behavior is consistent with the goal of the model:

  • Assign higher scores to true context words, and

  • Assign lower scores to randomly sampled (noise) words.

The objective function value is approximately \(-1.69\).

Since the objective is defined as a sum of logarithms of probabilities (which lie in \((0,1)\)), its value is negative. During training, the goal is to maximize this objective, making it less negative over time.

Step 3. Connection with the objective function.

This computation confirms that the objective function takes larger values when:

  • The dot product between the target and the positive context is large, and

  • The dot products with negative samples are small (or negative).

This pushes true context words closer to the target word in the embedding space, while pushing unrelated words farther away.

This formulation admits an equivalent expression based on \(\log(1 - \sigma(x))\), as discussed earlier.

The use of \(\sigma(-x)\) provides a more compact and numerically stable formulation, which is commonly adopted in implementations of Word2Vec.

This result is consistent with the goal of maximizing the likelihood of observing true context words while minimizing the likelihood of noise samples.

To make this equivalence explicit, we verify it numerically below.

10.0.13 Negative sampling: equivalence of formulations

An important identity used in the derivation is the equivalence:

\[ \log(1 - \sigma(x)) = \log(\sigma(-x)). \]

This follows from the property:

\[ 1 - \sigma(x) = \sigma(-x). \]

We can verify this numerically:

import numpy as np

sigmoid = lambda x: 1 / (1 + np.exp(-x))

x = 0.5

lhs = np.log(1 - sigmoid(x))
rhs = np.log(sigmoid(-x))

print("log(1 - sigmoid(x)):", lhs)
print("log(sigmoid(-x)):", rhs)
## log(1 - sigmoid(x)): -0.9740769841801068
## log(sigmoid(-x)): -0.9740769841801068

Both expressions produce the same value (up to numerical precision), confirming the identity \(\log(1 - \sigma(x)) = \log(\sigma(-x))\).

This identity explains why the objective can be written either as:

  • \(\log(1 - \sigma(x))\), or

  • \(\log(\sigma(-x))\),

with the latter often preferred for notational simplicity and numerical stability.

This equivalence is widely used in implementations of Word2Vec and logistic models, as it leads to more compact expressions and improved numerical stability in optimization routines.

11 The CBOW approach: overview

11.0.1 CBOW: conceptual overview

The Continuous Bag-of-Words (CBOW) model is closely related to the Skip-gram architecture, but it reverses the direction of prediction. While Skip-gram predicts surrounding words from a target word, CBOW predicts the target word from its surrounding context.

\[\text{Context words} \;\longrightarrow\; \text{Target word}\]

In this formulation, multiple context words are aggregated into a single representation, which is then used to infer the missing center word.

This idea is illustrated in Figure 11.1, where the CBOW model is shown predicting a missing target word from its surrounding context, while the Skip-gram model operates in the opposite direction.

CBWO approach. Source: Created by the author with ChatGPT (OpenAI)

Figure 11.1: CBWO approach. Source: Created by the author with ChatGPT (OpenAI)

Although the network structure differs slightly from Skip-gram, both models rely on the same distributional principle:

Words that appear in similar contexts tend to have similar representations.

As a result, CBOW and Skip-gram typically produce embeddings of comparable quality.

11.0.2 CBOW: a many-to-one prediction problem

CBOW can be interpreted as a many-to-one prediction task. Formally, for a given position \(t\) in a corpus and a context window of size \(s\), the model seeks to maximize:

\[ P\big(w_t \mid w_{t-s}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+s}\big) \]

In words, the objective is to maximize the probability of observing the center word \(w_t\) given its surrounding context words. That is, among all words in the vocabulary, the model aims to assign the highest probability to the actual word that appears in the middle of the context window.

Equivalently, the CBOW model attempts to answer the following question:

Given these neighboring words, which word is most likely to occupy the center position?

Conceptually:

  • The input consists of multiple neighboring words.

  • These context words are combined into a single representation.

  • The output is a single predicted target word.

This contrasts with Skip-gram, which solves a one-to-many problem by predicting several context words from a single center word.

11.0.3 CBOW: input and hidden representation

In the CBOW architecture, the model receives multiple context words as input and combines their representations into a single vector.

Let \(s\) denote the context window size, and consider a target position \(t\) in the corpus. The context consists of the words:

\[ \{w_{t-s}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+s}\}. \]

Before introducing the compact mathematical expression, it is helpful to describe the computation as a sequence of steps:

\[ \underbrace{\text{Context words}}_{(1)} \;\longrightarrow\; \underbrace{\text{One-hot vectors}}_{(2)} \;\longrightarrow\; \underbrace{\text{Embeddings}}_{(3)} \;\longrightarrow\; \underbrace{\text{Average} \;\longrightarrow\; \mathbf{h}}_{(4)} \]

That is, the model first identifies the surrounding words, maps each of them to its embedding vector, and then averages those vectors to obtain a single contextual representation.

Step 1. Context words.

The input consists of the neighboring words around the target position:

\[ \{w_{t-s}, \dots, w_{t-1}, w_{t+1}, \dots, w_{t+s}\}. \]

These are the words used to predict the center word \(w_t\).

Step 2. One-hot representation.

As in the Skip-gram model, each context word is initially represented as a one-hot vector:

\[ \mathbf{x}_t \in \mathbb{R}^{|W|}, \quad x_{t,j} = \begin{cases} 1 & \text{if } j = \text{index}(w_t) \\ 0 & \text{otherwise} \end{cases}, \quad \|\mathbf{x}_t\|_1 = 1 \]

Step 3. Embedding lookup.

Each one-hot vector is mapped through the embedding matrix to its corresponding dense vector:

\[ \mathbf{x}_{t+j} \;\xrightarrow{\text{embedding lookup}}\; \mathbf{v}_{w_{t+j}} \]

This mapping is implemented through the embedding matrix:

\[ \mathbf{v}_{w_{t+j}} = \mathbf{x}_{t+j} \mathbf{E}, \qquad \mathbf{v}_{w_{t+j}} \; \in \; \mathbb{R}^{K}, \]

where \(\mathbf{x}_{t+j} \in \mathbb{R}^{|W|}\) is a one-hot vector and \(\mathbf{E} \in \mathbb{R}^{|W| \times K}\) is the embedding matrix.

Step 4. Aggregation.

The CBOW model then combines all context embeddings into a single hidden representation:

\[ \mathbf{h} \;=\; \frac{1}{c} \sum_{\substack{-s \le j \le s \\ j \ne 0}} \mathbf{v}_{w_{t+j}}, \]

where:

  • \(c\) is the number of context words used,

  • typically \(c = 2s\) for a symmetric window,

  • \(\mathbf{h} \in \mathbb{R}^K\) is the context representation vector.

Here, the index \(j\) indicates the relative position of a context word with respect to the center word \(w_t\):

  • Negative values of \(j\) correspond to words to the left,

  • Positive values of \(j\) correspond to words to the right,

  • The condition \(j \ne 0\) excludes the center word itself.

Therefore, the sum simply means:

Add the embedding vectors of all surrounding words, but do not include the target word.

The factor \(\frac{1}{c}\) converts this sum into an average, ensuring that the magnitude of the resulting vector remains stable regardless of the number of context words.

Thus, instead of using a single word (as in Skip-gram), CBOW compresses the entire context into a single dense vector.

This averaging operation assumes that each context word contributes equally to the prediction.

Having constructed the aggregated context representation \(\mathbf{h}\), we now describe how it is transformed into a probability distribution over the vocabulary.

11.0.4 CBOW: from hidden representation to prediction

The next steps describe how the context representation \(\mathbf{h}\) is transformed into a probability distribution over the vocabulary.

Step 5. Scores.

Once the aggregated vector \(\mathbf{h}\in \mathbb{R}^K\) is obtained, the model computes a score for each word in the vocabulary.

This is done using a linear transformation:

\[ \mathbf{z} \;=\; \mathbf{h}\mathbf{C}^\top, \quad \mathbf{z} \in \mathbb{R}^{1 \times |W|}, \]

where:

  • \(\mathbf{C} \in \mathbb{R}^{|W| \times K}\) is the output (context) matrix,

  • each component \(z_i\) represents the score associated with the \(i^{\text{th}}\) word in the vocabulary.

Thus, the vector \(\mathbf{z}\) assigns a real-valued compatibility score to every possible candidate word.

Step 6. Softmax.

These scores are then transformed into probabilities using the softmax function:

\[ P(w_i \mid \text{context}) \; =\; \frac{\exp(z_{w_i})}{\sum\limits_{j=1}^{|W|} \exp(z_j)} \] This operation converts the raw scores into a valid probability distribution over the vocabulary, where all values lie in \([0,1]\) and sum to one.

Step 7. Target word.

The model then selects (or assigns highest probability to) the word corresponding to the largest value in the softmax output:

\[ \hat{w}_t = \arg\max_{w_i} \; P(w_i \mid \text{context}). \]

Here, \(\arg\max\) returns the word index (or token) that achieves the highest probability, not the probability value itself.

Ideally, the correct center word \(w_t\) should receive the highest probability.

Thus, the model predicts the word that best fits the given context.

This completes the forward pass of the CBOW model

11.0.5 CBOW: full computational flow

The CBOW model can be summarized as the following pipeline:

\[ \underbrace{\text{Context words}}_{(1)} \;\longrightarrow\; \underbrace{\text{One-hot vectors}}_{(2)} \;\longrightarrow\; \underbrace{\text{Embeddings}}_{(3)} \;\longrightarrow\; \underbrace{\text{Average}\ ( \mathbf{h})}_{(4)} \;\longrightarrow\; \underbrace{\text{Scores } (\mathbf{z})}_{(5)} \;\longrightarrow\; \underbrace{\text{Softmax}}_{(6)} \;\longrightarrow\; \underbrace{\text{Target word}}_{(7)} \]

This formulation highlights that CBOW is a many-to-one model:

multiple context words are combined to predict a single target word.

11.0.6 CBOW: relation to Skip-gram

The CBOW architecture closely mirrors the Skip-gram model, but the direction of prediction is reversed. While Skip-gram uses a single target word to predict its surrounding context, CBOW aggregates multiple context words to predict a single target word.

Table 11.1 summarizes the main differences between the two architectures.

Table 11.1: Table 11.2: Comparison between the CBOW and Skip-gram architectures.
Model Input Output Prediction.type Core.idea
CBOW Context words Target word \(w_t\) Many-to-one Predict the center word from surrounding context
Skip-gram Target word \(w_t\) Context words One-to-many Predict surrounding words from a single word

Although the direction of prediction differs, both models rely on the same distributional principle:

Words that appear in similar contexts tend to have similar representations.

Consequently, both architectures often produce embeddings of comparable semantic quality, even though they differ in how contextual information is processed.

11.0.7 CBOW: illustrative example

Sentence and target word.

To build intuition, consider the following sentence:

"The new store opened near the plaza"

Suppose we want to predict the missing center word in the fragment:

"The new ___ opened near the plaza"

Step 1. Context window.

Using a context window of size \(s = 2\), the model observes the surrounding words:

\[ \{\texttt{The},\ \texttt{new},\ \texttt{opened},\ \texttt{near}\} \]

Assume the following dimensions:

  • Vocabulary size: \(|W| = 10,000\).

  • Embedding dimension: \(K = 100\).

Step 2. One-hot representation.

Each context word \(w\) is first represented as a one-hot vector:

\[ \mathbf{x}_{w} \in \mathbb{R}^{10,000} \]

\[ \mathbf{x}_{w} \in \mathbb{R}^{10,000}, \quad x_{{w},j} = \begin{cases} 1 & \text{if } j = \text{index}(w) \\ 0 & \text{otherwise} \end{cases}, \quad \|\mathbf{x}_{w}\|_1 = 1 \]

where all entries are zero except for a single 1 indicating the position of the word in the vocabulary.

For example:

\[ \mathbf{x}_{\texttt{new}}, \quad \mathbf{x}_{\texttt{opened}}, \quad \mathbf{x}_{\texttt{near}} \]

Each of these vectors is high-dimensional and sparse.

Step 3. Embedding representation.

Each one-hot vector is mapped to a dense embedding vector through the embedding matrix:

\[ \mathbf{v}_{w} = \mathbf{x}_{w} \mathbf{E}, \qquad \mathbf{v}_{w} \; \in \; \mathbb{R}^{100}, \]

where \(\mathbf{x}_{w} \in \mathbb{R}^{10,000}\) is a one-hot vector and \(\mathbf{E} \in \mathbb{R}^{10,000 \times 100}\) is the embedding matrix.

Each context word is mapped to its embedding vector:

\[ \mathbf{v}_{\texttt{The}}, \quad \mathbf{v}_{\texttt{new}}, \quad \mathbf{v}_{\texttt{opened}}, \quad \mathbf{v}_{\texttt{near}} \]

Step 4. Aggregation (context vector).

The CBOW model computes the average:

\[ \mathbf{h} = \frac{1}{4} \left( \mathbf{v}_{\texttt{The}} + \mathbf{v}_{\texttt{new}} + \mathbf{v}_{\texttt{opened}} + \mathbf{v}_{\texttt{near}} \right) \]

This vector \(\mathbf{h} \in \mathbb{R}^{100}\) summarizes the contextual information.

Step 5. Score computation

The model computes scores over the entire vocabulary:

\[ \mathbf{z} = \mathbf{h}\mathbf{C}^\top, \qquad \mathbf{z} \in \mathbb{R}^{1 \times 10,000} \]

Each component \(z_i\) represents how compatible the context is with word \(w_i\).

Step 6a. Softmax probabilities

These scores are transformed into probabilities using the softmax function:

\[ P(w \mid \text{context}) = \frac{\exp(z_w)}{\sum_{j=1}^{|W|} \exp(z_j)} \]

Step 6b. Candidate words

The model assigns probabilities to all words in the vocabulary, including:

  • \(\texttt{store}\)

  • \(\texttt{hotel}\)

  • \(\texttt{restaurant}\)

These words appear because they are all possible candidates the model can select as the center word.

Words like \(\texttt{hotel}\) and \(\texttt{restaurant}\) receive relatively high probabilities because they are semantically compatible with the context:

"The new ___ opened near the plaza"

Step 7. Prediction

Ideally, the correct word \(\texttt{store}\) receives the highest probability:

\[ P(\texttt{store} \mid \text{context}) \quad \text{is maximal} \]

Thus, the CBOW model answers the question:

Given these surrounding words, which word best fits in the center?

This example illustrates how CBOW combines multiple context signals into a single prediction, capturing semantic coherence in the sentence.

In practice, computing probabilities over all \(|W|\) words is computationally expensive, which motivates the use of additional optimization techniques.

11.0.8 CBOW: computational challenges

In their basic form, both CBOW and Skip-gram require updating a large number of parameters for each training example. Because the vocabulary size \(|W|\) can be very large, computing full softmax probabilities and updating all associated weights becomes computationally expensive.

To address this challenge, the original Word2Vec framework introduced two key optimization strategies:

  • Subsampling of frequent words

  • Negative sampling

These two techniques significantly reduce computational cost while preserving embedding quality and are examined in detail in the following sections.

Computational challenges of CBOW approach. Source: Created by the author with ChatGPT (OpenAI)

Figure 11.2: Computational challenges of CBOW approach. Source: Created by the author with ChatGPT (OpenAI)

12 The CBOW approach: subsampling of frequent words

12.0.1 Subsampling (overview)

We now examine the first of these optimization strategies in more detail.

Highly frequent function words (such as and, of, or to) often carry limited semantic information but appear extremely often in text. To prevent these words from dominating the learning process, Word2Vec applies subsampling, which probabilistically discards some occurrences of frequent words during training.

As a consequence:

  • Frequent words are less likely to be selected as target words.

  • They appear less often as context words.

  • The effective training corpus becomes more informative and computationally manageable.

12.0.2 Subsampling (intuitive example)

To better understand the effect of subsampling, consider the following simple sentence:

"the cat sits on the mat"

In this sentence, the word the appears twice and is a highly frequent function word, while words such as cat, sits, and mat carry more semantic information.

During subsampling, some occurrences of the may be randomly discarded. For example, the sentence might be reduced to:

"cat sits on mat"

As a result:

    • Informative words such as cat, sits, and mat are preserved..
  • Less informative words such as the are partially removed.

  • The model focuses more on meaningful relationships between words.

This illustrates how subsampling reduces the dominance of frequent words and improves the quality of the training data.

12.0.3 Subsampling (important clarification: partial removal of frequent words)

It is important to emphasize that subsampling does not completely remove highly frequent words such as the, of, or and.

Instead, each occurrence of a frequent word is independently retained or discarded with a certain probability. As a result:

  • Very frequent words are removed most of the time.

  • But they are still occasionally retained in the training data.

For example, if a word such as the appears very frequently, only a small fraction of its occurrences may be kept, while the majority are discarded.

This probabilistic mechanism ensures that:

  • Frequent words do not dominate the learning process.

    • Enough occurrences are preserved to maintain grammatical and contextual structure.

Thus, subsampling performs a controlled reduction, rather than a complete elimination, of high-frequency words.

12.0.4 Subsampling (mathematical formulation)

The following probability function was introduced in the original Word2Vec framework to control the retention of frequent words during training.

Rather than being derived from first principles, this formulation is heuristic in nature. It was designed to aggressively downsample very frequent words while preserving less frequent ones. It has been shown empirically to improve both training efficiency and embedding quality (Mikolov et al., 2013).

The retention of a word \(w_i\) is controlled by the following function:

\[\begin{equation} R(w_i) \quad =\quad \left( \sqrt{\frac{f(w_i)}{\tau}} \; +\; 1 \right) \cdot \frac{\tau}{f(w_i)} \tag{12.1} \end{equation}\]

In Equation (12.1):

  • \(f(w_i)\) denotes the relative frequency of the word \(w_i\) in the corpus, and

  • \(\tau\) is a small threshold constant (commonly set around \(10^{-3}\)) that controls the aggressiveness of subsampling.

Words with very high frequencies are therefore more likely to be discarded.

12.0.5 Subsampling (important clarification: probability vs. retention function)

Strictly speaking, the expression in Equation (12.1) is not always a valid probability, since it may take values greater than 1.

For this reason, it is interpreted as a retention score, which is converted into a valid probability through truncation:

\[ P_{\text{keep}}(w_i) \;=\; \min\big(1,\; R(w_i)\big). \]

This ensures that the final value lies in the interval \((0,1]\).

As a consequence:

  • Words with low frequency often satisfy \(R(w_i) > 1\), and are therefore always retained.

  • Words with high frequency yield \(R(w_i) < 1\), and are retained only with a certain probability.

This function assigns lower retention probabilities to highly frequent words and values close to 1 for less frequent words, implementing the intuition described earlier.

Subsampling (additional insight: why the square root?)

An important component of the subsampling function is the square root term:

\[ \sqrt{\frac{f(w_i)}{\tau}} \]

This term plays a key role in controlling how aggressively frequent words are downsampled.

If we only used a term such as \(\frac{\tau}{f(w_i)}\), the retention probability would decrease too rapidly for frequent words. As a result, very common words could be almost entirely removed from the training data, potentially harming the grammatical structure of the corpus.

The square root introduces a smoothing effect:

  • It reduces the rate at which the probability decreases as frequency increases.

  • It ensures that very frequent words are downsampled aggressively but not completely.

  • It preserves a balance between removing noise and maintaining useful contextual information.

In this sense, the square root acts as a controlled attenuation mechanism, preventing extreme probabilities while still favoring informative words.

12.0.6 Subsampling pipeline

The subsampling procedure can be summarized as the following pipeline:

\[ \text{Corpus} \;\longrightarrow\; f(w_i) \;\longrightarrow\; R(w_i) \;\longrightarrow\; P_{\text{keep}}(w_i) \;\longrightarrow\; \text{Random filtering of occurrences} \;\longrightarrow\; \text{Subsampled corpus} \]

This pipeline highlights two key stages:

  • A global stage, where retention probabilities are computed for each word, and

  • A local stage, where each occurrence is independently retained or discarded.

12.0.7 Subsampling (intuitive example: one frequent word, the)

To understand how subsampling works in practice, consider a simple hypothetical corpus. Suppose the word:

\[w_i =``\text{the"}\]

appears extremely often, with relative frequency:

\[f(w_i) = 0.05\]

That means the word the accounts for 5% of all tokens in the corpus. Assume the threshold parameter is:

\[\tau = 10^{-3} = 0.001\]

We compute the retention score of the word the using Equation (12.1). Substituting values:

\[R(w_i) \quad = \quad \left(\sqrt{\frac{0.05}{0.001}} \;+\; 1 \right) \cdot \frac{0.001}{0.05} \quad = \quad \left(\sqrt{50} + 1 \right) \cdot 0.02 \quad = \quad (7.07 + 1)\cdot 0.02 \quad \approx \quad 0.1614\]

From this value, we obtain the effective probability:

\[ P_{\text{keep}}(w_i) \; = \; \min(1, R(w_i)) \; = \; 0.1614. \]

import numpy as np

# Given values
f = 0.05
tau = 0.001

# Step-by-step computation
step1 = f / tau
step2 = np.sqrt(step1)
step3 = step2 + 1
step4 = tau / f
R = step3 * step4

# Convert to probability
P_keep = min(1, R)

# Simulate retention decision
random_value = np.random.rand()
keep = random_value < P_keep

print("f/tau =", step1)
print("sqrt(f/tau) =", step2)
print("sqrt(f/tau) + 1 =", step3)
print("tau/f =", step4)

print("Retention score R(w_i) =", R)
print("Retention probability P_keep =", P_keep)

print("Random draw =", random_value)

if keep:
    print("Decision: The word is RETAINED")
else:
    print("Decision: The word is DISCARDED")
## Retention score R(w_i) = 0.16142135623730952
## Retention probability P_keep = 0.16142135623730952
## Random draw = 0.314889535873188
## Decision: The word is DISCARDED

This means that each occurrence of the word the is retained with probability approximately \(0.16\).

Equivalently:

  • About 16% of occurrences are kept.

  • About 84% are discarded during training.

Importantly, this decision is applied independently to each occurrence, not to the word as a whole.

From a practical perspective, this implies that very frequent words are strongly downsampled, reducing their dominance in the training process while still preserving occasional occurrences.

12.0.8 Subsampling (remarks)

1. About the code.

In code above, the following step:

print("Random draw =", random_value)

if keep:
    print("Decision: The word is RETAINED")
else:
    print("Decision: The word is DISCARDED")

simulates how subsampling is applied in practice. For each occurrence of a word:

  • A random value is drawn uniformly between 0 and 1.

  • The word is retained only if this value is smaller than \(P_{\text{keep}}\).

This means that each occurrence is treated independently, resulting in a probabilistic filtering process.

2. Distinction between two stages of the process.

It is important to distinguish between two stages of the process:

  • The retention probability \(P_{\text{keep}}(w_i)\) is computed once for each word based on its frequency in the corpus.

  • This probability is then applied independently to each occurrence of that word.

In other words, the model first determines how likely a word should be kept, and then uses that probability repeatedly while scanning the corpus.

3. Important consequences.

This mechanism has several important consequences:

  • It reduces computational cost, since fewer training examples are processed.

  • It prevents frequent function words from dominating the learning process.

  • It allows the model to focus on informative content words, which carry more semantic meaning.

  • It improves the quality of the learned embeddings, as relationships between meaningful words become more prominent.

Thus, the exact numerical value (e.g., 16%) is not important by itself. What matters is the qualitative effect:

Frequent words are strongly downsampled, but not completely removed.

Subsampling (example: comparison with a less frequent word, cat).

Now consider a less frequent word, such as:

\[w_j = ``\text{cat"}\]

Suppose:

\[f(w_j) = 0.0005\]

We compute the retention score of the word cat using Equation (12.1):

\[R(w_j) \quad = \quad \left( \sqrt{\frac{0.0005}{0.001}} \;+\; 1\right) \cdot \frac{0.001}{0.0005} \quad =\quad \left(\sqrt{0.5} + 1\right) \cdot 2 \quad = \quad (0.707 + 1)\cdot 2 \quad \approx \quad 3.414\]

The retention probability is therefore:

\[ P_{\text{keep}}(w_i) \; = \; \min(1, R(w_i)) \; = \; 1. \]

import numpy as np

# Given values
f = 0.0005
tau = 0.001

# Step-by-step computation
step1 = f / tau
step2 = np.sqrt(step1)
step3 = step2 + 1
step4 = tau / f
R = step3 * step4

# Convert to probability
P_keep = min(1, R)

# Simulate retention decision
random_value = np.random.rand()
keep = random_value < P_keep

print("f/tau =", step1)
print("sqrt(f/tau) =", step2)
print("sqrt(f/tau) + 1 =", step3)
print("tau/f =", step4)

print("Retention score R(w_i) =", R)
print("Retention probability P_keep =", P_keep)

print("Random draw =", random_value)

if keep:
    print("Decision: The word is RETAINED")
else:
    print("Decision: The word is DISCARDED")
## Retention score R(w_i) = 3.414213562373095
## Retention probability P_keep = 1
## Random draw = 0.4065489634375622
## Decision: The word is RETAINED

This means that every occurrence of the word cat is retained during training.

In contrast to very frequent words, rare words are never downsampled, since they already provide valuable information and do not dominate the corpus.

This illustrates a key property of the subsampling mechanism:

Rare words are fully preserved, while frequent words are selectively reduced.

12.0.9 Subsampling (example: effect of subsampling across words with different frequencies)

The following table summarizes the effect of subsampling across words with different frequencies:

Word Frequency \(f(w_i)\) \(R(w_i)\) \(P_{\text{keep}}(w_i)\) Interpretation
the 0.05 0.16 0.16 Highly frequent → strongly downsampled
and 0.02 0.32 0.32 Frequent → moderately downsampled
cat 0.0005 3.41 1.00 Rare → always retained
innovation 0.0001 10.00 1.00 Very rare → always retained

This table illustrates how the subsampling function behaves across words with different frequencies.

  • Highly frequent words (e.g., the) are assigned low retention probabilities.

  • Moderately frequent words are partially retained.

  • Rare words typically produce values \(P(w_i) > 1\), which are truncated to 1, meaning they are always retained.

This confirms that subsampling selectively reduces the influence of frequent words while preserving informative ones.

Taken together, these examples illustrate how subsampling adapts dynamically to word frequency.

12.0.10 Subsampling (interpretation)

Subsampling therefore:

  • Aggressively removes extremely frequent words.

  • Keeps informative content words.

  • Reduces computational cost.

  • Improves embedding quality by focusing on meaningful context.

Frequent function words contribute less semantic information, so discarding many of their occurrences does not harm learning. Instead, it allows the model to concentrate on more informative patterns in the data.

12.0.11 Subsampling (example: repeated occurrences under subsampling)

The previous examples considered a single occurrence of a word. To better understand how subsampling behaves in practice, we now simulate many independent occurrences of a frequent word such as the.

Suppose:

\[ f(w_i) = 0.05, \qquad \tau = 0.001 \]

Then the retention score is:

\[ R(w_i) = \left( \sqrt{\frac{f(w_i)}{\tau}} + 1 \right)\frac{\tau}{f(w_i)}, \]

and the effective retention probability is:

\[ P_{\text{keep}}(w_i) = \min(1, R(w_i)). \]

The following simulation applies this probability independently to many occurrences of the same word.

import numpy as np
import matplotlib.pyplot as plt

# Reproducibility
np.random.seed(42)

# Parameters
f = 0.05
tau = 0.001
n_occurrences = 1000

# Retention score and probability
R = (np.sqrt(f / tau) + 1) * (tau / f)
P_keep = min(1, R)

# Simulate independent retention decisions
random_draws = np.random.rand(n_occurrences)
kept = random_draws < P_keep

# Summary
n_kept = kept.sum()
n_discarded = n_occurrences - n_kept

print("Retention score R(w_i) =", R)
print("Retention probability P_keep =", P_keep)
print("Number of occurrences =", n_occurrences)
print("Retained =", n_kept)
print("Discarded =", n_discarded)
print("Observed retention proportion =", n_kept / n_occurrences)

# Simple bar plot
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(["Retained", "Discarded"], [n_kept, n_discarded]);
ax.set_ylabel("Number of occurrences");
ax.set_title("Subsampling simulation for a frequent word");
plt.show()
## Retention score R(w_i) = 0.16142135623730952
## Retention probability P_keep = 0.16142135623730952
## Number of occurrences = 1000
## Retained = 181
## Discarded = 819
## Observed retention proportion = 0.181

In this simulation, each occurrence of the word is treated independently. As expected, only a small fraction of occurrences are retained, while the majority are discarded.

Because the decision is probabilistic, the exact number retained will vary from one simulation to another. However, over many repetitions, the observed retention proportion will tend to be close to \(P_{\text{keep}}(w_i)\).

This illustrates the practical meaning of subsampling:

  • Very frequent words are not removed completely.

  • But they are retained only in a small proportion.

  • Which reduces their dominance in the training corpus.

13 The CBOW approach: negative sampling

13.0.1 Negative sampling (overview)

We now turn to the second optimization strategy, namely negative sampling.

A second major efficiency improvement is negative sampling. Instead of updating model parameters for every word in the vocabulary at each step, negative sampling updates parameters for:

  • The true target word, and

  • A small set of randomly selected negative words.

Negative words are randomly sampled noise words that are treated as incorrect targets for the current context.

This reduces the number of weight updates from \(|W|\) to a small constant \(k\), making training feasible even for very large vocabularies.

13.0.2 Negative sampling (how negative samples are selected)

Negative words are not sampled uniformly. Instead, they are drawn from a modified frequency distribution that balances the influence of common and rare words.

Let \(F(w_i)\) denote the absolute frequency (i.e., raw count) of word \(w_i\) in the corpus. A commonly used sampling distribution is defined as:

\[ P(w_i) \;=\; \frac{F(w_i)^{3/4}}{\sum\limits_{j=1}^{|W|} F(w_j)^{3/4}} \]

This formulation was introduced in the original Word2Vec framework and is heuristic in nature, meaning that it was designed based on empirical performance rather than derived from first principles (see Mikolov et al., 2013).

The exponent \(3/4\) plays an important role:

  • If raw frequencies \(F(w_i)\) were used, very frequent words (e.g., the, and) would dominate the sampling process.

  • If a uniform distribution were used, important frequency information would be lost.

  • The exponent \(3/4\) provides a balance, reducing the dominance of very frequent words while still favoring more common words over rare ones.

Thus, negative samples are more likely to be common words, but not overwhelmingly so.

Before illustrating this process with an example, we emphasize that this distribution is used exclusively for selecting negative samples, not for modeling the actual language probabilities.

13.0.3 Negative sampling (visual intuition of the \(3/4\) exponent)

To better understand the effect of the \(3/4\) exponent, we compare the original frequency distribution with its smoothed version.

The transformation:

\[ F(w_i) \;\longrightarrow\; F(w_i)^{3/4} \]

reduces the dominance of very frequent words while preserving the relative importance of less frequent ones.

import numpy as np
import matplotlib.pyplot as plt

# Simulated frequencies (Zipf-like behavior)
ranks = np.arange(1, 51)
F = 1 / ranks           # original frequencies
F_smooth = F**(3/4)     # smoothed frequencies

# Normalize for comparison
F = F / F.sum()
F_smooth = F_smooth / F_smooth.sum()

# Plot
fig, ax = plt.subplots(figsize=(7,4))
ax.plot(ranks, F, label="Original frequency $F(w)$");
ax.plot(ranks, F_smooth, label="Smoothed $F(w)^{3/4}$");

ax.set_xlabel("Word rank");
ax.set_ylabel("Probability");
ax.set_title("Effect of the $3/4$ exponent in negative sampling");
ax.legend();

plt.show()

The figure shows that:

  • Very frequent words (low ranks) are downweighted, meaning that their relative importance is reduced compared to the original frequency distribution.

  • Less frequent words become relatively more important, as their probability of being sampled increases in comparison to highly frequent words.

In other words, the transformation compresses the range of frequencies, reducing the dominance of very common words while giving more opportunity for less frequent words to be selected as negative samples.

13.0.4 Negative sampling (numerical illustration of the \(3/4\) exponent)

To illustrate this effect, consider a simple example:

  • Suppose a very frequent word such as the appears \(F(\text{the}) =1000\) times in the corpus, while a less frequent word such as cat appears \(F(\text{cat})=10\) times.

Applying the transformation \(F(w)^{3/4}\):

\[ F(\text{the})^{3/4}\; =\; 1000^{3/4}\; \approx\; 178, \qquad F(\text{cat})^{3/4}\; =\; 10^{3/4} \; \approx\; 5.6 \]

Before the transformation, the ratio between the two words is:

\[ \frac{F(\text{the})}{F(\text{cat})} \; =\; \frac{1000}{10} \; =\; 100 \]

After the transformation, the ratio becomes:

\[ \frac{F(\text{the})^{3/4}}{F(\text{cat})^{3/4}} \; =\; \frac{178}{5.6} \; \approx\; 32 \]

The following table summarizes the effect of the transformation:

Word Original frequency \(F(w)\) Transformed \(F(w)^{3/4}\)
the 1000 178
cat 10 5.6

This comparison highlights how the transformation reduces the gap between very frequent and less frequent words.

In other words, the transformation compresses the range of frequencies, reducing the dominance of very common words while giving more opportunity for less frequent words to be selected as negative samples.

This smoothing effect creates a more balanced distribution for selecting negative samples.

Importantly, this is a proper probability distribution, since it is normalized to sum to 1 over the vocabulary.

However, it does not represent the true probability of words in language. Instead, it is used solely as a sampling mechanism for selecting negative examples during training.

13.0.5 Negative sampling (illustrative example and intuition)

Retained words

To make the idea concrete, consider the following short sentence:

"The cat sits on the mat"

Suppose the center word is \(w_t = ``\text{sits}"\) and we use a context window of size \(s = 2\). The context words are therefore:

\[ \{``\text{the}",\; ``\text{cat}", \;``\text{on}", \; ``\text{the}"\} \]

In CBOW, the model aggregates these context words and tries to predict:

\[P(``\text{sits}"|\; \text{context})\]

Note that repeated words (such as the) are retained, since the context window is taken directly from the sentence.

Positive and negative pairs

In the negative sampling framework, the model constructs:

  • A positive pair:

\[ (``\texttt{sits}",\; ``\texttt{cat}") \]

(or any true context word)

  • Several negative pairs, for example:

\[ (``\texttt{sits}",\; ``\texttt{banana}"), \quad (``\texttt{sits}",\; ``\texttt{government}"), \quad (``\texttt{sits}",\; ``\texttt{ocean}") \]

These negative words are sampled from the distribution defined earlier.

Learning objective.

The model is trained to:

  • Assign a high score to positive pairs (true context words), and

  • Assign a low score to negative pairs (randomly sampled words).

In other words, the model learns to distinguish between:

Real context vs. random noise

This formulation avoids computing probabilities over the entire vocabulary, replacing it with a set of binary classification tasks.

13.0.6 Negative sampling (comparison with full softmax)

Without negative sampling

If we use the full softmax formulation, the model must:

  • Compute scores for every word in the vocabulary.

  • Normalize across all \(|W|\) words.

  • Update parameters associated with all of them.

If the vocabulary contains 100,000 words, this implies approximately 100,000 updates for a single training example, which is computationally expensive.

With negative sampling

Instead of updating all words, we update only:

  • The true target word: \(\texttt{sits}\).

  • A small number \(k\) of negative words.

Suppose \(k = 3\). The model might randomly select:

banana, government, ocean

These words are unlikely to be correct targets for the given context. The model then learns to:

  • Increase the score of \(\texttt{sits}\) given the context. `
  • Decrease the scores of the negative samples.

Thus, instead of updating 100,000 output weights, we update only:

\[1 + k = 4\] This represents a drastic reduction compared to updating all \(|W|\) words in the full softmax formulation. This dramatically reduces computational cost.

Mathematical interpretation.

With negative sampling, the objective for one training example becomes:

\[\log \sigma\big(\mathbf{v}_{\text{sits}}^{\top} \mathbf{h}\big) \quad +\quad \sum_{i=1}^{k} \log \sigma\big(-\mathbf{v}_{w_i^-}^{\top} \mathbf{h}\big)\]

where:

  • \(\mathbf{h}\) is the aggregated context vector.

  • \(w_i^-\) are negative samples.

  • \(\sigma(\cdot)\) is the logistic sigmoid function.

Instead of normalizing across the entire vocabulary, the model solves several small binary classification problems:

Is this word the correct target? Yes or No?

Why this works.

Although the model no longer computes a full probability distribution over the vocabulary, it still learns meaningful embeddings because:

  • True target words are pushed closer to their contexts.

  • Random negative words are pushed farther away.

Over many training examples, this process shapes the embedding space so that semantically related words cluster together.

This approximation is one of the key reasons why Word2Vec can be efficiently trained on very large corpora.

14 Summary: Skip-gram and CBOW

The Word2Vec framework combines two complementary architectures with efficient training strategies that enable learning word representations at scale. The main ideas can be summarized as follows.

Model comparison

Model Input Output Prediction type Key idea
Skip-gram Target word \(w_t\) Context words One-to-many Predict surrounding words from a single word
CBOW Context words Target word \(w_t\) Many-to-one Predict the center word from surrounding context

Computational challenge

Both architectures rely on modeling conditional probabilities over the entire vocabulary:

\[ P(w \mid \text{context}) = \frac{\exp(z_w)}{\sum\limits_{j=1}^{|W|} \exp(z_j)} \]

When \(|W|\) is large, this requires computing and updating a large number of parameters for each training example, making the process computationally expensive.

Optimization strategies

To address this limitation, Word2Vec introduces two key techniques:

  • Subsampling of frequent words. Reduces the influence of very common words (e.g., the, of) by probabilistically discarding some of their occurrences.

  • Negative sampling. Replaces the full softmax with a set of binary classification tasks involving:

    • The true target word, and

    • A small number \(k\) of negative samples.

Negative sampling objective

Instead of computing a full probability distribution, the model is reformulated as a set of binary classification problems.

The resulting objective function for a single training example is:

\[ \log \sigma\big(\mathbf{v}_{w_t}^\top \mathbf{h}\big) \;+\; \sum_{i=1}^{k} \log \sigma\big(-\mathbf{v}_{w_i^-}^\top \mathbf{h}\big) \]

where:

  • \(\mathbf{h} \in \mathbb{R}^d\) is the context representation vector.

    • In CBOW, it is typically obtained by averaging or summing the embeddings of the context words.

    • In Skip-gram, it corresponds to the embedding of the input (center) word.

  • \(\mathbf{v}_{w} \in \mathbb{R}^d\) is the output embedding vector associated with word \(w\). Each word in the vocabulary has one such vector in the output embedding matrix.

  • \(w_t\) is the true target word, and \(\mathbf{v}_{w_t}\) is its corresponding embedding.

  • \(w_i^-\) are the negative samples, i.e., words drawn from a noise distribution.

  • \(k\) is the number of negative samples used per training example (typically a small integer, e.g., \(k=5\) to \(k=20\)).

  • \(\sigma(x) = \frac{1}{1 + e^{-x}}\) is the sigmoid function.

Matrix perspective (for dimensional clarity).

Let \(|W|\) be the vocabulary size and \(d\) the embedding dimension.

  • Input embedding matrix: \[ \mathbf{W} \in \mathbb{R}^{|W| \times d} \]

  • Output embedding matrix: \[ \mathbf{V} \in \mathbb{R}^{|W| \times d} \]

  • Therefore:

    • \(\mathbf{h} \in \mathbb{R}^d\)

    • \(\mathbf{v}_{w} \in \mathbb{R}^d\)

    • \(\mathbf{v}_{w}^\top \mathbf{h} \in \mathbb{R}\) (scalar score)

This confirms that the sigmoid is applied to a scalar compatibility score between vectors.

Interpretation.

  • The first term increases the similarity between the context and the true word.

  • The second term decreases the similarity between the context and randomly sampled words.

In this way, the model learns to distinguish meaningful word-context pairs from noise.

Key intuition.

  • Words that occur in similar contexts tend to develop similar vector representations.

  • Skip-gram and CBOW differ in prediction direction but rely on the same distributional principle.

  • Negative sampling improves efficiency by avoiding full normalization over the vocabulary.

  • Subsampling further enhances training by reducing the dominance of very frequent words.

Together, these components enable Word2Vec to learn high-quality embeddings efficiently, even for very large vocabularies.

We now move from theory to practice by training a Word2Vec model from scratch.

15 Training a Word2Vec model

After examining how pretrained Word2Vec embeddings can be used and understanding the underlying architecture of the model, we now turn to the task of training a Word2Vec model from scratch.

Although it is possible to implement the algorithm manually, most practical applications rely on established libraries.
In this chapter, we use the gensim library, which provides a clear and efficient interface for training Word2Vec models.

We begin with a minimal configuration to illustrate the core ideas and then gradually introduce additional parameters.

15.0.1 Building a simple Word2Vec model

We start by defining a small collection of tokenized sentences and training a basic model.

from gensim.models import Word2Vec

sentences = [
    ["data", "science", "relies", "on", "statistical", "models"],
    ["machine", "learning", "models", "improve", "predictions"],
    ["statistical", "methods", "support", "data", "analysis"]
]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

In this example, the model is trained using a short list of tokenized sentences. Each sentence is represented as a list of tokens, and the full collection is passed to the Word2Vec constructor.

  • vector_size specifies the dimensionality of the embedding space.

  • window defines the maximum distance between the current word and its context words.

  • min_count controls vocabulary construction by specifying the minimum number of times a word must appear in the corpus to be included.

For example, if min_count = 4, only words that occur at least four times are retained, while less frequent words are discarded.
Consider the following hypothetical word frequencies:

  • data: 4
  • learning: 2
  • model: 1

If min_count = 4, only the word data is kept, while learning and model are removed from the vocabulary.

This helps reduce noise and improves training efficiency by focusing on more informative words.

In this example, setting min_count = 1 ensures that all tokens in the dataset are retained.

In real-world applications, the input typically consists of thousands or millions of sentences drawn from a large corpus.

Inspecting the dimensionality of the learned word vectors

To inspect the dimensionality of the learned word vectors, we use:

print(model.vector_size)
## 100

By default, Word2Vec constructs embeddings with 100 dimensions.

15.0.2 Size of the vocabulary

The size of the vocabulary learned from the data can be obtained as follows:

print(len(model.wv.key_to_index))
## 13

This returns the size of the vocabulary \(|W|\), i.e., the number of unique words retained by the model.

15.0.3 Adjusting the min_count parameter

The min_count parameter can be used to filter out infrequent words, which are often noisy or uninformative.

model = Word2Vec(sentences, min_count=2)

With this configuration, only words appearing at least twice in the corpus are retained.

We can verify the resulting vocabulary size:

print(len(model.wv.key_to_index))
## 3

And inspect the retained tokens:

print(model.wv.key_to_index)
## {'models': 0, 'statistical': 1, 'data': 2}

Although the vocabulary shrinks, the dimensionality of the embeddings remains unchanged:

print(model.vector_size)
## 100

Filtering rare words can improve training efficiency and reduce overfitting when working with large corpora.

15.0.4 Playing with the vector size

Higher-dimensional vectors capture more information across dimensions, especially when the corpus and vocabulary are big and the data is highly varied.

Let’s try to build a model where each vector is 300-dimensional using the following code block:

model = Word2Vec(sentences, min_count=2, vector_size=300)

Let’s now find out the vector size for the model we just built using the following line of code:

model.vector_size
## 300

As we can see, each of the four words that occur more than once is now represented using 300 dimensions.

15.0.5 Exploring the effect of vector dimensionality

The dimensionality of word embeddings influences how much semantic information can be encoded. Larger values allow for richer representations but require more data and computational resources.

To train a model with higher-dimensional embeddings, we can specify the vector_size parameter:

model = Word2Vec(sentences, min_count=2, vector_size=300)

We confirm the new dimensionality with:

model.vector_size
## 300

Each retained word is now represented as a vector in a 300-dimensional space.

15.0.6 Additional configuration parameters

Word2Vec provides several other parameters that control training behavior:

  • sg: selects the training architecture (1 for Skip-gram, 0 for CBOW),

  • negative: specifies the number of negative samples used during training,

  • workers: defines the number of parallel threads.

An example configuration is shown below:

model = Word2Vec(
    sentences,
    min_count=1,
    vector_size=200,
    sg=1,
    negative=5,
    workers=2
)

We can again inspect the vocabulary:

len(model.wv.key_to_index)
## 13
model.wv.key_to_index
## {'models': 0, 'statistical': 1, 'data': 2, 'analysis': 3, 'support': 4, 'methods': 5, 'predictions': 6, 'improve': 7, 'learning': 8, 'machine': 9, 'on': 10, 'relies': 11, 'science': 12}

Trained Word2Vec models can be saved to disk for later use using the save() method.

15.0.7 Limitations of Word2vec

Despite its effectiveness, Word2Vec has several well-known limitations.

Applications of `Word2vec`. Source: Created by the author with ChatGPT (OpenAI)

Figure 15.1: Applications of Word2vec. Source: Created by the author with ChatGPT (OpenAI)

First limitation.

Each word is assigned a single static vector, regardless of context. Consider the following sentences:

The researcher examined the cell samples.
The prisoner was locked in a cell overnight.

In both cases, the word cell would receive the same vector representation, even though its meaning differs across contexts.

Second limitation.

Word2Vec can reflect statistical biases present in the training corpus. If certain associations are overrepresented in the data, the learned embeddings may encode and reproduce these patterns. These issues highlight an important principle:

The quality and fairness of embeddings depend strongly on the data used for training.

15.0.8 Applications of Word2vec

Word2Vec embeddings are widely used in a variety of natural language processing tasks, including:

  • Semantic similarity measurement.

  • Document clustering.

  • Text classification.

  • Information retrieval.

By representing words as dense numerical vectors, Word2Vec enables text data to be integrated into traditional machine learning pipelines and more advanced neural architectures.

Applications of `Word2vec`. Source: Created by the author with ChatGPT (OpenAI)

Figure 15.2: Applications of Word2vec. Source: Created by the author with ChatGPT (OpenAI)

16 Word Mover’s Distance (WMD)

16.0.1 WMD: Overview

In earlier sections, we discussed how word embeddings can be used to represent documents and measure their similarity. One practical scenario where this becomes relevant is document matching, such as ranking short texts according to their relevance to a reference description.

For example, consider a system designed to compare short professional profiles against a project description. In such cases, we require a distance measure that reflects semantic similarity, not just surface-level word overlap. Documents that are semantically closer should receive smaller distance values.

In the document Transforming Text into Data Structure, we introduced cosine similarity as a common measure for comparing vector-based text representations. While effective in many settings, cosine similarity treats documents as aggregated vectors and may overlook fine-grained word-level alignments.

To address this limitation, we now introduce Word Mover’s Distance (WMD), a distance metric specifically designed for comparing documents represented through word embeddings.

Cosine similarity vs Word Mover's Distance`. Source: Created by the author with ChatGPT (OpenAI)

Figure 16.1: Cosine similarity vs Word Mover’s Distance`. Source: Created by the author with ChatGPT (OpenAI)

16.0.2 WMD: intuition behind this measure

Word Mover’s Distance (WMD), introduced by Kusner et al. (2015), is grounded in ideas from optimal transport theory. The central intuition is to measure how much “effort” is required to transform one document into another by moving words through the embedding space.

More precisely, WMD defines the dissimilarity between two documents as the minimum cumulative distance that the embedded words of one document must travel to align with the embedded words of the other document.

Instead of comparing documents as single aggregated vectors (as in cosine similarity), WMD explicitly accounts for word-level alignments.

17 WMD: example

17.0.1 Reference sentences

Consider the following sentences:

Sentence A: "The analyst explained results during the workshop in Medellín"
Sentence B: "A specialist discussed findings at a seminar in the city"

Many words in these sentences occupy nearby positions in the embedding space. For example:

  • analyst and specialist are semantically related.

  • workshop and seminar describe similar events.

  • explained and discussed reflect related communicative actions.

Now compare these with a third sentence:

Sentence C: "My bicycle needs maintenance before the weekend trip"

Sentence C shares little semantic content with Sentence A. Therefore, we expect the distance between A and C to be substantially larger than the distance between A and B.

Word Mover’s Distance (WMD) formalizes this intuition by computing pairwise distances between word embeddings and solving an optimal transport problem that minimizes the total movement cost required to transform one sentence into another.

17.0.2 Guiding question and working hypothesis

Guiding question.

Given the three sentences introduced above, we now formulate a concrete analytical objective. Our goal is to determine whether Word Mover’s Distance (WMD) aligns with our semantic intuition.

The guiding question is:

Given these three sentences, can we formally measure which pair is semantically closer using Word Mover’s Distance?

More specifically:

1. Is the distance between sentences A and B smaller than the distance between sentences A and C?
  
2. How does WMD operationalize our intuitive notion of semantic similarity?  

Intuitively, Sentence A and Sentence B are semantically related, whereas Sentence C describes a completely different topic.

We therefore propose the following working hypothesis:

\[ \mathrm{WMD}(A, B) \;<\; \mathrm{WMD}(A, C) \]

That is, the semantic distance between sentences A and B should be smaller than the distance between sentences A and C.

Interpretation of the hypothesis.

This hypothesis operationalizes a natural semantic expectation:

  • If two sentences share related concepts.

  • And those concepts occupy nearby regions in embedding space.

  • Then the optimal transport cost required to align them should be relatively small.

Conversely, if two sentences describe unrelated topics, the cumulative transport cost should be substantially larger.

Thus, WMD allows us to move from qualitative intuition (“these sentences are similar”) to a quantitative comparison based on geometric structure in the embedding space.

In the next section, we compute these distances explicitly using gensim and evaluate whether the numerical results confirm our hypothesis.

18 WMD: example (implementing with gensim)

We now demonstrate how to compute Word Mover’s Distance using the gensim library.

18.0.1 WMD with gensim: importing required modules

We begin by importing the necessary modules:

import gensim
from gensim.models import KeyedVectors

18.0.2 WMD with gensim: loading a pretrained embedding model

Next, we load a compact pretrained embedding model based on GloVe vectors trained on Wikipedia:

import gensim.downloader as api

# Load a compact pretrained model
model = api.load("glove-wiki-gigaword-100")

This model provides 100-dimensional word embeddings suitable for instructional demonstrations.

18.0.3 WMD with gensim: defining the example sentences

We now define three sentences for comparison:

sentence_1 = "The analyst explained results during the workshop in Medellín."
sentence_2 = "A specialist discussed findings at a seminar in the city."
sentence_3 = "My bicycle needs maintenance before the weekend trip."

18.0.4 WMD with gensim: computing distances

The distances.

We begin by computing the pairwise Word Mover’s Distance between the sentences:

d12 = model.wmdistance(sentence_1, sentence_2)
d13 = model.wmdistance(sentence_1, sentence_3)

d12, d13
## Distance (sentences 1 and 2) = 0.34086718145944445
## Distance (sentences 1 and 3) = 0.3008181961168074

We’re seeing that:

  • \(d_{12}=\mathrm{WMD}(1,2)=0.3409\)

  • \(d_{13}=\mathrm{WMD}(1,3)=0.3008\)

This implies:

\[\mathrm{WMD}(1,2) \;>\; \mathrm{WMD}(1,3).\]

In this run, Sentence 1 is numerically closer to Sentence 3 than to Sentence 2 under WMD (which is the opposite of our initial semantic expectation).

Interpreting the unexpected ordering.

Our working hypothesis was:

\[ \mathrm{WMD}(A,B) \;<\; \mathrm{WMD}(A,C), \]

meaning that the semantically related pair (1,2) should yield a smaller distance than the unrelated pair (1,3). However, WMD is highly sensitive to:

  • Out-of-vocabulary (OOV) tokens.

  • Tokenization decisions.

  • Casing and punctuation.

  • Accented characters, and

  • The specific pretrained embedding model.

For example, tokens such as Medellín may become OOV depending on preprocessing. When certain semantically important words are removed, the transport structure changes, potentially altering the distance ordering. Therefore, the correct empirical procedure is:

Do not assume the hypothesis holds: compare distances (1,2) and (1,3) directly.

Checking preprocessing and vocabulary coverage.

Before interpreting semantic similarity results, it is methodologically necessary to verify two aspects:

  1. That the text has been properly normalized.

  2. That all tokens are present in the embedding model vocabulary.

If a word is Out Of Vocabulary (OOV), the model cannot assign a vector representation to it, which may distort similarity computations.

Text normalization function.

import re
import unicodedata

def normalize_text(s):
    s = s.lower()
    s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("utf-8")
    s = re.sub(r"[^a-z\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

This function applies four preprocessing steps:

  1. Lowercasing: Converts all characters to lowercase to avoid case-sensitive mismatches.

  2. Unicode normalization: Removes diacritics (e.g., canción → cancion), ensuring compatibility with the embedding vocabulary.

  3. Removal of non-alphabetic characters: Eliminates punctuation and numbers.

  4. Whitespace standardization: Replaces multiple spaces with a single space and trims leading/trailing spaces.

This guarantees consistent token formatting before checking vocabulary coverage.

Text normalization function.

def oov_tokens(m, s):
    toks = normalize_text(s).split()
    return [t for t in toks if t not in m.key_to_index]

This function:

  1. Normalizes the sentence.

  2. Splits it into tokens.

  3. Checks whether each token exists in the model vocabulary (m.key_to_index).

  4. Returns a list of tokens not found in the model.

Checking each sentence.

print("OOV sentence_1:", oov_tokens(model, sentence_1))
print("OOV sentence_2:", oov_tokens(model, sentence_2))
print("OOV sentence_3:", oov_tokens(model, sentence_3))
## OOV sentence_1: []
## OOV sentence_2: []
## OOV sentence_3: []

The empty lists indicate that:

  • All tokens in each sentence are present in the embedding vocabulary.

  • No information is lost due to missing vector representations.

  • Similarity computations (e.g., cosine similarity or WMD) can be considered reliable with respect to vocabulary coverage.

If the output had included tokens, for example:

OOV sentence_1: ['blockchain', 'cryptomonedas']

this would indicate that those words have no vector representation in the model, potentially affecting semantic distance calculations.

Methodological Note.

Before computing semantic similarity measures, vocabulary coverage should always be verified. Ignoring OOV tokens may introduce silent distortions in embedding-based analyses.

18.0.5 WMD with gensim: recomputing WMD after normalization

The code.

s1n = normalize_text(sentence_1)
s2n = normalize_text(sentence_2)
s3n = normalize_text(sentence_3)

d12 = model.wmdistance(s1n, s2n)
d13 = model.wmdistance(s1n, s3n)

print(f"Distance (1,2) = {d12:.4f}")
print(f"Distance (1,3) = {d13:.4f}")

This block performs two main operations:

  1. Text normalization:

    • Each sentence is cleaned using the previously defined normalize_text() function.

    • This ensures consistent casing, removal of punctuation, and standardized tokens before computing distances.

  2. Recomputation of Word Mover’s Distance (WMD):

    • model.wmdistance() calculates the semantic distance between two sentences.

    • d12 measures the distance between Sentence 1 and Sentence 2.

    • d13 measures the distance between Sentence 1 and Sentence 3.

The output.

The values are printed with four decimal places for clarity.

## Distance (1,2) = 0.3093
## Distance (1,3) = 0.2870

Since WMD is a distance metric, smaller values indicate greater semantic similarity. We compare:

  • If \(d_{12} < d_{13}\), the result aligns with semantic intuition.

  • If \(d_{12} > d_{13}\), the embedding geometry (under this model and preprocessing) places Sentence 1 closer to Sentence 3.

Because \(d_{13} < d_{12}\), the embedding geometry places Sentence 1 closer to Sentence 3 than to Sentence 2.

Methodological insight.

Even after confirming that there are no OOV tokens, normalization can slightly modify token structure and therefore affect the computed distances. This illustrates an important principle in NLP:

Distance-based semantic comparisons depend not only on the metric itself, but also on preprocessing decisions and vocabulary coverage.

19 WMD: contrast with cosine similarity

19.0.1 Mean sentence embedding

Cosine similarity compares aggregated sentence vectors (e.g., mean embeddings), whereas WMD aligns words via optimal transport.

import numpy as np

def sent_vector_mean(m, s):
    toks = [t.strip(".,!?;:()[]\"'").lower() for t in s.split()]
    toks = [t for t in toks if t in m.key_to_index]
    if len(toks) == 0:
        return None
    return np.mean([m[t] for t in toks], axis=0)

This function computes a mean sentence embedding:

  1. The sentence is tokenized and lightly cleaned.

  2. Only tokens present in the embedding vocabulary are retained.

  3. Each token is mapped to its vector representation.

  4. If \(\mathbf{w}_i\) is the embedding of token \(i\), the sentence vector is computed as the arithmetic mean:

\[\mathbf{v}_s \quad = \quad \frac{1}{n} \sum_{i=1}^{n} \mathbf{w}_i\]

19.0.2 Cosine similarity

def cosine(u, v):
    return float(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))

This function computes cosine similarity between two vectors \(\mathbf{v}, \mathbf{v} \in \mathbb{R}^d\):

\[\cos(\mathbf{u}, \mathbf{v}) \quad =\quad \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} \quad \in \quad [-1, 1]\] Here, \(\mathbf{u} \cdot \mathbf{v}\) denotes the Euclidean inner product, and the \(L_2\) norm (Euclidean norm) of a vector \(\mathbf{v} \in \mathbb{R}^d\) is defined as:

\[\|\mathbf{v}\|_2 \quad =\quad \sqrt{\sum_{i=1}^{d} v_i^2}.\]

Cosine similarity measures angular similarity, not Euclidean distance. It evaluates the angle between vectors rather than their magnitude. Its takes values in the continuous interval \([-1,1]\). The extreme cases correspond to:

  • Identical direction (maximum similarity): \(1\)

  • Orthogonal vectors (no linear association): \(0\)

  • Opposite direction: \(-1\)

Intermediate values (e.g., 0.82, 0.34, −0.15) reflect varying angular proximity between vectors. In embedding spaces trained on natural language data, cosine values are typically non-negative, since semantically unrelated words rarely exhibit strong opposite orientations.

19.0.3 Output interpretation

v1 = sent_vector_mean(model, sentence_1)
v2 = sent_vector_mean(model, sentence_2)
v3 = sent_vector_mean(model, sentence_3)

cos12 = cosine(v1, v2)
cos13 = cosine(v1, v3)

print("| Pair | Cosine similarity |")
print(f"| (1,2) | {cos12:.4f} |")
print(f"| (1,3) | {cos13:.4f} |")
## | Pair | Cosine similarity |
## | (1,2) | 0.9319 |
## | (1,3) | 0.8143 |

Since cosine similarity is a similarity measure, larger values indicate greater semantic relatedness. Here we observe:

\[ \cos(1,2) \;=\; 0.9319 \quad > \quad \cos(1,3) \;=\; 0.8143\]

Thus, Sentence 1 is closer to Sentence 2 than to Sentence 3 under mean-embedding cosine similarity.

19.0.4 Conceptual contrast with WMD

  • Cosine similarity compares aggregated sentence vectors.

  • WMD aligns individual words via optimal transport.

Cosine relies on averaging, which may smooth or blur fine-grained word-level structure. WMD, by contrast, computes the minimal cumulative transport cost between word distributions. In general, if two sentences are more semantically related, we expect:

\[ \text{WMD}(A,B) < \text{WMD}(A,C)\]

However, as shown in this example, the empirical embedding geometry may yield a different ordering depending on preprocessing and model characteristics.

This discrepancy highlights the inherently model-dependent and preprocessing-sensitive nature of semantic distance in embedding spaces.

19.0.5 Key methodological insight

Cosine similarity and WMD operate on fundamentally different geometric principles:

  • Cosine operates global vector direction.

  • WMD implies distributional alignment in embedding space.

Different metrics may yield different rankings depending on preprocessing, token overlap, and embedding geometry.

20 WMD: cosine Similarity with mean embeddings

In addition to Word Mover’s Distance, we now compute cosine similarity between sentence representations. Here, each sentence is represented by the mean of its word embeddings. This produces a single dense vector per sentence, allowing us to compare them using cosine similarity.

The following code:

  • Tokenizes each sentence.

  • Removes out-of-vocabulary (OOV) tokens.

  • Computes the mean embedding.

  • And evaluates cosine similarity for the pairs (1,2) and (1,3).

import numpy as np

def sent_vector_mean(m, s):
    # Simple tokenization + OOV filtering
    toks = [t.strip(".,!?;:()[]\"'").lower() for t in s.split()]
    toks = [t for t in toks if t in m.key_to_index]
    if len(toks) == 0:
        return None
    return np.mean([m[t] for t in toks], axis=0)

def cosine(u, v):
    # Safe cosine computation
    if u is None or v is None:
        return None
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0 or norm_v == 0:
        return None
    return float(np.dot(u, v) / (norm_u * norm_v))

v1 = sent_vector_mean(model, sentence_1)
v2 = sent_vector_mean(model, sentence_2)
v3 = sent_vector_mean(model, sentence_3)

cos12 = cosine(v1, v2)
cos13 = cosine(v1, v3)

print("| Pair | Cosine similarity |")

if cos12 is not None:
    print(f"| (1,2) | {cos12:.4f} |")
else:
    print("| (1,2) | NA (vector not available) |")

if cos13 is not None:
    print(f"| (1,3) | {cos13:.4f} |")
else:
    print("| (1,3) | NA (vector not available) |")
## | Pair | Cosine similarity |
## | (1,2) | 0.9319 |
## | (1,3) | 0.8143 |

In contrast to WMD, cosine similarity in this example behaves as expected, assigning a higher similarity score to the semantically related pair (1,2).

20.0.1 Interpretation

From the computed results, we observe:

\[\cos(1,2)\; =\; 0.9319 \quad > \quad \cos(1,3)\; =\; 0.8143\]

In general, cosine similarity increases with semantic relatedness. Likewise, because Word Mover’s Distance (WMD) is a distance measure, semantic similarity should correspond to smaller values:

\[\mathrm{WMD}(1,2) < \mathrm{WMD}(1,3).\]

However, in our computed example we obtained:

\[\mathrm{WMD}(1,2) \;= \; 0.3093 \quad >\quad \mathrm{WMD}(1,3) \;= \; 0.2870.\]

Under this specific embedding model and preprocessing pipeline, Sentence 1 is therefore placed closer to Sentence 3 than to Sentence 2. This does not contradict the theory of WMD; rather, it illustrates its sensitivity to practical implementation details.

This example illustrates that similarity metrics are grounded in distinct geometric principles, which may produce divergent empirical rankings even under identical preprocessing pipelines.

20.0.2 Methodological Note

Discrepancies between intuitive semantic similarity and computed WMD values typically arise from:

  • Out-of-vocabulary (OOV) tokens.

  • Preprocessing inconsistencies.

  • Accented or rare words.

  • Differences in token coverage across sentences.

  • The geometry induced by the specific pretrained embedding model.

Therefore, embedding-based semantic comparisons depend not only on the metric itself, but also on preprocessing decisions and vocabulary coverage.WMD is theoretically well-founded, yet empirically sensitive to preprocessing and embedding geometry.

21 WMD: normalizing embeddings (optional)

Although WMD does not require explicit vector normalization, some practitioners precompute vector norms for computational efficiency.

model.fill_norms()

This operation precomputes and stores the \(L_2\) norms of the embedding vectors for efficient similarity computations. It does not modify the underlying vectors themselves. Recall that the \(L_2\) norm (Euclidean norm) of a vector \(\mathbf{v} \in \mathbb{R}^d\) is defined as:

\[\|\mathbf{v}\|_2 \quad =\quad \sqrt{\sum_{i=1}^{d} v_i^2}.\]

The \(L_2\) norm measures the magnitude (length) of a vector in Euclidean space. Precomputing these norms allows faster evaluation of similarity metrics such as cosine similarity, which depends on vector magnitudes:

\[ \cos(\mathbf{u}, \mathbf{v}) \quad = \quad \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} \quad \in \quad [-1, 1].\]

After precomputing norms, distances can be recomputed:

d12_norm = model.wmdistance(sentence_1, sentence_2)
d13_norm = model.wmdistance(sentence_1, sentence_3)

d12_norm, d13_norm
## (0.34086718145944445, 0.3008181961168074)

In practice, this step typically does not change the relative ordering of distances. It mainly improves numerical consistency in the underlying vector operations. If text normalization and OOV handling were already applied carefully, this embedding normalization step is optional.

22 WMD: interpreting the results

Word Mover’s Distance (WMD) provides a principled way to compare documents based on their word-level semantic structure.

  • Smaller WMD values indicate greater semantic similarity.

  • Larger WMD values indicate greater semantic divergence.

Unlike cosine similarity, which compares aggregated sentence vectors, WMD explicitly aligns words across documents using an optimal-transport formulation. This allows WMD to capture fine-grained semantic structure that simpler similarity measures may overlook. However, WMD is sensitive to several practical factors:

  • Tokenization choices.

  • Lowercasing and accent handling.

  • Out-of-vocabulary (OOV) words.

  • Coverage of the pretrained embedding model.

Therefore, interpretation must always consider preprocessing and vocabulary coverage.

Embedding-based semantic comparisons depend not only on the metric, but also on preprocessing decisions and model coverage.

In summary, WMD is a powerful distance metric, but its empirical behavior depends critically on the interaction between:

  1. The embedding geometry.

  2. The preprocessing pipeline.

  3. And the vocabulary represented in the model.

This makes WMD both theoretically principled and empirically sensitive.

23 Summary

In this document, we extended the discussion initiated in Transforming Text into Data Structures by shifting the focus from purely syntactic representations to semantic modeling of text.

Rather than treating words as isolated symbolic units, we examined how distributional information (especially word co-occurrence patterns) can be leveraged to approximate semantic structure.

We examined the geometric intuition behind word embeddings, analyzed how semantic regularities emerge through vector arithmetic, and studied the internal architecture of Word2Vec (specifically the Skip-gram and CBOW paradigms) along with practical considerations for training and deploying pretrained models.

Building on this foundation, we trained custom Word2Vec models from scratch, investigated the role of key hyperparameters, and reflected on known limitations of static embeddings (such as contextual ambiguity and the amplification of biases present in the training corpus). Several real-world applications were also highlighted, illustrating how word embeddings can be leveraged for similarity, clustering, and information retrieval tasks.

Finally, we introduced Word Mover’s Distance (WMD) as an optimal-transport-based framework for comparing documents in embedding space, illustrating how semantic distances can be quantified beyond simple aggregation-based similarity measures.

24 Applied activity: from word embeddings to semantic similarity

This activity extends the previous representations (Bag-of-Words and TF-IDF) toward embedding-based representations, allowing for a more meaningful comparison of words and texts based on semantic information.

While previous activities focused on frequency-based representations, this task introduces Word2Vec embeddings, where similarity reflects contextual meaning rather than surface form.

24.0.1 Objective

To construct a reproducible workflow that:

  • explores pretrained word embeddings,

  • analyzes semantic similarity between words,

  • performs analogy-style reasoning, and

  • compares short texts using embedding-based representations.

24.0.2 Instructions

  1. Select a small set of words and short sentences (you may reuse or adapt the corpus from documents developed in previous activities).

  2. Use a pretrained embedding model (e.g., via gensim).

  3. Create an R Markdown (.Rmd) document that compiles successfully to HTML (or PDF).

  4. The document must include:

    • the code, and

    • the resulting output (tables, printed objects, or numerical results).

24.0.3 Required sections

1. From frequency-based to embedding-based representations

Briefly explain:

  • how embeddings differ from Bag-of-Words and TF-IDF,

  • why embeddings capture semantic similarity more effectively.

Keep this section concise (3–5 lines).

2. Embedding model description

Describe:

  • the pretrained model used (e.g., GloVe via gensim),

  • the training corpus (if available),

  • the dimensionality of the vectors.

3. Vocabulary exploration

Select 10–15 words and:

  • verify whether they are in the model vocabulary,

  • report missing words (if any).

Briefly explain why some words may not appear (OOV).

4. Nearest neighbors

Select three words and:

  • compute their top 5 most similar words using cosine similarity,

  • present results in a table.

Interpret the relationships observed.

5. Analogy reasoning

Construct at least two analogies of the form: \(A - B + C \approx D\). Report:

  • input words,

  • predicted result(s),

  • a brief interpretation.

6. Word similarity matrix

Select 4–6 words and:

  • compute pairwise cosine similarity,

  • present results as a matrix.

Interpret which words are most/least similar.

7. From words to text similarity

Reuse three short sentences (from documents developed in previous activities, if possible).

Compute similarity using:

  • averaged word embeddings (cosine similarity).

Identify:

  • most similar pair,

  • least similar pair.

Compare briefly with TF-IDF results from documents developed in previous activities.

8. Conceptual reflection

Write a short reflection (6–10 lines) discussing:

  • what embeddings capture that TF-IDF does not,

  • one limitation of Word2Vec (e.g., lack of context sensitivity),

  • when embeddings are preferable.

24.0.4 Reproducibility Requirement

  • The R Markdown document must be fully reproducible.

  • All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.

  • All random seeds (if applicable) must be set to ensure deterministic results.

  • All library versions used should be clearly reported.

  • All code must be self-contained within the document. External dependencies are not allowed unless explicitly included.

  • The use of external functions not included in the document prevents reproducibility, as the analysis cannot be executed independently.

24.0.5 Submission Guidelines

To facilitate the review process and maintain the academic quality of submissions, the following requirements must be met:

  • The assignment must be submitted in both formats:

    • PDF

    • HTML

  • The document must be fully reproducible (see Reproducibility Requirement).

  • If additional files are used (e.g., modules, datasets, or auxiliary functions), they must:

    • Be included in the submission,

    • Be clearly organized, and

    • Be properly documented.

  • Multiple versions of the same assignment must not be submitted unless the final version is clearly identified. Submitting multiple unclear versions may hinder the evaluation process.

24.0.6 Submission Timing and Late Policy

Timely submission of assignments is strictly required.

  • Assignments must be submitted by the deadlines specified for each activity.

  • A grace period of 30 minutes is allowed after the deadline.

  • Submissions made after the grace period will incur penalties.

  • Penalty: −1.0 point for each hour or fraction of an hour of delay.

  • Submissions significantly past the deadline may receive a substantially reduced grade or may not be accepted.

References