hllinas2023

1 Prerequisites and Software Setup

The examples developed in this document are based on pretrained word embeddings and vector-based distance computations. To ensure that all code runs correctly, a small set of Python libraries must be available before executing the examples.

The commands shown below are provided for reference only. They should be executed in a Python environment, such as a system terminal, a Conda prompt, or an R Markdown document configured with the reticulate package.

1.0.1 Required Python packages

### Core libraries for word embeddings
pip install gensim

### Optional dependency for Word Mover’s Distance
pip install pot

The pot package (Python Optimal Transport) is required only for computing Word Mover’s Distance (WMD). If it is not installed, all other examples in this document will still run correctly.

1.0.2 Python imports used throughout this document

import numpy as np
import gensim
from gensim.models import KeyedVectors, Word2Vec
import gensim.downloader as api

1.0.3 Role of the libraries

The purpose of each library used in this chapter is summarized below.

  • gensim: Provides tools for loading, training, and querying word embedding models. In this document, it is used to:

    • load pretrained embeddings (e.g., GloVe),

    • compute similarity queries,

    • perform vector arithmetic,

    • and calculate Word Mover’s Distance when the required dependency is available.

  • gensim.downloader: Offers a convenient interface for downloading lightweight pretrained embedding models, avoiding manual file handling.

  • Word2Vec (gensim.models): Included to illustrate how embedding models can also be trained from scratch on custom corpora, even though the main focus of this chapter is on pretrained representations.

  • numpy: Supports low-level numerical operations and vector computations required for similarity and distance calculations.

  • pot (Python Optimal Transport): Implements optimal transport algorithms used internally by gensim to compute Word Mover’s Distance. This dependency is only needed for WMD-related examples.

2 Trained and pretrained models

Definition.

  • Word embedding models are models trained on textual data with the objective of learning continuous vector representations of words that capture semantic and syntactic relationships.

  • When such models are trained in advance on large, general-purpose corpora and later reused without further training, they are referred to as pretrained models.

Training of pretrained models.

  • Pretrained word embedding models are typically learned from very large text collections, such as Wikipedia or news archives.

  • Rather than training embeddings from scratch for each task, these models can be reused to explore semantic relationships, compute similarities, and perform analogy-based queries.

Use of pretrained models.

In this document, pretrained models are used to:

  • illustrate how words are represented as vectors,

  • explore semantic similarity and distance measures, and

  • analyze relationships between words and short texts.

Download of pretrained models.

To ensure reproducibility and ease of setup, all examples rely on compact pretrained models that can be downloaded automatically.
Large external embedding files are intentionally avoided to guarantee consistent execution across different systems.

2.1 Example tokens from a pretrained embedding model

Pretrained embedding models typically contain vector representations for tens or hundreds of thousands of tokens. Throughout the document, we will work with a pretrained GloVe model trained on Wikipedia data, using \(n\)-dimensional word vectors. The following code illustrates how such tokens can be accessed once a pretrained model has been loaded. The chunk is shown for reference only and is not explained at this stage.

import gensim.downloader as api

# Download and load a compact pretrained model 
model = api.load("glove-wiki-gigaword-100")

# Inspect a small sample of tokens
list(model.key_to_index.keys())[:50]
## ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as', 'it', 'by', 'at', '(', ')', 'from', 'his', "''", '``', 'an', 'be', 'has', 'are', 'have', 'but', 'were', 'not', 'this', 'who', 'they', 'had', 'i', 'which', 'will', 'their', ':', 'or', 'its', 'one', 'after']

The output would be a list of common tokens (such as frequent nouns, verbs, or adjectives) for which vector representations are available. In practice, pretrained models contain many more tokens than those shown here. Models accessed through gensim.downloader are downloaded automatically on first use and cached locally for future sessions..

2.1.1 Other pretrained embedding models available in gensim

In addition to the GloVe model used throughout this document, gensim.downloader provides access to several other pretrained word embedding models. These models differ in training corpus, dimensionality, and intended use. The following code lists all pretrained models available via gensim.downloader.

import gensim.downloader as api

# List all available pretrained models
api.info()["models"].keys()
## dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

Common alternatives include:

  • glove-wiki-gigaword-50: A lower-dimensional version of the Wikipedia-based GloVe model, suitable for fast experimentation and visualization.

  • glove-wiki-gigaword-200: A higher-dimensional variant that may capture finer semantic distinctions, at the cost of increased memory usage.

  • glove-twitter-25: Trained on Twitter data; useful for informal language, abbreviations, and social media text.

  • word2vec-google-news-300: A Word2Vec model trained on a large news corpus. Due to its size, it is not recommended for lightweight or instructional settings.

  • fasttext-wiki-news-subwords-300: A FastText model trained on Wikipedia data that incorporates subword information and can handle out-of-vocabulary words more effectively.

3 Preliminaries

In the document Transforming Text into Data Structure, we introduced frequency-based text representations such as bag-of-words and term frequency–inverse document frequency (TF–IDF). These methods encode text numerically by focusing on the presence and frequency of words within documents and across a corpus.

Although effective in many applications, frequency-based representations largely ignore the contextual surroundings of words. They do not account for which terms tend to appear before or after a given word, even though this local neighborhood plays a crucial role in shaping meaning. The semantic role of a word is strongly influenced by the context in which it appears.

In this document, we build on this idea by introducing word embeddings, which represent words as vectors learned from their contextual usage in text. These representations aim to capture semantic relationships between words rather than relying solely on surface-level frequency information.

The following topics are covered in this document:

  • Understanding word embeddings,

  • Demystifying the Word2Vec model,

  • Training a Word2Vec model from text data, and

  • Introducing Word Mover’s Distance as a measure of semantic similarity between texts.

Throughout this document, small helper functions are introduced only when they become necessary, in order to simplify repetitive tasks and improve the robustness of the examples.

4 Learning semantic word representations

4.0.1 From Distributional Hypothesis to Vector Geometry

Word embeddings are learned representations in which words are encoded as numerical vectors in an n-dimensional space. The central intuition is that words with related meanings tend to occupy nearby positions in this space. As a result, embeddings make it possible to quantify semantic similarity, discover related terms, and analyze meaningful relationships between words.

Although embeddings are often introduced at the word level, the same principle can be extended to larger linguistic units, such as sentences or documents. In this document, however, the focus is on word-level embeddings and on how they reflect semantic information derived from textual context.

Models such as Word2Vec learn these representations by analyzing patterns of word co-occurrence in large text corpora. Words that appear in similar contexts tend to acquire similar vector representations. To build intuition about the type of semantic structure captured by such models, it is useful to examine a few illustrative examples.

Word embeddings. Source: Created by the author with ChatGPT (OpenAI)

Figure 4.1: Word embeddings. Source: Created by the author with ChatGPT (OpenAI)

4.0.2 Example: education and professional roles

Consider terms related to education and professional roles. If a Word2Vec model has been trained appropriately and the relevant terms are present in its vocabulary, one may observe vector relationships such as:

\[\overrightarrow{\text{teacher}} \quad-\quad \overrightarrow{\text{school}} \quad+\quad \overrightarrow{\text{university}} \quad\approx\quad \overrightarrow{\text{professor}}\]

Graphically, this relationship can be interpreted as a sequence of vector operations in the embedding space. The arrow above each word denotes its vector representation, that is, the numerical vector learned by the Word2Vec model for that word.

In this analogy, the role of a teacher in a school is related to the role of a professor in a university. Rearranging the expression highlights the symmetry of the relationship:

\[\overrightarrow{\text{teacher}} \quad+\quad \overrightarrow{\text{university}} \quad\approx\quad \overrightarrow{\text{school}} \quad+\quad \overrightarrow{\text{professor}}\]

In other words, Word2Vec captures regularities in language by encoding comparable relationships as similar geometric transformations in the embedding space.

4.0.3 Example: geographical relationships

As a second example, consider geographical relationships that do not rely on country–capital pairs. Instead, we examine relationships between countries and their corresponding demonyms or nationalities. A typical analogy captured by word embeddings may take the form:

\[\overrightarrow{\text{Japan}} \quad-\quad \overrightarrow{\text{Japanese}} \quad+\quad \overrightarrow{\text{Italian}} \quad\approx\quad \overrightarrow{\text{Italy}}\]

Conceptually, this corresponds to transferring a relationship learned in one geographical context to another. Here, the association between a country and its demonym is shifted across contexts. This example illustrates the ability of embedding models to generalize relational patterns beyond individual word pairs.

4.0.4 Interpreting vector offsets and relational structure

At first glance, such results may appear surprising. However, they arise naturally from the distributional learning process underlying Word2Vec: words that share similar contextual roles tend to exhibit consistent vector offsets.

To understand how this semantic structure emerges from raw text, we now turn to the learning mechanism behind these representations. This leads us to a closer examination of the Word2Vec algorithm, discussed in detail in the next section.

It is important to note that the vectors produced by these algebraic operations are not exactly equal to the true word representations learned by the model. Rather, they are sufficiently close to demonstrate that meaningful semantic relationships are encoded within the embedding space.

Interpreting vector offsets and relational structure. Source: Created by the author with ChatGPT (OpenAI)

Figure 4.2: Interpreting vector offsets and relational structure. Source: Created by the author with ChatGPT (OpenAI)

5 Making sense of Word2Vec

5.0.1 Word2Vec: intuition behind this model

The intuition behind this model is closely aligned with a classic idea from linguistics:

You shall know a word by the contexts in which it appears

This statement captures the distributional hypothesis, which asserts that words occurring in similar contexts tend to have similar meanings.

Word2Vec operationalizes this idea by learning vector representations of words directly from their surrounding textual environments. Rather than treating words as isolated symbolic tokens, the model assigns each word a point in a continuous vector space. The geometric position of each word reflects the statistical patterns of the contexts in which it appears.

At its foundation, Word2Vec relies on a shallow neural architecture that transforms discrete co-occurrence information into dense numerical embeddings. Through iterative optimization, the model encodes regularities of language usage into the structure of the embedding space. As a result, both semantic and syntactic relationships emerge implicitly in the geometry of that space: words with similar meanings tend to cluster together, while relational patterns can often be expressed as vector operations..

The Word2Vec framework was introduced by Mikolov et al. (2013) at Google, marking a major milestone in the development of distributional representations in Natural Language Processing (NLP).

Its combination of conceptual simplicity, computational efficiency, and empirical effectiveness played a pivotal role in the widespread adoption of embedding-based methods in text analysis and downstream machine learning applications.

Intuition behind `Word2Vec`. Source: Created by the author with ChatGPT (OpenAI)

Figure 5.1: Intuition behind Word2Vec. Source: Created by the author with ChatGPT (OpenAI)

Before examining the internal structure of Word2Vec in detail, it is useful to briefly clarify the distinction between supervised and unsupervised learning. This distinction helps situate Word2Vecwithin the broader landscape of machine learning models.

5.0.2 Word2Vec: overview for supervised and Unsupervised learning

A detailed discussion of supervised and unsupervised learning appears later in Identifying Patterns in Text Using Machine Learning. Here, we provide a concise overview to establish the necessary background.

  • Supervised learning refers to learning scenarios in which each observation is associated with a known outcome or label. For example, in an email classification task, messages may be labeled as spam or not spam, and the model learns to predict these labels based on textual features.

  • Unsupervised learning involves data for which no explicit labels are available. A typical example is grouping documents or users based on similarity in behavior or content, without predefined categories. The goal is to uncover latent structure or patterns directly from the data.

With this distinction in mind, we can now examine how Word2Vec fits into this taxonomy.

Supervised and Unsupervised learning. Source: Created by the author with ChatGPT (OpenAI)

Figure 5.2: Supervised and Unsupervised learning. Source: Created by the author with ChatGPT (OpenAI)

5.0.3 Word2Vec: is supervised or unsupervised?

Word2Vec is commonly categorized as an unsupervised method for learning word embeddings. Within its training procedure, the model is designed to perform one of two related tasks:

  • Predict a word based on its surrounding context, or

  • Predict surrounding context words given a target word.

Although these tasks involve prediction, the signals used during training are derived entirely from the text itself. There is no externally provided target label, as would be required in a supervised learning setting.

Because all learning signals originate from raw, unlabeled text, Word2Vec learns word representations in an unsupervised manner. The resulting embeddings reflect statistical regularities present in natural language, extracted without manual annotation.

5.0.4 Word2Vec: pretrained model

As discussed earlier, the goal of Word2Vec is to capture semantic and syntactic relationships between words within a text corpus. In practice, training such models from scratch can be computationally demanding and typically requires very large datasets.

To address this limitation, several pretrained Word2Vec models are publicly available and widely used. These models are trained on large-scale text collections and can be directly applied to downstream tasks or further adapted to domain-specific corpora through fine-tuning.

The output of a Word2Vec model can be represented as a matrix of size \(|W| \times K\), where \(|W|\) denotes the vocabulary size and \(K\) corresponds to the dimensionality of the embedding space. Each row of this matrix contains the vector representation of a single word. In practical applications, the dimensionality \(K\) typically ranges from 50 to 300, depending on the size of the training corpus and the level of semantic granularity required.

A number of pretrained Word2Vec implementations can be accessed through standard NLP libraries, such as gensim. In addition, large-scale pretrained models have been released by industry and research groups. One widely referenced example is a Word2Vec model trained on a large news corpus, which includes approximately three million words and phrases, with each word represented by a 300-dimensional vector. Due to its size (around 1.5 GB), this model is primarily suited for large-scale or research-oriented applications and is available through public repositories such as the Google Code Archive.

6 Exploring a pretrained word embedding model with gensim

6.0.1 gensim: loading a pretrained embedding model

For instructional purposes, it is often preferable to work with compact pretrained embedding models that can be downloaded automatically. The gensim library provides convenient access to several such models through its internal data repository. To begin, we load a lightweight pretrained model based on GloVe embeddings trained on Wikipedia data.

import gensim.downloader as api

# Load a compact pretrained embedding model
model = api.load("glove-wiki-gigaword-100")

6.0.2 gensim: verifying vocabulary coverage

This approach avoids manual file handling and helps ensure that the examples run consistently across different systems. Before performing similarity queries or vector arithmetic, it is important to verify that the words we plan to use are actually present in the model’s vocabulary. Pretrained embedding models only contain vectors for words observed during training. If a word is missing, any attempt to query it will result in an error.

To illustrate this, we define a small list of general-purpose words:

candidates = [
    "city", "country", "river", "music", "science",
    "computer", "economy", "school", "university",
    "government", "health"
]

We then check which of these words are included in the model’s vocabulary:

[w for w in candidates if w in model.key_to_index]

The above expression is a list comprehension. It performs the following steps:

  • Iterates over each word w in the list candidates.

  • Checks whether that word exists in model.key_to_index.

  • Keeps only those words that are present in the vocabulary.

Here, model.key_to_index is a dictionary that maps each word in the pretrained model to its internal index. Therefore, the condition

w in model.key_to_index

verifies whether the embedding model contains a vector representation for the word w. The output confirms that all selected candidate words are covered by the model:

## ['city', 'country', 'river', 'music', 'science', 'computer', 'economy', 'school', 'university', 'government', 'health']

6.0.3 gensim: why is this step useful?

This verification step serves two purposes:

  • Technical safety: It prevents runtime errors when querying embeddings for out-of-vocabulary words.

  • Conceptual clarity: It reinforces the idea that embedding models operate only over a fixed vocabulary learned during training.

Only after confirming vocabulary coverage does it make sense to proceed with similarity queries, analogy tasks, or vector arithmetic.

6.0.4 gensim: inspecting the internal vocabulary structure

To better understand how tokens are stored internally, we can display a small sample of words from the model’s vocabulary:

# View a small sample of tokens
list(model.key_to_index.keys())[:20]
## ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as']

This command extracts the first 20 tokens from the vocabulary dictionary. It provides a quick glimpse into how words are indexed and stored within the pretrained model.

6.0.5 gensim: vocabulary size and embedding dimensionality

We can inspect the size of the vocabulary included in the pretrained model as follows:

len(model.key_to_index)
## 400000

This value indicates the total number of unique tokens for which vector representations are available. Next, we verify the dimensionality of the word vectors:

model.vector_size
## 100

In this case, each word is represented as a vector in a 100-dimensional embedding space. This dimensionality determines the number of numerical features used to encode semantic and syntactic information.

6.0.6 gensim: accessing the complete vocabulary mapping

For completeness, the full vocabulary-to-index mapping can be accessed via:

model.key_to_index

This object is a dictionary where:

  • Keys correspond to vocabulary words.

  • Values correspond to their internal integer indices.

The output is intentionally suppressed here, as it is too large to display meaningfully in the rendered document.

6.0.7 gensim: nearest Neighbors with most_similar()

One of the most direct ways to investigate what a word embedding model has learned is to examine nearest neighbors in the embedding space. The method most_similar() retrieves the words whose vectors are closest to a given query word, where closeness is measured using cosine similarity.

The output consists of pairs (word, similarity score), where higher values indicate stronger semantic proximity to the query word.

model.most_similar("city", topn=10)
## [('town', 0.8263899087905884), ('cities', 0.7764331698417664), ('where', 0.754779040813446), ('area', 0.7458435297012329), ('downtown', 0.7437540292739868), ('capital', 0.713450014591217), ('southern', 0.7070139646530151), ('near', 0.7027733325958252), ('neighborhood', 0.6954501271247864), ('suburb', 0.6925175189971924)]

\[ \begin{array}{ll} \textbf{Word} & \textbf{Cosine Similarity} \\ \hline \texttt{town} & 0.8264 \\ \texttt{cities} & 0.7764 \\ \texttt{where} & 0.7548 \\ \texttt{area} & 0.7458 \\ \texttt{downtown} & 0.7438 \\ \texttt{capital} & 0.7135 \\ \texttt{southern} & 0.7070 \\ \texttt{near} & 0.7028 \\ \texttt{neighborhood} & 0.6955 \\ \texttt{suburb} & 0.6925 \\ \end{array} \]

These results illustrate how semantic similarity emerges geometrically: words that tend to appear in related contexts occupy nearby positions in the vector space. It is important to note that this operation requires the query token to exist exactly as stored in the model vocabulary. If the token is absent, gensim raises a KeyError.

6.0.8 gensim: remarks

  • Pretrained embedding models follow specific tokenization conventions. Proper nouns, rare words, inflected forms, or multiword expressions may be absent or represented differently.

  • Cosine similarity does not imply strict synonymy. It measures distributional proximity, which may reflect topical association rather than identical meaning.

  • For instructional purposes, selecting common nouns and broadly used terms typically yields more stable and reproducible examples.

These nearest-neighbor queries provide intuition about local structure in the embedding space. We now extend this idea to analogy-based vector arithmetic, which explores relational structure rather than simple proximity.

7 Analogy queries with pretrained embeddings

Beyond nearest-neighbor exploration, word embeddings can also be examined through analogy-style queries. Unlike simple similarity queries (which measure proximity to a single word), analogy queries explore relational structure in the embedding space.

The central idea is that semantic relationships can often be expressed as vector offsets. For example, if two words share a comparable relationship, the difference between their vectors tends to be similar.

Formally, analogy queries attempt to solve expressions of the form:

\[\mathbf{v}_b \quad -\quad \mathbf{v}_a \quad + \quad \mathbf{v}_c \quad \approx \quad \mathbf{v}_d,\]

where the model searches for the word \(d\) whose vector is closest (in cosine similarity) to the resulting vector. Before running such queries, we define helper utilities to ensure that all required tokens exist in the vocabulary. This avoids runtime errors and keeps examples reproducible.

# Helper utilities for safer, reproducible analogy queries

def in_vocab(w, m):
    return w in m.key_to_index

def require_tokens(m, tokens):
    missing = [t for t in tokens if not in_vocab(t, m)]
    return missing

def analogy(m, positive, negative, topn=1):
    return m.most_similar(
        positive=positive,
        negative=negative,
        topn=topn
    )

In the previous code block:

  • in_vocab() is used to verify whether a given token is present in the embedding vocabulary.

  • require_tokens() checks a collection of tokens and reports any that are missing before a query is executed.

  • analogy() acts as a lightweight interface to perform analogy-based queries using the underlying most_similar() method.

One effective way to investigate the semantic structure learned by an embedding model is through analogy-style queries. These queries operate by combining word vectors through addition and subtraction, and then identifying the nearest vectors in the embedding space.

Within the gensim framework, this behavior is specified using:

  • positive=[...] to incorporate semantic components,

  • negative=[...] to subtract semantic components,

  • topn = k to retrieve the \(k\) closest candidate words.

7.0.1 Queries: example (a semantic shift within everyday concepts)

One relationship (as text).

Consider the following relationship:

\[\text{teacher} \quad - \quad \text{school} \quad + \quad \text{university} \quad \quad \approx \quad \text{professor}\]

Conceptually, this analogy asks:

If a teacher is associated with a school, what is the corresponding role associated with a university?

The relationship vectorially.

In vector terms, we compute:

\[\mathbf{v}_{\text{teacher}} \quad - \quad \mathbf{v}_{\text{school}} \quad +\quad \mathbf{v}_{\text{university}}\]

and search for the word vector that is closest (under cosine similarity) to this resulting vector.

pos = ["teacher", "university"]
neg = ["school"]

missing = require_tokens(model, pos + neg)
missing
## []
# Run the analogy only if all tokens exist
if len(missing) == 0:
    analogy(model, positive=pos, negative=neg, topn=1)
else:
    "Some tokens are missing from the vocabulary: " + ", ".join(missing)
## [('professor', 0.8101112842559814)]

Example output.

[('professor', 0.8101)]

Interpretation.

The model correctly identifies professor as the closest match. This suggests that the offset

\[\mathbf{v}_{\text{teacher}} \quad - \quad \mathbf{v}_{\text{school}}\]

encodes a “professional-role-in-institution” relationship, which can be transferred to a new institutional context.

This illustrates an important property of embeddings:

Relationships between words can be represented as approximately linear transformations in vector space.

7.0.2 Queries: inspecting multiple candidates

Example: inspect more than one candidate.

In practice, it is often informative to inspect more than one candidate, as the model may return closely related terms or near-synonyms.

if len(missing) == 0:
    analogy(model, positive=pos, negative=neg, topn=2)
## [('professor', 0.8101112842559814), ('lecturer', 0.7625928521156311)]

In many cases, the top-ranked result reflects the intended relationship, while subsequent results provide reasonable semantic alternatives.

Example output.

[('professor', 0.8101), ('lecturer', 0.7626)]

The second result, lecturer, is semantically coherent with the analogy. This highlights an important point:

  • Embedding models do not return a single “true” answer.

  • Instead, they provide a ranked list of candidates based on geometric proximity.

  • Several words may satisfy the relational constraint to varying degrees.

7.0.3 Queries: example (a geography-oriented analogy without capitals)

Another relationship.

Country-capital relationships can be unstable across models due to tokenization and coverage issues. As a more robust alternative, we consider a country–nationality style analogy:

\[ \text{Japan} \quad - \quad \text{Japanese} \quad + \quad \text{Italian} \quad \approx \quad \text{Italy} \]

This expression attempts to transfer the relationship:

Country ↔ Nationality

from one pair to another.

pos2 = ["Japan", "Italian"]
neg2 = ["Japanese"]

missing2 = require_tokens(model, pos2 + neg2)
missing2
## ['Japan', 'Italian', 'Japanese']

If tokens are missing:

if len(missing2) == 0:
    analogy(model, positive=pos2, negative=neg2, topn=1)
else:
    "Some tokens are missing from the vocabulary: " + ", ".join(missing2)
## 'Some tokens are missing from the vocabulary: Japan, Italian, Japanese'

In this case, the model reports that the tokens are not available.

Why does this happen?

Pretrained embedding models:

  • Contain only words observed during training.

  • May tokenize proper nouns differently (e.g., lowercase vs uppercase).

  • May omit infrequent named entities.

This reinforces a crucial lesson:

Embedding models operate over a fixed vocabulary determined at training time.

7.0.4 Queries: important remarks on analogy queries

Analogy queries are powerful exploratory tools, but:

  • They do not guarantee a unique or universally correct answer.

  • Results depend on the training corpus.

  • Tokenization conventions matter.

  • Named entities are often less stable than common nouns.

For instructional purposes, it is advisable to:

  • Prefer common terms,

  • Avoid rare proper nouns,

-Use broadly shared conceptual relationships.

Ultimately, analogy queries reveal not just similarity, but the relational geometry encoded in the embedding space.

8 The Word2Vec architecture

In the previous section, we worked with pretrained Word2Vec-style embeddings and examined how they can be queried to reveal semantic relationships. We now turn our attention to the learning process itself and describe how Word2Vec models are trained.

Word2Vec can be trained using two closely related modeling strategies:

  • Skip-gram, where the model predicts surrounding context words given a target word.

  • Continuous Bag-of-Words (CBOW), where the model predicts a target word given its surrounding context.

Word2Vec architectures. Source: author’s own elaboration

Figure 8.1: Word2Vec architectures. Source: author’s own elaboration

Both approaches rely on the same underlying principles and differ mainly in the direction of prediction. In this document, we focus on the Skip-gram architecture, as its intuition is often easier to visualize. The same ideas can be transferred directly to the CBOW formulation.

9 The Skip-gram approach

The Skip-gram model learns word representations by predicting context words from a given target word. Words that frequently appear near one another in text contribute to each other’s representations.

Each observed (target, context) pair provides a training signal that helps refine the embedding vectors. Over time, this process leads to word vectors that encode meaningful semantic structure.

9.0.1 Skip-gram: defining target and context words

A simple ilustration.

Consider the following sentence:

Learning models improve through repeated exposure

Suppose we select improve as the target word. The context words are those that appear within a fixed neighborhood around the target.

This neighborhood is controlled by a parameter known as the window size. For instance, if the window size is set to 5, the model considers up to two words to the left and two words to the right of the target word.

Under this configuration, the following (target, context) training pairs are generated for the word improve:

Target word Context word
improve learning
improve models
improve through
improve repeated

During training, the model learns to associate the target word with each of its surrounding context words. Each (target, context) pair contributes to refining the vector representation of the target word.

More generally, the window slides across the sentence, producing multiple target–context pairs as different words take the role of the target.

Sliding window across a longer sentence.

To better visualize how this process operates across an entire sentence, we now consider a longer example:

Students develop skills by practicing data analysis techniques
Illustration of sliding window context generation.

Figure 9.1: Illustration of sliding window context generation.

In Figure 9.1:

  • The highlighted word represents the target word.

  • The surrounding highlighted words correspond to the context words within the sliding window.

  • Each row illustrates a different position of the window as it moves across the sentence.

For example, when the word practicing is selected as the target, the words develop, skills, by, and data constitute its context.

This sliding-window mechanism allows the Skip-gram model to generate a large number of meaningful training examples from a single sentence, efficiently capturing local co-occurrence patterns in text.

9.0.2 Skip-gram: core components of this model

Skip-gram as a one-to-many prediction problem.

We now examine the fundamental components involved in training a Skip-gram Word2Vec model.

The Skip-gram architecture learns word representations by predicting surrounding context words given a target (center) word:

\[ \text{Target} \;\longrightarrow\; \text{Context words} \]

Formally, for a given position \(t\) in the corpus, the training objective is to maximize the conditional probability of the words that lie within a context window of size \(c\):

\[ \prod_{\substack{-c \le j \le c \\ j \ne 0}} P\big(w_{t+j} \mid w_t\big) \]

This expression states that, for each center word \(w_t\), the model attempts to predict every neighboring word \(w_{t+j}\) within the window.

Conceptually, Skip-gram can be interpreted as a one-to-many prediction problem:

  • Given a single center word \(w_t\),

  • Predict multiple surrounding context words within a window of size \(c\).

During training, this mechanism generates several \((\text{target}, \text{context})\) pairs from a single sentence, significantly increasing the number of training examples and improving statistical efficiency.

Core components.

From a structural perspective, the Skip-gram model consists of the following key components:

  1. Input representation (one-hot encoding of the target word),

  2. Embedding matrix (\(|W| \times K\)),

  3. Context (prediction) matrix,

  4. Output score vector,

  5. Softmax normalization,

  6. Loss computation and backpropagation.

Each component plays a specific role in transforming a discrete input word into a dense vector representation and updating the model parameters during training. We now analyze each of these components in detail.

Skip-gram: core Components. Source: Created by the author with ChatGPT (OpenAI)

Figure 9.2: Skip-gram: core Components. Source: Created by the author with ChatGPT (OpenAI)

9.0.3 Skip-gram: input representation

In the Skip-gram architecture, the input word \(w_t\) is encoded as a one-hot vector of size \(|W| \times 1\), where \(|W|\) denotes the size of the vocabulary. Formally,

\[\mathbf{x}_t \in \mathbb{R}^{|W|}\] where exactly one component equals 1 (corresponding to the position of \(w_t\) in the vocabulary) and all remaining components equal 0.

Thus, each one-hot vector contains exactly one active entry, identifying the target word, while all other positions indicate absence. This sparse representation constitutes the starting point of the forward pass through the embedding matrix.

Example.

To illustrate this idea, consider a vocabulary composed of four tokens:

data, models, learn, patterns

Then the corresponding one-hot encodings are:

  • data1 0 0 0

  • models0 1 0 0

  • learn0 0 1 0

  • patterns0 0 0 1

Each vector has length \(|W| = 4\), and only a single position is active in each case. Formally, we denote the vocabulary as:

\[\color{brown}{W=\{\texttt{data},\ \texttt{models},\ \texttt{learn},\ \texttt{patterns}\}}\]

Hence, the input space of one-hot representations is \(\mathbb{R}^{|W|}=\mathbb{R}^{4}\). A convenient way to visualize all possible one-hot inputs is through the following matrix:

\[\color{green}{\mathbf{X}^{(0)} = \left( \begin{array}{c|cccc} \text{Word} & \texttt{data} & \texttt{models} & \texttt{learn} & \texttt{patterns} \\ \hline \texttt{data} & 1 & 0 & 0 & 0 \\ \texttt{models} & 0 & 1 & 0 & 0 \\ \texttt{learn} & 0 & 0 & 1 & 0 \\ \texttt{patterns} & 0 & 0 & 0 & 1 \end{array} \right)}\]

Interpretation.

  • Each row represents a valid input configuration for the model.

  • Selecting the word learn as the target corresponds to activating the vector:

\[\color{blue}{\mathbf{x}_{\texttt{learn}} = \begin{pmatrix} 0\\ 0\\ 1\\ 0 \end{pmatrix}}\]

Although the Skip-gram model processes one target word at a time and never uses the full matrix simultaneously during training, this representation is pedagogically useful because it:

  • Makes the structure of the input space explicit,

  • Provides a clear connection to Bag-of-Words (BoW) representations, and

  • Prepares the transition to the embedding matrix, where these sparse vectors are mapped into dense semantic representations.

9.0.4 Sip-gram: embedding matrix

The next component of the Skip-gram architecture is the embedding matrix, denoted as \(\mathbf{E} \in \mathbb{R}^{|W| \times K}\), where:

  • \(|W|\) is the size of the vocabulary, and

  • \(K\) is the embedding dimension, that is, the number of latent features used to represent each word.

This matrix is typically initialized with small random values or with structured initialization schemes designed to improve numerical stability and convergence during training.

When a one-hot input vector corresponding to a target word is multiplied by the embedding matrix, the operation does not involve a full matrix multiplication in practice. Instead, it effectively selects the row of the embedding matrix associated with the active position in the one-hot vector.

The selected row constitutes the intermediate embedding vector, a dense vector of length \(K\) that encodes the semantic representation of the target word in the embedding space (see Figure 9.3).

Embedding Matrix. Source: author’s own elaboration

Figure 9.3: Embedding Matrix. Source: author’s own elaboration

9.0.5 Skip-gram: context (or prediction) matrix

A second trainable parameter matrix, commonly referred to as the context matrix (or prediction matrix), is introduced in the Skip-gram architecture. This matrix is denoted by \(\mathbf{C} \in \mathbb{R}^{|W| \times K}\), where:

  • \(|W|\) is the vocabulary size, and

  • \(K\) is the embedding dimension.

The intermediate embedding vector obtained from the embedding matrix is combined with the context matrix to generate a real-valued score for each word in the vocabulary.

Operationally, this step computes a vector of unnormalized scores by taking the dot product between the intermediate embedding vector and every row of the context matrix. Each score reflects how compatible the target word is with a candidate context word.

Conceptually, this operation measures the alignment between the semantic representation of the target word and each possible context word. Words that are more semantically or syntactically compatible with the target receive higher scores, indicating stronger contextual association (see Figure 9.4).

Context (or prediction) Matrix. Source: author’s own elaboration

Figure 9.4: Context (or prediction) Matrix. Source: author’s own elaboration

9.0.6 Skip-gram: output vector and softmax normalization

Overview.

The computation performed in the previous step yields an output vector of size \(|W| \times 1\), where \(|W|\) denotes the size of the vocabulary.

Each component of this vector corresponds to an unnormalized score that reflects how strongly the model associates a given vocabulary word with the current target word as a potential context word. At this stage, these scores are real-valued and do not yet constitute probabilities.

Softmax function.

To convert these raw scores into a probabilistic interpretation, the softmax function is applied.

Formally, given a score vector \[\mathbf{z} = (z_1, z_2, \dots, z_{|W|})^\top, \]

the softmax transformation is defined componentwise as

\[\text{softmax}(\mathbf{z})_i \;=\; \frac{\exp(z_i)}{\sum\limits_{j=1}^{|W|} \exp(z_j)}\]

A detailed discussion of this function and its probabilistic interpretation can be found in my notes on logistic regresssion. The result of this operation is a normalized output vector, whose components sum to one and can be interpreted as probabilities over the vocabulary.

\[\text{softmax}(\mathbf{z}) \in \mathbb{R}^{|W|},\]

and therefore has the same dimensionality as the input score vector. Here, the vector \(\mathbf{z}\) represents the output score vector produced by the model prior to normalization (see Figure 9.5).

Softmax Matrix. Source: author’s own elaboration

Figure 9.5: Softmax Matrix. Source: author’s own elaboration

Each component \(z_i\) corresponds to the model’s score for the \(i^{\text{th}}\) word in the vocabulary being the correct context word.

Applying the softmax function guarantees that:

  • All resulting values lie in the interval \([0, 1]\),

  • The values sum to 1 across the vocabulary,

  • Each value can be interpreted as a probability.

9.0.7 Skip-gram: example (output vector and softmax normalization)

Suppose the model produces the following score vector for a given target word:

\[\color{blue}{\mathbf{x} =\left(\begin{array}{c} 1.5 \\ 0.5 \\ 2.5 \\ 1.0 \\ 0.2 \end{array}\right) \in \mathbb{R}^{5}}\]

Each entry corresponds to a different vocabulary word. To convert these scores into probabilities, we apply the softmax function. The denominator (normalizing constant) is:

\[\color{green}{\sum_{j=1}^{5} \exp(x_j) = \exp(1.5) + \exp(0.5) + \exp(2.5) + \exp(1.0) + \exp(0.2)}.\]

Therefore, the softmax vector can be written compactly as:

\[\color{orange}{\text{softmax}(\mathbf{x}) = \frac{1} {\exp(1.5)+\exp(0.5)+\exp(2.5)+\exp(1.0)+\exp(0.2)} \begin{pmatrix} \exp(1.5) \\ \exp(0.5) \\ \exp(2.5) \\ \exp(1.0) \\ \exp(0.2) \end{pmatrix}= \begin{pmatrix} 0.20140079 \\ 0.07409121 \\ 0.54746412 \\ 0.12215576 \\ 0.05488812 \end{pmatrix}}.\]

The resulting vector belongs to \(\mathbb{R}^{5}\), its components lie in the interval \([0,1]\), and they sum to one, forming a valid probability distribution over the vocabulary. Each value indicates the likelihood that the corresponding word is the correct context word for the given target. In python:

import numpy as np

x = np.array([1.5, 0.5, 2.5, 1.0, 0.2])
np.exp(x) / np.sum(np.exp(x))
## array([0.20140079, 0.07409121, 0.54746412, 0.12215576, 0.05488812])

Words with larger scores in the original vector \(\mathbf{x}\) receive higher probabilities after normalization, while smaller scores are comparatively suppressed.

9.0.8 Skip-gram: loss computation and backpropagation

Once the model produces a probability distribution over the vocabulary, this predicted vector is compared against the true context word, which is encoded as a one-hot vector.

The difference between the predicted probabilities and the true target representation quantifies the training error, commonly referred to as the loss. This value reflects how accurately the model was able to identify the correct context word given the target word.

The loss serves as a feedback signal that is propagated backward through the network. During this process, the parameters of the model are updated, including:

  • The entries of the embedding matrix, and

  • The entries of the context matrix.

Each parameter update is proportional to its contribution to the prediction error.
This iterative process of error propagation and parameter adjustment is known as backpropagation, and it enables the model to progressively improve its predictions as training proceeds.

\[\mathbf{x} = \color{blue}{\begin{array}{c} \text{Target}\\ \left(\begin{array}{c} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \end{array}\right) \end{array}} \quad - \quad \color{green}{\begin{array}{c} \text{Predicted}\\ \left(\begin{array}{c} 0.20 \\ 0.10 \\ 0.50 \\ 0.04 \\ 0.00 \\ \vdots \end{array}\right) \end{array}} \quad = \quad \color{red}{\begin{array}{c} \text{Error}\\ \left(\begin{array}{r} 0.80 \\ -0.10 \\ -0.50 \\ -0.04 \\ 0.00 \\ \vdots \end{array}\right) \end{array}} \]

As training progresses, successive updates to the model parameters gradually decrease the loss, indicating an increasingly accurate alignment between the predicted probabilities and the true context words. Further discussion of backpropagation and widely used loss functions can be found in the Stanford Deep Learning for Computer Vision course materials, specifically in Class 1 and Class 2.

9.0.9 Skip-gram: inference and learned embeddings

This diagram summarizes the main components and interactions involved in training a Word2Vec model using the Skip-gram approach.

Forward and backward propagation in the Skip-gram Word2Vec model. Source: author’s own elaboration

Figure 9.6: Forward and backward propagation in the Skip-gram Word2Vec model. Source: author’s own elaboration

Training is performed over multiple passes through the corpus, commonly referred to as epochs. As training progresses, the embedding matrix stabilizes and converges to a set of meaningful vector representations.

After training is complete, each row of the embedding matrix corresponds to the learned vector for a specific word in the vocabulary. These vectors constitute the final word embeddings and can be extracted for use in downstream tasks, such as:

  • Measuring semantic similarity,

  • Performing analogy queries, or

  • Serving as input features for other machine learning models.

Applications of Learned Word Embeddings. Source: Created by the author with ChatGPT (OpenAI)

Figure 9.7: Applications of Learned Word Embeddings. Source: Created by the author with ChatGPT (OpenAI)

10 The CBOW approach: overview

10.0.1 CBOW: conceptual overview

The Continuous Bag-of-Words (CBOW) model is closely related to the Skip-gram architecture, but it reverses the direction of prediction. While Skip-gram predicts surrounding words from a target word, CBOW predicts the target word from its surrounding context.

\[\text{Context words} \;\longrightarrow\; \text{Target word}\]

In this formulation, multiple context words are aggregated into a single representation, which is then used to infer the missing center word.

CBWO approach. Source: Created by the author with ChatGPT (OpenAI)

Figure 10.1: CBWO approach. Source: Created by the author with ChatGPT (OpenAI)

Although the network structure differs slightly from Skip-gram, both models rely on the same distributional principle:

Words that appear in similar contexts tend to have similar representations.

As a result, CBOW and Skip-gram typically produce embeddings of comparable quality.

10.0.2 CBOW: a many-to-one prediction problem

CBOW can be interpreted as a many-to-one prediction task. Formally, for a given position \(t\) in a corpus and a context window of size \(s\), the model seeks to maximize:

\[ P\big(w_t \mid w_{t-s}, \dots, w_{t+s}\big) \]

In words, the objective is to maximize the probability of observing the center word \(w_t\) given its surrounding context words. That is, among all words in the vocabulary, the model aims to assign the highest probability to the actual word that appears in the middle of the context window.

Equivalently, the CBOW model attempts to answer the following question:

Given these neighboring words, which word is most likely to occupy the center position?

Conceptually:

  • The input consists of multiple neighboring words.

  • These context words are combined into a single representation.

  • The output is a single predicted target word.

This contrasts with Skip-gram, which solves a one-to-many problem by predicting several context words from a single center word.

10.0.3 CBOW: input and output structure

In the CBOW architecture, the input representation is constructed by combining the embeddings of the surrounding context words. Let \(s\) denote the context window size, that is, the number of positions considered to the left and right of the target word. The hidden representation is typically computed as the average (or sum) of the embeddings of those context words:

\[\mathbf{h} \quad =\quad \frac{1}{c} \sum_{\substack{-s \le j \le s \\ j \ne 0}} \mathbf{v}_{w_{t+j}}\]

where:

  • \(s\) is the window size parameter, determining how many neighboring positions are considered on each side of the target word.

  • \(c\) is the total number of context words actually included in the sum.

  • \(\mathbf{v}_{w_{t+j}}\) denotes the embedding vector of word \(w_{t+j}\).

If the window is symmetric and no boundary effects occur, then typically:

\[c = 2s\]

since there are \(s\) words to the left and \(s\) words to the right of the target word. The resulting vector \(\mathbf{h} \in \mathbb{R}^K\) is a dense representation summarizing the contextual information.

This vector is then passed through a linear transformation followed by a softmax layer, producing a probability distribution over the vocabulary:

\[\text{Softmax}(\mathbf{h}) \quad \rightarrow \quad \hat{w}_t\]

Thus, CBOW aggregates multiple contextual signals into a single vector before making one prediction (which is why it is considered a many-to-one model).

10.0.4 CBOW: computational challenges

In their basic form, both CBOW and Skip-gram require updating a large number of parameters for each training example. Because the vocabulary size \(|W|\) can be very large, computing full softmax probabilities and updating all associated weights becomes computationally expensive.

To address this challenge, the original Word2Vec framework introduced two key optimization strategies:

  • Subsampling of frequent words.

  • Negative sampling.

These techniques significantly reduce computational cost while preserving embedding quality.

Computational challenges of CBWO approach. Source: Created by the author with ChatGPT (OpenAI)

Figure 10.2: Computational challenges of CBWO approach. Source: Created by the author with ChatGPT (OpenAI)

11 The CBOW approach: subsampling of frequent words

11.0.1 CBOW: subsampling (overview)

Highly frequent function words (such as and, of, or to) often carry limited semantic information but appear extremely often in text. To prevent these words from dominating the learning process, Word2Vec applies subsampling, which probabilistically discards some occurrences of frequent words during training.

As a consequence:

  • Frequent words are less likely to be selected as target words.

  • They appear less often as context words.

  • The effective training corpus becomes more informative and computationally manageable.

11.0.2 CBOW: subsampling (mathematical formulation)

The probability of retaining a word \(w_i\) is typically defined as:

\[\begin{equation} P(w_i) \quad =\quad \left( \sqrt{\frac{f(w_i)}{\tau}} \; +\; 1 \right) \cdot \frac{\tau}{f(w_i)} \tag{11.1} \end{equation}\]

In Equation (11.1):

  • \(f(w_i)\) denotes the relative frequency of the word \(w_i\) in the corpus, and

  • \(\tau\) is a small threshold constant (commonly set around \(10^{-3}\)) that controls the aggressiveness of subsampling.

Words with very high frequencies are therefore more likely to be discarded.

11.0.3 CBOW: subsampling (intuitive example)

Example: one frequent word (the).

To understand how subsampling works in practice, consider a simple hypothetical corpus. Suppose the word:

\[w_i = \text{"the"}\]

appears extremely often, with relative frequency:

\[f(w_i) = 0.05\]

That means the word the accounts for 5% of all tokens in the corpus. Assume the threshold parameter is:

\[\tau = 10^{-3} = 0.001\]

We compute the probability of retaining the word using Equation (11.1). Substituting values:

\[P(w_i) \quad = \quad \left(\sqrt{\frac{0.05}{0.001}} \;+\; 1 \right) \cdot \frac{0.001}{0.05} \quad = \quad \left(\sqrt{50} + 1 \right) \cdot 0.02 \quad = \quad (7.07 + 1)\cdot 0.02 \quad \approx \quad 0.1614\]

import numpy as np

# Given values
f = 0.05
tau = 0.001

# Step-by-step computation
step1 = f / tau
step2 = np.sqrt(step1)
step3 = step2 + 1
step4 = tau / f
P = step3 * step4

print("f/tau =", step1)
print("sqrt(f/tau) =", step2)
print("sqrt(f/tau) + 1 =", step3)
print("tau/f =", step4)
print("Final probability P(w_i) =", P)
## Final probability P(w_i) = 0.16142135623730952

This means that only about 16% of occurrences of the are retained, while approximately 84% are discarded during training.

Example: comparison with a less frequent word (cat).

Now consider a less frequent word, such as:

\[w_j = \text{"cat"}\]

Suppose:

\[f(w_j) = 0.0005\]

Then:

\[P(w_j) \quad = \quad \left( \sqrt{\frac{0.0005}{0.001}} \;+\; 1\right) \cdot \frac{0.001}{0.0005} \quad =\quad \left(\sqrt{0.5} + 1\right) \cdot 2 \quad = \quad (0.707 + 1)\cdot 2 \quad \approx \quad 3.414\]

import numpy as np

# Given values
f = 0.0005
tau = 0.001

# Step-by-step computation
step1 = f / tau
step2 = np.sqrt(step1)
step3 = step2 + 1
step4 = tau / f
P = step3 * step4

print("f/tau =", step1)
print("sqrt(f/tau) =", step2)
print("sqrt(f/tau) + 1 =", step3)
print("tau/f =", step4)
print("Final probability P(w_i) =", P)
## Final probability P(w_i) = 3.414213562373095

Since probabilities cannot exceed 1, the word cat is effectively always retained.

11.0.4 CBOW: subsampling (interpretation)

Subsampling therefore:

  • Aggressively removes extremely frequent words.

  • Keeps informative content words.

  • Reduces computational cost.

  • Improves embedding quality by focusing on meaningful context.

Frequent function words contribute less semantic information, so discarding many of their occurrences does not harm learning. Instead, it allows the model to concentrate on more informative patterns in the data.

12 The CBOW approach: negative sampling

12.0.1 CBOW: negative sampling (overview)

A second major efficiency improvement is negative sampling. Instead of updating model parameters for every word in the vocabulary at each step, negative sampling updates parameters for:

  • The true target word, and

  • A small set of randomly selected negative words.

Negative words are tokens that are unlikely to appear in the current context.

This reduces the number of weight updates from \(|W|\) to a small constant \(k\), making training feasible even for very large vocabularies.

12.0.2 CBOW: selecting negative sampling

Negative words are sampled according to a modified frequency distribution. Rather than sampling strictly proportional to raw frequency, a smoothed distribution is typically used to balance common and rare words.

A simplified representation of the sampling probability is:

\[P(w_i) = \frac{F(w_i)}{\sum\limits_{j=1}^{|W|} F(w_j)}\]

where:

  • \(F(w_i)\) is the number of times word \(w_i\) appears in the corpus, and

  • \(|W|\) denotes the vocabulary size.

In practice, the original Word2Vec implementation uses a frequency distribution raised to the power \(3/4\), which empirically improves performance.

12.0.3 CBOW: negative sampling (example)

To make the idea concrete, consider the following short sentence:

"The cat sits on the mat"

Suppose the center word is \(w_t = \text{"sits"}\):

w = "sits"

and we use a context window of size \(s = 2\). The context words are therefore:

{"the", "cat", "on", "the"}

In CBOW, the model aggregates these context words and tries to predict:

P("sits" | context)

12.0.4 CBOW: what Happens without or with negative sampling?

Without negative sampling?

If we used the full softmax formulation, the model would need to:

  • Compute scores for every word in the vocabulary.

  • Normalize across all \(|W|\) words.

  • Update parameters associated with all of them.

If the vocabulary contains 100,000 words, that means 100,000 updates for a single training example (which is computationally expensive).

With negative sampling?

Instead of updating all words, we update only:

  • The true target word: sits.

  • A small number \(k\) of negative words.

Suppose \(k = 3\). The model might randomly select:

{"banana", "government", "ocean"}

These words are unlikely to appear in the given context. The model then learns to:

  • Increase the probability of sits given the context.

  • Decrease the probability of the negative samples.

Thus, instead of updating 100,000 output weights, we update only:

\[1 + k = 4\]

This dramatically reduces computational cost.

Mathematical interpretation.

With negative sampling, the objective for one training example becomes approximately:

\[\log \sigma\big(\mathbf{v}_{\text{sits}}^{\top} \mathbf{h}\big) \quad +\quad \sum_{i=1}^{k} \log \sigma\big(-\mathbf{v}_{w_i^-}^{\top} \mathbf{h}\big)\]

where:

  • \(\mathbf{h}\) is the aggregated context vector.

  • \(w_i^-\) are negative samples

  • \(\sigma(\cdot)\) is the logistic sigmoid function.

Instead of normalizing across the entire vocabulary, we solve several small binary classification problems:

Is this word the correct target? Yes or No?

Why this works.

Although the model no longer computes a full probability distribution over the vocabulary, it still learns meaningful embeddings because:

  • True target words are pushed closer to their contexts.

  • Random negative words are pushed farther away.

Over many training examples, this process shapes the embedding space so that semantically related words cluster together.

13 Training a Word2Vec model

After examining how pretrained Word2Vec embeddings can be used and understanding the underlying architecture of the model, we now turn to the task of training a Word2Vec model from scratch.

Although it is possible to implement the algorithm manually, most practical applications rely on established libraries.
In this chapter, we use the gensim library, which provides a clear and efficient interface for training Word2Vec models.

We begin with a minimal configuration to illustrate the core ideas and then gradually introduce additional parameters.

13.0.1 Building a simple Word2Vec model

We start by defining a small collection of tokenized sentences and training a basic model.

from gensim.models import Word2Vec

sentences = [
    ["data", "science", "relies", "on", "statistical", "models"],
    ["machine", "learning", "models", "improve", "predictions"],
    ["statistical", "methods", "support", "data", "analysis"]
]

model = Word2Vec(sentences, min_count=1)

In this example, the model is trained using a short list of tokenized sentences. Each sentence is represented as a list of tokens, and the full collection is passed to the Word2Vec constructor.

The parameter min_count controls vocabulary construction by specifying the minimum number of occurrences required for a word to be included. Here, setting min_count = 1 ensures that all tokens in the dataset are retained.

In real-world applications, the input typically consists of thousands or millions of sentences drawn from a large corpus.

13.0.2 Inspecting the dimensionality of the learned word vectors

To inspect the dimensionality of the learned word vectors, we use:

model.vector_size
## 100

By default, Word2Vec constructs embeddings with 100 dimensions.

13.0.3 Size of the vocabulary

The size of the vocabulary learned from the data can be obtained as follows:

len(model.wv.key_to_index)
## 13

This value corresponds to the number of distinct tokens appearing in the training corpus.

13.0.4 Adjusting the min_count parameter

The min_count parameter can be used to filter out infrequent words, which are often noisy or uninformative.

model = Word2Vec(sentences, min_count=2)

With this configuration, only words appearing at least twice in the corpus are retained.

We can verify the resulting vocabulary size:

len(model.wv.key_to_index)
## 3

And inspect the retained tokens:

model.wv.key_to_index
## {'models': 0, 'statistical': 1, 'data': 2}

Although the vocabulary shrinks, the dimensionality of the embeddings remains unchanged:

model.vector_size
## 100

Filtering rare words can improve training efficiency and reduce overfitting when working with large corpora.

13.0.5 Playing with the vector size

Higher-dimensional vectors capture more information across dimensions, especially when the corpus and vocabulary are big and the data is highly varied.

Let’s try to build a model where each vector is 300-dimensional using the following code block:

model = Word2Vec(sentences, min_count=2, vector_size=300)

Let’s now find out the vector size for the model we just built using the following line of code:

model.vector_size
## 300

As we can see, each of the four words that occur more than once is now represented using 300 dimensions.

13.0.6 Exploring the effect of vector dimensionality

The dimensionality of word embeddings influences how much semantic information can be encoded. Larger values allow for richer representations but require more data and computational resources.

To train a model with higher-dimensional embeddings, we can specify the vector_size parameter:

model = Word2Vec(sentences, min_count=2, vector_size=300)

We confirm the new dimensionality with:

model.vector_size
## 300

Each retained word is now represented as a vector in a 300-dimensional space.

13.0.7 Additional configuration parameters

Word2Vec provides several other parameters that control training behavior:

  • sg: selects the training architecture (1 for Skip-gram, 0 for CBOW),

  • negative: specifies the number of negative samples used during training,

  • workers: defines the number of parallel threads.

An example configuration is shown below:

model = Word2Vec(
    sentences,
    min_count=1,
    vector_size=200,
    sg=1,
    negative=5,
    workers=2
)

We can again inspect the vocabulary:

len(model.wv.key_to_index)
## 13
model.wv.key_to_index
## {'models': 0, 'statistical': 1, 'data': 2, 'analysis': 3, 'support': 4, 'methods': 5, 'predictions': 6, 'improve': 7, 'learning': 8, 'machine': 9, 'on': 10, 'relies': 11, 'science': 12}

Trained Word2Vec models can be saved to disk for later use using the save() method.

13.0.8 Limitations of Word2vec

Despite its effectiveness, Word2Vec has several well-known limitations.

Applications of `Word2vec`. Source: Created by the author with ChatGPT (OpenAI)

Figure 13.1: Applications of Word2vec. Source: Created by the author with ChatGPT (OpenAI)

First limitation.

Each word is assigned a single static vector, regardless of context. Consider the following sentences:

The researcher examined the cell samples.
The prisoner was locked in a cell overnight.

In both cases, the word cell would receive the same vector representation, even though its meaning differs across contexts.

Second limitation.

Word2Vec can reflect statistical biases present in the training corpus. If certain associations are overrepresented in the data, the learned embeddings may encode and reproduce these patterns. These issues highlight an important principle:

The quality and fairness of embeddings depend strongly on the data used for training.

13.0.9 Applications of Word2vec

Word2Vec embeddings are widely used in a variety of natural language processing tasks, including:

  • Semantic similarity measurement.

  • Document clustering.

  • Text classification.

  • Information retrieval.

By representing words as dense numerical vectors, Word2Vec enables text data to be integrated into traditional machine learning pipelines and more advanced neural architectures.

Applications of `Word2vec`. Source: Created by the author with ChatGPT (OpenAI)

Figure 13.2: Applications of Word2vec. Source: Created by the author with ChatGPT (OpenAI)

14 Word Mover’s Distance (WMD)

14.0.1 WMD: Overview

In earlier sections, we discussed how word embeddings can be used to represent documents and measure their similarity. One practical scenario where this becomes relevant is document matching, such as ranking short texts according to their relevance to a reference description.

For example, consider a system designed to compare short professional profiles against a project description. In such cases, we require a distance measure that reflects semantic similarity, not just surface-level word overlap. Documents that are semantically closer should receive smaller distance values.

In the document Transforming Text into Data Structure, we introduced cosine similarity as a common measure for comparing vector-based text representations. While effective in many settings, cosine similarity treats documents as aggregated vectors and may overlook fine-grained word-level alignments.

To address this limitation, we now introduce Word Mover’s Distance (WMD), a distance metric specifically designed for comparing documents represented through word embeddings.

Cosine similarity vs Word Mover's Distance`. Source: Created by the author with ChatGPT (OpenAI)

Figure 14.1: Cosine similarity vs Word Mover’s Distance`. Source: Created by the author with ChatGPT (OpenAI)

14.0.2 WMD: intuition behind this measure

Word Mover’s Distance (WMD), introduced by Kusner et al. (2015), is grounded in ideas from optimal transport theory. The central intuition is to measure how much “effort” is required to transform one document into another by moving words through the embedding space.

More precisely, WMD defines the dissimilarity between two documents as the minimum cumulative distance that the embedded words of one document must travel to align with the embedded words of the other document.

Instead of comparing documents as single aggregated vectors (as in cosine similarity), WMD explicitly accounts for word-level alignments.

15 WMD: example

15.0.1 Reference sentences

Consider the following sentences:

Sentence A: "The analyst explained results during the workshop in Medellín"
Sentence B: "A specialist discussed findings at a seminar in the city"

Many words in these sentences occupy nearby positions in the embedding space. For example:

  • analyst and specialist are semantically related.

  • workshop and seminar describe similar events.

  • explained and discussed reflect related communicative actions.

Now compare these with a third sentence:

Sentence C: "My bicycle needs maintenance before the weekend trip"

Sentence C shares little semantic content with Sentence A. Therefore, we expect the distance between A and C to be substantially larger than the distance between A and B.

Word Mover’s Distance (WMD) formalizes this intuition by computing pairwise distances between word embeddings and solving an optimal transport problem that minimizes the total movement cost required to transform one sentence into another.

15.0.2 Guiding question and working hypothesis

Guiding question.

Given the three sentences introduced above, we now formulate a concrete analytical objective. Our goal is to determine whether Word Mover’s Distance (WMD) aligns with our semantic intuition.

The guiding question is:

Given these three sentences, can we formally measure which pair is semantically closer using Word Mover’s Distance?

More specifically:

1. Is the distance between sentences A and B smaller than the distance between sentences A and C?
  
2. How does WMD operationalize our intuitive notion of semantic similarity?  

Intuitively, Sentence A and Sentence B are semantically related, whereas Sentence C describes a completely different topic.

We therefore propose the following working hypothesis:

\[ \mathrm{WMD}(A, B) \;<\; \mathrm{WMD}(A, C) \]

That is, the semantic distance between sentences A and B should be smaller than the distance between sentences A and C.

Interpretation of the hypothesis.

This hypothesis operationalizes a natural semantic expectation:

  • If two sentences share related concepts.

  • And those concepts occupy nearby regions in embedding space.

  • Then the optimal transport cost required to align them should be relatively small.

Conversely, if two sentences describe unrelated topics, the cumulative transport cost should be substantially larger.

Thus, WMD allows us to move from qualitative intuition (“these sentences are similar”) to a quantitative comparison based on geometric structure in the embedding space.

In the next section, we compute these distances explicitly using gensim and evaluate whether the numerical results confirm our hypothesis.

16 WMD: example (implementing with gensim)

We now demonstrate how to compute Word Mover’s Distance using the gensim library.

16.0.1 WMD with gensim: importing required modules

We begin by importing the necessary modules:

import gensim
from gensim.models import KeyedVectors

16.0.2 WMD with gensim: loading a pretrained embedding model

Next, we load a compact pretrained embedding model based on GloVe vectors trained on Wikipedia:

import gensim.downloader as api

# Load a compact pretrained model
model = api.load("glove-wiki-gigaword-100")

This model provides 100-dimensional word embeddings suitable for instructional demonstrations.

16.0.3 WMD with gensim: defining the example sentences

We now define three sentences for comparison:

sentence_1 = "The analyst explained results during the workshop in Medellín."
sentence_2 = "A specialist discussed findings at a seminar in the city."
sentence_3 = "My bicycle needs maintenance before the weekend trip."

16.0.4 WMD with gensim: computing distances

The distances.

We begin by computing the pairwise Word Mover’s Distance between the sentences:

d12 = model.wmdistance(sentence_1, sentence_2)
d13 = model.wmdistance(sentence_1, sentence_3)

d12, d13
## Distance (sentences 1 and 2) = 0.34086718145944445
## Distance (sentences 1 and 3) = 0.3008181961168074

We’re seeing that:

  • \(d_{12}=\mathrm{WMD}(1,2)=0.3409\)

  • \(d_{13}=\mathrm{WMD}(1,3)=0.3008\)

This implies:

\[\mathrm{WMD}(1,2) \;>\; \mathrm{WMD}(1,3).\]

In this run, Sentence 1 is numerically closer to Sentence 3 than to Sentence 2 under WMD (which is the opposite of our initial semantic expectation).

Interpreting the unexpected ordering.

Our working hypothesis was:

\[ \mathrm{WMD}(A,B) \;<\; \mathrm{WMD}(A,C), \]

meaning that the semantically related pair (1,2) should yield a smaller distance than the unrelated pair (1,3). However, WMD is highly sensitive to:

  • Out-of-vocabulary (OOV) tokens.

  • Tokenization decisions.

  • Casing and punctuation.

  • Accented characters, and

  • The specific pretrained embedding model.

For example, tokens such as Medellín may become OOV depending on preprocessing. When certain semantically important words are removed, the transport structure changes, potentially altering the distance ordering. Therefore, the correct empirical procedure is:

Do not assume the hypothesis holds: compare distances (1,2) and (1,3) directly.

Checking preprocessing and vocabulary coverage.

Before interpreting semantic similarity results, it is methodologically necessary to verify two aspects:

  1. That the text has been properly normalized.

  2. That all tokens are present in the embedding model vocabulary.

If a word is Out Of Vocabulary (OOV), the model cannot assign a vector representation to it, which may distort similarity computations.

Text normalization function.

import re
import unicodedata

def normalize_text(s):
    s = s.lower()
    s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("utf-8")
    s = re.sub(r"[^a-z\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

This function applies four preprocessing steps:

  1. Lowercasing: Converts all characters to lowercase to avoid case-sensitive mismatches.

  2. Unicode normalization: Removes diacritics (e.g., canción → cancion), ensuring compatibility with the embedding vocabulary.

  3. Removal of non-alphabetic characters: Eliminates punctuation and numbers.

  4. Whitespace standardization: Replaces multiple spaces with a single space and trims leading/trailing spaces.

This guarantees consistent token formatting before checking vocabulary coverage.

Text normalization function.

def oov_tokens(m, s):
    toks = normalize_text(s).split()
    return [t for t in toks if t not in m.key_to_index]

This function:

  1. Normalizes the sentence.

  2. Splits it into tokens.

  3. Checks whether each token exists in the model vocabulary (m.key_to_index).

  4. Returns a list of tokens not found in the model.

Checking each sentence.

print("OOV sentence_1:", oov_tokens(model, sentence_1))
print("OOV sentence_2:", oov_tokens(model, sentence_2))
print("OOV sentence_3:", oov_tokens(model, sentence_3))
## OOV sentence_1: []
## OOV sentence_2: []
## OOV sentence_3: []

The empty lists indicate that:

  • All tokens in each sentence are present in the embedding vocabulary.

  • No information is lost due to missing vector representations.

  • Similarity computations (e.g., cosine similarity or WMD) can be considered reliable with respect to vocabulary coverage.

If the output had included tokens, for example:

OOV sentence_1: ['blockchain', 'cryptomonedas']

this would indicate that those words have no vector representation in the model, potentially affecting semantic distance calculations.

Methodological Note.

Before computing semantic similarity measures, vocabulary coverage should always be verified. Ignoring OOV tokens may introduce silent distortions in embedding-based analyses.

16.0.5 WMD with gensim: recomputing WMD after normalization

The code.

s1n = normalize_text(sentence_1)
s2n = normalize_text(sentence_2)
s3n = normalize_text(sentence_3)

d12 = model.wmdistance(s1n, s2n)
d13 = model.wmdistance(s1n, s3n)

print(f"Distance (1,2) = {d12:.4f}")
print(f"Distance (1,3) = {d13:.4f}")

This block performs two main operations:

  1. Text normalization:

    • Each sentence is cleaned using the previously defined normalize_text() function.

    • This ensures consistent casing, removal of punctuation, and standardized tokens before computing distances.

  2. Recomputation of Word Mover’s Distance (WMD):

    • model.wmdistance() calculates the semantic distance between two sentences.

    • d12 measures the distance between Sentence 1 and Sentence 2.

    • d13 measures the distance between Sentence 1 and Sentence 3.

The output.

The values are printed with four decimal places for clarity.

## Distance (1,2) = 0.3093
## Distance (1,3) = 0.2870

Since WMD is a distance metric, smaller values indicate greater semantic similarity. We compare:

  • If \(d_{12} < d_{13}\), the result aligns with semantic intuition.

  • If \(d_{12} > d_{13}\), the embedding geometry (under this model and preprocessing) places Sentence 1 closer to Sentence 3.

Because \(d_{13} < d_{12}\), the embedding geometry places Sentence 1 closer to Sentence 3 than to Sentence 2.

Methodological insight.

Even after confirming that there are no OOV tokens, normalization can slightly modify token structure and therefore affect the computed distances. This illustrates an important principle in NLP:

Distance-based semantic comparisons depend not only on the metric itself, but also on preprocessing decisions and vocabulary coverage.

17 WMD: contrast with cosine similarity

17.0.1 Mean sentence embedding

Cosine similarity compares aggregated sentence vectors (e.g., mean embeddings), whereas WMD aligns words via optimal transport.

import numpy as np

def sent_vector_mean(m, s):
    toks = [t.strip(".,!?;:()[]\"'").lower() for t in s.split()]
    toks = [t for t in toks if t in m.key_to_index]
    if len(toks) == 0:
        return None
    return np.mean([m[t] for t in toks], axis=0)

This function computes a mean sentence embedding:

  1. The sentence is tokenized and lightly cleaned.

  2. Only tokens present in the embedding vocabulary are retained.

  3. Each token is mapped to its vector representation.

  4. If \(\mathbf{w}_i\) is the embedding of token \(i\), the sentence vector is computed as the arithmetic mean:

\[\mathbf{v}_s \quad = \quad \frac{1}{n} \sum_{i=1}^{n} \mathbf{w}_i\]

17.0.2 Cosine similarity

def cosine(u, v):
    return float(np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)))

This function computes cosine similarity between two vectors \(\mathbf{v}, \mathbf{v} \in \mathbb{R}^d\):

\[\cos(\mathbf{u}, \mathbf{v}) \quad =\quad \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} \quad \in \quad [-1, 1]\] Here, \(\mathbf{u} \cdot \mathbf{v}\) denotes the Euclidean inner product, and the \(L_2\) norm (Euclidean norm) of a vector \(\mathbf{v} \in \mathbb{R}^d\) is defined as:

\[\|\mathbf{v}\|_2 \quad =\quad \sqrt{\sum_{i=1}^{d} v_i^2}.\]

Cosine similarity measures angular similarity, not Euclidean distance. It evaluates the angle between vectors rather than their magnitude. Its takes values in the continuous interval \([-1,1]\). The extreme cases correspond to:

  • Identical direction (maximum similarity): \(1\)

  • Orthogonal vectors (no linear association): \(0\)

  • Opposite direction: \(-1\)

Intermediate values (e.g., 0.82, 0.34, −0.15) reflect varying angular proximity between vectors. In embedding spaces trained on natural language data, cosine values are typically non-negative, since semantically unrelated words rarely exhibit strong opposite orientations.

17.0.3 Output interpretation

v1 = sent_vector_mean(model, sentence_1)
v2 = sent_vector_mean(model, sentence_2)
v3 = sent_vector_mean(model, sentence_3)

cos12 = cosine(v1, v2)
cos13 = cosine(v1, v3)

print("| Pair | Cosine similarity |")
print(f"| (1,2) | {cos12:.4f} |")
print(f"| (1,3) | {cos13:.4f} |")
## | Pair | Cosine similarity |
## | (1,2) | 0.9319 |
## | (1,3) | 0.8143 |

Since cosine similarity is a similarity measure, larger values indicate greater semantic relatedness. Here we observe:

\[ \cos(1,2) \;=\; 0.9319 \quad > \quad \cos(1,3) \;=\; 0.8143\]

Thus, Sentence 1 is closer to Sentence 2 than to Sentence 3 under mean-embedding cosine similarity.

17.0.4 Conceptual contrast with WMD

  • Cosine similarity compares aggregated sentence vectors.

  • WMD aligns individual words via optimal transport.

Cosine relies on averaging, which may smooth or blur fine-grained word-level structure. WMD, by contrast, computes the minimal cumulative transport cost between word distributions. In general, if two sentences are more semantically related, we expect:

\[ \text{WMD}(A,B) < \text{WMD}(A,C)\]

However, as shown in this example, the empirical embedding geometry may yield a different ordering depending on preprocessing and model characteristics.

This discrepancy highlights the inherently model-dependent and preprocessing-sensitive nature of semantic distance in embedding spaces.

17.0.5 Key methodological insight

Cosine similarity and WMD operate on fundamentally different geometric principles:

  • Cosine operates global vector direction.

  • WMD implies distributional alignment in embedding space.

Different metrics may yield different rankings depending on preprocessing, token overlap, and embedding geometry.

18 WMD: cosine Similarity with mean embeddings

In addition to Word Mover’s Distance, we now compute cosine similarity between sentence representations. Here, each sentence is represented by the mean of its word embeddings. This produces a single dense vector per sentence, allowing us to compare them using cosine similarity.

The following code:

  • Tokenizes each sentence.

  • Removes out-of-vocabulary (OOV) tokens.

  • Computes the mean embedding.

  • And evaluates cosine similarity for the pairs (1,2) and (1,3).

import numpy as np

def sent_vector_mean(m, s):
    # Simple tokenization + OOV filtering
    toks = [t.strip(".,!?;:()[]\"'").lower() for t in s.split()]
    toks = [t for t in toks if t in m.key_to_index]
    if len(toks) == 0:
        return None
    return np.mean([m[t] for t in toks], axis=0)

def cosine(u, v):
    # Safe cosine computation
    if u is None or v is None:
        return None
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    if norm_u == 0 or norm_v == 0:
        return None
    return float(np.dot(u, v) / (norm_u * norm_v))

v1 = sent_vector_mean(model, sentence_1)
v2 = sent_vector_mean(model, sentence_2)
v3 = sent_vector_mean(model, sentence_3)

cos12 = cosine(v1, v2)
cos13 = cosine(v1, v3)

print("| Pair | Cosine similarity |")

if cos12 is not None:
    print(f"| (1,2) | {cos12:.4f} |")
else:
    print("| (1,2) | NA (vector not available) |")

if cos13 is not None:
    print(f"| (1,3) | {cos13:.4f} |")
else:
    print("| (1,3) | NA (vector not available) |")
## | Pair | Cosine similarity |
## | (1,2) | 0.9319 |
## | (1,3) | 0.8143 |

In contrast to WMD, cosine similarity in this example behaves as expected, assigning a higher similarity score to the semantically related pair (1,2).

18.0.1 Interpretation

From the computed results, we observe:

\[\cos(1,2)\; =\; 0.9319 \quad > \quad \cos(1,3)\; =\; 0.8143\]

In general, cosine similarity increases with semantic relatedness. Likewise, because Word Mover’s Distance (WMD) is a distance measure, semantic similarity should correspond to smaller values:

\[\mathrm{WMD}(1,2) < \mathrm{WMD}(1,3).\]

However, in our computed example we obtained:

\[\mathrm{WMD}(1,2) \;= \; 0.3093 \quad >\quad \mathrm{WMD}(1,3) \;= \; 0.2870.\]

Under this specific embedding model and preprocessing pipeline, Sentence 1 is therefore placed closer to Sentence 3 than to Sentence 2. This does not contradict the theory of WMD; rather, it illustrates its sensitivity to practical implementation details.

This example illustrates that similarity metrics are grounded in distinct geometric principles, which may produce divergent empirical rankings even under identical preprocessing pipelines.

18.0.2 Methodological Note

Discrepancies between intuitive semantic similarity and computed WMD values typically arise from:

  • Out-of-vocabulary (OOV) tokens.

  • Preprocessing inconsistencies.

  • Accented or rare words.

  • Differences in token coverage across sentences.

  • The geometry induced by the specific pretrained embedding model.

Therefore, embedding-based semantic comparisons depend not only on the metric itself, but also on preprocessing decisions and vocabulary coverage.WMD is theoretically well-founded, yet empirically sensitive to preprocessing and embedding geometry.

19 WMD: normalizing embeddings (optional)

Although WMD does not require explicit vector normalization, some practitioners precompute vector norms for computational efficiency.

model.fill_norms()

This operation precomputes and stores the \(L_2\) norms of the embedding vectors for efficient similarity computations. It does not modify the underlying vectors themselves. Recall that the \(L_2\) norm (Euclidean norm) of a vector \(\mathbf{v} \in \mathbb{R}^d\) is defined as:

\[\|\mathbf{v}\|_2 \quad =\quad \sqrt{\sum_{i=1}^{d} v_i^2}.\]

The \(L_2\) norm measures the magnitude (length) of a vector in Euclidean space. Precomputing these norms allows faster evaluation of similarity metrics such as cosine similarity, which depends on vector magnitudes:

\[ \cos(\mathbf{u}, \mathbf{v}) \quad = \quad \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} \quad \in \quad [-1, 1].\]

After precomputing norms, distances can be recomputed:

d12_norm = model.wmdistance(sentence_1, sentence_2)
d13_norm = model.wmdistance(sentence_1, sentence_3)

d12_norm, d13_norm
## (0.34086718145944445, 0.3008181961168074)

In practice, this step typically does not change the relative ordering of distances. It mainly improves numerical consistency in the underlying vector operations. If text normalization and OOV handling were already applied carefully, this embedding normalization step is optional.

20 WMD: interpreting the results

Word Mover’s Distance (WMD) provides a principled way to compare documents based on their word-level semantic structure.

  • Smaller WMD values indicate greater semantic similarity.

  • Larger WMD values indicate greater semantic divergence.

Unlike cosine similarity, which compares aggregated sentence vectors, WMD explicitly aligns words across documents using an optimal-transport formulation. This allows WMD to capture fine-grained semantic structure that simpler similarity measures may overlook. However, WMD is sensitive to several practical factors:

  • Tokenization choices.

  • Lowercasing and accent handling.

  • Out-of-vocabulary (OOV) words.

  • Coverage of the pretrained embedding model.

Therefore, interpretation must always consider preprocessing and vocabulary coverage.

Embedding-based semantic comparisons depend not only on the metric, but also on preprocessing decisions and model coverage.

In summary, WMD is a powerful distance metric, but its empirical behavior depends critically on the interaction between:

  1. The embedding geometry.

  2. The preprocessing pipeline.

  3. And the vocabulary represented in the model.

This makes WMD both theoretically principled and empirically sensitive.

21 Summary

In this document, we extended the discussion initiated in Transforming Text into Data Structures by shifting the focus from purely syntactic representations to semantic modeling of text.

Rather than treating words as isolated symbolic units, we examined how distributional information (especially word co-occurrence patterns) can be leveraged to approximate semantic structure.

We examined the geometric intuition behind word embeddings, analyzed how semantic regularities emerge through vector arithmetic, and studied the internal architecture of Word2Vec (specifically the Skip-gram and CBOW paradigms) along with practical considerations for training and deploying pretrained models.

Building on this foundation, we trained custom Word2Vec models from scratch, investigated the role of key hyperparameters, and reflected on known limitations of static embeddings (such as contextual ambiguity and the amplification of biases present in the training corpus). Several real-world applications were also highlighted, illustrating how word embeddings can be leveraged for similarity, clustering, and information retrieval tasks.

Finally, we introduced Word Mover’s Distance (WMD) as an optimal-transport-based framework for comparing documents in embedding space, illustrating how semantic distances can be quantified beyond simple aggregation-based similarity measures.

22 Applied activity: from word embeddings to semantic similarity

This activity is designed to integrate and apply the concepts introduced in this chapter related to word embeddings, Word2Vec, and semantic similarity. The reader will explore how contextual word representations are structured, queried, and used to compare words and short texts in a numerical vector space.

22.0.1 Objective

To build a fully reproducible workflow that:

  • explores pretrained word embeddings,

  • analyzes semantic similarity between words,

  • performs analogy-style queries, and

  • compares short texts using embedding-based distances.

22.0.2 Instructions

  1. Select one pretrained word embedding model available through a standard NLP library (e.g., gensim).

  2. Work with:

    • a small set of common words, and

    • a small collection of short sentences (2–4 sentences).

  3. Create an R Markdown (.Rmd) document that compiles successfully to HTML (or PDF).

  4. The document must include:

    • the code, and

    • the generated output (tables, printed objects, or numerical results).

22.0.3 Required sections

1. Embedding Model Description

Briefly describe:

  • the selected pretrained embedding model,

  • the source of the training corpus, and

  • the dimensionality of the word vectors.

Explain why a pretrained model is appropriate for this activity.

2. Vocabulary exploration

Select 10–15 common tokens (e.g., nouns or verbs).

For each token:

  • verify its presence in the embedding vocabulary, and

  • report the dimensionality of its vector representation.

Briefly comment on why some tokens may be missing.

3. Nearest neighbor analysis

Choose three query words and:

  • retrieve their top 5 most similar words using cosine similarity,

  • present the results in a clear table.

Interpret the semantic relationships observed in the results.

4. Analogy queries

Construct at least two analogy-style queries using the form:

\[ \text{word}_A - \text{word}_B + \text{word}_C \approx \text{word}_D \]

For each analogy:

  • specify the positive and negative sets,

  • report the top predicted result(s), and

  • discuss whether the analogy is semantically reasonable.

5. Semantic distance between words

Select four words and compute pairwise cosine similarity scores between their vectors.

Present the results as:

  • a similarity table, or

  • a similarity matrix.

Optionally, convert similarity to distance using \(1-\cos(\theta)\). Explain how numerical distance reflects semantic proximity.

6. Embedding-based text similarity

Define three short sentences (one or two lines each).

Using an embedding-based similarity or distance measure (e.g., Word Mover’s Distance or cosine similarity applied to averaged word vectors):

  • compute the distance between each pair of sentences,

  • identify the most similar and most dissimilar sentence pairs.

If both WMD and cosine similarity are computed, compare their rankings and comment on any discrepancies. Interpret the results in terms of semantic content.

7. Optional: preprocessing and OOV check

Before computing similarities, verify that all selected tokens are present in the embedding vocabulary. Briefly report any out-of-vocabulary (OOV) words and explain their potential impact.

8. Conceptual reflection

Write a concise reflection (6–10 lines) discussing:

  • how word embeddings differ from Bag-of-Words and TF-IDF representations,

  • what semantic information embeddings capture, and

  • one limitation of static word embeddings such as Word2Vec.

22.0.4 Reproducibility requirement

  • The R Markdown document must be fully reproducible.

  • All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.

  • All random seeds (if applicable) must be set to ensure deterministic results.

  • All library versions used should be clearly reported.

References