hllinas2023

1 Prerequisites and software setup

The examples and exercises presented in this chapter rely on a small set of widely used Python libraries for text preprocessing, vectorization, and numerical computation. To ensure that all code runs correctly, the required packages and language resources should be installed before executing the examples in this document.

The commands below are provided for reference only and should be executed in a Python environment (for example, a terminal, Anaconda Prompt, or a Python-enabled R Markdown setup using reticulate).

# Core machine learning and NLP libraries
pip install scikit-learn
pip install nltk
pip install pandas
pip install numpy
pip install seaborn
pip install tabulate

# Download required NLTK resources
python -c "import nltk; nltk.download('wordnet')"
python -c "import nltk; nltk.download('omw-1.4')"
python -c "import nltk; nltk.download('stopwords')"

Once the packages are installed, the following Python modules are imported throughout this document:

import numpy as np
import pandas as pd
import re

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Role of the libraries.

The purpose of each library used in this document is summarized below:

  • NLTK provides basic natural language preprocessing tools, including stopword removal and lemmatization. Only lightweight linguistic processing is used in this document.

  • scikit-learn (sklearn) supplies the vectorization and similarity machinery, including CountVectorizer, TfidfVectorizer, and cosine similarity computation.

  • pandasis used to manage text corpora as structured objects (e.g., Series) and to apply preprocessing functions consistently across documents.

  • numnpy supports numerical operations and vector-based computations required for similarity calculations.

  • seaborn is used for making statistical graphics.

  • tabulate is used to pretty-print tabular data in a human-readable format.

These tools are sufficient to illustrate the fundamental ideas behind frequency-based text representations, without introducing unnecessary dependencies.

2 Introduction

2.0.1 Preliminars

Textual data poses a distinctive challenge for computational analysis: unlike numerical or categorical variables, natural language does not come with an inherent mathematical representation. While computers operate exclusively on numbers, language is expressed through symbols, words, and structures whose meaning is not natively encoded in numeric form.

Transforming text into numbers is therefore unavoidable (but it is also an opportunity). The specific choices made during this transformation determine which aspects of language are preserved, which are simplified or ignored, and how effectively learning algorithms can operate on linguistic data. In this sense, representation choices are not neutral: they directly influence model behavior, interpretability, and performance.

In the previous document (see vocabulary construction), we focused on defining the symbolic units of language processing, including tokenization strategies, normalization procedures, and vocabulary design. These steps establish what constitutes a unit of analysis. In this chapter, we move to the next stage of the pipeline and examine how those symbolic units are transformed into numerical objects.

Our approach is deliberately incremental. We begin with simple and transparent representations that emphasize observable structure rather than deep semantic meaning. By relying on frequency counts and distributional information, we can construct representations that are easy to interpret and that provide a solid mathematical foundation for more advanced techniques.

Throughout this chapter, we introduce classical methods for numerical text representation, including Bag-of-Words and term frequency–inverse document frequency (TF-IDF). Although conceptually straightforward, these methods remain widely used in practice (for baseline models, exploratory analysis, and instructional settings).

Before introducing these techniques, it is useful to clarify a fundamental distinction that underlies all language modeling: syntax versus semantics. Syntax concerns the structural organization of words and their observable patterns of occurrence, whereas semantics relates to meaning and interpretation. A sentence may be syntactically well-formed without conveying meaningful information.

In this chapter, the emphasis is intentionally placed on the syntactic dimension of language. We focus on representations derived from word occurrence patterns (such as counts and relative frequencies) while postponing semantic representations (e.g., embeddings and neural encodings) to later chapters.

By the end of this chapter, you will be able to represent text using vectors and matrices, compute similarities between documents, and build simple language-based applications. These ideas also serve as a conceptual bridge toward the representation learning techniques employed in modern deep learning architectures, including Transformer-based models.

2.0.2 Motivation: From vectors to Transformer inputs

The numerical representations introduced in this chapter—such as Bag-of-Words and TF-IDF-illustrate a fundamental principle: language must be encoded as vectors before it can be processed by any computational model. Although these representations are relatively simple and primarily capture syntactic structure, they establish the mathematical foundation required for more advanced methods.

In modern architectures such as the Transformer (Vaswani et al., 2017), textual inputs are ultimately text can be mapped into vector spaces and processed through multiple layers of transformation. However, instead of relying on sparse, high-dimensional frequency-based vectors, Transformers employ dense vector representations (embeddings) that capture richer linguistic information.

Figure 2.1 shows that the model begins with an Input Embedding stage, where each token is mapped into a continuous vector space. The representations developed in this chapter can be interpreted as a conceptual precursor to that stage: they demonstrate how text can be embedded into vector spaces, even though they do not yet capture semantic relationships or contextual dependencies.

General architecture of the Transformer model. Source: [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)

Figure 2.1: General architecture of the Transformer model. Source: Vaswani et al. (2017)

This highlights an important transition: while frequency-based methods focus on observable structure, modern neural models require representations that also capture meaning and context. This highlights an important transition: while frequency-based methods focus on observable structure, modern neural models require representations that also capture meaning and context. This transition (from syntactic representations to semantic vector spaces) is developed in the next document (see word embeddings).

Chapter roadmap.

The main topics covered in this chapter are:

  • Understanding vectors and matrices as mathematical data structures

  • Exploring the Bag-of-Words (BoW) representation

  • Constructing TF–IDF vectors

  • Measuring distance and similarity between document vectors

  • One-hot vectorization

  • Building a basic chatbot

3 Understanding vectors and matrices

A central challenge in NLP is expressing language in mathematical form. Two data structures play a fundamental role in this transformation: vectors and matrices. Together, they allow collections of text documents to be analyzed using the tools of linear algebra.

3.0.1 Vectors

Definition and notation.

A vector is a one-dimensional array of numerical values, where each position corresponds to a specific feature. Vectors are commonly represented as column arrays:

\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad \mathbf{v} =\begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} \]

In this expression, the vector \(\mathbf{x}\) contains three components and belongs to \(\mathbb{R}^3\), while \(\mathbf{v}\) contains four components and belongs to \(\mathbb{R}^4\). Each coordinate represents the contribution of the vector along a particular axis. Once an object is represented as a vector, operations such as distance computation, similarity measurement, and projection become well defined.

Geometric intuition.

To develop geometric intuition, consider representing entities using measurable attributes. Suppose we describe two cities using their average annual temperature and annual rainfall:

\[ \begin{array}{c|cc} \text{City} & \text{Temperature (°C)} & \text{Rainfall (mm)} \\ \hline \text{A} & 18 & 500 \\ \text{B} & 25 & 1100 \end{array} \]

Each city can be interpreted as a point in a two-dimensional space, or equivalently, as a vector.

  • City A corresponds to the vector \(\mathbf{x}_A= (18, 500)\).

  • City B corresponds to the vector \(\mathbf{x}_B= (25, 1100)\).

From a mathematical perspective, both vectors belong to \(\mathbb{R}^2\). Here is the corresponding visualization:

Each vector originates at the coordinate system’s origin and points toward a location determined by the corresponding attributes. Adding a new attribute (such as altitude or population density) increases the dimensionality of the representation, moving the vectors from \(\mathbb{R}^2\) to \(\mathbb{R}^3\) or higher.

While such spaces quickly become difficult to visualize, the algebraic interpretation of vectors remains valid in any dimension.

From geometric vectors to text representation.

The same idea extends naturally to textual data.

After tokenization introduced in the previous document (see Vocabulary and text normalization), a document can be represented as a vector in which each dimension corresponds to a unique token in the vocabulary. The value along each dimension reflects how frequently that token appears in the document.

In this way, text is mapped into a high-dimensional vector space, where each document corresponds to a point. This representation enables the application of vector-based operations such as similarity measurement, distance computation, and clustering, forming the basis of many methods in natural language processing.

3.0.2 Matrices

Matrices extend vectors by organizing multiple vectors into rows and columns. A matrix can be written as:

\[ \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \end{bmatrix} \]

This matrix belongs to \(\mathbb{R}^{3 \times 2}\), indicating three rows and two columns. In text analysis, matrices are commonly used to represent collections of documents. Each row corresponds to a document, each column corresponds to a token in the vocabulary, and each entry stores the frequency of that token in the document.

3.0.3 Matrix representation: a simple example

From text to matrix representation.

To illustrate how text can be organized into matrix form, consider the following small collection of documents:

from sklearn.feature_extraction.text import CountVectorizer # 1

documents = (
    "Text analysis relies on numerical representations",
    "Vectors and matrices are core mathematical tools",
    "Large collections of text can be processed efficiently"
)

vectorizer = CountVectorizer(stop_words="english")  # 2
vectorizer

X = vectorizer.fit_transform(documents)             # 3
X

# Inspect the learned vocabulary and document-term matrix
print(vectorizer.vocabulary_)  # 4 --> first output
print(X.todense())             # 5a --> second output
print(X.toarray())             # 5b --> second output

Explanation of the code.

Code 1.

The code begins by importing CountVectorizer from sklearn.feature_extraction.text, a tool designed to transform a collection of text documents into a document–term matrix. Next, a small corpus of three short documents is defined.

Code 2.

The instruction CountVectorizer(stop_words="english") creates a vectorizer that automatically tokenizes the text, extracts a set of unique tokens (the vocabulary), and removes common English stopwords such as and, are, or of.

CountVectorizer(stop_words='english')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The output displayed after evaluating vectorizer does not correspond to data, but to the internal representation of the CountVectorizer object. It simply confirms that the object has been successfully created with the specified parameters.

At this stage, the vectorizer has not yet been fitted to the data, meaning that no vocabulary has been learned and no matrix has been constructed. The expandable Parameters section reflects the configuration of the object (such as stopword removal), rather than any learned information.

In other words, we are not yet seeing the data (we are only defining the tool that will process it).

Only after applying fit_transform(documents) does the vectorizer learn the vocabulary and generate the document-term matrix.

Code 3.

The line fit_transform(documents) then performs two tasks simultaneously:

  • It first learns the vocabulary from the corpus (fit) and then

  • It converts the documents into a numerical matrix representation (transform).

The resulting object X is a sparse document-term matrix, where rows correspond to documents and columns correspond to vocabulary terms.

## <Compressed Sparse Row sparse matrix of dtype 'int64'
##  with 15 stored elements and shape (3, 14)>

The object X is not displayed as a full matrix because it is stored in a sparse format. Instead, Python displays a summary of its structure:

  • The shape (3, 14) indicates that the matrix has 3 rows (documents) and 14 columns (unique tokens in the vocabulary).

  • The expression “15 stored elements” means that only 15 entries in the matrix are nonzero. This reflects the sparsity of textual data, where most tokens do not appear in most documents.

  • The term Compressed Sparse Row (CSR) refers to the internal representation used to efficiently store and manipulate sparse matrices by keeping track only of nonzero entries.

This compact representation is essential for handling large text corpora, where the document–term matrix can have thousands or even millions of columns.

In addition to printing the matrix, several attributes can be used to better understand its structure:

  • X.shape returns the dimensions of the matrix (number of documents × vocabulary size).

  • X.nnz gives the number of nonzero entries, indicating how sparse the matrix is.

  • vectorizer.get_feature_names_out() returns the ordered list of tokens corresponding to the columns of the matrix.

X.shape
X.nnz
vectorizer.get_feature_names_out()
## (3, 14)
## 15
## array(['analysis', 'collections', 'core', 'efficiently', 'large',
##        'mathematical', 'matrices', 'numerical', 'processed', 'relies',
##        'representations', 'text', 'tools', 'vectors'], dtype=object)

These tools allow us to interpret the matrix more precisely without converting it into a dense representation. The sparse representation hides most of the matrix entries (which are zero), but it preserves all the information needed to reconstruct the full document–term matrix when required.

Codes 4 and 5.

Finally, the code produces two explicit outputs:

  • The learned vocabulary (vectorizer.vocabulary_), which maps each token to a column index.

  • The dense version of the matrix (X.todense() or X.toarray()), which makes the full structure easier to inspect in small examples.

Example: interpreting the first output (code 4).

The printed dictionary (vocabulary_) maps each unique token to a column index in the document-term matrix:

## {'text': 11, 'analysis': 0, 'relies': 9, 'numerical': 7, 'representations': 10, 'vectors': 13, 'matrices': 6, 'core': 2, 'mathematical': 5, 'tools': 12, 'large': 4, 'collections': 1, 'processed': 8, 'efficiently': 3}

Each key in this dictionary is a token extracted from the corpus after preprocessing (tokenization and stopword removal). The associated number is not a frequency and does not indicate importance or order of appearance in the text. Instead, it specifies the column position assigned to that token in the document-term matrix.

To make this concrete:

  • 'analysis': 0 means that the token analysis corresponds to column 0 of the matrix.

  • 'collections': 1 corresponds to column 1.

  • 'core': 2 corresponds to column 2.

  • 'text': 11 corresponds to column 11.

  • 'vectors': 13 corresponds to column 13.

In other words, the numbers 0, 1, 2, …, 13 are indices, not counts. They simply label the columns of the matrix, starting from zero, following Python’s indexing convention.

Once this mapping is defined, the document-term matrix uses it consistently. For example:

  • The value located at row \(i\) and column 0 represents the frequency of the token analysis in document \(i\).

  • Similarly, the value at column 11 represents the frequency of the token text in that same document.

More generally, the entry in row \(i\) and column \(j\) records how many times token \(j\) appears in document \(i\).

This separation of roles is crucial:

  • The vocabulary dictionary defines where each token lives in the matrix.

  • The matrix entries define how often each token appears in each document.

Understanding this distinction helps explain why a document vector has a fixed length equal to the size of the vocabulary, and why most entries are zero when a token does not appear in a document.

Example: interpreting the second output (code 5).

The second output (X.todense() or X.toarray()) is the document-term matrix itself:

## [[1 0 0 0 0 0 0 1 0 1 1 1 0 0]
##  [0 0 1 0 0 1 1 0 0 0 0 0 1 1]
##  [0 1 0 1 1 0 0 0 1 0 0 1 0 0]]

Mathematically:

\[ \mathbf{X} = \left( \begin{array}{c|cccccccccccccc} \text{Text} & \text{anal} & \text{coll} & \text{core} & \text{eff} & \text{lar} & \text{math} & \text{mat} & \text{num} & \text{proc} & \text{relies} & \text{repr} & \text{text} & \text{tools} & \text{vec} \\ \hline \text{#1} & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 \\ \text{#2} & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \text{#3} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \end{array} \right) \in \mathbb{R}^{3 \times 14} \]

Legend (tokens):

anal = analysis; coll = collections; core = core; eff = efficiently; large = large; math = mathematical; mat = matrices; num = numerical; proc = processed; relies = relies; repr = representations; text = text; tools = tools; vec = vectors.

This matrix should be interpreted as follows:

  • Rows of the matrix \(\mathbf{X}\) correspond to documents: Text 1, Text 2, Text 3 (in the same order as the input text).

  • Columns correspond to tokens in the vocabulary.

The entry \(x_{ij}\) of the matrix \(\mathbf{X}\),

\[ \mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{i14}), \]

represents the number of times token \(j\) appears in document \(i\). For example:

  • \(x_{1,1} = 1\) indicates that the token analysis appears once in the first document.

  • \(x_{1,8} = 1\) indicates that the token numerical appears once in the first document.

  • \(x_{2,14} = 1\) indicates that the token vectors appears once in the second document.

  • Zeros indicate that the corresponding token does not appear in that document.

Because each document is short and most words appear at most once, the matrix mainly contains values of 0 and 1. A value of 1 indicates that the corresponding token appears once in that document, while 0 indicates that it does not appear at all.

The length of each row vector equals the size of the vocabulary. In this example, the vocabulary contains 14 unique tokens after stopword removal, which explains why each document vector has 14 components.

Once text data has been converted into matrix form, it becomes amenable to standard linear algebra operations such as similarity computation, projection, and matrix transformations, enabling quantitative analysis of documents.

This type of matrix-based encoding is commonly associated with the Bag-of-Words (BoW) model, in which each document is represented relative to a fixed vocabulary, typically by recording the frequency of its tokens while ignoring word order.

4 Bag-of-words (Bow)

4.0.1 Exploring the bag-of-words representation

Basic idea.

One of the simplest ways to represent text numerically is to count how often terms appear in a document. This idea forms the basis of the Bag-of-Words (BoW) representation.

The BoW model deliberately ignores word order and grammatical structure, focusing instead on which terms appear and how often they occur. Although this abstraction discards syntactic information such as word sequence, it provides a simple and effective baseline for many text analysis tasks.

In the previous chapter on vocabulary construction, we introduced the process of identifying and standardizing the basic units of text. That step is essential for BoW representations: before counting terms, we must first decide which terms belong to the vocabulary.

Vector interpretation.

Once the vocabulary is fixed, each document can be represented as a vector whose length equals the size of the vocabulary. Each position in the vector corresponds to a specific term, and the value stored in that position indicates how many times the term appears in the document.

If a term from the vocabulary does not appear in a given document, the corresponding entry in the vector is zero.

Sparsity and component values.

A natural question arises at this point:

What is the maximum possible value of a entry (or term count) in a Bag-of-Words vector?

Take a moment to think about it.

At first glance, one might expect a fixed upper bound. However, this is not the case.

In a Bag-of-Words representation, each component of the vector records the number of times a given term appears in a document. Therefore, the value of a component depends entirely on the frequency of that term within the document.

In principle, there is no fixed upper bound: a term could appear many times, especially in long documents or highly repetitive texts. As a result, some components may take relatively large values, while many others remain equal to zero.

This imbalance leads to a key property of Bag-of-Words representations: sparsity. Most entries in the vector are zero because most terms in the vocabulary do not appear in a given document.

4.0.2 From text to Bag-of-Words: a step-by-step construction

To make the idea concrete, we now construct a Bag-of-Words representation manually, starting from a small collection of sentences.

Before examining each step in detail, Figure 4.1 provides an overview of the main stages involved in building a Bag-of-Words representation. These stages will be explained progressively in the following sections.

Step-by-step construction of a Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)

Figure 4.1: Step-by-step construction of a Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)

Step 1: Define a small corpus.

We begin by defining a small collection of sentences, which will serve as our corpus. Each sentence is treated as a separate document:

sentences = [
    "Data science connects statistics and computation",
    "Statistical models learn patterns from data",
    "Modern data analysis relies on computational tools"
]

This corpus represents the raw textual input. At this stage, the data is still unstructured and cannot yet be processed mathematically.

Step 2: Store the corpus in a structured form.

The code.

The corpus is stored as a pandas.Series, where each element represents one document. This structured format facilitates systematic preprocessing and later vectorization steps.

import pandas as pd

corpus = pd.Series(sentences)
corpus

The code begins by importing the pandas library, which provides convenient data structures for handling and organizing data. The list of sentences defined in the previous step is then converted into a Series object.

A pandas.Series can be understood as a one-dimensional labeled array. In this context, each entry of the Series corresponds to a document, and the index (0, 1, 2, …) uniquely identifies each one.

The ouput.

In this case, the printed otuput is:

## 0     Data science connects statistics and computation
## 1          Statistical models learn patterns from data
## 2    Modern data analysis relies on computational t...
## dtype: object

This output shows:

  • The index on the left (0, 1, 2), which labels each document.

  • The text content of each document.

  • The data type (dtype: object), indicating that the entries are stored as text.

This representation does not yet transform the text into numbers, but it organizes the corpus into a structured format that can be easily processed in subsequent steps

Step 3: Apply basic preprocessing.

The code.

The preprocessing step standardizes the text by lowercasing, removing punctuation and stopwords, and reducing words to their lemma. These operations ensure that different surface forms of a word are treated consistently.

As discussed in the previous document on vocabulary construction (see Vocabulary and text normalization), preprocessing is a crucial step that directly influences how tokens are defined and how the vocabulary is built.

The following code implements a basic preprocessing pipeline using the nltk library:

# Step 3a: Required packages
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import re
import numpy as np

#Step 3b: Defining the preprocessing function
def clean_and_lemmatize(text):
    stop = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    
    tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
    
    return " ".join(tokens)

#Step 3c: Applying the preprocessing function to the corpus {.unlisted .unnumbered}
processed_corpus = corpus.apply(clean_and_lemmatize)
processed_corpus

This code defines a preprocessing function and applies it to each document in the corpus. For clarity, each component of this pipeline is explained in detail in the following subsections.

This preprocessing stage prepares the corpus for vocabulary construction, which is the next step in building the Bag-of-Words representation.

Step 3a: Required packages.

These packages provide tools for tokenization, stopword filtering, and lemmatization, which are standard preprocessing steps in natural language processing.

Step 3b: Defining the preprocessing function.

This function performs several preprocessing operations in sequence.

  • The re.sub(r"[^a-zA-Z]", " ", text) instruction removes punctuation and non-letter characters.

  • The .lower() method converts all text to lowercase to ensure consistency.

  • The .split() operation tokenizes the text into individual words.

  • Stopwords (common words such as the, and, is) are removed using the stopwords list from nltk.

  • The WordNetLemmatizer reduces each token to its base form (lemma), so that different grammatical forms are treated as the same term (e.g., models –> model).

  • The output of the function is a cleaned and normalized string, ready for vectorization.

This step is essential because the quality of the vocabulary and the resulting Bag-of-Words representation depend directly on how the text is preprocessed.

Step 3c: Applying the preprocessing function to the corpus.

The function clean_and_lemmatize is applied to each document in the corpus using the .apply() method from pandas. This method iterates over all elements of the Series and transforms each document individually.

## 0       data science connects statistic computation
## 1              statistical model learn pattern data
## 2    modern data analysis relies computational tool
## dtype: object

The output shows the preprocessed version of each document in the corpus, where stopwords have been removed and the remaining words have been lemmatized. Each row corresponds to one document, and the original document order is preserved.

len(processed_corpus)
## 3

This confirms that the corpus contains three documents, each of which has been transformed into its cleaned textual representation.

Step 4: Build the vocabulary.

The code.

This code constructs the vocabulary by extracting all unique tokens from the preprocessed corpus and sorting them alphabetically. Each token appears only once, regardless of how many times it occurs in the documents.

vocabulary = sorted(set(
    word for sentence in processed_corpus for word in sentence.split()
))
vocabulary

The expression inside the code performs three main operations:

  • The comprehension below iterates over each document and extracts all tokens.
word for sentence in processed_corpus for word in sentence.split()
  • The function set(...) removes duplicate tokens, ensuring that each term appears only once.

  • The function sorted(...) orders the vocabulary alphabetically, guaranteeing a consistent and reproducible structure.

The output.

The output is shown below:

## ['analysis', 'computation', 'computational', 'connects', 'data', 'learn', 'model', 'modern', 'pattern', 'relies', 'science', 'statistic', 'statistical', 'tool']

It is a list of 14 unique terms.

len(vocabulary)
## 14

Each term defines one dimension of the Bag-of-Words vector space. Therefore, every document will be represented as a vector of length 14, where each position corresponds to one vocabulary term.

In this way, the vocabulary defines the coordinate system of the vector space in which all documents will be represented.

Step 5: Assign indices to vocabulary terms.

The code.

This code creates a dictionary that maps each vocabulary term to a unique integer index. These indices define the column positions that each token will occupy in the Bag-of-Words matrix, ensuring a consistent numerical representation across all documents.

token_index = {token: idx for idx, token in enumerate(vocabulary)}
token_index

The function enumerate(vocabulary) pairs each term with an integer index:

(token₀, 0), (token₁, 1), (token₂, 2), ...

The dictionary comprehension then converts these pairs into a mapping of the form:

token → index

This mapping is essential because it determines the exact position of each term in the vector representation of every document.

In this way, the vocabulary is transformed into a coordinate system, where each dimension corresponds to a specific term.

The output.

In this case, the output is:

## {'analysis': 0, 'computation': 1, 'computational': 2, 'connects': 3, 'data': 4, 'learn': 5, 'model': 6, 'modern': 7, 'pattern': 8, 'relies': 9, 'science': 10, 'statistic': 11, 'statistical': 12, 'tool': 13}

Without this indexing step, it would not be possible to construct a consistent document–term matrix across multiple documents.

The next step is to initialize the document-term matrix using these indices.

Step 6: Initialize the Bag-of-Words matrix.

The code.

The next step is to initialize the document–term matrix, which will store the frequency of each vocabulary term in each document.

bow_matrix = np.zeros((len(processed_corpus), len(vocabulary)))
bow_matrix

The function np.zeros(...) creates a matrix filled with zeros. The shape of the matrix is determined by:

  • len(processed_corpus): the number of documents (rows).

  • len(vocabulary): the number of unique terms (columns).

Thus, the resulting matrix has:

number of documents × vocabulary size

This abstract description can be made more explicit by inspecting the matrix directly in Python.

The output (with python).

The size of the matrix can be verified as follows:

(len(processed_corpus), len(vocabulary))
## (3, 14)

The matrix itself is given by:

## array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
##        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
##        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

While this output shows the numerical structure, it is also useful to interpret the matrix from a mathematical perspective.

The output (matrix representation).

To better understand the structure of the matrix, it is useful to make explicit the correspondence between tokens and vocabulary terms:

\[ \text{Token map:}\qquad \begin{array}{llllllll} \texttt{ana}=\texttt{analysis}, & \texttt{cmp}=\texttt{computation}, & \texttt{cmpl}=\texttt{computational}, \\ \texttt{cnt}=\texttt{connects}, & \texttt{dat}=\texttt{data}, & \texttt{lear}=\texttt{learn}, \\ \texttt{mod}=\texttt{model}, & \texttt{mdrn}=\texttt{modern}, & \texttt{pat}=\texttt{pattern}, \\ \texttt{rel}=\texttt{relies}, & \texttt{sci}=\texttt{science}, & \texttt{st}=\texttt{statistic}, \\ \texttt{stl}=\texttt{statistical}, & \texttt{tool}=\texttt{tool}. \end{array} \]

Given this mapping, the matrix can be interpreted as the initial state:

\[ \mathbf{B}^{(0)} =\left( \begin{array}{c|cccccccccccccccc} \texttt{Text} & \texttt{ana} & \texttt{cmp} & \texttt{cmpl} & \texttt{cnt} & \texttt{dat} & \texttt{lear} & \texttt{mod} & \texttt{mdrn} & \texttt{pat} & \texttt{rel} & \texttt{sci} & \texttt{st} & \texttt{stl} & \texttt{tool} \\ \hline \text{#1} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{#2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{#3} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array} \right) \]

This representation makes explicit the relationship between vocabulary terms (columns) and documents (rows).

The output (interpretation).

At this stage, all entries are zero because no word counts have been recorded yet. The matrix only defines the structure of the representation.

  • Each row corresponds to a document, and each column corresponds to a vocabulary term (as defined in Step 5).

  • The value at position \((i, j)\) will later store how many times term \(j\) appears in document \(i\).

This matrix defines the vector space in which documents will be represented, but it does not yet contain any information about word frequencies.

We are now ready to populate the matrix with actual word counts.

Step 7: Populate the matrix with word counts.

This code fills the Bag-of-Words matrix by counting word occurrences. For each document (i), every token in the preprocessed sentence is located in the vocabulary using token_index, and the corresponding matrix entry is increased by one.

for i, sentence in enumerate(processed_corpus):
    for token in sentence.split():
        bow_matrix[i, token_index[token]] += 1

The code operates as follows:

  • The loop enumerate(processed_corpus) iterates over each document, where i is the document index and sentence is the corresponding text.

  • Each document is split into tokens using .split().

  • For each token, the dictionary token_index provides the column index associated with that term.

  • The value in the matrix at position (i, j) is incremented by 1, where:

    • i = document index

    • j = token index

In this way, the matrix is gradually populated with term frequencies.

After this step, each row of bow_matrix represents a document as a vector of word counts. Nonzero values indicate that a term appears in the document, while zeros indicate absence.

This step transforms the empty matrix into a numerical representation of the corpus, where each document is encoded as a vector in the vocabulary space.

Step 8: Inspect the final representation.

The output (with python).

The resulting Bag-of-Words matrix is shown below:

bow_matrix
## array([[0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
##        [0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0.],
##        [1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 1.]])

This output provides the numerical representation of the corpus, where each row corresponds to a document and each column corresponds to a vocabulary term.

The output (matrix representation).

To better understand the structure of the final representation, we can express the matrix in mathematical form:

\[ \text{Token map:}\qquad \begin{array}{llllllll} \texttt{ana}=\texttt{analysis}, & \texttt{cmp}=\texttt{computation}, & \texttt{cmpl}=\texttt{computational}, \\ \texttt{cnt}=\texttt{connects}, & \texttt{dat}=\texttt{data}, & \texttt{lear}=\texttt{learn}, \\ \texttt{mod}=\texttt{model}, & \texttt{mdrn}=\texttt{modern}, & \texttt{pat}=\texttt{pattern}, \\ \texttt{rel}=\texttt{relies}, & \texttt{sci}=\texttt{science}, & \texttt{st}=\texttt{statistic}, \\ \texttt{stl}=\texttt{statistical}, & \texttt{tool}=\texttt{tool}. \end{array} \]

\[ \mathbf{B} =\left( \begin{array}{c|cccccccccccccccc} \texttt{Text} & \texttt{ana} & \texttt{cmp} & \texttt{cmpl} & \texttt{cnt} & \texttt{dat} & \texttt{lear} & \texttt{mod} & \texttt{mdrn} & \texttt{pat} & \texttt{rel} & \texttt{sci} & \texttt{st} & \texttt{stl} & \texttt{tool} \\ \hline \text{#1} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ \text{#2} & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ \text{#3} & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 \end{array} \right) \] This matrix corresponds to the fully populated Bag-of-Words representation, where each entry reflects the frequency of a term in a given document.

The output (interpretation).

In this matrix:

  • Each row represents a document.

  • Each column a vocabulary term.

  • The entries indicate term frequencies, with many zeros reflecting the sparse nature of Bag-of-Words representations.

For example:

  • The entry in row #2 and column dat indicates how many times the word data appears in the second document.

  • More generally, each value \((i, j)\) captures the frequency of term \(j\) in document \(i\).

Final remark.

Most entries remain zero, illustrating the sparsity typical of Bag-of-Words representations. This sparsity arises because each document contains only a small subset of the full vocabulary.

This final matrix provides a complete numerical representation of the corpus, enabling the application of linear algebra operations and machine learning algorithms to textual data.

While this representation is simple and effective, it treats each word independently and ignores local context. We now explore extensions that capture richer structures in text.

4.0.3 Beyond unigrams

So far, we have considered only unigrams, meaning individual words. The same idea can be extended to:

  • Bigrams (pairs of consecutive words),

  • Trigrams, and

  • Higher-order n-grams.

Including n-grams allows the model to capture local contextual information, meaning that it can recognize short sequences of words rather than treating each word independently.

For example, the bigram data science carries a more specific meaning than the individual words data and science considered separately.

However, this comes at a cost: each additional n-gram increases the size of the vocabulary, leading to a higher-dimensional representation.

This raises a natural question:

Do we really need to implement all of this manually?
Fortunately, no.  

In practice, modern NLP libraries provide efficient and well-tested implementations of Bag-of-Words models.

In the next section, we introduce one such tool that automates this entire process.

5 Implementing Bag-of-Words with CountVectorizer

5.0.1 Understanding the BoW procedure with CountVectorizer

Manually building a Bag-of-Words (BoW) matrix helps develop intuition, but it is rarely necessary in practice. As discussed in the previous section, extending representations (e.g., to n-grams) quickly increases complexity.

In practice, Python provides efficient tools that automate this entire process. One of the most widely used is CountVectorizer from the scikit-learn library.

CountVectorizer transforms a collection of text documents into a document-term matrix, where:

  • Each row represents a document.

  • Each column corresponds to a token in the learned vocabulary.

  • Each cell contains the frequency of that token in the document.

This procedure mirrors the manual construction developed earlier, but in a fully automated and optimized way.

Figure 5.1 illustrates the transformation pipeline implemented by CountVectorizer, from raw text to the document-term matrix.

Bag-of-Words representation using `CountVectorizer`. Source: Created by the author with ChatGPT (OpenAI)

Figure 5.1: Bag-of-Words representation using CountVectorizer. Source: Created by the author with ChatGPT (OpenAI)

We now illustrate this process with a simple, concrete example.

5.0.2 BoW with CountVectorizer: example

The code.

Let us illustrate this with a small, self-contained example.

First, a CountVectorizer object is initialized using its default settings. Then, the method fit_transform() is applied to the corpus. This method simultaneously learns the vocabulary from the input documents and constructs the corresponding Bag-of-Words matrix.

Before implementing the model, note that the Bag-of-Words representation ignores word order and focuses only on term frequencies.

from sklearn.feature_extraction.text import CountVectorizer  # 1

documents = [
    "Data science relies on numerical methods",
    "Text analysis uses vectors and matrices",
    "Mathematical representations support data modeling"
]

vectorizer = CountVectorizer()                    # 2
vectorizer

bow_matrix = vectorizer.fit_transform(documents) # 3
bow_matrix

# Inspect the learned vocabulary and document-term matrix
print(vectorizer.get_feature_names_out()) # 4 --> first output
print(bow_matrix.toarray())               # 5a --> second output
print(bow_matrix.todense())               # 5b --> second output

Explanation.

  1. The CountVectorizer class is imported from scikit-learn. It is used to convert a collection of text documents into a numerical representation based on word counts.

  2. A vectorizer object is created with default parameters. By default, it:

    • converts all text to lowercase,

    • tokenizes the text automatically,

    • builds the vocabulary from the corpus,

    • does not remove stopwords.

  3. The method fit_transform() performs two operations in a single step:

    • fit: learns the vocabulary from the documents,

    • transform: converts each document into a vector of term frequencies.

The result is a sparse matrix, where most entries are zero.

  1. The method get_feature_names_out() returns the learned vocabulary. The order of these terms defines the column ordering of the matrix.

  2. The method toarray() converts the sparse matrix into a dense numerical array for inspection

The resulting output contains the learned vocabulary and the associated document-term matrix, which corresponds directly to the conceptual BoW construction discussed earlier.

First output (code 4): learned vocabulary.

The first output displays the learned vocabulary, that is, the set of unique tokens extracted from the corpus.

Each term corresponds to a column in the document-term matrix, and the order shown determines the column positions in the representation.

By default, CountVectorizer sorts the terms alphabetically.

## ['analysis' 'and' 'data' 'mathematical' 'matrices' 'methods' 'modeling'
##  'numerical' 'on' 'relies' 'representations' 'science' 'support' 'text'
##  'uses' 'vectors']

Second output (code 5): Bag-of-Words matrix.

This corresponds directly to the document-term matrix introduced in the manual construction.

## [[0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0]
##  [1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1]
##  [0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0]]

The output shows the Bag-of-Words matrix in dense form.

  • Each row corresponds to one document.

  • Each column corresponds to one of the vocabulary terms listed above.

  • Each entry indicates how many times a term appears in a document.

A value of:

  • 0 means the term does not appear,

  • a positive integer indicates its frequency.

This representation ignores word order and syntactic structure, retaining only term frequencies.

Matrix representation of the BoW.

To keep the notation compact, we label each token with a short abbreviation and report the corresponding document-term matrix below.

\[ \text{Token map:}\qquad \begin{array}{llllllll} \texttt{ana}=\texttt{analysis}, & \texttt{and}=\texttt{and}, & \texttt{dat}=\texttt{data}, & \texttt{math}=\texttt{mathematical}, \\ \texttt{mtx}=\texttt{matrices}, & \texttt{meth}=\texttt{methods}, & \texttt{model}=\texttt{modeling}, & \texttt{num}=\texttt{numerical}, \\ \texttt{on}=\texttt{on}, & \texttt{rel}=\texttt{relies}, & \texttt{repr}=\texttt{representations}, & \texttt{sci}=\texttt{science}, \\ \texttt{sup}=\texttt{support}, & \texttt{txt}=\texttt{text}, & \texttt{use}=\texttt{uses}, & \texttt{vec}=\texttt{vectors}. \end{array} \]

To make the structure more explicit, we present the matrix in mathematical form.

Each entry \((i,j)\) represents the frequency of term \(j\) in document \(i\).

This type of representation is typically sparse, meaning that most entries are zero, especially as the vocabulary size increases.

\[ \mathbf{B}= \left( \begin{array}{c|cccccccccccccccc} \texttt{Doc} & \texttt{ana} & \texttt{and} & \texttt{dat} & \texttt{math} & \texttt{mtx} & \texttt{meth} & \texttt{model} & \texttt{num} & \texttt{on} & \texttt{rel} & \texttt{repr} & \texttt{sci} & \texttt{sup} & \texttt{txt} & \texttt{use} & \texttt{vec} \\ \hline \text{#1} & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\ \text{#2} & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ \text{#3} & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\ \end{array}\right) \]

For example, the token data appears once in the first and third documents, and does not appear in the second document. Similarly, the token analysis appears only in the second document, while numerical appears only in the first document. This sparsity pattern is typical of Bag-of-Words representations, especially as the vocabulary size grows.

More generally, each entry \((i,j)\) represents the frequency of term \(j\) in document \(i\).

Heatmap of the BoW.

The same matrix can be visualized as a heatmap, where darker cells indicate higher token counts.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

terms = vectorizer.get_feature_names_out()
X = bow_matrix.toarray()
df_bow = pd.DataFrame(X, columns=terms)

plt.figure(figsize=(14,5));
ax = sns.heatmap(df_bow, cmap="Blues", cbar=True)

# --- Title and axis labels ---
ax.set_title("Bag-of-Words representation", fontsize=18, pad=10);
ax.set_xlabel("Vocabulary terms", fontsize=18);
ax.set_ylabel("Documents", fontsize=18);

# --- Tick labels ---
ax.tick_params(axis="x", labelsize=14, rotation=45)
ax.tick_params(axis="y", labelsize=14)

# --- Colorbar font size ---
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=14)

plt.tight_layout() # Prevents cropping
plt.show()

Each cell represents the frequency of a term in a document.

Since this is a small corpus, most values are either 0 or 1, so the heatmap primarily highlights the presence or absence of terms rather than strong frequency differences.

Note that common words such as and and on are included in the vocabulary because no stopword filtering was applied in this example.

Final remarks.

This example illustrates how CountVectorizer automates:

  • tokenization,

  • vocabulary construction,

  • and word counting.

The resulting Bag-of-Words representation provides a simple yet powerful way to transform text into numerical features suitable for machine learning models.

However, it is important to note that this representation ignores word order and context, which motivates more advanced approaches such as TF-IDF and word embeddings.

6 CountVectorizer: additional arguments

6.0.1 Understanding how CountVectorizer can be customized

Importantly, CountVectorizer provides several arguments that allow the basic Bag-of-Words representation to be refined and controlled, such as vocabulary size limits and document-frequency thresholds. These arguments, illustrated in Figure 6.1, will be introduced conceptually here and implemented in detail in the following sections.

Bag-of-words - `CountVectorizer` arguments. Source: Created by the author with ChatGPT (OpenAI)

Figure 6.1: Bag-of-words - CountVectorizer arguments. Source: Created by the author with ChatGPT (OpenAI)

Overview of customization options

To better organize these options, it is useful to distinguish between built-in processing features and explicit vocabulary control parameters.

The behavior of CountVectorizer can be adjusted through several arguments, which can be grouped into two main categories:

1. Built-in processing features (out-of-the-box behavior):

  • Automatic vocabulary learning.

  • Tokenization.

  • Support for n-grams (ngram_range).

  • Optional stopword removal (stop_words).

2. Vocabulary control parameters:

  • max_features: limits the vocabulary to the top N most frequent terms.

  • min_df: removes terms that appear in too few documents.

  • max_df: removes terms that appear in too many documents.

These options allow the user to balance expressiveness and dimensionality, tailoring the representation to the specific task and dataset.

6.0.2 CountVectorizer: out-of-the-box features

Beyond basic word counts, CountVectorizer includes several built-in options that make it flexible and practical for real-world applications.

We now explore some of the most commonly used features.

Automatic vocabulary learning and n-gram generation.

By default, CountVectorizer learns its vocabulary directly from the data. In addition, it can:

  • Apply tokenization internally,

  • Remove stopwords automatically, and

  • Generate n-grams without additional code.

Example.

In the following example, the argument ngram_range = (1, 3) instructs the vectorizer to include unigrams, bigrams, and trigrams, that is, single words, pairs of consecutive words, and sequences of three consecutive words.

First, a CountVectorizer object is created with the specified n-gram range. The method fit_transform() then learns the vocabulary from the corpus and constructs the corresponding Bag-of-Words matrix, where each column represents an n-gram and each row represents a document.

vectorizer_ngram = CountVectorizer(ngram_range=(1, 3))
bow_ngram = vectorizer_ngram.fit_transform(documents)

print(vectorizer_ngram.get_feature_names_out()) # Output 1
print(bow_ngram.toarray())                      # Output 2

First output.

The first output displays the learned n-gram vocabulary. As a result, terms such as analysis (unigram), text analysis (bigram), and text analysis uses (trigram) coexist as distinct features in the representation. The order shown here defines the column ordering of the Bag-of-Words matrix.

## ['analysis' 'analysis uses' 'analysis uses vectors' 'and' 'and matrices'
##  'data' 'data modeling' 'data science' 'data science relies'
##  'mathematical' 'mathematical representations'
##  'mathematical representations support' 'matrices' 'methods' 'modeling'
##  'numerical' 'numerical methods' 'on' 'on numerical'
##  'on numerical methods' 'relies' 'relies on' 'relies on numerical'
##  'representations' 'representations support'
##  'representations support data' 'science' 'science relies'
##  'science relies on' 'support' 'support data' 'support data modeling'
##  'text' 'text analysis' 'text analysis uses' 'uses' 'uses vectors'
##  'uses vectors and' 'vectors' 'vectors and' 'vectors and matrices']

To facilitate later reference and discussion, the learned vocabulary is listed below as an indexed sequence. This enumeration will be used in subsequent sections to illustrate how n-gram features are filtered, selected, or weighted when applying additional arguments of CountVectorizer.

library(dplyr)
library(stringr)
library(knitr)
library(kableExtra)

tokens <- c(
  "analysis",
  "analysis uses",
  "analysis uses vectors",
  "and",
  "and matrices",
  "data",
  "data modeling",
  "data science",
  "data science relies",
  "mathematical",
  "mathematical representations",
  "mathematical representations support",
  "matrices",
  "methods",
  "modeling",
  "numerical",
  "numerical methods",
  "on",
  "on numerical",
  "on numerical methods",
  "relies",
  "relies on",
  "relies on numerical",
  "representations",
  "representations support",
  "representations support data",
  "science",
  "science relies",
  "science relies on",
  "support",
  "support data",
  "support data modeling",
  "text",
  "text analysis",
  "text analysis uses",
  "uses",
  "uses vectors",
  "uses vectors and",
  "vectors",
  "vectors and",
  "vectors and matrices"
)

tok_tbl <- tibble(
  ID = seq_along(tokens),
  Token = tokens
) %>%
  mutate(
    n_words = str_count(Token, "\\S+") # cuenta "palabras" separadas por espacios
  ) %>%
  mutate(
    Unigram = ifelse(n_words == 1, "✓", ""),
    Bigram  = ifelse(n_words == 2, "✓", ""),
    Trigram = ifelse(n_words == 3, "✓", "")
  ) %>%
  select(ID, Token, Unigram, Bigram, Trigram)

# Mostrar en tabla con formato
kable(tok_tbl, align = "clccc",
      col.names = c("Token ID", "Token (as learned)", "Unigram", "Bigram", "Trigram"),
      caption = "Indexed n-gram vocabulary and token type (based on word count).",
       format = "html",
       booktabs = TRUE) %>%
kable_styling() %>%
kable_classic_2(full_width = FALSE)
Table 6.1: Indexed n-gram vocabulary and token type (based on word count).
Token ID Token (as learned) Unigram Bigram Trigram
1 analysis
2 analysis uses
3 analysis uses vectors
4 and
5 and matrices
6 data
7 data modeling
8 data science
9 data science relies
10 mathematical
11 mathematical representations
12 mathematical representations support
13 matrices
14 methods
15 modeling
16 numerical
17 numerical methods
18 on
19 on numerical
20 on numerical methods
21 relies
22 relies on
23 relies on numerical
24 representations
25 representations support
26 representations support data
27 science
28 science relies
29 science relies on
30 support
31 support data
32 support data modeling
33 text
34 text analysis
35 text analysis uses
36 uses
37 uses vectors
38 uses vectors and
39 vectors
40 vectors and
41 vectors and matrices

This example shows that the learned vocabulary contains tokens of different lengths. For instance:

  • Tokens 1, 6, 10, 16, and 33 correspond to unigrams.

  • Tokens 2, 7, 8, 17, and 34 correspond to bigrams.

  • Tokens 3, 11, 18, 35, and 38 correspond to trigrams.

These differences arise solely from the chosen ngram_range and do not change the underlying Bag-of-Words representation.

Second output.

The second output shows the Bag-of-Words matrix constructed using the n-gram vocabulary. Each row corresponds to a document, each column corresponds to a specific n-gram, and the value in each cell indicates how many times that n-gram appears in the document.

## [[0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0
##   0 0 0 0 0]
##  [1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
##   1 1 1 1 1]
##  [0 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0
##   0 0 0 0 0]]

For readability, the Bag-of-Words matrix is presented in two blocks, corresponding to tokens 1–21 and 22–40:

library(knitr)
library(kableExtra)

# --- Matriz original ---
B <- matrix(
  c(
    # Doc 1
    0,0,0,0,0,1,0,1,1,0, 0,0,0,1,0,1,1,1,1,1,
    1,1,1,0,0,0,1,1,1,0, 0,0,0,0,0,0,0,0,0,0,0,
    # Doc 2
    1,1,1,1,1,0,0,0,0,0, 0,0,1,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0, 0,0,1,1,1,1,1,1,1,1,1,
    # Doc 3
    0,0,0,0,0,1,1,0,0,1, 1,1,0,0,1,0,0,0,0,0,
    0,0,0,1,1,1,0,0,0,1, 1,1,0,0,0,0,0,0,0,0,0
  ),
  nrow = 3,
  byrow = TRUE
)

colnames(B) <- paste0("T", 1:41)   # o solo 1:41 si prefieres
rownames(B) <- paste0("Doc.", 1:3)

# --- Subtabla 1: Tokens 1–21 ---
B_1_21 <- B[, 1:21]

kable(
  B_1_21,
  align = "c",
  caption = "(a) Bag-of-Words matrix (tokens T1 - T21)",
  format = "html",
  booktabs = TRUE
) %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()

# --- Subtabla 2: Tokens 22–41 ---
B_22_41 <- B[, 22:41]

kable(
  B_22_41,
  align = "c",
  caption = "(b) Bag-of-Words matrix (tokens T22 - T41)",
  format = "html",
  booktabs = TRUE
) %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()
Table 6.2: (a) Bag-of-Words matrix (tokens T1 - T21)
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21
Doc.1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1
Doc.2 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Doc.3 0 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0
Table 6.2: (b) Bag-of-Words matrix (tokens T22 - T41)
T22 T23 T24 T25 T26 T27 T28 T29 T30 T31 T32 T33 T34 T35 T36 T37 T38 T39 T40 T41
Doc.1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
Doc.2 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
Doc.3 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0

A value of 1 at position (Doc. i, Token Tj) indicates that token j appears once in document i; a value of 0 indicates it does not appear. For example,

Token 8 → "data science" → value 1 in document 1
Token 8 → "data science" → value 0 in document 2
Token 8 → "data science" → value 0 in document 3

In this case, a value of 1 in the column associated with data science means that this bigram appears once in the corresponding document, whereas a value of 0 indicates that it does not appear.

While these built-in features define the default behavior of the vectorizer, additional arguments allow further control over the vocabulary and document-term representation.

6.0.3 CountVectorizer: Controlling the vocabulary

Several arguments allow the vocabulary to be restricted or filtered:

  • max_features: limits vocabulary size.

  • min_df: removes rare terms.

  • max_df: removes overly frequent terms.

6.0.4 CountVectorizer: limiting vocabulary size with max_features

As the vocabulary grows, the dimensionality of document vectors increases accordingly. Very high-dimensional representations may reduce computational efficiency and harm generalization, a phenomenon commonly referred to as the curse of dimensionality.

To address this issue, CountVectorizer provides the max_features argument, which restricts the vocabulary to the most frequent tokens observed in the corpus.

Example.

In the following example, the vocabulary is limited to the five most frequent unigrams or bigrams in the corpus.

First, a CountVectorizer object is created with ngram_range = (1, 2) to extract both unigrams and bigrams. The argument max_features = 5 restricts the vocabulary to the five most frequent tokens (according to document frequency). The method fit_transform() then learns this reduced vocabulary and constructs the corresponding Bag-of-Words matrix.

vectorizer_limited = CountVectorizer(
    ngram_range=(1, 2),
    max_features=5
)

bow_limited = vectorizer_limited.fit_transform(documents)

print(vectorizer_limited.get_feature_names_out()) # Output 1
print(bow_limited.toarray())                      # Output 2

First output.

The first output displays the reduced vocabulary, consisting of the five most frequent unigrams or bigrams retained after applying the max_features constraint. The order shown here defines the column ordering of the Bag-of-Words matrix.

## ['analysis' 'analysis uses' 'and' 'and matrices' 'data']

Second output.

The resulting vocabulary (Output 1) defines the columns of the Bag-of-Words matrix (Output 2), in the exact order shown above.

## [[0 0 0 0 1]
##  [1 1 1 1 0]
##  [0 0 0 0 1]]

Each row corresponds to a document and each column corresponds to one of the selected n-grams. The entries represent term frequencies. Formally, the matrix can be written as \[ \mathbf{B} = (b_{ij}), \qquad b_{ij} = \text{frequency of n-gram } j \text{ in document } i, \] where the columns correspond to \[ (\texttt{analysis},\ \texttt{analysis uses},\ \texttt{and},\ \texttt{and matrices},\ \texttt{data}). \]

That is, the Bag-of-Words matrix can be written explicitly as \[ \mathbf{B} = \begin{array}{c|ccccc} & \texttt{analysis} & \texttt{analysis uses} & \texttt{and} & \texttt{and matrices} & \texttt{data} \\ \hline \text{Document 1} & 0 & 0 & 0 & 0 & 1 \\ \text{Document 2} & 1 & 1 & 1 & 1 & 0 \\ \text{Document 3} & 0 & 0 & 0 & 0 & 1 \end{array} \]

The value \(b_{25} = 0\) indicates that the token does not appear in the second document, while \(b_{15} = 1\) indicates that it appears once in the first document.

6.0.5 CountVectorizer: filtering tokens with min_df and max_df thresholds

Not all tokens contribute equally to the representation. Some appear in almost every document (low discrimination), while others appear only once (often too specific or noisy).

CountVectorizer supports filtering tokens using document frequency:

  • min_df keeps only terms that appear in at least min_df documents (as a count or proportion).

  • max_df keeps only terms that appear in at most max_df documents (as a count or proportion).

A useful workflow is:

  1. fit a vectorizer,

  2. compute df for each token,

  3. inspect which tokens would survive a chosen min_df/max_df rule.

6.0.6 CountVectorizer: example 1 (inspecting which tokens survive a min_df rule)

In this example, we let the vectorizer learn the full vocabulary with min_df = 1, and then compute each token’s document frequency (df) manually. Based on this information, we mark which tokens would be retained if a stricter rule such as min_df = 2 were applied.

This approach makes the effect of min_df explicit and easy to interpret.

First, a CountVectorizer object is created with ngram_range = (1, 3) to extract unigrams, bigrams, and trigrams. The argument min_df = 1 ensures that no tokens are filtered at this stage, allowing the full vocabulary to be inspected. The method fit_transform() then learns the vocabulary and constructs the corresponding Bag-of-Words matrix.

Specifically:

  • fit_transform(documents) scans the corpus, builds the vocabulary of observed tokens, and returns the document–term matrix in sparse format.

  • get_feature_names_out() extracts the list of tokens learned by the vectorizer, in the order in which they appear as columns in the matrix.

  • toarray() converts the sparse Bag-of-Words matrix into a dense numerical array, which facilitates inspection and manual computation of document frequencies.

Interpretation of the matrix structure.

The resulting Bag-of-Words matrix has:

  • rows corresponding to documents, and

  • columns corresponding to vocabulary tokens (unigrams, bigrams, or trigrams).

Each entry represents the number of times a given token appears in a given document.

Mathematical notation.

Let \(D\) denote the number of documents and \(V\) the size of the vocabulary. The Bag-of-Words representation can be written as a matrix

\[\mathbf{B} = (b_{ij}) \in \mathbb{N}^{D \times V},\]

where

\[b_{ij} = \text{number of occurrences of token } j \text{ in document } i.\]

This representation provides the basis for computing document frequencies and for applying frequency-based filtering rules such as min_df and max_df.

Application.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Vectorizer (ya se tiene)
vectorizer_limited = CountVectorizer(
    ngram_range=(1, 3),
    min_df=1  # importante: df lo calculamos manualmente
)

bow = vectorizer_limited.fit_transform(documents)

# 1) Tokens aprendidos
tokens = vectorizer_limited.get_feature_names_out()

# 2) Bag-of-Words matrix
B = bow.toarray()

# 3) Document frequency (df)
df = (B > 0).sum(axis=0)

# 4) Número de palabras por token
n_words = np.array([len(t.split()) for t in tokens])

# 5) Construir tabla final
table_df = pd.DataFrame({
    "Token ID": np.arange(1, len(tokens) + 1),
    "Token (as learned)": tokens,
    "Unigram": (n_words == 1).astype(int),
    "Bigram":  (n_words == 2).astype(int),
    "Trigram": (n_words == 3).astype(int),
    "df": df,
    "Kept (min_df = 2)": np.where(df >= 2, "✓", "X")
})

# 6) Reemplazar 1/0 por ✓ / vacío (más legible)
for col in ["Unigram", "Bigram", "Trigram"]:
    table_df[col] = table_df[col].replace({1: "✓", 0: ""})

table_df
##     Token ID                    Token (as learned)  ... df Kept (min_df = 2)
## 0          1                              analysis  ...  1                 X
## 1          2                         analysis uses  ...  1                 X
## 2          3                 analysis uses vectors  ...  1                 X
## 3          4                                   and  ...  1                 X
## 4          5                          and matrices  ...  1                 X
## 5          6                                  data  ...  2                 ✓
## 6          7                         data modeling  ...  1                 X
## 7          8                          data science  ...  1                 X
## 8          9                   data science relies  ...  1                 X
## 9         10                          mathematical  ...  1                 X
## 10        11          mathematical representations  ...  1                 X
## 11        12  mathematical representations support  ...  1                 X
## 12        13                              matrices  ...  1                 X
## 13        14                               methods  ...  1                 X
## 14        15                              modeling  ...  1                 X
## 15        16                             numerical  ...  1                 X
## 16        17                     numerical methods  ...  1                 X
## 17        18                                    on  ...  1                 X
## 18        19                          on numerical  ...  1                 X
## 19        20                  on numerical methods  ...  1                 X
## 20        21                                relies  ...  1                 X
## 21        22                             relies on  ...  1                 X
## 22        23                   relies on numerical  ...  1                 X
## 23        24                       representations  ...  1                 X
## 24        25               representations support  ...  1                 X
## 25        26          representations support data  ...  1                 X
## 26        27                               science  ...  1                 X
## 27        28                        science relies  ...  1                 X
## 28        29                     science relies on  ...  1                 X
## 29        30                               support  ...  1                 X
## 30        31                          support data  ...  1                 X
## 31        32                 support data modeling  ...  1                 X
## 32        33                                  text  ...  1                 X
## 33        34                         text analysis  ...  1                 X
## 34        35                    text analysis uses  ...  1                 X
## 35        36                                  uses  ...  1                 X
## 36        37                          uses vectors  ...  1                 X
## 37        38                      uses vectors and  ...  1                 X
## 38        39                               vectors  ...  1                 X
## 39        40                           vectors and  ...  1                 X
## 40        41                  vectors and matrices  ...  1                 X
## 
## [41 rows x 7 columns]

Interpreting the output.

The resulting table contains 41 tokens, including unigrams, bigrams, and trigrams (up to length 3).
The most relevant columns are:

  • df: the number of documents in which a token appears at least once.

  • Kept (min_df = 2): indicates whether the token would be retained if we required it to appear in at least two documents.

From the output, only one token is retained:

  • data has df = 2, meaning it appears in two documents and therefore satisfies the condition min_df = 2 (✓).

All remaining tokens have:

  • df = 1, indicating that they appear in only one document. As a result, they would be removed (X) under the min_df = 2 rule.

Why does this happen?

This behavior is a direct consequence of the very small corpus size (approximately two documents). Most multi-word expressions (such as text analysis uses or vectors and matrices) occur in only one document.

By setting min_df = 2, we are effectively enforcing the rule:

Keep only the terms that appear across multiple documents.

When the corpus contains only two documents, this becomes a very strict filtering criterion, causing nearly all tokens to be discarded.

A more direct approach: applying min_df inside the vectorizer (optional).

In the previous example, token filtering was illustrated by manually computing document frequencies and marking which tokens would be retained under a given min_df rule.

An alternative (and more typical) approach is to apply the frequency constraint directly inside the vectorizer. In this case, tokens that do not satisfy the condition are never included in the learned vocabulary.

vectorizer_df = CountVectorizer(
    ngram_range=(1, 3),
    min_df=2
)

bow_df = vectorizer_df.fit_transform(documents)
print(vectorizer_df.get_feature_names_out())
## ['data']

Because the corpus is very small, only the token data appears in at least two documents and is therefore retained. All other unigrams, bigrams, and trigrams are discarded automatically during vocabulary construction.

This confirms the behavior observed earlier using the manual inspection table.

6.0.7 CountVectorizer: example 2 (joint filtering with min_df and max_df)

The previous example applied a lower bound on document frequency using min_df. We now extend this idea by introducing an upper bound through the parameter max_df.

In this example, only unigrams and bigrams that:

  • appear in at least two documents, and

  • appear in no more than 80% of the corpus

are retained.

First, a CountVectorizer object is created with ngram_range = (1, 2) to extract unigrams and bigrams only. The arguments min_df = 2 and max_df = 0.8 jointly filter tokens based on document frequency. As before, the method fit_transform() learns the filtered vocabulary and constructs the corresponding Bag-of-Words matrix.

vectorizer_df = CountVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.8
)

bow_df = vectorizer_df.fit_transform(documents)

print(vectorizer_df.get_feature_names_out()) # Output 1
print(bow_df.toarray())                      # Output 2

First output.

The first output lists the tokens retained in the filtered vocabulary, that is, those satisfying both the min_df and max_df constraints.. In this case, only the token data satisfies both frequency constraints.

## ['data']

Second output.

The second output displays the resulting Bag-of-Words matrix, whose columns correspond to the retained tokens and whose rows represent documents, with entries indicating token counts. Since only one token is retained, the matrix has a single column. Each row corresponds to a document and indicates whether the token data appears in the corresponding document..

## [[1]
##  [0]
##  [1]]

Together, min_df and max_df provide a simple yet powerful mechanism to control which tokens enter the representation. They are especially useful for reducing noise and dimensionality in high-dimensional text data, while preserving terms that carry cross-document relevance.

6.0.8 Limitations of the Bag-of-Words representation

Despite its simplicity and interpretability, the Bag-of-Words model has important limitations. See Figure 6.2.

Limitations of the Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)

Figure 6.2: Limitations of the Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)

Limitations.

  • First, it relies exclusively on token counts, ignoring word order and syntactic structure. As a result, sentences with very different meanings may receive similar representations.

  • Second, BoW does not capture semantic relationships. Words with related meanings are treated as entirely independent dimensions.

  • Third, large vocabularies can lead to extremely high-dimensional vectors, which may degrade performance and increase computational cost.

These limitations motivate more refined representations that adjust token importance and incorporate contextual information. One such approach (TF-IDF weighting) is introduced in the next section.

7 TF-IDF vectors

In the previous section, documents were represented using raw word counts through the Bag-of-Words model. While this approach is intuitive, it treats all tokens equally and relies solely on their frequency within each document.

As a result, terms that appear very often across the corpus may dominate the representation, while less frequent but potentially informative terms receive little weight or are discarded altogether. This can lead to a loss of relevant patterns, especially when rare terms are crucial for distinguishing documents.

The Term Frequency-Inverse Document Frequency (TF-IDF) scheme addresses this limitation by re-weighting tokens according to both their local importance within a document and their global distribution across the corpus.

TF-IDF is widely used in information retrieval, search engines, and text mining applications. Like BoW, it is a frequency-based representation, but it incorporates an additional normalization mechanism that balances common and rare terms.

7.0.1 Term Frequency (TF)

The term frequency component measures how often a word appears in a specific document. However, since documents may vary in length, raw counts are typically normalized. A common normalized definition of term frequency is:

\[TF(w) = \frac{\text{Number of times the word w occurs in a document}}{\text{Total number of words in the document}}\]

This normalization prevents longer documents from automatically assigning higher importance to all their terms.

To build intuition, the next figure shows normalized TF for a few example tokens inside a single document (so longer documents do not automatically inflate importance).

import numpy as np
import matplotlib.pyplot as plt

# --- Simulated document term counts (Document d1) ---
terms = ["data", "analysis", "model", "the", "and", "python"]
counts_d1 = np.array([6, 3, 2, 10, 8, 1])  # raw counts in document d1
tf_d1 = counts_d1 / counts_d1.sum()        # normalized TF

plt.figure()
plt.bar(terms, tf_d1)
plt.title("Term Frequency (TF) in a single document")
plt.xlabel("Token")
plt.ylabel("TF (normalized frequency)")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.show()
library(ggplot2)

tf_df <- data.frame(
  token = c("data", "analysis", "model", "the", "and", "python"),
  count = c(6, 3, 2, 10, 8, 1)
)

tf_df$TF <- tf_df$count / sum(tf_df$count)

ggplot(tf_df, aes(x = token, y = TF)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Term Frequency (TF) in a single document",
    x = "Token",
    y = "TF (normalized frequency)"
  ) +
  theme_minimal()

TF measures local importance: tokens that occur more often within the document receive larger TF values, but normalization keeps TF comparable across documents of different lengths.

7.0.2 Inverse Document Frequency (IDF)

While TF captures local relevance, it does not account for how informative a word is across the entire corpus. Words that appear in almost every document (such as general or domain-wide terms) may not be useful for discrimination.

The inverse document frequency component down-weights such ubiquitous terms and amplifies words that occur in fewer documents:

\[ IDF(w) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing word w}}\right) \]

where:

  • \(N\) is the total number of documents, and

  • \(df(w)\) is the number of documents containing word \(w\).

TF alone does not capture how informative a token is across the corpus. The next figure shows how IDF decreases as a token appears in more documents.

import numpy as np
import matplotlib.pyplot as plt

# --- Simulated corpus size and document frequencies ---
N = 10  # total documents
df = np.arange(1, N+1)                 # df(w) = 1..N
idf = np.log(N / df)                   # classic IDF definition (as in your notes)

plt.figure()
plt.plot(df, idf, marker="o")
plt.title("Inverse Document Frequency (IDF) vs. document frequency")
plt.xlabel("Document frequency  df(w)")
plt.ylabel("IDF(w) = log(N / df(w))")
plt.xticks(df)
plt.tight_layout()
plt.show()
idf_df <- data.frame(df = 1:10)

N <- 10
idf_df$IDF <- log(N / idf_df$df)

ggplot(idf_df, aes(x = df, y = IDF)) +
  geom_line(color = "steelblue", size=1) +
  geom_point(color = "steelblue", size=2.5) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title = "Inverse Document Frequency (IDF)",
    x = "Document frequency df(w)",
    y = "IDF(w) = log(N / df(w))"
  ) +
  theme_minimal()

Tokens that occur in many documents (high df) have low IDF, because they help less to distinguish documents. Tokens that occur in few documents have higher IDF.

7.0.3 TF-IDF weighting

The final TF-IDF weight of a word \(w\) in document \(d\) is obtained by combining two components:

\[ \text{weight}(w,d) = TF(w,d) \times IDF(w) \]

This formulation assigns higher weights to terms that are frequent within a document but relatively rare across the corpus.

Even when a term appears exactly once in every document, its TF-IDF weight is not necessarily identical across documents.
This occurs because the term frequency (TF) component is normalized by the total number of tokens in each document. Consequently, documents of different lengths assign different relative importance to the same term.

In addition, TF-IDF vectors are normalized by default using the \(L_2\) norm. This means that each document vector is rescaled to have unit length, further modifying the final weights. As a result, two documents may share the same vocabulary and identical raw term counts, yet still differ in their TF-IDF representations.

The next plot illustrates the combined effect: TF–IDF becomes large when a token is frequent in a document (high TF) and rare in the corpus (high IDF).

import numpy as np
import matplotlib.pyplot as plt

# --- Simulated TF (from a document) and IDF (from the corpus) for several tokens ---
tokens = ["data", "analysis", "model", "the", "and"]
tf = np.array([0.18, 0.12, 0.08, 0.30, 0.20])  # local frequencies (normalized)
idf = np.array([1.0, 1.4, 1.8, 0.1, 0.2])      # global rarity (higher = rarer)
tfidf = tf * idf

plt.figure()
plt.bar(tokens, tfidf)
plt.title("TF-IDF weights in a document (simulated)")
plt.xlabel("Token")
plt.ylabel("TF-IDF = TF × IDF")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()
tfidf_df <- data.frame(
  token = c("data", "analysis", "model", "the", "and"),
  TF = c(0.18, 0.12, 0.08, 0.30, 0.20),
  IDF = c(1.0, 1.4, 1.8, 0.1, 0.2)
)

tfidf_df$TFIDF <- tfidf_df$TF * tfidf_df$IDF

ggplot(tfidf_df, aes(x = token, y = TFIDF)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "TF-IDF weights in a document (simulated)",
    x = "Token",
    y = "TF-IDF = TF × IDF"
  ) +
  theme_minimal()

A token can have a high TF but still receive a small TF–IDF weight if its IDF is low (e.g., very common words). TF–IDF emphasizes tokens that are both locally frequent and globally informative.

7.0.4 How TF, IDF, and TF-IDF relate (combined view)

Finally, the figure below visualizes TF and IDF jointly; TF-IDF is shown by the point size (larger = higher TF-IDF).

import numpy as np
import matplotlib.pyplot as plt

tokens = np.array(["data", "analysis", "model", "the", "and", "python", "science"])
tf = np.array([0.18, 0.12, 0.08, 0.30, 0.20, 0.05, 0.07])
idf = np.array([1.0, 1.4, 1.8, 0.1, 0.2, 2.0, 1.6])
tfidf = tf * idf

plt.figure()
plt.scatter(tf, idf, s=2500*tfidf)  # point size proportional to TF–IDF
for x, y, t in zip(tf, idf, tokens):
    plt.text(x, y, f"  {t}", va="center")

plt.title("TF–IDF as an interaction of TF and IDF (size = TF–IDF)")
plt.xlabel("TF (within-document frequency)")
plt.ylabel("IDF (corpus rarity)")
plt.tight_layout()
plt.show()
rel_df <- data.frame(
  token = c("data", "analysis", "model", "the", "and", "python", "science"),
  TF = c(0.18, 0.12, 0.08, 0.30, 0.20, 0.05, 0.07),
  IDF = c(1.0, 1.4, 1.8, 0.1, 0.2, 2.0, 1.6)
)

rel_df$TFIDF <- rel_df$TF * rel_df$IDF

ggplot(rel_df, aes(x = TF, y = IDF, size = TFIDF)) +
  geom_point(color = "steelblue", alpha = 0.7) +
  geom_text(aes(label = token), hjust = -0.1, vjust = 0.5) +
  labs(
    title = "TF–IDF as an interaction of TF and IDF",
    x = "TF (within-document frequency)",
    y = "IDF (corpus rarity)",
    size = "TF–IDF"
  ) +
  theme_minimal()

The largest points appear where TF and IDF are simultaneously high. This makes TF–IDF easy to interpret as an interaction: a token is most important when it is frequent in the document but uncommon in the corpus.

7.0.5 Building a basic TF–IDF vectorizer

In practice, TF–IDF representations are computed efficiently using the TfidfVectorizer class from scikit-learn, which combines term frequency normalization and inverse document frequency weighting in a single step.

Example.

To keep the example simple and self-contained, consider the following small collection of documents.

First, a TfidfVectorizer object is created using the default settings, which include \(L_2\) normalization. The method fit_transform() learns the vocabulary from the corpus and computes the TF–IDF matrix simultaneously, producing a numerical representation of the documents.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "Statistical models rely on numerical features",
    "Text representations are built using vectors",
    "Feature weighting improves document comparison"
]

vectorizer = TfidfVectorizer()
tf_idf_matrix = vectorizer.fit_transform(documents)

The learned vocabulary and the resulting TF–IDF matrix can be inspected as follows:

print(vectorizer.get_feature_names_out())    # Output 1
print(tf_idf_matrix.toarray())               # Output 2
print("Matrix shape:", tf_idf_matrix.shape)  # Output 3

The three outputs correspond, respectively, to the learned vocabulary, the TF–IDF matrix expressed in dense form for inspection, and the dimensions of the resulting representation.

First output.

The first output displays the learned vocabulary, that is, the set of unique terms extracted from the corpus after preprocessing. Each element in this array corresponds to a column of the TF–IDF matrix, and the order shown here defines the column ordering used in the matrix representation.

## ['are' 'built' 'comparison' 'document' 'feature' 'features' 'improves'
##  'models' 'numerical' 'on' 'rely' 'representations' 'statistical' 'text'
##  'using' 'vectors' 'weighting']

Second output.

The second output shows the TF–IDF matrix itself, expressed in dense form for inspection. Each row corresponds to a document, each column corresponds to a term in the learned vocabulary, and each entry represents the TF–IDF weight assigned to that term in the corresponding document.

## [[0.         0.         0.         0.         0.         0.40824829
##   0.         0.40824829 0.40824829 0.40824829 0.40824829 0.
##   0.40824829 0.         0.         0.         0.        ]
##  [0.40824829 0.40824829 0.         0.         0.         0.
##   0.         0.         0.         0.         0.         0.40824829
##   0.         0.40824829 0.40824829 0.40824829 0.        ]
##  [0.         0.         0.4472136  0.4472136  0.4472136  0.
##   0.4472136  0.         0.         0.         0.         0.
##   0.         0.         0.         0.         0.4472136 ]]

Third output.

The final output reports the dimensions of the matrix. In this example, the matrix has three rows (one per document) and seventeen columns (one per vocabulary term), confirming the correspondence between the corpus size and the learned vocabulary.

## Matrix shape: (3, 17)

The vocabulary remains comparable to that of CountVectorizer, but the entries now represent TF-IDF weights rather than raw frequencies.

7.0.6 Normalization of TF–IDF vectors.

Normalization ensures that document vectors are comparable in magnitude, which is particularly important for similarity measures.

\(L_2\) norm.

By default, each TF-IDF document vector \(\mathbf{x} = (x_1, x_2, \dots, x_d)\) is normalized to have unit length using the \(L_2\) norm, defined as \[\|\mathbf{x}\|_2 \; =\; \sqrt{\sum_{j=1}^{d} x_j^2}\]

Under this normalization, the vector is rescaled so that \(\|\mathbf{x}\|_2 = 1\), emphasizing relative term contributions rather than document length.

\(L_1\) norm.

Alternatively, the \(L_1\) norm can be used, which is defined as

\[\|\mathbf{x}\|_1 \; =\; \sum_\limits{j=1}^{d} |x_j|\]

In this case, the vector is rescaled so that \(\|\mathbf{x}\|_1 = 1\), allowing the TF-IDF weights to be interpreted as relative proportions within each document.

Example.

The following example illustrates TF-IDF computation using \(L_1\) normalization.

First, a TfidfVectorizer object is created with the argument norm="l1", which specifies that each document vector will be normalized so that the sum of the absolute TF–IDF weights equals one. The method fit_transform() then learns the vocabulary from the corpus and computes the corresponding TF–IDF matrix in a single step.

The three outputs display, respectively, the learned vocabulary, the TF–IDF matrix with \(l_1\) normalization applied, and the dimensions of the resulting representation.

vectorizer_l1 = TfidfVectorizer(norm="l1")
tfidf_l1 = vectorizer_l1.fit_transform(documents)

print(vectorizer_l1.get_feature_names_out()) # Output 1
print(tfidf_l1.toarray())                    # Output 2
print("Matrix shape:", tfidf_l1.shape)       # Output 3

First output.

The first output displays the learned vocabulary. As before, each term corresponds to a column of the TF–IDF matrix, and the order shown here defines the column ordering used in the matrix representation. The vocabulary itself is unchanged by the choice of normalization.

## ['are' 'built' 'comparison' 'document' 'feature' 'features' 'improves'
##  'models' 'numerical' 'on' 'rely' 'representations' 'statistical' 'text'
##  'using' 'vectors' 'weighting']

Second output.

The second output shows the TF–IDF matrix with \(L_1\) normalization applied. Each row corresponds to a document and each column to a term in the vocabulary. Under \(L_1\) normalization, the values in each row sum to one, so the entries can be interpreted as relative weights of terms within the document. For example, in the first document, the nonzero entries are all equal, indicating that the retained terms contribute equally to the total TF–IDF weight of that document.

## [[0.         0.         0.         0.         0.         0.16666667
##   0.         0.16666667 0.16666667 0.16666667 0.16666667 0.
##   0.16666667 0.         0.         0.         0.        ]
##  [0.16666667 0.16666667 0.         0.         0.         0.
##   0.         0.         0.         0.         0.         0.16666667
##   0.         0.16666667 0.16666667 0.16666667 0.        ]
##  [0.         0.         0.2        0.2        0.2        0.
##   0.2        0.         0.         0.         0.         0.
##   0.         0.         0.         0.         0.2       ]]

Third output.

The final output reports the dimensions of the matrix. In this example, the matrix has three rows (one per document) and seventeen columns (one per vocabulary term), confirming that normalization affects the scale of the weights, but not the structure of the representation.

## Matrix shape: (3, 17)

7.0.7 N-grams and vocabulary size in TF–IDF

As with Bag-of-Words representations, the TF–IDF vectorizer supports the use of n-grams as well as constraints on vocabulary size. This allows short phrases to be incorporated into the representation while keeping dimensionality under control.

Example.

In the following example, the representation is restricted to the six most frequent features among unigrams, bigrams, and trigrams. The argument ngram_range = (1, 3) enables the extraction of n-grams up to length three, while max_features = 6 limits the vocabulary size. The default \(L_2\) normalization is applied.

vectorizer_ngram = TfidfVectorizer(
    ngram_range=(1, 3),
    max_features=6,
    norm="l2"
)

tfidf_ngram = vectorizer_ngram.fit_transform(documents)

print(vectorizer_ngram.get_feature_names_out())
print(tfidf_ngram.toarray())
print("Matrix shape:", tfidf_ngram.shape)

First output.

The first output displays the learned n-gram vocabulary, restricted to six features. In this case, all retained features correspond to unigrams, bigrams, and trigrams derived from the phrase are built using. Each element in this list defines a column of the TF-IDF matrix, and the order shown here determines the column ordering.

## ['are' 'are built' 'are built using' 'built' 'built using'
##  'built using vectors']

Second output.

The second output shows the TF–IDF matrix constructed using the restricted n-gram vocabulary. Each row corresponds to a document and each column corresponds to one of the selected n-grams.

## [[0.         0.         0.         0.         0.         0.        ]
##  [0.40824829 0.40824829 0.40824829 0.40824829 0.40824829 0.40824829]
##  [0.         0.         0.         0.         0.         0.        ]]

In this example, only the second document contains the retained n-grams, which explains why its row has nonzero TF–IDF values, while the first and third documents are represented by zero vectors.

Because \(L_2\) normalization is applied, the nonzero row has unit Euclidean norm, and the TF-IDF weights are evenly distributed across the six retained features.

Third output.

The final output reports the dimensions of the matrix. Here, the matrix has three rows (one per document) and six columns (one per retained n-gram), confirming that max_features directly controls the dimensionality of the TF-IDF representation.

## Matrix shape: (3, 6)

The parameters min_df and max_df are also available for TF-IDF vectorizers and behave identically to those in CountVectorizer, allowing extremely rare or overly common terms to be excluded based on document frequency.

7.0.8 Limitations of the TF-IDF representation

TF–IDF improves upon raw word counts by adjusting token importance using corpus-level statistics. It remains computationally efficient and highly interpretable.

However, TF–IDF still operates purely at the lexical level and therefore does not capture:

  • Semantic similarity between words,

  • Contextual meaning,

  • Word order or co-occurrence structure, or

  • Positional information within documents.

Figure 7.1 summarizes these four limitations with simple examples.

Limitations of the TF-IDF representation. Source: Created by the author with ChatGPT (OpenAI)

Figure 7.1: Limitations of the TF-IDF representation. Source: Created by the author with ChatGPT (OpenAI)

Like BoW, TF–IDF representations also scale with vocabulary size, which can become problematic for very large corpora.

These limitations motivate the use of similarity measures (such as cosine similarity) and more expressive representation learning techniques, which are explored in the following sections.

8 Distance/similarity calculation between document vectors

Once documents have been represented as vectors, a natural question arises:

How can we quantify how similar or dissimilar two text documents are?

If two documents use similar words with comparable distributions, it is reasonable to expect that they convey related information. In this section, we introduce cosine similarity, a geometric measure widely used to compare document vectors derived from Bag-of-Words and TF-IDF representations.

8.0.1 Cosine similarity

Cosine similarity measures the orientation of two vectors in a vector space by computing the cosine of the angle between them. Unlike distance-based measures, it is insensitive to vector magnitude and instead focuses on direction.

Two vectors are considered similar when they point in nearly the same direction, even if their lengths differ. This property is especially useful in text analysis, where vector magnitude is often influenced by document length.

For two vectors \(\mathbf{v}, \mathbf{v} \in \mathbb{R}^d\), cosine similarity is defined as:

\[\cos(\mathbf{u}, \mathbf{v}) \quad =\quad \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} \quad = \quad \frac{\sum_\limits{i=1}^{d} u_{i} \, v_{i}} {\sqrt{\sum_\limits{i=1}^{d} u_{i}^2} \; \sqrt{\sum_\limits{i=1}^{d} v_{i}^2}} \quad \in \quad [-1, 1]\]

Here, \(\mathbf{u} \cdot \mathbf{v}\) denotes the Euclidean inner product, and the \(L_2\) norm (Euclidean norm) of a vector \(\mathbf{v} \in \mathbb{R}^d\) is defined as:

\[\|\mathbf{v}\|_2 \quad =\quad \sqrt{\sum_{i=1}^{d} v_i^2}.\]

Cosine similarity measures angular similarity, not Euclidean distance. It evaluates the angle between vectors rather than their magnitude. Its takes values in the continuous interval \([-1,1]\). The extreme cases correspond to:

  • Identical direction (maximum similarity): \(1\)

  • Orthogonal vectors (no linear association): \(0\)

  • Opposite direction: \(-1\)

Intermediate values (e.g., 0.82, 0.34, −0.15) reflect varying angular proximity between vectors. In embedding spaces trained on natural language data, cosine values are typically non-negative, since semantically unrelated words rarely exhibit strong opposite orientations.

8.0.2 Cosine similarity: Solving step by step

Consider two documents represented by count-based vectors:

\[\mathbf{u} = (4, 1, 2, 0, 3, 0, 1, 0) \quad \text{and} \quad \mathbf{v} = (2, 0, 1, 1, 2, 1, 0, 0)\]

The cosine similarity between them is:

\[\cos(\mathbf{u}, \mathbf{v}) \quad = \quad \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|_2 \; \|\mathbf{v}\|_2}\]

First, compute the dot product:

\[\mathbf{u} \cdot \mathbf{v} \quad = \quad 4\cdot2 + 1\cdot0 + 2\cdot1 + 0\cdot1 + 3\cdot2 + 0\cdot1 + 1\cdot0 + 0\cdot0 \quad = \quad 16\]

Next, compute the vector norms:

\[\|\mathbf{u}\| \quad =\quad \sqrt{4^2 + 1^2 + 2^2 + 3^2 + 1^2} \quad = \quad \sqrt{31} \quad \approx \quad 5.57\]

\[\|\mathbf{v}\| \quad = \quad \sqrt{2^2 + 1^2 + 1^2 + 2^2 + 1^2} \quad = \quad \sqrt{11} \quad \approx \quad 3.32\]

Finally:

\[\cos(\mathbf{u}, \mathbf{v}) \quad = \quad \frac{16}{(5.57)(3.32)} \quad \approx \quad 0.87\]

A cosine similarity of \(0.87\) indicates a strong similarity between the two documents.

8.0.3 Cosine similarity: Implementing in Python

The following function computes cosine similarity between two numeric vectors:

import numpy as np

def cosine_similarity(vec1, vec2):
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

This function can be applied directly to document vectors produced by different vectorization techniques.

8.0.4 Cosine similarity using CountVectorizer outputs

CountVectorizer: the sparse and document-term matrices (code).

Recall that CountVectorizer builds a document–term matrix (Bag-of-Words). Each document is represented by a sparse vector of raw token counts, where:

  • Rows = documents.

  • Columns = vocabulary terms.

  • Entries = how many times each term appears in each document.

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "Data science relies on numerical methods",          # Document 1
    "Text analysis uses vectors and matrices",           # Document 2
    "Mathematical representations support data modeling" # Document 3
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out()) # Output 1
print(bow_matrix.toarray())               # Output 2

CountVectorizer: the sparse and document-term matrices (outputs).

Output 1 shows the learned vocabulary (feature names) extracted by CountVectorizer, which defines the columns of the document-term matrix. It lists the vocabulary learned from the corpus:

## ['analysis' 'and' 'data' 'mathematical' 'matrices' 'methods' 'modeling'
##  'numerical' 'on' 'relies' 'representations' 'science' 'support' 'text'
##  'uses' 'vectors']

Output 2 displays the corresponding document-term matrix in dense form, where rows represent documents, columns represent vocabulary terms (token), and entries indicate raw term frequencies.

## [[0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0]
##  [1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1]
##  [0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0]]

CountVectorizer: cosine similarity (code).

Using the document–term matrix bow_matrix, cosine similarity can be computed for every document pair:

for i in range(bow_matrix.shape[0]):
    for j in range(i + 1, bow_matrix.shape[0]):
        sim = cosine_similarity(
            bow_matrix.toarray()[i],
            bow_matrix.toarray()[j]
        )
        print(f"Cosine similarity between documents {i+1} and {j+1}: {sim:.3f}")

CountVectorizer: cosine similarity (outputs).

## Cosine similarity between documents 1 and 2: 0.000
## Cosine similarity between documents 1 and 3: 0.183
## Cosine similarity between documents 2 and 3: 0.000

CountVectorizer: cosine similarity (interpretation of the outputs).

  • A value of 0.000 means the two documents share no vocabulary terms after preprocessing, so their BoW vectors are orthogonal.

  • A small positive value (e.g., 0.183) typically indicates limited lexical overlap (for instance, a single shared token such as data), but not necessarily strong semantic similarity.

This highlights an important point: cosine similarity on Bag-of-Words is driven by shared tokens, not by meaning.

8.0.5 Cosine similarity using TF-IDF representations

TF-IDF representations: cosine similarity (code).

TF-IDF builds the same type of document vectors, but reweights terms:

  • Words that appear in many documents receive lower weight.

  • Words that are more document-specific receive higher weight.

for i in range(tf_idf_matrix.shape[0]):
    for j in range(i + 1, tf_idf_matrix.shape[0]):
        sim = cosine_similarity(
            tf_idf_matrix.toarray()[i],
            tf_idf_matrix.toarray()[j]
        )
        print(f"Cosine similarity between documents {i+1} and {j+1}: {sim:.3f}")

TF-IDF representations: cosine similarity (outputs).

## Cosine similarity between documents 1 and 2: 0.000
## Cosine similarity between documents 1 and 3: 0.000
## Cosine similarity between documents 2 and 3: 0.000

TF-IDF representations: cosine similarity (interpretation of outputs).

  • If similarities decrease (or become 0.000), it usually means that the documents share few or no important terms after TF-IDF reweighting.

  • TF-IDF can reduce the influence of very common tokens, so even when two documents share a word, the similarity may become smaller if that word is not informative.

8.0.6 Overall: BoW vs Cosine similarity vs TF-IDF

  • BoW + cosine measures overlap in raw counts.

  • TF-IDF + cosine measures overlap in weighted importance.

In the next section, we introduce one-hot vectorization, a foundational representation for neural and embedding-based models.

9 One-hot vectorization

One-hot encoding is a simple and widely used technique for representing categorical information in numerical form. In this representation, each possible category is associated with a unique coordinate in a vector. Exactly one entry takes the value 1, while all remaining entries are set to 0.

For a vocabulary of size \(|V|\), each one-hot vector lies in \(\mathbb{R}^{|V|}\) and contains exactly one non-zero component.

9.0.1 A simple intuition

Consider a categorical variable describing traffic conditions with three possible values:

  • Low.

  • Medium.

  • High.

A one-hot representation can be defined as:

\[\overrightarrow{\mathbf{\text{low}}} = (1,0,0),\quad \overrightarrow{\mathbf{\text{medium}}} = (0,1,0),\quad \overrightarrow{\mathbf{\text{high}}} = (0,0,1)\]

Each vector has length 3 because there are three possible categories, and exactly one position is active at a time. Geometrically, these vectors correspond to the canonical basis of \(\mathbb{R}^3\). They are mutually orthogonal and equidistant.

9.0.2 One-hot encoding in NLP

In natural language processing, the same idea applies to tokens. Once a vocabulary has been constructed, each word is treated as a category.

A token is represented by a vector in \(\mathbb{R}^{|V|}\), where only the coordinate corresponding to its position in the vocabulary equals 1.

Thus, one-hot encoding transforms discrete symbolic tokens into numerical vectors without introducing any semantic structure.

These representations serve as an intermediate step toward more advanced distributed representations such as word embeddings.

9.0.3 Constructing one-hot vectors step by step

To illustrate the process, we work with a short sentence.

Step 1: Define the input text.

sentence = ["Students study machine learning methods"]
corpus = pd.Series(sentence)
corpus
## 0    Students study machine learning methods
## dtype: object

The corpus contains a single document.

Step 2: Apply basic preprocessing.

We apply cleaning, stopword removal, and lemmatization:

def clean_and_lemmatize(text):
    stop = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    
    tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
    
    return " ".join(tokens)

preprocessed_corpus = corpus.apply(clean_and_lemmatize)
preprocessed_corpus

The output is:

## 0    student study machine learning method
## dtype: object

The sentence has been reduced to its core lexical components:

  • Studentsstudent.

  • methodsmethod.

  • Stopwords removed.

This ensures a clean and compact vocabulary.

Step 3: Build the vocabulary.

vocab = list(set(preprocessed_corpus[0].split()))
print(vocab)

The output is:

## ['machine', 'learning', 'student', 'study', 'method']

Each unique token now corresponds to one dimension in the vector space. Since there are 5 distinct tokens, the embedding space is \(\mathbb{R}^5\).

Step 4: Assign indices to vocabulary terms.

position = {token: idx for idx, token in enumerate(vocab)}
print(position)

The output is:

## {'machine': 0, 'learning': 1, 'student': 2, 'study': 3, 'method': 4}

This dictionary defines the coordinate system of the space: each token is assigned a fixed index. This mapping specifies which coordinate in the vector corresponds to each token.

Step 5: Initialize the one-hot matrix.

one_hot_matrix = np.zeros((len(preprocessed_corpus[0].split()), len(vocab)))
one_hot_matrix.shape

The output is:

## (5, 5)

In this case, the matrix has 5 rows (one row per token in the sentence) and 5 columns (one column per vocabulary term).

Step 6: Populate the one-hot vectors.

for i, token in enumerate(preprocessed_corpus[0].split()):
    one_hot_matrix[i][position[token]] = 1

For each token:

  • Identify its index in the vocabulary.

  • Set the corresponding column to 1.

Each row now becomes a canonical basis vector.

Step 7: Inspect the result.

one_hot_matrix

The output is:

## array([[0., 0., 1., 0., 0.],
##        [0., 0., 0., 1., 0.],
##        [1., 0., 0., 0., 0.],
##        [0., 1., 0., 0., 0.],
##        [0., 0., 0., 0., 1.]])

In this case:

  • Each row corresponds to one token.

  • Each row contains exactly one 1.

  • All vectors are orthogonal:

\[ \overrightarrow{w_i} \cdot \overrightarrow{w_j} = 0 \quad \text{for } i \neq j. \]

No semantic similarity is encoded and the representation only captures identity (no information about frequency, relative importance, or semantic similarity is captured).

Key observation.

While one-hot vectors provide a clear and unambiguous numerical encoding, they exhibit two major limitations:

  1. High dimensionality. The vector length grows linearly with vocabulary size.

  2. No semantic structure. All distinct words are equidistant:

    \[ \|\overrightarrow{w_i} \;-\; \overrightarrow{w_j}\|_2 \quad =\quad \sqrt{2} \quad \text{for}\, i\, \neq \,j. \]

These limitations motivate the transition to distributed representations, where meaning emerges from geometry rather than position alone.

10 Summary

In this chapter, we introduced the fundamental mathematical ideas behind representing text as numerical objects. Starting from simple heuristics, we explored how textual data can be mapped into vectors and matrices, enabling the use of linear algebra techniques for analysis.

We first examined the Bag-of-Words (BoW) representation and implemented it using the CountVectorizer API. While this approach provides an intuitive and effective way to encode text based on term frequencies, we also identified its main limitations—most notably, its tendency to overemphasize very frequent terms and ignore the relative importance of rarer but potentially informative words.

To address these issues, we introduced TF–IDF vectorization, which reweights term frequencies by incorporating global information about term distribution across the corpus. This adjustment helps balance local relevance within documents against global prevalence in the dataset. Despite this improvement, both BoW and TF–IDF remain fundamentally lexical methods: they rely on surface-level word occurrences and do not account for semantic meaning, word order, or contextual relationships.

Building on these vector representations, we then explored how document similarity can be quantified using cosine similarity, interpreting documents as points in a high-dimensional space and measuring the angles between their corresponding vectors. This provided a practical mechanism for comparing documents and served as the foundation for simple applications such as retrieval-based chatbots.

Finally, we discussed one-hot vectorization, a sparse encoding scheme commonly used to represent individual tokens as categorical variables. Although simple, this representation plays an important role as a conceptual building block for more advanced models.

Overall, the methods covered in this chapter are most effective in settings where the vocabulary size is moderate and lexical overlap between documents is meaningful. As vocabularies grow larger or semantic relationships become more important, these representations become less adequate.

With this syntactic foundation in place, the next chapter moves beyond word counts and lexical weighting. We will explore approaches that explicitly model semantic relationships between words, beginning with distributed representations such as Word2Vec.

11 Applied activity: from text to vector-based similarity

This activity is designed to integrate and apply the numerical text representation techniques introduced in this chapter.
The reader will transform a small text corpus into vector representations and analyze document similarity using linear algebra concepts.

11.0.1 Objective

To build a fully reproducible pipeline that converts raw text into numerical vectors using Bag-of-Words and TF–IDF representations, and to analyze document similarity using cosine similarity.

11.0.2 Instructions

  1. Select a small corpus of text, such as:

    • Short paragraphs from news articles,

    • Abstracts of scientific papers, or

    • Brief descriptions of products, movies, or books.

  2. The corpus must contain at least three documents, each consisting of one or two sentences.

  3. Create an R Markdown (.Rmd) document that compiles successfully to HTML (or PDF).

  4. The document must include both:

    • The code, and

    • The resulting output (printed matrices, tables, or numerical values).

11.0.3 Required Sections

1. Corpus Description.

Briefly describe the selected corpus and its context. List the documents explicitly and explain why this corpus is appropriate for similarity analysis.

2. Preprocessing.

Apply basic preprocessing steps, including:

  • Lowercasing,

  • Removal of punctuation,

  • Stopword removal, and

  • Lemmatization or stemming.

Show the processed version of each document.

3. Bag-of-Words Representation.

Construct a Bag-of-Words representation of the corpus using:

  • Either a manual implementation, or
  • A vectorization tool such as CountVectorizer.

Report:

  • The learned vocabulary, and

  • The document–term matrix.

Briefly interpret the sparsity and dimensionality of the resulting matrix.

4. TF–IDF Representation.

Using the same corpus, compute TF-IDF vectors.

Compare the TF–IDF matrix with the BoW matrix by discussing:

  • Differences in numerical values, and

  • How TF–IDF reweights frequent and rare terms.

5. Cosine Similarity Analysis.

Compute pairwise cosine similarity between all documents using:

  • BoW vectors, and

  • TF–IDF vectors.

Present the results clearly and identify:

  • The most similar document pair, and

  • The least similar document pair.

Explain any differences observed between the two representations.

6. One-hot Representation (Conceptual).

Select three tokens from the vocabulary and:

  • Construct their one-hot vectors, and - Explain why one-hot representations are unsuitable for measuring semantic similarity directly.

This section may be presented conceptually or with a small numerical example.

7. Summary and Reflection.

Write a concise reflection (6–10 lines) addressing:

  • How vectorization enables mathematical comparison of text,

  • The role of weighting schemes such as TF–IDF, and

  • The limitations of purely lexical representations.

11.0.4 Reproducibility Requirement

  • The R Markdown document must be fully reproducible.

  • All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.

  • All random seeds (if applicable) must be set to ensure deterministic results.

  • All library versions used should be clearly reported.

References

 

 
If you found any ERRORS or have SUGGESTIONS, please report them to my email. Thanks.  
---
title: "MATHEMATICS BEHIND LANGUAGE REPRESENTATION"
subtitle: <h1>**Transforming Text into Data Structures**</h1>

author: 
  - name          : "Dr. rer. nat. Humberto LLinás Solano"
    affiliation   : "Department of Mathematics and Statistics, Universidad del Norte, Barranquilla, Colombia"
     #corresponding : yes    # Define only one corresponding author
     #address       : "Departamento de Matemáticas y Estadística"
    email         : |
      hllinas@uninorte.edu.co
      
      [Biographical sketch](https://rpubs.com/hllinas/Bio_Sketch)
      
      `r format(Sys.time(), "%d/%m/%y")` 
      
     #role:         # Contributorship roles (e.g., CRediT, https://casrai.org/credit/)
  #    - Conceptualization
  #    - Writing - Original Draft Preparation
  #    - Writing - Review & Editing
 # - name          : "Autor numero 2"
 #   affiliation   : "1,2"
 #   role:
 #     - Writing - Review & Editing
     #affiliation:
  #- id            : "1"
  #  institution   : "Universidad del Norte (Barranquilla, Colombia)"
  #![](hllinas.jpg){width=1in} 
  
#date: '`r format(Sys.time(), "%d/%m/%y")`'  # ver https://bookdown.org/yihui/rmarkdown-cookbook/update-date.html
output: 
    bookdown::html_document2: 
          #OJO Salen capitulos, secciones y Teoremas
    #bookdown::html_book:
          #OJO ERROR Salen teoremas, pero no salen los capitulos 
    #html_document:
          toc: true      # table of content true
          toc_depth: 4   # upto three depths of headings (specified by #, ## and ###)
          toc_float: true #Con true, toc sale al margen izquierdo de la página; de lo contrario, arriba
          collapsed: false
          smooth_scroll: false
          number_sections: true   # if you want number sections at each table header
          #theme: sandstone
          #theme: united  # many options for theme, this one is my favorite.
          #theme: flatly  # 
          #theme: cerulean  # 
          #highlight: tango  # specifies the syntax highlighting style
          #css: Scripts accesorios/estiloboton.css
          #css: my.css   # you can add your custom css, should be in same folder
          code_download: true
          #highlight: tango  # cambiar color de library en azul
    # bookdown::gitbook:
    #      includes:
    #        in_header: header.html
    # bookdown::pdf_book:
    #       keep_tex: yes
    # bookdown::html_book:
    #       css: toc.css
    # bookdown::html_book:
    #         includes:
    #           in_header: style.css
    #bookdown::html_document2: default
    # bookdown::pdf_document2:
    #      keep_tex: true
    #bibliography: references.bib
    mathjax: "http://example.com/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
header-includes:
    \usepackage[x11names]{xcolor} 
    \usepackage{graphicx}
    \usepackage{array}
    
csl: science.csl
#Ojo: Se utiliza lenguaje YAML

abstract: |
 **Other related documents can be found at [Rpubs:: toc](https://rpubs.com/hllinas/toc).**
  
---
  
 
```{r setup, include=FALSE}
library(reticulate)
# Si quieres especificar una versión específica de Python:
#use_python("/usr/bin/python")
# o usar un entorno virtual o conda:
# use_virtualenv("~/miniconda3/envs/torch") 
# 
# 
# conda create -n torch python=3.11 -y
#conda activate torch
#pip install nltk

use_condaenv("torch")


knitr::opts_chunk$set(echo = TRUE, fig.align="center",  message=FALSE, warning=FALSE#,
                    #style = "color:darkblue"
                    # class.source="bg-danger", class.output="bg-warning"   #Colores dentro del chunk
                     )
library(rgl)
knitr::knit_hooks$set(webgl = hook_webgl)
```



<!-- markdownlint-disable-next-line MD033 -->
<style type="text/css">

body{ /* Normal  */
      font-size: 14px;
  }

/* td { font-size: 8px; }  Comentado para no afectar tamaño de tablas con kableExtra */

h1.title {
  font-size: 38px;
  color: DarkBlue;
}
h1  { /* Header 1 */
  font-size: 22px;
  font-weight: bold;  /* o usa 700 si prefieres */
  color: Black;
}
h2 { /* Header 2 */
    font-size: 22px;
    font-weight: bold;  /* o usa 700 si prefieres */
  color: DarkBlue;
}
h3 { /* Header 3 */
  font-size: 18px;
   font-weight: bold;  /* o usa 700 si prefieres */
  /* font-family: "Times New Roman", Times, serif; */
  color: DarkGreen;
}
h4 {
  font-size: 18px;
  color: Green;
  font-weight: 900;  /* o usa 700 si prefieres */
  font-family: "Times New Roman", Times, serif;
}

code.r{ /* Code block */
    font-size: 12px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 14px;
}
</style>



```{r, echo=FALSE, eval=FALSE}
https://bookdown.org/yihui/rmarkdown/language-engines.html

https://bookdown.org/yihui/bookdown/markdown-syntax.html

https://bookdown.org/yihui/bookdown/a-single-document.html

https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html

https://bookdown.org/yihui/rmarkdown/bookdown-markdown.html  # Teorems and proofs

https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html#theorems

https://bookdown.org/yihui/bookdown/html.html

https://www.data-to-viz.com/
  
[Rpubs](link)
  
(\#eq:ec-),  Ecuacion \@ref(eq:ec-), Figura \@ref(fig:Fig-), Table \@ref(tab:mtcars), Theorem \@ref(thm:boring)

# Titulo {#TituloSeccion}   \@ref(TituloSeccion)

# See Figure \@ref(fig:Fig1-Transf).  
    
# For HTML, we can set color with CSS, e.g., <span style="color: red;">text</span>
  
# https://radiant-rstats.github.io/docs/model/logistic.html Shinny Logit  
  
#### El código.  {.unlisted .unnumbered}  

# conda activate torch
```


```{r, eval=FALSE, echo=FALSE}
#La foto tamaño cédula

htmltools::img(src = knitr::image_uri(file.path(R.home("doc"), "html", "logo.jpg")), 
               alt = 'hllinas', 
               style = 'position:absolute; top:0; right:0; padding:10px;',
               width = "200px")  # Aquí especificas el ancho deseado en píxeles o porcentaje
```



```{r, echo=FALSE, }
# La foto grande

htmltools::img(src = knitr::image_uri("hllinas2023.jpg"), 
               alt = 'hllinas2023', 
               style = 'position:absolute; top:0; right:0; padding:1px;',
               width="15%")
```



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador  -->

```{css, echo=FALSE}
.columns {display: flex;}
h1 {color: DarkBlue;}
h3 {color: DarkGreen;}
h4 {color: DarkGreen;}


.error-block {
  margin-left: 2em;
}

.error-block strong {
  margin-left: -1em;
}

.sangria3 {
  margin-left: 3em;
}

.sangria4 {
  margin-left: 4em;
}

.sangria5 {
  margin-left: 5em;
}

.sangria6 {
  margin-left: 6em;
}

.sangria7 {
  margin-left: 7em;
}

.sangria8 {
  margin-left: 8em;
}

.sangria9 {
  margin-left: 9em;
}

```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Capítulo 1 -->

```{r, echo=FALSE, eval=FALSE}
#Multiple authors and subtitles in Rmarkdown yaml: 
#https://stackoverflow.com/questions/26043807/multiple-authors-and-subtitles-in-rmarkdown-yaml

#Insert a logo in upper right corner of R markdown html document:
#https://stackoverflow.com/questions/43009788/insert-a-logo-in-upper-right-corner-of-r-markdown-html-document/43010632

```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Prerequisites and software setup 

The examples and exercises presented in this chapter rely on a small set of widely used Python libraries for text preprocessing, vectorization, and numerical computation. To ensure that all code runs correctly, the required packages and language resources should be installed *before* executing the examples in this document.

The commands below are provided *for reference only* and should be executed in a Python environment (for example, a terminal, Anaconda Prompt, or a Python-enabled R Markdown setup using `reticulate`).



```{r, eval=FALSE}
# Core machine learning and NLP libraries
pip install scikit-learn
pip install nltk
pip install pandas
pip install numpy
pip install seaborn
pip install tabulate

# Download required NLTK resources
python -c "import nltk; nltk.download('wordnet')"
python -c "import nltk; nltk.download('omw-1.4')"
python -c "import nltk; nltk.download('stopwords')"
```

Once the packages are installed, the following Python modules are imported throughout this document:


```{python, eval=FALSE}
import numpy as np
import pandas as pd
import re

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
```


#### Role of the libraries.  {.unlisted .unnumbered}  

The purpose of each library used in this document is summarized below:

- `NLTK` provides basic natural language preprocessing tools, including stopword removal and lemmatization. Only lightweight linguistic processing is used in this document.

- scikit-learn (`sklearn`) supplies the vectorization and similarity machinery, including `CountVectorizer`, `TfidfVectorizer`, and cosine similarity computation.

- `pandas`is used to manage text corpora as structured objects (e.g., Series) and to apply preprocessing functions consistently across documents.

- `numnpy`  supports numerical operations and vector-based computations required for similarity calculations.

- `seaborn` is used for making statistical graphics.

- `tabulate` is used to pretty-print tabular data in a human-readable format.

These tools are sufficient to illustrate the fundamental ideas behind frequency-based text representations, without introducing unnecessary dependencies.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Introduction

### Preliminars

Textual data poses a distinctive challenge for computational analysis: unlike numerical or categorical variables, natural language does not come with an inherent mathematical representation. While computers operate exclusively on numbers, language is expressed through symbols, words, and structures whose meaning is not natively encoded in numeric form.

Transforming text into numbers is therefore unavoidable (but it is also an opportunity). The specific choices made during this transformation determine which aspects of language are preserved, which are simplified or ignored, and how effectively learning algorithms can operate on linguistic data. In this sense, representation choices are not neutral: they directly influence model behavior, interpretability, and performance.

In the previous document ([see vocabulary construction](https://rpubs.com/hllinas/R_NLP_vocabulary)), we focused on defining the *symbolic units* of language processing, including tokenization strategies, normalization procedures, and vocabulary design. These steps establish *what constitutes a unit of analysis*. In this chapter, we move to the next stage of the pipeline and examine *how those symbolic units are transformed into numerical objects*.

Our approach is deliberately incremental. We begin with simple and transparent representations that emphasize *observable structure* rather than deep semantic meaning. By relying on frequency counts and distributional information, we can construct representations that are easy to interpret and that provide a solid mathematical foundation for more advanced techniques.

Throughout this chapter, we introduce classical methods for numerical text representation, including *Bag-of-Words* and *term frequency–inverse document frequency (TF-IDF)*. Although conceptually straightforward, these methods remain widely used in practice (for baseline models, exploratory analysis, and instructional settings).

Before introducing these techniques, it is useful to clarify a fundamental distinction that underlies all language modeling: *syntax versus semantics*. Syntax concerns the structural organization of words and their observable patterns of occurrence, whereas semantics relates to meaning and interpretation. A sentence may be syntactically well-formed without conveying meaningful information.

In this chapter, the emphasis is intentionally placed on the *syntactic dimension* of language. We focus on representations derived from word occurrence patterns (such as counts and relative frequencies) while postponing semantic representations (e.g., embeddings and neural encodings) to later chapters.

By the end of this chapter, you will be able to represent text using vectors and matrices, compute similarities between documents, and build simple language-based applications. These ideas also serve as a conceptual bridge toward the representation learning techniques employed in modern deep learning architectures, including Transformer-based models.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Motivation: From vectors to Transformer inputs

The numerical representations introduced in this chapter—such as Bag-of-Words and TF-IDF-illustrate a fundamental principle: language must be encoded as vectors before it can be processed by any computational model. Although these representations are relatively simple and primarily capture *syntactic structure*, they establish the mathematical foundation required for more advanced methods.

In modern architectures such as the Transformer [(Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762), textual inputs are ultimately text can be mapped into vector spaces and processed through multiple layers of transformation. However, instead of relying on sparse, high-dimensional frequency-based vectors, Transformers employ *dense vector representations* (embeddings) that capture richer linguistic information.

Figure \@ref(fig:Fig-Transf) shows that the model begins with an **Input Embedding** stage, where each token is mapped into a continuous vector space. The representations developed in this chapter can be interpreted as a conceptual precursor to that stage: they demonstrate how text can be embedded into vector spaces, even though they do not yet capture semantic relationships or contextual dependencies.


<center>
```{r Fig-Transf, echo=FALSE, fig.cap = "General architecture of the Transformer model. Source: [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)", out.width = "55%"}
# fig.width = 20 # No funciona esta opcion en el chunk

#http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

knitr::include_graphics("Fig-Transf.png")

#Otra manera, pero no sale el caption:
#<center>
#![(#fig:Fig-caption) Mi figura](Nombre.png){width=400px}
#</center>
```
</center>


This highlights an important transition: while frequency-based methods focus on observable structure, modern neural models require representations that also capture meaning and context. This highlights an important transition: while frequency-based methods focus on observable structure, modern neural models require representations that also capture meaning and context. This transition (from syntactic representations to semantic vector spaces) is developed in the next document (see [word embeddings](https://rpubs.com/hllinas/R_NLP_embedding1)).

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Chapter roadmap. {.unlisted .unnumbered}

The main topics covered in this chapter are:

- Understanding vectors and matrices as mathematical data structures  

- Exploring the Bag-of-Words (BoW) representation  

- Constructing TF–IDF vectors  

- Measuring distance and similarity between document vectors  

- One-hot vectorization  

- Building a basic chatbot  
  

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Understanding vectors and matrices

A central challenge in NLP is expressing language in mathematical form. Two data structures play a fundamental role in this transformation: **vectors** and **matrices**. Together, they allow collections of text documents to be analyzed using the tools of linear algebra.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Vectors

#### Definition and notation. {.unlisted .unnumbered}  

A vector is a one-dimensional array of numerical values, where each position corresponds to a specific feature. Vectors are commonly represented as column arrays:

$$
\mathbf{x} =
\begin{bmatrix}
x_1 \\
x_2 \\
x_3 
\end{bmatrix}, \quad \mathbf{v} =\begin{bmatrix}
v_1 \\
v_2 \\
v_3 \\
v_4
\end{bmatrix}
$$

In this expression, the vector $\mathbf{x}$ contains three components and belongs to $\mathbb{R}^3$, while $\mathbf{v}$ contains four components and belongs to $\mathbb{R}^4$. Each coordinate represents the contribution of the vector along a particular axis. Once an object is represented as a vector, operations such as distance computation, similarity measurement, and projection become well defined.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Geometric intuition. {.unlisted .unnumbered}  

To develop geometric intuition, consider representing entities using measurable attributes. Suppose we describe two cities using their average annual temperature and annual rainfall:

\[
\begin{array}{c|cc}
\text{City} & \text{Temperature (°C)} & \text{Rainfall (mm)} \\
\hline
\text{A} & 18 & 500 \\
\text{B} & 25 & 1100
\end{array}
\]


Each city can be interpreted as a point in a two-dimensional space, or equivalently, as a vector.

- City A corresponds to the vector $\mathbf{x}_A= (18, 500)$.

- City B corresponds to the vector $\mathbf{x}_B= (25, 1100)$.



From a mathematical perspective, both vectors belong to $\mathbb{R}^2$. Here is the corresponding visualization:  

```{r, echo=FALSE}
library(ggplot2)

# Data: cities as vectors
df <- data.frame(
  City = c("City A", "City B"),
  Temperature = c(18, 25),
  Rainfall = c(500, 1100)
)

ggplot(df, aes(x = Temperature, y = Rainfall)) +
  
  # Vectors from the origin
  geom_segment(
    aes(x = 0, y = 0, xend = Temperature, yend = Rainfall),
    arrow = arrow(length = unit(0.25, "cm")),
    linewidth = 1,
    color = "steelblue"
  ) +
  
  # Points at vector tips
  geom_point(size = 3, color = "steelblue") +
  
  # Labels
  geom_text(
    aes(label = City),
    hjust = -0.15,
    vjust = -0.4,
    size = 4
  ) +
  
  labs(
    x = "Average Temperature (°C)",
    y = "Annual Rainfall (mm)",
    title = "Cities represented as vectors in a two-dimensional space"
  ) +
  
  scale_x_continuous(expand = expansion(mult = c(0, 0.2))) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  
  expand_limits(x = 0, y = 0) +
  theme_minimal(base_size = 11)
```


Each vector originates at the coordinate system’s origin and points toward a location determined by the corresponding attributes. Adding a new attribute (such as altitude or population density) increases the dimensionality of the representation, moving the vectors from
$\mathbb{R}^2$ to $\mathbb{R}^3$ or higher.

While such spaces quickly become difficult to visualize, the algebraic interpretation of vectors remains valid in any dimension.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### From geometric vectors to text representation. {.unlisted .unnumbered}  

The same idea extends naturally to textual data.

After tokenization introduced in the previous document (see [Vocabulary and text normalization](https://rpubs.com/hllinas/R_NLP_vocabulary)), a document can be represented as a vector in which *each dimension corresponds to a unique token* in the vocabulary. The value along each dimension reflects how frequently that token appears in the document.

In this way, text is mapped into a high-dimensional vector space, where each document corresponds to a point. This representation enables the application of vector-based operations such as similarity measurement, distance computation, and clustering, forming the basis of many methods in natural language processing.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Matrices

Matrices extend vectors by organizing multiple vectors into rows and columns. A matrix can be written as:

$$
\mathbf{X} =
\begin{bmatrix}
x_{11} & x_{12} \\
x_{21} & x_{22} \\
x_{31} & x_{32}
\end{bmatrix}
$$

This matrix belongs to $\mathbb{R}^{3 \times 2}$, indicating three rows and two columns. In text analysis, matrices are commonly used to represent **collections of documents**. Each row corresponds to a document, each column corresponds to a token in the vocabulary, and each entry stores the frequency of that token in the document.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Matrix representation: a simple example


#### From text to matrix representation.  {.unlisted .unnumbered}  


To illustrate how text can be organized into matrix form, consider the following small collection of documents:

```{python, eval=FALSE}
from sklearn.feature_extraction.text import CountVectorizer # 1

documents = (
    "Text analysis relies on numerical representations",
    "Vectors and matrices are core mathematical tools",
    "Large collections of text can be processed efficiently"
)

vectorizer = CountVectorizer(stop_words="english")  # 2
vectorizer

X = vectorizer.fit_transform(documents)             # 3
X

# Inspect the learned vocabulary and document-term matrix
print(vectorizer.vocabulary_)  # 4 --> first output
print(X.todense())             # 5a --> second output
print(X.toarray())             # 5b --> second output
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Explanation of the code.  {.unlisted .unnumbered}  


**Code 1.** 

:::sangria3
The code begins by importing `CountVectorizer` from `sklearn.feature_extraction.text`, a tool designed to transform a collection of text documents into a document–term matrix. Next, a small corpus of three short documents is defined. 
:::


**Code 2.** 

:::sangria3
The instruction `CountVectorizer(stop_words="english")` creates a vectorizer that automatically tokenizes the text, extracts a set of unique tokens (the vocabulary), and removes common English stopwords such as *and*, *are*, or *of*. 


```{python, echo=FALSE}
from sklearn.feature_extraction.text import CountVectorizer

documents = (
    "Text analysis relies on numerical representations",
    "Vectors and matrices are core mathematical tools",
    "Large collections of text can be processed efficiently"
)

vectorizer = CountVectorizer(stop_words="english")
vectorizer
##X = vectorizer.fit_transform(documents)             # 3
#X


# Inspect the learned vocabulary and document-term matrix
# print(vectorizer.vocabulary_)  # 4 --> first output
# print(X.todense())             # 5 --> second output
```


The output displayed after evaluating `vectorizer` does not correspond to data, but to the internal representation of the `CountVectorizer` object. It simply confirms that the object has been successfully created with the specified parameters.

At this stage, the vectorizer has not yet been fitted to the data, meaning that no vocabulary has been learned and no matrix has been constructed. The expandable *Parameters* section reflects the configuration of the object (such as stopword removal), rather than any learned information.

In other words, we are not yet seeing the data (we are only defining the tool that will process it).

Only after applying `fit_transform(documents)` does the vectorizer learn the vocabulary and generate the document-term matrix.
:::


**Code 3.** 

:::sangria3
The line `fit_transform(documents)` then performs two tasks simultaneously: 

- It first learns the vocabulary from the corpus (`fit`) and then 

- It converts the documents into a numerical matrix representation (`transform`).

The resulting object `X` is a sparse document-term matrix, where rows correspond to documents and columns correspond to vocabulary terms.


```{python, echo=FALSE}
from sklearn.feature_extraction.text import CountVectorizer

documents = (
    "Text analysis relies on numerical representations",
    "Vectors and matrices are core mathematical tools",
    "Large collections of text can be processed efficiently"
)

vectorizer = CountVectorizer(stop_words="english")
#vectorizer
X = vectorizer.fit_transform(documents)             # 3
X


# Inspect the learned vocabulary and document-term matrix
# print(vectorizer.vocabulary_)  # 4 --> first output
# print(X.todense())             # 5 --> second output
```


The object `X` is not displayed as a full matrix because it is stored in a *sparse format*. Instead, Python displays a summary of its structure:

- The `shape (3, 14)` indicates that the matrix has 3 rows (documents) and 14 columns (unique tokens in the vocabulary).

- The expression `“15 stored elements”` means that only 15 entries in the matrix are nonzero. This reflects the sparsity of textual data, where most tokens do not appear in most documents.

- The term `Compressed Sparse Row (CSR)` refers to the internal representation used to efficiently store and manipulate sparse matrices by keeping track only of nonzero entries.

This compact representation is essential for handling large text corpora, where the document–term matrix can have thousands or even millions of columns.


In addition to printing the matrix, several attributes can be used to better understand its structure:

- `X.shape` returns the dimensions of the matrix (number of documents × vocabulary size).

- `X.nnz` gives the number of nonzero entries, indicating how sparse the matrix is.

- `vectorizer.get_feature_names_out()` returns the ordered list of tokens corresponding to the columns of the matrix.


```{python, eval=FALSE}
X.shape
X.nnz
vectorizer.get_feature_names_out()
```


```{python, echo=FALSE}
X.shape
X.nnz
vectorizer.get_feature_names_out()
```

These tools allow us to interpret the matrix more precisely without converting it into a dense representation. The sparse representation hides most of the matrix entries (which are zero), but it preserves all the information needed to reconstruct the full document–term matrix when required.
:::

**Codes 4 and 5**. 

:::sangria3
Finally, the code produces two explicit outputs:

- The learned vocabulary (`vectorizer.vocabulary_`), which maps each token to a column index. 

- The dense version of the matrix (`X.todense()` or `X.toarray()`), which makes the full structure easier to inspect in small examples.
:::


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Example: interpreting the first output (code 4).  {.unlisted .unnumbered}  


The printed dictionary (`vocabulary_`) maps each unique token to a column index in the document-term matrix: 

```{python, echo=FALSE}
from sklearn.feature_extraction.text import CountVectorizer

documents = (
    "Text analysis relies on numerical representations",
    "Vectors and matrices are core mathematical tools",
    "Large collections of text can be processed efficiently"
)

vectorizer = CountVectorizer(stop_words="english")
#vectorizer
X = vectorizer.fit_transform(documents)             # 3
#X

# Inspect the learned vocabulary and document–term matrix
print(vectorizer.vocabulary_)
```



Each key in this dictionary is a **token** extracted from the corpus after preprocessing (tokenization and stopword removal). The associated number is *not a frequency* and does *not* indicate importance or order of appearance in the text. Instead, it specifies the *column position* assigned to that token in the document-term matrix.

To make this concrete:

- `'analysis': 0` means that the token *analysis* corresponds to *column 0* of the matrix.

- `'collections': 1` corresponds to *column 1*.

- `'core': 2` corresponds to *column 2*.
- ...
- `'text': 11` corresponds to *column 11*.

- `'vectors': 13` corresponds to *column 13*.

In other words, the numbers `0`, `1`, `2`, …, `13` are *indices*, not counts. They simply label the columns of the matrix, starting from zero, following Python’s indexing convention.

Once this mapping is defined, the document-term matrix uses it consistently. For example:

- The value located at row $i$ and column `0` represents the frequency of the token *analysis* in document $i$. 

- Similarly, the value at column `11` represents the frequency of the token *text* in that same document.  


More generally, the entry in row $i$ and column $j$ records how many times token $j$ appears in document $i$. 

This separation of roles is crucial:

- The *vocabulary dictionary* defines *where* each token lives in the matrix.

- The *matrix entries* define *how often* each token appears in each document.

Understanding this distinction helps explain why a document vector has a fixed length equal to the size of the vocabulary, and why most entries are zero when a token does not appear in a document.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Example: interpreting the second output (code 5).  {.unlisted .unnumbered}  

The second output (`X.todense()` or `X.toarray()`) is the document-term matrix itself: 

```{python, echo=FALSE}
print(X.todense())  
```


```{r, eval=FALSE, echo=FALSE}
\[
\mathbf{X} =
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0
\end{bmatrix}
\in \mathbb{R}^{3 \times 14}
\]

\[
\mathbf{X} = \left(
\begin{array}{ccccccccccccccc}
        & \text{analysis} & \text{collections} & \text{core} & \text{efficiently} & \text{large} & \text{mathematical} & \text{matrices} & \text{numerical} & \text{processed} & \text{relies} & \text{representations} & \text{text} & \text{tools} & \text{vectors} \\
\text{Text 1} & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 \\
\text{Text 2} & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
\text{Text 3} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0
\end{array} \right)
\in \mathbb{R}^{3 \times 14}
\]

```

Mathematically: 

\[
\mathbf{X} = \left(
\begin{array}{c|cccccccccccccc}
   \text{Text}     & \text{anal} & \text{coll} & \text{core} & \text{eff} & \text{lar} & \text{math} & \text{mat} & \text{num} & \text{proc} & \text{relies} & \text{repr} & \text{text} & \text{tools} & \text{vec} \\
        \hline
\text{#1} & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 \\
\text{#2} & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
\text{#3} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0
\end{array} \right)
\in \mathbb{R}^{3 \times 14}
\]

**Legend (tokens):**  

`anal` = analysis; `coll` = collections; `core` = core; `eff` = efficiently; `large` = large;  `math` = mathematical; `mat` = matrices; `num` = numerical; `proc` = processed; `relies` = relies;  `repr` = representations; `text` = text; `tools` = tools; `vec` = vectors.

This matrix should be interpreted as follows:

- Rows of the matrix $\mathbf{X}$  correspond to documents: `Text 1`, `Text 2`, `Text 3` (in the same order as the input text).

- Columns correspond to tokens in the vocabulary.

The entry $x_{ij}$ of the matrix $\mathbf{X}$,

$$
\mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{i14}),
$$

 represents the number of times token $j$ appears in document $i$. For example:

 - $x_{1,1} = 1$ indicates that the token *analysis* appears once in the first document.
  
 - $x_{1,8} = 1$ indicates that the token *numerical* appears once in the first document.
  
 - $x_{2,14} = 1$ indicates that the token *vectors* appears once in the second document.

 - Zeros indicate that the corresponding token does not appear in that document.

Because each document is short and most words appear at most once, the matrix mainly contains values of 0 and 1. A value of 1 indicates that the corresponding token appears once in that document, while 0 indicates that it does not appear at all.

The length of each row vector equals the size of the vocabulary. In this example, the vocabulary contains 14 unique tokens after stopword removal, which explains why each document vector has 14 components.

Once text data has been converted into matrix form, it becomes amenable to standard linear algebra operations such as similarity computation, projection, and matrix transformations, enabling quantitative analysis of documents.

This type of matrix-based encoding is commonly associated with the **Bag-of-Words (BoW) model**, in which each document is represented relative to a fixed vocabulary, typically by recording the frequency of its tokens while ignoring word order.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Bag-of-words (Bow)

### Exploring the bag-of-words representation

#### Basic idea. {.unlisted .unnumbered}


One of the simplest ways to represent text numerically is to count how often terms appear in a document. This idea forms the basis of the **Bag-of-Words (BoW)** representation.

The BoW model deliberately ignores word order and grammatical structure, focusing instead on *which terms appear* and *how often they occur*. Although this abstraction discards syntactic information such as word sequence, it provides a simple and effective baseline for many text analysis tasks.

In the previous chapter on vocabulary construction, we introduced the process of identifying and standardizing the basic units of text. That step is essential for BoW representations: before counting terms, we must first decide *which terms belong to the vocabulary*.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Vector interpretation. {.unlisted .unnumbered}


Once the vocabulary is fixed, each document can be represented as a vector whose length equals the size of the vocabulary. Each position in the vector corresponds to a specific term, and the value stored in that position indicates how many times the term appears in the document.

If a term from the vocabulary does not appear in a given document, the corresponding entry in the vector is zero.




<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Sparsity and component values. {.unlisted .unnumbered}

A natural question arises at this point:

```text
What is the maximum possible value of a entry (or term count) in a Bag-of-Words vector?
```


Take a moment to think about it.

At first glance, one might expect a fixed upper bound. However, this is not the case.

In a Bag-of-Words representation, each component of the vector records the number of times a given term appears in a document. Therefore, the value of a component depends entirely on the frequency of that term within the document.

In principle, there is no fixed upper bound: a term could appear many times, especially in long documents or highly repetitive texts. As a result, some components may take relatively large values, while many others remain equal to zero.

This imbalance leads to a key property of Bag-of-Words representations: sparsity. Most entries in the vector are zero because most terms in the vocabulary do not appear in a given document.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


### From text to Bag-of-Words: a step-by-step construction

To make the idea concrete, we now construct a Bag-of-Words representation manually, starting from a small collection of sentences.

Before examining each step in detail, Figure \@ref(fig:Fig-BoW-steps) provides an overview of the main stages involved in building a Bag-of-Words representation. These stages will be explained progressively in the following sections.




<center>
```{r Fig-BoW-steps, echo=FALSE, fig.cap = "Step-by-step construction of a Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)", out.width = "140%"}
# fig.width = 20 # No funciona esta opcion en el chunk

#http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

knitr::include_graphics("BoW_steps.png")

#Otra manera, pero no sale el caption:
#<center>
#![(#fig:Fig-caption) Mi figura](Nombre.png){width=400px}
#</center>
```
</center>

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 1: Define a small corpus.  {.unlisted .unnumbered}  

We begin by defining a small collection of sentences, which will serve as our corpus. Each sentence is treated as a separate document:



```{python}
sentences = [
    "Data science connects statistics and computation",
    "Statistical models learn patterns from data",
    "Modern data analysis relies on computational tools"
]
```

This corpus represents the raw textual input. At this stage, the data is still unstructured and cannot yet be processed mathematically.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 2: Store the corpus in a structured form.  {.unlisted .unnumbered}  


:::sangria3
#### The code.  {.unlisted .unnumbered} 


The corpus is stored as a `pandas.Series`, where each element represents one document. This structured format facilitates systematic preprocessing and later vectorization steps.


```{python, eval=FALSE}
import pandas as pd

corpus = pd.Series(sentences)
corpus
```

The code begins by importing the `pandas` library, which provides convenient data structures for handling and organizing data. The list of sentences defined in the previous step is then converted into a Series object.

A `pandas.Series` can be understood as a one-dimensional labeled array. In this context, each entry of the `Series` corresponds to a document, and the index (`0`, `1`, `2`, ...) uniquely identifies each one.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### The ouput.  {.unlisted .unnumbered} 


In this case, the printed otuput is: 

```{python, echo=FALSE}
import pandas as pd

corpus = pd.Series(sentences)
corpus
```


This output shows:

- The index on the left (`0`, `1`, `2`), which labels each document.

- The text content of each document.

- The data type (`dtype: object`), indicating that the entries are stored as text.

This representation does not yet transform the text into numbers, but it organizes the corpus into a structured format that can be easily processed in subsequent steps
:::


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 3: Apply basic preprocessing.  {.unlisted .unnumbered}  

:::sangria3
#### The code. {.unlisted .unnumbered} 

The preprocessing step standardizes the text by lowercasing, removing punctuation and stopwords, and reducing words to their lemma. These operations ensure that different surface forms of a word are treated consistently.

As discussed in the previous document on vocabulary construction (see [Vocabulary and text normalization](https://rpubs.com/hllinas/R_NLP_vocabulary)), preprocessing is a crucial step that directly influences how tokens are defined and how the vocabulary is built.

The following code implements a basic preprocessing pipeline using the `nltk` library:



```{python, eval=FALSE}
# Step 3a: Required packages
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import re
import numpy as np

#Step 3b: Defining the preprocessing function
def clean_and_lemmatize(text):
    stop = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    
    tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
    
    return " ".join(tokens)

#Step 3c: Applying the preprocessing function to the corpus {.unlisted .unnumbered}
processed_corpus = corpus.apply(clean_and_lemmatize)
processed_corpus
```

This code defines a preprocessing function and applies it to each document in the corpus. For clarity, each component of this pipeline is explained in detail in the following subsections.

This preprocessing stage prepares the corpus for vocabulary construction, which is the next step in building the Bag-of-Words representation.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 3a: Required packages. {.unlisted .unnumbered} 

These packages provide tools for tokenization, stopword filtering, and lemmatization, which are standard preprocessing steps in natural language processing.

```{python, echo=FALSE, warning=FALSE, message=FALSE}
import nltk
import re
import numpy as np

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Download required NLTK resources (silent)
_ =nltk.download('stopwords')
_ =nltk.download('wordnet')
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 3b: Defining the preprocessing function. {.unlisted .unnumbered}  

This function performs several preprocessing operations in sequence.

```{python, echo=FALSE, warning=FALSE, message=FALSE}
# My function
def clean_and_lemmatize(text):
    stop = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    
    tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
    
    return " ".join(tokens)
```

- The `re.sub(r"[^a-zA-Z]", " ", text)` instruction removes punctuation and non-letter characters.

- The `.lower()` method converts all text to lowercase to ensure consistency.

- The `.split()` operation tokenizes the text into individual words.

- Stopwords (common words such as *the*, *and*, *is*) are removed using the `stopwords` list from `nltk`.

- The `WordNetLemmatizer` reduces each token to its base form (lemma), so that different grammatical forms are treated as the same term (e.g., `models` --> `model`).

- The output of the function is a cleaned and normalized string, ready for vectorization.

This step is essential because the quality of the vocabulary and the resulting Bag-of-Words representation depend directly on how the text is preprocessed.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 3c: Applying the preprocessing function to the corpus. {.unlisted .unnumbered}  

The function `clean_and_lemmatize` is applied to each document in the corpus using the `.apply()` method from `pandas`. This method iterates over all elements of the `Series` and transforms each document individually.

```{python, echo=FALSE, warning=FALSE, message=FALSE}
processed_corpus = corpus.apply(clean_and_lemmatize)
processed_corpus
```

The output shows the preprocessed version of each document in the corpus, where stopwords have been removed and the remaining words have been lemmatized. Each row corresponds to one document, and the original document order is preserved.

```{python}
len(processed_corpus)
```


This confirms that the corpus contains three documents, each of which has been transformed into its cleaned textual representation.
:::

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 4: Build the vocabulary.  {.unlisted .unnumbered}  


:::sangria3
#### The code.  {.unlisted .unnumbered} 


This code constructs the vocabulary by extracting all unique tokens from the preprocessed corpus and sorting them alphabetically. Each token appears only once, regardless of how many times it occurs in the documents. 


```{python, eval=FALSE}
vocabulary = sorted(set(
    word for sentence in processed_corpus for word in sentence.split()
))
vocabulary
```

The expression inside the code performs three main operations:

- The comprehension below iterates over each document and extracts all tokens.

:::sangria3
```{python, eval=FALSE}
word for sentence in processed_corpus for word in sentence.split()
```
:::

- The function `set(...)` removes duplicate tokens, ensuring that each term appears only once.

- The function `sorted(...)` orders the vocabulary alphabetically, guaranteeing a consistent and reproducible structure.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output.  {.unlisted .unnumbered} 

The output is shown below:


```{python, echo=FALSE}
vocabulary = sorted(set(
    word for sentence in processed_corpus for word in sentence.split()
))
vocabulary
```


It is a list of 14 unique terms. 
 
```{python}
len(vocabulary)
```



Each term defines one dimension of the Bag-of-Words vector space. Therefore, every document will be represented as a vector of length 14, where each position corresponds to one vocabulary term.

In this way, the vocabulary defines the coordinate system of the vector space in which all documents will be represented.
:::

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 5: Assign indices to vocabulary terms.  {.unlisted .unnumbered}  

:::sangria3
#### The code.  {.unlisted .unnumbered} 


This code creates a dictionary that maps each vocabulary term to a unique integer index. These indices define the column positions that each token will occupy in the Bag-of-Words matrix, ensuring a consistent numerical representation across all documents.



```{python, eval=FALSE}
token_index = {token: idx for idx, token in enumerate(vocabulary)}
token_index
```

The function `enumerate(vocabulary)` pairs each term with an integer index:

```text
(token₀, 0), (token₁, 1), (token₂, 2), ...
```

The dictionary comprehension then converts these pairs into a mapping of the form:

```text
token → index
```

This mapping is essential because it determines the exact position of each term in the vector representation of every document.

In this way, the vocabulary is transformed into a coordinate system, where each dimension corresponds to a specific term.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output.  {.unlisted .unnumbered} 

In this case, the output is: 

```{python, echo=FALSE}
token_index = {token: idx for idx, token in enumerate(vocabulary)}
token_index
```

Without this indexing step, it would not be possible to construct a consistent document–term matrix across multiple documents.

The next step is to initialize the document-term matrix using these indices.
:::

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 6: Initialize the Bag-of-Words matrix.  {.unlisted .unnumbered}  


:::sangria3
#### The code.  {.unlisted .unnumbered} 


The next step is to initialize the document–term matrix, which will store the frequency of each vocabulary term in each document.



```{python, eval=FALSE}
bow_matrix = np.zeros((len(processed_corpus), len(vocabulary)))
bow_matrix
```

The function `np.zeros(...)` creates a matrix filled with zeros. The shape of the matrix is determined by:

- `len(processed_corpus)`: the number of documents (rows).

- `len(vocabulary)`: the number of unique terms (columns).



Thus, the resulting matrix has:

```text
number of documents × vocabulary size
```
This abstract description can be made more explicit by inspecting the matrix directly in Python.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output (with python).  {.unlisted .unnumbered} 



The size of the matrix can be verified as follows:

```{python}
(len(processed_corpus), len(vocabulary))
```

The matrix itself is given by:


```{python, echo=FALSE}
bow_matrix = np.zeros((len(processed_corpus), len(vocabulary)))
bow_matrix
```

While this output shows the numerical structure, it is also useful to interpret the matrix from a mathematical perspective.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output (matrix representation).  {.unlisted .unnumbered} 

To better understand the structure of the matrix, it is useful to make explicit the correspondence between tokens and vocabulary terms:

$$
\text{Token map:}\qquad
\begin{array}{llllllll}
\texttt{ana}=\texttt{analysis}, &
\texttt{cmp}=\texttt{computation}, &
\texttt{cmpl}=\texttt{computational}, \\
\texttt{cnt}=\texttt{connects}, &
\texttt{dat}=\texttt{data}, &
\texttt{lear}=\texttt{learn}, \\
\texttt{mod}=\texttt{model}, &
\texttt{mdrn}=\texttt{modern}, &
\texttt{pat}=\texttt{pattern}, \\
\texttt{rel}=\texttt{relies}, &
\texttt{sci}=\texttt{science}, &
\texttt{st}=\texttt{statistic}, \\
\texttt{stl}=\texttt{statistical}, &
\texttt{tool}=\texttt{tool}.
\end{array}
$$

Given this mapping, the matrix can be interpreted as the initial state:

$$
\mathbf{B}^{(0)} =\left(
\begin{array}{c|cccccccccccccccc}
\texttt{Text} & \texttt{ana} & \texttt{cmp} & \texttt{cmpl} & \texttt{cnt} & \texttt{dat} & \texttt{lear} & \texttt{mod} & \texttt{mdrn} & \texttt{pat} & \texttt{rel} & \texttt{sci} & \texttt{st} & \texttt{stl} & \texttt{tool} \\
\hline
\text{#1} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\text{#2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\text{#3} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array} \right)
$$

This representation makes explicit the relationship between vocabulary terms (columns) and documents (rows).

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output (interpretation).  {.unlisted .unnumbered} 

At this stage, all entries are zero because no word counts have been recorded yet. The matrix only defines the structure of the representation.

- Each row corresponds to a document, and each column corresponds to a vocabulary term (as defined in Step 5). 

- The value at position $(i, j)$ will later store how many times term $j$ appears in document $i$.

This matrix defines the vector space in which documents will be represented, but it does not yet contain any information about word frequencies.

We are now ready to populate the matrix with actual word counts.
:::

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 7: Populate the matrix with word counts.  {.unlisted .unnumbered} 


:::sangria3
This code fills the Bag-of-Words matrix by counting word occurrences. For each document (`i`), every token in the preprocessed sentence is located in the vocabulary using `token_index`, and the corresponding matrix entry is increased by one.


```{python}
for i, sentence in enumerate(processed_corpus):
    for token in sentence.split():
        bow_matrix[i, token_index[token]] += 1
```

The code operates as follows:

- The loop `enumerate(processed_corpus)` iterates over each document, where `i` is the document index and `sentence` is the corresponding text.

- Each document is split into tokens using `.split()`.

- For each token, the dictionary `token_index` provides the column index associated with that term.

- The value in the matrix at position `(i, j)` is incremented by 1, where:

  - `i` = document index
  
  - `j` = token index

In this way, the matrix is gradually populated with term frequencies.

After this step, each row of `bow_matrix` represents a document as a vector of word counts. Nonzero values indicate that a term appears in the document, while zeros indicate absence.

This step transforms the empty matrix into a numerical representation of the corpus, where each document is encoded as a vector in the vocabulary space.
:::

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 8: Inspect the final representation.  {.unlisted .unnumbered}  

:::sangria3
#### The output (with python).  {.unlisted .unnumbered} 

The resulting Bag-of-Words matrix is shown below:

```{python}
bow_matrix
```


This output provides the numerical representation of the corpus, where each row corresponds to a document and each column corresponds to a vocabulary term.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output (matrix representation).  {.unlisted .unnumbered} 

To better understand the structure of the final representation, we can express the matrix in mathematical form:


$$
\text{Token map:}\qquad
\begin{array}{llllllll}
\texttt{ana}=\texttt{analysis}, &
\texttt{cmp}=\texttt{computation}, &
\texttt{cmpl}=\texttt{computational}, \\
\texttt{cnt}=\texttt{connects}, &
\texttt{dat}=\texttt{data}, &
\texttt{lear}=\texttt{learn}, \\
\texttt{mod}=\texttt{model}, &
\texttt{mdrn}=\texttt{modern}, &
\texttt{pat}=\texttt{pattern}, \\
\texttt{rel}=\texttt{relies}, &
\texttt{sci}=\texttt{science}, &
\texttt{st}=\texttt{statistic}, \\
\texttt{stl}=\texttt{statistical}, &
\texttt{tool}=\texttt{tool}.
\end{array}
$$

$$
\mathbf{B} =\left(
\begin{array}{c|cccccccccccccccc}
\texttt{Text} & \texttt{ana} & \texttt{cmp} & \texttt{cmpl} & \texttt{cnt} & \texttt{dat} & \texttt{lear} & \texttt{mod} & \texttt{mdrn} & \texttt{pat} & \texttt{rel} & \texttt{sci} & \texttt{st} & \texttt{stl} & \texttt{tool} \\
\hline
\text{#1} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\
\text{#2} & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\
\text{#3} & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1
\end{array} \right)
$$
This matrix corresponds to the fully populated Bag-of-Words representation, where each entry reflects the frequency of a term in a given document.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### The output (interpretation).  {.unlisted .unnumbered} 

In this matrix: 

- Each row represents a document.

- Each column a vocabulary term. 

- The entries indicate term frequencies, with many zeros reflecting the sparse nature of Bag-of-Words representations.

For example:

- The entry in row #2 and column dat indicates how many times the word data appears in the second document. 

- More generally, each value $(i, j)$ captures the frequency of term $j$ in document $i$.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### Final remark.  {.unlisted .unnumbered} 

Most entries remain zero, illustrating the sparsity typical of Bag-of-Words representations. This sparsity arises because each document contains only a small subset of the full vocabulary.


This final matrix provides a complete numerical representation of the corpus, enabling the application of linear algebra operations and machine learning algorithms to textual data.

While this representation is simple and effective, it treats each word independently and ignores local context. We now explore extensions that capture richer structures in text.
:::

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Beyond unigrams

So far, we have considered only *unigrams*, meaning individual words. The same idea can be extended to:

- Bigrams (pairs of consecutive words),

- Trigrams, and

- Higher-order n-grams.

Including n-grams allows the model to capture *local contextual information*, meaning that it can recognize short sequences of words rather than treating each word independently.

For example, the bigram `data science` carries a more specific meaning than the individual words `data` and `science` considered separately.

However, this comes at a cost: each additional n-gram increases the size of the vocabulary, leading to a higher-dimensional representation.


This raises a natural question:

```text
Do we really need to implement all of this manually?
Fortunately, no.  
```


In practice, modern NLP libraries provide efficient and well-tested implementations of Bag-of-Words models.

In the next section, we introduce one such tool that automates this entire process.




<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Implementing Bag-of-Words with `CountVectorizer`

### Understanding the BoW procedure with `CountVectorizer`

Manually building a Bag-of-Words (BoW) matrix helps develop intuition, but it is rarely necessary in practice. As discussed in the previous section, extending representations (e.g., to n-grams) quickly increases complexity.

In practice, Python provides efficient tools that automate this entire process. One of the most widely used is `CountVectorizer` from the `scikit-learn` library.

`CountVectorizer` transforms a collection of text documents into a *document-term matrix*, where:

- Each row represents a document.

- Each column corresponds to a token in the learned vocabulary.

- Each cell contains the frequency of that token in the document.  


This procedure mirrors the manual construction developed earlier, but in a fully automated and optimized way.

Figure \@ref(fig:Fig-CountVectorizer1) illustrates the transformation pipeline implemented by `CountVectorizer`, from raw text to the document-term matrix.

<center>
```{r Fig-CountVectorizer1, echo=FALSE, fig.cap = "Bag-of-Words representation using `CountVectorizer`. Source: Created by the author with ChatGPT (OpenAI)", out.width = "90%"}
# fig.width = 20 # No funciona esta opcion en el chunk

#http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

knitr::include_graphics("CountVectorizer1.png")

#Otra manera, pero no sale el caption:
#<center>
#![(#fig:Fig-caption) Mi figura](Nombre.png){width=400px}
#</center>
```
</center>

We now illustrate this process with a simple, concrete example.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### BoW with `CountVectorizer`: example

#### The code. {.unlisted .unnumbered}


Let us illustrate this with a small, self-contained example.

First, a `CountVectorizer` object is initialized using its default settings. Then, the method `fit_transform()` is applied to the corpus. This method simultaneously learns the vocabulary from the input documents and constructs the corresponding Bag-of-Words matrix.

Before implementing the model, note that the Bag-of-Words representation ignores word order and focuses only on term frequencies.


```{python, eval=FALSE}
from sklearn.feature_extraction.text import CountVectorizer  # 1

documents = [
    "Data science relies on numerical methods",
    "Text analysis uses vectors and matrices",
    "Mathematical representations support data modeling"
]

vectorizer = CountVectorizer()                    # 2
vectorizer

bow_matrix = vectorizer.fit_transform(documents) # 3
bow_matrix

# Inspect the learned vocabulary and document-term matrix
print(vectorizer.get_feature_names_out()) # 4 --> first output
print(bow_matrix.toarray())               # 5a --> second output
print(bow_matrix.todense())               # 5b --> second output
```

**Explanation.**

1. The `CountVectorizer` class is imported from `scikit-learn`. It is used to convert a collection of text documents into a numerical representation based on word counts.

2. A vectorizer object is created with default parameters. By default, it:
    
   - converts all text to lowercase,
   
   - tokenizes the text automatically,
   
   - builds the vocabulary from the corpus,
   
   - does not remove stopwords.

3. The method `fit_transform()` performs two operations in a single step:
    
   - *fit*: learns the vocabulary from the documents,
    
   - *transform*: converts each document into a vector of term frequencies.

:::sangria3
The result is a sparse matrix, where most entries are zero.
:::

4. The method `get_feature_names_out()` returns the learned vocabulary. The order of these terms defines the column ordering of the matrix.

5. The method `toarray()` converts the sparse matrix into a dense numerical array for inspection


The resulting output contains the learned vocabulary and the associated document-term matrix, which corresponds directly to the conceptual BoW construction discussed earlier.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output (code 4): learned vocabulary. {.unlisted .unnumbered}

The first output displays the learned vocabulary, that is, the set of unique tokens extracted from the corpus.

Each term corresponds to a column in the document-term matrix, and the order shown determines the column positions in the representation.

By default, `CountVectorizer` sorts the terms alphabetically.

```{python,echo=FALSE}
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "Data science relies on numerical methods",
    "Text analysis uses vectors and matrices",
    "Mathematical representations support data modeling"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out()) # Output 1
```




<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output (code 5): Bag-of-Words matrix. {.unlisted .unnumbered}

This corresponds directly to the document-term matrix introduced in the manual construction.


```{python,echo=FALSE}
print(bow_matrix.toarray())               # Output 2
```

The output shows the *Bag-of-Words matrix* in dense form.

- Each row corresponds to one document.

- Each column corresponds to one of the vocabulary terms listed above. 

- Each entry indicates how many times a term appears in a document.

A value of:

- 0 means the term does not appear,

- a positive integer indicates its frequency.

This representation ignores word order and syntactic structure, retaining only term frequencies.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Matrix representation of the BoW. {.unlisted .unnumbered}

To keep the notation compact, we label each token with a short abbreviation and report the corresponding document-term matrix below. 

\[
\text{Token map:}\qquad
\begin{array}{llllllll}
\texttt{ana}=\texttt{analysis}, &
\texttt{and}=\texttt{and}, &
\texttt{dat}=\texttt{data}, &
\texttt{math}=\texttt{mathematical}, \\
\texttt{mtx}=\texttt{matrices}, &
\texttt{meth}=\texttt{methods}, &
\texttt{model}=\texttt{modeling}, &
\texttt{num}=\texttt{numerical}, \\
\texttt{on}=\texttt{on}, &
\texttt{rel}=\texttt{relies}, &
\texttt{repr}=\texttt{representations}, &
\texttt{sci}=\texttt{science}, \\
\texttt{sup}=\texttt{support}, &
\texttt{txt}=\texttt{text}, &
\texttt{use}=\texttt{uses}, &
\texttt{vec}=\texttt{vectors}.
\end{array}
\]

To make the structure more explicit, we present the matrix in mathematical form.

Each entry $(i,j)$ represents the frequency of term $j$ in document $i$.

This type of representation is typically sparse, meaning that most entries are zero, especially as the vocabulary size increases.

\[
\mathbf{B}= \left(
\begin{array}{c|cccccccccccccccc}
\texttt{Doc} & \texttt{ana} & \texttt{and} & \texttt{dat} & \texttt{math} & \texttt{mtx} & \texttt{meth} & \texttt{model} & \texttt{num} & \texttt{on} & \texttt{rel} & \texttt{repr} & \texttt{sci} & \texttt{sup} & \texttt{txt} & \texttt{use} & \texttt{vec} \\
\hline
\text{#1} & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\
\text{#2} & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\
\text{#3} & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\
\end{array}\right)
\]

```{r, eval=FALSE, echo=FALSE}
\[
\mathbf{B}=
\begin{array}{l|cccccccccccccccc}
 & \texttt{analysis} & \texttt{and} & \texttt{data} & \texttt{mathematical} & \texttt{matrices} & \texttt{methods} & \texttt{modeling} & \texttt{numerical} & \texttt{on} & \texttt{relies} & \texttt{representations} & \texttt{science} & \texttt{support} & \texttt{text} & \texttt{uses} & \texttt{vectors} \\
\hline
\text{Document 1} & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\
\text{Document 2} & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\
\text{Document 3} & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\
\end{array}
\]
```



For example, the token `data` appears once in the first and third documents, and does not appear in the second document. Similarly, the token `analysis` appears only in the second document, while `numerical` appears only in the first document. This sparsity pattern is typical of Bag-of-Words representations, especially as the vocabulary size grows.

More generally, each entry $(i,j)$ represents the frequency of term $j$ in document $i$.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Heatmap of the BoW. {.unlisted .unnumbered}

The same matrix can be visualized as a heatmap, where darker cells indicate higher token counts.


```{python, eval=FALSE, message=FALSE, warnings=FALSE}
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

terms = vectorizer.get_feature_names_out()
X = bow_matrix.toarray()
df_bow = pd.DataFrame(X, columns=terms)

plt.figure(figsize=(14,5));
ax = sns.heatmap(df_bow, cmap="Blues", cbar=True)

# --- Title and axis labels ---
ax.set_title("Bag-of-Words representation", fontsize=18, pad=10);
ax.set_xlabel("Vocabulary terms", fontsize=18);
ax.set_ylabel("Documents", fontsize=18);

# --- Tick labels ---
ax.tick_params(axis="x", labelsize=14, rotation=45)
ax.tick_params(axis="y", labelsize=14)

# --- Colorbar font size ---
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=14)

plt.tight_layout() # Prevents cropping
plt.show()
```



```{python, echo=FALSE, message=FALSE, warnings=FALSE}
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

terms = vectorizer.get_feature_names_out()
X = bow_matrix.toarray() if hasattr(bow_matrix, "toarray") else bow_matrix
df_bow = pd.DataFrame(X, columns=terms)

plt.figure(figsize=(14,5));
ax = sns.heatmap(df_bow, cmap="Blues", cbar=True)

# --- Title and axis labels ---
ax.set_title("Bag-of-Words representation", fontsize=18, pad=10);
ax.set_xlabel("Vocabulary terms", fontsize=18);
ax.set_ylabel("Documents", fontsize=18);

# --- Tick labels ---
ax.tick_params(axis="x", labelsize=14, rotation=45)
ax.tick_params(axis="y", labelsize=14)

# --- Colorbar font size ---
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=14)

plt.tight_layout() # Prevents cropping
plt.show()
```


Each cell represents the frequency of a term in a document.

Since this is a small corpus, most values are either 0 or 1, so the heatmap primarily highlights the presence or absence of terms rather than strong frequency differences.

Note that common words such as *and* and *on* are included in the vocabulary because no stopword filtering was applied in this example.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Final remarks. {.unlisted .unnumbered}

This example illustrates how `CountVectorizer` automates:

- tokenization,

- vocabulary construction,

- and word counting.

The resulting Bag-of-Words representation provides a simple yet powerful way to transform text into numerical features suitable for machine learning models.

However, it is important to note that this representation ignores word order and context, which motivates more advanced approaches such as TF-IDF and word embeddings.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# `CountVectorizer`: additional arguments

### Understanding how `CountVectorizer` can be customized

Importantly, `CountVectorizer` provides several arguments that allow the basic Bag-of-Words representation to be refined and controlled, such as vocabulary size limits and document-frequency thresholds. These arguments, illustrated in Figure \@ref(fig:Fig-CountVectorizer2), will be introduced conceptually here and implemented in detail in the following sections. 


<center>
```{r Fig-CountVectorizer2, echo=FALSE, fig.cap = "Bag-of-words - `CountVectorizer` arguments. Source: Created by the author with ChatGPT (OpenAI)", out.width = "90%"}
# fig.width = 20 # No funciona esta opcion en el chunk

#http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

knitr::include_graphics("CountVectorizer2.png")

#Otra manera, pero no sale el caption:
#<center>
#![(#fig:Fig-caption) Mi figura](Nombre.png){width=400px}
#</center>
```
</center>


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Overview of customization options {.unlisted .unnumbered}

To better organize these options, it is useful to distinguish between built-in processing features and explicit vocabulary control parameters.

The behavior of `CountVectorizer` can be adjusted through several arguments, which can be grouped into two main categories:


**1. Built-in processing features (out-of-the-box behavior):**

- Automatic vocabulary learning.

- Tokenization.

- Support for n-grams (`ngram_range`).

- Optional stopword removal (`stop_words`).  

**2. Vocabulary control parameters:**

- `max_features`: limits the vocabulary to the top N most frequent terms.

- `min_df`: removes terms that appear in too few documents.

- `max_df`: removes terms that appear in too many documents.  

These options allow the user to balance expressiveness and dimensionality, tailoring the representation to the specific task and dataset.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### `CountVectorizer`: out-of-the-box features 

Beyond basic word counts, `CountVectorizer` includes several built-in options that make it flexible and practical for real-world applications.

We now explore some of the most commonly used features.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Automatic vocabulary learning and n-gram generation. {.unlisted .unnumbered}

By default, `CountVectorizer` learns its vocabulary directly from the data. In addition, it can:

- Apply tokenization internally,

- Remove stopwords automatically, and

- Generate n-grams without additional code.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

:::sangria3
#### Example. {.unlisted .unnumbered}

In the following example, the argument `ngram_range = (1, 3)` instructs the vectorizer to include *unigrams*, *bigrams*, and *trigrams*, that is, single words, pairs of consecutive words, and sequences of three consecutive words.

First, a `CountVectorizer` object is created with the specified n-gram range. The method `fit_transform()` then learns the vocabulary from the corpus and constructs the corresponding Bag-of-Words matrix, where each column represents an n-gram and each row represents a document.



```{python, eval =FALSE}
vectorizer_ngram = CountVectorizer(ngram_range=(1, 3))
bow_ngram = vectorizer_ngram.fit_transform(documents)

print(vectorizer_ngram.get_feature_names_out()) # Output 1
print(bow_ngram.toarray())                      # Output 2
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output. {.unlisted .unnumbered}

The first output displays the learned n-gram vocabulary. As a result, terms such as `analysis` (unigram), `text analysis` (bigram), and `text analysis uses` (trigram) coexist as distinct features in the representation. The order shown here defines the column ordering of the Bag-of-Words matrix.

```{python, echo =FALSE}
vectorizer_ngram = CountVectorizer(ngram_range=(1, 3))
bow_ngram = vectorizer_ngram.fit_transform(documents)

print(vectorizer_ngram.get_feature_names_out())
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

To facilitate later reference and discussion, the learned vocabulary is listed below as an indexed sequence. This enumeration will be used in subsequent sections to illustrate how n-gram features are filtered, selected, or weighted when applying additional arguments of `CountVectorizer`.


```{r, echo=FALSE, eval=FALSE}
1. analysis  
2. analysis uses  
3. analysis uses vectors  
4. and  
5. and matrices  
6. data  
7. data modeling  
8. data science  
9. data science relies  
10. mathematical  
11. mathematical representations  
12. mathematical representations support  
13. matrices  
14. methods  
15. modeling  
16. numerical  
17. numerical methods  
18. on  
19. on numerical  
20. on numerical methods  
21. relies  
22. relies on  
23. relies on numerical  
24. representations  
25. representations support  
26. representations support data  
27. science  
28. science relies  
29. science relies on  
30. support  
31. support data  
32. support data modeling  
33. text  
34. text analysis  
35. text analysis uses  
36. uses  
37. uses vectors  
38. uses vectors and  
39. vectors  
40. vectors and  
41. vectors and matrices
```


```{r,  results='asis'}
library(dplyr)
library(stringr)
library(knitr)
library(kableExtra)

tokens <- c(
  "analysis",
  "analysis uses",
  "analysis uses vectors",
  "and",
  "and matrices",
  "data",
  "data modeling",
  "data science",
  "data science relies",
  "mathematical",
  "mathematical representations",
  "mathematical representations support",
  "matrices",
  "methods",
  "modeling",
  "numerical",
  "numerical methods",
  "on",
  "on numerical",
  "on numerical methods",
  "relies",
  "relies on",
  "relies on numerical",
  "representations",
  "representations support",
  "representations support data",
  "science",
  "science relies",
  "science relies on",
  "support",
  "support data",
  "support data modeling",
  "text",
  "text analysis",
  "text analysis uses",
  "uses",
  "uses vectors",
  "uses vectors and",
  "vectors",
  "vectors and",
  "vectors and matrices"
)

tok_tbl <- tibble(
  ID = seq_along(tokens),
  Token = tokens
) %>%
  mutate(
    n_words = str_count(Token, "\\S+") # cuenta "palabras" separadas por espacios
  ) %>%
  mutate(
    Unigram = ifelse(n_words == 1, "✓", ""),
    Bigram  = ifelse(n_words == 2, "✓", ""),
    Trigram = ifelse(n_words == 3, "✓", "")
  ) %>%
  select(ID, Token, Unigram, Bigram, Trigram)

# Mostrar en tabla con formato
kable(tok_tbl, align = "clccc",
      col.names = c("Token ID", "Token (as learned)", "Unigram", "Bigram", "Trigram"),
      caption = "Indexed n-gram vocabulary and token type (based on word count).",
       format = "html",
       booktabs = TRUE) %>%
kable_styling() %>%
kable_classic_2(full_width = FALSE)

```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

This example shows that the learned vocabulary contains tokens of different lengths. For instance:

- Tokens 1, 6, 10, 16, and 33 correspond to *unigrams*.

- Tokens 2, 7, 8, 17, and 34 correspond to *bigrams*.

- Tokens 3, 11, 18, 35, and 38 correspond to *trigrams*.

These differences arise solely from the chosen `ngram_range` and do not change the underlying Bag-of-Words representation.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output. {.unlisted .unnumbered}

The second output shows the Bag-of-Words matrix constructed using the n-gram vocabulary. Each row corresponds to a document, each column corresponds to a specific n-gram, and the value in each cell indicates how many times that n-gram appears in the document.

```{python, echo =FALSE}
print(bow_ngram.toarray())
```


For readability, the Bag-of-Words matrix is presented in two blocks, corresponding to tokens 1–21 and 22–40:


```{r, echo=FALSE, eval=FALSE}
library(knitr)
library(kableExtra)

B <- matrix(
  c(
    # Doc 1 (41 valores)
    0,0,0,0,0,1,0,1,1,0, 0,0,0,1,0,1,1,1,1,1, 1,1,1,0,0,0,1,1,1,0, 0,0,0,0,0,0,0,0,0,0,0,
    # Doc 2 (41 valores)
    1,1,1,1,1,0,0,0,0,0, 0,0,1,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0, 0,0,1,1,1,1,1,1,1,1,1,
    # Doc 3 (41 valores)
    0,0,0,0,0,1,1,0,0,1, 1,1,0,0,1,0,0,0,0,0, 0,0,0,1,1,1,0,0,0,1, 1,1,0,0,0,0,0,0,0,0,0
  ),
  nrow = 3,
  byrow = TRUE
)

colnames(B) <- 1:41# paste0("T", 1:41)   # o solo 1:41 si prefieres
rownames(B) <- paste0("D", 1:3)

kable(B,
      align = "c",
      caption = "Bag-of-Words matrix (indexed n-gram features). Columns correspond to tokens 1-41.") %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()

```


```{r, eval= FALSE, results='asis'}
library(knitr)
library(kableExtra)

# --- Matriz original ---
B <- matrix(
  c(
    # Doc 1
    0,0,0,0,0,1,0,1,1,0, 0,0,0,1,0,1,1,1,1,1,
    1,1,1,0,0,0,1,1,1,0, 0,0,0,0,0,0,0,0,0,0,0,
    # Doc 2
    1,1,1,1,1,0,0,0,0,0, 0,0,1,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0, 0,0,1,1,1,1,1,1,1,1,1,
    # Doc 3
    0,0,0,0,0,1,1,0,0,1, 1,1,0,0,1,0,0,0,0,0,
    0,0,0,1,1,1,0,0,0,1, 1,1,0,0,0,0,0,0,0,0,0
  ),
  nrow = 3,
  byrow = TRUE
)

colnames(B) <- paste0("T", 1:41)   # o solo 1:41 si prefieres
rownames(B) <- paste0("Doc.", 1:3)

# --- Subtabla 1: Tokens 1–21 ---
B_1_21 <- B[, 1:21]

kable(
  B_1_21,
  align = "c",
  caption = "(a) Bag-of-Words matrix (tokens T1 - T21)",
  format = "html",
  booktabs = TRUE
) %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()

# --- Subtabla 2: Tokens 22–41 ---
B_22_41 <- B[, 22:41]

kable(
  B_22_41,
  align = "c",
  caption = "(b) Bag-of-Words matrix (tokens T22 - T41)",
  format = "html",
  booktabs = TRUE
) %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()

```


```{r, echo= FALSE, results='asis'}
library(knitr)
library(kableExtra)

# --- Matriz original ---
B <- matrix(
  c(
    # Doc 1
    0,0,0,0,0,1,0,1,1,0, 0,0,0,1,0,1,1,1,1,1,
    1,1,1,0,0,0,1,1,1,0, 0,0,0,0,0,0,0,0,0,0,0,
    # Doc 2
    1,1,1,1,1,0,0,0,0,0, 0,0,1,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0, 0,0,1,1,1,1,1,1,1,1,1,
    # Doc 3
    0,0,0,0,0,1,1,0,0,1, 1,1,0,0,1,0,0,0,0,0,
    0,0,0,1,1,1,0,0,0,1, 1,1,0,0,0,0,0,0,0,0,0
  ),
  nrow = 3,
  byrow = TRUE
)

colnames(B) <- paste0("T", 1:41)   # o solo 1:41 si prefieres
rownames(B) <- paste0("Doc.", 1:3)

# --- Subtabla 1: Tokens 1–21 ---
B_1_21 <- B[, 1:21]

kable(
  B_1_21,
  align = "c",
  caption = "(a) Bag-of-Words matrix (tokens T1 - T21)",
  format = "html",
  booktabs = TRUE
) %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()

# --- Subtabla 2: Tokens 22–41 ---
B_22_41 <- B[, 22:41]

kable(
  B_22_41,
  align = "c",
  caption = "(b) Bag-of-Words matrix (tokens T22 - T41)",
  format = "html",
  booktabs = TRUE
) %>%
  kable_styling(full_width = FALSE) %>%
  kable_classic_2()

```


A value of 1 at position (Doc. `i`, Token `Tj`) indicates that token `j` appears once in document `i`; a value of 0 indicates it does not appear. For example, 

```{r, eval=FALSE}
Token 8 → "data science" → value 1 in document 1
Token 8 → "data science" → value 0 in document 2
Token 8 → "data science" → value 0 in document 3
```


In this case, a value of 1 in the column associated with `data science` means that this bigram appears once in the corresponding document, whereas a value of 0 indicates that it does not appear.

While these built-in features define the default behavior of the vectorizer, additional arguments allow further control over the vocabulary and document-term representation.
:::


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### `CountVectorizer`: Controlling the vocabulary

Several arguments allow the vocabulary to be restricted or filtered:

- `max_features`: limits vocabulary size.

- `min_df`: removes rare terms.

- `max_df`: removes overly frequent terms. 


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### `CountVectorizer`: limiting vocabulary size with `max_features`  

As the vocabulary grows, the dimensionality of document vectors increases accordingly. Very high-dimensional representations may reduce computational efficiency and harm generalization, a phenomenon commonly referred to as the *curse of dimensionality*.

To address this issue, `CountVectorizer` provides the `max_features` argument, which restricts the vocabulary to the most frequent tokens observed in the corpus. 


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Example. {.unlisted .unnumbered}

In the following example, the vocabulary is limited to the five most frequent unigrams or bigrams in the corpus.

First, a `CountVectorizer` object is created with `ngram_range = (1, 2)` to extract both unigrams and bigrams. The argument `max_features = 5` restricts the vocabulary to the five most frequent tokens (according to document frequency). The method `fit_transform()` then learns this reduced vocabulary and constructs the corresponding Bag-of-Words matrix.



```{python, eval=FALSE}
vectorizer_limited = CountVectorizer(
    ngram_range=(1, 2),
    max_features=5
)

bow_limited = vectorizer_limited.fit_transform(documents)

print(vectorizer_limited.get_feature_names_out()) # Output 1
print(bow_limited.toarray())                      # Output 2
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output. {.unlisted .unnumbered}

The first output displays the reduced vocabulary, consisting of the five most frequent unigrams or bigrams retained after applying the max_features constraint. The order shown here defines the column ordering of the Bag-of-Words matrix.

```{python, echo=FALSE}
vectorizer_limited = CountVectorizer(
    ngram_range=(1, 2),
    max_features=5
)

bow_limited = vectorizer_limited.fit_transform(documents)

print(vectorizer_limited.get_feature_names_out())
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output. {.unlisted .unnumbered}

The resulting vocabulary (Output 1) defines the columns of the Bag-of-Words matrix (Output 2), in the exact order shown above.

```{python, echo=FALSE}
print(bow_limited.toarray())
```

Each row corresponds to a document and each column corresponds to one of the selected n-grams. The entries represent term frequencies. Formally, the matrix can be written as
\[
\mathbf{B} = (b_{ij}), \qquad 
b_{ij} = \text{frequency of n-gram } j \text{ in document } i,
\]
where the columns correspond to
\[
(\texttt{analysis},\ \texttt{analysis uses},\ \texttt{and},\ \texttt{and matrices},\ \texttt{data}).
\]

That is, the Bag-of-Words matrix can be written explicitly as
\[
\mathbf{B} =
\begin{array}{c|ccccc}
        & \texttt{analysis} & \texttt{analysis uses} & \texttt{and} & \texttt{and matrices} & \texttt{data} \\ \hline
\text{Document 1} & 0 & 0 & 0 & 0 & 1 \\
\text{Document 2} & 1 & 1 & 1 & 1 & 0 \\
\text{Document 3} & 0 & 0 & 0 & 0 & 1
\end{array}
\]

The value \( b_{25} = 0 \) indicates that the token \texttt{data} does not appear in the second document, while \( b_{15} = 1 \) indicates that it appears once in the first document.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### `CountVectorizer`: filtering tokens with `min_df` and `max_df` thresholds

Not all tokens contribute equally to the representation. Some appear in almost every document (low discrimination), while others appear only once (often too specific or noisy).

`CountVectorizer` supports filtering tokens using *document frequency*:

- `min_df` keeps only terms that appear in at least `min_df` documents (as a count or proportion).

- `max_df` keeps only terms that appear in at most `max_df` documents (as a count or proportion).

A useful workflow is:

1. fit a vectorizer,

2. compute df for each token,

3. inspect which tokens would survive a chosen `min_df`/`max_df` rule.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### `CountVectorizer`:  example 1 (inspecting which tokens survive a `min_df` rule)

In this example, we let the vectorizer learn the *full vocabulary* with `min_df = 1`, and then compute each token’s document frequency (`df`) manually.  Based on this information, we mark which tokens would be retained if a stricter rule such as `min_df = 2` were applied.

This approach makes the effect of `min_df` explicit and easy to interpret. 

First, a `CountVectorizer` object is created with `ngram_range = (1, 3)` to extract *unigrams, bigrams, and trigrams*.  The argument `min_df = 1` ensures that *no tokens are filtered at this stage*, allowing the full vocabulary to be inspected. The method `fit_transform()` then learns the vocabulary and constructs the corresponding Bag-of-Words matrix.


Specifically:

- `fit_transform(documents)` scans the corpus, builds the vocabulary of observed tokens, and returns the *document–term matrix* in sparse format.

- `get_feature_names_out()` extracts the list of tokens learned by the vectorizer, in the order in which they appear as columns in the matrix.

- `toarray()` converts the sparse Bag-of-Words matrix into a dense numerical array, which facilitates inspection and manual computation of document frequencies.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Interpretation of the matrix structure. {.unlisted .unnumbered}


The resulting Bag-of-Words matrix has:

- *rows* corresponding to documents, and  

- *columns* corresponding to vocabulary tokens (unigrams, bigrams, or trigrams).

Each entry represents the number of times a given token appears in a given document.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Mathematical notation. {.unlisted .unnumbered}


Let \( D \) denote the number of documents and \( V \) the size of the vocabulary.  The Bag-of-Words representation can be written as a matrix

$$\mathbf{B} = (b_{ij}) \in \mathbb{N}^{D \times V},$$

where

$$b_{ij} = \text{number of occurrences of token } j \text{ in document } i.$$

This representation provides the basis for computing document frequencies and for applying frequency-based filtering rules such as `min_df` and `max_df`.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Application. {.unlisted .unnumbered}

```{python}
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Vectorizer (ya se tiene)
vectorizer_limited = CountVectorizer(
    ngram_range=(1, 3),
    min_df=1  # importante: df lo calculamos manualmente
)

bow = vectorizer_limited.fit_transform(documents)

# 1) Tokens aprendidos
tokens = vectorizer_limited.get_feature_names_out()

# 2) Bag-of-Words matrix
B = bow.toarray()

# 3) Document frequency (df)
df = (B > 0).sum(axis=0)

# 4) Número de palabras por token
n_words = np.array([len(t.split()) for t in tokens])

# 5) Construir tabla final
table_df = pd.DataFrame({
    "Token ID": np.arange(1, len(tokens) + 1),
    "Token (as learned)": tokens,
    "Unigram": (n_words == 1).astype(int),
    "Bigram":  (n_words == 2).astype(int),
    "Trigram": (n_words == 3).astype(int),
    "df": df,
    "Kept (min_df = 2)": np.where(df >= 2, "✓", "X")
})

# 6) Reemplazar 1/0 por ✓ / vacío (más legible)
for col in ["Unigram", "Bigram", "Trigram"]:
    table_df[col] = table_df[col].replace({1: "✓", 0: ""})

table_df
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Interpreting the output. {.unlisted .unnumbered}



The resulting table contains *41 tokens*, including *unigrams, bigrams, and trigrams* (up to length 3).  
The most relevant columns are:

- *`df`*: the number of documents in which a token appears at least once.

- *`Kept (min_df = 2)`*: indicates whether the token would be retained if we required it to appear in *at least two documents*.

From the output, only one token is retained:

- *`data`* has `df = 2`, meaning it appears in two documents and therefore satisfies the condition `min_df = 2` (✓).

All remaining tokens have:

- `df = 1`, indicating that they appear in only one document. As a result, they would be removed (X) under the `min_df = 2` rule.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

####  Why does this happen? {.unlisted .unnumbered}

This behavior is a direct consequence of the *very small corpus size* (approximately two documents). Most multi-word expressions (such as `text analysis uses` or `vectors and matrices`) occur in only one document.

By setting `min_df = 2`, we are effectively enforcing the rule:


```{r, eval=FALSE}
Keep only the terms that appear across multiple documents.
```



When the corpus contains only two documents, this becomes a *very strict filtering criterion*, causing nearly all tokens to be discarded.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### A more direct approach: applying `min_df` inside the vectorizer (optional). {.unlisted .unnumbered}
 

In the previous example, token filtering was illustrated by manually computing document frequencies and marking which tokens would be retained under a given `min_df` rule.  

An alternative (and more typical) approach is to apply the frequency constraint *directly inside the vectorizer*. In this case, tokens that do not satisfy the condition are never included in the learned vocabulary.


```{python}
vectorizer_df = CountVectorizer(
    ngram_range=(1, 3),
    min_df=2
)

bow_df = vectorizer_df.fit_transform(documents)
print(vectorizer_df.get_feature_names_out())
```

Because the corpus is very small, only the token data appears in at least two documents and is therefore retained. All other unigrams, bigrams, and trigrams are discarded automatically during vocabulary construction.

This confirms the behavior observed earlier using the manual inspection table.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### `CountVectorizer`: example 2 (joint filtering with `min_df` and `max_df`)

The previous example applied a lower bound on document frequency using `min_df`.  We now extend this idea by introducing an upper bound through the parameter `max_df`.

In this example, only *unigrams and bigrams* that:

- appear in *at least two documents*, and  

- appear in *no more than 80% of the corpus*

are retained.

First, a `CountVectorizer` object is created with `ngram_range = (1, 2)` to extract unigrams and bigrams only. The arguments `min_df = 2` and `max_df = 0.8` jointly filter tokens based on document frequency.  As before, the method `fit_transform()` learns the filtered vocabulary and constructs the corresponding Bag-of-Words matrix.



```{python, eval=FALSE}
vectorizer_df = CountVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.8
)

bow_df = vectorizer_df.fit_transform(documents)

print(vectorizer_df.get_feature_names_out()) # Output 1
print(bow_df.toarray())                      # Output 2
```



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output. {.unlisted .unnumbered}

The first output lists the tokens retained in the filtered vocabulary, that is, those satisfying both the `min_df` and `max_df` constraints.. In this case, only the token `data` satisfies both frequency constraints.

```{python, echo=FALSE}
vectorizer_df = CountVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.8
)

bow_df = vectorizer_df.fit_transform(documents)

print(vectorizer_df.get_feature_names_out())
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output. {.unlisted .unnumbered}

The second output displays the resulting Bag-of-Words matrix, whose columns correspond to the retained tokens and whose rows represent documents, with entries indicating token counts. Since only one token is retained, the matrix has a single column. Each row corresponds to a document and indicates whether the token `data` appears in the corresponding document..

```{python, echo=FALSE}
print(bow_df.toarray())
```

Together, `min_df` and `max_df` provide a simple yet powerful mechanism to control which tokens enter the representation. They are especially useful for reducing noise and dimensionality in high-dimensional text data, while preserving terms that carry cross-document relevance.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Limitations of the Bag-of-Words representation

Despite its simplicity and interpretability, the Bag-of-Words model has important limitations. See Figure \@ref(fig:Fig-BoW-steps-limita).


<center>
```{r Fig-BoW-steps-limita, echo=FALSE, fig.cap = "Limitations of the Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)", out.width = "80%"}
# fig.width = 20 # No funciona esta opcion en el chunk

#http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

knitr::include_graphics("BoW_steps_limita.png")

#Otra manera, pero no sale el caption:
#<center>
#![(#fig:Fig-caption) Mi figura](Nombre.png){width=400px}
#</center>
```
</center>

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Limitations. {.unlisted .unnumbered}


- First, it relies exclusively on token counts, ignoring word order and syntactic structure. As a result, sentences with very different meanings may receive similar representations.

- Second, BoW does not capture semantic relationships. Words with related meanings are treated as entirely independent dimensions.

- Third, large vocabularies can lead to extremely high-dimensional vectors, which may degrade performance and increase computational cost.

These limitations motivate more refined representations that adjust token importance and incorporate contextual information. One such approach (TF-IDF weighting) is introduced in the next section.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# TF-IDF vectors

In the previous section, documents were represented using raw word counts through the Bag-of-Words model. While this approach is intuitive, it treats all tokens equally and relies solely on their frequency within each document.

As a result, terms that appear very often across the corpus may dominate the representation, while less frequent but potentially informative terms receive little weight or are discarded altogether. This can lead to a loss of relevant patterns, especially when rare terms are crucial for distinguishing documents.

The Term Frequency-Inverse Document Frequency (TF-IDF) scheme addresses this limitation by re-weighting tokens according to both their local importance within a document and their global distribution across the corpus.

TF-IDF is widely used in information retrieval, search engines, and text mining applications. Like BoW, it is a frequency-based representation, but it incorporates an additional normalization mechanism that balances common and rare terms.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Term Frequency (TF)

The term frequency component measures how often a word appears in a specific document. However, since documents may vary in length, raw counts are typically normalized. A common normalized definition of term frequency is:


$$TF(w) = \frac{\text{Number of times the word w occurs in a document}}{\text{Total number of words in the document}}$$

This normalization prevents longer documents from automatically assigning higher importance to all their terms.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->



To build intuition, the next figure shows **normalized TF** for a few example tokens inside a single document (so longer documents do not automatically inflate importance).

```{python, eval=FALSE, message=FALSE, warning=FALSE}
import numpy as np
import matplotlib.pyplot as plt

# --- Simulated document term counts (Document d1) ---
terms = ["data", "analysis", "model", "the", "and", "python"]
counts_d1 = np.array([6, 3, 2, 10, 8, 1])  # raw counts in document d1
tf_d1 = counts_d1 / counts_d1.sum()        # normalized TF

plt.figure()
plt.bar(terms, tf_d1)
plt.title("Term Frequency (TF) in a single document")
plt.xlabel("Token")
plt.ylabel("TF (normalized frequency)")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.show()
```

```{r}
library(ggplot2)

tf_df <- data.frame(
  token = c("data", "analysis", "model", "the", "and", "python"),
  count = c(6, 3, 2, 10, 8, 1)
)

tf_df$TF <- tf_df$count / sum(tf_df$count)

ggplot(tf_df, aes(x = token, y = TF)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Term Frequency (TF) in a single document",
    x = "Token",
    y = "TF (normalized frequency)"
  ) +
  theme_minimal()
```


TF measures local importance: tokens that occur more often within the document receive larger TF values, but normalization keeps TF comparable across documents of different lengths.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


### Inverse Document Frequency (IDF)

While TF captures local relevance, it does not account for how informative a word is across the entire corpus. Words that appear in almost every document (such as general or domain-wide terms) may not be useful for discrimination.

The inverse document frequency component down-weights such ubiquitous terms and amplifies words that occur in fewer documents:

$$ IDF(w) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing word w}}\right) $$

where:

- $N$ is the total number of documents, and

- $df(w)$ is the number of documents containing word $w$.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


TF alone does not capture how informative a token is across the corpus. The next figure shows how IDF decreases as a token appears in more documents.


```{python, eval=FALSE}
import numpy as np
import matplotlib.pyplot as plt

# --- Simulated corpus size and document frequencies ---
N = 10  # total documents
df = np.arange(1, N+1)                 # df(w) = 1..N
idf = np.log(N / df)                   # classic IDF definition (as in your notes)

plt.figure()
plt.plot(df, idf, marker="o")
plt.title("Inverse Document Frequency (IDF) vs. document frequency")
plt.xlabel("Document frequency  df(w)")
plt.ylabel("IDF(w) = log(N / df(w))")
plt.xticks(df)
plt.tight_layout()
plt.show()
```


```{r}
idf_df <- data.frame(df = 1:10)

N <- 10
idf_df$IDF <- log(N / idf_df$df)

ggplot(idf_df, aes(x = df, y = IDF)) +
  geom_line(color = "steelblue", size=1) +
  geom_point(color = "steelblue", size=2.5) +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title = "Inverse Document Frequency (IDF)",
    x = "Document frequency df(w)",
    y = "IDF(w) = log(N / df(w))"
  ) +
  theme_minimal()
```



Tokens that occur in many documents (high df) have low IDF, because they help less to distinguish documents. Tokens that occur in few documents have higher IDF.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### TF-IDF weighting

The final TF-IDF weight of a word \( w \) in document \( d \) is obtained by combining two components:

\[
\text{weight}(w,d) = TF(w,d) \times IDF(w)
\]

This formulation assigns higher weights to terms that are frequent within a document but relatively rare across the corpus.

Even when a term appears exactly once in every document, its TF-IDF weight is not necessarily identical across documents.  
This occurs because the *term frequency (TF)* component is normalized by the total number of tokens in each document. Consequently, documents of different lengths assign different relative importance to the same term.

In addition, TF-IDF vectors are *normalized by default* using the \( L_2 \) norm. This means that each document vector is rescaled to have unit length, further modifying the final weights. As a result, two documents may share the same vocabulary and identical raw term counts, yet still differ in their TF-IDF representations.

The next plot illustrates the combined effect: TF–IDF becomes large when a token is frequent in a document (high TF) and rare in the corpus (high IDF).

```{python, eval=FALSE}
import numpy as np
import matplotlib.pyplot as plt

# --- Simulated TF (from a document) and IDF (from the corpus) for several tokens ---
tokens = ["data", "analysis", "model", "the", "and"]
tf = np.array([0.18, 0.12, 0.08, 0.30, 0.20])  # local frequencies (normalized)
idf = np.array([1.0, 1.4, 1.8, 0.1, 0.2])      # global rarity (higher = rarer)
tfidf = tf * idf

plt.figure()
plt.bar(tokens, tfidf)
plt.title("TF-IDF weights in a document (simulated)")
plt.xlabel("Token")
plt.ylabel("TF-IDF = TF × IDF")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()
```

```{r}
tfidf_df <- data.frame(
  token = c("data", "analysis", "model", "the", "and"),
  TF = c(0.18, 0.12, 0.08, 0.30, 0.20),
  IDF = c(1.0, 1.4, 1.8, 0.1, 0.2)
)

tfidf_df$TFIDF <- tfidf_df$TF * tfidf_df$IDF

ggplot(tfidf_df, aes(x = token, y = TFIDF)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "TF-IDF weights in a document (simulated)",
    x = "Token",
    y = "TF-IDF = TF × IDF"
  ) +
  theme_minimal()
```



A token can have a high TF but still receive a small TF–IDF weight if its IDF is low (e.g., very common words). TF–IDF emphasizes tokens that are both locally frequent and globally informative.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### How TF, IDF, and TF-IDF relate (combined view)

Finally, the figure below visualizes TF and IDF jointly; TF-IDF is shown by the point size (larger = higher TF-IDF).

```{python, eval=FALSE}
import numpy as np
import matplotlib.pyplot as plt

tokens = np.array(["data", "analysis", "model", "the", "and", "python", "science"])
tf = np.array([0.18, 0.12, 0.08, 0.30, 0.20, 0.05, 0.07])
idf = np.array([1.0, 1.4, 1.8, 0.1, 0.2, 2.0, 1.6])
tfidf = tf * idf

plt.figure()
plt.scatter(tf, idf, s=2500*tfidf)  # point size proportional to TF–IDF
for x, y, t in zip(tf, idf, tokens):
    plt.text(x, y, f"  {t}", va="center")

plt.title("TF–IDF as an interaction of TF and IDF (size = TF–IDF)")
plt.xlabel("TF (within-document frequency)")
plt.ylabel("IDF (corpus rarity)")
plt.tight_layout()
plt.show()
```


```{r}
rel_df <- data.frame(
  token = c("data", "analysis", "model", "the", "and", "python", "science"),
  TF = c(0.18, 0.12, 0.08, 0.30, 0.20, 0.05, 0.07),
  IDF = c(1.0, 1.4, 1.8, 0.1, 0.2, 2.0, 1.6)
)

rel_df$TFIDF <- rel_df$TF * rel_df$IDF

ggplot(rel_df, aes(x = TF, y = IDF, size = TFIDF)) +
  geom_point(color = "steelblue", alpha = 0.7) +
  geom_text(aes(label = token), hjust = -0.1, vjust = 0.5) +
  labs(
    title = "TF–IDF as an interaction of TF and IDF",
    x = "TF (within-document frequency)",
    y = "IDF (corpus rarity)",
    size = "TF–IDF"
  ) +
  theme_minimal()
```


The largest points appear where TF and IDF are simultaneously high. This makes TF–IDF easy to interpret as an interaction: a token is most important when it is frequent in the document but uncommon in the corpus.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Building a basic TF–IDF vectorizer

In practice, TF–IDF representations are computed efficiently using the `TfidfVectorizer` class from `scikit-learn`, which combines term frequency normalization and inverse document frequency weighting in a single step.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Example. {.unlisted .unnumbered}

To keep the example simple and self-contained, consider the following small collection of documents.

First, a `TfidfVectorizer` object is created using the default settings, which include \(L_2\) normalization. The method `fit_transform()` learns the vocabulary from the corpus and computes the TF–IDF matrix simultaneously, producing a numerical representation of the documents.



```{python}
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "Statistical models rely on numerical features",
    "Text representations are built using vectors",
    "Feature weighting improves document comparison"
]

vectorizer = TfidfVectorizer()
tf_idf_matrix = vectorizer.fit_transform(documents)
```

The learned vocabulary and the resulting TF–IDF matrix can be inspected as follows:


```{python, eval=FALSE}
print(vectorizer.get_feature_names_out())    # Output 1
print(tf_idf_matrix.toarray())               # Output 2
print("Matrix shape:", tf_idf_matrix.shape)  # Output 3
```

The three outputs correspond, respectively, to the learned vocabulary, the TF–IDF matrix expressed in dense form for inspection, and the dimensions of the resulting representation.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output. {.unlisted .unnumbered}


The first output displays the *learned vocabulary*, that is, the set of unique terms extracted from the corpus after preprocessing. Each element in this array corresponds to a column of the TF–IDF matrix, and the order shown here defines the column ordering used in the matrix representation.

```{python, echo=FALSE}
print(vectorizer.get_feature_names_out())
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output. {.unlisted .unnumbered}

The second output shows the TF–IDF matrix itself, expressed in dense form for inspection. Each row corresponds to a document, each column corresponds to a term in the learned vocabulary, and each entry represents the TF–IDF weight assigned to that term in the corresponding document.

```{python, echo=FALSE}
print(tf_idf_matrix.toarray())
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Third output. {.unlisted .unnumbered}

The final output reports the dimensions of the matrix. In this example, the matrix has three rows (one per document) and seventeen columns (one per vocabulary term), confirming the correspondence between the corpus size and the learned vocabulary.


```{python, echo=FALSE}
print("Matrix shape:", tf_idf_matrix.shape)
```

The vocabulary remains comparable to that of `CountVectorizer`, but the entries now represent TF-IDF weights rather than raw frequencies.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Normalization of TF–IDF vectors.

Normalization ensures that document vectors are comparable in magnitude, which is particularly important for similarity measures.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### \(L_2\) norm. {.unlisted .unnumbered}



By default, each TF-IDF document vector \(\mathbf{x} = (x_1, x_2, \dots, x_d)\) is normalized to have unit length using the \(L_2\) norm, defined as
$$\|\mathbf{x}\|_2  \; =\;   \sqrt{\sum_{j=1}^{d} x_j^2}$$

Under this normalization, the vector is rescaled so that \(\|\mathbf{x}\|_2 = 1\), emphasizing relative term contributions rather than document length.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### \(L_1\) norm. {.unlisted .unnumbered}



Alternatively, the \(L_1\) norm can be used, which is defined as

$$\|\mathbf{x}\|_1 \; =\;  \sum_\limits{j=1}^{d} |x_j|$$

In this case, the vector is rescaled so that \(\|\mathbf{x}\|_1 = 1\), allowing the TF-IDF weights to be interpreted as relative proportions within each document.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Example. {.unlisted .unnumbered}

The following example illustrates TF-IDF computation using \(L_1\) normalization. 

First, a `TfidfVectorizer` object is created with the argument `norm="l1"`, which specifies that each document vector will be normalized so that the sum of the absolute TF–IDF weights equals one. The method `fit_transform()` then learns the vocabulary from the corpus and computes the corresponding TF–IDF matrix in a single step.

The three outputs display, respectively, the learned vocabulary, the TF–IDF matrix with $l_1$ normalization applied, and the dimensions of the resulting representation.

```{python, eval=FALSE}
vectorizer_l1 = TfidfVectorizer(norm="l1")
tfidf_l1 = vectorizer_l1.fit_transform(documents)

print(vectorizer_l1.get_feature_names_out()) # Output 1
print(tfidf_l1.toarray())                    # Output 2
print("Matrix shape:", tfidf_l1.shape)       # Output 3
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output. {.unlisted .unnumbered}


The first output displays the *learned vocabulary*. As before, each term corresponds to a column of the TF–IDF matrix, and the order shown here defines the column ordering used in the matrix representation. The vocabulary itself is unchanged by the choice of normalization.

```{python, echo=FALSE}
vectorizer_l1 = TfidfVectorizer(norm="l1")
tfidf_l1 = vectorizer_l1.fit_transform(documents)

print(vectorizer_l1.get_feature_names_out())
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output. {.unlisted .unnumbered}

The second output shows the TF–IDF matrix with $L_1$ normalization applied. Each row corresponds to a document and each column to a term in the vocabulary. Under $L_1$ normalization, the values in each row sum to one, so the entries can be interpreted as relative weights of terms within the document. For example, in the first document, the nonzero entries are all equal, indicating that the retained terms contribute equally to the total TF–IDF weight of that document.
 
```{python, echo=FALSE}
print(tfidf_l1.toarray())
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Third output. {.unlisted .unnumbered}

The final output reports the dimensions of the matrix. In this example, the matrix has three rows (one per document) and seventeen columns (one per vocabulary term), confirming that normalization affects the scale of the weights, but not the structure of the representation.


```{python, echo=FALSE}
print("Matrix shape:", tfidf_l1.shape)
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### N-grams and vocabulary size in TF–IDF

As with Bag-of-Words representations, the TF–IDF vectorizer supports the use of n-grams as well as constraints on vocabulary size. This allows short phrases to be incorporated into the representation while keeping dimensionality under control.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Example. {.unlisted .unnumbered}

In the following example, the representation is restricted to the six most frequent features among unigrams, bigrams, and trigrams. The argument `ngram_range = (1, 3)` enables the extraction of n-grams up to length three, while `max_features = 6` limits the vocabulary size. The default \(L_2\) normalization is applied.


```{python, eval= FALSE}
vectorizer_ngram = TfidfVectorizer(
    ngram_range=(1, 3),
    max_features=6,
    norm="l2"
)

tfidf_ngram = vectorizer_ngram.fit_transform(documents)

print(vectorizer_ngram.get_feature_names_out())
print(tfidf_ngram.toarray())
print("Matrix shape:", tfidf_ngram.shape)
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### First output. {.unlisted .unnumbered}

The first output displays the *learned n-gram vocabulary*, restricted to six features. In this case, all retained features correspond to unigrams, bigrams, and trigrams derived from the phrase `are built using`. Each element in this list defines a column of the TF-IDF matrix, and the order shown here determines the column ordering.


```{python, echo= FALSE}
vectorizer_ngram = TfidfVectorizer(
    ngram_range=(1, 3),
    max_features=6,
    norm="l2"
)

tfidf_ngram = vectorizer_ngram.fit_transform(documents)

print(vectorizer_ngram.get_feature_names_out())
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Second output. {.unlisted .unnumbered}

The second output shows the TF–IDF matrix constructed using the restricted n-gram vocabulary. Each row corresponds to a document and each column corresponds to one of the selected n-grams.


```{python, echo= FALSE}
print(tfidf_ngram.toarray())
```

In this example, only the second document contains the retained n-grams, which explains why its row has nonzero TF–IDF values, while the first and third documents are represented by zero vectors.

Because $L_2$ normalization is applied, the nonzero row has unit Euclidean norm, and the TF-IDF weights are evenly distributed across the six retained features.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Third output. {.unlisted .unnumbered}

The final output reports the dimensions of the matrix. Here, the matrix has three rows (one per document) and six columns (one per retained n-gram), confirming that `max_features` directly controls the dimensionality of the TF-IDF representation.

```{python, echo= FALSE}
print("Matrix shape:", tfidf_ngram.shape)
```

The parameters `min_df` and `max_df` are also available for TF-IDF vectorizers and behave identically to those in `CountVectorizer`, allowing extremely rare or overly common terms to be excluded based on document frequency.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% --> 
<!-- Separador -->


### Limitations of the TF-IDF representation

TF–IDF improves upon raw word counts by adjusting token importance using corpus-level statistics. It remains computationally efficient and highly interpretable.

However, TF–IDF still operates purely at the lexical level and therefore does not capture:

- Semantic similarity between words,

- Contextual meaning,

- Word order or co-occurrence structure, or

- Positional information within documents.

Figure \@ref(fig:Fig-Limita1) summarizes these four limitations with simple examples.

<center>
```{r Fig-Limita1, echo=FALSE, fig.cap = "Limitations of the TF-IDF representation. Source: Created by the author with ChatGPT (OpenAI)", out.width = "80%"}
# fig.width = 20 # No funciona esta opcion en el chunk

#http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

knitr::include_graphics("Limita1.png")

#Otra manera, pero no sale el caption:
#<center>
#![(#fig:Fig-caption) Mi figura](Nombre.png){width=400px}
#</center>
```
</center>


Like BoW, TF–IDF representations also scale with vocabulary size, which can become problematic for very large corpora.

These limitations motivate the use of similarity measures (such as cosine similarity) and more expressive representation learning techniques, which are explored in the following sections.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Distance/similarity calculation between document vectors

Once documents have been represented as vectors, a natural question arises:

```{r, eval=FALSE}
How can we quantify how similar or dissimilar two text documents are?
```


If two documents use similar words with comparable distributions, it is reasonable to expect that they convey related information. In this section, we introduce **cosine similarity**, a geometric measure widely used to compare document vectors derived from Bag-of-Words and TF-IDF representations.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Cosine similarity

Cosine similarity measures the *orientation* of two vectors in a vector space by computing the cosine of the angle between them. Unlike distance-based measures, it is insensitive to vector magnitude and instead focuses on direction.

Two vectors are considered similar when they point in nearly the same direction, even if their lengths differ. This property is especially useful in text analysis, where vector magnitude is often influenced by document length.

For two vectors \(\mathbf{v}, \mathbf{v} \in \mathbb{R}^d \), cosine similarity is defined as:

$$\cos(\mathbf{u}, \mathbf{v}) \quad =\quad \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} \quad = \quad 
\frac{\sum_\limits{i=1}^{d} u_{i} \, v_{i}}  {\sqrt{\sum_\limits{i=1}^{d} u_{i}^2} \;  \sqrt{\sum_\limits{i=1}^{d} v_{i}^2}} \quad \in \quad [-1, 1]$$


Here, \( \mathbf{u} \cdot \mathbf{v} \) denotes the Euclidean inner product, and the $L_2$ norm (Euclidean norm) of a vector \( \mathbf{v} \in \mathbb{R}^d \) is defined as:


$$\|\mathbf{v}\|_2 \quad =\quad  \sqrt{\sum_{i=1}^{d} v_i^2}.$$

Cosine similarity measures **angular similarity**, not Euclidean distance. It evaluates the angle between vectors rather than their magnitude. Its takes values in the continuous interval \([-1,1]\). The extreme cases correspond to:

- Identical direction (maximum similarity): \( 1 \) 

- Orthogonal vectors (no linear association): \( 0 \) 

- Opposite direction: \( -1 \) 

Intermediate values (e.g., 0.82, 0.34, −0.15) reflect varying angular proximity between vectors. In embedding spaces trained on natural language data, cosine values are typically non-negative, since semantically unrelated words rarely exhibit strong opposite orientations.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Cosine similarity: Solving step by step 

Consider two documents represented by count-based vectors:

$$\mathbf{u} = (4, 1, 2, 0, 3, 0, 1, 0) \quad \text{and} \quad \mathbf{v} = (2, 0, 1, 1, 2, 1, 0, 0)$$

The cosine similarity between them is:

$$\cos(\mathbf{u}, \mathbf{v}) \quad = \quad  \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|_2 \; \|\mathbf{v}\|_2}$$

First, compute the dot product:

$$\mathbf{u} \cdot \mathbf{v} \quad = \quad  4\cdot2 + 1\cdot0 + 2\cdot1 + 0\cdot1 + 3\cdot2 + 0\cdot1 + 1\cdot0 + 0\cdot0 \quad = \quad 16$$

Next, compute the vector norms:

$$\|\mathbf{u}\| \quad =\quad  \sqrt{4^2 + 1^2 + 2^2 + 3^2 + 1^2} \quad = \quad  \sqrt{31} \quad \approx \quad 5.57$$

$$\|\mathbf{v}\| \quad = \quad \sqrt{2^2 + 1^2 + 1^2 + 2^2 + 1^2} \quad = \quad  \sqrt{11} \quad \approx \quad 3.32$$

Finally:

$$\cos(\mathbf{u}, \mathbf{v}) \quad = \quad  \frac{16}{(5.57)(3.32)} \quad \approx \quad 0.87$$

A cosine similarity of \(0.87\) indicates a strong similarity between the two documents.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Cosine similarity: Implementing in Python  

The following function computes cosine similarity between two numeric vectors:

```{python}
import numpy as np

def cosine_similarity(vec1, vec2):
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
```

This function can be applied directly to document vectors produced by different vectorization techniques.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Cosine similarity using `CountVectorizer` outputs 

#### `CountVectorizer`: the sparse and document-term matrices (code).  {.unlisted .unnumbered}  


Recall that `CountVectorizer` builds a *document–term matrix* (Bag-of-Words). Each document is represented by a sparse vector of *raw token counts*, where:

- *Rows* = documents.  

- *Columns* = vocabulary terms.  

- *Entries* = how many times each term appears in each document.

```{python, eval=FALSE}
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "Data science relies on numerical methods",          # Document 1
    "Text analysis uses vectors and matrices",           # Document 2
    "Mathematical representations support data modeling" # Document 3
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print(vectorizer.get_feature_names_out()) # Output 1
print(bow_matrix.toarray())               # Output 2
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### `CountVectorizer`: the sparse and document-term matrices (outputs).  {.unlisted .unnumbered} 

```{python, echo=FALSE}
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "Data science relies on numerical methods",          # Document 1
    "Text analysis uses vectors and matrices",           # Document 2
    "Mathematical representations support data modeling" # Document 3
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
```


Output 1 shows the learned vocabulary (feature names) extracted by `CountVectorizer`, which defines the columns of the document-term matrix. It lists the vocabulary learned from the corpus:


```{python, echo=FALSE}
print(vectorizer.get_feature_names_out())
```

Output 2 displays the corresponding document-term matrix in dense form, where rows represent documents, columns represent vocabulary terms (token), and entries indicate raw term frequencies. 

```{python, echo=FALSE}
print(bow_matrix.toarray())               
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### `CountVectorizer`: cosine similarity (code).  {.unlisted .unnumbered}  

Using the document–term matrix `bow_matrix`, cosine similarity can be computed for every document pair:


```{python, eval=FALSE}
for i in range(bow_matrix.shape[0]):
    for j in range(i + 1, bow_matrix.shape[0]):
        sim = cosine_similarity(
            bow_matrix.toarray()[i],
            bow_matrix.toarray()[j]
        )
        print(f"Cosine similarity between documents {i+1} and {j+1}: {sim:.3f}")
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### `CountVectorizer`: cosine similarity (outputs).  {.unlisted .unnumbered} 

```{python, echo=FALSE}
for i in range(bow_matrix.shape[0]):
    for j in range(i + 1, bow_matrix.shape[0]):
        sim = cosine_similarity(
            bow_matrix.toarray()[i],
            bow_matrix.toarray()[j]
        )
        print(f"Cosine similarity between documents {i+1} and {j+1}: {sim:.3f}")
```



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### `CountVectorizer`: cosine similarity (interpretation of the outputs).  {.unlisted .unnumbered}  


- A value of *0.000* means the two documents share *no vocabulary terms* after preprocessing, so their BoW vectors are orthogonal.

- A small positive value (e.g., *0.183*) typically indicates *limited lexical overlap* (for instance, a single shared token such as *data*), but not necessarily strong semantic similarity.

This highlights an important point: *cosine similarity on Bag-of-Words is driven by shared tokens*, not by meaning.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Cosine similarity using TF-IDF representations 


#### TF-IDF representations: cosine similarity (code).  {.unlisted .unnumbered}  


TF-IDF builds the same type of document vectors, but *reweights* terms:

- Words that appear in many documents receive *lower weight*.

- Words that are more document-specific receive *higher weight*.

```{python, eval=FALSE}
for i in range(tf_idf_matrix.shape[0]):
    for j in range(i + 1, tf_idf_matrix.shape[0]):
        sim = cosine_similarity(
            tf_idf_matrix.toarray()[i],
            tf_idf_matrix.toarray()[j]
        )
        print(f"Cosine similarity between documents {i+1} and {j+1}: {sim:.3f}")
```


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### TF-IDF representations: cosine similarity (outputs).  {.unlisted .unnumbered} 

```{python, echo=FALSE}
for i in range(tf_idf_matrix.shape[0]):
    for j in range(i + 1, tf_idf_matrix.shape[0]):
        sim = cosine_similarity(
            tf_idf_matrix.toarray()[i],
            tf_idf_matrix.toarray()[j]
        )
        print(f"Cosine similarity between documents {i+1} and {j+1}: {sim:.3f}")
```

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


#### TF-IDF representations: cosine similarity (interpretation of outputs).  {.unlisted .unnumbered}  


- If similarities decrease (or become *0.000*), it usually means that the documents share *few or no important terms* after TF-IDF reweighting.

- TF-IDF can reduce the influence of very common tokens, so even when two documents share a word, the similarity may become smaller if that word is not informative.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


### Overall: BoW vs Cosine similarity vs TF-IDF


- *BoW + cosine* measures overlap in *raw counts*.

- *TF-IDF + cosine* measures overlap in *weighted importance*.

In the next section, we introduce **one-hot vectorization**, a foundational representation for neural and embedding-based models.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# One-hot vectorization

One-hot encoding is a simple and widely used technique for representing categorical information in numerical form. In this representation, each possible category is associated with a unique coordinate in a vector. Exactly one entry takes the value 1, while all remaining entries are set to 0.  

For a vocabulary of size \( |V| \), each one-hot vector lies in \( \mathbb{R}^{|V|} \) and contains exactly one non-zero component.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


### A simple intuition

Consider a categorical variable describing traffic conditions with three possible values:

- Low.  

- Medium.  

- High.  

A one-hot representation can be defined as:



$$\overrightarrow{\mathbf{\text{low}}} = (1,0,0),\quad \overrightarrow{\mathbf{\text{medium}}} = (0,1,0),\quad \overrightarrow{\mathbf{\text{high}}} = (0,0,1)$$


Each vector has length 3 because there are three possible categories, and exactly one position is active at a time. Geometrically, these vectors correspond to the canonical basis of \( \mathbb{R}^3 \). They are mutually orthogonal and equidistant.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### One-hot encoding in NLP

In natural language processing, the same idea applies to tokens. Once a vocabulary has been constructed, each word is treated as a category.  

A token is represented by a vector in \( \mathbb{R}^{|V|} \), where only the coordinate corresponding to its position in the vocabulary equals 1.

Thus, one-hot encoding transforms discrete symbolic tokens into numerical vectors without introducing any semantic structure.

These representations serve as an intermediate step toward more advanced distributed representations such as word embeddings.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Constructing one-hot vectors step by step

To illustrate the process, we work with a short sentence.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 1: Define the input text.  {.unlisted .unnumbered}  

```{python}
sentence = ["Students study machine learning methods"]
corpus = pd.Series(sentence)
corpus
```

The corpus contains a single document.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 2: Apply basic preprocessing.  {.unlisted .unnumbered}  

We apply cleaning, stopword removal, and lemmatization:

```{python, eval=FALSE}
def clean_and_lemmatize(text):
    stop = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    
    tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
    
    return " ".join(tokens)

preprocessed_corpus = corpus.apply(clean_and_lemmatize)
preprocessed_corpus
```

The output is: 

```{python, echo=FALSE}
def clean_and_lemmatize(text):
    stop = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    
    tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
    
    return " ".join(tokens)

preprocessed_corpus = corpus.apply(clean_and_lemmatize)
preprocessed_corpus
```



The sentence has been reduced to its core lexical components:

- `Students` → `student`.

- `methods` → `method`. 

- Stopwords removed.  

This ensures a clean and compact vocabulary.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 3: Build the vocabulary.  {.unlisted .unnumbered}  


```{python, eval=FALSE}
vocab = list(set(preprocessed_corpus[0].split()))
print(vocab)
```

The output is: 

```{python, echo=FALSE}
vocab = list(set(preprocessed_corpus[0].split()))
print(vocab)
```


Each unique token now corresponds to one dimension in the vector space.  Since there are 5 distinct tokens, the embedding space is \( \mathbb{R}^5 \).

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 4: Assign indices to vocabulary terms.  {.unlisted .unnumbered}  

```{python, eval=FALSE}
position = {token: idx for idx, token in enumerate(vocab)}
print(position)
```

The output is: 

```{python, echo=FALSE}
position = {token: idx for idx, token in enumerate(vocab)}
print(position)
```


This dictionary defines the coordinate system of the space: *each token is assigned a fixed index*. This mapping specifies which coordinate in the vector corresponds to each token.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 5: Initialize the one-hot matrix.  {.unlisted .unnumbered}  


```{python, eval=FALSE}
one_hot_matrix = np.zeros((len(preprocessed_corpus[0].split()), len(vocab)))
one_hot_matrix.shape
```

The output is: 



```{python, echo=FALSE}
one_hot_matrix = np.zeros((len(preprocessed_corpus[0].split()), len(vocab)))
one_hot_matrix.shape
```


In this case, the matrix has  5 rows (one row per token in the sentence) and 5 columns (one column per vocabulary term). 


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 6: Populate the one-hot vectors.  {.unlisted .unnumbered}  


```{python}
for i, token in enumerate(preprocessed_corpus[0].split()):
    one_hot_matrix[i][position[token]] = 1
```


For each token:

- Identify its index in the vocabulary.  

- Set the corresponding column to 1.  

Each row now becomes a canonical basis vector.




<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Step 7: Inspect the result.  {.unlisted .unnumbered}  



```{python, eval=FALSE}
one_hot_matrix
```


The output is: 

```{python, echo=FALSE}
one_hot_matrix
```

In this case: 

- Each row corresponds to one token.

- Each row contains exactly one 1.

- All vectors are orthogonal:

$$
\overrightarrow{w_i} \cdot \overrightarrow{w_j} = 0 \quad \text{for } i \neq j.
$$

No semantic similarity is encoded and the representation only captures identity (no information about *frequency*, *relative importance*, or *semantic similarity* is captured).



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### Key observation.  {.unlisted .unnumbered}  


While one-hot vectors provide a clear and unambiguous numerical encoding, they exhibit two major limitations:

1. **High dimensionality**.    The vector length grows linearly with vocabulary size.

2. **No semantic structure**.     All distinct words are equidistant:
   
   $$   \|\overrightarrow{w_i} \;-\; \overrightarrow{w_j}\|_2 \quad =\quad \sqrt{2}   \quad \text{for}\, i\, \neq \,j.   $$

These limitations motivate the transition to distributed representations, where meaning emerges from geometry rather than position alone.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->



# Summary

In this chapter, we introduced the fundamental mathematical ideas behind representing text as numerical objects. Starting from simple heuristics, we explored how textual data can be mapped into vectors and matrices, enabling the use of linear algebra techniques for analysis.

We first examined the *Bag-of-Words (BoW)* representation and implemented it using the `CountVectorizer` API. While this approach provides an intuitive and effective way to encode text based on term frequencies, we also identified its main limitations—most notably, its tendency to overemphasize very frequent terms and ignore the relative importance of rarer but potentially informative words.

To address these issues, we introduced *TF–IDF vectorization*, which reweights term frequencies by incorporating global information about term distribution across the corpus. This adjustment helps balance local relevance within documents against global prevalence in the dataset. Despite this improvement, both BoW and TF–IDF remain fundamentally *lexical* methods: they rely on surface-level word occurrences and do not account for semantic meaning, word order, or contextual relationships.

Building on these vector representations, we then explored how document similarity can be quantified using *cosine similarity*, interpreting documents as points in a high-dimensional space and measuring the angles between their corresponding vectors. This provided a practical mechanism for comparing documents and served as the foundation for simple applications such as retrieval-based chatbots.

Finally, we discussed *one-hot vectorization*, a sparse encoding scheme commonly used to represent individual tokens as categorical variables. Although simple, this representation plays an important role as a conceptual building block for more advanced models.

Overall, the methods covered in this chapter are most effective in settings where the vocabulary size is moderate and lexical overlap between documents is meaningful. As vocabularies grow larger or semantic relationships become more important, these representations become less adequate.

With this syntactic foundation in place, the next chapter moves beyond word counts and lexical weighting. We will explore approaches that explicitly model *semantic relationships between words*, beginning with distributed representations such as *Word2Vec*.


<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

# Applied activity: from text to vector-based similarity

This activity is designed to integrate and apply the numerical text representation techniques introduced in this chapter.  
The reader will transform a small text corpus into vector representations and analyze document similarity using linear algebra concepts.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Objective

To build a fully reproducible pipeline that converts raw text into numerical vectors using Bag-of-Words and TF–IDF representations, and to analyze document similarity using cosine similarity.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Instructions

1. Select a *small corpus of text*, such as:

   - Short paragraphs from news articles,
   
   - Abstracts of scientific papers, or  
   
   - Brief descriptions of products, movies, or books.

3. The corpus must contain *at least three documents*, each consisting of one or two sentences.

4. Create an *R Markdown (`.Rmd`)* document that compiles successfully to *HTML* (or PDF).

5. The document must include both:

   - The *code*, and  
   
   - The *resulting output* (printed matrices, tables, or numerical values).

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

### Required Sections

:::sangria3
#### 1. Corpus Description.  {.unlisted .unnumbered}  

Briefly describe the selected corpus and its context.  List the documents explicitly and explain why this corpus is appropriate for similarity analysis.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### 2. Preprocessing.  {.unlisted .unnumbered}  

Apply basic preprocessing steps, including:

- Lowercasing, 

- Removal of punctuation,

- Stopword removal, and 

- Lemmatization *or* stemming.

Show the processed version of each document.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### 3. Bag-of-Words Representation.  {.unlisted .unnumbered}  

Construct a Bag-of-Words representation of the corpus using:

- Either a manual implementation, *or*  
- A vectorization tool such as `CountVectorizer`.

Report:

- The learned vocabulary, and  

- The document–term matrix.

Briefly interpret the sparsity and dimensionality of the resulting matrix.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### 4. TF–IDF Representation.  {.unlisted .unnumbered}  

Using the same corpus, compute TF-IDF vectors.

Compare the TF–IDF matrix with the BoW matrix by discussing:

- Differences in numerical values, and  

- How TF–IDF reweights frequent and rare terms.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### 5. Cosine Similarity Analysis.  {.unlisted .unnumbered}  

Compute pairwise cosine similarity between all documents using:

- BoW vectors, and 

- TF–IDF vectors.

Present the results clearly and identify:

- The most similar document pair, and  

- The least similar document pair.

Explain any differences observed between the two representations.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### 6. One-hot Representation (Conceptual).  {.unlisted .unnumbered}  

Select **three tokens** from the vocabulary and:

- Construct their one-hot vectors, and - Explain why one-hot representations are unsuitable for measuring semantic similarity directly.

This section may be presented conceptually or with a small numerical example.

<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

#### 7. Summary and Reflection.  {.unlisted .unnumbered}  

Write a concise reflection (6–10 lines) addressing:

- How vectorization enables mathematical comparison of text,  

- The role of weighting schemes such as TF–IDF, and 

- The limitations of purely lexical representations.
:::
<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->


### Reproducibility Requirement

- The R Markdown document must be fully reproducible. 

- All code chunks must execute without errors and regenerate the reported outputs when the document is compiled. 

- All random seeds (if applicable) must be set to ensure deterministic results.

- All library versions used should be clearly reported.



<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->

<!-- Capítulo Bibliografía-->


# References {.unlisted .unnumbered}
  




<!-- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->
<!-- Separador -->

&nbsp;


&nbsp;
<center>
~~~
If you found any ERRORS or have SUGGESTIONS, please report them to my email. Thanks.  
~~~
</center>


