05/02/26
Abstract
Other related documents can be found at Rpubs:: toc.
The examples and exercises presented in this chapter rely on a small set of widely used Python libraries for text preprocessing, vectorization, and numerical computation. To ensure that all code runs correctly, the required packages and language resources should be installed before executing the examples in this document.
The commands below are provided for reference only and should be executed in a Python environment (for example, a terminal, Anaconda Prompt, or a Python-enabled R Markdown setup using reticulate).
# Core machine learning and NLP libraries
pip install scikit-learn
pip install nltk
pip install pandas
pip install numpy
pip install seaborn
pip install tabulate
# Download required NLTK resources
python -c "import nltk; nltk.download('wordnet')"
python -c "import nltk; nltk.download('omw-1.4')"
python -c "import nltk; nltk.download('stopwords')"
Once the packages are installed, the following Python modules are imported throughout this document:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
The purpose of each library used in this document is summarized below:
NLTK provides basic natural language preprocessing tools, including stopword removal and lemmatization. Only lightweight linguistic processing is used in this document.
scikit-learn (sklearn) supplies the vectorization and similarity machinery, including CountVectorizer, TfidfVectorizer, and cosine similarity computation.
pandasis used to manage text corpora as structured objects (e.g., Series) and to apply preprocessing functions consistently across documents.
numnpy supports numerical operations and vector-based computations required for similarity calculations.
seaborn is used for making statistical graphics.
tabulate is used to pretty-print tabular data in a human-readable format.
These tools are sufficient to illustrate the fundamental ideas behind frequency-based text representations, without introducing unnecessary dependencies.
Textual data poses a distinctive challenge for computational analysis: unlike numerical or categorical variables, natural language does not come with an inherent mathematical representation. While computers operate exclusively on numbers, language is expressed through symbols, words, and structures whose meaning is not natively encoded in numeric form.
Transforming text into numbers is therefore unavoidable—but it is also an opportunity. The specific choices made during this transformation determine which aspects of language are preserved, which are simplified or ignored, and how effectively learning algorithms can operate on linguistic data. In this sense, representation choices are not neutral: they directly influence model behavior, interpretability, and performance.
In the previous document (see vocabulary construction), we focused on defining the symbolic units of language processing, including tokenization strategies, normalization procedures, and vocabulary design. These steps establish what constitutes a unit of analysis. In this chapter, we move to the next stage of the pipeline and examine how those symbolic units are transformed into numerical objects.
Our approach is deliberately incremental. We begin with simple and transparent representations that emphasize observable structure rather than deep semantic meaning. By relying on frequency counts and distributional information, we can construct representations that are easy to interpret and that provide a solid mathematical foundation for more advanced techniques.
Throughout this chapter, we introduce classical methods for numerical text representation, including Bag-of-Words and term frequency–inverse document frequency (TF–IDF). Although conceptually straightforward, these methods remain widely used in practice—for baseline models, exploratory analysis, and instructional settings.
Before introducing these techniques, it is useful to clarify a fundamental distinction that underlies all language modeling: syntax versus semantics. Syntax concerns the structural organization of words and their observable patterns of occurrence, whereas semantics relates to meaning and interpretation. A sentence may be syntactically well-formed without conveying meaningful information.
In this chapter, the emphasis is intentionally placed on the syntactic dimension of language. We focus on representations derived from word occurrence patterns—such as counts and relative frequencies—while postponing semantic representations (e.g., embeddings and neural encodings) to later chapters.
By the end of this chapter, you will be able to represent text using vectors and matrices, compute similarities between documents, and build simple language-based applications. These ideas also serve as a conceptual bridge toward the representation learning techniques employed in modern deep learning architectures, including Transformer-based models.
The main topics covered in this chapter are:
Understanding vectors and matrices as mathematical data structures
Exploring the Bag-of-Words (BoW) representation
Constructing TF–IDF vectors
Measuring distance and similarity between document vectors
One-hot vectorization
Building a basic chatbot
A central challenge in NLP is expressing language in mathematical form. Two data structures play a fundamental role in this transformation: vectors and matrices. Together, they allow collections of text documents to be analyzed using the tools of linear algebra.
A vector is a one-dimensional array of numerical values, where each position corresponds to a specific feature. Vectors are commonly represented as column arrays:
\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad \mathbf{v} =\begin{bmatrix} v_1 \\ v_2 \\ v_3 \\ v_4 \end{bmatrix} \]
In this expression, the vector \(\mathbf{x}\) contains three components and belongs to \(\mathbb{R}^3\), while \(\mathbf{v}\) contains four components and belongs to \(\mathbb{R}^4\). Each coordinate represents the contribution of the vector along a particular axis. Once an object is represented as a vector, operations such as distance computation, similarity measurement, and projection become well defined.
To develop geometric intuition, consider representing entities using measurable attributes. Suppose we describe two cities using their average annual temperature and annual rainfall:
\[ \begin{array}{c|cc} \text{City} & \text{Temperature (°C)} & \text{Rainfall (mm)} \\ \hline \text{A} & 18 & 720 \\ \text{B} & 25 & 1100 \end{array} \]
Each city can be interpreted as a point in a two-dimensional space, or equivalently, as a vector.
City A corresponds to the vector \(X_A= (18, 720)\).
City B corresponds to the vector \(X_B= (25, 1100)\).
From a mathematical perspective, both vectors belong to \(\mathbb{R}^2\). Here is the corresponding visualization:
Each vector originates at the coordinate system’s origin and points toward a location determined by the corresponding attributes. Adding a new attribute (such as altitude or population density) increases the dimensionality of the representation, moving the vectors from \(\mathbb{R}^2\) to \(\mathbb{R}^3\) or higher.
While such spaces quickly become impossible to visualize, the algebraic interpretation of vectors remains valid in any dimension.
The same idea applies directly to text.
After tokenization (introduced in the previous chapter), a document can be represented as a vector in which each dimension corresponds to a unique token in the vocabulary. The value along each dimension reflects how frequently that token appears in the document. In this way, textual data can be embedded into a numerical space, enabling the use of vector-based methods for comparison, similarity, and analysis.
Matrices extend vectors by organizing multiple vectors into rows and columns. A matrix can be written as:
\[ \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \end{bmatrix} \]
This matrix belongs to \(\mathbb{R}^{3 \times 2}\), indicating three rows and two columns. In text analysis, matrices are commonly used to represent collections of documents. Each row corresponds to a document, each column corresponds to a token in the vocabulary, and each entry stores the frequency of that token in the document.
To illustrate this idea, consider the following small collection of documents:
from sklearn.feature_extraction.text import CountVectorizer
documents = (
"Text analysis relies on numerical representations",
"Vectors and matrices are core mathematical tools",
"Large collections of text can be processed efficiently"
)
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(documents)
# Inspect the learned vocabulary and document–term matrix
print(vectorizer.vocabulary_) # First output
print(X.todense()) # Second output
The printed dictionary (vocabulary_) maps each unique token to a column index in the document-term matrix:
## {'text': 11, 'analysis': 0, 'relies': 9, 'numerical': 7, 'representations': 10, 'vectors': 13, 'matrices': 6, 'core': 2, 'mathematical': 5, 'tools': 12, 'large': 4, 'collections': 1, 'processed': 8, 'efficiently': 3}
Each key in this dictionary is a token extracted from the corpus after preprocessing (tokenization and stopword removal). The associated number is not a frequency and does not indicate importance or order of appearance in the text. Instead, it specifies the column position assigned to that token in the document–term matrix.
To make this concrete:
'analysis': 0 means that the token analysis corresponds to column 0 of the matrix.
'collections': 1 corresponds to column 1.
'core': 2 corresponds to column 2.
…
'text': 11 corresponds to column 11.
'vectors': 13 corresponds to column 13.
In other words, the numbers 0, 1, 2, …, 13 are indices, not counts. They simply label the columns of the matrix, starting from zero, following Python’s indexing convention.
Once this mapping is defined, the document–term matrix uses it consistently. For example, the value located at row \(i\) and column 0 represents the frequency of the token analysis in document \(j\). Similarly, the value at column 11 represents the frequency of the token text in that same document.
This separation of roles is crucial:
The vocabulary dictionary defines where each token lives in the matrix.
The matrix entries define how often each token appears in each document.
Understanding this distinction helps understanding why a document vector has a fixed length equal to the size of the vocabulary, and why most entries are zero when a token does not appear in a document.
The second output (X.todense()) is the document–term matrix itself:
## [[1 0 0 0 0 0 0 1 0 1 1 1 0 0]
## [0 0 1 0 0 1 1 0 0 0 0 0 1 1]
## [0 1 0 1 1 0 0 0 1 0 0 1 0 0]]
Mathematically:
\[ \mathbf{X} = \left( \begin{array}{c|cccccccccccccc} \text{Text} & \text{anal} & \text{coll} & \text{core} & \text{eff} & \text{lar} & \text{math} & \text{mat} & \text{num} & \text{proc} & \text{relies} & \text{repr} & \text{text} & \text{tools} & \text{vec} \\ \hline \text{#1} & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 \\ \text{#2} & 0 & 0 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \text{#3} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \end{array} \right) \in \mathbb{R}^{3 \times 14} \]
Legend (tokens):
anal = analysis; coll = collections; core = core; eff = efficiently; large = large; math = mathematical; mat = matrices; num = numerical; proc = processed; relies = relies; repr = representations; text = text; tools = tools; vec = vectors.
This matrix should be interpreted as follows:
Rows of the matrix \(\mathbf{X}\) correspond to documents: Text 1, Text 2, Text 3 (in the same order as the input text).
Columns correspond to tokens in the vocabulary.
The entry \(x_{ij}\) of \[ \mathbf{x}_i = (x_{i1}, x_{i2}, \dots, x_{i14}) \in \mathbb{R}^{14}, \] represents the number of times token \(j\) appears in document \(i\). For example:
\(x_{1,1} = 1\) indicates that the token analysis appears once in the first document.
\(x_{1,8} = 1\) indicates that the token numerical appears once in the first document.
\(x_{2,14} = 1\) indicates that the token vectors appears once in the second document.
Zeros indicate that the corresponding token does not appear in that document.
Because each document is short and most words appear at most once, the matrix mainly contains values of 0 and 1. A value of 1 indicates that the corresponding token appears once in that document, while 0 indicates that it does not appear at all.
The length of each row vector equals the size of the vocabulary. In this example, the vocabulary contains 14 unique tokens after stopword removal, which explains why each document vector has 14 components.
Once text data has been converted into matrix form, standard linear algebra operations (such as similarity computation, projection, or matrix transformation) can be applied to compare and analyze documents quantitatively.
One of the most straightforward ways to represent text numerically is to count how often words appear in a document. This simple idea underlies the Bag-of-Words (BoW) representation.
The BoW model deliberately ignores word order and grammatical structure, focusing instead on which words appear and how frequently they occur. Although this abstraction discards syntactic information such as word sequence, it provides a powerful and intuitive baseline for many text analysis tasks.
In the previous chapter on vocabulary construction, we introduced the process of identifying and standardizing the basic units of text. That step is essential for BoW representations: before counting words, we must first decide which words belong to the vocabulary.
Once a vocabulary is fixed, each document can be represented as a vector whose length equals the size of the vocabulary. Each position in the vector corresponds to a specific term, and the value stored at that position indicates how many times the term appears in the document.
If a word from the vocabulary does not appear in a given document, the corresponding entry in the vector is simply zero.
What is the maximum possible value of a component in a Bag-of-Words vector?
Take a moment to think about it.
In principle, there is no fixed upper bound: the value is determined by how many times a word appears within a document. Extremely frequent words could dominate the representation, while many entries remain zero. This characteristic leads to sparse vectors—a defining property of BoW models.
To make the idea concrete, let us construct a Bag-of-Words representation manually, starting from a small collection of sentences.
Before detailing each step, Figure 4.1 provides an overview of the main stages involved in constructing a Bag-of-Words representation, which are explained in detail in the following steps.
Figure 4.1: Step-by-step construction of a Bag-of-Words representation. Source: Created by the author with ChatGPT (OpenAI)
We begin by defining a small corpus composed of three short sentences. Each sentence will be treated as an individual document for illustration purposes.
sentences = [
"Data science connects statistics and computation",
"Statistical models learn patterns from data",
"Modern data analysis relies on computational tools"
]
The corpus is stored as a pandas.Series, where each element represents one document. This structured format facilitates systematic preprocessing and later vectorization steps.
import pandas as pd
corpus = pd.Series(sentences)
corpus
## 0 Data science connects statistics and computation
## 1 Statistical models learn patterns from data
## 2 Modern data analysis relies on computational t...
## dtype: str
The preprocessing step standardizes the text by lowercasing, removing punctuation and stopwords, and reducing words to their lemma.
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import re
import numpy as np
def clean_and_lemmatize(text):
stop = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
return " ".join(tokens)
processed_corpus = corpus.apply(clean_and_lemmatize)
processed_corpus
The following steps illustrate a basic text preprocessing pipeline using the nltk library, including stop-word removal and lemmatization.
These packages provide tools for tokenization, stop-word filtering, and lemmatization, which are standard preprocessing steps in natural language processing.
This function performs several preprocessing operations in sequence.
First, non-alphabetic characters are removed using a regular expression, and all text is converted to lowercase to ensure consistency. The cleaned text is then split into individual tokens (words).
Next, common English stop words (such as and, on, the) are removed, since they typically carry little semantic information.
Finally, each remaining token is lemmatized, reducing words to their base form (e.g., models → model), which helps group related word forms under a single representation.
The output of the function is a cleaned and normalized string, ready for vectorization.
## 0 data science connects statistic computation
## 1 statistical model learn pattern data
## 2 modern data analysis relies computational tool
## dtype: str
The output shows the preprocessed version of each document in the corpus, where stop words have been removed and remaining words have been lemmatized. Each row corresponds to one document, and the result preserves the original document order.
len(processed_corpus)
## 3
This confirms that the corpus contains three documents, each of which has been transformed into its cleaned textual representation.
This code constructs the vocabulary by extracting all unique tokens from the preprocessed corpus and sorting them alphabetically. Each token appears only once, regardless of how many times it occurs in the documents.
vocabulary = sorted(set(
word for sentence in processed_corpus for word in sentence.split()
))
vocabulary
## ['analysis', 'computation', 'computational', 'connects', 'data', 'learn', 'model', 'modern', 'pattern', 'relies', 'science', 'statistic', 'statistical', 'tool']
The output is a list of 14 unique terms.
len(vocabulary)
## 14
Each term defines one dimension of the Bag-of-Words vector space, meaning that every document will be represented as a vector of length 14, with each position corresponding to one vocabulary term.
This code creates a dictionary that maps each vocabulary term to a unique integer index. These indices define the column positions that each token will occupy in the Bag-of-Words matrix, ensuring a consistent numerical representation across all documents.
token_index = {token: idx for idx, token in enumerate(vocabulary)}
token_index
## {'analysis': 0, 'computation': 1, 'computational': 2, 'connects': 3, 'data': 4, 'learn': 5, 'model': 6, 'modern': 7, 'pattern': 8, 'relies': 9, 'science': 10, 'statistic': 11, 'statistical': 12, 'tool': 13}
This code initializes the Bag-of-Words matrix using zeros. The number of rows equals the number of documents in the corpus, and the number of columns equals the size of the vocabulary.
bow_matrix = np.zeros((len(processed_corpus), len(vocabulary)))
bow_matrix
## array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
## [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
## [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
Matrixially:
\[ \text{Token map:}\qquad \begin{array}{llllllll} \texttt{ana}=\texttt{analysis}, & \texttt{comp}=\texttt{computation}, & \texttt{compal}=\texttt{computational}, & \texttt{conec}=\texttt{connects}, \\ \texttt{dat}=\texttt{data}, & \texttt{lear}=\texttt{learn}, & \texttt{mod}=\texttt{model}, & \texttt{mdrn}=\texttt{modern}, \\ \texttt{pat}=\texttt{pattern}, & \texttt{rel}=\texttt{relies}, & \texttt{sci}=\texttt{science}, & \texttt{sta}=\texttt{statistic}, \\ \texttt{stal}=\texttt{statistical}, & \texttt{tool}=\texttt{tool}. \end{array} \]
\[ \mathbf{B}^{(0)} =\left( \begin{array}{c|cccccccccccccccc} \texttt{Text} & \texttt{ana} & \texttt{comp} & \texttt{compal} & \texttt{conec} & \texttt{dat} & \texttt{lear} & \texttt{mod} & \texttt{mdrn} & \texttt{pat} & \texttt{rel} & \texttt{sci} & \texttt{sta} & \texttt{stal} & \texttt{tool} \\ \hline \text{#1} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{#2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \text{#3} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{array} \right) \]
The resulting matrix is filled with zeros because no word counts have been recorded yet. At this stage, the matrix only defines the structure of the representation. Each row corresponds to a document, and each column corresponds to a vocabulary term. The matrix will be populated with word frequencies in the next step.
This code fills the Bag-of-Words matrix by counting word occurrences. For each document (i), every token in the preprocessed sentence is located in the vocabulary using token_index, and the corresponding matrix entry is increased by one.
for i, sentence in enumerate(processed_corpus):
for token in sentence.split():
bow_matrix[i, token_index[token]] += 1
After this step, each row of bow_matrix contains the frequency of vocabulary terms in a document. Nonzero values indicate that a word appears in the document, while zeros indicate absence.
The resulting Bag-of-Words matrix is shown below:
bow_matrix
## array([[0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0.],
## [0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0.],
## [1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 1.]])
Matrixially:
\[ \mathbf{B} =\left( \begin{array}{c|cccccccccccccccc} \texttt{Text} & \texttt{ana} & \texttt{comp} & \texttt{compal} & \texttt{conec} & \texttt{dat} & \texttt{lear} & \texttt{mod} & \texttt{mdrn} & \texttt{pat} & \texttt{rel} & \texttt{sci} & \texttt{sta} & \texttt{stal} & \texttt{tool} \\ \hline \text{#1} & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ \text{#2} & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ \text{#3} & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 \end{array} \right) \]
Each row represents a document and each column a vocabulary term; the entries indicate term frequencies, with many zeros reflecting the sparse nature of Bag-of-Words representations. For example, if the word data appears twice in the second document, the corresponding matrix entry reflects this count. Most entries remain zero, illustrating the sparsity typical of Bag-of-Words representations.
So far, we have considered only unigrams, meaning individual words. The same idea can be extended to:
Bigrams (pairs of consecutive words),
Trigrams, and
Higher-order n-grams.
Including n-grams allows the model to capture limited local contextual information, at the cost of increasing the dimensionality of the representation.
Do we really need to implement all of this manually?
Fortunately, no.
Modern NLP libraries provide efficient, well-tested implementations of Bag-of-Words models. In the next section, we introduce one such tool that automates this entire process.
CountVectorizer: understanding a basic procedureManually building a Bag-of-Words (BoW) matrix helps develop intuition, but it is rarely necessary in practice. Python provides efficient tools that automate this entire process, one of the most widely used being CountVectorizer from the scikit-learn library.
CountVectorizer transforms a collection of text documents into a document-term matrix, where each row represents a document, each column corresponds to a token in the learned vocabulary, and each cell contains the frequency of that token in the document. See Figure 4.2.
Figure 4.2: Bag-of-words - CountVectorizer. Source: Created by the author with ChatGPT (OpenAI)
Let us illustrate this with a small, self-contained example.
First, a CountVectorizer object is created using the default settings. The method fit_transform() then learns the vocabulary from the corpus and constructs the corresponding Bag-of-Words matrix in a single step.
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"Data science relies on numerical methods",
"Text analysis uses vectors and matrices",
"Mathematical representations support data modeling"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out()) # Output 1
print(bow_matrix.toarray()) # Output 2
The resulting output contains the learned vocabulary and the associated document–term matrix, which corresponds directly to the conceptual BoW construction discussed earlier.
The first output displays the learned vocabulary, that is, the set of unique tokens extracted from the corpus after preprocessing. Each term in this list corresponds to a column of the Bag-of-Words matrix, and the order shown here defines the column ordering used in the matrix representation.
## ['analysis' 'and' 'data' 'mathematical' 'matrices' 'methods' 'modeling'
## 'numerical' 'on' 'relies' 'representations' 'science' 'support' 'text'
## 'uses' 'vectors']
## [[0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0]
## [1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1]
## [0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0]]
The second output shows the Bag-of-Words matrix in dense form. Each row corresponds to one document and each column corresponds to one of the vocabulary terms listed above. The entries indicate how many times each token appears in each document.
To keep the notation compact, we label each token with a short abbreviation and report the corresponding document-term matrix below.
\[ \text{Token map:}\qquad \begin{array}{llllllll} \texttt{ana}=\texttt{analysis}, & \texttt{and}=\texttt{and}, & \texttt{dat}=\texttt{data}, & \texttt{math}=\texttt{mathematical}, \\ \texttt{mtx}=\texttt{matrices}, & \texttt{meth}=\texttt{methods}, & \texttt{model}=\texttt{modeling}, & \texttt{num}=\texttt{numerical}, \\ \texttt{on}=\texttt{on}, & \texttt{rel}=\texttt{relies}, & \texttt{repr}=\texttt{representations}, & \texttt{sci}=\texttt{science}, \\ \texttt{sup}=\texttt{support}, & \texttt{txt}=\texttt{text}, & \texttt{use}=\texttt{uses}, & \texttt{vec}=\texttt{vectors}. \end{array} \]
\[ \mathbf{B}= \left( \begin{array}{c|cccccccccccccccc} \texttt{Text} & \texttt{ana} & \texttt{and} & \texttt{dat} & \texttt{math} & \texttt{mtx} & \texttt{meth} & \texttt{model} & \texttt{num} & \texttt{on} & \texttt{rel} & \texttt{repr} & \texttt{sci} & \texttt{sup} & \texttt{txt} & \texttt{use} & \texttt{vec} \\ \hline \text{#1} & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\ \text{#2} & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ \text{#3} & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 1 & 0 & 0 & 0 \\ \end{array}\right) \]
For example, the token data appears once in the first and third documents, and does not appear in the second document. Similarly, the token analysis appears only in the second document, while numerical appears only in the first document. This sparsity pattern is typical of Bag-of-Words representations, especially as the vocabulary size grows.
The same matrix can be visualized as a heatmap, where darker cells indicate higher token counts.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
terms = vectorizer.get_feature_names_out()
X = bow_matrix.toarray() if hasattr(bow_matrix, "toarray") else bow_matrix
df_bow = pd.DataFrame(X, columns=terms)
plt.figure(figsize=(14,5));
ax = sns.heatmap(df_bow, cmap="Blues", cbar=True)
# --- Title and axis labels ---
ax.set_title("Bag-of-Words representation", fontsize=18, pad=10);
ax.set_xlabel("Vocabulary terms", fontsize=18);
ax.set_ylabel("Documents", fontsize=18);
# --- Tick labels ---
ax.tick_params(axis="x", labelsize=14, rotation=45)
ax.tick_params(axis="y", labelsize=14)
# --- Colorbar font size ---
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=14)
plt.tight_layout() # Prevents cropping
plt.show()
This example reproduces the Bag-of-Words representation introduced earlier, but now using CountVectorizer to automatically perform tokenization, vocabulary construction, and word counting. This allows the reader to focus on interpreting the document–term matrix itself, rather than on the low-level implementation details of the Bag-of-Words process.
Importantly, CountVectorizer provides several arguments that allow this basic representation to be refined and controlled, such as vocabulary size limits and document-frequency thresholds. These arguments, illustrated in Figure 4.3, will be introduced conceptually here and implemented in detail in the following sections.
Figure 4.3: Bag-of-words - CountVectorizer arguments. Source: Created by the author with ChatGPT (OpenAI)
CountVectorizer: out-of-the-box featuresBeyond basic word counts, CountVectorizer includes several built-in options that make it flexible and practical for real-world applications.
We now explore some of the most commonly used features.
By default, CountVectorizer learns its vocabulary directly from the data. However, it can also:
Apply tokenization internally,
Remove stopwords automatically, and
Generate n-grams without additional code.
In the following example, the argument ngram_range = (1, 3) instructs the vectorizer to include unigrams, bigrams, and trigrams, that is, single words, pairs of consecutive words, and sequences of three consecutive words.
First, a CountVectorizer object is created with the specified n-gram range. The method fit_transform() then learns the vocabulary from the corpus and constructs the corresponding Bag-of-Words matrix, where each column represents an n-gram and each row represents a document.
vectorizer_ngram = CountVectorizer(ngram_range=(1, 3))
bow_ngram = vectorizer_ngram.fit_transform(documents)
print(vectorizer_ngram.get_feature_names_out()) # Output 1
print(bow_ngram.toarray()) # Output 2
The first output displays the learned n-gram vocabulary. As a result, terms such as analysis (unigram), text analysis (bigram), and text analysis uses (trigram) coexist as distinct features in the representation. The order shown here defines the column ordering of the Bag-of-Words matrix.
## ['analysis' 'analysis uses' 'analysis uses vectors' 'and' 'and matrices'
## 'data' 'data modeling' 'data science' 'data science relies'
## 'mathematical' 'mathematical representations'
## 'mathematical representations support' 'matrices' 'methods' 'modeling'
## 'numerical' 'numerical methods' 'on' 'on numerical'
## 'on numerical methods' 'relies' 'relies on' 'relies on numerical'
## 'representations' 'representations support'
## 'representations support data' 'science' 'science relies'
## 'science relies on' 'support' 'support data' 'support data modeling'
## 'text' 'text analysis' 'text analysis uses' 'uses' 'uses vectors'
## 'uses vectors and' 'vectors' 'vectors and' 'vectors and matrices']
To facilitate later reference and discussion, the learned vocabulary is listed below as an indexed sequence. This enumeration will be used in subsequent sections to illustrate how n-gram features are filtered, selected, or weighted when applying additional arguments of CountVectorizer.
| Token ID | Token (as learned) | Unigram | Bigram | Trigram |
|---|---|---|---|---|
| 1 | analysis | ✓ | ||
| 2 | analysis uses | ✓ | ||
| 3 | analysis uses vectors | ✓ | ||
| 4 | and | ✓ | ||
| 5 | and matrices | ✓ | ||
| 6 | data | ✓ | ||
| 7 | data modeling | ✓ | ||
| 8 | data science | ✓ | ||
| 9 | data science relies | ✓ | ||
| 10 | mathematical | ✓ | ||
| 11 | mathematical representations | ✓ | ||
| 12 | mathematical representations support | ✓ | ||
| 13 | matrices | ✓ | ||
| 14 | methods | ✓ | ||
| 15 | modeling | ✓ | ||
| 16 | numerical | ✓ | ||
| 17 | numerical methods | ✓ | ||
| 18 | on | ✓ | ||
| 19 | on numerical | ✓ | ||
| 20 | on numerical methods | ✓ | ||
| 21 | relies | ✓ | ||
| 22 | relies on | ✓ | ||
| 23 | relies on numerical | ✓ | ||
| 24 | representations | ✓ | ||
| 25 | representations support | ✓ | ||
| 26 | representations support data | ✓ | ||
| 27 | science | ✓ | ||
| 28 | science relies | ✓ | ||
| 29 | science relies on | ✓ | ||
| 30 | support | ✓ | ||
| 31 | support data | ✓ | ||
| 32 | support data modeling | ✓ | ||
| 33 | text | ✓ | ||
| 34 | text analysis | ✓ | ||
| 35 | text analysis uses | ✓ | ||
| 36 | uses | ✓ | ||
| 37 | uses vectors | ✓ | ||
| 38 | uses vectors and | ✓ | ||
| 39 | vectors | ✓ | ||
| 40 | vectors and | ✓ | ||
| 41 | vectors and matrices | ✓ |
This example shows that the learned vocabulary contains tokens of different lengths. For instance:
Tokens 1, 6, 10, 16, and 33 correspond to unigrams.
Tokens 2, 7, 8, 17, and 34 correspond to bigrams.
Tokens 3, 11, 18, 35, and 38 correspond to trigrams.
These differences arise solely from the chosen ngram_range and do not change the underlying Bag-of-Words representation.
The second output shows the Bag-of-Words matrix constructed using the n-gram vocabulary. Each row corresponds to a document, each column corresponds to a specific n-gram, and the value in each cell indicates how many times that n-gram appears in the document.
## [[0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0
## 0 0 0 0 0]
## [1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## 1 1 1 1 1]
## [0 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0
## 0 0 0 0 0]]
For readability, the Bag-of-Words matrix is presented in two blocks, corresponding to tokens 1–21 and 22–40:
| T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 | T14 | T15 | T16 | T17 | T18 | T19 | T20 | T21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Doc.1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| Doc.2 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc.3 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| T22 | T23 | T24 | T25 | T26 | T27 | T28 | T29 | T30 | T31 | T32 | T33 | T34 | T35 | T36 | T37 | T38 | T39 | T40 | T41 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Doc.1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Doc.2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Doc.3 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
A value of 1 at position (Doc. i, Token Tj) indicates that token j appears once in document i; a value of 0 indicates it does not appear. For example,
Token 8 → "data science" → value 1 in document 1
Token 8 → "data science" → value 0 in document 2
Token 8 → "data science" → value 0 in document 3
In this case, a value of 1 in the column associated with data science means that this bigram appears once in the corresponding document, whereas a value of 0 indicates that it does not appear.
CountVectorizer: limiting vocabulary size with max_featuresAs the vocabulary grows, the dimensionality of document vectors increases accordingly. Very high-dimensional representations may reduce computational efficiency and harm generalization, a phenomenon commonly referred to as the curse of dimensionality.
To address this issue, CountVectorizer provides the max_features argument, which restricts the vocabulary to the most frequent tokens observed in the corpus.
In the following example, the vocabulary is limited to the five most frequent unigrams or bigrams in the corpus.
First, a CountVectorizer object is created with ngram_range = (1, 2) to extract both unigrams and bigrams. The argument max_features = 5 restricts the vocabulary to the five most frequent tokens (according to document frequency). The method fit_transform() then learns this reduced vocabulary and constructs the corresponding Bag-of-Words matrix.
vectorizer_limited = CountVectorizer(
ngram_range=(1, 2),
max_features=5
)
bow_limited = vectorizer_limited.fit_transform(documents)
print(vectorizer_limited.get_feature_names_out()) # Output 1
print(bow_limited.toarray()) # Output 2
The first output displays the reduced vocabulary, consisting of the five most frequent unigrams or bigrams retained after applying the max_features constraint. The order shown here defines the column ordering of the Bag-of-Words matrix.
## ['analysis' 'analysis uses' 'and' 'and matrices' 'data']
The resulting vocabulary (Output 1) defines the columns of the Bag-of-Words matrix (Output 2), in the exact order shown above.
## [[0 0 0 0 1]
## [1 1 1 1 0]
## [0 0 0 0 1]]
Each row corresponds to a document and each column corresponds to one of the selected n-grams. The entries represent term frequencies. Formally, the matrix can be written as \[ \mathbf{B} = (b_{ij}), \qquad b_{ij} = \text{frequency of n-gram } j \text{ in document } i, \] where the columns correspond to \[ (\texttt{analysis},\ \texttt{analysis uses},\ \texttt{and},\ \texttt{and matrices},\ \texttt{data}). \]
That is, the Bag-of-Words matrix can be written explicitly as \[ \mathbf{B} = \begin{array}{c|ccccc} & \texttt{analysis} & \texttt{analysis uses} & \texttt{and} & \texttt{and matrices} & \texttt{data} \\ \hline \text{Document 1} & 0 & 0 & 0 & 0 & 1 \\ \text{Document 2} & 1 & 1 & 1 & 1 & 0 \\ \text{Document 3} & 0 & 0 & 0 & 0 & 1 \end{array} \]
The value \(b_{25} = 0\) indicates that the token does not appear in the second document, while \(b_{15} = 1\) indicates that it appears once in the first document.
CountVectorizer: filtering tokens with min_df and max_df thresholdsNot all tokens contribute equally to the representation. Some terms appear in almost every document and therefore provide little discriminatory power, while others appear only once and may be overly specific or noisy.
The parameters min_df and max_df allow filtering tokens based on document frequency:
min_df removes terms that appear in fewer than a specified number (or proportion) of documents.
max_df removes terms that appear in more than a specified proportion of documents.
min_df=1.import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Vectorizer (ya se tiene)
vectorizer_limited = CountVectorizer(
ngram_range=(1, 3),
min_df=1 # importante: df lo calculamos manualmente
)
bow = vectorizer_limited.fit_transform(documents)
# 1) Tokens aprendidos
tokens = vectorizer_limited.get_feature_names_out()
# 2) Bag-of-Words matrix
B = bow.toarray()
# 3) Document frequency (df)
df = (B > 0).sum(axis=0)
# 4) Número de palabras por token
n_words = np.array([len(t.split()) for t in tokens])
# 5) Construir tabla final
table_df = pd.DataFrame({
"Token ID": np.arange(1, len(tokens) + 1),
"Token (as learned)": tokens,
"Unigram": (n_words == 1).astype(int),
"Bigram": (n_words == 2).astype(int),
"Trigram": (n_words == 3).astype(int),
"df": df,
"Kept (min_df = 2)": np.where(df >= 2, "✓", "X")
})
# 6) Reemplazar 1/0 por ✓ / vacÃo (más legible)
for col in ["Unigram", "Bigram", "Trigram"]:
table_df[col] = table_df[col].replace({1: "✓", 0: ""})
table_df
#print(table_df.to_markdown(index=False))
## Token ID Token (as learned) ... df Kept (min_df = 2)
## 0 1 analysis ... 1 X
## 1 2 analysis uses ... 1 X
## 2 3 analysis uses vectors ... 1 X
## 3 4 and ... 1 X
## 4 5 and matrices ... 1 X
## 5 6 data ... 2 ✓
## 6 7 data modeling ... 1 X
## 7 8 data science ... 1 X
## 8 9 data science relies ... 1 X
## 9 10 mathematical ... 1 X
## 10 11 mathematical representations ... 1 X
## 11 12 mathematical representations support ... 1 X
## 12 13 matrices ... 1 X
## 13 14 methods ... 1 X
## 14 15 modeling ... 1 X
## 15 16 numerical ... 1 X
## 16 17 numerical methods ... 1 X
## 17 18 on ... 1 X
## 18 19 on numerical ... 1 X
## 19 20 on numerical methods ... 1 X
## 20 21 relies ... 1 X
## 21 22 relies on ... 1 X
## 22 23 relies on numerical ... 1 X
## 23 24 representations ... 1 X
## 24 25 representations support ... 1 X
## 25 26 representations support data ... 1 X
## 26 27 science ... 1 X
## 27 28 science relies ... 1 X
## 28 29 science relies on ... 1 X
## 29 30 support ... 1 X
## 30 31 support data ... 1 X
## 31 32 support data modeling ... 1 X
## 32 33 text ... 1 X
## 33 34 text analysis ... 1 X
## 34 35 text analysis uses ... 1 X
## 35 36 uses ... 1 X
## 36 37 uses vectors ... 1 X
## 37 38 uses vectors and ... 1 X
## 38 39 vectors ... 1 X
## 39 40 vectors and ... 1 X
## 40 41 vectors and matrices ... 1 X
##
## [41 rows x 7 columns]
In the following example, only unigrams and bigrams that appear in at least two documents and in no more than 80% of the corpus are retained.
First, a CountVectorizer object is created with ngram_range = (1, 2) to extract both unigrams and bigrams. The arguments min_df = 2 and max_df = 0.8 filter tokens based on document frequency: terms must appear in at least two documents, but not in more than 80% of the corpus.
The method fit_transform() then learns the filtered vocabulary and constructs the corresponding Bag-of-Words matrix.
vectorizer_df = CountVectorizer(
ngram_range=(1, 3),
min_df=2,
max_df=0.8
)
bow_df = vectorizer_df.fit_transform(documents)
print(vectorizer_df.get_feature_names_out()) # Output 1
print(bow_df.toarray()) # Output 2
The first output shows the filtered vocabulary. In this case, only the token data satisfies both frequency constraints.
## ['data']
The second output displays the resulting Bag-of-Words matrix. Since only one token is retained, the matrix has a single column, and each row indicates whether the token appears in the corresponding document.
## [[1]
## [0]
## [1]]
These thresholds provide a simple yet effective way to control which tokens enter the representation and to reduce noise in high-dimensional text data.
Despite its simplicity and interpretability, the Bag-of-Words model has important limitations.
First, it relies exclusively on token counts, ignoring word order and syntactic structure. As a result, sentences with very different meanings may receive similar representations.
Second, BoW does not capture semantic relationships. Words with related meanings are treated as entirely independent dimensions.
Third, large vocabularies can lead to extremely high-dimensional vectors, which may degrade performance and increase computational cost.
These limitations motivate more refined representations that adjust token importance and incorporate contextual information. One such approach (TF-IDF weighting) is introduced in the next section.
In the previous section, documents were represented using raw word counts through the Bag-of-Words model. While this approach is intuitive, it treats all tokens equally and relies solely on their frequency within each document.
As a result, terms that appear very often across the corpus may dominate the representation, while less frequent but potentially informative terms receive little weight or are discarded altogether. This can lead to a loss of relevant patterns, especially when rare terms are crucial for distinguishing documents.
The Term Frequency–Inverse Document Frequency (TF–IDF) scheme addresses this limitation by re-weighting tokens according to both their local importance within a document and their global distribution across the corpus.
TF–IDF is widely used in information retrieval, search engines, and text mining applications. Like BoW, it is a frequency-based representation, but it incorporates an additional normalization mechanism that balances common and rare terms.
The term frequency component measures how often a word appears in a specific document. However, since documents may vary in length, raw counts are typically normalized. A common normalized definition of term frequency is:
\[TF(w) = \frac{\text{Number of times the word w occurs in a document}}{\text{Total number of words in the document}} \]
This normalization prevents longer documents from automatically assigning higher importance to all their terms.
To build intuition, the next figure shows normalized TF for a few example tokens inside a single document (so longer documents do not automatically inflate importance).
import numpy as np
import matplotlib.pyplot as plt
# --- Simulated document term counts (Document d1) ---
terms = ["data", "analysis", "model", "the", "and", "python"]
counts_d1 = np.array([6, 3, 2, 10, 8, 1]) # raw counts in document d1
tf_d1 = counts_d1 / counts_d1.sum() # normalized TF
plt.figure()
plt.bar(terms, tf_d1)
plt.title("Term Frequency (TF) in a single document")
plt.xlabel("Token")
plt.ylabel("TF (normalized frequency)")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.show()
library(ggplot2)
tf_df <- data.frame(
token = c("data", "analysis", "model", "the", "and", "python"),
count = c(6, 3, 2, 10, 8, 1)
)
tf_df$TF <- tf_df$count / sum(tf_df$count)
ggplot(tf_df, aes(x = token, y = TF)) +
geom_col(fill = "steelblue") +
labs(
title = "Term Frequency (TF) in a single document",
x = "Token",
y = "TF (normalized frequency)"
) +
theme_minimal()
TF measures local importance: tokens that occur more often within the document receive larger TF values, but normalization keeps TF comparable across documents of different lengths.
While TF captures local relevance, it does not account for how informative a word is across the entire corpus. Words that appear in almost every document (such as general or domain-wide terms) may not be useful for discrimination.
The inverse document frequency component down-weights such ubiquitous terms and amplifies words that occur in fewer documents:
\[ IDF(w) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing word w}}\right) \]
where:
\(N\) is the total number of documents, and
\(df(w)\) is the number of documents containing word \(w\).
TF alone does not capture how informative a token is across the corpus. The next figure shows how IDF decreases as a token appears in more documents.
import numpy as np
import matplotlib.pyplot as plt
# --- Simulated corpus size and document frequencies ---
N = 10 # total documents
df = np.arange(1, N+1) # df(w) = 1..N
idf = np.log(N / df) # classic IDF definition (as in your notes)
plt.figure()
plt.plot(df, idf, marker="o")
plt.title("Inverse Document Frequency (IDF) vs. document frequency")
plt.xlabel("Document frequency df(w)")
plt.ylabel("IDF(w) = log(N / df(w))")
plt.xticks(df)
plt.tight_layout()
plt.show()
idf_df <- data.frame(df = 1:10)
N <- 10
idf_df$IDF <- log(N / idf_df$df)
ggplot(idf_df, aes(x = df, y = IDF)) +
geom_line(color = "steelblue", size=1) +
geom_point(color = "steelblue", size=2.5) +
scale_x_continuous(breaks = 1:10) +
labs(
title = "Inverse Document Frequency (IDF)",
x = "Document frequency df(w)",
y = "IDF(w) = log(N / df(w))"
) +
theme_minimal()
Tokens that occur in many documents (high df) have low IDF, because they help less to distinguish documents. Tokens that occur in few documents have higher IDF.
The final TF-IDF weight of a word \(w\) in document \(d\) is obtained by combining two components:
\[ \text{weight}(w,d) = TF(w,d) \times IDF(w) \]
This formulation assigns higher weights to terms that are frequent within a document but relatively rare across the corpus.
Even when a term appears exactly once in every document, its TF-IDF weight is not necessarily identical across documents.
This occurs because the term frequency (TF) component is normalized by the total number of tokens in each document. Consequently, documents of different lengths assign different relative importance to the same term.
In addition, TF-IDF vectors are normalized by default using the \(L_2\) norm. This means that each document vector is rescaled to have unit length, further modifying the final weights. As a result, two documents may share the same vocabulary and identical raw term counts, yet still differ in their TF-IDF representations.
The next plot illustrates the combined effect: TF–IDF becomes large when a token is frequent in a document (high TF) and rare in the corpus (high IDF).
import numpy as np
import matplotlib.pyplot as plt
# --- Simulated TF (from a document) and IDF (from the corpus) for several tokens ---
tokens = ["data", "analysis", "model", "the", "and"]
tf = np.array([0.18, 0.12, 0.08, 0.30, 0.20]) # local frequencies (normalized)
idf = np.array([1.0, 1.4, 1.8, 0.1, 0.2]) # global rarity (higher = rarer)
tfidf = tf * idf
plt.figure()
plt.bar(tokens, tfidf)
plt.title("TF-IDF weights in a document (simulated)")
plt.xlabel("Token")
plt.ylabel("TF-IDF = TF × IDF")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()
tfidf_df <- data.frame(
token = c("data", "analysis", "model", "the", "and"),
TF = c(0.18, 0.12, 0.08, 0.30, 0.20),
IDF = c(1.0, 1.4, 1.8, 0.1, 0.2)
)
tfidf_df$TFIDF <- tfidf_df$TF * tfidf_df$IDF
ggplot(tfidf_df, aes(x = token, y = TFIDF)) +
geom_col(fill = "steelblue") +
labs(
title = "TF-IDF weights in a document (simulated)",
x = "Token",
y = "TF-IDF = TF × IDF"
) +
theme_minimal()
A token can have a high TF but still receive a small TF–IDF weight if its IDF is low (e.g., very common words). TF–IDF emphasizes tokens that are both locally frequent and globally informative.
Finally, the figure below visualizes TF and IDF jointly; TF-IDF is shown by the point size (larger = higher TF-IDF).
import numpy as np
import matplotlib.pyplot as plt
tokens = np.array(["data", "analysis", "model", "the", "and", "python", "science"])
tf = np.array([0.18, 0.12, 0.08, 0.30, 0.20, 0.05, 0.07])
idf = np.array([1.0, 1.4, 1.8, 0.1, 0.2, 2.0, 1.6])
tfidf = tf * idf
plt.figure()
plt.scatter(tf, idf, s=2500*tfidf) # point size proportional to TF–IDF
for x, y, t in zip(tf, idf, tokens):
plt.text(x, y, f" {t}", va="center")
plt.title("TF–IDF as an interaction of TF and IDF (size = TF–IDF)")
plt.xlabel("TF (within-document frequency)")
plt.ylabel("IDF (corpus rarity)")
plt.tight_layout()
plt.show()
rel_df <- data.frame(
token = c("data", "analysis", "model", "the", "and", "python", "science"),
TF = c(0.18, 0.12, 0.08, 0.30, 0.20, 0.05, 0.07),
IDF = c(1.0, 1.4, 1.8, 0.1, 0.2, 2.0, 1.6)
)
rel_df$TFIDF <- rel_df$TF * rel_df$IDF
ggplot(rel_df, aes(x = TF, y = IDF, size = TFIDF)) +
geom_point(color = "steelblue", alpha = 0.7) +
geom_text(aes(label = token), hjust = -0.1, vjust = 0.5) +
labs(
title = "TF–IDF as an interaction of TF and IDF",
x = "TF (within-document frequency)",
y = "IDF (corpus rarity)",
size = "TF–IDF"
) +
theme_minimal()
The largest points appear where TF and IDF are simultaneously high. This makes TF–IDF easy to interpret as an interaction: a token is most important when it is frequent in the document but uncommon in the corpus.
In practice, TF–IDF representations are computed efficiently using the TfidfVectorizer class from scikit-learn, which combines term frequency normalization and inverse document frequency weighting in a single step.
To keep the example simple and self-contained, consider the following small collection of documents.
First, a TfidfVectorizer object is created using the default settings, which include \(L_2\) normalization. The method fit_transform() learns the vocabulary from the corpus and computes the TF–IDF matrix simultaneously, producing a numerical representation of the documents.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"Statistical models rely on numerical features",
"Text representations are built using vectors",
"Feature weighting improves document comparison"
]
vectorizer = TfidfVectorizer()
tf_idf_matrix = vectorizer.fit_transform(documents)
The learned vocabulary and the resulting TF–IDF matrix can be inspected as follows:
print(vectorizer.get_feature_names_out()) # Output 1
print(tf_idf_matrix.toarray()) # Output 2
print("Matrix shape:", tf_idf_matrix.shape) # Output 3
The three outputs correspond, respectively, to the learned vocabulary, the TF–IDF matrix expressed in dense form for inspection, and the dimensions of the resulting representation.
The first output displays the learned vocabulary, that is, the set of unique terms extracted from the corpus after preprocessing. Each element in this array corresponds to a column of the TF–IDF matrix, and the order shown here defines the column ordering used in the matrix representation.
## ['are' 'built' 'comparison' 'document' 'feature' 'features' 'improves'
## 'models' 'numerical' 'on' 'rely' 'representations' 'statistical' 'text'
## 'using' 'vectors' 'weighting']
The second output shows the TF–IDF matrix itself, expressed in dense form for inspection. Each row corresponds to a document, each column corresponds to a term in the learned vocabulary, and each entry represents the TF–IDF weight assigned to that term in the corresponding document.
## [[0. 0. 0. 0. 0. 0.40824829
## 0. 0.40824829 0.40824829 0.40824829 0.40824829 0.
## 0.40824829 0. 0. 0. 0. ]
## [0.40824829 0.40824829 0. 0. 0. 0.
## 0. 0. 0. 0. 0. 0.40824829
## 0. 0.40824829 0.40824829 0.40824829 0. ]
## [0. 0. 0.4472136 0.4472136 0.4472136 0.
## 0.4472136 0. 0. 0. 0. 0.
## 0. 0. 0. 0. 0.4472136 ]]
The final output reports the dimensions of the matrix. In this example, the matrix has three rows (one per document) and seventeen columns (one per vocabulary term), confirming the correspondence between the corpus size and the learned vocabulary.
## Matrix shape: (3, 17)
The vocabulary remains comparable to that of CountVectorizer, but the entries now represent TF-IDF weights rather than raw frequencies.
Normalization ensures that document vectors are comparable in magnitude, which is particularly important for similarity measures.
By default, each TF-IDF document vector \(\mathbf{x} = (x_1, x_2, \dots, x_p)\) is normalized to have unit length using the \(L_2\) norm, defined as \[ \|\mathbf{x}\|_2 = \sqrt{\sum_{j=1}^{p} x_j^2} \]
Under this normalization, the vector is rescaled so that \(\|\mathbf{x}\|_2 = 1\), emphasizing relative term contributions rather than document length.
Alternatively, the \(L_1\) norm can be used, which is defined as \[ \|\mathbf{x}\|_1 = \sum_{j=1}^{p} |x_j| \]
In this case, the vector is rescaled so that \(\|\mathbf{x}\|_1 = 1\), allowing the TF-IDF weights to be interpreted as relative proportions within each document.
The following example illustrates TF–IDF computation using \(L_1\) normalization.
First, a TfidfVectorizer object is created with the argument norm="l1", which specifies that each document vector will be normalized so that the sum of the absolute TF–IDF weights equals one. The method fit_transform() then learns the vocabulary from the corpus and computes the corresponding TF–IDF matrix in a single step.
The three outputs display, respectively, the learned vocabulary, the TF–IDF matrix with \(l_1\) normalization applied, and the dimensions of the resulting representation.
vectorizer_l1 = TfidfVectorizer(norm="l1")
tfidf_l1 = vectorizer_l1.fit_transform(documents)
print(vectorizer_l1.get_feature_names_out()) # Output 1
print(tfidf_l1.toarray()) # Output 2
print("Matrix shape:", tfidf_l1.shape) # Output 3
The first output displays the learned vocabulary. As before, each term corresponds to a column of the TF–IDF matrix, and the order shown here defines the column ordering used in the matrix representation. The vocabulary itself is unchanged by the choice of normalization.
## ['are' 'built' 'comparison' 'document' 'feature' 'features' 'improves'
## 'models' 'numerical' 'on' 'rely' 'representations' 'statistical' 'text'
## 'using' 'vectors' 'weighting']
The second output shows the TF–IDF matrix with \(L_1\) normalization applied. Each row corresponds to a document and each column to a term in the vocabulary. Under \(L_1\) normalization, the values in each row sum to one, so the entries can be interpreted as relative weights of terms within the document. For example, in the first document, the nonzero entries are all equal, indicating that the retained terms contribute equally to the total TF–IDF weight of that document.
## [[0. 0. 0. 0. 0. 0.16666667
## 0. 0.16666667 0.16666667 0.16666667 0.16666667 0.
## 0.16666667 0. 0. 0. 0. ]
## [0.16666667 0.16666667 0. 0. 0. 0.
## 0. 0. 0. 0. 0. 0.16666667
## 0. 0.16666667 0.16666667 0.16666667 0. ]
## [0. 0. 0.2 0.2 0.2 0.
## 0.2 0. 0. 0. 0. 0.
## 0. 0. 0. 0. 0.2 ]]
The final output reports the dimensions of the matrix. In this example, the matrix has three rows (one per document) and seventeen columns (one per vocabulary term), confirming that normalization affects the scale of the weights, but not the structure of the representation.
## Matrix shape: (3, 17)
As with Bag-of-Words representations, the TF–IDF vectorizer supports the use of n-grams as well as constraints on vocabulary size. This allows short phrases to be incorporated into the representation while keeping dimensionality under control.
In the following example, the representation is restricted to the six most frequent features among unigrams, bigrams, and trigrams. The argument ngram_range = (1, 3) enables the extraction of n-grams up to length three, while max_features = 6 limits the vocabulary size. The default \(L_2\) normalization is applied.
vectorizer_ngram = TfidfVectorizer(
ngram_range=(1, 3),
max_features=6,
norm="l2"
)
tfidf_ngram = vectorizer_ngram.fit_transform(documents)
print(vectorizer_ngram.get_feature_names_out())
print(tfidf_ngram.toarray())
print("Matrix shape:", tfidf_ngram.shape)
The first output displays the learned n-gram vocabulary, restricted to six features. In this case, all retained features correspond to unigrams, bigrams, and trigrams derived from the phrase are built using. Each element in this list defines a column of the TF-IDF matrix, and the order shown here determines the column ordering.
## ['are' 'are built' 'are built using' 'built' 'built using'
## 'built using vectors']
The second output shows the TF–IDF matrix constructed using the restricted n-gram vocabulary. Each row corresponds to a document and each column corresponds to one of the selected n-grams.
## [[0. 0. 0. 0. 0. 0. ]
## [0.40824829 0.40824829 0.40824829 0.40824829 0.40824829 0.40824829]
## [0. 0. 0. 0. 0. 0. ]]
In this example, only the second document contains the retained n-grams, which explains why its row has nonzero TF–IDF values, while the first and third documents are represented by zero vectors.
Because \(L_2\) normalization is applied, the nonzero row has unit Euclidean norm, and the TF-IDF weights are evenly distributed across the six retained features.
The final output reports the dimensions of the matrix. Here, the matrix has three rows (one per document) and six columns (one per retained n-gram), confirming that max_features directly controls the dimensionality of the TF-IDF representation.
## Matrix shape: (3, 6)
The parameters min_df and max_df are also available for TF–IDF vectorizers and behave identically to those in CountVectorizer, allowing extremely rare or overly common terms to be excluded based on document frequency.
TF–IDF improves upon raw word counts by adjusting token importance using corpus-level statistics. It remains computationally efficient and highly interpretable.
However, TF–IDF still operates purely at the lexical level and therefore does not capture:
Semantic similarity between words,
Contextual meaning,
Word order or co-occurrence structure, or
Positional information within documents.
Figure 5.1 summarizes these four limitations with simple examples.
Figure 5.1: Limitations of the TF-IDF representation. Source: Created by the author with ChatGPT (OpenAI)
Like BoW, TF–IDF representations also scale with vocabulary size, which can become problematic for very large corpora.
These limitations motivate the use of similarity measures (such as cosine similarity) and more expressive representation learning techniques, which are explored in the following sections.
Once documents have been represented as vectors, a natural question arises:
How can we quantify how similar or dissimilar two text documents are?
If two documents use similar words with comparable distributions, it is reasonable to expect that they convey related information. In this section, we introduce cosine similarity, a geometric measure widely used to compare document vectors derived from Bag-of-Words and TF-IDF representations.
Cosine similarity measures the orientation of two vectors in a vector space by computing the cosine of the angle between them. Unlike distance-based measures, it is insensitive to vector magnitude and instead focuses on direction.
Two vectors are considered similar when they point in nearly the same direction, even if their lengths differ. This property is especially useful in text analysis, where vector magnitude is often influenced by document length.
For two vectors \(\mathbf{A}\) and \(\mathbf{B}\), cosine similarity is defined as:
\[ \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \, \|\mathbf{B}\|} \]
where \(\mathbf{A} \cdot \mathbf{B}\) denotes the dot product, and \(\|\mathbf{A}\|\), \(\|\mathbf{B}\|\) represent the Euclidean norms of the vectors.
Expanding this expression yields:
\[ \cos(\theta) = \frac{\sum_{i=1}^{N} w_{iA} \, w_{iB}} {\sqrt{\sum_{i=1}^{N} w_{iA}^2} \, \sqrt{\sum_{i=1}^{N} w_{iB}^2}} \]
where \(w_{iA}\) and \(w_{iB}\) denote the weights of vectors \(A\) and \(B\) along the \(i\)-th dimension in an \(N\)-dimensional space.
In theory, cosine similarity ranges from \(-1\) to \(+1\). However, in most NLP applications using BoW or TF–IDF (where vector components are non-negative), cosine similarity lies in the interval \([0, 1]\).
Consider two documents represented by count-based vectors:
\[ \mathbf{d}_1 = (4, 1, 2, 0, 3, 0, 1, 0) \quad \text{and} \quad \mathbf{d}_2 = (2, 0, 1, 1, 2, 1, 0, 0) \]
The cosine similarity between them is:
\[ \cos(\mathbf{d}_1, \mathbf{d}_2) = \frac{\mathbf{d}_1 \cdot \mathbf{d}_2}{\|\mathbf{d}_1\| \, \|\mathbf{d}_2\|} \]
First, compute the dot product:
\[ \mathbf{d}_1 \cdot \mathbf{d}_2 = 4\cdot2 + 1\cdot0 + 2\cdot1 + 0\cdot1 + 3\cdot2 + 0\cdot1 + 1\cdot0 + 0\cdot0 = 16 \]
Next, compute the vector norms:
\[ \|\mathbf{d}_1\| = \sqrt{4^2 + 1^2 + 2^2 + 3^2 + 1^2} = \sqrt{31} \approx 5.57 \]
\[ \|\mathbf{d}_2\| = \sqrt{2^2 + 1^2 + 1^2 + 2^2 + 1^2} = \sqrt{11} \approx 3.32 \]
Finally:
\[ \cos(\mathbf{d}_1, \mathbf{d}_2) = \frac{16}{5.57 \times 3.32} \approx 0.87 \]
A cosine similarity of \(0.87\) indicates a strong similarity between the two documents.
The following function computes cosine similarity between two numeric vectors:
import numpy as np
def cosine_similarity(vec1, vec2):
v1 = np.array(vec1)
v2 = np.array(vec2)
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
This function can be applied directly to document vectors produced by different vectorization techniques.
CountVectorizer outputsUsing the document-term matrix generated by CountVectorizer, cosine similarity can be computed for every document pair:
for i in range(bow_matrix.shape[0]):
for j in range(i + 1, bow_matrix.shape[0]):
sim = cosine_similarity(
bow_matrix.toarray()[i],
bow_matrix.toarray()[j]
)
print(f"Cosine similarity between documents {i} and {j}: {sim:.3f}")
## Cosine similarity between documents 0 and 1: 0.000
## Cosine similarity between documents 0 and 2: 0.183
## Cosine similarity between documents 1 and 2: 0.000
The resulting values indicate which document pairs are more closely related based on raw term counts.
The same procedure can be applied to TF–IDF vectors:
for i in range(tf_idf_matrix.shape[0]):
for j in range(i + 1, tf_idf_matrix.shape[0]):
sim = cosine_similarity(
tf_idf_matrix.toarray()[i],
tf_idf_matrix.toarray()[j]
)
print(f"Cosine similarity between documents {i} and {j}: {sim:.3f}")
## Cosine similarity between documents 0 and 1: 0.000
## Cosine similarity between documents 0 and 2: 0.000
## Cosine similarity between documents 1 and 2: 0.000
While absolute similarity values may differ from those obtained with count-based vectors, the relative ranking of document similarity often remains consistent.
This illustrates how TF–IDF reweights terms without fundamentally altering shared lexical structure. Cosine similarity thus provides a geometric measure of overlap between document representations.
In the next section, we introduce one-hot vectorization, a representation that plays a foundational role in neural and embedding-based models.
One-hot encoding is a simple and widely used technique for representing categorical information in numerical form. In this representation, each possible category is associated with a unique position in a vector, and only one entry takes the value 1, while all remaining entries are set to 0. For this reason, one-hot vectors are also known as binary vectors. For a vocabulary of size \(|V|\), each one-hot vector contains exactly one non-zero entry.
Consider a categorical variable describing traffic conditions with three possible values:
A one-hot representation could be defined as:
vec(low) = <1, 0, 0>
vec(medium) = <0, 1, 0>
vec(high) = <0, 0, 1>
Each vector has length 3 because there are three possible categories, and exactly one position is active at a time.
In natural language processing, the same idea applies to tokens.
Once a vocabulary has been constructed, each word in the vocabulary is treated as a category. A token can then be represented by a one-hot vector whose length equals the vocabulary size.
Only the coordinate associated with the token’s position in the vocabulary is set to 1, while all other coordinates are 0. Formally, each token is represented by a vector in:
\[ \mathbb{R}^{|V|} \]
where \(|V|\) denotes the size of the vocabulary.
One-hot representations are especially important as an intermediate step in the construction of more advanced representations, such as word embeddings, which will be discussed in the next chapter.
To illustrate the process, we will work with a single short sentence.
sentence = ["Students study machine learning methods"]
corpus = pd.Series(sentence)
corpus
## 0 Students study machine learning methods
## dtype: str
We reuse the same preprocessing strategy introduced earlier (cleaning, stopword removal, and lemmatization):
def clean_and_lemmatize(text):
stop = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
tokens = re.sub(r"[^a-zA-Z]", " ", text).lower().split()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop]
return " ".join(tokens)
preprocessed_corpus = corpus.apply(clean_and_lemmatize)
preprocessed_corpus
## 0 student study machine learning method
## dtype: str
After preprocessing, the sentence is reduced to its core lexical components.
vocab = list(set(preprocessed_corpus[0].split()))
print(vocab)
## ['machine', 'study', 'method', 'student', 'learning']
Each unique token now corresponds to one dimension in the one-hot representation.
position = {token: idx for idx, token in enumerate(vocab)}
print(position)
## {'machine': 0, 'study': 1, 'method': 2, 'student': 3, 'learning': 4}
This mapping specifies which coordinate in the vector corresponds to each token.
one_hot_matrix = np.zeros((len(preprocessed_corpus[0].split()), len(vocab)))
one_hot_matrix.shape
## (5, 5)
The matrix has one row per token in the sentence and one column per vocabulary term.
for i, token in enumerate(preprocessed_corpus[0].split()):
one_hot_matrix[i][position[token]] = 1
Each row now contains exactly one 1, marking the presence of the corresponding token.
one_hot_matrix
## array([[0., 0., 0., 1., 0.],
## [0., 1., 0., 0., 0.],
## [1., 0., 0., 0., 0.],
## [0., 0., 0., 0., 1.],
## [0., 0., 1., 0., 0.]])
Each row represents the one-hot encoding of a single token. All vectors are orthogonal, and no information about frequency, relative importance, or semantic similarity is captured.
While one-hot vectors provide a clear and unambiguous way to represent tokens numerically, they suffer from two major limitations:
High dimensionality, proportional to the vocabulary size.
No notion of similarity between tokens—every word is equally distant from every other word.
These limitations motivate the development of distributed representations, which we will explore next.
Chatbots are among the most visible real-world applications of natural language processing. At this point, we have already introduced the essential tools required to construct a simple yet functional chatbot: text preprocessing, vector representations, and similarity measures.
In this section, we will combine these components to design a retrieval-based chatbot. Rather than generating new text, this type of chatbot searches a predefined knowledge base and returns the most appropriate response based on vector similarity.
The effectiveness of a chatbot depends primarily on the quality and relevance of its training corpus. A chatbot intended to answer questions about technical products should be trained on technical documentation or user queries—not on unrelated text such as news articles or literary works.
In addition, a practical chatbot should satisfy several basic requirements:
The corpus must be domain-specific and sufficiently rich.
Response time should be fast enough for interactive use.
Answers should be accurate and consistent.
The interaction should feel reasonably natural to the user.
For illustration purposes, we will use a question–answer corpus related to electronic products, extracted from a large public dataset of product-related Q&A pairs. Such a corpus could realistically support an automated help desk for an online electronics store.
The chatbot follows a simple yet effective pipeline:
Extract all questions from the corpus into a list.
Extract the corresponding answers into a parallel list.
Preprocess and vectorize the questions using TF–IDF.
Vectorize the user’s query using the same transformation.
Compute cosine similarity between the query and all stored questions.
Return the answer associated with the most similar question.
Each entry in the dataset is stored as a dictionary containing a question–answer pair. The following code reads the file line by line and extracts the relevant fields:
import ast
questions = []
answers = []
with open("./Scripts/Dataset/qa_Electronics.json", "r") as f:
for line in f:
record = ast.literal_eval(line)
questions.append(record["question"].lower())
answers.append(record["answer"].lower())
At this stage, the questions and answers are stored in two aligned lists. Lowercasing is applied to reduce variability during vectorization.
Next, we convert the questions into numerical representations. We first compute term counts and then apply TF–IDF weighting:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
vectorizer = CountVectorizer(stop_words="english")
X_counts = vectorizer.fit_transform(questions)
tfidf = TfidfTransformer(norm="l2")
X_tfidf = tfidf.fit_transform(X_counts)
The matrix X_tfidf serves as the reference space against which all user queries will be compared. Here we explicitly separate term counting and TF–IDF weighting to make each step transparent.
To generate a response, we compare the vector representation of the user’s query to every question vector in the corpus using cosine similarity. This approach ensures that both stored questions and incoming queries are embedded in the same vector space.
def get_response(user_input, threshold=60):
query_vec = vectorizer.transform([user_input.lower()])
query_tfidf = tfidf.transform(query_vec)
similarities = cosine_similarity(query_tfidf, X_tfidf)[0]
best_match = np.argmax(similarities)
angle = np.rad2deg(np.arccos(similarities[best_match]))
if angle > threshold:
return "Sorry, I am not confident enough to answer that."
else:
return answers[best_match]
Here, cosine similarity is converted into an angle-based interpretation. The threshold is heuristic and helps avoid returning answers when the query is weakly related to the corpus.
The final step is to implement a simple interaction loop:
def run_chatbot():
username = input("Enter your name: ")
print("SupportBot: Hello! How can I assist you today?")
while True:
user_input = input(f"{username}: ")
if user_input.lower() == "bye":
print("SupportBot: Goodbye!")
break
else:
print("SupportBot:", get_response(user_input))
run_chatbot()
This chatbot demonstrates how vector representations and cosine similarity can be used to build a functional question–answering system with minimal complexity.
While effective for small to medium-sized corpora, this approach has several limitations:
It relies purely on lexical overlap.
It does not capture semantic relationships or context.
Performance degrades as the corpus grows large.
Despite these limitations, similarity-based chatbots played a foundational role in early NLP systems. Modern conversational agents extend these ideas using distributed representations and deep learning, which we will explore in later chapters.
In this chapter, we introduced the fundamental mathematical ideas behind representing text as numerical objects. Starting from simple heuristics, we explored how textual data can be mapped into vectors and matrices, enabling the use of linear algebra techniques for analysis.
We first examined the Bag-of-Words (BoW) representation and implemented it using the CountVectorizer API. While this approach provides an intuitive and effective way to encode text based on term frequencies, we also identified its main limitations—most notably, its tendency to overemphasize very frequent terms and ignore the relative importance of rarer but potentially informative words.
To address these issues, we introduced TF–IDF vectorization, which reweights term frequencies by incorporating global information about term distribution across the corpus. This adjustment helps balance local relevance within documents against global prevalence in the dataset. Despite this improvement, both BoW and TF–IDF remain fundamentally lexical methods: they rely on surface-level word occurrences and do not account for semantic meaning, word order, or contextual relationships.
Building on these vector representations, we then explored how document similarity can be quantified using cosine similarity, interpreting documents as points in a high-dimensional space and measuring the angles between their corresponding vectors. This provided a practical mechanism for comparing documents and served as the foundation for simple applications such as retrieval-based chatbots.
Finally, we discussed one-hot vectorization, a sparse encoding scheme commonly used to represent individual tokens as categorical variables. Although simple, this representation plays an important role as a conceptual building block for more advanced models.
Overall, the methods covered in this chapter are most effective in settings where the vocabulary size is moderate and lexical overlap between documents is meaningful. As vocabularies grow larger or semantic relationships become more important, these representations become less adequate.
With this syntactic foundation in place, the next chapter moves beyond word counts and lexical weighting. We will explore approaches that explicitly model semantic relationships between words, beginning with distributed representations such as Word2Vec.
This activity is designed to integrate and apply the numerical text representation techniques introduced in this chapter.
The reader will transform a small text corpus into vector representations and analyze document similarity using linear algebra concepts.
To build a fully reproducible pipeline that converts raw text into numerical vectors using Bag-of-Words and TF–IDF representations, and to analyze document similarity using cosine similarity.
Select a small corpus of text, such as:
Short paragraphs from news articles,
Abstracts of scientific papers, or
Brief descriptions of products, movies, or books.
The corpus must contain at least three documents, each consisting of one or two sentences.
Create an R Markdown (.Rmd) document that compiles successfully to HTML (or PDF).
The document must include both:
The code, and
The resulting output (printed matrices, tables, or numerical values).
Briefly describe the selected corpus and its context. List the documents explicitly and explain why this corpus is appropriate for similarity analysis.
Apply basic preprocessing steps, including:
Lowercasing,
Removal of punctuation,
Stopword removal, and
Lemmatization or stemming.
Show the processed version of each document.
Construct a Bag-of-Words representation of the corpus using:
CountVectorizer.Report:
The learned vocabulary, and
The document–term matrix.
Briefly interpret the sparsity and dimensionality of the resulting matrix.
Using the same corpus, compute TF–IDF vectors.
Compare the TF–IDF matrix with the BoW matrix by discussing:
Differences in numerical values, and
How TF–IDF reweights frequent and rare terms.
Compute pairwise cosine similarity between all documents using:
BoW vectors, and
TF–IDF vectors.
Present the results clearly and identify:
The most similar document pair, and
The least similar document pair.
Explain any differences observed between the two representations.
Select three tokens from the vocabulary and:
This section may be presented conceptually or with a small numerical example.
Write a concise reflection (6–10 lines) addressing:
How vectorization enables mathematical comparison of text,
The role of weighting schemes such as TF–IDF, and
The limitations of purely lexical representations.
The R Markdown document must be fully reproducible. All code chunks must execute without errors and regenerate the reported outputs when the document is compiled.
Â
ÂIf you found any ERRORS or have SUGGESTIONS, please report them to my email. Thanks.