20/01/26
Abstract
En Rpubs:: toc se pueden ver otros documentos de posible interés.
To understand that initial step, we’ll study how vocabularies are built and how language is represented symbolically. This includes lexicons, phonemes, graphemes, morphemes, tokenization strategies, and word normalization techniques.
Before applying any model to a text dataset, it is essential to convert language into a format that machines can understand. In this chapter, we’ll explore the core elements of that transformation (from raw text to structured tokens).
The Figure 1.1 illustrates the progression from raw text to structured tokens, highlighting key linguistic units such as lexicons, phonemes, graphemes, morphemes, and tokenization strategies used in natural language processing.
Figure 1.1: Core Components of Linguistic Representation. Source: Created by the author with ChatGPT (OpenAI)
As a motivating example, consider the Transformer architecture (Vaswani et al., 2017), widely used in modern NLP. See Figure 1.2.
Figure 1.2: General architecture of the Transformer model. Source: Vaswani et al. (2017)
The process begins with tokenization (Inputs, shown at the lower left of the figure), where input sentences are broken down into basic units (usually subwords or word-pieces). These units then flow through the entire model. To understand that initial step, we’ll study how vocabularies are built and how language is represented symbolically.
To understand this initial stage, it is necessary to examine how vocabularies are constructed and how language is represented in symbolic form. This leads us to fundamental concepts such as lexicons, phonemes, graphemes, morphemes, tokenization methods, and normalization procedures, which together define how raw text is transformed into model-ready inputs.
All code examples used in this section are available in the following GitHub repository:
https://github.com/PacktPublishing/Hands-On-Python-Natural-Language-Processing/tree/master/Chapter03.
A lexicon is the set of words used in a language or within a specific domain.
In practice, it functions like a dictionary that defines which terms are meaningful in a given context.
The following examples illustrate how lexicons vary depending on the context or domain:
General language: house, run, happy
Medicine: diagnosis, dosage, symptom
Technology: algorithm, dataset, model
In natural language processing, the lexicon defines the vocabulary a model can recognize and process.
Words not included in the lexicon are typically ignored, replaced, or decomposed into smaller units.
For this reason, building or learning a lexicon is a foundational step before tokenization and modeling.
The Figure 2.1 clarifies the conceptual differences between closely related terms that are often used interchangeably in NLP and linguistics. It distinguishes between lexicon, vocabulary, and word, highlighting their scope, usage, and role in language representation.
Figure 2.1: Lexicon vs vocabulary vs word. Source: Created by the author with ChatGPT (OpenAI)
Before building a vocabulary, it is useful to understand three basic linguistic units: phonemes, graphemes, and morphemes. These concepts provide the linguistic foundation for the tokenization and representation methods used in NLP systems.
They are the smallest sound units that distinguish meaning in spoken language. Examples:
English: the sounds /k/, /æ/, and /t/ form the word cat.
English: the sounds /f/, /uː/, and /d/ form the word food.
French: the sounds /ʃ/, /a/, and /t/ form the word chat.
Spanish: the sounds /g/, /a/, and /t/, /o/ form the word gato.
Figure 3.1: Phonemes. Source: Charge Mommy Books
They are letters or letter groups that represent phonemes in written language. Examples:
In spoon, the graphemes s, p, oo, and n represent four phonemes.
In sheet, the graphemes sh, ee, and t represent three phonemes.
In ship, the graphemes sh, i, and p represent three phonemes.
In food, the grapheme oo represents a single phoneme (within a word that contains multiple phonemes).
In chat, the grapheme ch represents a single phoneme.
In Spanish, the word gato is composed of the following graphemes: g, a, t, and o. Each grapheme corresponds to a written unit that represents a phoneme in the word.
Figure 3.2: Graphemes. Source: ReadingDoctor
They are the smallest units that carry meaning. Examples:
The word unhappiness can be decomposed into three morphemes:
un- (a bound morpheme signifying not).
happy (the root morpheme).
-ness (a free morpheme signifying state or quality).
The word teacher consists of two morphemes:
teach (root).
-er (a person who performs an action).
The word reusable can be decomposed into:
re- (again).
use (root).
-able (can be done).
The word imported can be decomposed into:
im- (prefix).
port (root).
-ed (sufix).
Figure 3.3: Morphemes. Source: Literacy Learn
To construct a vocabulary, text must first be divided into smaller units called tokens.
This process, known as tokenization, consists of segmenting sentences or documents into meaningful elements that can be processed by a model.
In most cases, tokens correspond to words or numbers, although punctuation symbols and other textual elements may also be treated as tokens depending on the application. The Figure 4.1 is a visual representation of tokens generated by gpt-4o on Tiktokenizer:
Figure 4.1: Visual representation of tokens generated by gpt-4o on Tiktokenizer
A simple example illustrates this idea:
sentence = "Machine learning improves decision making"
sentence.split()
## ['Machine', 'learning', 'improves', 'decision', 'making']
This basic splitting operation separates the sentence into individual word tokens. While this approach is intuitive, real-world tokenization is often more complex and requires more advanced strategies.
Real-world text presents multiple challenges that cannot be adequately handled by simple tokenization rules such as whitespace splitting.
Figure 4.2 provides a high-level overview of some common tokenization issues encountered in natural language processing, including apostrophes, contractions, multi-word expressions, punctuation, non-lexical tokens, and social media artifacts.
These issues are introduced here for conceptual orientation. Each case will be examined in detail in the following subsections, together with illustrative examples and discussion of why more sophisticated tokenization strategies are required.
Figure 4.2: IssueS with tokenization. Source: Created by the author with ChatGPT (OpenAI)
Simple tokenization methods often struggle with common language patterns. Consider the following sentence:
sentence = "Machine learning's impact is growing"
sentence.split()
## ['Machine', "learning's", 'impact', 'is', 'growing']
Here, the tokenizer cannot determine whether the correct token should be learning, learnings, or learning's. Apostrophes introduce ambiguity that basic splitting rules cannot resolve.
Contractions present a similar challenge. For example:
sentence = "We'll apply machine learning tomorrow"
sentence.split()
## ["We'll", 'apply', 'machine', 'learning', 'tomorrow']
The contraction we'll actually represents we will, but a simple split does not capture
this distinction. Ideally, a tokenizer should convert it into two tokens: we and will.
A related case appears with pronoun contractions:
sentence = "I'm studying machine learning"
sentence.split()
## ["I'm", 'studying', 'machine', 'learning']
Here, I'm should be interpreted as I am, which again requires linguistic awareness beyond simple string splitting.
Multi-word expressions also raise important questions. Consider:
sentence = "Deep learning is a branch of machine learning"
sentence.split()
## ['Deep', 'learning', 'is', 'a', 'branch', 'of', 'machine', 'learning']
Should machine learning be treated as two separate tokens or as a single meaningful unit?
In many contexts, it functions as one concept rather than two independent words.
Punctuation introduces additional complexity. In the following example, the period does not mark the end of a sentence:
sentence = "She holds a Ph.D. in machine learning"
sentence.split()
## ['She', 'holds', 'a', 'Ph.D.', 'in', 'machine', 'learning']
Finally, not all tokens are standard words. Some elements may appear meaningless but still carry contextual value:
sentence = "I was umm thinking about this problem"
sentence.split()
## ['I', 'was', 'umm', 'thinking', 'about', 'this', 'problem']
Although umm is not part of formal vocabulary, it may be relevant in applications such as
speech analysis.
So far, we have seen that simple splitting rules are often insufficient for real text. To address different linguistic patterns and use cases, several types of tokenizers have been developed. In this section, we introduce:
Rule-based tokenizers (such as regular expression–based tokenizers),
Linguistically motivated tokenizers (such as the Treebank tokenizer), and
Tokenizers designed for informal and social media text (such as TweetTokenizer).
Each type is presented in the following subsections, along with simple examples that illustrate when and why it should be used in NLP applications.
The Figure 5.1 illustrates the main categories of tokenizers, highlighting their underlying principles and typical use cases, including rule-based, linguistically motivated, and social media–oriented approaches..
Figure 5.1: Different types of tokenizers. Source: Created by the author with ChatGPT (OpenAI)
Regular expressions (regex) are formal patterns used to identify, match, and extract specific structures in text. They constitute one of the earliest and most widely used tools for text processing and remain fundamental in many NLP pipelines.
Figure 5.2 provides a visual overview of several commonly used regular expression patterns and illustrates how they operate on concrete text examples.
Figure 5.2: Regular expressions. Source: Created by the author with ChatGPT (OpenAI)
Regex-based approaches are especially useful when the target patterns are well defined and follow recognizable formats, such as dates, prices, email addresses, numerical values, or identifiers. In such cases, rule-based methods are often simpler, more transparent, and computationally more efficient than machine learning alternatives.
Because these elements exhibit fixed structural regularities, regular expressions are particularly well suited for rule-based extraction during tokenization and text preprocessing.
For further reading and interactive practice, see: MDN Web Docs - Regular Expressions and regex101. MDN Web Docs is an authoritative, developer-oriented documentation resource that provides clear and practical explanations of programming concepts, including regular expressions. In contrast, regex101 is an interactive platform designed for testing, visualizing, and debugging regular expression patterns in real time..
regex metacharacters (quick reference).The table below provides a quick reference to some commonly used regular expression metacharacters. It is not exhaustive (many additional symbols and constructs exist depending on the regex engine) but it covers the most frequently encountered elements in introductory NLP tasks.
| Metacharacter | Name | Meaning |
|---|---|---|
| [ ] | Square brackets |
Character class: matches one character from a specified set/range (e.g., [abc], [0-9]).
|
| \ | Backslash (escape) | Escapes a metacharacter (or introduces special sequences depending on the regex engine). |
| (concatenation) | Sequence (AND) |
Logical AND expressed by sequence: patterns must appear consecutively (e.g., catdog means cat AND dog).
|
| + | Plus | Quantifier: repeats the previous token one or more times. |
| ^ | Caret | Anchors the match at the start of the string/line (depending on flags). |
| * | Asterisk | Quantifier: repeats the previous token zero or more times. |
| . | Dot | Wildcard: matches (almost) any single character except newline (by default). |
| $ | Dollar sign | Anchors the match at the end of the string/line (depending on flags). |
| ? | Question mark | Quantifier: makes the previous token optional (zero or one time). |
| { } | Curly braces |
Quantifier: repeats the previous token a specified number of times (e.g., {3}, {2,5}).
|
| ( ) | Parentheses | Grouping: groups tokens; also used to capture subpatterns for later reference. |
| ! | Exclamation mark | Negation (engine-dependent): often used for “NOT”/negative constructs (not universal across all regex flavors). |
The following examples illustrate how some of the metacharacters above behave in practice. Each example focuses on pattern intuition, not on exhaustive matching rules.
Alternation (|) and repetition (*): matches one option or repeated occurrences of another. Example a|b* matches either the symbol a or zero or more repetitions of b: , a, b, bb, bbb (but not ab, ba, aa).
Grouping with repetition: generates strings using only the specified symbols. Example (a|b)* matches any sequence formed by the symbols a and b, including the empty string: , a, b, ab, ba, aab (but not c, abc, aabx).
Optional element (?): allows a symbol to appear zero or one time. Example ab*(c)? matches strings that start with a, followed by zero or more b’s, and optionally end with c: a, ab, abb, ac, abc (but not b, bc, abbcde).
Wildcard (.): matches any single character in a fixed position. Example .ing matches any sequence where one character is followed by the substring ing. The match does not need to start at the beginning of the word: sing->sing, flying->ying, going->oing. Does not match: ing, thing, bring, because .ing requires exactly one character before ing).
Character class ([]): matches one character from a defined set. Example [mh]ouse matches one character from the specified set, followed by a fixed suffix: mouse, house. Does not match: louse, Mouse, because only m or h are allowed, and matching is case-sensitive.
Negated character class ([^ ]): matches any character not listed. Example [^h]ouse matches any single character except those listed inside the brackets (h): mouse, cheese and house → mouse (but not house, and, and cheese, because h is explicitly excluded).
Anchors (^, $): restrict matches to the start and end of a string. Example ^[mh]ouse$ matches only complete strings that start and end exactly with the specified pattern: mouse, house (but not match: warehouse, mousepad, my house).
Greedy matching (.*): matches a symbol followed by any number of characters. Example f.* matches the character f followed by zero or more characters of any kind: f, fly, foot data, fast text processing.
One or more repetitions (+): requires at least one occurrence. Example [mp]+ouse matches: mouse, pmouse, mmouse, ppouse (but not ouse, house). Requires at least one occurrence of the specified characters before the remaining pattern.
RegexpTokenizerThe nltk library provides a tokenizer based on regular expressions, known as RegexpTokenizer. This tokenizer allows us to define explicit rules that control how text is split into tokens.
Instead of relying on spaces or punctuation alone, RegexpTokenizer uses a regex pattern to specify which character sequences should be treated as tokens.
Figure 5.3 illustrates the main variants built on top of RegexpTokenizer. These tokenizers are introduced here for orientation purposes and will be examined in detail in subsequent sections.
Figure 5.3: RegexpTokenizer and its variants. Source: Created by the author with ChatGPT (OpenAI)
RegexpTokenizer).Consider the following sentence:
sentence = "The price ranges from $120.50 to $350.00 today."
sentence
## 'The price ranges from $120.50 to $350.00 today.'
We define a tokenizer that recognizes words and prices:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"\w+|$[\d.]+|\S+")
tokenizer.tokenize(sentence)
## ['The', 'price', 'ranges', 'from', '$120.50', 'to', '$350.00', 'today', '.']
In this pattern:
\w+ matches words and numbers (equal to [a-zA-Z0-9_]).
\$ in \$[\d\.]+matches prices starting with $; \d matches a digit between 0 and 9,
\. matches the character . (period), and + again acts as a quantifier matching between one and unlimited times.
\S accepts any non-whitespace character and + again acts the same way as in the preceding two alternatives.
This approach is useful when we want precise control over which elements are preserved as tokens.
RegexpTokenizer (RegexpTokenizer Family)Several tokenizers in NLTK are implemented as wrappers or variants of RegexpTokenizer include:
WordPunctTokenizer, which separates alphabetic and non-alphabetic characters,
BlanklineTokenizer, which uses empty lines as delimiters.
Figure 5.4: Variants of RegexpTokenizer. Source: Created by the author with ChatGPT (OpenAI)
WordPunctTokenizer).This tokenizer separates alphabetic tokens from punctuation and symbols, making each punctuation mark an individual token.
from nltk.tokenize import WordPunctTokenizer
sentence = "Price: $120.50, available today!"
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(sentence)
## ['Price', ':', '$', '120', '.', '50', ',', 'available', 'today', '!']
BlanklineTokenizer).This tokenizer splits text into chunks based on blank lines, which is useful when processing documents structured into paragraphs.
from nltk.tokenize import BlanklineTokenizer
sentence = "This is the first paragraph.\n\nThis is the second paragraph."
tokenizer = BlanklineTokenizer()
tokenizer.tokenize(sentence)
## ['This is the first paragraph.', 'This is the second paragraph.']
Write a regular expression to extract email addresses from a text and test it at regex101.
The Treebank tokenizer applies a set of linguistically motivated rules inspired by the Penn Treebank annotation guidelines.
Although it relies internally on regular expressions, its main goal is not purely pattern matching, but rather linguistically informed tokenization.
In particular, this tokenizer is designed to handle contractions, punctuation, and other common syntactic phenomena in a way that better reflects the structure of natural language. As a result, it is especially effective for preprocessing English text in downstream NLP tasks.
Figure 5.5: Treebank tokenizer`. Source: Created by the author with ChatGPT (OpenAI)
As illustrated in Figure 5.5, the Treebank tokenizer systematically splits contractions (e.g., we'll → we + will) and separates punctuation marks from words, producing tokens that are more suitable for syntactic and semantic analysis. These design choices make it a standard baseline tokenizer in many NLP pipelines.
Treebank).Consider the following example:
## "I'm sure this model doesn't perform perfectly."
Tokenizing with the Treebank tokenizer:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)
## ['I', "'m", 'sure', 'this', 'model', 'does', "n't", 'perform', 'perfectly', '.']
Here, contractions such as I'm and doesn't are split into meaningful components:
I'm → I and 'm
doesn't → does and n't
This decomposition helps isolate grammatical and semantic elements such as negation, which would be harder to analyze if treated as a single token.
Text from social media often differs substantially from standard written language. It typically includes user mentions, hashtags, emojis, URLs, elongated words, repeated characters, all of which pose challenges for simple whitespace- or punctuation-based tokenizers.
To address these characteristics, the nltk library provides the TweetTokenizer, a tokenizer specifically designed to handle the informal and highly variable nature of social media text. Rather than discarding these elements, TweetTokenizer preserves them as meaningful tokens, allowing downstream NLP models to capture emotional cues, emphasis, and topical information.
Figure 5.6: Tweet tokenizer`. Source: Created by the author with ChatGPT (OpenAI)
As illustrated in the Figure 5.6, TweetTokenizer is able to correctly identify and separate mentions (e.g., @user), hashtags (e.g., #AI), emojis, URLs, and expressive punctuation, producing a token sequence that better reflects the structure and semantics of social media communication.
TweetTokenizer).Consider the following example:
from nltk.tokenize import TweetTokenizer
sentence = "@datafan NLP is sooo exciting!!! 😄🚀 #TextMining #AI"
sentence
## '@datafan NLP is sooo exciting!!! 😄🚀 #TextMining #AI'
Using TweetTokenizer:
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize(sentence)
## ['@datafan', 'NLP', 'is', 'sooo', 'exciting', '!', '!', '!', '😄', '🚀', '#TextMining', '#AI']
This tokenizer preserves important elements such as:
User mentions (@datafan),
Emojis (😄, 🚀),
Hashtags (#TextMining, #AI), and
Repeated punctuation or characters.
The TweetTokenizer also offers useful configuration options:
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False)
tokenizer.tokenize(sentence)
## ['nlp', 'is', 'sooo', 'exciting', '!', '!', '!', '😄', '🚀', '#textmining', '#ai']
In this example, each parameter modifies the tokenization behavior as follows:
The parameter strip_handles=True (when set to True) removes user mentions in a post/tweet. For example, The user mention (e.g., @datafan) is removed from the output. This is useful when user identifiers are not relevant for the analysis.
reduce_len=True shortens exaggerated repetitions, but not completely removed. For instance, sooo → soo (here, sooo is preserved in a reduced form rather than being fully normalized).
preserve_case=False (when set to False) converts text to lowercase for vocabulary normalization (NLP → nlp, #AI → #ai). This helps reduce vocabulary size during normalization. The default value for this parameter is True.
Together, these options allow TweetTokenizer to retain meaningful social media elements (emojis, hashtags, emphasis) while reducing noise and variability in informal text.
In many NLP tasks, keeping every possible word form in the vocabulary is unnecessary (and often undesirable). A common solution is word normalization, where different surface forms are mapped to a more consistent representation.
For instance, verb forms such as am, are, and is can be mapped to the same base concept be. Likewise, variants such as car, cars, and car's may be treated as the same underlying word, depending on the goal of the analysis.
Normalization is mainly used to control vocabulary size and reduce noise in text data. However, the choice of technique is task-dependent: words that are often removed in general NLP pipelines (e.g., when, why, where) may be uninformative for some classification tasks, but essential for applications such as question answering.
Figure 6.1 summarizes the main normalization steps covered in this section and how they typically fit into a preprocessing workflow.
Figure 6.1: Word normalization`. Source: Created by the author with ChatGPT (OpenAI)
Stemming is a technique used to reduce words to a simplified base form, known as the stem, by removing prefixes or suffixes.
For example, words such as compute, computer, and computing may all be reduced to the stem comput.
compute, computer, computing → comput
Importantly, the resulting stem is not guaranteed to be a valid dictionary word.
Stemming relies on heuristic rules rather than linguistic analysis, which makes it fast but sometimes imprecise. Two widely used stemming algorithms are:
Porter stemmer (PorterStemmer), designed for English.
Snowball stemmer (SnowballStemmer), an extension of the PorterStemmer that supports multiple languages.
The SnowballStemmer supports the following languages:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages
## ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
PorterStemmer.Let us first apply the PorterStemmer to a small set of words:
from nltk.stem.porter import PorterStemmer
words = ["running", "runs", "runner", "easily", "fairly"]
stemmer = PorterStemmer()
[stemmer.stem(word) for word in words]
## ['run', 'run', 'runner', 'easili', 'fairli']
The output shows that the PorterStemmer aggressively removes suffixes:
running and runs are correctly reduced to run.
runner remains unchanged, since the algorithm does not treat it as an inflected form.
easily and fairly are reduced to easili and fairli, which are not valid English words.
This illustrates a key property of stemming: the resulting stem is not required to be linguistically correct, only consistent.
SnowballStemmer.Now, applying the SnowballStemmer to the same words:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
[stemmer.stem(word) for word in words]
## ['run', 'run', 'runner', 'easili', 'fair']
The Snowballstemmer produces results that are very similar to those of the PorterStemmer, but with small refinements:
As before, running and runs are reduced to run.
runner remains unchanged.
easily is still reduced to easili.
However, fairly is reduced to fair, which is a more readable and meaningful stem than fairli.
This difference illustrates how Snowball refines some of the original Porter rules, leading to slightly more interpretable stems in certain cases.
Overall, both stemmers behave similarly, but Snowball often provides modest improvements while preserving the speed and simplicity of rule-based stemming.
Stemming can introduce two common types of errors:
Over-stemming occurs when different words are reduced to the same stem even though
they have different meanings. Example: university and universe may be incorrectly mapped to a similar stem.
Under-stemming occurs when related words are not reduced to the same stem. Example: analysis and analyst may remain separate despite being conceptually related.
These limitations highlight that stemming is a crude normalization technique and should be applied with care, depending on the NLP task.
For a detailed discussion on stemming algorithms, see this paper: A Comparative Study of Stemming Algorithms (Jivani et al., 2011)
Unlike stemming, which removes characters using heuristic rules, lemmatization aims to convert a word into its meaningful base form, known as the lemma. The lemma usually corresponds to a valid dictionary word.
Lemmatization groups together different word forms that share the same base meaning. For example, am, are, and is can all be mapped to the lemma be.
This process relies on linguistic information such as:
The part of speech (POS) of a word,
Its contextual usage, and,
In some cases, semantic knowledge.
Because the same word can have different lemmas depending on context, lemmatization is generally more accurate (but also more computationally expensive) than stemming.
In this section, we illustrate lemmatization using the WordNet lemmatizer (WordNetLemmatizer) and the spaCy lemmatizer (spacy).
wordNet lemmatizer)WordNet is a large lexical database of English in which words are grouped into sets of synonyms (called synsets) that represent distinct concepts. The nltk library provides an interface to WordNet that can be used for lemmatization.
Consider the following sentence:
sentence = "We are putting in efforts to improve our understanding of lemmatization"
sentence
## 'We are putting in efforts to improve our understanding of lemmatization'
Applying the WordNet lemmatizer (WordNetLemmatizer) without additional information:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tokens = sentence.split()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
lemmatized
## ['We', 'are', 'putting', 'in', 'effort', 'to', 'improve', 'our', 'understanding', 'of', 'lemmatization']
Most words remain unchanged. This occurs because, by default, the WordNet lemmatizer assumes that all words are nouns. As a result, verbs such as are or putting are not reduced to their correct base forms.
Lemmatization is inherently context-dependent: the correct lemma of a word depends on its grammatical role in the sentence. For this reason, lemmatizers typically rely on part-of-speech (POS) tags.
The nltk library provides a pretrained POS tagger that assigns grammatical labels to tokens:
pos_tags = nltk.pos_tag(tokens)
pos_tags
## [('We', 'PRP'), ('are', 'VBP'), ('putting', 'VBG'), ('in', 'IN'), ('efforts', 'NNS'), ('to', 'TO'), ('improve', 'VB'), ('our', 'PRP$'), ('understanding', 'NN'), ('of', 'IN'), ('lemmatization', 'NN')]
Some common POS tags appearing in this example are:
PRP: personal pronounPRP$: possessive pronounVB: verb (base form)VBP: verb (present tense)VBG: verb (gerund or present participle)NN: noun (singular)NNS: noun (plural)IN: preposition or subordinating conjunctionTO: infinitive markerA complete description of the Penn Treebank POS tagset can be found at: Alphabetical list of part-of-speech tags used in the Penn Treebank Project.
When POS information is incorporated, lemmatization becomes more accurate:
lemmatized_with_pos = [
lemmatizer.lemmatize(token, get_wordnet_pos(token))
for token in tokens
]
" ".join(lemmatized_with_pos)
## 'We be put in effort to improve our understand of lemmatization'
This time, the lemmatizer correctly identifies meaningful base forms, for example:
are → be
putting → put
understanding → understand
This example illustrates that lemmatization is linguistically informed and context-aware, in contrast to stemming, which applies purely rule-based truncation without regard to grammatical role or semantic validity.
While modern Transformer-based models do not explicitly perform lemmatization, understanding these normalization steps helps clarify how linguistic structure is simplified before tokenization and embedding.
For comparison, consider the output of a stemmer applied to the same tokens:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
stemmed = [stemmer.stem(token) for token in tokens]
" ".join(stemmed)
## 'we are put in effort to improv our understand of lemmat'
Several words are truncated to forms that do not correspond to valid dictionary entries, for example:
improve → improv
understanding → understand
lemmatization → lemmat
Unlike lemmatization, stemming applies purely rule-based suffix removal without considering grammatical role or meaning. As a result, it may generate incomplete or non-standard word forms. Lemmatization, by contrast, aims to preserve linguistic validity and interpretability.
spaCy Lemmatizer)The spaCy lemmatizer relies on pretrained language models that perform tokenization, POS tagging, and lemmatization as part of a single integrated pipeline.
After installing spaCy and downloading a language model, lemmatization can be applied directly:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We are putting in efforts to improve our understanding of lemmatization")
[token.lemma_ for token in doc]
## ['we', 'be', 'put', 'in', 'effort', 'to', 'improve', 'our', 'understanding', 'of', 'lemmatization']
In this output, spaCy automatically infers the grammatical role of each token and assigns an appropriate lemma:
are → be (verb normalization).
putting → put (verb base form).
efforts → effort (singular noun).
Function words such as in, to, and of remain unchanged.
Content words like understanding and lemmatization already appear in their base form and therefore do not change.
Unlike the WordNet-based approach, spaCy does not require POS tags to be supplied explicitly, as grammatical information is inferred internally by the model.
In some spaCy language models, pronouns are represented using the placeholder -PRON-. This is a design choice intended to abstract away surface forms of pronouns rather than a lemmatization error. Depending on the application, this behavior may be useful (e.g., for normalization) or undesirable (e.g., for interpretability).
spaCy supports multiple languages. A list of available language models can be found at: SpaCY: Models & Languages.
spaCy supports multiple languages through pretrained language models. These models are not installed by default and must be downloaded separately from the command line before use. For example:
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
In this course, the following examples are presented for illustrative purposes only, in order to highlight spaCy’s multilingual capabilities. The code below shows how different language models would be loaded if they were installed.
import spacy
# English
nlp_en = spacy.load("en_core_web_sm")
# Spanish
nlp_es = spacy.load("es_core_news_sm")
# French
nlp_fr = spacy.load("fr_core_news_sm")
The code above is provided for illustrative purposes only and is not intended to be executed in this course. This avoids installation issues while preserving the conceptual understanding of spaCy’s multilingual model structure.
In previous sections, we briefly mentioned stopword removal as a common preprocessing step in NLP. We now examine this technique in more detail.
Stopwords are words such as a, an, the, in, at, and to that occur very frequently in text corpora but usually carry limited semantic information on their own. Although these words are essential for grammatical correctness, they often contribute little to tasks focused on content or meaning.
As a result, stopword removal is commonly used to:
Reduce vocabulary size,
Simplify text representations, and
Improve efficiency in certain NLP tasks.
It is important to note that there is no universal stopword list. Stopwords depend on the:
The language,
The application, and
The specific task being addressed.
nltk example).The nltk library provides predefined stopword lists for several languages. The following example illustrates how stopwords can be retrieved for English:
#nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)
## "aren't, wouldn, there, other, our, with, wouldn't, them, o, is, he's, few, aren, off, that'll, shan't, didn't, are, over, wasn't, needn, had, of, then, i've, their, ve, don, you, we, we'd, haven't, from, hasn, while, me, where, ours, each, itself, shouldn't, m, mustn't, t, theirs, you'd, just, such, about, down, all, shouldn, under, between, couldn't, he, up, before, hers, it'd, having, it, for, a, through, as, here, he'll, weren't, what, no, doesn, an, they've, i'll, i, we've, do, during, she'll, was, it's, hadn't, until, needn't, that, how, ma, now, she'd, re, myself, they'll, be, mightn't, on, does, by, s, were, any, his, they're, because, have, only, shan, hasn't, i'm, been, most, both, i'd, doesn't, won, but, her, did, couldn, am, so, being, these, or, more, below, we're, which, wasn, some, him, your, weren, again, above, themselves, mightn, when, haven, they'd, y, himself, in, isn't, yourself, you've, don't, to, very, if, who, further, why, hadn, herself, the, than, can, nor, he'd, those, she, didn, should've, ain, we'll, and, isn, will, ourselves, yourselves, yours, mustn, you'll, should, once, into, own, whom, out, it'll, this, my, won't, ll, they, has, not, she's, you're, same, at, against, its, doing, d, after, too"
The Italian stopword list includes frequent function words such as di, e, il, la, che, per, which may be removed depending on the goals of the analysis.
from nltk.corpus import stopwords
stop_it = set(stopwords.words("italian"))
", ".join(stop_it)
## 'stemmo, ho, facessimo, farò, sulle, quello, ebbe, o, starebbero, sugli, se, dagl, avevamo, avuto, sta, avresti, da, si, essendo, c, avessi, stando, l, agli, facevo, ero, starei, nei, nel, foste, stia, noi, abbiate, abbiano, dai, faccia, avrei, fareste, ne, sullo, facendo, fosse, quella, sull, loro, fui, stanno, alle, quale, e, farai, uno, faceva, facevate, fummo, abbiamo, avete, faceste, tra, sono, stettero, ti, dalla, facesse, nostro, del, cui, stiate, negl, saresti, facciano, della, starete, saremmo, all, il, avevano, miei, quelle, stava, fossimo, staresti, stai, vostre, anche, stette, quante, gli, avrò, stavate, stesti, sareste, quanti, stetti, agl, mio, per, tutto, faccio, faremmo, lei, facessero, dello, a, stessero, avemmo, di, sarebbe, sarei, fu, ebbero, quanto, nello, queste, contro, avevi, con, avrebbe, stavo, fai, una, dallo, tue, ad, dov, i, avendo, le, faresti, facevano, stiamo, lo, sto, starà, stavamo, coi, avesti, farebbero, un, ma, avrete, sarò, avrebbero, avuta, non, nostra, dagli, avuti, fosti, stesse, stessimo, siate, staremo, avrai, vostri, starai, col, eravate, sarebbero, questi, vostra, come, abbia, avremo, dei, perché, faremo, negli, saremo, facevi, siamo, era, eravamo, aveva, farei, eri, dal, avrà, avremmo, stavi, nostre, alla, degli, chi, facciate, farà, farete, saranno, tuo, tua, al, vostro, avreste, mie, sugl, avevate, in, steste, quanta, hanno, su, voi, hai, sarete, furono, facciamo, staranno, stavano, stessi, ha, fossero, erano, avessimo, sei, io, quelli, degl, stareste, delle, staremmo, dell, avevo, dalle, ed, starebbe, è, nell, allo, questa, avute, fece, sue, tu, sarà, ai, dall, più, tutti, sul, fossi, nella, facemmo, faranno, starò, nelle, lui, vi, che, facesti, ebbi, sulla, sua, farebbe, suoi, tuoi, sarai, avessero, avranno, facevamo, stiano, fanno, questo, ci, li, sui, suo, mi, aveste, mia, siete, siano, dove, feci, facessi, avesse, nostri, fecero, la, sia'
The French stopword list contains common grammatical words such as le, la, les, de, et, à, which are frequently removed in preprocessing steps depending on the task.
from nltk.corpus import stopwords
stop_fr = set(stopwords.words("french"))
", ".join(stop_fr)
## 'qu, étées, ou, auras, eurent, eussiez, moi, étantes, aux, se, étiez, eues, c, leur, eusses, ce, sommes, eu, en, qui, aura, une, ton, ayons, ne, au, avions, la, étant, me, serais, été, aie, m, sa, je, aurait, t, fussent, votre, les, fûmes, pas, il, ta, aies, elle, aient, ayants, eût, as, soyons, mes, es, à, n, avons, étaient, aviez, serons, fussiez, tes, le, mais, que, ayantes, avait, des, ma, un, étée, êtes, sont, son, aurions, on, s, même, eûmes, te, fus, auriez, seraient, ils, étés, avec, vos, aurez, eut, serai, sur, serait, soyez, avaient, serions, fut, étais, fût, eûtes, seras, ait, eus, soit, eue, mon, nos, eux, eussent, vous, avais, y, furent, suis, notre, étions, sera, toi, fussions, fusses, fusse, aurons, aurais, ont, ces, tu, ayez, et, pour, auront, ai, de, est, seront, auraient, lui, nous, fûtes, aurai, eusse, ses, eussions, dans, du, étante, j, avez, sois, était, soient, ayant, par, d, seriez, étants, ayante, l, serez'
Although stopword lists are language-specific, the underlying principle remains the same: stopword removal is a task-dependent preprocessing decision rather than a universal rule. These lists typically consist of highly frequent function words; however, applying them blindly can lead to the removal of linguistically or semantically relevant terms.
As discussed earlier, stopword lists should not be applied blindly. In particular, wh- words such as who, what, when, why, how, which, where, and whom often play a crucial role in tasks involving questions or information-seeking behavior.
While removing these words may be acceptable in some contexts, it can be harmful in applications such as:
Question answering,
Question classification,
Information retrieval.
The following example illustrates how a stopword list can be adapted to preserve wh- words when they are relevant for interpretation.
from nltk.corpus import stopwords
wh_words = ["who", "what", "when", "why", "how", "which", "where", "whom"]
stop = set(stopwords.words("english"))
for word in wh_words:
stop.remove(word)
sentence = "how do students analyze text data in applied statistics courses"
filtered_sentence = [token for token in sentence.split() if token not in stop]
" ".join(filtered_sentence)
## 'how students analyze text data applied statistics courses'
The original sentence:
how do students analyze text data in applied statistics courses
is transformed into:
how students analyze text data applied statistics courses
In this process, common function words such as do and in are removed, while the wh- word how is preserved due to its importance for interpretation. This example highlights that stopword removal must be adapted to the specific goals of the analysis rather than applied mechanically.
Another common normalization strategy in NLP is case folding, which consists of converting all characters in a text corpus to lowercase. Under case folding, tokens such as The and the are treated as identical, whereas they would be considered distinct in a case-sensitive representation.
Case folding is particularly useful in applications such as information retrieval and text matching, where differences in capitalization are usually not meaningful. For example, whether a user types Statistics or statistics should not affect the retrieval of relevant documents.
However, case folding can introduce limitations in certain contexts. Proper nouns may lose important distinctions when converted to lowercase. For instance, acronyms such as NASA or UN may be transformed into common nouns. Similarly, named entities composed of common words can become ambiguous after case folding.
Although more sophisticated approaches attempt to preserve capitalization selectively using contextual information, such methods are not always reliable—especially when users predominantly write in lowercase. As a result, fully lowercasing text remains a widely used and practical solution.
It is also important to note that the relevance of capitalization varies across languages. In languages such as English, capitalization often conveys syntactic or semantic information, whereas in other languages it may play a less significant role.
The following example illustrates a simple case-folding operation in Python using the lower() method:
sentence = "Graduate Students Apply Statistical Models to Text Analysis"
sentence = sentence.lower()
sentence
## 'graduate students apply statistical models to text analysis'
In this output, all uppercase letters are converted to lowercase. As a result, words such as Graduate, Students, and Statistical lose their capitalization and become indistinguishable from their lowercase counterparts. This transformation reduces variability in the text representation, which can be beneficial for tasks such as text matching and information retrieval, but may also remove useful signals when capitalization carries semantic or syntactic meaning.
In traditional NLP pipelines, case folding is often applied explicitly as a preprocessing step. In contrast, modern neural language models may handle capitalization differently depending on their architecture and training data.
For example, uncased models rely on fully lowercased text, whereas cased models preserve capitalization and may use it as a signal for meaning or named-entity recognition. Consequently, the decision to apply case folding should be aligned with the representation model being used.
Stopword removal and case folding are common text normalization techniques, but neither should be applied blindly.
Both techniques involve modeling decisions that depend on the task, language, and downstream application.
Removing stopwords may simplify representations, but can be harmful in tasks such as question answering or information retrieval.
Case folding reduces sparsity but may eliminate meaningful distinctions, particularly for proper nouns and acronyms.
In modern NLP systems, including Transformer-based models, some normalization steps may be handled implicitly rather than explicitly.
This concludes the discussion on lexical normalization. We now turn to tokenization strategies that capture not only individual words, but also short sequences of words that convey meaning jointly.
So far, we have implicitly worked with unigrams, that is, individual words treated as independent tokens. Unigrams represent the simplest level of text representation and are often used to model word frequency and basic lexical information. However, many expressions in natural language convey meaning only when multiple words are considered together. Examples include compound terms, named entities, and fixed expressions. To capture such local context, NLP relies on n-grams, which are contiguous sequences of n tokens.
Unigrams (n = 1): single words.
Bigrams (n = 2): pairs of words.
Trigrams (n = 3): sequences of three words.
In practice, most NLP applications use unigrams, bigrams, and trigrams, as larger n-grams tend to be sparse and less informative.
Consider the following sentence:
sentence = "Applied statistics supports data-driven decision making. Applied statistics supports better decision making in practice, and applied statistics supports all decisions"
sentence
## 'Applied statistics supports data-driven decision making. Applied statistics supports better decision making in practice, and applied statistics supports all decisions'
The phrase data-driven decision making carries a specific meaning that would be partially lost if each word were analyzed independently. N-grams allow us to preserve such local context.
In this case, unigrams correspond to the individual words in the sentence. They capture basic lexical information but ignore word order and local dependencies.
tokens = sentence.lower().split()
tokens
## ['applied', 'statistics', 'supports', 'data-driven', 'decision', 'making.', 'applied', 'statistics', 'supports', 'better', 'decision', 'making', 'in', 'practice,', 'and', 'applied', 'statistics', 'supports', 'all', 'decisions']
To summarize the distribution of individual words in the text, unigram frequencies are computed and organized into a table. Presenting frequencies in tabular form facilitates inspection and comparison, making it easier to identify which terms dominate the text.
from collections import Counter
unigram_freq = Counter(tokens)
unigram_freq
## Counter({'applied': 3, 'statistics': 3, 'supports': 3, 'decision': 2, 'data-driven': 1, 'making.': 1, 'better': 1, 'making': 1, 'in': 1, 'practice,': 1, 'and': 1, 'all': 1, 'decisions': 1})
For improved readability, the frequency information is displayed as a structured table with one row per unigram:
#pip install matplotlib
import pandas as pd
unigram_table = (
pd.DataFrame(unigram_freq.items(), columns=["Unigram", "Frequency"])
.sort_values("Frequency", ascending=False)
.reset_index(drop=True)
)
unigram_table
## Unigram Frequency
## 0 applied 3
## 1 statistics 3
## 2 supports 3
## 3 decision 2
## 4 data-driven 1
## 5 making. 1
## 6 better 1
## 7 making 1
## 8 in 1
## 9 practice, 1
## 10 and 1
## 11 all 1
## 12 decisions 1
The unigram frequency table shows how often each individual word appears in the text. Words such as applied, statistics, supports, and decision occur multiple times, indicating their central role in the sentence. However, because unigrams treat words independently, this representation does not preserve word order or capture multi-word expressions, limiting the contextual information available.
While unigrams focus on individual words, bigrams capture pairs of adjacent words. This allows the representation to preserve short-range dependencies and common two-word expressions.
from nltk.util import ngrams
tokens = sentence.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]
## ['Applied statistics', 'statistics supports', 'supports data-driven', 'data-driven decision', 'decision making.', 'making. Applied', 'Applied statistics', 'statistics supports', 'supports better', 'better decision', 'decision making', 'making in', 'in practice,', 'practice, and', 'and applied', 'applied statistics', 'statistics supports', 'supports all', 'all decisions']
Bigrams capture pairs of adjacent words. This allows the model to preserve short-range dependencies and common phrases such as data-driven decision or decision making, which would lose meaning if analyzed word by word.
As with unigrams, bigram frequencies can be summarized in a table to facilitate interpretation. While unigrams focus on individual words, bigrams capture pairs of adjacent words, allowing us to observe short-range dependencies and common word combinations.
bigram_freq = Counter(bigrams)
bigram_freq
## Counter({('statistics', 'supports'): 3, ('Applied', 'statistics'): 2, ('supports', 'data-driven'): 1, ('data-driven', 'decision'): 1, ('decision', 'making.'): 1, ('making.', 'Applied'): 1, ('supports', 'better'): 1, ('better', 'decision'): 1, ('decision', 'making'): 1, ('making', 'in'): 1, ('in', 'practice,'): 1, ('practice,', 'and'): 1, ('and', 'applied'): 1, ('applied', 'statistics'): 1, ('supports', 'all'): 1, ('all', 'decisions'): 1})
For greater clarity, the bigram counts are organized into a structured table, where each row represents a two-word sequence and its frequency:
bigram_table = (
pd.DataFrame(
[(" ".join(k), v) for k, v in bigram_freq.items()],
columns=["Bigram", "Frequency"]
)
.sort_values("Frequency", ascending=False)
.reset_index(drop=True)
)
bigram_table
## Bigram Frequency
## 0 statistics supports 3
## 1 Applied statistics 2
## 2 supports data-driven 1
## 3 data-driven decision 1
## 4 decision making. 1
## 5 making. Applied 1
## 6 supports better 1
## 7 better decision 1
## 8 decision making 1
## 9 making in 1
## 10 in practice, 1
## 11 practice, and 1
## 12 and applied 1
## 13 applied statistics 1
## 14 supports all 1
## 15 all decisions 1
The bigram frequency table highlights short expressions that recur in the text. For example, statistics supports appears more than once, suggesting a meaningful local dependency between these words. Compared to unigrams, bigrams preserve word order and provide richer contextual information, although the context remains limited to two-word windows.
Trigrams extend this idea by capturing sequences of three consecutive words. They are particularly useful for representing compound concepts and fixed expressions, at the cost of increased sparsity.
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]
## ['Applied statistics supports', 'statistics supports data-driven', 'supports data-driven decision', 'data-driven decision making.', 'decision making. Applied', 'making. Applied statistics', 'Applied statistics supports', 'statistics supports better', 'supports better decision', 'better decision making', 'decision making in', 'making in practice,', 'in practice, and', 'practice, and applied', 'and applied statistics', 'applied statistics supports', 'statistics supports all', 'supports all decisions']
Trigram frequencies summarize sequences of three consecutive tokens extracted from the text. By extending the context window beyond individual words and word pairs, trigrams are able to represent longer expressions and more specific semantic patterns.
trigram_freq = Counter(trigrams)
trigram_freq
## Counter({('Applied', 'statistics', 'supports'): 2, ('statistics', 'supports', 'data-driven'): 1, ('supports', 'data-driven', 'decision'): 1, ('data-driven', 'decision', 'making.'): 1, ('decision', 'making.', 'Applied'): 1, ('making.', 'Applied', 'statistics'): 1, ('statistics', 'supports', 'better'): 1, ('supports', 'better', 'decision'): 1, ('better', 'decision', 'making'): 1, ('decision', 'making', 'in'): 1, ('making', 'in', 'practice,'): 1, ('in', 'practice,', 'and'): 1, ('practice,', 'and', 'applied'): 1, ('and', 'applied', 'statistics'): 1, ('applied', 'statistics', 'supports'): 1, ('statistics', 'supports', 'all'): 1, ('supports', 'all', 'decisions'): 1})
For improved readability, the trigram counts can be arranged in a table format, where each row corresponds to a three-word sequence and its observed frequency:
trigram_table = (
pd.DataFrame(
[(" ".join(k), v) for k, v in trigram_freq.items()],
columns=["Trigram", "Frequency"]
)
.sort_values("Frequency", ascending=False)
.reset_index(drop=True)
)
trigram_table
## Trigram Frequency
## 0 Applied statistics supports 2
## 1 statistics supports data-driven 1
## 2 supports data-driven decision 1
## 3 data-driven decision making. 1
## 4 decision making. Applied 1
## 5 making. Applied statistics 1
## 6 statistics supports better 1
## 7 supports better decision 1
## 8 better decision making 1
## 9 decision making in 1
## 10 making in practice, 1
## 11 in practice, and 1
## 12 practice, and applied 1
## 13 and applied statistics 1
## 14 applied statistics supports 1
## 15 statistics supports all 1
## 16 supports all decisions 1
In this example, the trigram applied statistics supports appears three times, while all other trigrams occur only once. This indicates the presence of a repeated local pattern in the text, whereas the remaining trigrams correspond to unique contextual sequences.
Nevertheless, even when frequencies are equal, trigrams provide valuable information by preserving local syntactic and semantic context. For instance, expressions such as applied statistics supports or data-driven decision making capture relationships between words that are not visible when using unigrams or bigrams alone.
This illustrates a key trade-off in n-gram modeling: as the value of n increases, n-grams tend to become more informative in terms of contextual richness, but also more sparse. In larger corpora, repeated trigrams typically emerge, and their frequency distributions become more meaningful for statistical modeling and feature extraction.
Frequency tables are the primary analytical output in n-gram analysis. However, simple visualizations can be useful for exploratory and pedagogical purposes, especially when introducing text-based features for the first time.
In this section, we define a lightweight visualization utility that produces:
A bar plot of the most frequent n-grams.
A word cloud summarizing relative frequency patterns.
These visualizations are intended to support interpretation and intuition. They do not replace frequency tables, which remain the authoritative analytical representation.
#pip install matplotlib
#pip install wordcloud
#pip install numpy
#pip install collections
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
from matplotlib.gridspec import GridSpec
def to_freq_dict(x):
"""
Convert input into a clean frequency dictionary {str: int}.
"""
# Case 1: already a Counter or dict
if isinstance(x, (Counter, dict)):
items = x.items()
else:
# if it's a list of tokens
items = Counter(x).items()
clean = {}
for k, v in items:
# Ensure value is numeric
clean[str(k)] = int(v)
return clean
def plot_ngram(
freq_like,
title,
top_n=15,
max_font_size=150,
min_font_size=18,
scale=4
):
words = to_freq_dict(freq_like)
top = dict(
sorted(words.items(), key=lambda kv: kv[1], reverse=True)[:top_n]
)
fig = plt.figure(figsize=(18,6))
gs = GridSpec(1, 2, width_ratios=[1, 1.5]) # más espacio a la nube
# Bar plot
ax1 = fig.add_subplot(gs[0])
#ax1.bar(top.keys(), top.values())
# Bar plot (horizontal)
labels = list(top.keys())
values = list(top.values())
# sort for nicer horizontal plotting
pairs = sorted(zip(labels, values), key=lambda x: x[1])
labels, values = zip(*pairs)
ax1.barh(labels, values)
ax1.set_title(f"{title} – Top {top_n} Frequencies")
ax1.tick_params(axis="y", labelsize=16) # controla tamaño etiquetas
#ax1.tick_params(axis="x", rotation=45)
# Word cloud
ax2 = fig.add_subplot(gs[1])
wc = WordCloud(
width=1300,
height=600,
background_color="white",
max_font_size=max_font_size,
min_font_size=min_font_size,
scale=scale,
prefer_horizontal=1.0,
collocations=False
).generate_from_frequencies(top)
ax2.imshow(wc)
ax2.axis("off")
ax2.set_title(f"{title} – Word Cloud", fontsize=14)
plt.tight_layout()
plt.show()
The code above defines two helper functions that work together to prepare and visualize n-gram frequency information.
Function to_freq_dict()
This function converts different types of input into a standardized frequency dictionary of the form:
n-gram → frequency
Its purpose is to ensure that the visualization function receives data in a consistent and error-free format, regardless of whether the input is:
A Counter object.
A regular dictionary.
Or a list of tokens or n-grams.
In practical terms, this function:
Extracts the frequency counts.
Converts all keys to strings.
Ensures that all frequencies are numeric.
This preprocessing step avoids errors and makes the plotting function more robust.
Function plot_ngram()
This function generates two complementary visual summaries from the frequency information:
Bar plot: Displays the most frequent n-grams and their counts, allowing for direct quantitative comparison.
Word cloud: Provides a qualitative visualization where more frequent n-grams appear more prominently, offering an intuitive overview of relative importance.
For readability, only the top n most frequent n-grams are displayed (controlled by the top_n argument).
Overall, this function plays an exploratory role: it helps us visually inspect patterns in the data, while the frequency tables remain the primary analytical reference.
We now apply the visualization utility to the unigram, bigram, and trigram frequency objects computed earlier. This illustrates how the same function can be reused to explore different levels of textual context.
The resulting plots facilitate comparison across n-gram types and help highlight how contextual information increases as n grows. As emphasized throughout this section, frequency tables remain the primary analytical reference.
# Example usage (these can be Counter/dict or lists)
plot_ngram(unigram_freq, "Unigrams")
plot_ngram(bigram_freq, "Bigrams", top_n=10)
plot_ngram(trigram_freq, "Trigrams", top_n=5, max_font_size=200, min_font_size=25, scale=4)
The preprocessing steps discussed so far—lexical normalization, stopword handling, case folding, n-grams, and noise removal—are typically applied before building statistical or machine learning models.
Which steps are applied, and in what order, depends entirely on the use case. For this reason, preprocessing choices should be viewed as modeling decisions, not fixed or universal rules.
After preprocessing, tokens can be aggregated to form a vocabulary, which defines the set of units used to represent text numerically. The vocabulary serves as the interface between raw text and quantitative representations.
sentence = "Applied statistics supports data-driven decision making"
tokens = set(sentence.lower().split())
vocabulary = sorted(tokens)
vocabulary
## ['applied', 'data-driven', 'decision', 'making', 'statistics', 'supports']
In this example, the vocabulary consists of the unique tokens obtained after basic preprocessing (lowercasing and tokenization). Each element represents a distinct unit that can later be mapped to numerical features.
This vocabulary forms the foundation for representing text in a structured and consistent way, serving as the interface between raw language data and quantitative analysis.
While the tokenization methods discussed so far are widely used in classical NLP pipelines, modern large language models (LLMs) rely on more sophisticated subword-based tokenization schemes designed to balance linguistic coverage, efficiency, and scalability.
Most contemporary LLMs do not operate directly on words or characters. Instead, they tokenize text into subword units, which may correspond to full words, word fragments, or even individual characters, depending on frequency and context.
In practice, current models adopt variations of data-driven subword tokenization, including:
Byte Pair Encoding (BPE) and its extensions.
Unigram language model tokenization.
Byte-level tokenization, which operates directly on raw bytes rather than characters.
These approaches allow models to handle rare words, multilingual text, and previously unseen strings while keeping vocabulary sizes manageable.
Although exact implementations are often proprietary, the following high-level patterns are well established:
ChatGPT / GPT-family models (OpenAI). Use byte-level BPE–style tokenization, where tokens may represent characters, subwords, or frequent word sequences.
Claude (Anthropic). Relies on subword tokenization with strong emphasis on robustness to rare and out-of-vocabulary strings.
Gemini models (Google). Build upon SentencePiece-style tokenization, supporting multilingual and byte-aware representations.
DeepSeek models. Explore advanced compression-aware and context-sensitive tokenization strategies, particularly for long-context and multimodal inputs.
Rather than reflecting linguistic units directly, these tokenizers are optimized for statistical efficiency and model performance.
Tokenization remains an active research area, particularly as large language models continue to scale in model size, context length, and modality. Recent studies have revisited the role of tokenization, exploring alternatives to traditional subword-based schemes and highlighting its impact on efficiency, representation, and learning dynamics.
Several recent contributions illustrate these trends:
Google (arxiv, 17 dic. 2025): Prompt Repetition Improves Non-Reasoning LLMs. This work emphasizes that tokenization-related design choices can significantly affect model efficiency and representational capacity, especially in large-context settings.
Hong Kong and Huazhong Universities (arxiv, 24 oct. 2025): UniTok: A Unified Tokenizer for Visual Generation and Understanding. This work investigates how tokenization choices interact with model architecture and training dynamics in large language models. The authors show that tokenization affects not only sequence length and efficiency, but also optimization behavior and generalization, further supporting the view of tokenization as a core modeling decision.
Deepseek (arxiv, 21 oct. 2025): DeepSeek-OCR: Contexts Optical Compression. Introduces compression-oriented tokenization strategies aimed at improving efficiency and scalability in long-context language models.
Gunther et al. (arxiv, 7 jul. 2025): Rethinking tokenization for large language models. Examines limitations of conventional subword tokenizers and proposes alternative formulations that better align with modern LLM architectures.
Pagnoni et al. (arxiv, 13 dic. 2024): Byte Latent Transformer: Patches Scale Better Than Tokens. Shows that tokenization schemes induce structural biases in LLMs, affecting learned representations and downstream behavior. Supports the view of tokenization as a core architectural design choice.
Schmidt et al. (arxiv, 7 oct. 2024): Tokenization is more than compression. Analyzes how tokenization choices influence representation learning and downstream task performance beyond simple compression efficiency.
Together, these studies highlight that tokenization is not merely a preprocessing step, but a core design component that directly shapes model capacity, efficiency, and generalization.
This perspective provides a natural bridge between classical NLP preprocessing techniques and the representation learning methods employed in modern deep learning–based language models. Tokenization has evolved from a preprocessing heuristic into a central research topic in large-scale language modeling.
In this document, we examined the main steps involved in constructing a vocabulary for natural language processing tasks. These steps form the foundation of text preprocessing and play a central role in how linguistic data is prepared for analysis.
Text preprocessing is a critical component of any machine learning workflow, and this is especially true in NLP. Thoughtful preprocessing helps reduce noise, control variability, and shape the structure of the data in ways that facilitate effective modeling. When these steps are carefully designed and aligned with the task at hand, they often lead to more stable and interpretable results than approaches that rely on raw text alone.
As discussed in the final sections of this document, many preprocessing decisions (particularly those related to tokenization) also play a fundamental role in modern large language models, where they directly influence efficiency, representation, and overall model behavior.
In other documents (click here), we build on these concepts by applying the preprocessing techniques discussed here to construct mathematical representations of text that can be used directly by machine learning algorithms.
This activity is designed to integrate and apply all the concepts introduced in this document. The reader is asked to work with a short song fragment of their choice and perform a complete lexical analysis using R.
To construct a reproducible lexical analysis pipeline that moves from raw text to tokenization, vocabulary construction, and normalization, illustrating key NLP preprocessing concepts.
Select a song of your choice and work with:
A short fragment (e.g., 6–8 lines), or
A song with public-domain or open licensing.
Create an R Markdown (.Rmd) document that compiles successfully to HTML (or PDF).
The document must include both:
The R code, and
The resulting output (tables, printed objects, or visual summaries).
Briefly describe the chosen song and the reason for selecting it. Include the text fragment used for the analysis.
Define a small lexicon (at least 15 entries) derived from the text, including:
The lexical item.
A conceptual category (e.g., emotion, action, place).
A short description or interpretation.
Explain, in conceptual terms, the distinction between phonemes, graphemes, and morphemes. Illustrate these concepts using a small set of words from the selected text.
Apply and compare different tokenization strategies, including:
Sentence tokenization.
Word tokenization.
Character-level tokenization.
Report:
The total number of tokens.
The most frequent tokens.
A short interpretation of the results.
Using rule-based tokenization, design at least two regular expressions to extract specific entities from the text (e.g., numbers, dates, prices, hashtags).
Show the matched results and explain what each pattern captures.
Apply common normalization techniques, such as:
Lowercasing.
Punctuation removal.
Stopword removal.
Stemming or lemmatization.
Compare the vocabulary before and after normalization and discuss the observed changes.
Provide a concise reflection (5–8 lines) on how lexical choices, tokenization, and normalization affect vocabulary construction and textual representation in NLP.
The R Markdown document must be fully reproducible, meaning that all code chunks execute without errors and generate the reported outputs when the document is compiled.
If you found any ERRORS or have SUGGESTIONS, please report them to my email. Thanks.