A set of lexical forms having the same stem \(\longleftrightarrow\) word forms
Run (lemma)
Runs (third person singular present)
Ran (simple past)
Running (present participle)
Basic terms in text-as-data/NLP
Tokens and types
Token: the total number N of running words
Type: the number of distinct words in a corpus (its size indicated with |V|)
E.g, “they picnicked by the pool, then lay back on the grass and looked at the stars”
Basic terms in text-as-data/NLP
Tokens and types
Heap’s Law
Basic terms in text-as-data/NLP
Strings
A sequence of characters
“file upload complete”
“I got a new job today”
“100%”
“?action=edit”
Basic terms in text-as-data/NLP
Reg(ular) ex(pression)
Notation for characterizing a set of strings
Powerful way to search text based on certain patterns
E.g., cellphone numbers in South Korea 010-\d{4}-\d{4}
Unit of Analysis
The main element that is being analyzed in a study
“What” or “who” that is being studied
Depends on the research question
Unit of Analysis
Typically, information about one unit is recorded as one row
Think of a survey data set
Unit of Analysis
“How have the dominant themes in a corpus of 19th-century British literature changed?”
Data: literary works (novels, poems, etc.)
Unit of analysis: whole pieces, chapters, paragraphs, etc.
Unit of Analysis
“What are the key scientific topics debated within the scientific community between 2000–2020?”
Data: scientific publications (e.g., Dimension, Web of Science, etc.)
Unit of analysis: titles, abstracts, introductions, full texts, etc.
Unit of Analysis
The key consideration is our research question
Unit of Analysis
E.g., Barbera et al. (2019)
Investigate whether politicians respond to people’s policy interests, focused on Twitter (2013–2014)
Run topic models (Latent Dirichlet Allocation) on tweets from ordinary users and 500+ legislators in the U.S.
Examine if the topics in the former at t predicts the latter at t+1
Unit of Analysis
E.g., Barbera et al. (2019)
“Our definition of “document” is the aggregated total of tweets sent by members of Congress each day”
“Our conceptualization of each day’s tweets as the political agenda that each party within each legislative chamber is trying to push for that specific day”
“Conducting an analysis at the tweet level is complex, given its very limited length”
Unit of Analysis
E.g., Hammer et al. (2019)
Unit of Analysis
E.g., Hammer et al. (2019)
Use supervised learning to detect threatening speech on YouTube comments (i.e., text classification)
Comments on YouTube videos are split into individual sentences
Tokenization
Breaking up a text into discrete components
Tokenization is a form of segmentation (= word segmentation)
Token: each individual component in the document
Possibly including numbers, punctuation, or other symbols
E.g, if the model has vi (as in virus) and rologist (as in neurologist, urologist, etc.), it can handle virologist
Reduced vocabulary / more efficiency (again, consider “tokenization”)
Common approaches include Byte Pair Encoding (BPE) for GPT, WordPiece for BERT
Character: no meaning (although computationally very efficient)
E.g., what would be the vocabulary size?
Sentences: too many types
Tokenization
Subword tokenization in GPTs
Tokenization
Tokenization at the word level
In English (and many other languages, including Korea), we can rely heavily on white space
Many algorithms build not only on white space but also on various patterns
E.g., appstrophies (don’t)
E.g., punctuations: (vehicles?)
Tools include NLTK, spaCy, Keras, etc.
In some languages, words cannot be separated as easily as in English, and specialized models are required (e.g., Chinese, Japanese, etc.).
E.g., “我喜欢吃苹果” (I like eating apples)
Tokenization
Tokenization at the sub-word level
Very common in LLMs
Tokenizers are built (trained) as a separate process before model training
After a tokenizer is initialized and trained, it is then used in the training process of its associated (L)LMs \(\rightarrow\) The model is “locked” to its tokenizer \(\rightarrow\) (L)LMs are linked with their tokenizers
Tokenization
Tokenization at the sub-word level (cont’d)
Methods: while there are various methods, they all aim to optimize an efficient set of tokens to represent a texts data set
E.g., Byte Pair Encoding or WordPiece
Special tokens: used to indicate specific roles or structures
E.g., [CLS] (BERT) or <s> (GPT) to mark the beginning of input/output
E.g., [SEP] (BERT) or </s> (GPT) to separate sentences or mark the end of input/output
Vocabulary size: it should be decided how many tokens to keep in the tokenizer’s vocabulary
Tokenization
n-grams
A sequence of n adjacent tokens
Unigrams, bigrams, trigrams, etc.
Why would we need multi-grams?
E.g., “White House”, “look after”, “take care of”, etc.
Tokenization
n-grams
Be aware of the computational cost
Consider the number of all consecutive sets of two words in the corpus
Alternatively, we can compile a list of particular bi-grams or tri-grams
Tokenization
n-grams
With modern (L)LMs, we no longer need to explicitly generate n-grams most of the time
However, the idea behind n-grams remains relevant
The purpose of n-grams is often achieved implicitly through subword tokenization and attention mechanisms in modern transformer-based models
This inclues capturing local word patterns, collocations, and context,
E.g., subword tokenization involves common patterns (e.g., artificial intelligence or New York)
Segmenting Sentences/paragraphs
Sentence segmentation
Useful cues: periods, question marks, or exclamation marks
Prone to errors (the example of ".")
Abbreviations and initials: “Ph.D.”, “J.K. Rowling”, etc.
Decimal numbers: “3.14”
Websites and email addresses: “www.kaist.ac.kr”) and email addresses
Quotations within a sentence: “He said, ‘Stop.’ Then he left.”
Rule-based/deterministic or ML-based approaches (part of NLTK and spaCy)
Segmenting Sentences/paragraphs
Paragraph segmentation
Not as commonly addressed
There are useful cues
Newline characters (\n) or double newline characters (\n\n)
Indentations (e.g., \t)
With HTLM documents, we could potentially use tags (e.g., <p>) to parse different parts of the document (not necessarily paragraphs though)
Fewer specialized libraries or algorithms in Python
Text Normalization
A set of approaches to reducing complexity in text (a.k.a. pre-processing)
The output from tokenization will contain too many words
With normalization, vocabulary size can be reduced (computationally more efficient)
Exclamation mark (!!!), hashtags (#metoo), emojis (:)), etc.
Punctuation itself can be of interest (studying writing styles)
Text Normalization
Text Normalization
Removing stop words
Common words used across documents that do not give much information
E.g., “and”, “the”, or “that”
Text Normalization
Removing stop words can spare much computational power
C.f., Heaps’ Law
However, under what circumstances are these words not stop words? For instance, consider the case of the.
Text Normalization
Lemmatization
Lemma: the base form
E.g., “run”
Wordform: various forms derived the lemma
E.g., “runs”, “ran”, “running”
Lemmatizatoin is the process of mapping words to their lemma
Text Normalization
Lemmatization
Not always straightforward
Irregular variations E.g., “see-saw-seen”
Same token but different lemmas
E.g., he is “writing” an email vs. a nice piece of “writing”
Necessitates a dictionary and POS (part of speech) tagging
Text Normalization
Stemming is a popular approximation to lemmatization
Simply discards the end of a word
E.g., family: famili
Errors
E.g., “leav” for both “leaves” (as in “He leaves the room”) and “leaves” (as in parts of a plant)
Various algorithms: Porter, Lancaster, etc.
Text Normalization
Filtering by frequency
Too (in)frequent words across documents
E.g., stop words
Can be filtered by the minimum/maximum document frequencies
Word that appear in fewer/more than n% of documents
The rationale
Discriminatory power
Computational savings
Text Normalization
(How) Should we normalize?
Difficult to know its consequences a priori
Before analysis: carefully think about the pros and cos in each of the steps
After analysis: conduct robustness check
Text Normalization
Normalization with LLMs
With LLMs, traditional normalization steps are less relevant, especially due to the widespread adoption of subword-level tokenization (BPE, WordPiece, etc.)
However, they are still relevant depending on the context
Text Normalization
Normalization with LLMs (cont’d)
Lower-casing: different models deal with lower-casing differently
BERT-base-uncased vs. BERT-cased, GPT models, etc.
Punctuation and stopwords: LLMs can handle them effectively by treating them as meaningful tokens
Lemmatization/stemming: subword tokenization already handles word variations
Cosine similarity between vectors \(\vec{A}\) and \(\vec{B}\) is given by:
\[\text{Cosine Similarity} (\vec{A}, \vec{B}) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}\]\(\vec{A} \cdot \vec{B}\) is the inner product, and \(\|\vec{A}\|\) and \(\|\vec{B}\|\) are defined as
However, some words are too frequent across different documents
E.g., the, a, an, etc.
We want to weight how unique a word to a document
TF-IDF Weighting
TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus.
TF-IDF Weighting
The TF-IDF value is obtained by multiplying TF (Term Frequency) and IDF (Inverse Document Frequency) for a term in a document, highlighting the importance of rare terms