Text-as-Data/NLP I

Agenda

Things to be covered

Basic terms in text-as-data/NLP
Unit of analysis
Tokenization (segmentation)
Text normalization (= cleaning)
BoW / vector space models
Cosine similarity
TF-IDF weighting

We leverage texts and natural language processing (NLP) to investigate substantive research questions in social sciences and humanities.
Our approach enables systematic analysis of various forms of texts—ranging from political speeches and news media to literature, historical documents, social media interactions, and speeches from educational applications for learning languages.
Today we will assume that we have some form of raw texts and will be discussing key approaches to numerical representation of texts before we conduct any meaningful analytic work
There would be quite some steps that require careful decision making, including determining unit-of-analysis, how to segment texts into words, sentences, and paragraphs, how to normalize (or clean)
Once we have cleaned/normalized texts, we should also decide how to put transform them into vectors (an array of numbers) of a matrix (a two-dimensional array, or a table of numbers), and also want to compare between texts (or documents)

Basic terms in text-as-data/NLP

Corpus

Corpus: a computer-readable collection of text or speech
Size of corpus: the number of texts (documents) or words

Basic terms in text-as-data/NLP

Sentence and punctuation

A set of words that is complete in itself
Utterance is the spoken version of a sentence
- Disfluences: fragment, filler/filled pauses
- E.g., “I do uh main- mainly business data processing”
Punctuation: period, comma, apostrophe, quotation, question, exclamation, brackets, parenthesis, dash (—), hyphen (-), ellipsis (…), colon, semicolon

Basic terms in text-as-data/NLP

Lemma

A set of lexical forms having the same stem \(\longleftrightarrow\) word forms
Run (lemma)
- Runs (third person singular present)
- Ran (simple past)
- Running (present participle)

Basic terms in text-as-data/NLP

Tokens and types

Token: the total number N of running words
Type: the number of distinct words in a corpus (its size indicated with |V|)
E.g, “they picnicked by the pool, then lay back on the grass and looked at the stars”

Basic terms in text-as-data/NLP

Tokens and types

Heap’s Law

Basic terms in text-as-data/NLP

Strings

A sequence of characters
- “file upload complete”
- “I got a new job today”
- “100%”
- “?action=edit”

Basic terms in text-as-data/NLP

Reg(ular) ex(pression)

Notation for characterizing a set of strings
Powerful way to search text based on certain patterns
E.g., cellphone numbers in South Korea 010-\d{4}-\d{4}

Unit of Analysis

The main element that is being analyzed in a study

“What” or “who” that is being studied
Depends on the research question

Unit of Analysis

Typically, information about one unit is recorded as one row

Think of a survey data set

Unit of Analysis

“How have the dominant themes in a corpus of 19th-century British literature changed?”

Data: literary works (novels, poems, etc.)
Unit of analysis: whole pieces, chapters, paragraphs, etc.

Unit of Analysis

“What are the key scientific topics debated within the scientific community between 2000–2020?”

Data: scientific publications (e.g., Dimension, Web of Science, etc.)
Unit of analysis: titles, abstracts, introductions, full texts, etc.

Unit of Analysis

The key consideration is our research question

Unit of Analysis

E.g., Barbera et al. (2019)

Investigate whether politicians respond to people’s policy interests, focused on Twitter (2013–2014)
Run topic models (Latent Dirichlet Allocation) on tweets from ordinary users and 500+ legislators in the U.S.
Examine if the topics in the former at t predicts the latter at t+1

Unit of Analysis

E.g., Barbera et al. (2019)

“Our definition of “document” is the aggregated total of tweets sent by members of Congress each day”
“Our conceptualization of each day’s tweets as the political agenda that each party within each legislative chamber is trying to push for that specific day”
“Conducting an analysis at the tweet level is complex, given its very limited length”

Unit of Analysis

E.g., Hammer et al. (2019)

Unit of Analysis

E.g., Hammer et al. (2019)

Use supervised learning to detect threatening speech on YouTube comments (i.e., text classification)
Comments on YouTube videos are split into individual sentences

Tokenization

Breaking up a text into discrete components

Tokenization is a form of segmentation (= word segmentation)
Token: each individual component in the document
- Possibly including numbers, punctuation, or other symbols

Tokenization

“To be or not to be, that is the question”

⟶ “To”, “be”, “or”, “not”, “to”, “be”, “that”, “is”, “the”, “question”

Tokenization

Types

Each token is of a particular “type”
The set of types is the vocabulary (often denoted as |V|)
“To be or not to be, that is the question”

⟶ “to” “be” “or” “not” “that” “is” “the”, “question” (|V| = 8)

Tokenization

“Let us explore tokenization.”

Word-level: [“Let”, “us”, “explore”, “tokenization.”]
Subword-level: [“Let”, “us”, “explore”, “token”, “ization.”]
Character-level: [“L”, “e”, “t”, “u”, “s”, “e”, “x”, “p”, “l”, “o”, “r”, “e”, “t”, “o”, “k”, “e”, “n”, “i”, “z”, “a”, “t”, “i”, “o”, “n”, “.”]

Tokenization

Levels of tokenization

Words: most common pre-LLM
Subwords: now prevalent in neural NLP / LLM
- Handling of OOV (out-of-vocabulary) words
  - E.g, if the model has vi (as in virus) and rologist (as in neurologist, urologist, etc.), it can handle virologist
- Reduced vocabulary / more efficiency (again, consider “tokenization”)
- Common approaches include Byte Pair Encoding (BPE) for GPT, WordPiece for BERT
Character: no meaning (although computationally very efficient)
- E.g., what would be the vocabulary size?
Sentences: too many types

Tokenization

Subword tokenization in GPTs

Tokenization

Tokenization at the word level

In English (and many other languages, including Korea), we can rely heavily on white space
Many algorithms build not only on white space but also on various patterns
- E.g., appstrophies (don’t)
- E.g., punctuations: (vehicles?)
Tools include NLTK, spaCy, Keras, etc.
In some languages, words cannot be separated as easily as in English, and specialized models are required (e.g., Chinese, Japanese, etc.).
- E.g., “我喜欢吃苹果” (I like eating apples)

Tokenization

Tokenization at the sub-word level

Very common in LLMs
Tokenizers are built (trained) as a separate process before model training
After a tokenizer is initialized and trained, it is then used in the training process of its associated (L)LMs
\(\rightarrow\) The model is “locked” to its tokenizer
\(\rightarrow\) (L)LMs are linked with their tokenizers

Before training an LLM, a tokenizer (e.g., BPE, WordPiece, Unigram) is trained on a large corpus to create a fixed vocabulary.
This involves training the tokenizer on a corpus to create a fixed vocabulary and define tokenization rules.
While this corpus may overlap with the LLM’s training data, the tokenizer’s creation is an independent process.
Once the tokenizer is finalized, the LLM is trained using tokenized input (converted to token IDs).
So, tokenizers are built (trained) as a separate process before model training.
The model learns to predict the next token (GPT) or fill in missing tokens (BERT) based on the tokenizer’s tokenization.
Pre-trained models are “locked” to their tokenizers because the model’s embeddings, attention layers, and weights are trained with a specific tokenizer

Tokenization

Tokenization at the sub-word level (cont’d)

Methods: while there are various methods, they all aim to optimize an efficient set of tokens to represent a texts data set
- E.g., Byte Pair Encoding or WordPiece
Special tokens: used to indicate specific roles or structures
- E.g., [CLS] (BERT) or <s> (GPT) to mark the beginning of input/output
- E.g., [SEP] (BERT) or </s> (GPT) to separate sentences or mark the end of input/output
Vocabulary size: it should be decided how many tokens to keep in the tokenizer’s vocabulary

Tokenization

n-grams

A sequence of n adjacent tokens
Unigrams, bigrams, trigrams, etc.
Why would we need multi-grams?
- E.g., “White House”, “look after”, “take care of”, etc.

Tokenization

n-grams

Be aware of the computational cost
- Consider the number of all consecutive sets of two words in the corpus
Alternatively, we can compile a list of particular bi-grams or tri-grams

Tokenization

n-grams

With modern (L)LMs, we no longer need to explicitly generate n-grams most of the time
However, the idea behind n-grams remains relevant
The purpose of n-grams is often achieved implicitly through subword tokenization and attention mechanisms in modern transformer-based models
- This inclues capturing local word patterns, collocations, and context,
- E.g., subword tokenization involves common patterns (e.g., artificial intelligence or New York)

Segmenting Sentences/paragraphs

Sentence segmentation

Useful cues: periods, question marks, or exclamation marks
Prone to errors (the example of ".")
- Abbreviations and initials: “Ph.D.”, “J.K. Rowling”, etc.
- Decimal numbers: “3.14”
- Websites and email addresses: “www.kaist.ac.kr”) and email addresses
- Quotations within a sentence: “He said, ‘Stop.’ Then he left.”
Rule-based/deterministic or ML-based approaches (part of NLTK and spaCy)

Segmenting Sentences/paragraphs

Paragraph segmentation

Not as commonly addressed
There are useful cues
- Newline characters (\n) or double newline characters (\n\n)
- Indentations (e.g., \t)
- With HTLM documents, we could potentially use tags (e.g., <p>) to parse different parts of the document (not necessarily paragraphs though)
Fewer specialized libraries or algorithms in Python

Text Normalization

A set of approaches to reducing complexity in text (a.k.a. pre-processing)

The output from tokenization will contain too many words
With normalization, vocabulary size can be reduced (computationally more efficient)
It can enhance many downstream tasks
- E.g., topic modeling (player, players, playing, etc.)
- E.g., information retrieval: finding a pattern in a corpus (e.g., Penny, Pennies, penny, pennies, etc.)

Text Normalization

We will discuss five approaches

Lowercasing
Removing punctuation
Removing stop words
Lemmatization/stemming
Filtering by frequency

Text Normalization

Lowercasing

We often replace all capital letter with lowercase letters
It is assumed that there is no (semantic) difference
Is it?

Text Normalization

Lowercasing

Compare “NOW” and “now” in terms of sentiment
Capital letters also signal the start of of a sentence
Proper nouns (May vs. may. US vs. us)

Text Normalization

Removing punctuation

Period (.), comma (,), apostrophe ('), quotation (""), question (?), exclamation (!), dash (-), ellipsis (...), colon (:), semicolon (;), etc.
In many cases, these are (considered) unimportant
Are they?

Text Normalization

Removing punctuation

Punctuation carries important information
- Exclamation mark (!!!), hashtags (#metoo), emojis (:)), etc.
Punctuation itself can be of interest (studying writing styles)

Text Normalization

Removing stop words

Common words used across documents that do not give much information
E.g., “and”, “the”, or “that”

Text Normalization

Removing stop words can spare much computational power

C.f., Heaps’ Law
However, under what circumstances are these words not stop words? For instance, consider the case of the.

Text Normalization

Lemmatization

Lemma: the base form
- E.g., “run”
Wordform: various forms derived the lemma
- E.g., “runs”, “ran”, “running”
Lemmatizatoin is the process of mapping words to their lemma

Text Normalization

Lemmatization

Not always straightforward
- Irregular variations E.g., “see-saw-seen”
- Same token but different lemmas
  - E.g., he is “writing” an email vs. a nice piece of “writing”
Necessitates a dictionary and POS (part of speech) tagging

Text Normalization

Stemming is a popular approximation to lemmatization

Simply discards the end of a word
- E.g., family: famili
Errors
- E.g., “leav” for both “leaves” (as in “He leaves the room”) and “leaves” (as in parts of a plant)
Various algorithms: Porter, Lancaster, etc.

Text Normalization

Filtering by frequency

Too (in)frequent words across documents
- E.g., stop words
Can be filtered by the minimum/maximum document frequencies
- Word that appear in fewer/more than n% of documents
The rationale
- Discriminatory power
- Computational savings

Text Normalization

(How) Should we normalize?

Difficult to know its consequences a priori
Before analysis: carefully think about the pros and cos in each of the steps
After analysis: conduct robustness check

Text Normalization

Normalization with LLMs

With LLMs, traditional normalization steps are less relevant, especially due to the widespread adoption of subword-level tokenization (BPE, WordPiece, etc.)
However, they are still relevant depending on the context

Text Normalization

Normalization with LLMs (cont’d)

Lower-casing: different models deal with lower-casing differently
- BERT-base-uncased vs. BERT-cased, GPT models, etc.
Punctuation and stopwords: LLMs can handle them effectively by treating them as meaningful tokens
Lemmatization/stemming: subword tokenization already handles word variations
- See this book (pp. 47–55) for various approaches
White spaces: white space handling can matter where they represent structure
- E.g., four white spaces as a single token representing an indentation should work better for code (see CodeBERT)

Text Representation

Text representation as a model

Text representation abstracts away from reality (actual texts)
It serves as a simplified model of language and meaning
This applies to LLMs too
“All models are wrong, some are useful” (George Box, 1976)

Text Representation

Levels of text representation

Word representations
- Static embeddings: Word2Vec, GloVe, FastText (Week 6)
  - The same embeddings for the bank in river bank and bank account
- Contextual embeddings: neural embeddings from transformer models (BERT, GPT) (Week 12)
  - The embedding for the bank differs across contexts

Text Representation

Levels of text representation (cont’d)

Representations beyond the word-level (sentences, paragraphs, or documents)
- Lexcial approaches
  - Bag of Words (BoW): basic word frequency representation
  - TF-IDF (Term-Frequency Inverse Document Frequency): weighting of words based on their importance to documents
- Transformer-based neural embeddings can be used for representing entire sentences, paragraphs, or documents
  - There are models specifically tailored for sentence-level representations too (e.g., S-BERT, Universal Sentence Encoder)

The Bag of Words (BoW) model

The most basic text representation model

A text is represented as a set of words that appear in it

The Bag of Words (BoW) model

Document-Feature Matrix (or Document-Term Matrix)

Columns record features/terms (all types or |V|)
Rows record documents
Cells can be binary vectors or count vectors

The Bag of Words (BoW) model

An example corpus

Doc 1: “The clever fox cleverly jumps over the lazy dog, showcasing its cleverness.”
Doc 2: “Magic and mysteries mingle in the wizard’s daily musings, revealing mysteries unknown.”
Doc 3: “Sunny days bring sunshine and sunsets, making sunny parks the best for sunny strolls.”

The Bag of Words (BoW) model

An example DFM

Document	clever	jumps	lazy	dog	magic	mysteries	…
Doc 1	3	1	1	1	0	0	…
Doc 2	0	0	0	0	1	2	…
Doc 3	0	0	0	0	0	0	…

The Bag of Words (BoW) model

An example DFM

Document	`clever`	jumps	lazy	dog	magic	mysteries	…
Doc 1	3	1	1	1	0	0	…
Doc 2	0	0	0	0	1	2	…
Doc 3	0	0	0	0	0	0	…

The Bag of Words (BoW) model

An example corpus

Doc 1: “The clever fox cleverly jumps over the lazy dog, showcasing its cleverness.”
Doc 2: “Magic and mysteries mingle in the wizard’s daily musings, revealing mysteries unknown.”
Doc 3: “Sunny days bring sunshine and sunsets, making sunny parks the best for sunny strolls.”

The Vector Space Model

What is the vector space model?

Each row (representing a text) in a DFM is a vector (an array of numbers) in a high-dimensional space
The size of the dimension (the number of columns) is |V|
Originates from IR (Information Retrieval)
- See Turney and Pantel (2010) for details

Underlying the BoW representation is the vector space model where we represent documents as vectors in a multi-dimensional space consisting of the vocabulary of the corpus
Representing certain entities as a vector is very common and not unique to NLP.
We are not going to get into the details of IR, but the gist of IR is to take a query and return documents that are relevant to that query, just like when we look up a keyword on a search engine
What is worth attention here is that, in IR, we are comparing the query and retrieved document(s) by calculating the similarity between the two vectors representing the query and the target documents.
Also, the embeddings used in large language models are indeed a type of vector space model
While traditional vector space models (VSMs) like count or TF-IDF vectors we will talk about today map words into a multi-dimensional space, LLM embeddings are far more complex, richer in information, and learned through deep learning techniques.
In many applications in HSS, comparing texts is one of key research interests, and we are going to discuss various ways of measuring similarity between texts

Comparing Texts

With some form of DFM, we are ready to compare different documents

“Similar” can mean different things
- Sentiments, stances, themes, etc.
There is no “correct” notion of similarity
Yet there are metrics that are more or less effective across contexts

Cosine Similarity

We have two vectors (representing two documents), \(\vec{A}\) and \(\vec{B}\):

\[\vec{A} = [a_1, a_2, \ldots, a_n]\] \[\vec{B} = [b_1, b_2, \ldots, b_n]\] The inner product:

\[\vec{A} \cdot \vec{B} = (a_1 \times b_1) + (a_2 \times b_2) + \ldots + (a_n \times b_n)\]

With binary count vectors, the inner product between the two vectors would be the number of shared words, which makes sense
Note that with count vectors composed of non-negative numbers the inner product is as low as 0
What is the upper bound? Two limitations
Large numbers in vectors inflate similarity
- The inner product of two vectors is calculated by multiplying corresponding elements and summing these products.
- If the vectors contain large numbers, this multiplication can result in very high values, potentially leading to an exaggerated level of similarity.
- This is particularly a concern if the magnitude of the numbers in the vectors isn’t necessarily representative of their actual importance or relevance.
Long vectors inflate similarity
- Since the inner product sums over all elements of the vectors, longer vectors (those with more elements) naturally tend to produce larger inner product values, simply due to the increased number of elements being summed
- This means that the length of the vectors, rather than just their directional similarity, can heavily influence the resulting similarity score

Cosine Similarity

Cosine similarity between vectors \(\vec{A}\) and \(\vec{B}\) is given by:

\[\text{Cosine Similarity} (\vec{A}, \vec{B}) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}\] \(\vec{A} \cdot \vec{B}\) is the inner product, and \(\|\vec{A}\|\) and \(\|\vec{B}\|\) are defined as

\[ \|\vec{A}\| = \sqrt{a_1^2 + a_2^2 + \ldots + a_n^2} \|\vec{B}\| = \sqrt{b_1^2 + b_2^2 + \ldots + b_n^2} \]

Cosine Similarity

TF-IDF Weighting

TF (Term Frequency) - IDF (Inverse Document Frequency)

Count vectors consider the frequencies of words
However, some words are too frequent across different documents
- E.g., the, a, an, etc.
We want to weight how unique a word to a document

TF-IDF Weighting

TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus.

TF-IDF Weighting

The TF-IDF value is obtained by multiplying TF (Term Frequency) and IDF (Inverse Document Frequency) for a term in a document, highlighting the importance of rare terms

\[ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) \]

TF-IDF Weighting

Term Frequency

Reflects how frequently a term occurs in a document, normalized by the document length

\[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]

TF-IDF Weighting

Inverse Document Frequency

Scales down terms that occur very frequently across the corpus and are less informative

\[ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents } D}{\text{Number of documents with term } t \text{ in it} + 1}\right) \]

TF-IDF Weighting

Many versions of TF-IDF: link

Count Vectors Vs. TF-IDF Vectors

Count Vectors

Term	can	you	fly	sleep
‘can you fly’	1	1	1	0
‘can you sleep’	1	1	0	1

TF-IDF Vectors

Term	can	you	fly	sleep
‘can you fly’	0.5	0.5	0.7	0
‘can you sleep’	0.5	0.5	0	0.7