Text-as-Data/NLP II

Agenda

Things to be covered

Overview of text classification
Evolution of text classification
Manual labeling
Evaluating performance
Generative LLMs for text classification

Overview of Text Classification

What is text classification?

Goal
- To classify documents into pre-defined categories
- A wide range of use cases: sentiment of comments (e.g., positive or negative), stance on issues (e.g., for vs. against vs. neutral), topic (e.g., related to immigration or not), events (e.g., related to an election).
The possible classification targets are virtually limitless
Compared to topic models or other unsupervised approaches, text classification allows for pinpointed measurement that specifically targets predefined concepts

Overview of Text Classification

What is text classification?

We need
- Labeled data set (for training and/or testing)
- Model/approach that maps texts to labels (dictionary, traditional supervised learning, fine-tuning pre-trained models, generative models)
- Evaluation approaches: performance metrics, cross-validation, etc.

Evolution of Text Classification

Evolution of text classification

Dictionary methods
- Based on counting/weighting of relevant keywords
- Readily available and fast \(\leftrightarrow\) sub-optimal performance
- However, if you want to quick, preliminary analysis for concepts supported by an existing dictionary, it can be helpful
- E.g., LIWC, VADER, Moral Foundations Dictionaries, etc.

Evolution of Text Classification

Evolution of text classification (cont’d)

Traditional ML algorithms
- This approach provides the groundwork for many foundational concepts in text classification
- Classifiers (models) are trained to learn the relationships between texts and labels (i.e., classes)
- So this requires training (labeled) data (pairs of a text and a label)
- (On average) more training data, higher performance
- E.g., logistic regression, random forest, SVM, deep neural networks

Evolution of Text Classification

Evolution of text classification (cont’d)

Fine-tuning representation models (e.g., BERT family)
- These models are pre-trained with massive amounts of text data
- Given a text, representation models encode it into a vector (an array of numbers) that captures its meaning
- We can fine-tune such a model for classification
- They tend to achieve higher performance than the traditional ML approaches
- It might still require (potentially large) labeled training data for high performance

Evolution of Text Classification

Evolution of text classification (cont’d)

Prompting generative models (e.g., GPT family)
- Like representation models, these models are pre-trained on vast amounts of text data, often even more extensively
- Generative models are designed to generate text outputs given input text prompts
- We can prompt such a model with unlabeled texts to generate labels
- Still requires labeled data: not for training but for testing performance
- (It is also possible to “fine-tune” these models)

Manual Labeling

How do we obtain a labeled data set?

Expert labeling
- In many projects, a few domain experts work on a labeling (after training)
- Annotators are trained to learn the concept and related guidelines
- E.g., a researcher + two RAs from the department
Crowd-sourced labeling
- “Wisdom of crowd”: aggregated judgments of (online) non-experts converge to judgments of experts at much lower cost (Benoit et al, 2016)
- Difficult to educate annotators on sophisticated tasks
- Inductive measurement based on loose conceptualization

Manual Labeling

Expert labeling vs. Crowd Sourcing

Deductive vs. inductive
Degree of training
Scalability (cost)

Manual Labeling

Selected texts for manual labeling

Should reflect the entire corpus
Mismatch leading to low performance: shift/drift
E.g., drift in anti-vaccine discourse throughout 2020

the data on which your model is trained should ideally be reflective of the data it will encounter in deployment or production. This concept is often referred to as the “training data representing the target distribution” or ensuring that your training data comes from the same distribution as the data your model will encounter in the real world
if there is a significant mismatch between the training data and the deployment data, the performance of the model in real-world scenarios may suffer. This issue is known as “distribution shift” or “dataset shift.” It can lead to poor generalization, where the model performs well on the training data but poorly on unseen data.
to create a train set, we will typically code only a sample of a much larger set of data
the mapping (or the classifier) we produce with hand labeling should be generalizable to the entire data set and the final population the research is interested in

Manual Labeling

Iterative process

Definition/operationalization does not often take place at once but in an iterative process
In many cases, it is difficult to specify an entire annotation guidelines ex ante
Preliminary labeling rule are written and applied to an initial set of docs \(\to\) Annotators identify ambiguities in the rule
\(\to\) Revision of the rule \(\to\) …

Manual Labeling

Dealing with subjectivity

Many concepts in humanities and social sciences are not straightforward
They can involve high levels of subjectivity
This is, from the beginning, why 1) careful conceptualization and 2) writing an excellent labeling rule, and 3) training coders are extremely important
Evaluation metrics: Krippendorf’s α, Cohen’s κ (alternatives include Pearson’s r, Spearman’s ρ) (recommended R package: irr)

Manual Labeling

Who are the annotators?

Expert coding
- Academics/students (Javdani and Chang 2023)
Crowdsourcing
- Skewed distribution of worked hours (Difallah et al. 2018)
- Inattentive workers (Peyton et al. 2022; Ternovski 2022)
- LLM-based responses (Veselovsky et al. 2023)
- Demographic characteristic (Al Kuwatly et al. 2020)

Evaluating Performance

Performance metrics

Accuracy: the proportion of all predictions (both positive and negative) that the model got right
Precision: the proportion of positive predictions that were actually correct
Recall: the proportion of actual positives that were correctly predicted
F-1: the harmonic (as opposed to arithmetic) mean of precision and recall

Evaluating Performance

Confusion matrix: predictions against true labels

Evaluating Performance

Accuracy: \(\frac{TP+TN}{TP+TN+FP+FN}\)

Evaluating Performance

Precision: \(\frac{TP}{TP+FP}\)

Evaluating Performance

Recall: \(\frac{TP}{TP+FN}\)