A wide range of use cases: sentiment of comments (e.g., positive or negative), stance on issues (e.g., for vs. against vs. neutral), topic (e.g., related to immigration or not), events (e.g., related to an election).
The possible classification targets are virtually limitless
Compared to topic models or other unsupervised approaches, text classification allows for pinpointed measurement that specifically targets predefined concepts
Overview of Text Classification
What is text classification?
We need
Labeled data set (for training and/or testing)
Model/approach that maps texts to labels (dictionary, traditional supervised learning, fine-tuning pre-trained models, generative models)
Evaluation approaches: performance metrics, cross-validation, etc.
Evolution of Text Classification
Evolution of text classification
Dictionary methods
Based on counting/weighting of relevant keywords
Readily available and fast \(\leftrightarrow\) sub-optimal performance
However, if you want to quick, preliminary analysis for concepts supported by an existing dictionary, it can be helpful
These models are pre-trained with massive amounts of text data
Given a text, representation models encode it into a vector (an array of numbers) that captures its meaning
We can fine-tune such a model for classification
They tend to achieve higher performance than the traditional ML approaches
It might still require (potentially large) labeled training data for high performance
Evolution of Text Classification
Evolution of text classification (cont’d)
Prompting generative models (e.g., GPT family)
Like representation models, these models are pre-trained on vast amounts of text data, often even more extensively
Generative models are designed to generate text outputs given input text prompts
We can prompt such a model with unlabeled texts to generate labels
Still requires labeled data: not for training but for testing performance
(It is also possible to “fine-tune” these models)
Manual Labeling
How do we obtain a labeled data set?
Expert labeling
In many projects, a few domain experts work on a labeling (after training)
Annotators are trained to learn the concept and related guidelines
E.g., a researcher + two RAs from the department
Crowd-sourced labeling
“Wisdom of crowd”: aggregated judgments of (online) non-experts converge to judgments of experts at much lower cost (Benoit et al, 2016)
Difficult to educate annotators on sophisticated tasks
Inductive measurement based on loose conceptualization
Manual Labeling
Expert labeling vs. Crowd Sourcing
Deductive vs. inductive
Degree of training
Scalability (cost)
Manual Labeling
Selected texts for manual labeling
Should reflect the entire corpus
Mismatch leading to low performance: shift/drift
E.g., drift in anti-vaccine discourse throughout 2020
Manual Labeling
Iterative process
Definition/operationalization does not often take place at once but in an iterative process
In many cases, it is difficult to specify an entire annotation guidelines ex ante
Preliminary labeling rule are written and applied to an initial set of docs \(\to\) Annotators identify ambiguities in the rule \(\to\) Revision of the rule \(\to\) …
Manual Labeling
Dealing with subjectivity
Many concepts in humanities and social sciences are not straightforward
They can involve high levels of subjectivity
This is, from the beginning, why 1) careful conceptualization and 2) writing an excellent labeling rule, and 3) training coders are extremely important
Evaluation metrics: Krippendorf’s α, Cohen’s κ (alternatives include Pearson’s r, Spearman’s ρ) (recommended R package: irr)