Text-as-Data/NLP II

HSS 611: Programming for HSS

Taegyoon Kim

Nov 18, 2025

Agenda

Things to be covered

  • Overview of text classification
  • Evolution of text classification
  • Manual labeling
  • Evaluating performance
  • Generative LLMs for text classification

Overview of Text Classification

What is text classification?

  • Goal
    • To classify documents into pre-defined categories
    • A wide range of use cases: sentiment of comments (e.g., positive or negative), stance on issues (e.g., for vs. against vs. neutral), topic (e.g., related to immigration or not), events (e.g., related to an election).
  • The possible classification targets are virtually limitless
  • Compared to topic models or other unsupervised approaches, text classification allows for pinpointed measurement that specifically targets predefined concepts

Overview of Text Classification

What is text classification?

  • We need
    • Labeled data set (for training and/or testing)
    • Model/approach that maps texts to labels (dictionary, traditional supervised learning, fine-tuning pre-trained models, generative models)
    • Evaluation approaches: performance metrics, cross-validation, etc.

Evolution of Text Classification

Evolution of text classification

  • Dictionary methods
    • Based on counting/weighting of relevant keywords
    • Readily available and fast \(\leftrightarrow\) sub-optimal performance
    • However, if you want to quick, preliminary analysis for concepts supported by an existing dictionary, it can be helpful
    • E.g., LIWC, VADER, Moral Foundations Dictionaries, etc.

Evolution of Text Classification

Evolution of text classification (cont’d)

  • Traditional ML algorithms
    • This approach provides the groundwork for many foundational concepts in text classification
    • Classifiers (models) are trained to learn the relationships between texts and labels (i.e., classes)
    • So this requires training (labeled) data (pairs of a text and a label)
    • (On average) more training data, higher performance
    • E.g., logistic regression, random forest, SVM, deep neural networks

Evolution of Text Classification

Evolution of text classification (cont’d)

  • Fine-tuning representation models (e.g., BERT family)
    • These models are pre-trained with massive amounts of text data
    • Given a text, representation models encode it into a vector (an array of numbers) that captures its meaning
    • We can fine-tune such a model for classification
    • They tend to achieve higher performance than the traditional ML approaches
    • It might still require (potentially large) labeled training data for high performance

Evolution of Text Classification

Evolution of text classification (cont’d)

  • Prompting generative models (e.g., GPT family)
    • Like representation models, these models are pre-trained on vast amounts of text data, often even more extensively
    • Generative models are designed to generate text outputs given input text prompts
    • We can prompt such a model with unlabeled texts to generate labels
    • Still requires labeled data: not for training but for testing performance
    • (It is also possible to “fine-tune” these models)

Manual Labeling

How do we obtain a labeled data set?

  • Expert labeling
    • In many projects, a few domain experts work on a labeling (after training)
    • Annotators are trained to learn the concept and related guidelines
    • E.g., a researcher + two RAs from the department
  • Crowd-sourced labeling
    • “Wisdom of crowd”: aggregated judgments of (online) non-experts converge to judgments of experts at much lower cost (Benoit et al, 2016)
    • Difficult to educate annotators on sophisticated tasks
    • Inductive measurement based on loose conceptualization

Manual Labeling

Expert labeling vs. Crowd Sourcing

  • Deductive vs. inductive
  • Degree of training
  • Scalability (cost)

Manual Labeling

Selected texts for manual labeling

  • Should reflect the entire corpus
  • Mismatch leading to low performance: shift/drift
  • E.g., drift in anti-vaccine discourse throughout 2020

Manual Labeling

Iterative process

  • Definition/operationalization does not often take place at once but in an iterative process
  • In many cases, it is difficult to specify an entire annotation guidelines ex ante
  • Preliminary labeling rule are written and applied to an initial set of docs \(\to\) Annotators identify ambiguities in the rule
    \(\to\) Revision of the rule \(\to\)

Manual Labeling

Dealing with subjectivity

  • Many concepts in humanities and social sciences are not straightforward
  • They can involve high levels of subjectivity
  • This is, from the beginning, why 1) careful conceptualization and 2) writing an excellent labeling rule, and 3) training coders are extremely important
  • Evaluation metrics: Krippendorf’s α, Cohen’s κ (alternatives include Pearson’s r, Spearman’s ρ) (recommended R package: irr)

Manual Labeling

Who are the annotators?

Evaluating Performance

Performance metrics

  • Accuracy: the proportion of all predictions (both positive and negative) that the model got right
  • Precision: the proportion of positive predictions that were actually correct
  • Recall: the proportion of actual positives that were correctly predicted
  • F-1: the harmonic (as opposed to arithmetic) mean of precision and recall

Evaluating Performance

Confusion matrix: predictions against true labels

Evaluating Performance

Accuracy: \(\frac{TP+TN}{TP+TN+FP+FN}\)

Evaluating Performance

Precision: \(\frac{TP}{TP+FP}\)

Evaluating Performance

Recall: \(\frac{TP}{TP+FN}\)

Evaluating Performance

F-1: (\(2 \times precision \times recall\)) / (\(precision + recall\))

  • Why not arithmetic mean (\((precision + recall) / 2\))?

Evaluating Performance


Precision/recall/F-1 & accuracy


  • 100 positives
  • 80 predicted positives
  • 60 true positives

Evaluating Performance


Precision/recall/F-1 & accuracy


  • Precision: \(\frac{60}{60+20} = 0.75\)

Evaluating Performance


Precision/recall/F-1 & accuracy


  • Recall: \(\frac{60}{60+40} = 0.6\)

Evaluating Performance


Precision/recall/F-1 & accuracy


  • Precision: \(\frac{60}{60+20} = 0.75\)
  • Recall: \(\frac{60}{60+40} = 0.6\)
  • Accuracy: \(\frac{60+50}{60+20+40+50} = 0.65\)

Evaluating Performance


Precision/recall/F-1 & accuracy


  • Precision: \(\frac{60}{60+20} = 0.75\)
  • Recall: \(\frac{60}{60+40} = 0.6\)
  • Accuracy: \(\frac{60+150}{60+20+40+150} = 0.78\)

Evaluating Performance


An extremely imbalanced case


  • Accuracy: ??
  • Precision: ??
  • Recall: ??
  • F-1: ??

Evaluating Performance


An extremely imbalanced case


  • Accuracy: 0.991
  • Precision: 0.66
  • Recall: 0.2
  • F-1: 0.31

Generative LLMs for Text Classification

Promises

  • No need for labeled training data (zero-shot/few-shot prompting)

  • High performance in various tasks (classification, sentiment, ideology)

  • Little programming or machine learning expertise required

  • Multi-lingual and adaptable across domains

Generative LLMs for Text Classification

Pitfalls

  • Transparency and replicability issues with closed models (e.g., GPT-4)

  • Computational power (open-source models) or API usage fees (closed-source, proprietary models)

  • Lack of validation framework: \(\leftrightarrow\) Traditional ML tools (e.g., train-test splits, CV) do not directly apply

  • Small prompt changes can produce drastically different results

Generative LLMs for Text Classification

Recommended pipeline (Törnberg et al. 2024)

Tutorial Materials

  • Stance detection on abortion with BERT fine-tuning: here

  • Introduction to text classification with generative models: link