Introduction to Machine Learning in the Social Sciences

Regularization, Classification & Interpretable ML

Adrian Stanciu & Erik Paessler

Faculty of Humanities, Education and Social Sciences (FHSE), University of Luxembourg

Classification Tasks

Assigning instances to discrete categories.

What is a Classification Task?

A classification task consists of assigning class labels to instances. This takes two forms:

Hard classification: directly assign an instance to a class.
Soft classification: derive class probabilities \(\pi_i\), then assign instances using a threshold.

Classes are extremely common in the social sciences. Examples include:

Turnout and vote choice
Democratic breakdown vs. stability
War vs. peace
Employment status; marital status
Clinical diagnosis (disorder vs. no disorder)

Two Opposing Approaches

Abandon models entirely

\(k\)-Nearest Neighbors (kNN)

No model is imposed. Classify based on the labels of the most similar training instances.

Non-parametric
No assumptions about data distribution
Lazy learner: stores training data, computes at prediction time

Impose a strong model

Logistic Regression

A parametric model is imposed. The probability of class membership is modeled as a function of the features.

Parametric
Strong distributional assumptions
Estimable and interpretable coefficients

Most classification algorithms lie between these extremes (decision trees, random forests, SVMs, neural networks).

Useful R packages for your Project

# Core workflow
library(tidymodels)   # rsample, recipes, parsnip, yardstick, workflows, tune
library(tidyverse)    # data manipulation and visualization

# Model engines
library(glmnet)       # lasso, ridge, elastic nets
library(kknn)         # k-nearest neighbors
library(ranger)       # fast random forests
library(xgboost)      # gradient boosting

# Missing data
library(mice)         # multiple imputation by chained equations

# Interpretability
library(DALEXtra)     # DALEX integration with tidymodels
library(bonsai)       # partykit engine (interpretable trees)

# Text analysis (if applicable)
library(tidytext)     # tidy text mining
library(textrecipes)  # text pre-processing in recipes

Outlook: Beyond This Course

Where the field is going.

Tree-Based Methods and Ensembles

Decision Trees

Recursive binary splits of the feature space. Highly interpretable but prone to over-fitting.

Random Forests

An ensemble of many decorrelated trees. Each tree is trained on a bootstrap sample with a random subset of features. Predictions are averaged (regression) or voted (classification).

Gradient Boosting (XGBoost)

Trees trained sequentially, each correcting the errors of the previous. State-of-the-art on tabular data.

BART (Bayesian Additive Regression Trees)

A Bayesian ensemble of trees. Naturally provides posterior intervals — well suited for causal and social-science inference.

Brief Note on Unsupervised ML

This course focused on supervised ML. Key unsupervised methods relevant for social scientists:

\(k\)-Means clustering: partitions instances into \(k\) groups based on feature similarity.
Hierarchical clustering: builds a dendrogram — useful for exploring natural groupings without pre-specifying \(k\).
Topic models (LDA): unsupervised discovery of latent themes in large corpora of text.
PCA / Factor Analysis: dimensionality reduction and latent structure detection.

These methods are valuable for theory generation and exploration when no outcome variable is available.

The core workflow — recipe → train → evaluate → tune → deploy — remains the same across all algorithms (Steenbergen, 2025).

Reference List

Aydede, Y. (2023). Machine learning toolbox for social scientists: Applied predictive analytics with r (1st ed.). Chapman; Hall/CRC.

Biecek, P. (2018). DALEX: Explainers for complex predictive models in R. Journal of Machine Learning Research, 19(84), 1–5.

Breiman, L. (1984). Classification and regression trees. Wadsworth International Group.

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726

Cimentada, J. (2020). Machine learning for social scientists. https://cimentadaj.github.io/ml_socsci/index.html

De Cock, D. (2011). Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19. https://doi.org/10.1080/10691898.2011.11889627

Jacobucci, R., Grimm, K. J., & Zhang, Z. (2023). Machine learning for social and behavioral research. The Guilford Press.

Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815–840.

Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3(4), 319–342. https://doi.org/10.1023/A:1022645801436

Mitchell, T. M. (1997). Machine learning. McGraw-Hill.

Steenbergen, M. (2025). Introduction to machine learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.

Xu, Q.-S., & Liang, Y.-Z. (2001). Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1), 1–11. https://doi.org/10.1016/S0169-7439(00)00122-2