Introduction to Machine Learning in the Social Sciences

Data Sources, Research Resources & ML Fundamentals

Adrian Stanciu & Erik Paessler

Faculty of Humanities, Education and Social Sciences (FHSE), University of Luxembourg

Data Sources & Research Resources

Where does the data come from, and what makes it suitable for machine learning?

Areas of Application

Machine learning has found productive applications across many areas of the social sciences.

Text and language

Classifying open-ended survey responses, detecting political sentiment in social media, automating the coding of qualitative content.

Behavioral data

Predicting outcomes such as voting, health behavior, social mobility, or educational attainment from large observational datasets.

Survey & registry data

Feature selection from multi-hundred-variable surveys; supplementing missing measurements; predicting dropout or non-response.

Experimental data

Estimating heterogeneous treatment effects; finding subgroup patterns that classical ANOVA misses.

Requirements of Application

Not every research situation calls for machine learning. Some requirements and considerations:

Sample size: ML generally benefits from larger samples. With very small samples (e.g., \(n < 100\)), classical models may outperform.
Number of features: The comparative advantage of ML grows with the number of predictive features (especially \(P > 10\)).
Research goal: ML excels at prediction; causality still requires careful design and theory.
Interpretability needs: Results must be communicable. Black-box models require additional tools.

ML may be an ‘overkill’ for small, tightly controlled experiments with few variables—classical regression may suffice.

Dataset Quality & Data Preparation

The quality of ML results is fundamentally bounded by the quality of the data.

Common pre-processing steps:

Step	What to do	Why
Missing values	Impute or remove	Algorithms require complete cases or explicit handling
Scaling	Normalize or standardize	Distance-based and penalized methods are scale-sensitive
Encoding	Dummy/one-hot encode	Categorical features must be numeric
Outlier treatment	Detect and handle	Outliers can dominate loss minimization
Class imbalance	Oversample / undersample	Imbalanced outcomes skew performance metrics

In R, the recipes package from the tidymodels ecosystem handles most of these steps in a reproducible pipeline.

What Is Machine Learning?

Definition, taxonomy, and its place in the social-science toolkit.

Definition

The design of algorithms that learn from data and are capable of improvement as new experiences emerge. Specifically, the algorithm “is said to learn from experience \(E\) with respect to some task \(T\) and performance measure \(P\), if its performance at task \(T\), as measured by \(P\), improves with experience \(E\)” (Mitchell, 1997).

Three core concepts:

Experience \(E\): data
Task \(T\): e.g., regression, classification, clustering
Performance \(P\): a metric that quantifies how well the task is accomplished

A continuum of ML algorithms

Supervised Machine Learning

Image Source: GeeksforGeeks

Requires labelled outcomes.

Classification tasks (discrete labels)
Regression tasks (continuous scores)

Algorithm learns from training data to predict labels on unseen data.

Unsupervised Machine Learning

Image Source: GeeksforGeeks

No labelled outcomes required.

Cluster analysis
Topic modelling
Dimensionality reduction

Algorithm organizes data into systematic patterns. Human evaluation validates the resulting structure.

ML Jargon & Formal Setup

Translating between the familiar and the new.

ML Jargon

The ML community uses slightly different terminology from classical statistics.

Traditional statistics	Machine Learning
Data	Set
Unit / observation	Sample, example, or instance
Variable	Feature
Outcome / dependent variable	Label
Parameter	Weight
Intercept	Bias
Model selection criterion	Performance metric
Cross-validation	Re-sampling

The Generic Learning Problem

An algorithm learns the following model

\[ y_i = \phi \left(\boldsymbol{x}_i, \boldsymbol{\theta} \right) + \varepsilon_i \]

where:

\(i\) is the instance
\(y\) is the label (a class or a numerical score)
\(\phi(.)\) is an unknown function to be learnt from data
\(\boldsymbol{x}\) is the vector of predictive features
\(\boldsymbol{\theta}\) contains the parameters (weights and tuning parameters)
\(\varepsilon\) is irreducible error

The algorithm then generates predictions

\[ \psi_i = f \left( \boldsymbol{x}_i^*, \boldsymbol{q} \right) \]

where \(\boldsymbol{x}^*\) is the subset of relevant features selected during training, and \(\boldsymbol{q}\) contains the optimal parameter values.

The Loss Function

In statistics, loss equals error. For at least one instance, \(\psi_i \neq y_i\). The loss function \(L(\psi_i, y_i)\) quantifies the total error across all instances.

The learning process selects \(f(.)\) and \(\boldsymbol{q}\) such that loss is minimized.

Absolute loss (\(L_1\)-norm)

\[L = \sum_{i=1}^{n_1} |\psi_i - y_i|\]

Less sensitive to outliers.

Quadratic loss (\(L_2\)-norm)

\[L = \sum_{i=1}^{n_1} (\psi_i - y_i)^2\]

Foundation of ordinary least squares. More sensitive to large errors.

Performance Metrics

Performance metrics \(P(\psi_i, y_i)\) measure how well the model accounts for the data.

For regression tasks, three metrics are standard (computed on the test set):

\[\text{MAE} = \frac{1}{n_2} \sum_{i=1}^{n_2} |\psi_i - y_i|\]

\[\text{RMSE} = \sqrt{\frac{1}{n_2} \sum_{i=1}^{n_2} (\psi_i - y_i)^2}\]

\[R^2 = r_{\psi,y}^2\]

MAE and RMSE: smaller is better (they measure error)
\(R^2\): larger is better (it measures explained variance)

Our goal: achieve best possible predictive performance, but not at the cost of interpretability.

Training & Evaluation

The central principle of supervised machine learning.

A Division of Labor

Two distinct processes must be kept strictly separate:

Training

I.e. learning the functional form, parameters, and relevant features from data.

Evaluation

I.e. assessing the performance of your model This requires data that the model has never seen.

Cardinal rule: Never train and evaluate on the same data.

Measuring performance on training data produces re-substitution error, i.e. an optimistic estimate of true performance. The model has simply memorized the training data.

The Split Sample Approach

The most fundamental approach to separating training from evaluation:

If necessary, randomize the \(n\) instances.
Set aside a fraction \(p\) for training \(\Rightarrow\) training set of size \(n_1 = p \cdot n\).
Set aside the remaining fraction \(1 - p\) for evaluation \(\Rightarrow\) test set of size \(n_2 = n - n_1\).

\[ \underbrace{n_1}_{\text{training}} + \underbrace{n_2}_{\text{test}} = n \]

Note

Stratified sampling preserves the outcome distribution across the split (important for imbalanced labels).
Randomization is not advisable for time-series data.
Typical split: 60–80% training, 20–40% test.

A Summary in Simplest Terms

Keeping it simple.

The Dog Days Are Over

Imagine you want to teach a computer to guess how much a dog costs to buy, based on facts you know about it.

Features:

How big it is
How old it is
How big its kennel is
What breed it is

Train/test split:

We split our collection of dogs into two piles
60% go into the training pile — i.e. the examples we show the computer
40% go into the test pile — i.e. dogs hidden from the computer completely, and are used to test it later
We check that both piles have a similar spread of prices, so neither pile is biased.

Pre-processing:

Most dogs are priced normally, but a few rare breeds are astronomically expensive
A computer fixates on outliers — like a champion pedigree worth fifty thousand euros among hundreds of mixed breeds at a few hundred
The log transformation compresses the scale so the rare pedigree does not dominate everything the computer learns

Or Have They Just Begun?

Training:

We fit a linear regression — the simplest possible model
It draws a straight line through the data: for every extra kilogram of body weight, the price goes up by roughly this much
We inspect the coefficients to see which features matter most
We check \(R^2\) to see how much of the variation in price we can explain
We check RMSE to see on average how many euros wrong our predictions are on the dogs from the hidden test pile

Interpretation:

We draw a partial dependence plot for body size
This asks: if we could magically change just the size of every dog and leave everything else the same, how would predicted price change?
It isolates one feature’s contribution — holding breed, age, and kennel size constant, how does size alone affect price?

Note

Our code may be messy: cleaning steps and modelling steps, if written separately can be improved through a workflow.

The ML Workflow

How to deploy a model.

Overview

A supervised ML workflow follows a consistent cycle:

Data Preparation → Training → Evaluation → Improvement → Deployment
                        ↑                       ↓
                        └───────────────────────┘
                              (tuning loop)

Each stage has specific objectives and tools.

Stage	Goal	Key tool in `tidymodels`
Data Preparation	Clean, split, engineer features	`rsample`, `recipes`
Training	Fit model to training data	`parsnip`
Evaluation	Measure performance on test data	`yardstick`
Improvement	Tune hyperparameters	`tune`, `dials`
Deployment	Finalize and apply to new data	`workflows`

The `tidymodels` Ecosystem

tidymodels is a collection of R packages built around a consistent grammar for ML.

rsample

Creates split samples and cross-validation folds.

recipes

Pre-processes data: scaling, encoding, imputation, feature engineering.

parsnip

Unified model interface — same syntax for linear regression, random forests, neural networks, etc.

yardstick

Computes performance metrics (MAE, RMSE, \(R^2\), accuracy, AUC, …).

workflows

Bundles recipe and model into one object, ensuring that pre-processing from the training set is correctly applied to the test set.

tune & dials

Define and search over tuning parameter grids.

Workflows are critical, as they prevent the cardinal error of applying test-set information during training.

Reference List

Aydede, Y. (2023). Machine learning toolbox for social scientists: Applied predictive analytics with r (1st ed.). Chapman; Hall/CRC.

Biecek, P. (2018). DALEX: Explainers for complex predictive models in R. Journal of Machine Learning Research, 19(84), 1–5.

Breiman, L. (1984). Classification and regression trees. Wadsworth International Group.

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726

Cimentada, J. (2020). Machine learning for social scientists. https://cimentadaj.github.io/ml_socsci/index.html

De Cock, D. (2011). Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19. https://doi.org/10.1080/10691898.2011.11889627

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24, 395–419. https://doi.org/10.1146/annurev-polisci-053119-015921

Jacobucci, R., Grimm, K. J., & Zhang, Z. (2023). Machine learning for social and behavioral research. The Guilford Press.

Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815–840.

Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3(4), 319–342. https://doi.org/10.1023/A:1022645801436

Mitchell, T. M. (1997). Machine learning. McGraw-Hill.

Steenbergen, M. (2025). Introduction to machine learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.

Xu, Q.-S., & Liang, Y.-Z. (2001). Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1), 1–11. https://doi.org/10.1016/S0169-7439(00)00122-2

Introduction to Machine Learning in the Social Sciences

Data Sources & Research Resources

Areas of Application

Requirements of Application

Dataset Quality & Data Preparation

Best-Case Examples in the Social Sciences

What Is Machine Learning?

Definition

A continuum of ML algorithms

Main uses of ML in applied social sciences

ML Jargon & Formal Setup

ML Jargon

The Generic Learning Problem

The Loss Function

Performance Metrics

Training & Evaluation

A Division of Labor

The Split Sample Approach

A Summary in Simplest Terms

The Dog Days Are Over

Or Have They Just Begun?

The ML Workflow

Overview

The tidymodels Ecosystem

Reference List

The `tidymodels` Ecosystem