Data Sources & Research Resources

Where does the data come from, and what makes it suitable for machine learning?

Areas of Application

Machine learning has found productive applications across many areas of the social sciences.

Text and language

Classifying open-ended survey responses, detecting political sentiment in social media, automating the coding of qualitative content.

Behavioral data

Predicting outcomes such as voting, health behavior, social mobility, or educational attainment from large observational datasets.

Survey & registry data

Feature selection from multi-hundred-variable surveys; supplementing missing measurements; predicting dropout or non-response.

Experimental data

Estimating heterogeneous treatment effects; finding subgroup patterns that classical ANOVA misses.

Requirements of Application

Not every research situation calls for machine learning. Some requirements and considerations:

  • Sample size: ML generally benefits from larger samples. With very small samples (e.g., \(n < 100\)), classical models may outperform.
  • Number of features: The comparative advantage of ML grows with the number of predictive features (especially \(P > 10\)).
  • Research goal: ML excels at prediction; causality still requires careful design and theory.
  • Interpretability needs: Results must be communicable. Black-box models require additional tools.

ML may be an ‘overkill’ for small, tightly controlled experiments with few variables—classical regression may suffice.

Dataset Quality & Data Preparation

The quality of ML results is fundamentally bounded by the quality of the data.

Common pre-processing steps:

Step What to do Why
Missing values Impute or remove Algorithms require complete cases or explicit handling
Scaling Normalize or standardize Distance-based and penalized methods are scale-sensitive
Encoding Dummy/one-hot encode Categorical features must be numeric
Outlier treatment Detect and handle Outliers can dominate loss minimization
Class imbalance Oversample / undersample Imbalanced outcomes skew performance metrics

In R, the recipes package from the tidymodels ecosystem handles most of these steps in a reproducible pipeline.

Best-Case Examples in the Social Sciences

Prediction of political behavior

Identifying swing voters from large-scale survey data using ensemble methods. ML reveals non-linear feature interactions that standard regression masks.

Prediction of individual health on the basis of psychological and behavioural variables

Given a set of behavioural variables, such as personality questionnaires, sleep patterns, social media usage, self-reported stress levels etc., can we predict the likelihood of negative health effects?

Missing value imputation

Predicting who declines to answer sensitive survey questions from observable characteristics, then imputing plausible values.

Heterogeneous treatment effects

Identifying subgroups for whom a policy or intervention has stronger or weaker effects — moving beyond the average treatment effect.

In each case, the research design has to be theoretically informed. ML identifies patterns on the basis of a researchers design, but does it does not interpret them. Both design and interpretation have to work together.

What Is Machine Learning?

Definition, taxonomy, and its place in the social-science toolkit.

Definition

The design of algorithms that learn from data and are capable of improvement as new experiences emerge. Specifically, the algorithm “is said to learn from experience \(E\) with respect to some task \(T\) and performance measure \(P\), if its performance at task \(T\), as measured by \(P\), improves with experience \(E\) (Mitchell, 1997).

Three core concepts:

  • Experience \(E\): data
  • Task \(T\): e.g., regression, classification, clustering
  • Performance \(P\): a metric that quantifies how well the task is accomplished

A continuum of ML algorithms

Supervised Machine Learning

Image Source: GeeksforGeeks

Requires labelled outcomes.

  • Classification tasks (discrete labels)
  • Regression tasks (continuous scores)

Algorithm learns from training data to predict labels on unseen data.

Unsupervised Machine Learning

Image Source: GeeksforGeeks

No labelled outcomes required.

  • Cluster analysis
  • Topic modelling
  • Dimensionality reduction

Algorithm organizes data into systematic patterns. Human evaluation validates the resulting structure.

Main uses of ML in applied social sciences

Three major uses have been identified in the literature (Grimmer et al., 2021):

1. Automation

Tedious, time-consuming, error-prone coding tasks. Example: scaling thousands of open-ended survey responses.

2. Prediction

Predicting future or out-of-sample behavior. Example: forecasting electoral turnout at the precinct level.

3. Induction — theory generation

Discovering previously unseen or unconsidered patterns; developing new theoretical hypotheses from the data.

Controversy around inductive uses remains, particularly in relation to the Humean critique of induction. Past experience alone does not guarantee future accuracy.

ML Jargon & Formal Setup

Translating between the familiar and the new.

ML Jargon

The ML community uses slightly different terminology from classical statistics.

Traditional statistics Machine Learning
Data Set
Unit / observation Sample, example, or instance
Variable Feature
Outcome / dependent variable Label
Parameter Weight
Intercept Bias
Model selection criterion Performance metric
Cross-validation Re-sampling

The Generic Learning Problem

An algorithm learns the following model

\[ y_i = \phi \left(\boldsymbol{x}_i, \boldsymbol{\theta} \right) + \varepsilon_i \]

where:

  • \(i\) is the instance
  • \(y\) is the label (a class or a numerical score)
  • \(\phi(.)\) is an unknown function to be learnt from data
  • \(\boldsymbol{x}\) is the vector of predictive features
  • \(\boldsymbol{\theta}\) contains the parameters (weights and tuning parameters)
  • \(\varepsilon\) is irreducible error

The algorithm then generates predictions

\[ \psi_i = f \left( \boldsymbol{x}_i^*, \boldsymbol{q} \right) \]

where \(\boldsymbol{x}^*\) is the subset of relevant features selected during training, and \(\boldsymbol{q}\) contains the optimal parameter values.

The Loss Function

In statistics, loss equals error. For at least one instance, \(\psi_i \neq y_i\). The loss function \(L(\psi_i, y_i)\) quantifies the total error across all instances.

The learning process selects \(f(.)\) and \(\boldsymbol{q}\) such that loss is minimized.

Absolute loss (\(L_1\)-norm)

\[L = \sum_{i=1}^{n_1} |\psi_i - y_i|\]

Less sensitive to outliers.

Quadratic loss (\(L_2\)-norm)

\[L = \sum_{i=1}^{n_1} (\psi_i - y_i)^2\]

Foundation of ordinary least squares. More sensitive to large errors.

Performance Metrics

Performance metrics \(P(\psi_i, y_i)\) measure how well the model accounts for the data.

For regression tasks, three metrics are standard (computed on the test set):

\[\text{MAE} = \frac{1}{n_2} \sum_{i=1}^{n_2} |\psi_i - y_i|\]

\[\text{RMSE} = \sqrt{\frac{1}{n_2} \sum_{i=1}^{n_2} (\psi_i - y_i)^2}\]

\[R^2 = r_{\psi,y}^2\]

  • MAE and RMSE: smaller is better (they measure error)
  • \(R^2\): larger is better (it measures explained variance)

Our goal: achieve best possible predictive performance, but not at the cost of interpretability.

Training & Evaluation

The central principle of supervised machine learning.

A Division of Labor

Two distinct processes must be kept strictly separate:

Training

I.e. learning the functional form, parameters, and relevant features from data.

Evaluation

I.e. assessing the performance of your model This requires data that the model has never seen.

Cardinal rule: Never train and evaluate on the same data.

Measuring performance on training data produces re-substitution error, i.e. an optimistic estimate of true performance. The model has simply memorized the training data.

The Split Sample Approach

The most fundamental approach to separating training from evaluation:

  1. If necessary, randomize the \(n\) instances.
  2. Set aside a fraction \(p\) for training \(\Rightarrow\) training set of size \(n_1 = p \cdot n\).
  3. Set aside the remaining fraction \(1 - p\) for evaluation \(\Rightarrow\) test set of size \(n_2 = n - n_1\).

\[ \underbrace{n_1}_{\text{training}} + \underbrace{n_2}_{\text{test}} = n \]

Note

  • Stratified sampling preserves the outcome distribution across the split (important for imbalanced labels).
  • Randomization is not advisable for time-series data.
  • Typical split: 60–80% training, 20–40% test.

A Summary in Simplest Terms

Keeping it simple.

The Dog Days Are Over

Imagine you want to teach a computer to guess how much a dog costs to buy, based on facts you know about it.

Features:

  • How big it is
  • How old it is
  • How big its kennel is
  • What breed it is

Train/test split:

  • We split our collection of dogs into two piles
  • 60% go into the training pile — i.e. the examples we show the computer
  • 40% go into the test pile — i.e. dogs hidden from the computer completely, and are used to test it later
  • We check that both piles have a similar spread of prices, so neither pile is biased.

Pre-processing:

  • Most dogs are priced normally, but a few rare breeds are astronomically expensive
  • A computer fixates on outliers — like a champion pedigree worth fifty thousand euros among hundreds of mixed breeds at a few hundred
  • The log transformation compresses the scale so the rare pedigree does not dominate everything the computer learns

Or Have They Just Begun?

Training:

  • We fit a linear regression — the simplest possible model
  • It draws a straight line through the data: for every extra kilogram of body weight, the price goes up by roughly this much
  • We inspect the coefficients to see which features matter most
  • We check \(R^2\) to see how much of the variation in price we can explain
  • We check RMSE to see on average how many euros wrong our predictions are on the dogs from the hidden test pile

Interpretation:

  • We draw a partial dependence plot for body size
  • This asks: if we could magically change just the size of every dog and leave everything else the same, how would predicted price change?
  • It isolates one feature’s contribution — holding breed, age, and kennel size constant, how does size alone affect price?

Note

Our code may be messy: cleaning steps and modelling steps, if written separately can be improved through a workflow.

The ML Workflow

How to deploy a model.

Overview

A supervised ML workflow follows a consistent cycle:

Data Preparation → Training → Evaluation → Improvement → Deployment
                        ↑                       ↓
                        └───────────────────────┘
                              (tuning loop)

Each stage has specific objectives and tools.

Stage Goal Key tool in tidymodels
Data Preparation Clean, split, engineer features rsample, recipes
Training Fit model to training data parsnip
Evaluation Measure performance on test data yardstick
Improvement Tune hyperparameters tune, dials
Deployment Finalize and apply to new data workflows

The tidymodels Ecosystem

tidymodels is a collection of R packages built around a consistent grammar for ML.

rsample

Creates split samples and cross-validation folds.

recipes

Pre-processes data: scaling, encoding, imputation, feature engineering.

parsnip

Unified model interface — same syntax for linear regression, random forests, neural networks, etc.

yardstick

Computes performance metrics (MAE, RMSE, \(R^2\), accuracy, AUC, …).

workflows

Bundles recipe and model into one object, ensuring that pre-processing from the training set is correctly applied to the test set.

tune & dials

Define and search over tuning parameter grids.

Workflows are critical, as they prevent the cardinal error of applying test-set information during training.

Reference List

Aydede, Y. (2023). Machine learning toolbox for social scientists: Applied predictive analytics with r (1st ed.). Chapman; Hall/CRC.
Biecek, P. (2018). DALEX: Explainers for complex predictive models in R. Journal of Machine Learning Research, 19(84), 1–5.
Breiman, L. (1984). Classification and regression trees. Wadsworth International Group.
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Cimentada, J. (2020). Machine learning for social scientists. https://cimentadaj.github.io/ml_socsci/index.html
De Cock, D. (2011). Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19. https://doi.org/10.1080/10691898.2011.11889627
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2021). Machine learning for social science: An agnostic approach. Annual Review of Political Science, 24, 395–419. https://doi.org/10.1146/annurev-polisci-053119-015921
Jacobucci, R., Grimm, K. J., & Zhang, Z. (2023). Machine learning for social and behavioral research. The Guilford Press.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815–840.
Mingers, J. (1989). An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3(4), 319–342. https://doi.org/10.1023/A:1022645801436
Mitchell, T. M. (1997). Machine learning. McGraw-Hill.
Steenbergen, M. (2025). Introduction to machine learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.
Xu, Q.-S., & Liang, Y.-Z. (2001). Monte carlo cross validation. Chemometrics and Intelligent Laboratory Systems, 56(1), 1–11. https://doi.org/10.1016/S0169-7439(00)00122-2