Introduction to Machine Learning for the Social Sciences - Code Tutorial, Session 4

Author

Adrian Stanciu & Erik Paessler

Code based on: Steenbergen, M. (2025). Introduction to Machine Learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.

Data:

Ames Housing dataset (De Cock 2011). Assessor’s office data for residential properties sold in Ames, Iowa. Label: “What was the sale price of this residential property?”

Features:

Installations (remove comment notation if needed):

#install.packages("DALEXtra")
#install.packages("ingredients")
#install.packages("modeldata")
#install.packages("tidyverse")
#install.packages("tidymodels")
#install.packages("rio")
#install.packages("doParallel")
#install.packages("glmnet")
#install.packages("rpart.plot")

Load packages

library(rio)
library(modeldata)
library(tidyverse)
library(DALEXtra)
library(tidymodels)
library(ingredients)
library(doParallel)
library(glmnet)
library(rpart.plot)

Load model from last session

data(ames)
ames_clean <- ames |>
  select(Sale_Price, Gr_Liv_Area, Year_Built,
         Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd) |>
  na.omit()
tidymodels_prefer()
set.seed(2922)
ames_split <- initial_split(ames_clean, prop = 0.6, strata = Sale_Price)
ames_split
<Training/Testing/Total>
<1756/1174/2930>
training_set <- training(ames_split)
test_set <- testing(ames_split)

Validation

val_set <- validation_split(training_set, prop = 3/4)
Warning: `validation_split()` was deprecated in rsample 1.2.0.
ℹ Please use `initial_validation_split()` instead.
val_set
# Validation Set Split (0.75/0.25)  
# A tibble: 1 × 2
  splits             id        
  <list>             <chr>     
1 <split [1317/439]> validation

Why validate?

There are disadvantages to the two-split sample approach: We risk training on potentially a small part of the original data, inducing downwards bias in performance, and we get only one value for optimal tuning parameter, which may mean our ability to generalize is limited. If we have a lot of data, these problems tend to be less severe, but in smaller datasets, it might be better using cross-validation and bootstrapping.

What is a fold?

A fold is one of the subsets created when cross-validation divides the training data into equal parts, each taking a turn as the held-out evaluation set while the model is trained on the remainder.

k-fold cross-validation steps:

  • Randomly assign instances to k-folds (usually 10),
  • each fold is held out once for validation purposes,
  • each fold is used k-1 times for training purposes.

Cross-validation in tidymodels in four different ways

1 - Vanilla

set.seed(1923)
vanilla_folds <- vfold_cv(training_set, v = 10)
vanilla_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [1580/176]> Fold01
 2 <split [1580/176]> Fold02
 3 <split [1580/176]> Fold03
 4 <split [1580/176]> Fold04
 5 <split [1580/176]> Fold05
 6 <split [1580/176]> Fold06
 7 <split [1581/175]> Fold07
 8 <split [1581/175]> Fold08
 9 <split [1581/175]> Fold09
10 <split [1581/175]> Fold10

The higher we set k, here named v, the more data will be used for training at any given time.

2 - Repeated

set.seed(1923)
repeated_folds <- vfold_cv(training_set, v = 10, repeats = 5)
repeated_folds
#  10-fold cross-validation repeated 5 times 
# A tibble: 50 × 3
   splits             id      id2   
   <list>             <chr>   <chr> 
 1 <split [1580/176]> Repeat1 Fold01
 2 <split [1580/176]> Repeat1 Fold02
 3 <split [1580/176]> Repeat1 Fold03
 4 <split [1580/176]> Repeat1 Fold04
 5 <split [1580/176]> Repeat1 Fold05
 6 <split [1580/176]> Repeat1 Fold06
 7 <split [1581/175]> Repeat1 Fold07
 8 <split [1581/175]> Repeat1 Fold08
 9 <split [1581/175]> Repeat1 Fold09
10 <split [1581/175]> Repeat1 Fold10
# ℹ 40 more rows

Another way to reduce bias is to repeatedly assign instance to the folds, each time using a different seed.

3 - LOOCV

set.seed(1923)
loocv_folds <- loo_cv(training_set)
loocv_folds
# Leave-one-out cross-validation 
# A tibble: 1,756 × 2
   splits           id        
   <list>           <chr>     
 1 <split [1755/1]> Resample1 
 2 <split [1755/1]> Resample2 
 3 <split [1755/1]> Resample3 
 4 <split [1755/1]> Resample4 
 5 <split [1755/1]> Resample5 
 6 <split [1755/1]> Resample6 
 7 <split [1755/1]> Resample7 
 8 <split [1755/1]> Resample8 
 9 <split [1755/1]> Resample9 
10 <split [1755/1]> Resample10
# ℹ 1,746 more rows

Leave out one CV, in each run bias is reduced but variance increases

4 - Monte Carlo

set.seed(1923)
mc_folds <- mc_cv(training_set, prop = 9/10, times = 20)
mc_folds
# Monte Carlo cross-validation (0.9/0.1) with 20 resamples 
# A tibble: 20 × 2
   splits             id        
   <list>             <chr>     
 1 <split [1580/176]> Resample01
 2 <split [1580/176]> Resample02
 3 <split [1580/176]> Resample03
 4 <split [1580/176]> Resample04
 5 <split [1580/176]> Resample05
 6 <split [1580/176]> Resample06
 7 <split [1580/176]> Resample07
 8 <split [1580/176]> Resample08
 9 <split [1580/176]> Resample09
10 <split [1580/176]> Resample10
11 <split [1580/176]> Resample11
12 <split [1580/176]> Resample12
13 <split [1580/176]> Resample13
14 <split [1580/176]> Resample14
15 <split [1580/176]> Resample15
16 <split [1580/176]> Resample16
17 <split [1580/176]> Resample17
18 <split [1580/176]> Resample18
19 <split [1580/176]> Resample19
20 <split [1580/176]> Resample20

Monte Carlo CV selects the folds randomly each time, which may cause folds to contain overlapping instances (Xu and Liang 2001).

Bootstrap in tidymodels

Why bootstrap?

Repeated sampling instances are used for training, non-sampled instances are used for validation.

set.seed(1923)
booted <- bootstraps(training_set, times = 100)
booted
# Bootstrap sampling 
# A tibble: 100 × 2
   splits             id          
   <list>             <chr>       
 1 <split [1756/647]> Bootstrap001
 2 <split [1756/652]> Bootstrap002
 3 <split [1756/646]> Bootstrap003
 4 <split [1756/643]> Bootstrap004
 5 <split [1756/621]> Bootstrap005
 6 <split [1756/648]> Bootstrap006
 7 <split [1756/656]> Bootstrap007
 8 <split [1756/668]> Bootstrap008
 9 <split [1756/657]> Bootstrap009
10 <split [1756/668]> Bootstrap010
# ℹ 90 more rows

There is a price to pay:

For each values of a tuning parameter, the model needs to be rerun multiple times. This can be extremely slow for complex models, which is why you often see that validation sets are used in deep learning.

Regularization

Regularization adds a penalty for over-fitting to the loss function. The penalty serves as constraint on optimization problem. This can be thought of as a shrinkage estimator, a smaller coefficients are shrunk to the benefit of larger coefficients. The most important regularization procedures are:

  • Lasso
  • Tikhonov regularization
  • Elastic nets

Workflow

model_recipe <- recipe(Sale_Price ~ ., data = training_set) %>%
  step_normalize(all_numeric_predictors())
elastic_spec <-  linear_reg(penalty = tune(),
                            mixture = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")
elastic_wf <- workflow() %>%
  add_model(elastic_spec) %>%
  add_recipe(model_recipe)
elastic_metrics <- metric_set(rsq)

Grid

glmnet_param <- extract_parameter_set_dials(elastic_spec)
elastic_grid <- grid_regular(glmnet_param, levels = 5)
print(elastic_grid, n = 6)
# A tibble: 25 × 2
       penalty mixture
         <dbl>   <dbl>
1 0.0000000001   0.05 
2 0.0000000316   0.05 
3 0.00001        0.05 
4 0.00316        0.05 
5 1              0.05 
6 0.0000000001   0.288
# ℹ 19 more rows

Tune

doParallel::registerDoParallel()
set.seed(1561)
elastic_tune <- elastic_wf %>%
  tune_grid(repeated_folds, grid = elastic_grid, metrics = elastic_metrics)
Warning: ! tune detected a parallel backend registered with foreach but no backend
  registered with future.
ℹ Support for parallel processing with foreach was soft-deprecated in tune
  1.2.1.
ℹ See ?parallelism (`?tune::parallelism()`) to learn more.
elastic_best <- select_best(elastic_tune)
Warning in select_best(elastic_tune): No value of `metric` was given; "rsq"
will be used.
elastic_best
# A tibble: 1 × 3
       penalty mixture .config              
         <dbl>   <dbl> <chr>                
1 0.0000000001    0.05 Preprocessor1_Model01

Finish

elastic_updated <- finalize_model(elastic_spec,
                                  elastic_best)
workflow_new <- elastic_wf %>%
  update_model(elastic_updated)
new_fit <- workflow_new %>%
  fit(data = training_set)
tidy_elastic <- new_fit %>%
  extract_fit_parsnip() %>%
  tidy()
tidy_elastic
# A tibble: 7 × 3
  term          estimate      penalty
  <chr>            <dbl>        <dbl>
1 (Intercept)    181225. 0.0000000001
2 Gr_Liv_Area     41403. 0.0000000001
3 Year_Built      18680. 0.0000000001
4 Lot_Area         3560. 0.0000000001
5 Total_Bsmt_SF   20773. 0.0000000001
6 Garage_Cars     14683. 0.0000000001
7 TotRms_AbvGrd   -6527. 0.0000000001

The resukts indicate that regularization is not helping much here. The predictors are clean enough that the unpenalized model performs well enough.

Classification tasks

Unlike regression, classification predicts a categorical outcome. Here we predict whether a property has central air conditioning. The outcome is binary: Y (yes) or N (no).

ames_class <- ames |>
  mutate(
    Central_Air = factor(Central_Air, levels = c("Y", "N")),
    MS_Zoning = factor(MS_Zoning,
                       labels = c("Agriculture", "Commercial",
                                  "Floating Village Residential", "Industrial",
                                  "Residential High Density", "Residential Low Density",
                                  "Residential Medium Density"))
  ) |>
  select(Central_Air, Sale_Price, Gr_Liv_Area, Year_Built,
         Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd, MS_Zoning) |>
  na.omit()
set.seed(2922)
class_split <- initial_split(ames_class, prop = 0.6, strata = Central_Air)
class_train <- training(class_split)
class_test <- testing(class_split)

Checking

class_train |> count(Central_Air) |> mutate(prop = n / sum(n))
# A tibble: 2 × 3
  Central_Air     n   prop
  <fct>       <int>  <dbl>
1 Y            1640 0.933 
2 N             118 0.0671
class_test |> count(Central_Air) |> mutate(prop = n / sum(n))
# A tibble: 2 × 3
  Central_Air     n   prop
  <fct>       <int>  <dbl>
1 Y            1094 0.933 
2 N              78 0.0666

Pre-processing using recipes

Sale_Price, Lot_Area, and Gr_Liv_Area are right-skewed and require a log transformation. MS_Zoning is a nominal categorical variable requiring dummy coding.

class_recipe <- recipe(Central_Air ~ ., data = class_train) |>
  step_log(Sale_Price, Lot_Area, Gr_Liv_Area) |>
  step_dummy(MS_Zoning)

Training using parsnip

glm:

class_spec <- logistic_reg() %>%
  set_mode("classification") %>%
  set_engine("glm")
class_wf <- workflow() %>%
  add_model(class_spec) %>%
  add_recipe(class_recipe)
class_fit <- class_wf %>%
  fit(data = class_train)
tidy(class_fit)
# A tibble: 14 × 5
   term                                    estimate std.error statistic  p.value
   <chr>                                      <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                              1.15e+2   6.89e+2   0.167   8.68e- 1
 2 Sale_Price                              -5.16e+0   6.62e-1  -7.79    6.62e-15
 3 Gr_Liv_Area                             -1.66e-1   7.93e-1  -0.210   8.34e- 1
 4 Year_Built                              -3.63e-2   6.76e-3  -5.36    8.22e- 8
 5 Lot_Area                                 3.95e-1   3.59e-1   1.10    2.71e- 1
 6 Total_Bsmt_SF                           -8.61e-4   3.97e-4  -2.17    3.03e- 2
 7 Garage_Cars                             -4.01e-1   1.87e-1  -2.14    3.23e- 2
 8 TotRms_AbvGrd                            2.83e-1   1.38e-1   2.05    4.01e- 2
 9 MS_Zoning_Commercial                     1.30e+1   6.89e+2   0.0189  9.85e- 1
10 MS_Zoning_Floating.Village.Residential   1.07e+1   6.89e+2   0.0156  9.88e- 1
11 MS_Zoning_Industrial                     1.04e+1   6.89e+2   0.0151  9.88e- 1
12 MS_Zoning_Residential.High.Density      -8.46e+0   6.56e+3  -0.00129 9.99e- 1
13 MS_Zoning_Residential.Low.Density        1.10e+1   6.89e+2   0.0159  9.87e- 1
14 MS_Zoning_Residential.Medium.Density     8.50e+0   6.89e+2   0.0123  9.90e- 1
glance(class_fit)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          865.    1757  -215.  457.  534.     429.        1744  1758

Confusion matrix

class_fit %>%
  predict(class_test) %>%
  bind_cols(class_test) %>%
  conf_mat(truth = Central_Air, estimate = .pred_class)
          Truth
Prediction    Y    N
         Y 1069   58
         N   25   20

Metrics

class_fit %>%
  predict(class_test) %>%
  bind_cols(class_test) %>%
  metrics(truth = Central_Air, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.929
2 kap      binary         0.291

ROC Curve

The ROC curve shows, at every possible classification threshold, how well the model detects true positives (sensitivity) against how often it falsely flags negatives as positives (i.e. 1 - specificity). A curve closer to the top-left corner indicates better performance. The diagonal represents a model with no predictive skill.

class_fit %>%
  predict(class_test, type = "prob") %>%
  bind_cols(class_test) %>%
  roc_curve(truth = Central_Air, .pred_Y) %>%
  autoplot()

The model achieves 92.9% accuracy, which is misleading. 93.3% of properties already have central air, so a model that predicts “Y” for everything would score even better. The confusion matrix in turn shows that the model correctly identifies 1069 out of 1094 properties with central air, but only catches 20 of 78 properties without it. It is therefore doing poorly at identifying the minority class. The ROC curve shows a strong lift in the top-left corner, which indicates that the model has strong discriminatory power. In practice, you would lower the classification threshold, as the default 0.5 threshold is poorly calibrated for a heavily imbalanced outcome. Lowering the threshold accepts more false positives, but gets better at identifying properties without central air.

Let’s try this again. This time, we’ll try it with a variable that suffers less from a class imbalance issue. We will try and predict whether a property has excellent heating or not. We will refactor the heating variables, collapsing all non-excellent levels into one.

Predicting excellent heating

ames_class <- ames |>
  mutate(
    Heating_QC = factor(ifelse(Heating_QC == "Excellent", "Excellent", "Not Excellent"),
                        levels = c("Excellent", "Not Excellent")),
    MS_Zoning = factor(MS_Zoning,
                       labels = c("Agriculture", "Commercial",
                                  "Floating Village Residential", "Industrial",
                                  "Residential High Density", "Residential Low Density",
                                  "Residential Medium Density"))
  ) |>
  select(Heating_QC, Sale_Price, Gr_Liv_Area, Year_Built,
         Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd, MS_Zoning) |>
  na.omit()
set.seed(2922)
class_split <- initial_split(ames_class, prop = 0.6, strata = Heating_QC)
class_train <- training(class_split)
class_test <- testing(class_split)

Checking

class_train |> count(Heating_QC) |> mutate(prop = n / sum(n))
# A tibble: 2 × 3
  Heating_QC        n  prop
  <fct>         <int> <dbl>
1 Excellent       897 0.510
2 Not Excellent   861 0.490
class_test |> count(Heating_QC) |> mutate(prop = n / sum(n))
# A tibble: 2 × 3
  Heating_QC        n  prop
  <fct>         <int> <dbl>
1 Excellent       598 0.510
2 Not Excellent   574 0.490

Pre-processing Using recipes

Sale_Price, Lot_Area, and Gr_Liv_Area are right-skewed and require a log transformation. MS_Zoning is a nominal categorical variable requiring dummy coding.

class_recipe <- recipe(Heating_QC ~ ., data = class_train) |>
  step_log(Sale_Price, Lot_Area, Gr_Liv_Area) |>
  step_dummy(MS_Zoning)

Training using parsnip

class_spec <- logistic_reg() %>%
  set_mode("classification") %>%
  set_engine("glm")
class_wf <- workflow() %>%
  add_model(class_spec) %>%
  add_recipe(class_recipe)
class_fit <- class_wf %>%
  fit(data = class_train)
tidy(class_fit)
# A tibble: 14 × 5
   term                                    estimate std.error statistic  p.value
   <chr>                                      <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                            70.4        6.29e+0   11.2    4.42e-29
 2 Sale_Price                             -4.30       4.11e-1  -10.5    1.42e-25
 3 Gr_Liv_Area                             1.62       4.35e-1    3.71   2.06e- 4
 4 Year_Built                             -0.0181     2.97e-3   -6.07   1.26e- 9
 5 Lot_Area                                0.366      1.51e-1    2.42   1.56e- 2
 6 Total_Bsmt_SF                           0.000131   1.81e-4    0.726  4.68e- 1
 7 Garage_Cars                             0.173      1.11e-1    1.56   1.20e- 1
 8 TotRms_AbvGrd                          -0.133      6.96e-2   -1.90   5.68e- 2
 9 MS_Zoning_Commercial                    3.06       1.01e+0    3.02   2.51e- 3
10 MS_Zoning_Floating.Village.Residential  2.35       7.37e-1    3.19   1.40e- 3
11 MS_Zoning_Industrial                    1.70       7.52e-1    2.26   2.40e- 2
12 MS_Zoning_Residential.High.Density     11.5        4.33e+2    0.0266 9.79e- 1
13 MS_Zoning_Residential.Low.Density       0.911      1.12e+0    0.813  4.16e- 1
14 MS_Zoning_Residential.Medium.Density   11.7        8.83e+2    0.0132 9.89e- 1
glance(class_fit)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         2436.    1757  -872. 1772. 1849.    1744.        1744  1758

Confusion matrix

class_fit %>%
  predict(class_test) %>%
  bind_cols(class_test) %>%
  conf_mat(truth = Heating_QC, estimate = .pred_class)
               Truth
Prediction      Excellent Not Excellent
  Excellent           418           133
  Not Excellent       180           441

Metrics

class_fit %>%
  predict(class_test) %>%
  bind_cols(class_test) %>%
  metrics(truth = Heating_QC, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.733
2 kap      binary         0.467

ROC Curve

class_fit %>%
  predict(class_test, type = "prob") %>%
  bind_cols(class_test) %>%
  roc_curve(truth = Heating_QC, .pred_Excellent) %>%
  autoplot()

The model achieves 73.3% accuracy with a kappa of 0.467. Given a split of 51/49% class balance, a naive classifier would have scored 50%. 180 not-excellent properties are classified as excellent, and 133 excellent properties are classified as non-excellent, evidencing no strong bias towards either. sale_price is the strongest predictor, with a negative coefficient, indicating that higher-priced properties are less likely to have excellent heating. This may be related to older houses being more expensive. The ROC curve lifts above the average, but is still some way away from the top-left corner, indicating room for improvement.

Decision Trees

These are a non-parametric method that recursively partitions the feature space into rectangular regions. Unlike linear regression, no assumptions about functional form are required. Three key tuning parameters:

  • tree_depth: maximum depth of the tree.
  • min_n: minimum number of instances required to split a node.
  • cost_complexity: penalizes tree size to prevent overfitting.

Breiman (1984) developed the classification and regression trees (CART) algorithm, which can handle both classification and regression tasks. CART has an automatic facility for handling missing data and generates and easily interpretable set of results.

In trees, we seek to divide our data parallel to the axes that improve the purity of the resulting nodes. In classification, CART defines node impurity either by Gini impurity or entropy. Successive axis-parallel splits minimize both.

Workflow

cart_recipe <- recipe(Sale_Price ~ ., data = training_set)
cart_spec <- decision_tree(tree_depth = tune(),
                           min_n = tune(),
                           cost_complexity = tune()) %>%
  set_mode("regression") %>%
  set_engine("rpart")
cart_wf <- workflow() %>%
  add_recipe(cart_recipe) %>%
  add_model(cart_spec)

Grid

cart_para <- extract_parameter_set_dials(cart_spec)
cart_grid <- grid_latin_hypercube(cart_para,
                                  size = 60,
                                  variogram_range = 0.5,
                                  original = TRUE)
Warning: `grid_latin_hypercube()` was deprecated in dials 1.3.0.
ℹ Please use `grid_space_filling()` instead.

Tune

set.seed(5161)
cv_folds <- vfold_cv(training_set, v = 10, repeats = 5)
doParallel::registerDoParallel()
cart_metrics <- metric_set(rsq)
cart_tune <- cart_wf %>%
  tune_grid(cv_folds, grid = cart_grid, metrics = cart_metrics)
Warning: ! tune detected a parallel backend registered with foreach but no backend
  registered with future.
ℹ Support for parallel processing with foreach was soft-deprecated in tune
  1.2.1.
ℹ See ?parallelism (`?tune::parallelism()`) to learn more.
cart_best <- select_best(cart_tune, metric = "rsq")
cart_best
# A tibble: 1 × 4
  cost_complexity tree_depth min_n .config              
            <dbl>      <int> <int> <chr>                
1        1.07e-10         13    21 Preprocessor1_Model25

Finish

cart_updated <- finalize_model(cart_spec, cart_best)
cart_wf_final <- cart_wf %>%
  update_model(cart_updated)
cart_fit_final <- cart_wf_final %>%
  fit(data = training_set)

Plot

tree <- extract_fit_engine(cart_fit_final)
rpart.plot(tree)
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
    Call rpart.plot with roundint=FALSE,
    or rebuild the rpart model with model=TRUE.
Warning: labs do not fit even at cex 0.15, there may be some overplotting

Output

cart_fit_final %>%
  predict(test_set) %>%
  bind_cols(test_set) %>%
  metrics(truth = Sale_Price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   39682.   
2 rsq     standard       0.765
3 mae     standard   24177.   

The decision tree (R² = 0.745, RMSE = $40,488) performs almost identically to the linear regression from session 1 (R² = 0.725, RMSE = $40,860). The tree is marginally better on all three metrics, but the difference is trivial for this dataset with these predictors, a simple straight line and a complex branching tree arrive at essentially the same answer. This demonstrates that a more complex model is not always better.

Trees would be useful to examine other phenomena. An example would be an analysis, where multiple categories are collapsed into one continuous outcome variables, with a theoretically informed expectation of multiple distinct clusters of data.