#install.packages("DALEXtra")
#install.packages("ingredients")
#install.packages("modeldata")
#install.packages("tidyverse")
#install.packages("tidymodels")
#install.packages("rio")
#install.packages("doParallel")
#install.packages("glmnet")
#install.packages("rpart.plot")Introduction to Machine Learning for the Social Sciences - Code Tutorial, Session 4
Code based on: Steenbergen, M. (2025). Introduction to Machine Learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.
Data:
Ames Housing dataset (De Cock 2011). Assessor’s office data for residential properties sold in Ames, Iowa. Label: “What was the sale price of this residential property?”
Features:
- Above ground living area in square feet.
- Year the property was originally constructed.
- Lot size in square feet.
- Total square feet of basement area.
- Size of garage in car capacity.
- Total rooms above ground
Installations (remove comment notation if needed):
Load packages
library(rio)
library(modeldata)
library(tidyverse)
library(DALEXtra)
library(tidymodels)
library(ingredients)
library(doParallel)
library(glmnet)
library(rpart.plot)Load model from last session
data(ames)
ames_clean <- ames |>
select(Sale_Price, Gr_Liv_Area, Year_Built,
Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd) |>
na.omit()
tidymodels_prefer()
set.seed(2922)
ames_split <- initial_split(ames_clean, prop = 0.6, strata = Sale_Price)
ames_split<Training/Testing/Total>
<1756/1174/2930>
training_set <- training(ames_split)
test_set <- testing(ames_split)Validation
val_set <- validation_split(training_set, prop = 3/4)Warning: `validation_split()` was deprecated in rsample 1.2.0.
ℹ Please use `initial_validation_split()` instead.
val_set# Validation Set Split (0.75/0.25)
# A tibble: 1 × 2
splits id
<list> <chr>
1 <split [1317/439]> validation
Why validate?
There are disadvantages to the two-split sample approach: We risk training on potentially a small part of the original data, inducing downwards bias in performance, and we get only one value for optimal tuning parameter, which may mean our ability to generalize is limited. If we have a lot of data, these problems tend to be less severe, but in smaller datasets, it might be better using cross-validation and bootstrapping.
What is a fold?
A fold is one of the subsets created when cross-validation divides the training data into equal parts, each taking a turn as the held-out evaluation set while the model is trained on the remainder.
k-fold cross-validation steps:
- Randomly assign instances to k-folds (usually 10),
- each fold is held out once for validation purposes,
- each fold is used k-1 times for training purposes.
Cross-validation in tidymodels in four different ways
1 - Vanilla
set.seed(1923)
vanilla_folds <- vfold_cv(training_set, v = 10)
vanilla_folds# 10-fold cross-validation
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [1580/176]> Fold01
2 <split [1580/176]> Fold02
3 <split [1580/176]> Fold03
4 <split [1580/176]> Fold04
5 <split [1580/176]> Fold05
6 <split [1580/176]> Fold06
7 <split [1581/175]> Fold07
8 <split [1581/175]> Fold08
9 <split [1581/175]> Fold09
10 <split [1581/175]> Fold10
The higher we set k, here named v, the more data will be used for training at any given time.
2 - Repeated
set.seed(1923)
repeated_folds <- vfold_cv(training_set, v = 10, repeats = 5)
repeated_folds# 10-fold cross-validation repeated 5 times
# A tibble: 50 × 3
splits id id2
<list> <chr> <chr>
1 <split [1580/176]> Repeat1 Fold01
2 <split [1580/176]> Repeat1 Fold02
3 <split [1580/176]> Repeat1 Fold03
4 <split [1580/176]> Repeat1 Fold04
5 <split [1580/176]> Repeat1 Fold05
6 <split [1580/176]> Repeat1 Fold06
7 <split [1581/175]> Repeat1 Fold07
8 <split [1581/175]> Repeat1 Fold08
9 <split [1581/175]> Repeat1 Fold09
10 <split [1581/175]> Repeat1 Fold10
# ℹ 40 more rows
Another way to reduce bias is to repeatedly assign instance to the folds, each time using a different seed.
3 - LOOCV
set.seed(1923)
loocv_folds <- loo_cv(training_set)
loocv_folds# Leave-one-out cross-validation
# A tibble: 1,756 × 2
splits id
<list> <chr>
1 <split [1755/1]> Resample1
2 <split [1755/1]> Resample2
3 <split [1755/1]> Resample3
4 <split [1755/1]> Resample4
5 <split [1755/1]> Resample5
6 <split [1755/1]> Resample6
7 <split [1755/1]> Resample7
8 <split [1755/1]> Resample8
9 <split [1755/1]> Resample9
10 <split [1755/1]> Resample10
# ℹ 1,746 more rows
Leave out one CV, in each run bias is reduced but variance increases
4 - Monte Carlo
set.seed(1923)
mc_folds <- mc_cv(training_set, prop = 9/10, times = 20)
mc_folds# Monte Carlo cross-validation (0.9/0.1) with 20 resamples
# A tibble: 20 × 2
splits id
<list> <chr>
1 <split [1580/176]> Resample01
2 <split [1580/176]> Resample02
3 <split [1580/176]> Resample03
4 <split [1580/176]> Resample04
5 <split [1580/176]> Resample05
6 <split [1580/176]> Resample06
7 <split [1580/176]> Resample07
8 <split [1580/176]> Resample08
9 <split [1580/176]> Resample09
10 <split [1580/176]> Resample10
11 <split [1580/176]> Resample11
12 <split [1580/176]> Resample12
13 <split [1580/176]> Resample13
14 <split [1580/176]> Resample14
15 <split [1580/176]> Resample15
16 <split [1580/176]> Resample16
17 <split [1580/176]> Resample17
18 <split [1580/176]> Resample18
19 <split [1580/176]> Resample19
20 <split [1580/176]> Resample20
Monte Carlo CV selects the folds randomly each time, which may cause folds to contain overlapping instances (Xu and Liang 2001).
Bootstrap in tidymodels
Why bootstrap?
Repeated sampling instances are used for training, non-sampled instances are used for validation.
set.seed(1923)
booted <- bootstraps(training_set, times = 100)
booted# Bootstrap sampling
# A tibble: 100 × 2
splits id
<list> <chr>
1 <split [1756/647]> Bootstrap001
2 <split [1756/652]> Bootstrap002
3 <split [1756/646]> Bootstrap003
4 <split [1756/643]> Bootstrap004
5 <split [1756/621]> Bootstrap005
6 <split [1756/648]> Bootstrap006
7 <split [1756/656]> Bootstrap007
8 <split [1756/668]> Bootstrap008
9 <split [1756/657]> Bootstrap009
10 <split [1756/668]> Bootstrap010
# ℹ 90 more rows
There is a price to pay:
For each values of a tuning parameter, the model needs to be rerun multiple times. This can be extremely slow for complex models, which is why you often see that validation sets are used in deep learning.
Regularization
Regularization adds a penalty for over-fitting to the loss function. The penalty serves as constraint on optimization problem. This can be thought of as a shrinkage estimator, a smaller coefficients are shrunk to the benefit of larger coefficients. The most important regularization procedures are:
- Lasso
- Tikhonov regularization
- Elastic nets
Workflow
model_recipe <- recipe(Sale_Price ~ ., data = training_set) %>%
step_normalize(all_numeric_predictors())
elastic_spec <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
elastic_wf <- workflow() %>%
add_model(elastic_spec) %>%
add_recipe(model_recipe)
elastic_metrics <- metric_set(rsq)Grid
glmnet_param <- extract_parameter_set_dials(elastic_spec)
elastic_grid <- grid_regular(glmnet_param, levels = 5)
print(elastic_grid, n = 6)# A tibble: 25 × 2
penalty mixture
<dbl> <dbl>
1 0.0000000001 0.05
2 0.0000000316 0.05
3 0.00001 0.05
4 0.00316 0.05
5 1 0.05
6 0.0000000001 0.288
# ℹ 19 more rows
Tune
doParallel::registerDoParallel()
set.seed(1561)
elastic_tune <- elastic_wf %>%
tune_grid(repeated_folds, grid = elastic_grid, metrics = elastic_metrics)Warning: ! tune detected a parallel backend registered with foreach but no backend
registered with future.
ℹ Support for parallel processing with foreach was soft-deprecated in tune
1.2.1.
ℹ See ?parallelism (`?tune::parallelism()`) to learn more.
elastic_best <- select_best(elastic_tune)Warning in select_best(elastic_tune): No value of `metric` was given; "rsq"
will be used.
elastic_best# A tibble: 1 × 3
penalty mixture .config
<dbl> <dbl> <chr>
1 0.0000000001 0.05 Preprocessor1_Model01
Finish
elastic_updated <- finalize_model(elastic_spec,
elastic_best)
workflow_new <- elastic_wf %>%
update_model(elastic_updated)
new_fit <- workflow_new %>%
fit(data = training_set)
tidy_elastic <- new_fit %>%
extract_fit_parsnip() %>%
tidy()
tidy_elastic# A tibble: 7 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 181225. 0.0000000001
2 Gr_Liv_Area 41403. 0.0000000001
3 Year_Built 18680. 0.0000000001
4 Lot_Area 3560. 0.0000000001
5 Total_Bsmt_SF 20773. 0.0000000001
6 Garage_Cars 14683. 0.0000000001
7 TotRms_AbvGrd -6527. 0.0000000001
The resukts indicate that regularization is not helping much here. The predictors are clean enough that the unpenalized model performs well enough.
Classification tasks
Unlike regression, classification predicts a categorical outcome. Here we predict whether a property has central air conditioning. The outcome is binary: Y (yes) or N (no).
ames_class <- ames |>
mutate(
Central_Air = factor(Central_Air, levels = c("Y", "N")),
MS_Zoning = factor(MS_Zoning,
labels = c("Agriculture", "Commercial",
"Floating Village Residential", "Industrial",
"Residential High Density", "Residential Low Density",
"Residential Medium Density"))
) |>
select(Central_Air, Sale_Price, Gr_Liv_Area, Year_Built,
Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd, MS_Zoning) |>
na.omit()
set.seed(2922)
class_split <- initial_split(ames_class, prop = 0.6, strata = Central_Air)
class_train <- training(class_split)
class_test <- testing(class_split)Checking
class_train |> count(Central_Air) |> mutate(prop = n / sum(n))# A tibble: 2 × 3
Central_Air n prop
<fct> <int> <dbl>
1 Y 1640 0.933
2 N 118 0.0671
class_test |> count(Central_Air) |> mutate(prop = n / sum(n))# A tibble: 2 × 3
Central_Air n prop
<fct> <int> <dbl>
1 Y 1094 0.933
2 N 78 0.0666
Pre-processing using recipes
Sale_Price, Lot_Area, and Gr_Liv_Area are right-skewed and require a log transformation. MS_Zoning is a nominal categorical variable requiring dummy coding.
class_recipe <- recipe(Central_Air ~ ., data = class_train) |>
step_log(Sale_Price, Lot_Area, Gr_Liv_Area) |>
step_dummy(MS_Zoning)Training using parsnip
glm:
class_spec <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm")
class_wf <- workflow() %>%
add_model(class_spec) %>%
add_recipe(class_recipe)
class_fit <- class_wf %>%
fit(data = class_train)
tidy(class_fit)# A tibble: 14 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.15e+2 6.89e+2 0.167 8.68e- 1
2 Sale_Price -5.16e+0 6.62e-1 -7.79 6.62e-15
3 Gr_Liv_Area -1.66e-1 7.93e-1 -0.210 8.34e- 1
4 Year_Built -3.63e-2 6.76e-3 -5.36 8.22e- 8
5 Lot_Area 3.95e-1 3.59e-1 1.10 2.71e- 1
6 Total_Bsmt_SF -8.61e-4 3.97e-4 -2.17 3.03e- 2
7 Garage_Cars -4.01e-1 1.87e-1 -2.14 3.23e- 2
8 TotRms_AbvGrd 2.83e-1 1.38e-1 2.05 4.01e- 2
9 MS_Zoning_Commercial 1.30e+1 6.89e+2 0.0189 9.85e- 1
10 MS_Zoning_Floating.Village.Residential 1.07e+1 6.89e+2 0.0156 9.88e- 1
11 MS_Zoning_Industrial 1.04e+1 6.89e+2 0.0151 9.88e- 1
12 MS_Zoning_Residential.High.Density -8.46e+0 6.56e+3 -0.00129 9.99e- 1
13 MS_Zoning_Residential.Low.Density 1.10e+1 6.89e+2 0.0159 9.87e- 1
14 MS_Zoning_Residential.Medium.Density 8.50e+0 6.89e+2 0.0123 9.90e- 1
glance(class_fit)# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 865. 1757 -215. 457. 534. 429. 1744 1758
Confusion matrix
class_fit %>%
predict(class_test) %>%
bind_cols(class_test) %>%
conf_mat(truth = Central_Air, estimate = .pred_class) Truth
Prediction Y N
Y 1069 58
N 25 20
Metrics
class_fit %>%
predict(class_test) %>%
bind_cols(class_test) %>%
metrics(truth = Central_Air, estimate = .pred_class)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.929
2 kap binary 0.291
ROC Curve
The ROC curve shows, at every possible classification threshold, how well the model detects true positives (sensitivity) against how often it falsely flags negatives as positives (i.e. 1 - specificity). A curve closer to the top-left corner indicates better performance. The diagonal represents a model with no predictive skill.
class_fit %>%
predict(class_test, type = "prob") %>%
bind_cols(class_test) %>%
roc_curve(truth = Central_Air, .pred_Y) %>%
autoplot()The model achieves 92.9% accuracy, which is misleading. 93.3% of properties already have central air, so a model that predicts “Y” for everything would score even better. The confusion matrix in turn shows that the model correctly identifies 1069 out of 1094 properties with central air, but only catches 20 of 78 properties without it. It is therefore doing poorly at identifying the minority class. The ROC curve shows a strong lift in the top-left corner, which indicates that the model has strong discriminatory power. In practice, you would lower the classification threshold, as the default 0.5 threshold is poorly calibrated for a heavily imbalanced outcome. Lowering the threshold accepts more false positives, but gets better at identifying properties without central air.
Let’s try this again. This time, we’ll try it with a variable that suffers less from a class imbalance issue. We will try and predict whether a property has excellent heating or not. We will refactor the heating variables, collapsing all non-excellent levels into one.
Predicting excellent heating
ames_class <- ames |>
mutate(
Heating_QC = factor(ifelse(Heating_QC == "Excellent", "Excellent", "Not Excellent"),
levels = c("Excellent", "Not Excellent")),
MS_Zoning = factor(MS_Zoning,
labels = c("Agriculture", "Commercial",
"Floating Village Residential", "Industrial",
"Residential High Density", "Residential Low Density",
"Residential Medium Density"))
) |>
select(Heating_QC, Sale_Price, Gr_Liv_Area, Year_Built,
Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd, MS_Zoning) |>
na.omit()
set.seed(2922)
class_split <- initial_split(ames_class, prop = 0.6, strata = Heating_QC)
class_train <- training(class_split)
class_test <- testing(class_split)Checking
class_train |> count(Heating_QC) |> mutate(prop = n / sum(n))# A tibble: 2 × 3
Heating_QC n prop
<fct> <int> <dbl>
1 Excellent 897 0.510
2 Not Excellent 861 0.490
class_test |> count(Heating_QC) |> mutate(prop = n / sum(n))# A tibble: 2 × 3
Heating_QC n prop
<fct> <int> <dbl>
1 Excellent 598 0.510
2 Not Excellent 574 0.490
Pre-processing Using recipes
Sale_Price, Lot_Area, and Gr_Liv_Area are right-skewed and require a log transformation. MS_Zoning is a nominal categorical variable requiring dummy coding.
class_recipe <- recipe(Heating_QC ~ ., data = class_train) |>
step_log(Sale_Price, Lot_Area, Gr_Liv_Area) |>
step_dummy(MS_Zoning)Training using parsnip
class_spec <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm")
class_wf <- workflow() %>%
add_model(class_spec) %>%
add_recipe(class_recipe)
class_fit <- class_wf %>%
fit(data = class_train)
tidy(class_fit)# A tibble: 14 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 70.4 6.29e+0 11.2 4.42e-29
2 Sale_Price -4.30 4.11e-1 -10.5 1.42e-25
3 Gr_Liv_Area 1.62 4.35e-1 3.71 2.06e- 4
4 Year_Built -0.0181 2.97e-3 -6.07 1.26e- 9
5 Lot_Area 0.366 1.51e-1 2.42 1.56e- 2
6 Total_Bsmt_SF 0.000131 1.81e-4 0.726 4.68e- 1
7 Garage_Cars 0.173 1.11e-1 1.56 1.20e- 1
8 TotRms_AbvGrd -0.133 6.96e-2 -1.90 5.68e- 2
9 MS_Zoning_Commercial 3.06 1.01e+0 3.02 2.51e- 3
10 MS_Zoning_Floating.Village.Residential 2.35 7.37e-1 3.19 1.40e- 3
11 MS_Zoning_Industrial 1.70 7.52e-1 2.26 2.40e- 2
12 MS_Zoning_Residential.High.Density 11.5 4.33e+2 0.0266 9.79e- 1
13 MS_Zoning_Residential.Low.Density 0.911 1.12e+0 0.813 4.16e- 1
14 MS_Zoning_Residential.Medium.Density 11.7 8.83e+2 0.0132 9.89e- 1
glance(class_fit)# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2436. 1757 -872. 1772. 1849. 1744. 1744 1758
Confusion matrix
class_fit %>%
predict(class_test) %>%
bind_cols(class_test) %>%
conf_mat(truth = Heating_QC, estimate = .pred_class) Truth
Prediction Excellent Not Excellent
Excellent 418 133
Not Excellent 180 441
Metrics
class_fit %>%
predict(class_test) %>%
bind_cols(class_test) %>%
metrics(truth = Heating_QC, estimate = .pred_class)# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.733
2 kap binary 0.467
ROC Curve
class_fit %>%
predict(class_test, type = "prob") %>%
bind_cols(class_test) %>%
roc_curve(truth = Heating_QC, .pred_Excellent) %>%
autoplot()The model achieves 73.3% accuracy with a kappa of 0.467. Given a split of 51/49% class balance, a naive classifier would have scored 50%. 180 not-excellent properties are classified as excellent, and 133 excellent properties are classified as non-excellent, evidencing no strong bias towards either. sale_price is the strongest predictor, with a negative coefficient, indicating that higher-priced properties are less likely to have excellent heating. This may be related to older houses being more expensive. The ROC curve lifts above the average, but is still some way away from the top-left corner, indicating room for improvement.
Decision Trees
These are a non-parametric method that recursively partitions the feature space into rectangular regions. Unlike linear regression, no assumptions about functional form are required. Three key tuning parameters:
- tree_depth: maximum depth of the tree.
- min_n: minimum number of instances required to split a node.
- cost_complexity: penalizes tree size to prevent overfitting.
Breiman (1984) developed the classification and regression trees (CART) algorithm, which can handle both classification and regression tasks. CART has an automatic facility for handling missing data and generates and easily interpretable set of results.
In trees, we seek to divide our data parallel to the axes that improve the purity of the resulting nodes. In classification, CART defines node impurity either by Gini impurity or entropy. Successive axis-parallel splits minimize both.
Workflow
cart_recipe <- recipe(Sale_Price ~ ., data = training_set)
cart_spec <- decision_tree(tree_depth = tune(),
min_n = tune(),
cost_complexity = tune()) %>%
set_mode("regression") %>%
set_engine("rpart")
cart_wf <- workflow() %>%
add_recipe(cart_recipe) %>%
add_model(cart_spec)Grid
cart_para <- extract_parameter_set_dials(cart_spec)
cart_grid <- grid_latin_hypercube(cart_para,
size = 60,
variogram_range = 0.5,
original = TRUE)Warning: `grid_latin_hypercube()` was deprecated in dials 1.3.0.
ℹ Please use `grid_space_filling()` instead.
Tune
set.seed(5161)
cv_folds <- vfold_cv(training_set, v = 10, repeats = 5)
doParallel::registerDoParallel()
cart_metrics <- metric_set(rsq)
cart_tune <- cart_wf %>%
tune_grid(cv_folds, grid = cart_grid, metrics = cart_metrics)Warning: ! tune detected a parallel backend registered with foreach but no backend
registered with future.
ℹ Support for parallel processing with foreach was soft-deprecated in tune
1.2.1.
ℹ See ?parallelism (`?tune::parallelism()`) to learn more.
cart_best <- select_best(cart_tune, metric = "rsq")
cart_best# A tibble: 1 × 4
cost_complexity tree_depth min_n .config
<dbl> <int> <int> <chr>
1 1.07e-10 13 21 Preprocessor1_Model25
Finish
cart_updated <- finalize_model(cart_spec, cart_best)
cart_wf_final <- cart_wf %>%
update_model(cart_updated)
cart_fit_final <- cart_wf_final %>%
fit(data = training_set)Plot
tree <- extract_fit_engine(cart_fit_final)
rpart.plot(tree)Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
Call rpart.plot with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
Warning: labs do not fit even at cex 0.15, there may be some overplotting
Output
cart_fit_final %>%
predict(test_set) %>%
bind_cols(test_set) %>%
metrics(truth = Sale_Price, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 39682.
2 rsq standard 0.765
3 mae standard 24177.
The decision tree (R² = 0.745, RMSE = $40,488) performs almost identically to the linear regression from session 1 (R² = 0.725, RMSE = $40,860). The tree is marginally better on all three metrics, but the difference is trivial for this dataset with these predictors, a simple straight line and a complex branching tree arrive at essentially the same answer. This demonstrates that a more complex model is not always better.
Trees would be useful to examine other phenomena. An example would be an analysis, where multiple categories are collapsed into one continuous outcome variables, with a theoretically informed expectation of multiple distinct clusters of data.