Introduction to Machine Learning for the Social Sciences - Code Tutorial, Session 3

Author

Adrian Stanciu & Erik Paessler

Code based on: Steenbergen, M. (2025). Introduction to Machine Learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.

Data:

Ames Housing dataset (De Cock 2011). Assessor’s office data for residential properties sold in Ames, Iowa. Label: “What was the sale price of this residential property?”

Features:

Above ground living area in square feet.
Year the property was originally constructed.
Lot size in square feet.
Total square feet of basement area.
Size of garage in car capacity.
Total rooms above ground

Installations (remove comment notation if needed):

#install.packages("DALEXtra")
#install.packages("ingredients")
#install.packages("modeldata")
#install.packages("tidyverse")
#install.packages("tidymodels")
#install.packages("rio")
#install.packages("doParallel")
#install.packages("glmnet")
#install.packages("rpart.plot")

Load packages

library(rio)
library(modeldata)
library(tidyverse)
library(DALEXtra)
library(tidymodels)
library(ingredients)
library(doParallel)
library(glmnet)
library(rpart.plot)

Splitting the Sample Using rsample

data(ames)
ames_clean <- ames |>
  select(Sale_Price, Gr_Liv_Area, Year_Built,
         Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd) |>
  na.omit()

Setup

tidymodels_prefer()
set.seed(2922)
ames_split <- initial_split(ames_clean, prop = 0.6, strata = Sale_Price)
ames_split

<Training/Testing/Total>
<1756/1174/2930>

initial_split() from rsample splits your data into training and test sets. When breaking down each argument from the code:

ames_clean — the dataset being split
prop = 0.6 — 60% of the data goes to the training set, 40% to the test set
strata = Sale_Price — ensures the distribution of Sale_Price is similar in both sets by sampling proportionally within bins of the outcome, rather than splitting purely at random.

The last point is why the KS test returned p = 0.99. Stratification enforces distributional similarity, so the result is expected.

Execution

training_set <- training(ames_split)
test_set <- testing(ames_split)

Checking

ks.test(training_set$Sale_Price, test_set$Sale_Price)

Warning in ks.test.default(training_set$Sale_Price, test_set$Sale_Price):
p-value will be approximate in the presence of ties


    Asymptotic two-sample Kolmogorov-Smirnov test

data:  training_set$Sale_Price and test_set$Sale_Price
D = 0.016455, p-value = 0.9912
alternative hypothesis: two-sided

Pre-processing Using recipes

For a regression problem, there really is often not any pre-processing needed. But: The outcome of Sale_Price and predictors such as Lot_Area and Gr_Liv_Area are, on the basis of theoretical expectation, likely right-skewed and would benefit from a log transformation.

Check distributions of all numeric variables

ames_clean |>
  select(where(is.numeric)) |>
  pivot_longer(everything()) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~name, scales = "free") +
  theme_bw()

Skewness statistic (|skew| > 1 warrants transformation)

ames_clean |>
  select(where(is.numeric)) |>
  summarise(across(everything(), \(x) mean((x - mean(x))^3) / sd(x)^3))

# A tibble: 1 × 7
  Sale_Price Gr_Liv_Area Year_Built Lot_Area Total_Bsmt_SF Garage_Cars
       <dbl>       <dbl>      <dbl>    <dbl>         <dbl>       <dbl>
1       1.74        1.27     -0.604     12.8          1.15      -0.221
# ℹ 1 more variable: TotRms_AbvGrd <dbl>

Apply log to right-skewed variables in training and test sets separately (transformation parameters must not be learned from the test set):

training_set <- training_set |>
  mutate(across(c(Sale_Price, Lot_Area, Gr_Liv_Area), log))

test_set <- test_set |>
  mutate(across(c(Sale_Price, Lot_Area, Gr_Liv_Area), log))

Training Using parsnip

We specify the algorithm, any tuning parameters, and the mode (i.e., task). We set the external learning engine (= implementation of the algorithm) using set_engine. We declare the model formula and training set using fit. Algorithms and engines can be found at tidymodels.org/find/parsnip

Using lm

ames_fit <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm") %>%
  fit(Sale_Price ~ ., data = training_set)
tidy(ames_fit)

# A tibble: 7 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -1.72     0.405         -4.24 2.31e-  5
2 Gr_Liv_Area    0.634    0.0257        24.7  8.20e-116
3 Year_Built     0.00417  0.000189      22.1  1.68e- 95
4 Lot_Area       0.0793   0.00948        8.37 1.21e- 16
5 Total_Bsmt_SF  0.000195 0.0000124     15.8  2.27e- 52
6 Garage_Cars    0.0910   0.00791       11.5  1.50e- 29
7 TotRms_AbvGrd -0.0260   0.00486       -5.35 9.82e-  8

glance(ames_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.795         0.794 0.185     1128.       0     6   476. -936. -892.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Based on the t-statistic (the statistic column, which measures the strength of evidence net of uncertainty), Gr_Liv_Area is the strongest predictor with a t-statistic of 22.7 and a p-value of 6.48e-100.

However, you cannot directly compare the raw coefficients to rank predictors here. Each is on a different scale (square feet vs. years vs. number of cars), so the coefficient sizes are not comparable to each other.

Evolution Using yardstick

Prediction

ames_fit %>%
  predict(test_set) %>%
  bind_cols(test_set)

# A tibble: 1,174 × 8
   .pred Sale_Price Gr_Liv_Area Year_Built Lot_Area Total_Bsmt_SF Garage_Cars
   <dbl>      <dbl>       <dbl>      <int>    <dbl>         <dbl>       <dbl>
 1  12.2       12.3        7.41       1960    10.4           1080           2
 2  11.6       11.6        6.80       1961     9.36           882           1
 3  12.0       12.1        7.19       1958     9.57          1329           1
 4  12.3       12.2        7.40       1997     9.53           928           2
 5  12.2       12.2        7.38       1998     9.21           926           2
 6  12.3       12.4        7.39       1995     8.59          1595           2
 7  12.1       12.1        7.29       1998     9.04           789           2
 8  12.3       12.3        7.31       1985     8.83          1488           2
 9  12.6       12.9        7.53       2010     9.34          1856           3
10  12.4       12.3        7.64       1978     9.49          1542           2
# ℹ 1,164 more rows
# ℹ 1 more variable: TotRms_AbvGrd <int>

Metrics

ames_fit %>%
  predict(test_set) %>%
  bind_cols(test_set) %>%
  metrics(truth = Sale_Price, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.194
2 rsq     standard       0.773
3 mae     standard       0.131

Partial Dependence Plot

DALEX = Descriptive mAchine Learning EXplanations (Biecek 2018)

Create explainer:

ames_explainer <- explain_tidymodels(
  model = ames_fit,
  data = test_set,
  y = test_set$Sale_Price
)

Preparation of a new explainer is initiated
  -> model label       :  model_fit  (  default  )
  -> data              :  1174  rows  7  cols 
  -> data              :  tibble converted into a data.frame 
  -> target variable   :  1174  values 
  -> predict function  :  yhat.model_fit  will be used (  default  )
  -> predicted values  :  No value for predict function target column. (  default  )
  -> model_info        :  package parsnip , ver. 1.3.2 , task regression (  default  ) 
  -> predicted values  :  numerical, min =  11.06472 , mean =  12.02728 , max =  13.78078  
  -> residual function :  difference between y and yhat (  default  )
  -> residuals         :  numerical, min =  -1.962382 , mean =  -0.008625293 , max =  0.7334518  
  A new explainer has been created!

Plotter:

pdp_area <- model_profile(
  ames_explainer,
  variables = "Gr_Liv_Area",
  N = NULL
)
as_tibble(pdp_area$agr_profiles) %>%
  ggplot(aes(`_x_`, `_yhat_`)) +
  geom_line(size = 1.2, alpha = 0.8, color = "#31688EFF") +
  geom_rug(sides = "b", col = "#31688EFF") +
  labs(
    x = "Above Ground Living Area (sq ft)",
    y = "Predicted Sale Price",
    color = NULL
  ) +
  theme_bw()

Substantive interpretation:

The R² of 0.744 in the training set indicates the six predictors explain about 74% of the variance in sale prices. The RMSE in the test set of $40,860 tells you that on average the predictions are off by that amount, which is substantial given a mean price around $180,000 (approx. 23% error). The PDP for Gr_Liv_Area shows the expected positive relationship, as above-ground living area increases, predicted sale price rises roughly linearly. The partial dependence plot shows the average predicted sale price as Gr_Liv_Area increases, holding all other predictors constant

Workflows

Sometimes in data, you will have missing data, or columns such as property_id identifying the property, which is not a predictive feature. With 74 variables in the ames dataset, we also need to select only the relevant predictors. The plan is to select relevant columns, turn MS_Zoning into a properly labelled factor, drop missing values, and create a recipe that creates dummies for MS_Zoning, since that is what the regression needs, and ensures that property_id will not be used as a predictor but remains available to flag instances.

Preparation

head(ames)

# A tibble: 6 × 74
  MS_SubClass             MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
  <fct>                   <fct>            <dbl>    <int> <fct>  <fct> <fct>    
1 One_Story_1946_and_New… Resident…          141    31770 Pave   No_A… Slightly…
2 One_Story_1946_and_New… Resident…           80    11622 Pave   No_A… Regular  
3 One_Story_1946_and_New… Resident…           81    14267 Pave   No_A… Slightly…
4 One_Story_1946_and_New… Resident…           93    11160 Pave   No_A… Regular  
5 Two_Story_1946_and_New… Resident…           74    13830 Pave   No_A… Slightly…
6 Two_Story_1946_and_New… Resident…           78     9978 Pave   No_A… Slightly…
# ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#   Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
#   Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, …

work_df <- ames %>%
  mutate(
    property_id = row_number(),
    MS_Zoning = factor(MS_Zoning,
                       labels = c("Agriculture", "Commercial",
                                  "Floating Village Residential", "Industrial",
                                  "Residential High Density", "Residential Low Density",
                                  "Residential Medium Density"))
  ) %>%
  select(Sale_Price, property_id, MS_Zoning, Gr_Liv_Area, Year_Built,
         Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd) %>%
  na.omit()

Split sample

set.seed(1560)
ames_split <- initial_split(work_df, prop = 0.6)
training_set <- training(ames_split)
test_set <- testing(ames_split)

Recipe

ames_recipe <-
  recipe(Sale_Price ~ ., data = training_set) %>%
  update_role(property_id, new_role = "id") %>%
  step_dummy(MS_Zoning)
tidy(ames_recipe)

# A tibble: 1 × 6
  number operation type  trained skip  id         
   <int> <chr>     <chr> <lgl>   <lgl> <chr>      
1      1 step      dummy FALSE   FALSE dummy_ppepF

How does recipe process data?

The recipe call screens the data for types (numeric, factors, strings and Boolean), as well as roles (e.g. outcome or predictor). Any estimation is performed on the data specified in the recipe. Estimates from the training set will then be used to inform operations in the test set, and nothing new will be estimated in the test set. Workflows help you make efficient machine learning projects. Using workflows means recognising that modeling is not just about fitting a model, but also about various pre- and post-processing steps.

Workflow example

ames_model <-
  linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")
ames_flow <-
  workflow() %>%
  add_model(ames_model) %>%
  add_recipe(ames_recipe)
ames_fit <- fit(ames_flow, training_set)
tidy(ames_fit)

# A tibble: 13 × 5
   term                                   estimate std.error statistic   p.value
   <chr>                                     <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                            -1.24e+6 80938.      -15.4   4.19e- 50
 2 Gr_Liv_Area                             8.83e+1     3.49     25.3   2.63e-120
 3 Year_Built                              6.29e+2    41.0      15.3   6.29e- 50
 4 Lot_Area                                7.28e-1     0.149     4.88  1.13e-  6
 5 Total_Bsmt_SF                           5.86e+1     2.66     22.0   5.97e- 95
 6 Garage_Cars                             1.49e+4  1644.        9.09  2.70e- 19
 7 TotRms_AbvGrd                          -5.64e+3  1021.       -5.52  3.87e-  8
 8 MS_Zoning_Commercial                   -1.29e+4 11408.       -1.13  2.59e-  1
 9 MS_Zoning_Floating.Village.Residential -6.91e+3  4314.       -1.60  1.10e-  1
10 MS_Zoning_Industrial                   -4.37e+3  5124.       -0.853 3.94e-  1
11 MS_Zoning_Residential.High.Density     -5.37e+4 38451.       -1.40  1.63e-  1
12 MS_Zoning_Residential.Low.Density      -9.31e+3 11019.       -0.845 3.98e-  1
13 MS_Zoning_Residential.Medium.Density   -2.36e+4 27832.       -0.848 3.96e-  1