#install.packages("DALEXtra")
#install.packages("ingredients")
#install.packages("modeldata")
#install.packages("tidyverse")
#install.packages("tidymodels")
#install.packages("rio")
#install.packages("doParallel")
#install.packages("glmnet")
#install.packages("rpart.plot")Introduction to Machine Learning for the Social Sciences - Code Tutorial, Session 3
Code based on: Steenbergen, M. (2025). Introduction to Machine Learning. Course in 29th Summer School in Social Sciences Methods, Università della Svizzera italiana.
Data:
Ames Housing dataset (De Cock 2011). Assessor’s office data for residential properties sold in Ames, Iowa. Label: “What was the sale price of this residential property?”
Features:
- Above ground living area in square feet.
- Year the property was originally constructed.
- Lot size in square feet.
- Total square feet of basement area.
- Size of garage in car capacity.
- Total rooms above ground
Installations (remove comment notation if needed):
Load packages
library(rio)
library(modeldata)
library(tidyverse)
library(DALEXtra)
library(tidymodels)
library(ingredients)
library(doParallel)
library(glmnet)
library(rpart.plot)Splitting the Sample Using rsample
data(ames)
ames_clean <- ames |>
select(Sale_Price, Gr_Liv_Area, Year_Built,
Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd) |>
na.omit()Setup
tidymodels_prefer()
set.seed(2922)
ames_split <- initial_split(ames_clean, prop = 0.6, strata = Sale_Price)
ames_split<Training/Testing/Total>
<1756/1174/2930>
initial_split() from rsample splits your data into training and test sets. When breaking down each argument from the code:
- ames_clean — the dataset being split
- prop = 0.6 — 60% of the data goes to the training set, 40% to the test set
- strata = Sale_Price — ensures the distribution of Sale_Price is similar in both sets by sampling proportionally within bins of the outcome, rather than splitting purely at random.
The last point is why the KS test returned p = 0.99. Stratification enforces distributional similarity, so the result is expected.
Execution
training_set <- training(ames_split)
test_set <- testing(ames_split)Checking
ks.test(training_set$Sale_Price, test_set$Sale_Price)Warning in ks.test.default(training_set$Sale_Price, test_set$Sale_Price):
p-value will be approximate in the presence of ties
Asymptotic two-sample Kolmogorov-Smirnov test
data: training_set$Sale_Price and test_set$Sale_Price
D = 0.016455, p-value = 0.9912
alternative hypothesis: two-sided
Pre-processing Using recipes
For a regression problem, there really is often not any pre-processing needed. But: The outcome of Sale_Price and predictors such as Lot_Area and Gr_Liv_Area are, on the basis of theoretical expectation, likely right-skewed and would benefit from a log transformation.
Check distributions of all numeric variables
ames_clean |>
select(where(is.numeric)) |>
pivot_longer(everything()) |>
ggplot(aes(value)) +
geom_histogram(bins = 30) +
facet_wrap(~name, scales = "free") +
theme_bw()Skewness statistic (|skew| > 1 warrants transformation)
ames_clean |>
select(where(is.numeric)) |>
summarise(across(everything(), \(x) mean((x - mean(x))^3) / sd(x)^3))# A tibble: 1 × 7
Sale_Price Gr_Liv_Area Year_Built Lot_Area Total_Bsmt_SF Garage_Cars
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.74 1.27 -0.604 12.8 1.15 -0.221
# ℹ 1 more variable: TotRms_AbvGrd <dbl>
Apply log to right-skewed variables in training and test sets separately (transformation parameters must not be learned from the test set):
training_set <- training_set |>
mutate(across(c(Sale_Price, Lot_Area, Gr_Liv_Area), log))
test_set <- test_set |>
mutate(across(c(Sale_Price, Lot_Area, Gr_Liv_Area), log))Training Using parsnip
We specify the algorithm, any tuning parameters, and the mode (i.e., task). We set the external learning engine (= implementation of the algorithm) using set_engine. We declare the model formula and training set using fit. Algorithms and engines can be found at tidymodels.org/find/parsnip
Using lm
ames_fit <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm") %>%
fit(Sale_Price ~ ., data = training_set)
tidy(ames_fit)# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.72 0.405 -4.24 2.31e- 5
2 Gr_Liv_Area 0.634 0.0257 24.7 8.20e-116
3 Year_Built 0.00417 0.000189 22.1 1.68e- 95
4 Lot_Area 0.0793 0.00948 8.37 1.21e- 16
5 Total_Bsmt_SF 0.000195 0.0000124 15.8 2.27e- 52
6 Garage_Cars 0.0910 0.00791 11.5 1.50e- 29
7 TotRms_AbvGrd -0.0260 0.00486 -5.35 9.82e- 8
glance(ames_fit)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.795 0.794 0.185 1128. 0 6 476. -936. -892.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Based on the t-statistic (the statistic column, which measures the strength of evidence net of uncertainty), Gr_Liv_Area is the strongest predictor with a t-statistic of 22.7 and a p-value of 6.48e-100.
However, you cannot directly compare the raw coefficients to rank predictors here. Each is on a different scale (square feet vs. years vs. number of cars), so the coefficient sizes are not comparable to each other.
Evolution Using yardstick
Prediction
ames_fit %>%
predict(test_set) %>%
bind_cols(test_set)# A tibble: 1,174 × 8
.pred Sale_Price Gr_Liv_Area Year_Built Lot_Area Total_Bsmt_SF Garage_Cars
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 12.2 12.3 7.41 1960 10.4 1080 2
2 11.6 11.6 6.80 1961 9.36 882 1
3 12.0 12.1 7.19 1958 9.57 1329 1
4 12.3 12.2 7.40 1997 9.53 928 2
5 12.2 12.2 7.38 1998 9.21 926 2
6 12.3 12.4 7.39 1995 8.59 1595 2
7 12.1 12.1 7.29 1998 9.04 789 2
8 12.3 12.3 7.31 1985 8.83 1488 2
9 12.6 12.9 7.53 2010 9.34 1856 3
10 12.4 12.3 7.64 1978 9.49 1542 2
# ℹ 1,164 more rows
# ℹ 1 more variable: TotRms_AbvGrd <int>
Metrics
ames_fit %>%
predict(test_set) %>%
bind_cols(test_set) %>%
metrics(truth = Sale_Price, estimate = .pred)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.194
2 rsq standard 0.773
3 mae standard 0.131
Partial Dependence Plot
DALEX = Descriptive mAchine Learning EXplanations (Biecek 2018)
Create explainer:
ames_explainer <- explain_tidymodels(
model = ames_fit,
data = test_set,
y = test_set$Sale_Price
)Preparation of a new explainer is initiated
-> model label : model_fit ( default )
-> data : 1174 rows 7 cols
-> data : tibble converted into a data.frame
-> target variable : 1174 values
-> predict function : yhat.model_fit will be used ( default )
-> predicted values : No value for predict function target column. ( default )
-> model_info : package parsnip , ver. 1.3.2 , task regression ( default )
-> predicted values : numerical, min = 11.06472 , mean = 12.02728 , max = 13.78078
-> residual function : difference between y and yhat ( default )
-> residuals : numerical, min = -1.962382 , mean = -0.008625293 , max = 0.7334518
A new explainer has been created!
Plotter:
pdp_area <- model_profile(
ames_explainer,
variables = "Gr_Liv_Area",
N = NULL
)
as_tibble(pdp_area$agr_profiles) %>%
ggplot(aes(`_x_`, `_yhat_`)) +
geom_line(size = 1.2, alpha = 0.8, color = "#31688EFF") +
geom_rug(sides = "b", col = "#31688EFF") +
labs(
x = "Above Ground Living Area (sq ft)",
y = "Predicted Sale Price",
color = NULL
) +
theme_bw()Substantive interpretation:
The R² of 0.744 in the training set indicates the six predictors explain about 74% of the variance in sale prices. The RMSE in the test set of $40,860 tells you that on average the predictions are off by that amount, which is substantial given a mean price around $180,000 (approx. 23% error). The PDP for Gr_Liv_Area shows the expected positive relationship, as above-ground living area increases, predicted sale price rises roughly linearly. The partial dependence plot shows the average predicted sale price as Gr_Liv_Area increases, holding all other predictors constant
Workflows
Sometimes in data, you will have missing data, or columns such as property_id identifying the property, which is not a predictive feature. With 74 variables in the ames dataset, we also need to select only the relevant predictors. The plan is to select relevant columns, turn MS_Zoning into a properly labelled factor, drop missing values, and create a recipe that creates dummies for MS_Zoning, since that is what the regression needs, and ensures that property_id will not be used as a predictor but remains available to flag instances.
Preparation
head(ames)# A tibble: 6 × 74
MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
<fct> <fct> <dbl> <int> <fct> <fct> <fct>
1 One_Story_1946_and_New… Resident… 141 31770 Pave No_A… Slightly…
2 One_Story_1946_and_New… Resident… 80 11622 Pave No_A… Regular
3 One_Story_1946_and_New… Resident… 81 14267 Pave No_A… Slightly…
4 One_Story_1946_and_New… Resident… 93 11160 Pave No_A… Regular
5 Two_Story_1946_and_New… Resident… 74 13830 Pave No_A… Slightly…
6 Two_Story_1946_and_New… Resident… 78 9978 Pave No_A… Slightly…
# ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
# Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
# Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
# Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
# Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
# Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
# Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, …
work_df <- ames %>%
mutate(
property_id = row_number(),
MS_Zoning = factor(MS_Zoning,
labels = c("Agriculture", "Commercial",
"Floating Village Residential", "Industrial",
"Residential High Density", "Residential Low Density",
"Residential Medium Density"))
) %>%
select(Sale_Price, property_id, MS_Zoning, Gr_Liv_Area, Year_Built,
Lot_Area, Total_Bsmt_SF, Garage_Cars, TotRms_AbvGrd) %>%
na.omit()Split sample
set.seed(1560)
ames_split <- initial_split(work_df, prop = 0.6)
training_set <- training(ames_split)
test_set <- testing(ames_split)Recipe
ames_recipe <-
recipe(Sale_Price ~ ., data = training_set) %>%
update_role(property_id, new_role = "id") %>%
step_dummy(MS_Zoning)
tidy(ames_recipe)# A tibble: 1 × 6
number operation type trained skip id
<int> <chr> <chr> <lgl> <lgl> <chr>
1 1 step dummy FALSE FALSE dummy_ppepF
How does recipe process data?
The recipe call screens the data for types (numeric, factors, strings and Boolean), as well as roles (e.g. outcome or predictor). Any estimation is performed on the data specified in the recipe. Estimates from the training set will then be used to inform operations in the test set, and nothing new will be estimated in the test set. Workflows help you make efficient machine learning projects. Using workflows means recognising that modeling is not just about fitting a model, but also about various pre- and post-processing steps.
Workflow example
ames_model <-
linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
ames_flow <-
workflow() %>%
add_model(ames_model) %>%
add_recipe(ames_recipe)
ames_fit <- fit(ames_flow, training_set)
tidy(ames_fit)# A tibble: 13 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.24e+6 80938. -15.4 4.19e- 50
2 Gr_Liv_Area 8.83e+1 3.49 25.3 2.63e-120
3 Year_Built 6.29e+2 41.0 15.3 6.29e- 50
4 Lot_Area 7.28e-1 0.149 4.88 1.13e- 6
5 Total_Bsmt_SF 5.86e+1 2.66 22.0 5.97e- 95
6 Garage_Cars 1.49e+4 1644. 9.09 2.70e- 19
7 TotRms_AbvGrd -5.64e+3 1021. -5.52 3.87e- 8
8 MS_Zoning_Commercial -1.29e+4 11408. -1.13 2.59e- 1
9 MS_Zoning_Floating.Village.Residential -6.91e+3 4314. -1.60 1.10e- 1
10 MS_Zoning_Industrial -4.37e+3 5124. -0.853 3.94e- 1
11 MS_Zoning_Residential.High.Density -5.37e+4 38451. -1.40 1.63e- 1
12 MS_Zoning_Residential.Low.Density -9.31e+3 11019. -0.845 3.98e- 1
13 MS_Zoning_Residential.Medium.Density -2.36e+4 27832. -0.848 3.96e- 1