Loading everything up front. Main workhorse here is
recipes which lets you chain preprocessing steps together
without worrying about train/test leakage.
# Helper packages
library(dplyr) # for data manipulation
library(ggplot2) # for awesome graphics
# Feature engineering packages
library(caret) # for various ML tasks
library(recipes) # for feature engineering steps
library(modeldata) # for datasets (attrition, etc.)Two datasets get used across the chapter. The Ames housing data is the main one — it has property sale records from Ames, Iowa with 80 variables. A 70/30 stratified split is done on the response so price ranges are equally represented in both sets.
# Ames housing data
ames <- AmesHousing::make_ames()
# Job attrition data
attrition <- modeldata::attrition
# Create a train/test split of Ames data
set.seed(42)
split <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- rsample::training(split)
ames_test <- rsample::testing(split)Before touching any predictors it’s worth asking whether
Sale_Price itself needs fixing. The issue with raw house
prices is that they tend to bunch up at lower values with a long tail
stretching right — a handful of expensive properties can distort model
training.
# Skewed target
p1 <- ggplot(ames_train, aes(x = Sale_Price)) +
geom_histogram(bins = 45, fill = "#2c7bb6", colour = "white") +
scale_x_continuous(labels = scales::dollar) +
theme_minimal() +
labs(title = "Original Sale Price", x = "Sale Price", y = "Count")
# Log-transformed target
p2 <- ggplot(ames_train, aes(x = Sale_Price)) +
geom_histogram(bins = 45, fill = "#2c7bb6", colour = "white") +
scale_x_log10(labels = scales::dollar) +
theme_minimal() +
labs(title = "Log-Transformed Sale Price", x = "Sale Price (log scale)", y = "Count")
gridExtra::grid.arrange(p1, p2, nrow = 1)Distribution of Sale_Price before and after log transformation.
Log scale makes it much more bell-shaped. That’s what we want — residuals will be more evenly spread and predictions more stable.
step_log() handles this inside the pipeline. Using base
10 here so predictions can be back-transformed with a simple
10^x.
If you’re not sure log is the right choice, Box-Cox finds the best power transformation automatically by estimating lambda from the training data.
ames_recipe_bc <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_BoxCox(Sale_Price)
ames_recipe_bcNote: whichever transformation is applied, predictions need to be back-transformed at evaluation time — otherwise RMSE and MAE will be on the wrong scale.
Real datasets have gaps. The question is what to do about them. Simply dropping any row with an NA is easy but loses information and can bias results if missingness isn’t random.
make_ames() is already cleaned, so we go back to
ames_raw which still has the original NAs.
A raster plot of the logical NA matrix is one of the quickest ways to spot patterns — are missing values scattered randomly, or do they cluster in specific columns/rows?
ames_raw %>%
is.na() %>%
reshape2::melt() %>%
ggplot(aes(Var2, Var1, fill = value)) +
geom_raster() +
scale_fill_manual(name = "Status",
values = c("FALSE" = "#d9d9d9", "TRUE" = "#d7191c"),
labels = c("Present", "Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Variables", y = "Observations",
title = "Missing value heatmap - Raw Ames Data")Heatmap of missing values in the raw Ames data. Each column is a variable; dark cells indicate a missing value.
It also helps to just count them per column and rank:
ames_raw %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
tidyr::pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") %>%
filter(n_missing > 0) %>%
arrange(desc(n_missing)) %>%
ggplot(aes(x = reorder(variable, n_missing), y = n_missing)) +
geom_col(fill = "#d7191c") +
coord_flip() +
theme_minimal() +
labs(x = NULL, y = "Number of missing values",
title = "Missing values by variable")Number of missing values per variable.
Three options are shown below — pick based on how much compute time you can afford and how much accuracy matters for that variable.
Fastest option. Just plugs in the median value from the training set. Works fine when missingness is low and the variable isn’t doing heavy lifting in the model.
Finds the k most similar rows (by other features) and uses their average. More work to compute but respects relationships between variables better than a flat median.
ames_recipe_knn <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_impute_knn(all_predictors(), neighbors = 5)
ames_recipe_knnk = 5 is a reasonable default. Somewhere in the 5–10 range is usually fine.
Trains a separate bagged tree for each variable that has NAs, using all other features as predictors. Most accurate but slowest — good when you have time and imputation quality matters.
ames_recipe_bag <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_impute_bag(all_predictors())
ames_recipe_bagSome variables contribute almost nothing — either because they barely change across observations (near-zero variance) or because they’re near-duplicates of other columns (high correlation). Keeping them wastes memory, slows down training, and can confuse certain algorithms.
step_nzv() removes columns where one value dominates
(e.g. 95%+ of rows are the same). Here’s the dimension of the dataset
after it runs:
recipe(Sale_Price ~ ., data = ames_train) %>%
step_nzv(all_predictors()) %>%
prep(training = ames_train, retain = TRUE) %>%
juice() %>%
dim()## [1] 2049 60
To actually see which columns got flagged:
caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
tibble::rownames_to_column("variable") %>%
filter(nzv == TRUE) %>%
arrange(desc(freqRatio))## variable freqRatio percentUnique zeroVar nzv
## 1 Pool_Area 2041.00000 0.43923865 FALSE TRUE
## 2 Utilities 1023.00000 0.14641288 FALSE TRUE
## 3 Three_season_porch 1013.00000 0.97608590 FALSE TRUE
## 4 Pool_QC 680.33333 0.24402147 FALSE TRUE
## 5 Low_Qual_Fin_SF 674.33333 1.22010737 FALSE TRUE
## 6 Condition_2 406.20000 0.39043436 FALSE TRUE
## 7 BsmtFin_SF_2 363.00000 9.37042460 FALSE TRUE
## 8 Street 340.50000 0.09760859 FALSE TRUE
## 9 Screen_Porch 233.50000 4.68521230 FALSE TRUE
## 10 Roof_Matl 144.42857 0.29282577 FALSE TRUE
## 11 Misc_Val 141.42857 1.51293314 FALSE TRUE
## 12 Heating 100.85000 0.24402147 FALSE TRUE
## 13 Enclosed_Porch 96.00000 7.51586140 FALSE TRUE
## 14 Functional 38.95918 0.39043436 FALSE TRUE
## 15 Misc_Feature 31.39683 0.29282577 FALSE TRUE
## 16 BsmtFin_Type_2 24.13699 0.34163006 FALSE TRUE
## 17 Kitchen_AbvGr 22.77907 0.19521718 FALSE TRUE
## 18 Alley 22.11628 0.14641288 FALSE TRUE
## 19 Bsmt_Cond 21.62353 0.29282577 FALSE TRUE
## 20 Land_Slope 21.14130 0.14641288 FALSE TRUE
## 21 Land_Contour 20.85227 0.19521718 FALSE TRUE
Pairs of features with r > 0.85 are likely measuring the same
thing. step_corr() keeps one and drops the other:
recipe(Sale_Price ~ ., data = ames_train) %>%
step_nzv(all_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.85) %>%
prep() %>%
juice() %>%
dim()## [1] 2049 59
Even after filtering, raw numeric variables can cause problems. Two issues come up constantly: skewness and incompatible scales.
Algorithms that rely on distance (KNN, SVM) or assume linearity (regression) are sensitive to heavy-tailed predictors. A power transformation brings them closer to symmetric.
Raw distributions of three area-based predictors:
ames_train %>%
select(Lot_Area, Gr_Liv_Area, First_Flr_SF) %>%
tidyr::pivot_longer(everything(), names_to = "feature", values_to = "value") %>%
ggplot(aes(value)) +
geom_histogram(bins = 40, fill = "#2c7bb6", colour = "white") +
facet_wrap(~feature, scales = "free") +
theme_minimal() +
labs(title = "Before transformation", x = NULL, y = "Count")Distribution of several right-skewed numeric features.
Yeo-Johnson is used rather than Box-Cox because it works even when values include zeros or negatives. After transformation:
ames_recipe_yj <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_nzv(all_predictors()) %>%
step_YeoJohnson(all_numeric_predictors())
# Preview the transformation on Gr_Liv_Area
prep_yj <- prep(ames_recipe_yj, training = ames_train, retain = TRUE)
juice(prep_yj) %>%
select(Lot_Area, Gr_Liv_Area, First_Flr_SF) %>%
tidyr::pivot_longer(everything(), names_to = "feature", values_to = "value") %>%
ggplot(aes(value)) +
geom_histogram(bins = 40, fill = "#1a9641", colour = "white") +
facet_wrap(~feature, scales = "free") +
theme_minimal() +
labs(title = "After Yeo-Johnson transformation", x = NULL, y = "Count")Much better — the long right tails are gone.
A variable in square feet will swamp a binary 0/1 flag in any distance calculation. Normalising puts everything on the same footing: mean 0, SD 1.
ames_recipe_std <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_nzv(all_predictors()) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) # center + scale
ames_recipe_stdQuick sanity check — means should be ~0 and SDs ~1:
prep_std <- prep(ames_recipe_std, training = ames_train, retain = TRUE)
juice(prep_std) %>%
select(Gr_Liv_Area, Lot_Area, First_Flr_SF) %>%
summarise(across(everything(), list(mean = mean, sd = sd)))## # A tibble: 1 × 6
## Gr_Liv_Area_mean Gr_Liv_Area_sd Lot_Area_mean Lot_Area_sd First_Flr_SF_mean
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.32e-15 1 5.69e-16 1 -2.21e-15
## # ℹ 1 more variable: First_Flr_SF_sd <dbl>
Important: all normalisation parameters (mean, SD,
lambda values, etc.) come from the training set only.
recipes handles this correctly — prep() fits
on train, bake() applies to new data without re-estimating
anything.
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Ulaanbaatar
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] modeldata_1.5.1 recipes_1.3.1 caret_7.0-1 lattice_0.22-7
## [5] ggplot2_4.0.2 dplyr_1.2.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.53 bslib_0.9.0
## [4] vctrs_0.7.1 tools_4.5.1 generics_0.1.4
## [7] stats4_4.5.1 parallel_4.5.1 tibble_3.3.1
## [10] AmesHousing_0.0.4 pkgconfig_2.0.3 ModelMetrics_1.2.2.2
## [13] Matrix_1.7-3 data.table_1.17.8 RColorBrewer_1.1-3
## [16] S7_0.2.0 lifecycle_1.0.5 compiler_4.5.1
## [19] farver_2.1.2 stringr_1.5.2 codetools_0.2-20
## [22] htmltools_0.5.8.1 class_7.3-23 sass_0.4.10
## [25] yaml_2.3.10 prodlim_2025.04.28 tidyr_1.3.1
## [28] furrr_0.3.1 pillar_1.11.0 jquerylib_0.1.4
## [31] MASS_7.3-65 cachem_1.1.0 gower_1.0.2
## [34] iterators_1.0.14 rpart_4.1.24 foreach_1.5.2
## [37] nlme_3.1-168 parallelly_1.45.1 lava_1.8.2
## [40] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7
## [43] future_1.67.0 reshape2_1.4.5 purrr_1.1.0
## [46] listenv_0.9.1 labeling_0.4.3 splines_4.5.1
## [49] fastmap_1.2.0 grid_4.5.1 cli_3.6.5
## [52] magrittr_2.0.3 survival_3.8-3 future.apply_1.20.0
## [55] withr_3.0.2 scales_1.4.0 lubridate_1.9.4
## [58] timechange_0.3.0 rmarkdown_2.29 globals_0.18.0
## [61] nnet_7.3-20 gridExtra_2.3 timeDate_4051.111
## [64] evaluate_1.0.5 knitr_1.50 hardhat_1.4.2
## [67] rsample_1.3.2 rlang_1.1.7 Rcpp_1.1.0
## [70] glue_1.8.0 pROC_1.19.0.1 ipred_0.9-15
## [73] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1
## [76] plyr_1.8.9