3.1 — Packages and Data Needed

Loading everything up front. Main workhorse here is recipes which lets you chain preprocessing steps together without worrying about train/test leakage.

# Helper packages
library(dplyr)     # for data manipulation
library(ggplot2)   # for awesome graphics

# Feature engineering packages
library(caret)     # for various ML tasks
library(recipes)   # for feature engineering steps
library(modeldata) # for datasets (attrition, etc.)

Two datasets get used across the chapter. The Ames housing data is the main one — it has property sale records from Ames, Iowa with 80 variables. A 70/30 stratified split is done on the response so price ranges are equally represented in both sets.

# Ames housing data
ames <- AmesHousing::make_ames()

# Job attrition data
attrition <- modeldata::attrition

# Create a train/test split of Ames data
set.seed(42)
split  <- rsample::initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- rsample::training(split)
ames_test  <- rsample::testing(split)

3.2 — Should We Transform the Target?

Before touching any predictors it’s worth asking whether Sale_Price itself needs fixing. The issue with raw house prices is that they tend to bunch up at lower values with a long tail stretching right — a handful of expensive properties can distort model training.

Checking the shape of Sale_Price

# Skewed target
p1 <- ggplot(ames_train, aes(x = Sale_Price)) +
  geom_histogram(bins = 45, fill = "#2c7bb6", colour = "white") +
  scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  labs(title = "Original Sale Price", x = "Sale Price", y = "Count")

# Log-transformed target
p2 <- ggplot(ames_train, aes(x = Sale_Price)) +
  geom_histogram(bins = 45, fill = "#2c7bb6", colour = "white") +
  scale_x_log10(labels = scales::dollar) +
  theme_minimal() +
  labs(title = "Log-Transformed Sale Price", x = "Sale Price (log scale)", y = "Count")

gridExtra::grid.arrange(p1, p2, nrow = 1)
Distribution of Sale_Price before and after log transformation.

Distribution of Sale_Price before and after log transformation.

Log scale makes it much more bell-shaped. That’s what we want — residuals will be more evenly spread and predictions more stable.

Adding log transform to a recipe

step_log() handles this inside the pipeline. Using base 10 here so predictions can be back-transformed with a simple 10^x.

# Option 1: log transform the response directly
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(Sale_Price, base = 10)

ames_recipe

Alternative: Box-Cox

If you’re not sure log is the right choice, Box-Cox finds the best power transformation automatically by estimating lambda from the training data.

ames_recipe_bc <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_BoxCox(Sale_Price)

ames_recipe_bc

Note: whichever transformation is applied, predictions need to be back-transformed at evaluation time — otherwise RMSE and MAE will be on the wrong scale.


3.3 — Dealing with Missing Data

Real datasets have gaps. The question is what to do about them. Simply dropping any row with an NA is easy but loses information and can bias results if missingness isn’t random.

Step 1: See where the missingness is

make_ames() is already cleaned, so we go back to ames_raw which still has the original NAs.

# Use the raw Ames data which contains missing values
ames_raw <- AmesHousing::ames_raw

A raster plot of the logical NA matrix is one of the quickest ways to spot patterns — are missing values scattered randomly, or do they cluster in specific columns/rows?

ames_raw %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill = value)) +
    geom_raster() +
    scale_fill_manual(name = "Status",
                      values = c("FALSE" = "#d9d9d9", "TRUE" = "#d7191c"),
                      labels = c("Present", "Missing")) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Variables", y = "Observations",
         title = "Missing value heatmap - Raw Ames Data")
Heatmap of missing values in the raw Ames data. Each column is a variable; dark cells indicate a missing value.

Heatmap of missing values in the raw Ames data. Each column is a variable; dark cells indicate a missing value.

It also helps to just count them per column and rank:

ames_raw %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  tidyr::pivot_longer(everything(),
                      names_to  = "variable",
                      values_to = "n_missing") %>%
  filter(n_missing > 0) %>%
  arrange(desc(n_missing)) %>%
  ggplot(aes(x = reorder(variable, n_missing), y = n_missing)) +
    geom_col(fill = "#d7191c") +
    coord_flip() +
    theme_minimal() +
    labs(x = NULL, y = "Number of missing values",
         title = "Missing values by variable")
Number of missing values per variable.

Number of missing values per variable.

Step 2: Pick an imputation method

Three options are shown below — pick based on how much compute time you can afford and how much accuracy matters for that variable.

Option A: Median imputation

Fastest option. Just plugs in the median value from the training set. Works fine when missingness is low and the variable isn’t doing heavy lifting in the model.

ames_recipe_impute <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_impute_median(Gr_Liv_Area)   # median imputation for a numeric predictor

ames_recipe_impute

Option B: KNN imputation

Finds the k most similar rows (by other features) and uses their average. More work to compute but respects relationships between variables better than a flat median.

ames_recipe_knn <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_impute_knn(all_predictors(), neighbors = 5)

ames_recipe_knn

k = 5 is a reasonable default. Somewhere in the 5–10 range is usually fine.

Option C: Bagged tree imputation

Trains a separate bagged tree for each variable that has NAs, using all other features as predictors. Most accurate but slowest — good when you have time and imputation quality matters.

ames_recipe_bag <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_impute_bag(all_predictors())

ames_recipe_bag

3.4 — Cutting Out Useless Features

Some variables contribute almost nothing — either because they barely change across observations (near-zero variance) or because they’re near-duplicates of other columns (high correlation). Keeping them wastes memory, slows down training, and can confuse certain algorithms.

Near-zero variance filter

step_nzv() removes columns where one value dominates (e.g. 95%+ of rows are the same). Here’s the dimension of the dataset after it runs:

recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_predictors()) %>%
  prep(training = ames_train, retain = TRUE) %>%
  juice() %>%
  dim()
## [1] 2049   60

To actually see which columns got flagged:

caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
  tibble::rownames_to_column("variable") %>%
  filter(nzv == TRUE) %>%
  arrange(desc(freqRatio))
##              variable  freqRatio percentUnique zeroVar  nzv
## 1           Pool_Area 2041.00000    0.43923865   FALSE TRUE
## 2           Utilities 1023.00000    0.14641288   FALSE TRUE
## 3  Three_season_porch 1013.00000    0.97608590   FALSE TRUE
## 4             Pool_QC  680.33333    0.24402147   FALSE TRUE
## 5     Low_Qual_Fin_SF  674.33333    1.22010737   FALSE TRUE
## 6         Condition_2  406.20000    0.39043436   FALSE TRUE
## 7        BsmtFin_SF_2  363.00000    9.37042460   FALSE TRUE
## 8              Street  340.50000    0.09760859   FALSE TRUE
## 9        Screen_Porch  233.50000    4.68521230   FALSE TRUE
## 10          Roof_Matl  144.42857    0.29282577   FALSE TRUE
## 11           Misc_Val  141.42857    1.51293314   FALSE TRUE
## 12            Heating  100.85000    0.24402147   FALSE TRUE
## 13     Enclosed_Porch   96.00000    7.51586140   FALSE TRUE
## 14         Functional   38.95918    0.39043436   FALSE TRUE
## 15       Misc_Feature   31.39683    0.29282577   FALSE TRUE
## 16     BsmtFin_Type_2   24.13699    0.34163006   FALSE TRUE
## 17      Kitchen_AbvGr   22.77907    0.19521718   FALSE TRUE
## 18              Alley   22.11628    0.14641288   FALSE TRUE
## 19          Bsmt_Cond   21.62353    0.29282577   FALSE TRUE
## 20         Land_Slope   21.14130    0.14641288   FALSE TRUE
## 21       Land_Contour   20.85227    0.19521718   FALSE TRUE

High-correlation filter

Pairs of features with r > 0.85 are likely measuring the same thing. step_corr() keeps one and drops the other:

recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_predictors()) %>%
  step_corr(all_numeric_predictors(), threshold = 0.85) %>%
  prep() %>%
  juice() %>%
  dim()
## [1] 2049   59

3.5 — Fixing the Shape and Scale of Numeric Predictors

Even after filtering, raw numeric variables can cause problems. Two issues come up constantly: skewness and incompatible scales.

3.5.1 Skewness

Algorithms that rely on distance (KNN, SVM) or assume linearity (regression) are sensitive to heavy-tailed predictors. A power transformation brings them closer to symmetric.

Raw distributions of three area-based predictors:

ames_train %>%
  select(Lot_Area, Gr_Liv_Area, First_Flr_SF) %>%
  tidyr::pivot_longer(everything(), names_to = "feature", values_to = "value") %>%
  ggplot(aes(value)) +
    geom_histogram(bins = 40, fill = "#2c7bb6", colour = "white") +
    facet_wrap(~feature, scales = "free") +
    theme_minimal() +
    labs(title = "Before transformation", x = NULL, y = "Count")
Distribution of several right-skewed numeric features.

Distribution of several right-skewed numeric features.

Yeo-Johnson is used rather than Box-Cox because it works even when values include zeros or negatives. After transformation:

ames_recipe_yj <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_predictors()) %>%
  step_YeoJohnson(all_numeric_predictors())

# Preview the transformation on Gr_Liv_Area
prep_yj <- prep(ames_recipe_yj, training = ames_train, retain = TRUE)

juice(prep_yj) %>%
  select(Lot_Area, Gr_Liv_Area, First_Flr_SF) %>%
  tidyr::pivot_longer(everything(), names_to = "feature", values_to = "value") %>%
  ggplot(aes(value)) +
    geom_histogram(bins = 40, fill = "#1a9641", colour = "white") +
    facet_wrap(~feature, scales = "free") +
    theme_minimal() +
    labs(title = "After Yeo-Johnson transformation", x = NULL, y = "Count")

Much better — the long right tails are gone.

3.5.2 Standardisation

A variable in square feet will swamp a binary 0/1 flag in any distance calculation. Normalising puts everything on the same footing: mean 0, SD 1.

ames_recipe_std <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_predictors()) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())  # center + scale

ames_recipe_std

Quick sanity check — means should be ~0 and SDs ~1:

prep_std <- prep(ames_recipe_std, training = ames_train, retain = TRUE)

juice(prep_std) %>%
  select(Gr_Liv_Area, Lot_Area, First_Flr_SF) %>%
  summarise(across(everything(), list(mean = mean, sd = sd)))
## # A tibble: 1 × 6
##   Gr_Liv_Area_mean Gr_Liv_Area_sd Lot_Area_mean Lot_Area_sd First_Flr_SF_mean
##              <dbl>          <dbl>         <dbl>       <dbl>             <dbl>
## 1         1.32e-15              1      5.69e-16           1         -2.21e-15
## # ℹ 1 more variable: First_Flr_SF_sd <dbl>

Important: all normalisation parameters (mean, SD, lambda values, etc.) come from the training set only. recipes handles this correctly — prep() fits on train, bake() applies to new data without re-estimating anything.


Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Ulaanbaatar
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] modeldata_1.5.1 recipes_1.3.1   caret_7.0-1     lattice_0.22-7 
## [5] ggplot2_4.0.2   dplyr_1.2.0    
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         xfun_0.53            bslib_0.9.0         
##  [4] vctrs_0.7.1          tools_4.5.1          generics_0.1.4      
##  [7] stats4_4.5.1         parallel_4.5.1       tibble_3.3.1        
## [10] AmesHousing_0.0.4    pkgconfig_2.0.3      ModelMetrics_1.2.2.2
## [13] Matrix_1.7-3         data.table_1.17.8    RColorBrewer_1.1-3  
## [16] S7_0.2.0             lifecycle_1.0.5      compiler_4.5.1      
## [19] farver_2.1.2         stringr_1.5.2        codetools_0.2-20    
## [22] htmltools_0.5.8.1    class_7.3-23         sass_0.4.10         
## [25] yaml_2.3.10          prodlim_2025.04.28   tidyr_1.3.1         
## [28] furrr_0.3.1          pillar_1.11.0        jquerylib_0.1.4     
## [31] MASS_7.3-65          cachem_1.1.0         gower_1.0.2         
## [34] iterators_1.0.14     rpart_4.1.24         foreach_1.5.2       
## [37] nlme_3.1-168         parallelly_1.45.1    lava_1.8.2          
## [40] tidyselect_1.2.1     digest_0.6.37        stringi_1.8.7       
## [43] future_1.67.0        reshape2_1.4.5       purrr_1.1.0         
## [46] listenv_0.9.1        labeling_0.4.3       splines_4.5.1       
## [49] fastmap_1.2.0        grid_4.5.1           cli_3.6.5           
## [52] magrittr_2.0.3       survival_3.8-3       future.apply_1.20.0 
## [55] withr_3.0.2          scales_1.4.0         lubridate_1.9.4     
## [58] timechange_0.3.0     rmarkdown_2.29       globals_0.18.0      
## [61] nnet_7.3-20          gridExtra_2.3        timeDate_4051.111   
## [64] evaluate_1.0.5       knitr_1.50           hardhat_1.4.2       
## [67] rsample_1.3.2        rlang_1.1.7          Rcpp_1.1.0          
## [70] glue_1.8.0           pROC_1.19.0.1        ipred_0.9-15        
## [73] rstudioapi_0.17.1    jsonlite_2.0.0       R6_2.6.1            
## [76] plyr_1.8.9