1 Overview This report reproduces the outputs from Chapter 3 (Feature & Target Engineering) from the book Hands-On Machine Learning with R by Bradley Boehmke and Brandon Greenwell.

This report reproduces Sections 3.1 – 3.5 of Chapter 3. These sections cover several important preprocessing techniques used in machine learning:

• Loading and preparing required libraries • Target variable transformations • Handling missing values • Feature filtering techniques • Numeric feature engineering methods

The dataset used in this chapter is the Ames Housing dataset, which contains housing information used to predict house prices.

All analyses are performed in R using several packages including:

dplyr ggplot2 visdat caret recipes AmesHousing The goal of this report is to demonstrate how feature engineering techniques can improve data quality and modeling performance.

2 Prerequisites

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(visdat)
## Warning: package 'visdat' was built under R version 4.5.2
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: lattice
library(recipes)
## Warning: package 'recipes' was built under R version 4.5.2
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
## 
##     step
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.5.2
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.5.2
library(forecast)
## Warning: package 'forecast' was built under R version 4.5.2

2.1 Load Dataset

ames <- make_ames()

ames_train <- ames

3 Target Engineering Target engineering refers to transforming the response variable to improve model performance.

Many machine learning models assume that the response variable follows a normal distribution. However, real-world data often exhibit skewed distributions.

In such cases, applying transformations such as log transformation or Box-Cox transformation can help normalize the distribution and improve model predictions.

3.1 Log Transmission

transformed_response <- log(ames_train$Sale_Price)

summary(transformed_response)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.456  11.771  11.983  12.021  12.271  13.534

3.2 Using Recipes for Log Transformation

ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes())

ames_recipe
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()

3.3 Log Transformation Example

log(-0.5)
## Warning in log(-0.5): NaNs produced
## [1] NaN
log1p(-0.5)
## [1] -0.6931472

3.4 Box-Cox Transformation

y <- forecast::BoxCox(10, lambda = 0.5)
y
## [1] 4.324555
## attr(,"lambda")
## [1] 0.5

3.5 Inverse Box-Cox Transformation

inv_box_cox <- function(x, lambda) {
  
  if (lambda == 0) exp(x)
  else (lambda * x + 1)^(1/lambda)
  
}

inv_box_cox(y, 0.5)
## [1] 10
## attr(,"lambda")
## [1] 0.5

4 Dealing with Missingness Missing values are a common issue in real-world datasets.

Before applying machine learning models, it is important to understand:

• how many missing values exist • where they occur • how they should be handled

There are several strategies for handling missing data, including:

deletion imputation model-based estimation

4.1 Checking Missing Values The raw Ames dataset contains missing values.

ames_raw <- AmesHousing::ames_raw

sum(is.na(ames_raw))
## [1] 13997

5 Visualizing Missing Values Visualizing missing data helps identify patterns and understand which variables contain the most missing values.

ames_raw %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) +
  geom_raster() +
  coord_flip() +
  scale_fill_grey(name = "",
                  labels = c("Present", "Missing")) +
  xlab("Observation") +
  theme(axis.text.y = element_text(size = 4))

5.1 Using visdat Package The visdat package provides a convenient way to visualize missing data.

vis_miss(ames_raw, cluster = TRUE)

6 Imputation Imputation replaces missing values with estimated values.

Common imputation strategies include:

Mean imputation Median imputation Mode imputation Model-based imputation

6.1 Median Imputation

ames_recipe %>%
  step_impute_median(Gr_Liv_Area)
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • Median imputation for: Gr_Liv_Area

6.2 Mode Imputation

ames_recipe %>%
  step_impute_mode(all_nominal())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • Mode imputation for: all_nominal()

6.3 K Nearest Imputation

ames_recipe %>%
  step_impute_knn(all_predictors(), neighbors = 6)
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • K-nearest neighbor imputation for: all_predictors()

6.4 Tree-Based Imputation

ames_recipe %>%
  step_impute_bag(all_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • Bagged tree imputation for: all_predictors()

6.4 Tree-Based Imputation

ames_recipe %>%
  step_impute_bag(all_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • Bagged tree imputation for: all_predictors()

7 Feature Filtering Datasets often contain features that provide little or no predictive value. Removing these features can improve model performance and reduce computation time. One common method is identifying near-zero variance predictors. 7.1 Near Zero Variance Detection

caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
  tibble::rownames_to_column() %>%
  dplyr::filter(nzv)
##               rowname  freqRatio percentUnique zeroVar  nzv
## 1              Street  243.16667    0.06825939   FALSE TRUE
## 2               Alley   22.76667    0.10238908   FALSE TRUE
## 3        Land_Contour   21.94167    0.13651877   FALSE TRUE
## 4           Utilities 1463.50000    0.10238908   FALSE TRUE
## 5          Land_Slope   22.31200    0.10238908   FALSE TRUE
## 6         Condition_2  223.07692    0.27303754   FALSE TRUE
## 7           Roof_Matl  125.52174    0.27303754   FALSE TRUE
## 8           Bsmt_Cond   21.44262    0.20477816   FALSE TRUE
## 9      BsmtFin_Type_2   23.57547    0.23890785   FALSE TRUE
## 10       BsmtFin_SF_2  515.80000    9.35153584   FALSE TRUE
## 11            Heating  106.85185    0.20477816   FALSE TRUE
## 12    Low_Qual_Fin_SF  722.50000    1.22866894   FALSE TRUE
## 13      Kitchen_AbvGr   21.67442    0.13651877   FALSE TRUE
## 14         Functional   38.97143    0.27303754   FALSE TRUE
## 15      Open_Porch_SF   25.00000    8.60068259   FALSE TRUE
## 16     Enclosed_Porch  112.31818    6.24573379   FALSE TRUE
## 17 Three_season_porch  964.33333    1.05802048   FALSE TRUE
## 18       Screen_Porch  205.69231    4.12969283   FALSE TRUE
## 19          Pool_Area 2917.00000    0.47781570   FALSE TRUE
## 20            Pool_QC  729.25000    0.17064846   FALSE TRUE
## 21       Misc_Feature   29.72632    0.20477816   FALSE TRUE
## 22           Misc_Val  157.05556    1.29692833   FALSE TRUE

7.2 Removing Zero Variance Features

ames_recipe %>%
  step_zv(all_predictors()) %>%
  step_nzv(all_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • Zero variance filter on: all_predictors()
## • Sparse, unbalanced variable filter on: all_predictors()

8 Numeric Feature Engineering Numeric features may require transformation to improve model performance.

Common techniques include:

• reducing skewness • standardizing features • scaling variables

8.1 Skewness Reduction

recipe(Sale_Price ~ ., data = ames_train) %>%
  step_YeoJohnson(all_numeric_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Yeo-Johnson transformation on: all_numeric_predictors()

8.2 Standardization Standardization rescales numeric features so that they have:

• mean = 0 • standard deviation = 1

This is particularly important for models sensitive to feature scale such as: KNN SVM neural networks

ames_recipe %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()
## • Centering for: all_numeric() -all_outcomes()
## • Scaling for: all_numeric() -all_outcomes()