1 Overview This report reproduces the outputs from Chapter 3 (Feature & Target Engineering) from the book Hands-On Machine Learning with R by Bradley Boehmke and Brandon Greenwell.
This report reproduces Sections 3.1 – 3.5 of Chapter 3. These sections cover several important preprocessing techniques used in machine learning:
• Loading and preparing required libraries • Target variable transformations • Handling missing values • Feature filtering techniques • Numeric feature engineering methods
The dataset used in this chapter is the Ames Housing dataset, which contains housing information used to predict house prices.
All analyses are performed in R using several packages including:
dplyr ggplot2 visdat caret recipes AmesHousing The goal of this report is to demonstrate how feature engineering techniques can improve data quality and modeling performance.
2 Prerequisites
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(visdat)
## Warning: package 'visdat' was built under R version 4.5.2
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: lattice
library(recipes)
## Warning: package 'recipes' was built under R version 4.5.2
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
##
## step
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.5.2
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.5.2
library(forecast)
## Warning: package 'forecast' was built under R version 4.5.2
2.1 Load Dataset
ames <- make_ames()
ames_train <- ames
3 Target Engineering Target engineering refers to transforming the response variable to improve model performance.
Many machine learning models assume that the response variable follows a normal distribution. However, real-world data often exhibit skewed distributions.
In such cases, applying transformations such as log transformation or Box-Cox transformation can help normalize the distribution and improve model predictions.
3.1 Log Transmission
transformed_response <- log(ames_train$Sale_Price)
summary(transformed_response)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.456 11.771 11.983 12.021 12.271 13.534
3.2 Using Recipes for Log Transformation
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_log(all_outcomes())
ames_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
3.3 Log Transformation Example
log(-0.5)
## Warning in log(-0.5): NaNs produced
## [1] NaN
log1p(-0.5)
## [1] -0.6931472
3.4 Box-Cox Transformation
y <- forecast::BoxCox(10, lambda = 0.5)
y
## [1] 4.324555
## attr(,"lambda")
## [1] 0.5
3.5 Inverse Box-Cox Transformation
inv_box_cox <- function(x, lambda) {
if (lambda == 0) exp(x)
else (lambda * x + 1)^(1/lambda)
}
inv_box_cox(y, 0.5)
## [1] 10
## attr(,"lambda")
## [1] 0.5
4 Dealing with Missingness Missing values are a common issue in real-world datasets.
Before applying machine learning models, it is important to understand:
• how many missing values exist • where they occur • how they should be handled
There are several strategies for handling missing data, including:
deletion imputation model-based estimation
4.1 Checking Missing Values The raw Ames dataset contains missing values.
ames_raw <- AmesHousing::ames_raw
sum(is.na(ames_raw))
## [1] 13997
5 Visualizing Missing Values Visualizing missing data helps identify patterns and understand which variables contain the most missing values.
ames_raw %>%
is.na() %>%
reshape2::melt() %>%
ggplot(aes(Var2, Var1, fill=value)) +
geom_raster() +
coord_flip() +
scale_fill_grey(name = "",
labels = c("Present", "Missing")) +
xlab("Observation") +
theme(axis.text.y = element_text(size = 4))
5.1 Using visdat Package The visdat package provides a convenient way to
visualize missing data.
vis_miss(ames_raw, cluster = TRUE)
6 Imputation Imputation replaces missing values with estimated
values.
Common imputation strategies include:
Mean imputation Median imputation Mode imputation Model-based imputation
6.1 Median Imputation
ames_recipe %>%
step_impute_median(Gr_Liv_Area)
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • Median imputation for: Gr_Liv_Area
6.2 Mode Imputation
ames_recipe %>%
step_impute_mode(all_nominal())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • Mode imputation for: all_nominal()
6.3 K Nearest Imputation
ames_recipe %>%
step_impute_knn(all_predictors(), neighbors = 6)
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • K-nearest neighbor imputation for: all_predictors()
6.4 Tree-Based Imputation
ames_recipe %>%
step_impute_bag(all_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • Bagged tree imputation for: all_predictors()
6.4 Tree-Based Imputation
ames_recipe %>%
step_impute_bag(all_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • Bagged tree imputation for: all_predictors()
7 Feature Filtering Datasets often contain features that provide little or no predictive value. Removing these features can improve model performance and reduce computation time. One common method is identifying near-zero variance predictors. 7.1 Near Zero Variance Detection
caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
tibble::rownames_to_column() %>%
dplyr::filter(nzv)
## rowname freqRatio percentUnique zeroVar nzv
## 1 Street 243.16667 0.06825939 FALSE TRUE
## 2 Alley 22.76667 0.10238908 FALSE TRUE
## 3 Land_Contour 21.94167 0.13651877 FALSE TRUE
## 4 Utilities 1463.50000 0.10238908 FALSE TRUE
## 5 Land_Slope 22.31200 0.10238908 FALSE TRUE
## 6 Condition_2 223.07692 0.27303754 FALSE TRUE
## 7 Roof_Matl 125.52174 0.27303754 FALSE TRUE
## 8 Bsmt_Cond 21.44262 0.20477816 FALSE TRUE
## 9 BsmtFin_Type_2 23.57547 0.23890785 FALSE TRUE
## 10 BsmtFin_SF_2 515.80000 9.35153584 FALSE TRUE
## 11 Heating 106.85185 0.20477816 FALSE TRUE
## 12 Low_Qual_Fin_SF 722.50000 1.22866894 FALSE TRUE
## 13 Kitchen_AbvGr 21.67442 0.13651877 FALSE TRUE
## 14 Functional 38.97143 0.27303754 FALSE TRUE
## 15 Open_Porch_SF 25.00000 8.60068259 FALSE TRUE
## 16 Enclosed_Porch 112.31818 6.24573379 FALSE TRUE
## 17 Three_season_porch 964.33333 1.05802048 FALSE TRUE
## 18 Screen_Porch 205.69231 4.12969283 FALSE TRUE
## 19 Pool_Area 2917.00000 0.47781570 FALSE TRUE
## 20 Pool_QC 729.25000 0.17064846 FALSE TRUE
## 21 Misc_Feature 29.72632 0.20477816 FALSE TRUE
## 22 Misc_Val 157.05556 1.29692833 FALSE TRUE
7.2 Removing Zero Variance Features
ames_recipe %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • Zero variance filter on: all_predictors()
## • Sparse, unbalanced variable filter on: all_predictors()
8 Numeric Feature Engineering Numeric features may require transformation to improve model performance.
Common techniques include:
• reducing skewness • standardizing features • scaling variables
8.1 Skewness Reduction
recipe(Sale_Price ~ ., data = ames_train) %>%
step_YeoJohnson(all_numeric_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Yeo-Johnson transformation on: all_numeric_predictors()
8.2 Standardization Standardization rescales numeric features so that they have:
• mean = 0 • standard deviation = 1
This is particularly important for models sensitive to feature scale such as: KNN SVM neural networks
ames_recipe %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
## • Centering for: all_numeric() -all_outcomes()
## • Scaling for: all_numeric() -all_outcomes()