This chapter leverages the following packages for data manipulation, visualization, and feature engineering.
# Helper packages
library(dplyr) # for data manipulation
library(ggplot2) # for awesome graphics
library(visdat) # for additional visualizations
# Feature engineering packages
library(caret) # for various ML tasks
library(recipes) # for feature engineering tasks
Transforming the response variable can lead to predictive improvement, especially for parametric models.
# Manual transformation
transformed_response <- log(ames_train$Sale_Price)
# Using a blueprint (recipe)
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_log(all_outcomes())
ames_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Log transformation on: all_outcomes()
# Using Box-Cox for strictly positive data
# Using Yeo-Johnson if data contains negative values
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_BoxCox(all_outcomes())
Note: The Ames housing data is relatively clean. The
visdat package is used to visualize missingness.
# Visualize missing data (if any)
vis_miss(ames_train)
Recipes provide various methods to handle missing values.
# 3.3.2.1 Impute via Median/Mode
ames_recipe %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Median imputation for: all_numeric_predictors()
## • Mode imputation for: all_nominal_predictors()
# 3.3.2.2 K-nearest neighbor imputation
ames_recipe %>%
step_impute_knn(all_predictors(), neighbors = 5)
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • K-nearest neighbor imputation for: all_predictors()
# 3.3.2.3 Tree-based imputation (Bagged Trees)
ames_recipe %>%
step_impute_bag(all_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Bagged tree imputation for: all_predictors()
Removing features that provide little to no information (e.g., zero or near-zero variance).
ames_recipe %>%
step_nzv(all_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Sparse, unbalanced variable filter on: all_predictors()
Correcting for skewed predictors using transformations like Yeo-Johnson.
ames_recipe %>%
step_YeoJohnson(all_numeric_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Yeo-Johnson transformation on: all_numeric_predictors()
Centering (subtracting mean) and scaling (dividing by standard deviation) so that features have a mean of 0 and a standard deviation of 1.
ames_recipe %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 80
##
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Centering for: all_numeric_predictors()
## • Scaling for: all_numeric_predictors()