3.1 Prerequisites

This chapter leverages the following packages for data manipulation, visualization, and feature engineering.

# Helper packages
library(dplyr)      # for data manipulation
library(ggplot2)    # for awesome graphics
library(visdat)     # for additional visualizations

# Feature engineering packages
library(caret)      # for various ML tasks
library(recipes)    # for feature engineering tasks

3.2 Target Engineering

Transforming the response variable can lead to predictive improvement, especially for parametric models.

Option 1: Log Transformation

# Manual transformation
transformed_response <- log(ames_train$Sale_Price)

# Using a blueprint (recipe)
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes())

ames_recipe
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Log transformation on: all_outcomes()

Option 2: Box-Cox Transformation

# Using Box-Cox for strictly positive data
# Using Yeo-Johnson if data contains negative values
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_BoxCox(all_outcomes())

3.3 Dealing with Missingness

3.3.1 Visualizing Missing Values

Note: The Ames housing data is relatively clean. The visdat package is used to visualize missingness.

# Visualize missing data (if any)
vis_miss(ames_train)

3.3.2 Imputation

Recipes provide various methods to handle missing values.

# 3.3.2.1 Impute via Median/Mode
ames_recipe %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Median imputation for: all_numeric_predictors()
## • Mode imputation for: all_nominal_predictors()
# 3.3.2.2 K-nearest neighbor imputation
ames_recipe %>%
  step_impute_knn(all_predictors(), neighbors = 5)
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • K-nearest neighbor imputation for: all_predictors()
# 3.3.2.3 Tree-based imputation (Bagged Trees)
ames_recipe %>%
  step_impute_bag(all_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Bagged tree imputation for: all_predictors()

3.4 Feature Filtering

Removing features that provide little to no information (e.g., zero or near-zero variance).

ames_recipe %>%
  step_nzv(all_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Sparse, unbalanced variable filter on: all_predictors()

3.5 Numeric Feature Engineering

3.5.1 Skewness

Correcting for skewed predictors using transformations like Yeo-Johnson.

ames_recipe %>%
  step_YeoJohnson(all_numeric_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Yeo-Johnson transformation on: all_numeric_predictors()

3.5.2 Standardization

Centering (subtracting mean) and scaling (dividing by standard deviation) so that features have a mean of 0 and a standard deviation of 1.

ames_recipe %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 80
## 
## ── Operations
## • Box-Cox transformation on: all_outcomes()
## • Centering for: all_numeric_predictors()
## • Scaling for: all_numeric_predictors()