Hands-on Machine Learning with R: Chapter 3 Feature and Target Engineering

1 Overview

This report reproduces the outputs from Chapter 3 (Feature & Target Engineering) from the book Hands-On Machine Learning with R by Bradley Boehmke and Brandon Greenwell.

This report reproduces Sections 3.1 – 3.5 of Chapter 3. These sections cover several important preprocessing techniques used in machine learning:

• Loading and preparing required libraries
• Target variable transformations
• Handling missing values
• Feature filtering techniques
• Numeric feature engineering methods

The dataset used in this chapter is the Ames Housing dataset, which contains housing information used to predict house prices.

All analyses are performed in R using several packages including:

dplyr
ggplot2
visdat
caret
recipes
AmesHousing

The goal of this report is to demonstrate how feature engineering techniques can improve data quality and modeling performance.

2 Prerequisites

In this section, we load all the required packages used throughout the chapter. These libraries provide tools for data manipulation, visualization, and machine learning preprocessing.

library(dplyr)
library(ggplot2)
library(visdat)

library(caret)
library(recipes)

library(AmesHousing)
library(reshape2)
library(forecast)

2.1 Load Dataset

The Ames Housing dataset is used throughout this chapter. It contains housing attributes used to predict the selling price of houses.

ames <- make_ames()

ames_train <- ames

3 Target Engineering

Target engineering refers to transforming the response variable to improve model performance.

Many machine learning models assume that the response variable follows a normal distribution. However, real-world data often exhibit skewed distributions.

In such cases, applying transformations such as log transformation or Box-Cox transformation can help normalize the distribution and improve model predictions.

3.1 Log Transmission

A simple way to reduce skewness is by applying a logarithmic transformation to the target variable.

transformed_response <- log(ames_train$Sale_Price)

summary(transformed_response)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.456  11.771  11.983  12.021  12.271  13.534

3.2 Using Recipes for Log Transformation

Instead of manually transforming variables, the recipes package allows us to define preprocessing steps that can later be applied to training and test data.

ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes())

ames_recipe

3.3 Log Transformation Example

Log transformation cannot be applied to negative numbers. If negative values exist, the function will return NaN.

log(-0.5)

## [1] NaN

log1p(-0.5)

## [1] -0.6931472

3.4 Box-Cox Transformation

Another commonly used transformation is the Box-Cox transformation, which automatically determines the best transformation parameter (lambda).

y <- forecast::BoxCox(10, lambda = 0.5)
y

## [1] 4.324555
## attr(,"lambda")
## [1] 0.5

3.5 Inverse Box-Cox Transformation

To convert the transformed values back to their original scale, we apply the inverse transformation.

inv_box_cox <- function(x, lambda) {
  
  if (lambda == 0) exp(x)
  else (lambda * x + 1)^(1/lambda)
  
}

inv_box_cox(y, 0.5)

## [1] 10
## attr(,"lambda")
## [1] 0.5

4 Dealing with Missingness

Missing values are a common issue in real-world datasets.

Before applying machine learning models, it is important to understand:

• how many missing values exist
• where they occur
• how they should be handled

There are several strategies for handling missing data, including:

deletion
imputation
model-based estimation

4.1 Checking Missing Values

The raw Ames dataset contains missing values.

ames_raw <- AmesHousing::ames_raw

sum(is.na(ames_raw))

## [1] 13997

5 Visualizing Missing Values

Visualizing missing data helps identify patterns and understand which variables contain the most missing values.

ames_raw %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) +
  geom_raster() +
  coord_flip() +
  scale_fill_grey(name = "",
                  labels = c("Present", "Missing")) +
  xlab("Observation") +
  theme(axis.text.y = element_text(size = 4))

5.1 Using visdat Package

The visdat package provides convenient functions for visualizing missing data patterns.

vis_miss(ames_raw, cluster = TRUE)

6 Imputation

Imputation replaces missing values with estimated values.

Common imputation strategies include:

Mean imputation
Median imputation
Mode imputation
Model-based imputation

6.1 Median Imputation

Median imputation replaces missing values with the median of the variable.

ames_recipe %>%
  step_impute_median(Gr_Liv_Area)

6.2 Mode Imputation

For categorical variables, the most common value (mode) is used.

ames_recipe %>%
  step_impute_mode(all_nominal())

6.3 K-Nearest Neighbor Imputation

KNN imputation replaces missing values based on the values of the k most similar observations.

ames_recipe %>%
  step_impute_knn(all_predictors(), neighbors = 6)

6.4 Tree-Based Imputation

Tree-based imputation predicts missing values using decision tree models.

ames_recipe %>%
  step_impute_bag(all_predictors())

7 Feature Filtering

Datasets often contain features that provide little or no predictive value. Removing these features can improve model performance and reduce computation time. One common method is identifying near-zero variance predictors.

7.1 Near Zero Variance Detection

caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
  tibble::rownames_to_column() %>%
  dplyr::filter(nzv)

7.2 Removing Zero Variance Features

ames_recipe %>%
  step_zv(all_predictors()) %>%
  step_nzv(all_predictors())

8 Numeric Feature Engineering

Numeric features may require transformation to improve model performance.

Common techniques include:

• reducing skewness
• standardizing features
• scaling variables

8.1 Skewness Reduction

The Yeo-Johnson transformation is commonly used to reduce skewness in numeric variables.

recipe(Sale_Price ~ ., data = ames_train) %>%
  step_YeoJohnson(all_numeric_predictors())

8.2 Standardization

Standardization rescales numeric features so that they have:

• mean = 0
• standard deviation = 1

This is particularly important for models sensitive to feature scale such as:

KNN
SVM
neural networks

ames_recipe %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())