This report reproduces the outputs from Chapter 3 (Feature & Target Engineering) from the book Hands-On Machine Learning with R by Bradley Boehmke and Brandon Greenwell.
This report reproduces Sections 3.1 – 3.5 of Chapter 3. These sections cover several important preprocessing techniques used in machine learning:
• Loading and preparing required libraries
• Target variable transformations
• Handling missing values
• Feature filtering techniques
• Numeric feature engineering methods
The dataset used in this chapter is the Ames Housing dataset, which contains housing information used to predict house prices.
All analyses are performed in R using several packages including:
The goal of this report is to demonstrate how feature engineering techniques can improve data quality and modeling performance.
In this section, we load all the required packages used throughout the chapter. These libraries provide tools for data manipulation, visualization, and machine learning preprocessing.
library(dplyr)
library(ggplot2)
library(visdat)
library(caret)
library(recipes)
library(AmesHousing)
library(reshape2)
library(forecast)
The Ames Housing dataset is used throughout this chapter. It contains housing attributes used to predict the selling price of houses.
ames <- make_ames()
ames_train <- ames
Target engineering refers to transforming the response variable to improve model performance.
Many machine learning models assume that the response variable follows a normal distribution. However, real-world data often exhibit skewed distributions.
In such cases, applying transformations such as log transformation or Box-Cox transformation can help normalize the distribution and improve model predictions.
A simple way to reduce skewness is by applying a logarithmic transformation to the target variable.
transformed_response <- log(ames_train$Sale_Price)
summary(transformed_response)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.456 11.771 11.983 12.021 12.271 13.534
Instead of manually transforming variables, the recipes package allows us to define preprocessing steps that can later be applied to training and test data.
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_log(all_outcomes())
ames_recipe
Log transformation cannot be applied to negative numbers. If negative values exist, the function will return NaN.
log(-0.5)
## [1] NaN
log1p(-0.5)
## [1] -0.6931472
Another commonly used transformation is the Box-Cox transformation, which automatically determines the best transformation parameter (lambda).
y <- forecast::BoxCox(10, lambda = 0.5)
y
## [1] 4.324555
## attr(,"lambda")
## [1] 0.5
To convert the transformed values back to their original scale, we apply the inverse transformation.
inv_box_cox <- function(x, lambda) {
if (lambda == 0) exp(x)
else (lambda * x + 1)^(1/lambda)
}
inv_box_cox(y, 0.5)
## [1] 10
## attr(,"lambda")
## [1] 0.5
Missing values are a common issue in real-world datasets.
Before applying machine learning models, it is important to understand:
• how many missing values exist
• where they occur
• how they should be handled
There are several strategies for handling missing data, including:
The raw Ames dataset contains missing values.
ames_raw <- AmesHousing::ames_raw
sum(is.na(ames_raw))
## [1] 13997
Visualizing missing data helps identify patterns and understand which variables contain the most missing values.
ames_raw %>%
is.na() %>%
reshape2::melt() %>%
ggplot(aes(Var2, Var1, fill=value)) +
geom_raster() +
coord_flip() +
scale_fill_grey(name = "",
labels = c("Present", "Missing")) +
xlab("Observation") +
theme(axis.text.y = element_text(size = 4))
The visdat package provides convenient functions for visualizing missing data patterns.
vis_miss(ames_raw, cluster = TRUE)
Imputation replaces missing values with estimated values.
Common imputation strategies include:
Median imputation replaces missing values with the median of the variable.
ames_recipe %>%
step_impute_median(Gr_Liv_Area)
For categorical variables, the most common value (mode) is used.
ames_recipe %>%
step_impute_mode(all_nominal())
KNN imputation replaces missing values based on the values of the k most similar observations.
ames_recipe %>%
step_impute_knn(all_predictors(), neighbors = 6)
Tree-based imputation predicts missing values using decision tree models.
ames_recipe %>%
step_impute_bag(all_predictors())
Datasets often contain features that provide little or no predictive value. Removing these features can improve model performance and reduce computation time. One common method is identifying near-zero variance predictors.
caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
tibble::rownames_to_column() %>%
dplyr::filter(nzv)
ames_recipe %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
Numeric features may require transformation to improve model performance.
Common techniques include:
• reducing skewness
• standardizing features
• scaling variables
The Yeo-Johnson transformation is commonly used to reduce skewness in numeric variables.
recipe(Sale_Price ~ ., data = ames_train) %>%
step_YeoJohnson(all_numeric_predictors())
Standardization rescales numeric features so that they have:
• mean = 0
• standard deviation = 1
This is particularly important for models sensitive to feature scale such as:
ames_recipe %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())