1 Introduction

This report reproduces several examples from Chapter 3 of the book Hands-On Machine Learning with R written by Bradley Boehmke and Brandon Greenwell. The main purpose of this assignment is to implement a variety of feature engineering techniques that are commonly used during the data preprocessing stage in machine learning workflows.

Specifically, this report focuses on Sections 3.1 to 3.5 which introduce several important preprocessing strategies such as:

  • Loading required libraries and datasets
  • Transforming the response variable
  • Handling missing values
  • Identifying non-informative features
  • Performing numeric feature transformations

The Ames Housing dataset is used throughout the examples. This dataset contains housing attributes from Ames, Iowa and is commonly used for predictive modeling tasks such as estimating house sale prices.

2 Required Libraries

Before performing any feature engineering procedures, first we should load the necessary libraries that will be used for data manipulation, visualization, and preprocessing.

library(dplyr)
library(ggplot2)
library(visdat)

library(caret)
library(recipes)

library(AmesHousing)
library(reshape2)
library(forecast)

3 Dataset Preparation

The Ames Housing dataset is used throughout the examples in this chapter. It contains multiple housing attributes that can be used to predict house sale prices.

To better understand the dataset structure, i think we should display its dimensions and preview the first few observations.

ames <- make_ames()

ames_train <- ames

dim(ames_train)
## [1] 2930   81
head(ames_train)
glimpse(ames_train)
## Rows: 2,930
## Columns: 81
## $ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning          <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Qual       <fct> Above_Average, Average, Above_Average, Good, Averag…
## $ Overall_Cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Qual         <fct> Typical, Typical, Typical, Good, Typical, Typical, …
## $ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Qual          <fct> Typical, Typical, Typical, Typical, Good, Typical, …
## $ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Low_Qual_Fin_SF    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Kitchen_Qual       <fct> Typical, Typical, Good, Excellent, Typical, Good, G…
## $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Fireplace_Qu       <fct> Good, No_Fireplace, No_Fireplace, Typical, Typical,…
## $ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Qual        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…

4 Target Engineering

Target engineering refers to modifying the response variable in order to improve model performance. Many machine learning algorithms assume that the response variable follows a roughly normal distribution. However, real-world data often exhibit skewness. Transformations such as logarithmic transformation or Box-Cox transformation can help address this issue.

4.1 Log Transformation

A common technique to reduce skewness is applying a logarithmic transformation.

transformed_response <- log(ames_train$Sale_Price)

summary(transformed_response)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.456  11.771  11.983  12.021  12.271  13.534

4.2 Log Transformation Using Recipes

Instead of manually transforming the response variable, the recipes package allows us to define preprocessing steps that can later be applied during model training.

ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes())

ames_recipe

4.3 Log Transformation Example

Logarithmic transformations cannot be applied to negative numbers. If negative values exist, the function will return NaN.

log(-0.5)
## [1] NaN
log1p(-0.5)
## [1] -0.6931472

4.4 Box-Cox Transformation

Another transformation technique that can stabilize variance and reduce skewness is the Box-Cox transformation.

y <- forecast::BoxCox(10, lambda = 0.5)

y
## [1] 4.324555
## attr(,"lambda")
## [1] 0.5

4.5 Inverse Box-Cox Transformation

To convert the transformed values back to their original scale, an inverse Box-Cox function can be applied.

inv_box_cox <- function(x, lambda) {

  if (lambda == 0) exp(x)
  else (lambda * x + 1)^(1/lambda)

}

inv_box_cox(y, 0.5)
## [1] 10
## attr(,"lambda")
## [1] 0.5

5 Dealing with Missing Values

Missing values are a common challenge in real-world datasets. Before building machine learning models, it is important to identify and understand where missing values occur.

5.1 Checking Missing Values

ames_raw <- AmesHousing::ames_raw

sum(is.na(ames_raw))
## [1] 13997

6 Visualizing Missing Values

Visualizing missing data can help reveal patterns and determine which variables contain missing values.

ames_raw %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) +
  geom_raster() +
  coord_flip() +
  scale_fill_grey(name = "",
                  labels = c("Present","Missing")) +
  xlab("Observation") +
  theme(axis.text.y = element_text(size = 4))

### Using visdat Package

vis_miss(ames_raw, cluster = TRUE)

7 Imputation

Imputation is a technique used to replace missing values with estimated values based on available data.

7.1 Median Imputation

ames_recipe %>%
  step_impute_median(Gr_Liv_Area)

7.2 Mode Imputation

ames_recipe %>%
  step_impute_mode(all_nominal())

7.3 K-Nearest Neighbor Imputation

ames_recipe %>%
  step_impute_knn(all_predictors(), neighbors = 6)

7.4 Tree-Based Imputation

ames_recipe %>%
  step_impute_bag(all_predictors())

8 Feature Filtering

Feature filtering aims to remove variables that provide little or no predictive value. Removing these variables can simplify the model and reduce computational cost.

8.1 Detecting Near-Zero Variance Predictors

caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
  tibble::rownames_to_column() %>%
  dplyr::filter(nzv)

8.2 Removing Low Variance Features

ames_recipe %>%
  step_zv(all_predictors()) %>%
  step_nzv(all_predictors())

9 Numeric Feature Engineering

Numeric variables sometimes require transformation to improve modeling performance.

9.1 Skewness Reduction

The Yeo-Johnson transformation is commonly used to reduce skewness in numeric predictors.

skew_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_YeoJohnson(all_numeric_predictors())

skew_recipe

tidy(skew_recipe)

We can assume that YeoJohson is still False because we haven’t run step prep ().

Recipes workflow should be

recipe() ↓ step_* ↓ prep() ↓ bake()

9.2 Standardization

Standardization rescales numeric variables so that they have a mean of zero and a standard deviation of one.

This preprocessing step is especially important for algorithms that are sensitive to feature scale, such as KNN, SVM, and neural networks.

std_recipe <- ames_recipe %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes())

std_recipe

tidy(std_recipe)

10 Conclusion

This report demonstrated several important feature engineering techniques discussed in Chapter 3 of Hands-On Machine Learning with R.

The examples illustrated how preprocessing methods such as target transformations, missing value imputation, feature filtering, and numeric feature scaling can improve data quality prior to model training.

Proper preprocessing is an essential step in the machine learning pipeline, as it helps ensure that models receive well-structured and informative data.