Jayden Khalifa Armand
Student ID : 114035109
This report reproduces several examples from Chapter 3 of the book Hands-On Machine Learning with R written by Bradley Boehmke and Brandon Greenwell. The main purpose of this assignment is to implement a variety of feature engineering techniques that are commonly used during the data preprocessing stage in machine learning workflows.
Specifically, this report focuses on Sections 3.1 to 3.5 which introduce several important preprocessing strategies such as:
The Ames Housing dataset is used throughout the examples. This dataset contains housing attributes from Ames, Iowa and is commonly used for predictive modeling tasks such as estimating house sale prices.
Before performing any feature engineering procedures, first we should load the necessary libraries that will be used for data manipulation, visualization, and preprocessing.
library(dplyr)
library(ggplot2)
library(visdat)
library(caret)
library(recipes)
library(AmesHousing)
library(reshape2)
library(forecast)
The Ames Housing dataset is used throughout the examples in this chapter. It contains multiple housing attributes that can be used to predict house sale prices.
To better understand the dataset structure, i think we should display its dimensions and preview the first few observations.
ames <- make_ames()
ames_train <- ames
dim(ames_train)
## [1] 2930 81
head(ames_train)
glimpse(ames_train)
## Rows: 2,930
## Columns: 81
## $ MS_SubClass <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Qual <fct> Above_Average, Average, Above_Average, Good, Averag…
## $ Overall_Cond <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Qual <fct> Typical, Typical, Typical, Good, Typical, Typical, …
## $ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Qual <fct> Typical, Typical, Typical, Typical, Good, Typical, …
## $ Bsmt_Cond <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Low_Qual_Fin_SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Kitchen_Qual <fct> Typical, Typical, Good, Excellent, Typical, Good, G…
## $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Fireplace_Qu <fct> Good, No_Fireplace, No_Fireplace, Typical, Typical,…
## $ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Qual <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Garage_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
Target engineering refers to modifying the response variable in order to improve model performance. Many machine learning algorithms assume that the response variable follows a roughly normal distribution. However, real-world data often exhibit skewness. Transformations such as logarithmic transformation or Box-Cox transformation can help address this issue.
A common technique to reduce skewness is applying a logarithmic transformation.
transformed_response <- log(ames_train$Sale_Price)
summary(transformed_response)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.456 11.771 11.983 12.021 12.271 13.534
Instead of manually transforming the response variable, the recipes package allows us to define preprocessing steps that can later be applied during model training.
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_log(all_outcomes())
ames_recipe
Logarithmic transformations cannot be applied to negative numbers. If negative values exist, the function will return NaN.
log(-0.5)
## [1] NaN
log1p(-0.5)
## [1] -0.6931472
Another transformation technique that can stabilize variance and reduce skewness is the Box-Cox transformation.
y <- forecast::BoxCox(10, lambda = 0.5)
y
## [1] 4.324555
## attr(,"lambda")
## [1] 0.5
To convert the transformed values back to their original scale, an inverse Box-Cox function can be applied.
inv_box_cox <- function(x, lambda) {
if (lambda == 0) exp(x)
else (lambda * x + 1)^(1/lambda)
}
inv_box_cox(y, 0.5)
## [1] 10
## attr(,"lambda")
## [1] 0.5
Missing values are a common challenge in real-world datasets. Before building machine learning models, it is important to identify and understand where missing values occur.
ames_raw <- AmesHousing::ames_raw
sum(is.na(ames_raw))
## [1] 13997
Visualizing missing data can help reveal patterns and determine which variables contain missing values.
ames_raw %>%
is.na() %>%
reshape2::melt() %>%
ggplot(aes(Var2, Var1, fill=value)) +
geom_raster() +
coord_flip() +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
xlab("Observation") +
theme(axis.text.y = element_text(size = 4))
### Using visdat Package
vis_miss(ames_raw, cluster = TRUE)
Imputation is a technique used to replace missing values with estimated values based on available data.
ames_recipe %>%
step_impute_median(Gr_Liv_Area)
ames_recipe %>%
step_impute_mode(all_nominal())
ames_recipe %>%
step_impute_knn(all_predictors(), neighbors = 6)
ames_recipe %>%
step_impute_bag(all_predictors())
Feature filtering aims to remove variables that provide little or no predictive value. Removing these variables can simplify the model and reduce computational cost.
caret::nearZeroVar(ames_train, saveMetrics = TRUE) %>%
tibble::rownames_to_column() %>%
dplyr::filter(nzv)
ames_recipe %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
Numeric variables sometimes require transformation to improve modeling performance.
The Yeo-Johnson transformation is commonly used to reduce skewness in numeric predictors.
skew_recipe <- recipe(Sale_Price ~ ., data = ames_train) %>%
step_YeoJohnson(all_numeric_predictors())
skew_recipe
tidy(skew_recipe)
We can assume that YeoJohson is still False because we haven’t run step prep ().
Recipes workflow should be
recipe() ↓ step_* ↓ prep() ↓ bake()
Standardization rescales numeric variables so that they have a mean of zero and a standard deviation of one.
This preprocessing step is especially important for algorithms that are sensitive to feature scale, such as KNN, SVM, and neural networks.
std_recipe <- ames_recipe %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes())
std_recipe
tidy(std_recipe)
This report demonstrated several important feature engineering techniques discussed in Chapter 3 of Hands-On Machine Learning with R.
The examples illustrated how preprocessing methods such as target transformations, missing value imputation, feature filtering, and numeric feature scaling can improve data quality prior to model training.
Proper preprocessing is an essential step in the machine learning pipeline, as it helps ensure that models receive well-structured and informative data.