6. Feature engineering with recipes
Feature engineering encompassess activities that reformat predictor values to make them easier for a model to use effectively.
This includes transforming and encoding of the data to best represent their important characteristics.
There are many other examples of preprocessing to build better features for modeling:
Correlation between predictors can be reduced via feature extraction or the removal of some predictors.
When some predictors have missing values, they can be imputed using a sub-model.
Models that use variance-type measures may benefit from coercing the distribution of some skewed predictors to be symmetric by estimating a transformation .
Different models have different preprocessing requiremetns and some, such as tree based mode, required very little preprocessing at all.
In thí chapter, we introduce the recipes package which you can use to combine diferent feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets.
This chapter uses the Ames housing data and the R objects created in the book so far.
In this section, we will focus on a small subset of the predictors available in the Ames housing data:
The neighborhood (qualitative, with 29 neighborhoods in the training set)
The general living area (continuous, names Gr_Liv_Area)
The year built (Year_Built)
The type of building (Bldg_Type with values OneFam, TwoFmCon, Duplex, Twnhs and TwnhsE
Suppose that an intial ordinary linear regression model were fit to these data. Recalling that, in Chapter 4, the sale prices were pre-logged, a standard call to lm() might look like:
library(tidymodels)
## -- Attaching packages --------
## v broom 0.7.0 v recipes 0.1.13
## v dials 0.0.8 v rsample 0.0.7
## v dplyr 1.0.0 v tibble 3.0.3
## v ggplot2 3.3.2 v tidyr 1.1.0
## v infer 0.5.3 v tune 0.1.1
## v modeldata 0.0.2 v workflows 0.1.2
## v parsnip 0.1.2 v yardstick 0.0.7
## v purrr 0.3.4
## -- Conflicts -----------------
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x recipes::step() masks stats::step()
setwd('C:/Users/DellPC/Desktop/Corner/R_source_code/Julia_Silge/tidy_model_R_book')
ames <- read.csv('ames.csv')
ames_split <- initial_split(ames, prob= 0.80)
ames_split
## <Analysis/Assess/Total>
## <2198/732/2930>
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area)+ Year_Built + Bldg_Type, data = ames)
##
## Call:
## lm(formula = Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) +
## Year_Built + Bldg_Type, data = ames)
##
## Coefficients:
## (Intercept)
## -0.918763
## NeighborhoodBlueste
## -0.008016
## NeighborhoodBriardale
## -0.086835
## NeighborhoodBrookside
## -0.051273
## NeighborhoodClear_Creek
## 0.007754
## NeighborhoodCollege_Creek
## -0.026704
## NeighborhoodCrawford
## 0.051173
## NeighborhoodEdwards
## -0.089570
## NeighborhoodGilbert
## -0.073986
## NeighborhoodGreen_Hills
## 0.182670
## NeighborhoodGreens
## 0.134898
## NeighborhoodIowa_DOT_and_Rail_Road
## -0.124235
## NeighborhoodLandmark
## -0.052245
## NeighborhoodMeadow_Village
## -0.130165
## NeighborhoodMitchell
## -0.039783
## NeighborhoodNorth_Ames
## -0.040253
## NeighborhoodNorthpark_Villa
## -0.014048
## NeighborhoodNorthridge
## 0.044334
## NeighborhoodNorthridge_Heights
## 0.093323
## NeighborhoodNorthwest_Ames
## -0.039794
## NeighborhoodOld_Town
## -0.069213
## NeighborhoodSawyer
## -0.044531
## NeighborhoodSawyer_West
## -0.057711
## NeighborhoodSomerset
## 0.009713
## NeighborhoodSouth_and_West_of_Iowa_State_University
## -0.068506
## NeighborhoodStone_Brook
## 0.105713
## NeighborhoodTimberland
## 0.020153
## NeighborhoodVeenker
## 0.048301
## log10(Gr_Liv_Area)
## 0.634400
## Year_Built
## 0.002068
## Bldg_TypeOneFam
## 0.103844
## Bldg_TypeTwnhs
## 0.006958
## Bldg_TypeTwnhsE
## 0.062351
## Bldg_TypeTwoFmCon
## 0.072614
Sale price is defined as the outcome while neighborhood, general living area, the year built, and building type variables are all defined as predictors.
A log transformation is applied to the general living area predictor.
The neighborhood and building type columns are converted from a non-numeric format to a numeric format (since least squares requires numeric predictors).
As mentioned in Chapter 3, the formula method will apply these data manipulations to any data, including new data, that are passed to the predict() function
library(tidymodels)
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type, data = ames_train) %>% step_log(Gr_Liv_Area, base = 10) %>% step_dummy(all_nominal())
simple_ames <- prep(simple_ames, training = ames_train, retain = TRUE)
simple_ames
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 4
##
## Training data contained 2198 data points and no missing data.
##
## Operations:
##
## Log transformation on Gr_Liv_Area [trained]
## Dummy variables from Neighborhood, Bldg_Type [trained]
Note that, after preparing the recipe, the print statement shows the results of the selectors (e.g: Neighborhood and Bldg_Type are listed insteal of all_nomial)
One important argument to prep() is retain. When TRUE (the default), the prepared version of the training set is kept within the recipe. The data set has been pre-processed using all of the steps listed in the recipe. If the data is big, it causes a problem in memory capacity
The third phase of recipe usage is to apply the preprocessing operations to a data set using the bake() function. The bake() function can apply the recip[e to any data set. To use the test set, the syntax would be:
test_ex <- bake(simple_ames, new_data = ames_test)
names(test_ex) %>% head()
## [1] "Gr_Liv_Area" "Year_Built" "Sale_Price"
## [4] "Neighborhood_Blueste" "Neighborhood_Briardale" "Neighborhood_Brookside"
Note the dummy variable columns starting with Neighborhood_. The bake() function can also take selectors so that, if we only wanted the neighborhood results, we could use:
bake(simple_ames, ames_test, starts_with('Neighborhood_'))
## # A tibble: 732 x 27
## Neighborhood_Bl~ Neighborhood_Br~ Neighborhood_Br~ Neighborhood_Cl~
## <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 1 0 0
## # ... with 722 more rows, and 23 more variables:
## # Neighborhood_College_Creek <dbl>, Neighborhood_Crawford <dbl>,
## # Neighborhood_Edwards <dbl>, Neighborhood_Gilbert <dbl>,
## # Neighborhood_Green_Hills <dbl>, Neighborhood_Greens <dbl>,
## # Neighborhood_Iowa_DOT_and_Rail_Road <dbl>, Neighborhood_Landmark <dbl>,
## # Neighborhood_Meadow_Village <dbl>, Neighborhood_Mitchell <dbl>,
## # Neighborhood_North_Ames <dbl>, Neighborhood_Northpark_Villa <dbl>,
## # Neighborhood_Northridge <dbl>, Neighborhood_Northridge_Heights <dbl>,
## # Neighborhood_Northwest_Ames <dbl>, Neighborhood_Old_Town <dbl>,
## # Neighborhood_Sawyer <dbl>, Neighborhood_Sawyer_West <dbl>,
## # Neighborhood_Somerset <dbl>,
## # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
## # Neighborhood_Stone_Brook <dbl>, Neighborhood_Timberland <dbl>,
## # Neighborhood_Veenker <dbl>
To get the processed version of training set, we could use bake() and pass in the argument ames_train but, as previously mentioned, this would repeat calculations that have already been executed.Instead, we can use new_data = NULL to quickly return the training set (if retain = TRUE was used). It access the data component of the prepared recipe.
#bake(simple_ames, new_data = NULL)
ames_train %>% nrow()
## [1] 2198
To reiterate, using a recipe is a three phase process summarized as:
feature_engineering
One of the most common feature engineering tasks is transforming nomial or qualitative data (factors or characters) so thay they can be encoded or represented numerically. Sometimes we can alter the factor levels of a qualitative column in helpful ways prior to such a transformation.
For example, step_unknown() can be used to change missing values to a dedicated factor level. Similarly, if we anticipate that a new factor level may be encountered in future data, step_novel() can allot a new level for this purpose.
Additionally, step_other() can be used to analyze the frequencies of the factor levels in the training set and convert infrequently occurring values to a catch-all level of “other”, with a specific threshold that can be specified. A good example is the Neighborhood predictor in our data:
ames_train %>% ggplot(aes(y = Neighborhood)) + geom_bar() + labs(y= NULL)
Here there are two neighborhoods that have less than 5 properties in the training data; in this case, no houses at all in the Landmark neighborhood were included in the training set. For some models, it may be problematic to have dummy variables with a single non zero entry in the column. At a minimum, it is highly improable that these features would be important to a model. If we add step_other(Neighborhood, threshold = 0.01) to our recipe, the bottom 1% of the neighborhoods will be lumped into a new level called ‘other’. In this training set, this will catch 9 neighborhoods.
For the Ames data, we can amend the recipe to use:
simple_ames <- recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type, data = ames_train)%>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal())
Many, but not all, underlying model calculations require predictor values to be encoded as numbers Notable exceptions included tree-based models, rule-based models, and naive Bayes models.
There are a few strategies for converting a factor predictor to numeric format. The most common method is to create ‘dummy’ or indicator variables. Let’s take the predictor in the Ames data for the building type, which is a factor variable with five levels. For dummy variables, the single Bldg_Type column would be replaced with numeric columns whose values are either zero or one. These binary variables represent specific factor level values. In R, the convention is to execute a column for the first factor level TwoFmCon that is one when the row has that value and zero otherwise. Three other columns are similarly created:
The full set of encodings can be used for some models. This is traditionally called the 'one_hot' encoding and can be achieved using the one_hot argument of step_dummy().
One helpful feature of step_dummy() is that there is more control over how the resulting dummy variables are named. In base R, dummy variable names mash the variable name with the level, resulting in names like NeighborhoodVeenker. Recipes, by default, use an underscore as the separator between the name and level *e.g: Neighborhood_Veenker) and there is an option to use custom formatting for the names. The default naming convention in recipes makes it easier to capture those new columns in future steps using a selector, such as starts_with('Neighborhood_')
There are other methods for doing this transformation to a numeric format. Feature hashing methods only consider the value of the category to assign it to a predefined pool of dummy variables. This can be a good strategy when there are a large number of possible categories, but the statistical properties may not be optimal. For example, it may unnecessarily alias categories together (by assigning them to the same dummy variable). This reduces the specificity of the encoding and, if that dummy variable were important, it would be difficult to determine which of the categories is driving the effect.
Interaction effects involve two or more predictors. Such as effect occurs when one predictor has an effect on the outcome that is contingent on one or more other predictors. For example, if you were trying to predict your morning commute time, two potential predictors could be the amount of traffic and the time of day. In this case, you could add an interaction term between the two predictors to the model along with the original two predictors (which are called the ‘main effect’). Numerically, an interaction term between predictors is encoded as their product. Interactions are only defined in terms of their effect on the outcome and can be combinations of different types of data (e.g: numeric, categorical, etc).
After exploring the Ames training set, we might find that the regression slopes for the general living area differ for different building types:
ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) +
geom_point(alpha = .2) +
facet_wrap(~Bldg_Type) +
geom_smooth(method = lm, formula= y~x, se = FALSE, col = 'red') +
scale_x_log10()+
scale_x_log10() +
labs(x = 'General Living Area', y = 'Sale Price (USD)')
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
How are interactions specified in a recipe? A base R formula would take an interaction using a :, so we would use:
Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type + log10(Gr_Liv_Area):Bldg_Type
## Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type +
## log10(Gr_Liv_Area):Bldg_Type
# or
Sale_Price ~ Neighborhood + log10(Gr_Liv_Area)*Bldg_Type
## Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) * Bldg_Type
where * expands those columns to the main effects and interaction term. Again, the formula method does many things simultaneously and understands that a factor variable (such as Bldg_Type) should be expanded into dummy variables first and that the interaction should involve all of the resulting binary columns.
Recipe are more explicit and sequential, and give you more control. With the current recipe, step_dummy() has already created dummy variables. How would we combine these for an interaction? The addition step would look like step_interact(~ interaction term) where he terms on the right-hand side of the tilde are the interactions. These can included selectors, so it would be appropriate to use:
simple_ames <- recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type, data = ames_train) %>%
step_log(Ggr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact(~ Gr_Liv_Area:starts_with('Bldg_Type_'))
simple_ames
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 4
##
## Operations:
##
## Log transformation on Ggr_Liv_Area
## Collapsing factor levels for Neighborhood
## Dummy variables from all_nominal()
## Interactions with Gr_Liv_Area:starts_with("Bldg_Type_")
Additional interactions can be specified inthis formula by separating them by +. Also note that the recipe will only utilize interactions between different variables; if the formula uses var1:var_1 this term will be ignored.
Suppose that, in a recipe, we had not yet made dummy variables for building types. It would be inappropriate to include a factor column in this step, such as:
This error should be avoided
#step_interact(~Gr_Liv_Area:Bldg_Type)
Order matters. The general living area is log transformed prior to the interaction term. Subsequent interactions with this variable will also use the log scale
The sale price data are already log transformed in the ames data frame. Why not use:
#step_log(Sale_Price, base = 10)
This will cause a failure when the recipe is applied to new properties when the sale price it not known. Since price is what we are trying to predict, there probably won’t be a column in the data for this variable. In fact, to avoid information leakage, many tidymodels packages isolate the data being used when making any predictions. This means that the training set and any outcome columns are not available for use at prediction time.
For simple transformations of the outcome column(s), we strongly suggest that those operations be conducted outside of the recipe
However, there are other circumstances where this is not an adequate solution. For example, in classification models where there is a serve class imbalance, it is common to conduct subsampling of the data that are given to the modeling function. For example, suppose that there were two classes and a 10% event rate.
The problem is that the same subsampling process should not be applied to the data being predicted. As a result, when using a recipe, we need a mechanism to ensure that some operations are only applied to the data that are given to the model. Each step function has an option called skip that, when set to TRUE, will be ignored by the bake() function used with a data set argument. In this way, you can isolate the steps that affect the modeling data without causing errors when applied to new samples. However, all steps are applied when using bake(new_data = NULL).
At the time of this writing, the step functions is the recipes and themis packages that are only applied to the modeling data are: step_adasyn(), step_bsmote(), step_downsample(), step_filter(), step_nearmiss(), step_rose(), step_sample(), step_slice(), step_smote(), step_tomek(), and step_upsample()
Spline Functions
When a predictor has a nonlinear relationship with outcome, some types of predictive models can adaptively approximate this relationships during training. However, simpler is usuall better and it is not uncommon to try to use a simple model, such as a linear fit, and add in specific non-linear features for predictors that may need them.
One common method for doing this is to use spline functions to represent the data. Splines replace the existing numeric predictor with a set of columns that allow a model to emulate a flexible, non-linear relationships. As more spline terms are added to the data, the capacity to non-linearly represent the relationship increases.
library(patchwork)
library(splines)
plot_smoother <- function(deg_free) {
ggplot(ames_train, aes(x = Latitude, y = Sale_Price)) +
geom_point(alpha = .2) +
scale_y_log10() +
geom_smooth(
method = lm,
formula = y ~ ns(x, df = deg_free),
col = "red",
se = FALSE
) +
ggtitle(paste(deg_free, "Spline Terms"))
}
( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )
The themis package has recipe steps the can be used for this purpose. For simple down-sampling, we would use:
#step_downsample(outcome_column_name)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec_prepped <- prep(ames_rec)
ames_train_prepped <- juice(ames_rec_prepped)
ames_test_prepped <- bake(ames_rec_prepped, ames_test)
# Fit the model; Note that the column Sale_Price has already been
# log transformed.
lm_fit <- lm(Sale_Price ~ ., data = ames_train_prepped)
tidy(lm_fit)
## # A tibble: 73 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.385 0.389 0.992 3.21e- 1
## 2 Gr_Liv_Area 0.422 0.0702 6.02 2.06e- 9
## 3 Year_Built 0.00181 0.000142 12.8 3.52e-36
## 4 Neighborhood_Briardale -0.123 0.0450 -2.74 6.24e- 3
## 5 Neighborhood_Brookside -0.0351 0.0432 -0.812 4.17e- 1
## 6 Neighborhood_Clear_Creek -0.0682 0.0457 -1.49 1.36e- 1
## 7 Neighborhood_College_Creek -0.0390 0.0496 -0.788 4.31e- 1
## 8 Neighborhood_Crawford 0.149 0.0415 3.59 3.35e- 4
## 9 Neighborhood_Edwards -0.109 0.0456 -2.38 1.72e- 2
## 10 Neighborhood_Gilbert -0.0106 0.0228 -0.467 6.41e- 1
## # ... with 63 more rows
To make predictions on the test set, we use the standard syntax:
predict(lm_fit, ames_test_prepped %>% head())
## 1 2 3 4 5 6
## 5.317438 5.407259 5.297857 5.213703 5.363925 5.663064
tidy(ames_rec_prepped)
## # A tibble: 5 x 6
## number operation type trained skip id
## <int> <chr> <chr> <lgl> <lgl> <chr>
## 1 1 step log TRUE FALSE log_b0oeG
## 2 2 step other TRUE FALSE other_wk8RZ
## 3 3 step dummy TRUE FALSE dummy_bu1nQ
## 4 4 step interact TRUE FALSE interact_1GM08
## 5 5 step ns TRUE FALSE ns_RYSzR
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10, id = "my_id") %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec_prepped <- prep(ames_rec)
#ames_rec %>% update_role(address, new_role = "street address")