library(tidymodels)
library(modeldatatoo)
tidymodels_prefer()
theme_set(theme_bw())
options(
pillar.advice = FALSE,
pillar.min_title_chars = Inf
)
set.seed(295)
hotel_rates <-
data_hotel_rates() %>%
sample_n(5000) %>%
arrange(arrival_date) %>%
select(-arrival_date_num, -arrival_date) %>%
mutate(
company = factor(as.character(company)),
country = factor(as.character(country)),
agent = factor(as.character(agent))
)Feature Engineering
What is Feature Engineering?
Think of a feature as some representation of a predictor that will be used in a model.
Example representations:
Interactions
Polynomial expansions/splines
Principal component analysis (PCA) feature extraction
There are a lot of examples in Feature Engineering and Selection (FES).
Example: Dates
How can we represent date columns for our model?
When we use a date column in its native format, most models in R convert it to an integer.
We can re-engineer it as:
Days since a reference date
Day of the week
Month
Year
Indicators for holidays
General definitions
Data preprocessing steps allow your model to fit.
Feature engineering steps help the model do the least work to predict the outcome as well as possible.
The recipes package can handle both!
Hotel Data
We'll use data on hotels to predict the cost of a room.
The data are in the modeldatatoo package. We'll sample down the data and refactor some columns:
Data Splitting Strategy
Data Spending
Let's split the data into a training set (75%) and testing set (25%):
set.seed(4028)
hotel_split <-
initial_split(hotel_rates, strata = avg_price_per_room)
hotel_train <- training(hotel_split)
hotel_test <- testing(hotel_split)Resampling Strategy
We'll use simple 10-fold cross-validation (stratified sampling):
set.seed(472)
hotel_rs <- vfold_cv(hotel_train, strata = avg_price_per_room)
hotel_rs# 10-fold cross-validation using stratification
# A tibble: 10 × 2
splits id
<list> <chr>
1 <split [3372/377]> Fold01
2 <split [3373/376]> Fold02
3 <split [3373/376]> Fold03
4 <split [3373/376]> Fold04
5 <split [3373/376]> Fold05
6 <split [3374/375]> Fold06
7 <split [3375/374]> Fold07
8 <split [3376/373]> Fold08
9 <split [3376/373]> Fold09
10 <split [3376/373]> Fold10
Prepare your data for modeling
The recipes package is an extensible framework for pipeable sequences of preprocessing and feature engineering steps.
Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets.
- The resulting processed output can be used as inputs for statistical or machine learning models.
A first recipe
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train)- The
recipe()function assigns columns to roles of "outcome" or "predictor" using the formula
summary(hotel_rec)# A tibble: 28 × 4
variable type role source
<chr> <list> <chr> <chr>
1 lead_time <chr [2]> predictor original
2 arrival_date_day_of_month <chr [2]> predictor original
3 stays_in_weekend_nights <chr [2]> predictor original
4 stays_in_week_nights <chr [2]> predictor original
5 adults <chr [2]> predictor original
6 children <chr [2]> predictor original
7 babies <chr [2]> predictor original
8 meal <chr [3]> predictor original
9 country <chr [3]> predictor original
10 market_segment <chr [3]> predictor original
# ℹ 18 more rows
The type column contains information on the variables
Create indicator variables
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors())For any factor or character predictors, make binary indicators.
There are many recipe steps that can convert categorical predictors to numeric columns.
step_dummy()records the levels of the categorical predictors in the training set.
Filter out constant columns
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())In case there is a factor level that was never observed in the training data (resulting in a column of all 0s), we can delete any zero-variance predictors that have a single unique value.
Normalization
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())This centers and scales the numeric predictors.
The recipe will use the training set to estimate the means and standard deviations of the data.
- All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).
Reduce correlation
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric_predictors(), threshold = 0.9)To deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.
Other possible steps
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors())PCA feature extraction…
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
embed::step_umap(all_numeric_predictors(), outcome = vars(avg_price_per_room))A fancy machine learning supervised dimension reduction technique…
hotel_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_spline_natural(year_day, deg_free = 10)Nonlinear transforms like natural splines, and so on!
Minimal recipe
hotel_indicators <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_YeoJohnson(lead_time) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())Measuring Performance
We'll compute two measures: mean absolute error and the coefficient of determination.
The focus will be on MAE for parameter optimization. We'll use a metric set to compute these:
reg_metrics <- metric_set(mae, rsq)Using a workflow
set.seed(9)
hotel_lm_wflow <-
workflow() %>%
add_recipe(hotel_indicators) %>%
add_model(linear_reg())
ctrl <- control_resamples(save_pred = TRUE)
hotel_lm_res <-
hotel_lm_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_lm_res)# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 17.7 1 NA Preprocessor1_Model1
2 rsq standard 0.858 1 NA Preprocessor1_Model1
Holdout predictions
# Since we used `save_pred = TRUE`
lm_val_pred <- collect_predictions(hotel_lm_res)
lm_val_pred %>% slice(1:7)# A tibble: 7 × 5
id .pred .row avg_price_per_room .config
<chr> <dbl> <int> <dbl> <chr>
1 Fold02 25.4 11 36 Preprocessor1_Model1
2 Fold02 26.9 16 46 Preprocessor1_Model1
3 Fold02 18.2 41 39.4 Preprocessor1_Model1
4 Fold02 51.9 57 54.9 Preprocessor1_Model1
5 Fold02 44.7 60 49 Preprocessor1_Model1
6 Fold02 46.2 67 49 Preprocessor1_Model1
7 Fold02 44.7 69 49 Preprocessor1_Model1
Calibration Plot
library(probably)
cal_plot_regression(hotel_lm_res, alpha = 1 / 5)What do we do with the agent and company data?
There are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?
We could:
make the full set of indicator variables 😳
lump agents and companies that rarely occur into an "other" group
use feature hashing to create a smaller set of indicator variables
use effect encoding to replace the
agentandcompanycolumns with the estimated effect of that predictor (in the extra materials)
Per-agent statistics
Collapsing factor levels
There is a recipe step that will redefine factor levels based on their frequency in the training set:
hotel_other_rec <-
recipe(avg_price_per_room ~ ., data = hotel_train) %>%
step_YeoJohnson(lead_time) %>%
step_other(agent, threshold = 0.001) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())Using this code, 34 agents (out of 98) were collapsed into "other" based on the training set.
We could try to optimize the threshold for collapsing (see the next set of slides on model tuning).
Does othering help?
hotel_other_wflow <-
hotel_lm_wflow %>%
update_recipe(hotel_other_rec)
hotel_other_res <-
hotel_other_wflow %>%
fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)
collect_metrics(hotel_other_res)# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 17.8 1 NA Preprocessor1_Model1
2 rsq standard 0.855 1 NA Preprocessor1_Model1
About the same MAE and much faster to complete.