Data Preparation with Recipes

R Model Formulas

A simple example of a formula used in a linear model to predict sale prices of houses:

> library(caret)
> data("Sacramento")
> 
> mod1 <- lm(log(price) ~ type + sqft, data = Sacramento, subset = beds > 2)

The purpose of this code chunk:

subset some of the data points (subset)
create a design matrix for 2 predictor variable (but 3 model terms)
log transform the outcome variable
fit a linear regression model

The first two steps create the design matrix (usually represented by X).

Summarizing the Model Formula Method

Model formulas are very expressive in that they can represent model terms easily
The formula/terms framework does some elegant functional programming
Functions can be embedded inline to do fairly complex things (on single variables) and these can be applied to new data sets.

However, there are significant limitations to what this framework can do and, in some cases, it can be very inefficient.

This is mostly due of being written well before large scale modeling and machine learning were commonplace.

Limitations of the Current System

Formulas are not very extensible especially with nested or sequential operations (e.g. y ~ scale(center(knn_impute(x)))).
When used in modeling functions, you cannot recycle the previous computations.
For wide data sets, the formula method can be very inefficient and consume a significant proportion of the total execution time.
Multivariate outcomes are kludgy by requiring cbind
Formulas have a limited set of roles (next slide)

A more in-depth discussion of these issues can be found in this blog post.

Variable Roles

Formulas have been re-implemented in different packages for a variety of different reasons:

> # ?lme4::lmer
> # Subjects need to be in the data but are not part of the model
> lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
> 
> # BradleyTerry2
> # We want to make the outcomes to be a function of a 
> # competitor-specific function of reach 
> BTm(outcome = 1, player1 = winner, player2 = loser, 
+     formula = ~reach[..] + (1 | ..), 
+     data = boxers)
> 
> # mboost::mob (using the modeltools package for formulas)
> mob(diabetes ~ glucose | pregnant + mass +  age,
+     data = PimaIndiansDiabetes)

Variable Roles

A general list of possible variables roles could be

outcomes
predictors
stratification
model performance data (e.g. loan amount to compute expected loss)
conditioning or faceting variables (e.g. lattice or ggplot2)
random effects or hierarchical model ID variables
case weights (*)
offsets (*)
error terms (limited to Error in the aov function)(*)

(*) These can be handled in formulas but are hard-coded into the functions.

Recipes

We can approach the design matrix and preprocessing steps by first specifying a sequence of steps.

price is an outcome
type and sqft are predictors
log transform price
convert type to dummy variables

A recipe is a specification of intent.

One issue with the formula method is that it couples the specification for your predictors along with the implementation.

Recipes, as you’ll see, separates the planning from the doing.

Website: https://topepo.github.io/recipes

Recipes

A recipe can be trained then applied to any data.

> library(recipes) 
> library(dplyr)
> 
> ## Create an initial recipe with only predictors and outcome
> rec <- recipe(price ~ type + sqft, data = Sacramento)
> rec <- rec %>% 
+   step_log(price) %>%
+   step_dummy(type)
> 
> rec_trained <- prep(rec, training = Sacramento, retain = TRUE)
> design_mat <- bake(rec_trained, newdata = Sacramento)

Selecting Variables

In the last slide, we used dplyr-like syntax for selecting variables such as step_dummy(type).

In some cases, the names of the predictors may not be known at the time when you construct a recipe (or model formula). For example:

dummy variable columns
PCA feature extraction when you keep components that capture X% of the variability.
discretized predictors with dynamic bins

dplyr selectors can also be used on variables names, such as

> step_spatialsign(matches("^PC[1-9]"), all_numeric(), -all_outcomes())

Variables can be selected by name, role, data type, or any combination of these.

Extending

Need to add more preprocessing or other operations?

> standardized <- rec_trained %>%
+   step_center(all_numeric()) %>%
+   step_scale(all_numeric()) %>%
+   step_pca(all_numeric())
>           
> ## Only estimate the new parts:
> standardized <- prep(standardized)

If an initial step is computationally expensive, you don’t have to redo those operations to add more.

Available Steps

Basic: logs, roots, polynomials, logits, hyperbolics
Encodings: dummy variables, “other” factor level collapsing, discretization
Date Features: Encodings for day/doy/month etc, holiday indicators
Filters: correlation, near-zero variables, linear dependencies
Imputation: K-nearest neighbors, bagged trees, mean/mode imputation,
Normalization/Transformations: center, scale, range, Box-Cox, Yeo-Johnson
Dimension Reduction: PCA, kernel PCA, ICA, Isomap, data depth features, class distances
Others: spline basis functions, interactions, spatial sign

More on the way (i.e. autoencoders, more imputation methods, etc.)

One of the package vignettes shows how to write your own step functions.

Extending

Recipes can also be created with different roles manually

> rec <- recipe(x  = Sacramento) %>%
+   add_role(price, new_role = "outcome") %>%
+   add_role(type, sqft, new_role = "predictor") %>%
+   add_role(zip, new_role = "strata")

Also, the sequential nature of steps means that they don’t have to be R operations and could call other compute engines (e.g. Weka, scikit-learn, Tensorflow, etc. )

We can create wrappers to work with recipes too:

> lin_reg.recipe <- function(rec, data, ...) {
+   trained <- prep(rec, training = data)
+   lm.fit(x = bake(trained, newdata = data, all_predictors()),
+          y = bake(trained, newdata = data, all_outcomes()), ...)
+ }

An Example

Kuhn and Johnson (2013) analyze a data set where thousands of cells are determined to be well-segmented (WS) or poorly segmented (PS) based on 58 image features. We would like to make predictions of the segmentation quality based on these features.

> library(dplyr)
> library(caret)
> data("segmentationData")
> 
> seg_train <- segmentationData %>% 
+   filter(Case == "Train") %>% 
+   select(-Case, -Cell)
> seg_test  <- segmentationData %>% 
+   filter(Case == "Test")  %>% 
+   select(-Case, -Cell)

A Simple Recipe

> rec <- recipe(Class  ~ ., data = seg_train)
> 
> basic <- rec %>%
+   # Correct some predictors for skewness
+   step_YeoJohnson(all_predictors()) %>%
+   # Standardize the values
+   step_center(all_predictors()) %>%
+   step_scale(all_predictors())
> 
> # Estimate the transformation and standardization parameters 
> basic <- prep(basic, training = seg_train, verbose = FALSE, retain = TRUE)

Principal Component Analysis

> pca <- basic %>% step_pca(all_predictors(), threshold = .9)
> summary(pca)

# A tibble: 59 x 4
   variable                type    role      source  
   <chr>                   <chr>   <chr>     <chr>   
 1 AngleCh1                numeric predictor original
 2 AreaCh1                 numeric predictor original
 3 AvgIntenCh1             numeric predictor original
 4 AvgIntenCh2             numeric predictor original
 5 AvgIntenCh3             numeric predictor original
 6 AvgIntenCh4             numeric predictor original
 7 ConvexHullAreaRatioCh1  numeric predictor original
 8 ConvexHullPerimRatioCh1 numeric predictor original
 9 DiffIntenDensityCh1     numeric predictor original
10 DiffIntenDensityCh3     numeric predictor original
# ... with 49 more rows

Principal Component Analysis

> pca <- prep(pca)
> summary(pca)

# A tibble: 16 x 4
   variable type    role      source  
   <chr>    <chr>   <chr>     <chr>   
 1 Class    nominal outcome   original
 2 PC01     numeric predictor derived 
 3 PC02     numeric predictor derived 
 4 PC03     numeric predictor derived 
 5 PC04     numeric predictor derived 
 6 PC05     numeric predictor derived 
 7 PC06     numeric predictor derived 
 8 PC07     numeric predictor derived 
 9 PC08     numeric predictor derived 
10 PC09     numeric predictor derived 
11 PC10     numeric predictor derived 
12 PC11     numeric predictor derived 
13 PC12     numeric predictor derived 
14 PC13     numeric predictor derived 
15 PC14     numeric predictor derived 
16 PC15     numeric predictor derived

> pca <- bake(pca, newdata = seg_test, everything())

Principal Component Analysis

> pca[1:4, 1:8]

# A tibble: 4 x 8
  Class   PC01  PC02   PC03   PC04  PC05   PC06   PC07
  <fctr> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1 PS      4.86 -5.85 -0.891 -4.13  1.84  -2.29  -3.88 
2 PS      3.28 -1.51  0.353 -2.24  0.441 -0.911  0.800
3 WS     -7.03 -1.77 -2.42  -0.652 3.22  -0.212  0.118
4 WS     -6.96 -2.08 -2.89  -1.79  3.20  -0.845 -0.204

> ggplot(pca, aes(x = PC01, y = PC02, color = Class)) + geom_point(alpha = .4)

Principal Component Analysis

Kernel Principal Component Analysis

> kern_pca <- basic %>% 
+   step_kpca(all_predictors(), num = 2, 
+             options = list(kernel = "rbfdot", 
+                            kpar = list(sigma = 0.05)))
> 
> kern_pca <- prep(kern_pca)
> 
> kern_pca <- bake(kern_pca, newdata = seg_test, everything())

Kernel Principal Component Analysis

Distance to Each Class Centroid

> dist_to_classes <- basic %>% 
+   step_classdist(all_predictors(), class = "Class") %>%
+   # Take log of the new distance features
+   step_log(starts_with("classdist"))
> 
> dist_to_classes <- prep(dist_to_classes, verbose = FALSE)
> 
> # All variables are retained plus an additional one for each class
> dist_to_classes <- bake(dist_to_classes, newdata = seg_test, matches("[Cc]lass"))
> dist_to_classes

# A tibble: 1,010 x 3
   Class  classdist_PS classdist_WS
   <fctr>        <dbl>        <dbl>
 1 PS             1.53         1.74
 2 PS             1.35         1.46
 3 WS             1.71         1.53
 4 WS             1.75         1.61
 5 PS             1.47         1.65
 6 WS             1.48         1.47
 7 WS             1.49         1.55
 8 WS             1.55         1.40
 9 PS             1.54         1.71
10 PS             1.55         1.57
# ... with 1,000 more rows

Distance to Each Class

Next Steps

Get it on CRAN once tidyselect is on CRAN
Add more steps
caret methods for recipes (instead of using preProcess):

model1 <- train(recipe, data = data, method, ...)

as an alternative to

model2 <- train(x, y, method, preProcess, ...) # or
model3 <- train(y ~ x1 + x2, data = data, method, preProcess, ...)

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_Indonesia.1252  LC_CTYPE=English_Indonesia.1252   
[3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Indonesia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2       recipes_0.1.1      broom_0.4.3        dplyr_0.7.4       
[5] caret_6.0-78       lattice_0.20-35    ggplot2_2.2.1.9000 knitr_1.18        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14       lubridate_1.7.1    tidyr_0.7.2        class_7.3-14      
 [5] utf8_1.1.3         assertthat_0.2.0   rprojroot_1.3-1    digest_0.6.13     
 [9] ipred_0.9-6        psych_1.7.8        foreach_1.4.4      R6_2.2.2          
[13] plyr_1.8.4         backports_1.1.2    stats4_3.4.3       evaluate_0.10.1   
[17] highr_0.6          pillar_1.0.1       rlang_0.1.6        lazyeval_0.2.1    
[21] kernlab_0.9-25     rpart_4.1-11       Matrix_1.2-12      rmarkdown_1.8     
[25] labeling_0.3       splines_3.4.3      CVST_0.2-1         ddalpha_1.3.1     
[29] gower_0.1.2        stringr_1.2.0      foreign_0.8-69     munsell_0.4.3     
[33] compiler_3.4.3     pkgconfig_2.0.1    mnormt_1.5-5       dimRed_0.1.0      
[37] htmltools_0.3.6    nnet_7.3-12        tidyselect_0.2.3   tibble_1.4.1      
[41] prodlim_1.6.1      DRR_0.0.2          codetools_0.2-15   RcppRoll_0.2.2    
[45] crayon_1.3.4       withr_2.1.1.9000   MASS_7.3-47        ModelMetrics_1.1.0
[49] grid_3.4.3         nlme_3.1-131       gtable_0.2.0       magrittr_1.5      
[53] scales_0.5.0       cli_1.0.0          stringi_1.1.6      reshape2_1.4.3    
[57] timeDate_3042.101  robustbase_0.92-8  lava_1.6           iterators_1.0.9   
[61] tools_3.4.3        glue_1.2.0         DEoptimR_1.0-8     purrr_0.2.4       
[65] sfsmisc_1.1-1      parallel_3.4.3     survival_2.41-3    yaml_2.1.16       
[69] colorspace_1.3-2   bindr_0.1

Data Preparation with Recipes

Heru Wiryanto

January 15, 2018

R Model Formulas

Summarizing the Model Formula Method

Limitations of the Current System

Variable Roles

Variable Roles

Recipes

Recipes

Selecting Variables

Extending

Available Steps

Extending

An Example

A Simple Recipe

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Kernel Principal Component Analysis

Kernel Principal Component Analysis

Distance to Each Class Centroid

Distance to Each Class

Next Steps

Thanks!