R Model Formulas

A simple example of a formula used in a linear model to predict sale prices of houses:

> library(caret)
> data("Sacramento")
> 
> mod1 <- lm(log(price) ~ type + sqft, data = Sacramento, subset = beds > 2)

The purpose of this code chunk:

  1. subset some of the data points (subset)
  2. create a design matrix for 2 predictor variable (but 3 model terms)
  3. log transform the outcome variable
  4. fit a linear regression model

The first two steps create the design matrix (usually represented by X).

Summarizing the Model Formula Method

However, there are significant limitations to what this framework can do and, in some cases, it can be very inefficient.

This is mostly due of being written well before large scale modeling and machine learning were commonplace.

Limitations of the Current System

A more in-depth discussion of these issues can be found in this blog post.

Variable Roles

Formulas have been re-implemented in different packages for a variety of different reasons:

> # ?lme4::lmer
> # Subjects need to be in the data but are not part of the model
> lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
> 
> # BradleyTerry2
> # We want to make the outcomes to be a function of a 
> # competitor-specific function of reach 
> BTm(outcome = 1, player1 = winner, player2 = loser, 
+     formula = ~reach[..] + (1 | ..), 
+     data = boxers)
> 
> # mboost::mob (using the modeltools package for formulas)
> mob(diabetes ~ glucose | pregnant + mass +  age,
+     data = PimaIndiansDiabetes)

Variable Roles

A general list of possible variables roles could be

(*) These can be handled in formulas but are hard-coded into the functions.

Recipes

We can approach the design matrix and preprocessing steps by first specifying a sequence of steps.

  1. price is an outcome
  2. type and sqft are predictors
  3. log transform price
  4. convert type to dummy variables

A recipe is a specification of intent.

One issue with the formula method is that it couples the specification for your predictors along with the implementation.

Recipes, as you’ll see, separates the planning from the doing.

Website: https://topepo.github.io/recipes

Recipes

A recipe can be trained then applied to any data.

> library(recipes) 
> library(dplyr)
> 
> ## Create an initial recipe with only predictors and outcome
> rec <- recipe(price ~ type + sqft, data = Sacramento)
> rec <- rec %>% 
+   step_log(price) %>%
+   step_dummy(type)
> 
> rec_trained <- prep(rec, training = Sacramento, retain = TRUE)
> design_mat <- bake(rec_trained, newdata = Sacramento)

Selecting Variables

In the last slide, we used dplyr-like syntax for selecting variables such as step_dummy(type).

In some cases, the names of the predictors may not be known at the time when you construct a recipe (or model formula). For example:

dplyr selectors can also be used on variables names, such as

> step_spatialsign(matches("^PC[1-9]"), all_numeric(), -all_outcomes())

Variables can be selected by name, role, data type, or any combination of these.

Extending

Need to add more preprocessing or other operations?

> standardized <- rec_trained %>%
+   step_center(all_numeric()) %>%
+   step_scale(all_numeric()) %>%
+   step_pca(all_numeric())
>           
> ## Only estimate the new parts:
> standardized <- prep(standardized)

If an initial step is computationally expensive, you don’t have to redo those operations to add more.

Available Steps

More on the way (i.e. autoencoders, more imputation methods, etc.)

One of the package vignettes shows how to write your own step functions.

Extending

Recipes can also be created with different roles manually

> rec <- recipe(x  = Sacramento) %>%
+   add_role(price, new_role = "outcome") %>%
+   add_role(type, sqft, new_role = "predictor") %>%
+   add_role(zip, new_role = "strata")

Also, the sequential nature of steps means that they don’t have to be R operations and could call other compute engines (e.g. Weka, scikit-learn, Tensorflow, etc. )

We can create wrappers to work with recipes too:

> lin_reg.recipe <- function(rec, data, ...) {
+   trained <- prep(rec, training = data)
+   lm.fit(x = bake(trained, newdata = data, all_predictors()),
+          y = bake(trained, newdata = data, all_outcomes()), ...)
+ }

An Example

Kuhn and Johnson (2013) analyze a data set where thousands of cells are determined to be well-segmented (WS) or poorly segmented (PS) based on 58 image features. We would like to make predictions of the segmentation quality based on these features.

> library(dplyr)
> library(caret)
> data("segmentationData")
> 
> seg_train <- segmentationData %>% 
+   filter(Case == "Train") %>% 
+   select(-Case, -Cell)
> seg_test  <- segmentationData %>% 
+   filter(Case == "Test")  %>% 
+   select(-Case, -Cell)

A Simple Recipe

> rec <- recipe(Class  ~ ., data = seg_train)
> 
> basic <- rec %>%
+   # Correct some predictors for skewness
+   step_YeoJohnson(all_predictors()) %>%
+   # Standardize the values
+   step_center(all_predictors()) %>%
+   step_scale(all_predictors())
> 
> # Estimate the transformation and standardization parameters 
> basic <- prep(basic, training = seg_train, verbose = FALSE, retain = TRUE)  

Principal Component Analysis

> pca <- basic %>% step_pca(all_predictors(), threshold = .9)
> summary(pca)
# A tibble: 59 x 4
   variable                type    role      source  
   <chr>                   <chr>   <chr>     <chr>   
 1 AngleCh1                numeric predictor original
 2 AreaCh1                 numeric predictor original
 3 AvgIntenCh1             numeric predictor original
 4 AvgIntenCh2             numeric predictor original
 5 AvgIntenCh3             numeric predictor original
 6 AvgIntenCh4             numeric predictor original
 7 ConvexHullAreaRatioCh1  numeric predictor original
 8 ConvexHullPerimRatioCh1 numeric predictor original
 9 DiffIntenDensityCh1     numeric predictor original
10 DiffIntenDensityCh3     numeric predictor original
# ... with 49 more rows

Principal Component Analysis

> pca <- prep(pca)
> summary(pca)
# A tibble: 16 x 4
   variable type    role      source  
   <chr>    <chr>   <chr>     <chr>   
 1 Class    nominal outcome   original
 2 PC01     numeric predictor derived 
 3 PC02     numeric predictor derived 
 4 PC03     numeric predictor derived 
 5 PC04     numeric predictor derived 
 6 PC05     numeric predictor derived 
 7 PC06     numeric predictor derived 
 8 PC07     numeric predictor derived 
 9 PC08     numeric predictor derived 
10 PC09     numeric predictor derived 
11 PC10     numeric predictor derived 
12 PC11     numeric predictor derived 
13 PC12     numeric predictor derived 
14 PC13     numeric predictor derived 
15 PC14     numeric predictor derived 
16 PC15     numeric predictor derived 
> pca <- bake(pca, newdata = seg_test, everything())

Principal Component Analysis

> pca[1:4, 1:8]
# A tibble: 4 x 8
  Class   PC01  PC02   PC03   PC04  PC05   PC06   PC07
  <fctr> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1 PS      4.86 -5.85 -0.891 -4.13  1.84  -2.29  -3.88 
2 PS      3.28 -1.51  0.353 -2.24  0.441 -0.911  0.800
3 WS     -7.03 -1.77 -2.42  -0.652 3.22  -0.212  0.118
4 WS     -6.96 -2.08 -2.89  -1.79  3.20  -0.845 -0.204
> ggplot(pca, aes(x = PC01, y = PC02, color = Class)) + geom_point(alpha = .4)

Principal Component Analysis

Kernel Principal Component Analysis

> kern_pca <- basic %>% 
+   step_kpca(all_predictors(), num = 2, 
+             options = list(kernel = "rbfdot", 
+                            kpar = list(sigma = 0.05)))
> 
> kern_pca <- prep(kern_pca)
> 
> kern_pca <- bake(kern_pca, newdata = seg_test, everything())

Kernel Principal Component Analysis

Distance to Each Class Centroid

> dist_to_classes <- basic %>% 
+   step_classdist(all_predictors(), class = "Class") %>%
+   # Take log of the new distance features
+   step_log(starts_with("classdist"))
> 
> dist_to_classes <- prep(dist_to_classes, verbose = FALSE)
> 
> # All variables are retained plus an additional one for each class
> dist_to_classes <- bake(dist_to_classes, newdata = seg_test, matches("[Cc]lass"))
> dist_to_classes
# A tibble: 1,010 x 3
   Class  classdist_PS classdist_WS
   <fctr>        <dbl>        <dbl>
 1 PS             1.53         1.74
 2 PS             1.35         1.46
 3 WS             1.71         1.53
 4 WS             1.75         1.61
 5 PS             1.47         1.65
 6 WS             1.48         1.47
 7 WS             1.49         1.55
 8 WS             1.55         1.40
 9 PS             1.54         1.71
10 PS             1.55         1.57
# ... with 1,000 more rows

Distance to Each Class

Next Steps

model1 <- train(recipe, data = data, method, ...)

as an alternative to

model2 <- train(x, y, method, preProcess, ...) # or
model3 <- train(y ~ x1 + x2, data = data, method, preProcess, ...)
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_Indonesia.1252  LC_CTYPE=English_Indonesia.1252   
[3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Indonesia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2       recipes_0.1.1      broom_0.4.3        dplyr_0.7.4       
[5] caret_6.0-78       lattice_0.20-35    ggplot2_2.2.1.9000 knitr_1.18        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14       lubridate_1.7.1    tidyr_0.7.2        class_7.3-14      
 [5] utf8_1.1.3         assertthat_0.2.0   rprojroot_1.3-1    digest_0.6.13     
 [9] ipred_0.9-6        psych_1.7.8        foreach_1.4.4      R6_2.2.2          
[13] plyr_1.8.4         backports_1.1.2    stats4_3.4.3       evaluate_0.10.1   
[17] highr_0.6          pillar_1.0.1       rlang_0.1.6        lazyeval_0.2.1    
[21] kernlab_0.9-25     rpart_4.1-11       Matrix_1.2-12      rmarkdown_1.8     
[25] labeling_0.3       splines_3.4.3      CVST_0.2-1         ddalpha_1.3.1     
[29] gower_0.1.2        stringr_1.2.0      foreign_0.8-69     munsell_0.4.3     
[33] compiler_3.4.3     pkgconfig_2.0.1    mnormt_1.5-5       dimRed_0.1.0      
[37] htmltools_0.3.6    nnet_7.3-12        tidyselect_0.2.3   tibble_1.4.1      
[41] prodlim_1.6.1      DRR_0.0.2          codetools_0.2-15   RcppRoll_0.2.2    
[45] crayon_1.3.4       withr_2.1.1.9000   MASS_7.3-47        ModelMetrics_1.1.0
[49] grid_3.4.3         nlme_3.1-131       gtable_0.2.0       magrittr_1.5      
[53] scales_0.5.0       cli_1.0.0          stringi_1.1.6      reshape2_1.4.3    
[57] timeDate_3042.101  robustbase_0.92-8  lava_1.6           iterators_1.0.9   
[61] tools_3.4.3        glue_1.2.0         DEoptimR_1.0-8     purrr_0.2.4       
[65] sfsmisc_1.1-1      parallel_3.4.3     survival_2.41-3    yaml_2.1.16       
[69] colorspace_1.3-2   bindr_0.1         

Thanks!