A simple example of a formula used in a linear model to predict sale prices of houses:
> library(caret)
> data("Sacramento")
>
> mod1 <- lm(log(price) ~ type + sqft, data = Sacramento, subset = beds > 2)
The purpose of this code chunk:
subset
)The first two steps create the design matrix (usually represented by X).
However, there are significant limitations to what this framework can do and, in some cases, it can be very inefficient.
This is mostly due of being written well before large scale modeling and machine learning were commonplace.
y ~ scale(center(knn_impute(x)))
).cbind
A more in-depth discussion of these issues can be found in this blog post.
Formulas have been re-implemented in different packages for a variety of different reasons:
> # ?lme4::lmer
> # Subjects need to be in the data but are not part of the model
> lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
>
> # BradleyTerry2
> # We want to make the outcomes to be a function of a
> # competitor-specific function of reach
> BTm(outcome = 1, player1 = winner, player2 = loser,
+ formula = ~reach[..] + (1 | ..),
+ data = boxers)
>
> # mboost::mob (using the modeltools package for formulas)
> mob(diabetes ~ glucose | pregnant + mass + age,
+ data = PimaIndiansDiabetes)
A general list of possible variables roles could be
lattice
or ggplot2
)Error
in the aov
function)(*)(*) These can be handled in formulas but are hard-coded into the functions.
We can approach the design matrix and preprocessing steps by first specifying a sequence of steps.
price
is an outcometype
and sqft
are predictorsprice
type
to dummy variablesA recipe is a specification of intent.
One issue with the formula method is that it couples the specification for your predictors along with the implementation.
Recipes, as you’ll see, separates the planning from the doing.
Website: https://topepo.github.io/recipes
A recipe can be trained then applied to any data.
> library(recipes)
> library(dplyr)
>
> ## Create an initial recipe with only predictors and outcome
> rec <- recipe(price ~ type + sqft, data = Sacramento)
> rec <- rec %>%
+ step_log(price) %>%
+ step_dummy(type)
>
> rec_trained <- prep(rec, training = Sacramento, retain = TRUE)
> design_mat <- bake(rec_trained, newdata = Sacramento)
In the last slide, we used dplyr
-like syntax for selecting variables such as step_dummy(type)
.
In some cases, the names of the predictors may not be known at the time when you construct a recipe (or model formula). For example:
dplyr
selectors can also be used on variables names, such as
> step_spatialsign(matches("^PC[1-9]"), all_numeric(), -all_outcomes())
Variables can be selected by name, role, data type, or any combination of these.
Need to add more preprocessing or other operations?
> standardized <- rec_trained %>%
+ step_center(all_numeric()) %>%
+ step_scale(all_numeric()) %>%
+ step_pca(all_numeric())
>
> ## Only estimate the new parts:
> standardized <- prep(standardized)
If an initial step is computationally expensive, you don’t have to redo those operations to add more.
More on the way (i.e. autoencoders, more imputation methods, etc.)
One of the package vignettes shows how to write your own step functions.
Recipes can also be created with different roles manually
> rec <- recipe(x = Sacramento) %>%
+ add_role(price, new_role = "outcome") %>%
+ add_role(type, sqft, new_role = "predictor") %>%
+ add_role(zip, new_role = "strata")
Also, the sequential nature of steps means that they don’t have to be R operations and could call other compute engines (e.g. Weka, scikit-learn, Tensorflow, etc. )
We can create wrappers to work with recipes too:
> lin_reg.recipe <- function(rec, data, ...) {
+ trained <- prep(rec, training = data)
+ lm.fit(x = bake(trained, newdata = data, all_predictors()),
+ y = bake(trained, newdata = data, all_outcomes()), ...)
+ }
Kuhn and Johnson (2013) analyze a data set where thousands of cells are determined to be well-segmented (WS) or poorly segmented (PS) based on 58 image features. We would like to make predictions of the segmentation quality based on these features.
> library(dplyr)
> library(caret)
> data("segmentationData")
>
> seg_train <- segmentationData %>%
+ filter(Case == "Train") %>%
+ select(-Case, -Cell)
> seg_test <- segmentationData %>%
+ filter(Case == "Test") %>%
+ select(-Case, -Cell)
> rec <- recipe(Class ~ ., data = seg_train)
>
> basic <- rec %>%
+ # Correct some predictors for skewness
+ step_YeoJohnson(all_predictors()) %>%
+ # Standardize the values
+ step_center(all_predictors()) %>%
+ step_scale(all_predictors())
>
> # Estimate the transformation and standardization parameters
> basic <- prep(basic, training = seg_train, verbose = FALSE, retain = TRUE)
> pca <- basic %>% step_pca(all_predictors(), threshold = .9)
> summary(pca)
# A tibble: 59 x 4
variable type role source
<chr> <chr> <chr> <chr>
1 AngleCh1 numeric predictor original
2 AreaCh1 numeric predictor original
3 AvgIntenCh1 numeric predictor original
4 AvgIntenCh2 numeric predictor original
5 AvgIntenCh3 numeric predictor original
6 AvgIntenCh4 numeric predictor original
7 ConvexHullAreaRatioCh1 numeric predictor original
8 ConvexHullPerimRatioCh1 numeric predictor original
9 DiffIntenDensityCh1 numeric predictor original
10 DiffIntenDensityCh3 numeric predictor original
# ... with 49 more rows
> pca <- prep(pca)
> summary(pca)
# A tibble: 16 x 4
variable type role source
<chr> <chr> <chr> <chr>
1 Class nominal outcome original
2 PC01 numeric predictor derived
3 PC02 numeric predictor derived
4 PC03 numeric predictor derived
5 PC04 numeric predictor derived
6 PC05 numeric predictor derived
7 PC06 numeric predictor derived
8 PC07 numeric predictor derived
9 PC08 numeric predictor derived
10 PC09 numeric predictor derived
11 PC10 numeric predictor derived
12 PC11 numeric predictor derived
13 PC12 numeric predictor derived
14 PC13 numeric predictor derived
15 PC14 numeric predictor derived
16 PC15 numeric predictor derived
> pca <- bake(pca, newdata = seg_test, everything())
> pca[1:4, 1:8]
# A tibble: 4 x 8
Class PC01 PC02 PC03 PC04 PC05 PC06 PC07
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PS 4.86 -5.85 -0.891 -4.13 1.84 -2.29 -3.88
2 PS 3.28 -1.51 0.353 -2.24 0.441 -0.911 0.800
3 WS -7.03 -1.77 -2.42 -0.652 3.22 -0.212 0.118
4 WS -6.96 -2.08 -2.89 -1.79 3.20 -0.845 -0.204
> ggplot(pca, aes(x = PC01, y = PC02, color = Class)) + geom_point(alpha = .4)
> kern_pca <- basic %>%
+ step_kpca(all_predictors(), num = 2,
+ options = list(kernel = "rbfdot",
+ kpar = list(sigma = 0.05)))
>
> kern_pca <- prep(kern_pca)
>
> kern_pca <- bake(kern_pca, newdata = seg_test, everything())
> dist_to_classes <- basic %>%
+ step_classdist(all_predictors(), class = "Class") %>%
+ # Take log of the new distance features
+ step_log(starts_with("classdist"))
>
> dist_to_classes <- prep(dist_to_classes, verbose = FALSE)
>
> # All variables are retained plus an additional one for each class
> dist_to_classes <- bake(dist_to_classes, newdata = seg_test, matches("[Cc]lass"))
> dist_to_classes
# A tibble: 1,010 x 3
Class classdist_PS classdist_WS
<fctr> <dbl> <dbl>
1 PS 1.53 1.74
2 PS 1.35 1.46
3 WS 1.71 1.53
4 WS 1.75 1.61
5 PS 1.47 1.65
6 WS 1.48 1.47
7 WS 1.49 1.55
8 WS 1.55 1.40
9 PS 1.54 1.71
10 PS 1.55 1.57
# ... with 1,000 more rows
tidyselect
is on CRANcaret
methods for recipes (instead of using preProcess
):model1 <- train(recipe, data = data, method, ...)
as an alternative to
model2 <- train(x, y, method, preProcess, ...) # or
model3 <- train(y ~ x1 + x2, data = data, method, preProcess, ...)
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=English_Indonesia.1252 LC_CTYPE=English_Indonesia.1252
[3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Indonesia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 recipes_0.1.1 broom_0.4.3 dplyr_0.7.4
[5] caret_6.0-78 lattice_0.20-35 ggplot2_2.2.1.9000 knitr_1.18
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 lubridate_1.7.1 tidyr_0.7.2 class_7.3-14
[5] utf8_1.1.3 assertthat_0.2.0 rprojroot_1.3-1 digest_0.6.13
[9] ipred_0.9-6 psych_1.7.8 foreach_1.4.4 R6_2.2.2
[13] plyr_1.8.4 backports_1.1.2 stats4_3.4.3 evaluate_0.10.1
[17] highr_0.6 pillar_1.0.1 rlang_0.1.6 lazyeval_0.2.1
[21] kernlab_0.9-25 rpart_4.1-11 Matrix_1.2-12 rmarkdown_1.8
[25] labeling_0.3 splines_3.4.3 CVST_0.2-1 ddalpha_1.3.1
[29] gower_0.1.2 stringr_1.2.0 foreign_0.8-69 munsell_0.4.3
[33] compiler_3.4.3 pkgconfig_2.0.1 mnormt_1.5-5 dimRed_0.1.0
[37] htmltools_0.3.6 nnet_7.3-12 tidyselect_0.2.3 tibble_1.4.1
[41] prodlim_1.6.1 DRR_0.0.2 codetools_0.2-15 RcppRoll_0.2.2
[45] crayon_1.3.4 withr_2.1.1.9000 MASS_7.3-47 ModelMetrics_1.1.0
[49] grid_3.4.3 nlme_3.1-131 gtable_0.2.0 magrittr_1.5
[53] scales_0.5.0 cli_1.0.0 stringi_1.1.6 reshape2_1.4.3
[57] timeDate_3042.101 robustbase_0.92-8 lava_1.6 iterators_1.0.9
[61] tools_3.4.3 glue_1.2.0 DEoptimR_1.0-8 purrr_0.2.4
[65] sfsmisc_1.1-1 parallel_3.4.3 survival_2.41-3 yaml_2.1.16
[69] colorspace_1.3-2 bindr_0.1