suppressPackageStartupMessages( require(tidyverse) )

This package which came out in July 2017, seems to be a good alternative to piplearner. It should work hand in hand with recipes amd caret and its sampling functions should be similar to the resample objects in modelr

1 example for 10 x 10-fold cross validation

set.seed(4622)

data = rsample::attrition

formula = Attrition ~ JobSatisfaction + Gender + MonthlyIncome

rs = rsample::vfold_cv( data, v = 10, repeats = 10)

rs
## #  10-fold cross-validation repeated 10 times 
## # A tibble: 100 x 3
##          splits       id    id2
##          <list>    <chr>  <chr>
##  1 <S3: rsplit> Repeat01 Fold01
##  2 <S3: rsplit> Repeat01 Fold02
##  3 <S3: rsplit> Repeat01 Fold03
##  4 <S3: rsplit> Repeat01 Fold04
##  5 <S3: rsplit> Repeat01 Fold05
##  6 <S3: rsplit> Repeat01 Fold06
##  7 <S3: rsplit> Repeat01 Fold07
##  8 <S3: rsplit> Repeat01 Fold08
##  9 <S3: rsplit> Repeat01 Fold09
## 10 <S3: rsplit> Repeat01 Fold10
## # ... with 90 more rows

1.1 Sample

It is basically just like the pipelearner object without the trained models. Similar to modelr resample objects we can convert the rsample splits to data using as.data.frame() .

split = rs$splits[[1]]

dim( as.data.frame(split) )
## [1] 1323   31
dim( as.data.frame(split, data = 'analysis') )
## [1] 1323   31
dim( as.data.frame(split, data = 'assessment') )
## [1] 147  31
dim( rsample::analysis(split) )
## [1] 1323   31
dim( rsample::assessment(split) )
## [1] 147  31

1.2 fit lm model

rs_mod = rs %>%
  mutate( formula = list(formula)
          , fit = map2( formula, splits, glm, family = 'binomial' ) 
          , preds = map2( fit, splits, function(x,y) predict(x, newdata = rsample::assessment(y)) )  
          )

We can see that the rsample object increases the memory reserved for the data splits by a factor of 100 in this particular case.

1.3 Memory allocation

There are two functions for checking memory allocation. I dont understand the difference between them very well but. Max Kuhn claims that pryr::object_size is the function of choice for revealing the actual in memory size of rsample objects.

memory_tib = tibble( object = list(data
                      , rs
                      , select(rs_mod, -formula, -preds)
                      , select(rs_mod, -preds, - fit)
                      , select(rs_mod, -formula, -fit) 
                      )
        , object_str = c('original data'
                          , 'rsample object'
                          , 'rsample object + fit'
                          , 'rsample object + formula'
                          , 'rsample object + preds'
                          )
        ) %>%
  mutate( object_size = map_dbl(object, function(x) pryr::object_size(x) / pryr::object_size(data) )
          , object.size = map_dbl(object, function(x) object.size(x) / object.size(data) ) ) %>%
  select( object_str, object_size, object.size)

print(memory_tib)
## # A tibble: 5 x 3
##                 object_str object_size object.size
##                      <chr>       <dbl>       <dbl>
## 1            original data    1.000000      1.0000
## 2           rsample object    3.237688    102.5814
## 3     rsample object + fit   81.033910    547.3621
## 4 rsample object + formula    3.483933    103.0416
## 5   rsample object + preds    4.146146    106.0336

If we where to split the original data in a traditional non-index using way we would increase the memory need for the data by 100x. When using pryr::object_size we see that instead using rsample we can limit the addiotionally required memory to 3x. Further we see that keeping the model in the modelling dataframe is very costly and increases the memory need by 50-100x in this case however most models will not scale in proportion to the size of the data.

2 Using broom

broom has several functions that are meant to tidy up R objects, here we will check whether it could make sense to keep a broom return value instead of a model

require(pryr)

m = rs_mod$fit[1]
m = glm(formula, data, family = 'binomial')

glance = broom::glance(m)
tidy   = broom::tidy(m)
augment = broom::augment(m, data)

tib_broom = tibble( .f_broom = c(broom::glance
                                 , broom::tidy
                                 , broom::augment) ) %>%
  mutate( .f_str = map(.f_broom, quote )
          , .f_str = map_chr( .f_str, as.character )
          , m = list(m)
          , broom_return = map2(.f_broom, m, function(x,y) x(y) ) 
          , object_size  = map2_dbl(broom_return, m, function(x,y) pryr::object_size(x)/ pryr::object_size(y)) )

By simply saving the broom returns and discarding the model we can save a lot of memory