suppressPackageStartupMessages( require(tidyverse) )
This package which came out in July 2017, seems to be a good alternative to piplearner. It should work hand in hand with recipes amd caret and its sampling functions should be similar to the resample objects in modelr
set.seed(4622)
data = rsample::attrition
formula = Attrition ~ JobSatisfaction + Gender + MonthlyIncome
rs = rsample::vfold_cv( data, v = 10, repeats = 10)
rs
## # 10-fold cross-validation repeated 10 times
## # A tibble: 100 x 3
## splits id id2
## <list> <chr> <chr>
## 1 <S3: rsplit> Repeat01 Fold01
## 2 <S3: rsplit> Repeat01 Fold02
## 3 <S3: rsplit> Repeat01 Fold03
## 4 <S3: rsplit> Repeat01 Fold04
## 5 <S3: rsplit> Repeat01 Fold05
## 6 <S3: rsplit> Repeat01 Fold06
## 7 <S3: rsplit> Repeat01 Fold07
## 8 <S3: rsplit> Repeat01 Fold08
## 9 <S3: rsplit> Repeat01 Fold09
## 10 <S3: rsplit> Repeat01 Fold10
## # ... with 90 more rows
It is basically just like the pipelearner object without the trained models. Similar to modelr resample objects we can convert the rsample splits to data using as.data.frame() .
split = rs$splits[[1]]
dim( as.data.frame(split) )
## [1] 1323 31
dim( as.data.frame(split, data = 'analysis') )
## [1] 1323 31
dim( as.data.frame(split, data = 'assessment') )
## [1] 147 31
dim( rsample::analysis(split) )
## [1] 1323 31
dim( rsample::assessment(split) )
## [1] 147 31
rs_mod = rs %>%
mutate( formula = list(formula)
, fit = map2( formula, splits, glm, family = 'binomial' )
, preds = map2( fit, splits, function(x,y) predict(x, newdata = rsample::assessment(y)) )
)
We can see that the rsample object increases the memory reserved for the data splits by a factor of 100 in this particular case.
There are two functions for checking memory allocation. I dont understand the difference between them very well but. Max Kuhn claims that pryr::object_size is the function of choice for revealing the actual in memory size of rsample objects.
memory_tib = tibble( object = list(data
, rs
, select(rs_mod, -formula, -preds)
, select(rs_mod, -preds, - fit)
, select(rs_mod, -formula, -fit)
)
, object_str = c('original data'
, 'rsample object'
, 'rsample object + fit'
, 'rsample object + formula'
, 'rsample object + preds'
)
) %>%
mutate( object_size = map_dbl(object, function(x) pryr::object_size(x) / pryr::object_size(data) )
, object.size = map_dbl(object, function(x) object.size(x) / object.size(data) ) ) %>%
select( object_str, object_size, object.size)
print(memory_tib)
## # A tibble: 5 x 3
## object_str object_size object.size
## <chr> <dbl> <dbl>
## 1 original data 1.000000 1.0000
## 2 rsample object 3.237688 102.5814
## 3 rsample object + fit 81.033910 547.3621
## 4 rsample object + formula 3.483933 103.0416
## 5 rsample object + preds 4.146146 106.0336
If we where to split the original data in a traditional non-index using way we would increase the memory need for the data by 100x. When using pryr::object_size we see that instead using rsample we can limit the addiotionally required memory to 3x. Further we see that keeping the model in the modelling dataframe is very costly and increases the memory need by 50-100x in this case however most models will not scale in proportion to the size of the data.
broombroom has several functions that are meant to tidy up R objects, here we will check whether it could make sense to keep a broom return value instead of a model
require(pryr)
m = rs_mod$fit[1]
m = glm(formula, data, family = 'binomial')
glance = broom::glance(m)
tidy = broom::tidy(m)
augment = broom::augment(m, data)
tib_broom = tibble( .f_broom = c(broom::glance
, broom::tidy
, broom::augment) ) %>%
mutate( .f_str = map(.f_broom, quote )
, .f_str = map_chr( .f_str, as.character )
, m = list(m)
, broom_return = map2(.f_broom, m, function(x,y) x(y) )
, object_size = map2_dbl(broom_return, m, function(x,y) pryr::object_size(x)/ pryr::object_size(y)) )
By simply saving the broom returns and discarding the model we can save a lot of memory