Having memory usage issues is not an uncommon problem when building predictive models in R with lots of data. We show just how to significantly reduce the R memory footprint of glm() for fitting generalized linear models and rpart() for fitting decision trees. We demonstrate this memory reduction with a toy problem using simulated datasets generated by the following R function.
if( !require(dplyr) ){ install.packages("dplyr") }
if( !require(ggplot2) ){ install.packages("ggplot2") }
if( !require(rpart) ){ install.packages("rpart") }
if( !require(rpart.plot) ){ install.packages("rpart.plot") }
if( !require(tibble) ){ install.packages("tibble") }
library(dplyr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(tibble)
simulated_data.fn <- function(nrows){
df <- tibble::as_tibble(
data.frame(
X1 = round(rnorm(nrows, mean = 0, sd = 1), 1),
X2 = round(rnorm(nrows, mean = 0, sd = 1), 1)
))
df$Y = as.factor(ifelse(df$X1 * df$X2 > 0, "Yes", "No"))
return(df)
}
We first need a function that breaks down a fitted model’s total memory size by its individual model components. The function below does this and displays memory sizes in megabytes.
memory_breakdown.fn <- function(model){
MB <- sapply(model, FUN = function(t){
10^(-6) * as.numeric(gsub(pattern = "bytes", replacement = "", x = utils::object.size(t)))
}, simplify = TRUE)
return(MB)
}
We begin by fitting a pair of GLM models with training sets of 10k and 100k rows, respectively. Then, we calculate their estimated total memory sizes.
glm.fit1 <- glm(formula = (Y ~ X1 + X2),
data = simulated_data.fn(nrows = 10^4),
family = binomial)
glm.fit2 <- glm(formula = (Y ~ X1 + X2),
data = simulated_data.fn(nrows = 10^5),
family = binomial)
sum(memory_breakdown.fn(model = glm.fit1))
## [1] 5.950192
sum(memory_breakdown.fn(model = glm.fit2))
## [1] 57.79019
round(sum(memory_breakdown.fn(model = glm.fit2)) / sum(memory_breakdown.fn(model = glm.fit1)), 2)
## [1] 9.71
Naturally, the total memory size increases as the size of the training set increases. However, we want to know where exactly does the memory size increase. Let us take a look at the memory breakdown of glm.fit2 relative to the memory breakdown of glm.fit1.
sort( round(memory_breakdown.fn(model = glm.fit2) / memory_breakdown.fn(model = glm.fit1), 2) )
## coefficients R rank family
## 1.00 1.00 1.00 1.00
## deviance aic null.deviance iter
## 1.00 1.00 1.00 1.00
## df.residual df.null converged boundary
## 1.00 1.00 1.00 1.00
## call formula terms control
## 1.00 1.00 1.00 1.00
## method xlevels model data
## 1.00 1.00 9.73 9.93
## effects qr residuals fitted.values
## 9.98 9.98 10.00 10.00
## linear.predictors weights prior.weights y
## 10.00 10.00 10.00 10.00
We see that the increase occurs in the following glm() model components: model, data, effects, qr, residuals, fitted.values, linear.predictors, weights, prior.weights, and y. We write a function that removes these model components in order to reduce the memory footprint of glm().
lean_glm.fn <- function(glm_fitted){
glm_fitted$model <- NULL
glm_fitted$data <- NULL
glm_fitted$effects <- NULL
glm_fitted$qr$qr <- NULL
glm_fitted$residuals <- NULL
glm_fitted$fitted.values <- NULL
glm_fitted$linear.predictors <- NULL
glm_fitted$weights <- NULL
glm_fitted$prior.weights <- NULL
glm_fitted$y <- NULL
return(glm_fitted)
}
lean.glm1 <- lean_glm.fn(glm_fitted = glm.fit1)
lean.glm2 <- lean_glm.fn(glm_fitted = glm.fit2)
Now, let us compare the total memory usage of glm() and lean_glm.fn().
sum(memory_breakdown.fn(model = lean.glm1))
## [1] 0.179872
sum(memory_breakdown.fn(model = glm.fit1))
## [1] 5.950192
sum(memory_breakdown.fn(model = lean.glm2))
## [1] 0.179872
sum(memory_breakdown.fn(model = glm.fit2))
## [1] 57.79019
We have just stripped away a lot of fat from glm(), but this will not impact our ability to make accurate predictions. We verify this by comparing the predictions made by lean_glm.fn() against glm().
test <- simulated_data.fn(nrows = 10^2)
abs( predict(object = lean.glm1, newdata = test, type = "response") -
predict(object = glm.fit1, newdata = test, type = "response") )
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 91 92 93 94 95 96 97 98 99 100
## 0 0 0 0 0 0 0 0 0 0
abs( predict(object = lean.glm2, newdata = test, type = "response") -
predict(object = glm.fit2, newdata = test, type = "response") )
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 91 92 93 94 95 96 97 98 99 100
## 0 0 0 0 0 0 0 0 0 0
We have learned how to make glm() lean. We follow the same approach to make rpart() lean.
rpart.fit1 <- rpart(formula = (Y ~ X1 + X2),
data = simulated_data.fn(nrows = 10^4),
method = "class")
rpart.fit2 <- rpart(formula = (Y ~ X1 + X2),
data = simulated_data.fn(nrows = 10^5),
method = "class")
sort( round(memory_breakdown.fn(model = rpart.fit2) / memory_breakdown.fn(model = rpart.fit1), 2) )
## frame call terms
## 1.00 1.00 1.00
## cptable method parms
## 1.00 1.00 1.00
## control functions numresp
## 1.00 1.00 1.00
## splits variable.importance ordered
## 1.00 1.00 1.00
## y where
## 9.99 10.00
We see that the memory size increases in the following model components: where and y. We write a function that removes these model components in order to reduce the memory footprint of rpart().
lean_rpart.fn <- function(tree_fitted){
tree_fitted$where <- NULL
tree_fitted$y <- NULL
return(tree_fitted)
}
lean.rpart1 <- lean_rpart.fn(tree_fitted = rpart.fit1)
lean.rpart2 <- lean_rpart.fn(tree_fitted = rpart.fit2)
Let us compare the total memory usage of lean_rpart.fn() and rpart().
sum(memory_breakdown.fn(model = lean.rpart1))
## [1] 0.060232
sum(memory_breakdown.fn(model = rpart.fit1))
## [1] 0.780488
sum(memory_breakdown.fn(model = lean.rpart2))
## [1] 0.060232
sum(memory_breakdown.fn(model = rpart.fit2))
## [1] 7.260488
The fat is now stripped away from rpart(), but this will not impact our ability to make accurate predictions. We verify this by comparing the predictions made by lean_rpart.fn() against rpart().
test <- simulated_data.fn(nrows = 10^2)
head(
abs( predict(object = lean.rpart1, newdata = test, type = "prob") -
predict(object = rpart.fit1, newdata = test, type = "prob") )
)
## No Yes
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
head(
abs( predict(object = lean.rpart2, newdata = test, type = "prob") -
predict(object = rpart.fit2, newdata = test, type = "prob") )
)
## No Yes
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Our lean_rpart.fn() makes accurate predictions, and it also does not impact our ability to plot fitted trees! Awesome!
prp(x = lean.rpart1, roundint = FALSE)
Below are plots of the memory sizes from fitted models using training sets between 100k and 1M rows. The memory sizes of lean_glm.fn() and lean_rpart.fn() remain constant as the size of the training set grows larger. The memory footprint reduction is simply amazing!
Just remember, there is no such thing as a free lunch, and life is all about tradeoffs and compromises! The takeway is there is indeed a price to be paid for reducing the memory footprint of glm() and rpart(). For example, using lean_glm.fn(), you will not be able to call on functions like summary(), anova(), residuals(), and such, but this is actually okay, if the only functionality you really need is the ability to make predictions! You can use glm() and rpart() on a smaller representative sample during your investigations and search for a best fitting model, then use lean_glm.fn() and lean_rpart.fn() to deploy your best model at a much larger scale without leaving a huge R memory footprint.