How to Reduce the R Memory Footprint of glm() and rpart()

Introduction

Having memory usage issues is not an uncommon problem when building predictive models in R with lots of data. We show just how to significantly reduce the R memory footprint of glm() for fitting generalized linear models and rpart() for fitting decision trees. We demonstrate this memory reduction with a toy problem using simulated datasets generated by the following R function.

if( !require(dplyr) ){ install.packages("dplyr") }
if( !require(ggplot2) ){ install.packages("ggplot2") }
if( !require(rpart) ){ install.packages("rpart") }
if( !require(rpart.plot) ){ install.packages("rpart.plot") }
if( !require(tibble) ){ install.packages("tibble") }

library(dplyr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(tibble)

simulated_data.fn <- function(nrows){
  
  df <- tibble::as_tibble(
        data.frame(
                   X1 = round(rnorm(nrows, mean = 0, sd = 1), 1),
                   X2 = round(rnorm(nrows, mean = 0, sd = 1), 1)
        ))
  
  df$Y = as.factor(ifelse(df$X1 * df$X2 > 0, "Yes", "No"))
  return(df)
  
}

We first need a function that breaks down a fitted model’s total memory size by its individual model components. The function below does this and displays memory sizes in megabytes.

memory_breakdown.fn <- function(model){
  
  MB <- sapply(model, FUN = function(t){ 
    
    10^(-6) * as.numeric(gsub(pattern = "bytes", replacement = "", x = utils::object.size(t))) 
    
  }, simplify = TRUE)
  return(MB)
  
}

Making glm() Lean

We begin by fitting a pair of GLM models with training sets of 10k and 100k rows, respectively. Then, we calculate their estimated total memory sizes.

glm.fit1 <- glm(formula = (Y ~ X1 + X2), 
                data = simulated_data.fn(nrows = 10^4), 
                family = binomial)

glm.fit2 <- glm(formula = (Y ~ X1 + X2), 
                data = simulated_data.fn(nrows = 10^5), 
                family = binomial)

sum(memory_breakdown.fn(model = glm.fit1))

## [1] 5.950192

sum(memory_breakdown.fn(model = glm.fit2))

## [1] 57.79019

round(sum(memory_breakdown.fn(model = glm.fit2)) / sum(memory_breakdown.fn(model = glm.fit1)), 2)

## [1] 9.71

Naturally, the total memory size increases as the size of the training set increases. However, we want to know where exactly does the memory size increase. Let us take a look at the memory breakdown of glm.fit2 relative to the memory breakdown of glm.fit1.

sort( round(memory_breakdown.fn(model = glm.fit2) / memory_breakdown.fn(model = glm.fit1), 2) )

##      coefficients                 R              rank            family 
##              1.00              1.00              1.00              1.00 
##          deviance               aic     null.deviance              iter 
##              1.00              1.00              1.00              1.00 
##       df.residual           df.null         converged          boundary 
##              1.00              1.00              1.00              1.00 
##              call           formula             terms           control 
##              1.00              1.00              1.00              1.00 
##            method           xlevels             model              data 
##              1.00              1.00              9.73              9.93 
##           effects                qr         residuals     fitted.values 
##              9.98              9.98             10.00             10.00 
## linear.predictors           weights     prior.weights                 y 
##             10.00             10.00             10.00             10.00

We see that the increase occurs in the following glm() model components: model, data, effects, qr, residuals, fitted.values, linear.predictors, weights, prior.weights, and y. We write a function that removes these model components in order to reduce the memory footprint of glm().

lean_glm.fn <- function(glm_fitted){
  
  glm_fitted$model <- NULL
  glm_fitted$data <- NULL
  glm_fitted$effects <- NULL
  glm_fitted$qr$qr <- NULL
  glm_fitted$residuals <- NULL
  glm_fitted$fitted.values <- NULL
  glm_fitted$linear.predictors <- NULL
  glm_fitted$weights <- NULL
  glm_fitted$prior.weights <- NULL
  glm_fitted$y <- NULL

  return(glm_fitted)
  
}

lean.glm1 <- lean_glm.fn(glm_fitted = glm.fit1)
lean.glm2 <- lean_glm.fn(glm_fitted = glm.fit2)

Now, let us compare the total memory usage of glm() and lean_glm.fn().

sum(memory_breakdown.fn(model = lean.glm1))

## [1] 0.179872

sum(memory_breakdown.fn(model = glm.fit1))

## [1] 5.950192

sum(memory_breakdown.fn(model = lean.glm2))

## [1] 0.179872

sum(memory_breakdown.fn(model = glm.fit2))

## [1] 57.79019

We have just stripped away a lot of fat from glm(), but this will not impact our ability to make accurate predictions. We verify this by comparing the predictions made by lean_glm.fn() against glm().

test <- simulated_data.fn(nrows = 10^2)

abs( predict(object = lean.glm1, newdata = test, type = "response") - 
     predict(object = glm.fit1, newdata = test, type = "response") )

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  91  92  93  94  95  96  97  98  99 100 
##   0   0   0   0   0   0   0   0   0   0

abs( predict(object = lean.glm2, newdata = test, type = "response") - 
     predict(object = glm.fit2, newdata = test, type = "response") )

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  91  92  93  94  95  96  97  98  99 100 
##   0   0   0   0   0   0   0   0   0   0

Making rpart() Lean

We have learned how to make glm() lean. We follow the same approach to make rpart() lean.

rpart.fit1 <- rpart(formula = (Y ~ X1 + X2), 
                    data = simulated_data.fn(nrows = 10^4), 
                    method = "class")

rpart.fit2 <- rpart(formula = (Y ~ X1 + X2), 
                    data = simulated_data.fn(nrows = 10^5), 
                    method = "class")

sort( round(memory_breakdown.fn(model = rpart.fit2) / memory_breakdown.fn(model = rpart.fit1), 2) )

##               frame                call               terms 
##                1.00                1.00                1.00 
##             cptable              method               parms 
##                1.00                1.00                1.00 
##             control           functions             numresp 
##                1.00                1.00                1.00 
##              splits variable.importance             ordered 
##                1.00                1.00                1.00 
##                   y               where 
##                9.99               10.00

We see that the memory size increases in the following model components: where and y. We write a function that removes these model components in order to reduce the memory footprint of rpart().

lean_rpart.fn <- function(tree_fitted){
  
  tree_fitted$where <- NULL
  tree_fitted$y <- NULL
  
  return(tree_fitted)
  
}

lean.rpart1 <- lean_rpart.fn(tree_fitted = rpart.fit1)
lean.rpart2 <- lean_rpart.fn(tree_fitted = rpart.fit2)

Let us compare the total memory usage of lean_rpart.fn() and rpart().

sum(memory_breakdown.fn(model = lean.rpart1))

## [1] 0.060232

sum(memory_breakdown.fn(model = rpart.fit1))

## [1] 0.780488

sum(memory_breakdown.fn(model = lean.rpart2))

## [1] 0.060232

sum(memory_breakdown.fn(model = rpart.fit2))

## [1] 7.260488

The fat is now stripped away from rpart(), but this will not impact our ability to make accurate predictions. We verify this by comparing the predictions made by lean_rpart.fn() against rpart().

test <- simulated_data.fn(nrows = 10^2)

head(
abs( predict(object = lean.rpart1, newdata = test, type = "prob") - 
       predict(object = rpart.fit1, newdata = test, type = "prob") )
)

##   No Yes
## 1  0   0
## 2  0   0
## 3  0   0
## 4  0   0
## 5  0   0
## 6  0   0

head(
abs( predict(object = lean.rpart2, newdata = test, type = "prob") - 
       predict(object = rpart.fit2, newdata = test, type = "prob") )
)

##   No Yes
## 1  0   0
## 2  0   0
## 3  0   0
## 4  0   0
## 5  0   0
## 6  0   0

Our lean_rpart.fn() makes accurate predictions, and it also does not impact our ability to plot fitted trees! Awesome!

prp(x = lean.rpart1, roundint = FALSE)

Memory Size Comparisons

Below are plots of the memory sizes from fitted models using training sets between 100k and 1M rows. The memory sizes of lean_glm.fn() and lean_rpart.fn() remain constant as the size of the training set grows larger. The memory footprint reduction is simply amazing!

Just remember, there is no such thing as a free lunch, and life is all about tradeoffs and compromises! The takeway is there is indeed a price to be paid for reducing the memory footprint of glm() and rpart(). For example, using lean_glm.fn(), you will not be able to call on functions like summary(), anova(), residuals(), and such, but this is actually okay, if the only functionality you really need is the ability to make predictions! You can use glm() and rpart() on a smaller representative sample during your investigations and search for a best fitting model, then use lean_glm.fn() and lean_rpart.fn() to deploy your best model at a much larger scale without leaving a huge R memory footprint.

How to Reduce the R Memory Footprint of glm() and rpart()

Dave Tung

May 7, 2019

Introduction

Making glm() Lean

Making rpart() Lean

Memory Size Comparisons