Saving and loading R objects: performance testing

Author: Jeff Johnston

2015-04-23 15:47:23

This was inspired by (and borrows heavily from) Hadley Wickham’s saveRDS and compression document.

R’s saveRDS and readRDS functions use gzip compression by default. They can also use alternate compression methods, of which a number are available in R. In addition, you can supply your own connection. In combination with the pipe function, we can delegate compression and decompression to command line tools. This feature will allow us to test out a few compression methods that are not implemented in R such as parallel gzip (pigz), snappy (snzip) and lz4.

If you are using a Mac with Homebrew, you can easily install these three extra compression tools:

brew install pigz 
brew install snzip
brew install lz4
library(dplyr)
library(magrittr)
library(ggplot2)
library(pander)
panderOptions("table.split.table", Inf)
saveRDS_wrapper <- function(object, filename, con_func, func_type, ...) {
  if(func_type == "builtin") {
    con <- con_func(filename)
  } else {
    con <- con_func(filename, mode="write")
  }
  on.exit(close(con))
  saveRDS(object, con)
}

readRDS_wrapper <- function(filename, con_func, func_type, ...) {
  if(func_type == "builtin") {
    con <- con_func(filename)
  } else {
    con <- con_func(filename, mode="read", ...)
  }
  on.exit(close(con))
  readRDS(con)
}

snappy_pipe <- function(filename, mode="read") {
  if(mode == "read") {
    con <- pipe(paste0("cat ", filename, " | snzip -dc"), "rb")
  } else {
    con <- pipe(paste0("snzip > ", filename), "wb")
  }
  con
}

pigz_pipe <- function(filename, mode="read", cores=4) {
  if(mode == "read") {
    con <- pipe(paste0("cat ", filename, " | pigz -dcp ", cores), "rb")
  } else {
    con <- pipe(paste0("pigz -p ", cores, " > ", filename), "wb")
  }
  con
}

lz4_pipe <- function(filename, mode="read") {
  if(mode == "read") {
    con <- pipe(paste0("cat ", filename, " | lz4 -d"), "rb")
  } else {
    con <- pipe(paste0("lz4 -z > ", filename), "wb")
  }
  con
}

roundtrip_timings <- function(object, con_func, compression_type, func_type, ...) {
  testfile <- tempfile()

  save <- system.time(saveRDS_wrapper(object, testfile, con_func, func_type=func_type, ...))[[3]]
  load <- system.time(x <- readRDS_wrapper(testfile, con_func, func_type=func_type, ...))[[3]]
  size <- file.info(testfile)$size / (1024) ^ 2

  data_frame(compression_type = compression_type,
             save_time        = save,
             load_time        = load,
             file_size_mb     = size)
}

We’ll benchmark these various compression methods against using no compression with a large data frame (nearly 100MB uncompressed). Feel free to experiment with objects that are more representative of the data you normally use in R.

n <- 5e6
x <- runif(n)
z <- sample(1:10, n, replace=TRUE)
o <- data.frame(x = x, y = x, z = z)

tmp <- tempfile()
saveRDS(o, tmp)
o <- readRDS(tmp)

times <- bind_rows(
  roundtrip_timings(o, file,        compression_type="none",           func_type="builtin"),
  roundtrip_timings(o, gzfile,      compression_type="gzip",           func_type="builtin"),
  roundtrip_timings(o, pigz_pipe,   compression_type="pigz (4 cores)", func_type="pipe", cores=4),
  roundtrip_timings(o, snappy_pipe, compression_type="snappy",         func_type="pipe"),
  roundtrip_timings(o, lz4_pipe,    compression_type="lz4",            func_type="pipe")
  #roundtrip_builtin(o, bzfile, compression_type="bzip", func_type="builtin"),
  #roundtrip_builtin(o, xzfile, compression_type="xz", func_type="builtin"),
)

raw_save <- times$save_time[1]
raw_load <- times$load_time[1]
raw_size <- times$file_size_mb[1]

times %<>% transform(save_slowdown = save_time / raw_save,
                     load_slowdown = load_time / raw_load,
                     size_reduction_percent = 100 * (raw_size - file_size_mb) / raw_size)

times %>% pander(caption="Results")
Results
compression_type save_time load_time file_size_mb save_slowdown load_slowdown size_reduction_percent
none 0.22 0.165 95.37 1 1 0
gzip 10.54 0.425 53.84 47.92 2.576 43.54
pigz (4 cores) 2.829 0.322 53.8 12.86 1.952 43.59
snappy 0.443 0.146 73.91 2.014 0.8848 22.5
lz4 0.581 0.242 76.62 2.641 1.467 19.66

To visualize these results, we will plot the relative performance of each compression method when compared to no compression:

When rendering this document in RStudio for the first time, I noticed the results are quite different when compared to rendering the same document in an R console session with rmarkdown::render. It seems that only the no-compression result slows down on load, but not the compression results. I re-rendered the document multiple times and the result remains consistent. I cannot explain this.

You can see the results when run under R console here.

The source RMarkdown file for this document can be found here.

Session info

sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pander_0.5.1  ggplot2_1.0.1 magrittr_1.5  dplyr_0.4.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5      digest_0.6.8     assertthat_0.1   MASS_7.3-40     
##  [5] grid_3.2.0       plyr_1.8.2       gtable_0.1.2     DBI_0.3.1       
##  [9] formatR_1.2      scales_0.2.4     evaluate_0.7     lazyeval_0.1.10 
## [13] reshape2_1.4.1   rmarkdown_0.5.1  labeling_0.3     proto_0.3-10    
## [17] tools_3.2.0      stringr_0.6.2    munsell_0.4.2    parallel_3.2.0  
## [21] colorspace_1.2-6 htmltools_0.2.6  knitr_1.9