Saving and loading R objects: performance testing

Author: Jeff Johnston

2015-04-23 15:47:23

This was inspired by (and borrows heavily from) Hadley Wickham’s saveRDS and compression document.

R’s saveRDS and readRDS functions use gzip compression by default. They can also use alternate compression methods, of which a number are available in R. In addition, you can supply your own connection. In combination with the pipe function, we can delegate compression and decompression to command line tools. This feature will allow us to test out a few compression methods that are not implemented in R such as parallel gzip (pigz), snappy (snzip) and lz4.

If you are using a Mac with Homebrew, you can easily install these three extra compression tools:

brew install pigz 
brew install snzip
brew install lz4

library(dplyr)
library(magrittr)
library(ggplot2)
library(pander)
panderOptions("table.split.table", Inf)

saveRDS_wrapper <- function(object, filename, con_func, func_type, ...) {
  if(func_type == "builtin") {
    con <- con_func(filename)
  } else {
    con <- con_func(filename, mode="write")
  }
  on.exit(close(con))
  saveRDS(object, con)
}

readRDS_wrapper <- function(filename, con_func, func_type, ...) {
  if(func_type == "builtin") {
    con <- con_func(filename)
  } else {
    con <- con_func(filename, mode="read", ...)
  }
  on.exit(close(con))
  readRDS(con)
}

snappy_pipe <- function(filename, mode="read") {
  if(mode == "read") {
    con <- pipe(paste0("cat ", filename, " | snzip -dc"), "rb")
  } else {
    con <- pipe(paste0("snzip > ", filename), "wb")
  }
  con
}

pigz_pipe <- function(filename, mode="read", cores=4) {
  if(mode == "read") {
    con <- pipe(paste0("cat ", filename, " | pigz -dcp ", cores), "rb")
  } else {
    con <- pipe(paste0("pigz -p ", cores, " > ", filename), "wb")
  }
  con
}

lz4_pipe <- function(filename, mode="read") {
  if(mode == "read") {
    con <- pipe(paste0("cat ", filename, " | lz4 -d"), "rb")
  } else {
    con <- pipe(paste0("lz4 -z > ", filename), "wb")
  }
  con
}

roundtrip_timings <- function(object, con_func, compression_type, func_type, ...) {
  testfile <- tempfile()

  save <- system.time(saveRDS_wrapper(object, testfile, con_func, func_type=func_type, ...))[[3]]
  load <- system.time(x <- readRDS_wrapper(testfile, con_func, func_type=func_type, ...))[[3]]
  size <- file.info(testfile)$size / (1024) ^ 2

  data_frame(compression_type = compression_type,
             save_time        = save,
             load_time        = load,
             file_size_mb     = size)
}

We’ll benchmark these various compression methods against using no compression with a large data frame (nearly 100MB uncompressed). Feel free to experiment with objects that are more representative of the data you normally use in R.

n <- 5e6
x <- runif(n)
z <- sample(1:10, n, replace=TRUE)
o <- data.frame(x = x, y = x, z = z)

tmp <- tempfile()
saveRDS(o, tmp)
o <- readRDS(tmp)

times <- bind_rows(
  roundtrip_timings(o, file,        compression_type="none",           func_type="builtin"),
  roundtrip_timings(o, gzfile,      compression_type="gzip",           func_type="builtin"),
  roundtrip_timings(o, pigz_pipe,   compression_type="pigz (4 cores)", func_type="pipe", cores=4),
  roundtrip_timings(o, snappy_pipe, compression_type="snappy",         func_type="pipe"),
  roundtrip_timings(o, lz4_pipe,    compression_type="lz4",            func_type="pipe")
  #roundtrip_builtin(o, bzfile, compression_type="bzip", func_type="builtin"),
  #roundtrip_builtin(o, xzfile, compression_type="xz", func_type="builtin"),
)

raw_save <- times$save_time[1]
raw_load <- times$load_time[1]
raw_size <- times$file_size_mb[1]

times %<>% transform(save_slowdown = save_time / raw_save,
                     load_slowdown = load_time / raw_load,
                     size_reduction_percent = 100 * (raw_size - file_size_mb) / raw_size)

times %>% pander(caption="Results")

Results
compression_type	save_time	load_time	file_size_mb	save_slowdown	load_slowdown	size_reduction_percent
none	0.22	0.165	95.37	1	1	0
gzip	10.54	0.425	53.84	47.92	2.576	43.54
pigz (4 cores)	2.829	0.322	53.8	12.86	1.952	43.59
snappy	0.443	0.146	73.91	2.014	0.8848	22.5
lz4	0.581	0.242	76.62	2.641	1.467	19.66

To visualize these results, we will plot the relative performance of each compression method when compared to no compression:

When rendering this document in RStudio for the first time, I noticed the results are quite different when compared to rendering the same document in an R console session with rmarkdown::render. It seems that only the no-compression result slows down on load, but not the compression results. I re-rendered the document multiple times and the result remains consistent. I cannot explain this.

You can see the results when run under R console here.

The source RMarkdown file for this document can be found here.

Session info

sessionInfo()

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pander_0.5.1  ggplot2_1.0.1 magrittr_1.5  dplyr_0.4.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5      digest_0.6.8     assertthat_0.1   MASS_7.3-40     
##  [5] grid_3.2.0       plyr_1.8.2       gtable_0.1.2     DBI_0.3.1       
##  [9] formatR_1.2      scales_0.2.4     evaluate_0.7     lazyeval_0.1.10 
## [13] reshape2_1.4.1   rmarkdown_0.5.1  labeling_0.3     proto_0.3-10    
## [17] tools_3.2.0      stringr_0.6.2    munsell_0.4.2    parallel_3.2.0  
## [21] colorspace_1.2-6 htmltools_0.2.6  knitr_1.9