Author: Jeff Johnston
2015-04-23 15:47:23
This was inspired by (and borrows heavily from) Hadley Wickham’s saveRDS and compression document.
R’s saveRDS and readRDS functions use gzip compression by default. They can also use alternate compression methods, of which a number are available in R. In addition, you can supply your own connection. In combination with the pipe function, we can delegate compression and decompression to command line tools. This feature will allow us to test out a few compression methods that are not implemented in R such as parallel gzip (pigz), snappy (snzip) and lz4.
If you are using a Mac with Homebrew, you can easily install these three extra compression tools:
brew install pigz
brew install snzip
brew install lz4
library(dplyr)
library(magrittr)
library(ggplot2)
library(pander)
panderOptions("table.split.table", Inf)
saveRDS_wrapper <- function(object, filename, con_func, func_type, ...) {
if(func_type == "builtin") {
con <- con_func(filename)
} else {
con <- con_func(filename, mode="write")
}
on.exit(close(con))
saveRDS(object, con)
}
readRDS_wrapper <- function(filename, con_func, func_type, ...) {
if(func_type == "builtin") {
con <- con_func(filename)
} else {
con <- con_func(filename, mode="read", ...)
}
on.exit(close(con))
readRDS(con)
}
snappy_pipe <- function(filename, mode="read") {
if(mode == "read") {
con <- pipe(paste0("cat ", filename, " | snzip -dc"), "rb")
} else {
con <- pipe(paste0("snzip > ", filename), "wb")
}
con
}
pigz_pipe <- function(filename, mode="read", cores=4) {
if(mode == "read") {
con <- pipe(paste0("cat ", filename, " | pigz -dcp ", cores), "rb")
} else {
con <- pipe(paste0("pigz -p ", cores, " > ", filename), "wb")
}
con
}
lz4_pipe <- function(filename, mode="read") {
if(mode == "read") {
con <- pipe(paste0("cat ", filename, " | lz4 -d"), "rb")
} else {
con <- pipe(paste0("lz4 -z > ", filename), "wb")
}
con
}
roundtrip_timings <- function(object, con_func, compression_type, func_type, ...) {
testfile <- tempfile()
save <- system.time(saveRDS_wrapper(object, testfile, con_func, func_type=func_type, ...))[[3]]
load <- system.time(x <- readRDS_wrapper(testfile, con_func, func_type=func_type, ...))[[3]]
size <- file.info(testfile)$size / (1024) ^ 2
data_frame(compression_type = compression_type,
save_time = save,
load_time = load,
file_size_mb = size)
}
We’ll benchmark these various compression methods against using no compression with a large data frame (nearly 100MB uncompressed). Feel free to experiment with objects that are more representative of the data you normally use in R.
n <- 5e6
x <- runif(n)
z <- sample(1:10, n, replace=TRUE)
o <- data.frame(x = x, y = x, z = z)
tmp <- tempfile()
saveRDS(o, tmp)
o <- readRDS(tmp)
times <- bind_rows(
roundtrip_timings(o, file, compression_type="none", func_type="builtin"),
roundtrip_timings(o, gzfile, compression_type="gzip", func_type="builtin"),
roundtrip_timings(o, pigz_pipe, compression_type="pigz (4 cores)", func_type="pipe", cores=4),
roundtrip_timings(o, snappy_pipe, compression_type="snappy", func_type="pipe"),
roundtrip_timings(o, lz4_pipe, compression_type="lz4", func_type="pipe")
#roundtrip_builtin(o, bzfile, compression_type="bzip", func_type="builtin"),
#roundtrip_builtin(o, xzfile, compression_type="xz", func_type="builtin"),
)
raw_save <- times$save_time[1]
raw_load <- times$load_time[1]
raw_size <- times$file_size_mb[1]
times %<>% transform(save_slowdown = save_time / raw_save,
load_slowdown = load_time / raw_load,
size_reduction_percent = 100 * (raw_size - file_size_mb) / raw_size)
times %>% pander(caption="Results")
| compression_type | save_time | load_time | file_size_mb | save_slowdown | load_slowdown | size_reduction_percent |
|---|---|---|---|---|---|---|
| none | 0.22 | 0.165 | 95.37 | 1 | 1 | 0 |
| gzip | 10.54 | 0.425 | 53.84 | 47.92 | 2.576 | 43.54 |
| pigz (4 cores) | 2.829 | 0.322 | 53.8 | 12.86 | 1.952 | 43.59 |
| snappy | 0.443 | 0.146 | 73.91 | 2.014 | 0.8848 | 22.5 |
| lz4 | 0.581 | 0.242 | 76.62 | 2.641 | 1.467 | 19.66 |
To visualize these results, we will plot the relative performance of each compression method when compared to no compression:
When rendering this document in RStudio for the first time, I noticed the results are quite different when compared to rendering the same document in an R console session with rmarkdown::render. It seems that only the no-compression result slows down on load, but not the compression results. I re-rendered the document multiple times and the result remains consistent. I cannot explain this.
You can see the results when run under R console here.
The source RMarkdown file for this document can be found here.
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pander_0.5.1 ggplot2_1.0.1 magrittr_1.5 dplyr_0.4.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.5 digest_0.6.8 assertthat_0.1 MASS_7.3-40
## [5] grid_3.2.0 plyr_1.8.2 gtable_0.1.2 DBI_0.3.1
## [9] formatR_1.2 scales_0.2.4 evaluate_0.7 lazyeval_0.1.10
## [13] reshape2_1.4.1 rmarkdown_0.5.1 labeling_0.3 proto_0.3-10
## [17] tools_3.2.0 stringr_0.6.2 munsell_0.4.2 parallel_3.2.0
## [21] colorspace_1.2-6 htmltools_0.2.6 knitr_1.9