1 Preface

This document is designated to investigate some practical issues of input approaches considering time costs in R.

We are not going to dive deep into R chasm but going to cover the most applicable cases such the fastest way to write data to a file in R environment.

It is recommended to have some R basic skills but not required. Each case is provided with a reproducible example, the useful links are added as well.

The function to read data are as follows (to be tested for performance outcome):

1.1 PC & OS features:

  • PC: i5, RAM 8 Gb
  • OS: Win7 64bit
  • R: 3.3.1
  • RStudio: 0.99.486

As the object to be tested we consider the dataframe as follows:

#set the dataframe size (number of rows) for the single run
df_size <- 1000000
#declare file variables and dataframe (n)
file_output <- "test_to_write.csv"
file_output2 <- "test_iotools_to_write.csv"
file_output3 <- "test_rds_to_write.rds"
file_output4 <- "test_feather_to_write.feather"
n<-0
make_df <- function(){
#to generate the main dataframe
set.seed (123)
#  one dimension character dataframe of 15 symbol length per cell
w <- as.data.frame(1:df_size)
#populate character dataframe 
w <- apply(w, 1, function(x){paste(sample(letters,15), collapse = "")})
#paste character dataframe into the main one
n <- data.frame(x=1:df_size,y=rnorm(1:df_size),z=rnorm(1:df_size),w=w,stringsAsFactors = FALSE)
#convert column 'w' from factor to character vector 
n$w <- as.character(n$w)
rm(w)
gc()

return(n)
}

n <- make_df()

2 Tesing criteria:

  1. time to write data;
  2. memory used (critical for ff package);
  3. access to data saved (whether outer users can use data saved).

2.1 WRITE.TABLE() - util package

We consider the speed of data writing by WRITE.TABLE() - write.csv, write.csv2 are wrappers of WRITE.TABLE (they are not being investigated):

2.1.1 base version of write.table() - no additional adjustments:

#create the dataframe to record test runs through all functions
test_results <- data.frame(package=character(), tool=character(), phase=character(), cores=numeric(), sys=numeric(), elapsed=numeric(), stringsAsFactors = FALSE)
#to record the test run
time_used <- system.time(write.table(n,file_output,row.names = FALSE))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="write.table", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
##   package        tool phase cores  sys elapsed
## 1    util write.table  base  6.92 0.13     7.6

2.2 WRITE_DELIM() - readr package (read flat/tabular text files from disk (or a connection)

2.2.1 base version of write_delim() - no additional adjustments:

if(!require(readr)){
  library(readr)  
}
#to record the test run
time_used <- system.time(write_delim(n,file_output,delim=";"))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="readr", tool="write_delim", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
##   package        tool phase cores  sys elapsed
## 1   readr write_delim  base 10.19 0.36   10.77

2.3 WRITE.CSV.RAW() - iotools package (I/O Tools for Streaming)

N.B.: File format used in iotools package is not compatible properly with the rest of reading function tested.

2.3.1 base version of WRITE.CSV.RAW() - no additional adjustments:

if(!require(iotools)){
  library(iotools)  
}
#to record the test run
  cat(noquote(paste(paste(names(n), collapse = ";"), "\n")),file = file_output2)
time_used <- system.time(write.csv.raw(n,file_output2,sep=";",append=TRUE)
)
#make a vector to add to dataframe
vector_towrite <- data.frame(package="iotools",tool="write.csv.raw",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
##   package          tool phase cores  sys elapsed
## 1 iotools write.csv.raw  base  1.75 0.11    2.69

2.4 saveRDS - base package (installed with R on default):

This function is interesting to use because of the speed of writing files and the file size allocated (smaller than files saved with ‘write.table’, ‘write_delim’, ‘write_csv’, write.csv.sql,write.csv.raw).

The file is saved as binary of ASCII format (details - type “?saveRDS”“).

You are unlikely to have correct access to the data saved as ‘.rds’ with other reading functions. Therefore, that could be the critical issue for other users of you functions or business cycle.

2.4.1 base version of saveRDS() - no additional adjustments:

#to record the test run
time_used <- system.time(saveRDS(n,file_output3))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="base",tool="saveRDS",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
##   package    tool phase cores  sys elapsed
## 1    base saveRDS  base  2.48 0.06     2.6

2.5 write_feather - feather package

Read and write feather files, a lightweight binary columnar data store designed for maximum speed. NB: Current purpose of feather package (here: “At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.”

2.5.1 base version of write_feather() - no additional adjustments:

#to record the test run
time_used <- system.time(write_feather(n,file_output4))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="feather",tool="write_feather",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
##   package          tool phase cores  sys elapsed
## 1 feather write_feather  base  0.08 0.02    0.09

2.6 fwrite - data.table package (Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using

no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development)

2.6.1 base version of fwrite() - no additional adjustments:

if(!require(data.table)){
  library(data.table)  
}
#to record the test run
time_used <- system.time(fwrite(n,file_output,sep=";"))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fwrite",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
##      package   tool phase cores  sys elapsed
## 1 data.table fwrite  base  0.41 0.04    0.15

3 Speed test of different sample sizes:

The goal of this test is to calculate AVG indicators for theoretical sample such as AVG time elapsed since the specific function is run.

The sample structure is identical to the dataframe used in the above-mentioned examples.

The user defines the number of trials (simulations).

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   515554  27.6   25086107 1339.8  48996304 2616.7
## Vcells 17439316 133.1  191998699 1464.9 171609952 1309.3

4 Graphic output of the test results:

The performance test is based upon the sample size of (1000, 5000, 10^{4}, 510^{4}, 10^{5}, 510^{5}, 10^{6}, 510^{6}, 10^{7}) with four columns (integer, numeric, numeric, character). The character is treated as not a factor.

## [1] "Samples < 10mio rows"

## [1] "All samples"

## [1] "all samples"

5 REAL LIFE EXAMPLE

## [1] "REAL START"
## [1] "Real sample size"
## [1] "rows = 1435200 cols = 32"
##                   tools_aggr     qty meanElapsed
## 1     data.table fwrite base 1435200        0.88
## 2 feather write_feather base 1435200        7.25
## 3 iotools write.csv.raw base 1435200        8.19
## 4          base saveRDS base 1435200       17.49
## 5     readr write_delim base 1435200       34.69
## 6      util write.table base 1435200       34.91

rm(real_sample)
gc()