This document is designated to investigate some practical issues of input approaches considering time costs in R.
We are not going to dive deep into R chasm but going to cover the most applicable cases such the fastest way to write data to a file in R environment.
It is recommended to have some R basic skills but not required. Each case is provided with a reproducible example, the useful links are added as well.
The function to read data are as follows (to be tested for performance outcome):
As the object to be tested we consider the dataframe as follows:
#set the dataframe size (number of rows) for the single run
df_size <- 1000000
#declare file variables and dataframe (n)
file_output <- "test_to_write.csv"
file_output2 <- "test_iotools_to_write.csv"
file_output3 <- "test_rds_to_write.rds"
file_output4 <- "test_feather_to_write.feather"
n<-0
make_df <- function(){
#to generate the main dataframe
set.seed (123)
# one dimension character dataframe of 15 symbol length per cell
w <- as.data.frame(1:df_size)
#populate character dataframe
w <- apply(w, 1, function(x){paste(sample(letters,15), collapse = "")})
#paste character dataframe into the main one
n <- data.frame(x=1:df_size,y=rnorm(1:df_size),z=rnorm(1:df_size),w=w,stringsAsFactors = FALSE)
#convert column 'w' from factor to character vector
n$w <- as.character(n$w)
rm(w)
gc()
return(n)
}
n <- make_df()
We consider the speed of data writing by WRITE.TABLE() - write.csv, write.csv2 are wrappers of WRITE.TABLE (they are not being investigated):
#create the dataframe to record test runs through all functions
test_results <- data.frame(package=character(), tool=character(), phase=character(), cores=numeric(), sys=numeric(), elapsed=numeric(), stringsAsFactors = FALSE)
#to record the test run
time_used <- system.time(write.table(n,file_output,row.names = FALSE))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="write.table", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
## package tool phase cores sys elapsed
## 1 util write.table base 6.92 0.13 7.6
if(!require(readr)){
library(readr)
}
#to record the test run
time_used <- system.time(write_delim(n,file_output,delim=";"))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="readr", tool="write_delim", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
## package tool phase cores sys elapsed
## 1 readr write_delim base 10.19 0.36 10.77
N.B.: File format used in iotools package is not compatible properly with the rest of reading function tested.
if(!require(iotools)){
library(iotools)
}
#to record the test run
cat(noquote(paste(paste(names(n), collapse = ";"), "\n")),file = file_output2)
time_used <- system.time(write.csv.raw(n,file_output2,sep=";",append=TRUE)
)
#make a vector to add to dataframe
vector_towrite <- data.frame(package="iotools",tool="write.csv.raw",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
## package tool phase cores sys elapsed
## 1 iotools write.csv.raw base 1.75 0.11 2.69
This function is interesting to use because of the speed of writing files and the file size allocated (smaller than files saved with ‘write.table’, ‘write_delim’, ‘write_csv’, write.csv.sql,write.csv.raw).
The file is saved as binary of ASCII format (details - type “?saveRDS”“).
You are unlikely to have correct access to the data saved as ‘.rds’ with other reading functions. Therefore, that could be the critical issue for other users of you functions or business cycle.
#to record the test run
time_used <- system.time(saveRDS(n,file_output3))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="base",tool="saveRDS",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
## package tool phase cores sys elapsed
## 1 base saveRDS base 2.48 0.06 2.6
Read and write feather files, a lightweight binary columnar data store designed for maximum speed. NB: Current purpose of feather package (here: “At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.”
#to record the test run
time_used <- system.time(write_feather(n,file_output4))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="feather",tool="write_feather",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
## package tool phase cores sys elapsed
## 1 feather write_feather base 0.08 0.02 0.09
no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development)
if(!require(data.table)){
library(data.table)
}
#to record the test run
time_used <- system.time(fwrite(n,file_output,sep=";"))
#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fwrite",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)
## package tool phase cores sys elapsed
## 1 data.table fwrite base 0.41 0.04 0.15
The goal of this test is to calculate AVG indicators for theoretical sample such as AVG time elapsed since the specific function is run.
The sample structure is identical to the dataframe used in the above-mentioned examples.
The user defines the number of trials (simulations).
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 515554 27.6 25086107 1339.8 48996304 2616.7
## Vcells 17439316 133.1 191998699 1464.9 171609952 1309.3
The performance test is based upon the sample size of (1000, 5000, 10^{4}, 510^{4}, 10^{5}, 510^{5}, 10^{6}, 510^{6}, 10^{7}) with four columns (integer, numeric, numeric, character). The character is treated as not a factor.
## [1] "Samples < 10mio rows"
## [1] "All samples"
## [1] "all samples"
## [1] "REAL START"
## [1] "Real sample size"
## [1] "rows = 1435200 cols = 32"
## tools_aggr qty meanElapsed
## 1 data.table fwrite base 1435200 0.88
## 2 feather write_feather base 1435200 7.25
## 3 iotools write.csv.raw base 1435200 8.19
## 4 base saveRDS base 1435200 17.49
## 5 readr write_delim base 1435200 34.69
## 6 util write.table base 1435200 34.91
rm(real_sample)
gc()