This document is designated to investigate some practical issues of Input/output approaches (IO) as well as data manipulation hints considering time cost and memory (RAM) limitations in R.
We are not going to dive deep into R chasm but going to cover the most applicable cases such the fastest way to read data from a file into R environment.
It is recommended to have some R basic skills but not required. Each case is provided with a reproducible example, the useful links are added as well.
The function to read data are as follows (to be tested for performance outcome):
A good description of pros and cons of R (in comparison with Python) see at https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis.
For inquisitive minds, the useful link to look at APL language that could be the alternative to R in some ways: http://tryapl.org/. It is necessary to admit that APL is not wide spread as R or Python for data science purposes.
There are many sources to read about the reasons to use R and its pros and cons. The useful links to read about R:
briefly about R: https://www.r-project.org/about.html
visualization http://shiny.rstudio.com/gallery/
The corner stone of good R practice - the understanding of basic data structure elements of R and basics of functional programming in R.
In fact, most of all issues arise concerning the misunderstanding or wrong application of specific elements of data structures.
A good description of data structure could be found here: http://adv-r.had.co.nz/Data-structures.html
Good notes about functional programming could be found here:
http://adv-r.had.co.nz/Functional-programming.html
Data structure consist of four elements:
Vectors introduces you to atomic vectors and lists, R?s 1d data structures. That is the corner stone for the rest of data structure in R.
Matrices and arrays introduces matrices and arrays, data structures for storing 2d and higher dimensional data.
Data frames teaches you about the data frame, the most important data structure for storing data in R. Data frames combine the behavior of lists and matrices to make a structure ideally suited for the needs of statistical data.
#make a vector
print ("make a vector")## [1] "make a vector"
v <- c(1,2,3,4,5)
str(v) #check the structure of the object## num [1:5] 1 2 3 4 5
class(v) #check the class of the object## [1] "numeric"
is.atomic(v) #check whether the object could contain elements of different types## [1] TRUE
is.vector(v) #check the is the correct object## [1] TRUE
# vector must be homogeneous!!!
#the next script will generate an error
#v < - c(1,2,"sfgs",4,5)Useful link with vector code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Vectors
#make a matrix & array
print ("make a matrix")## [1] "make a matrix"
m <- matrix(1:10,nrow=5,ncol=2)
str(m)## int [1:5, 1:2] 1 2 3 4 5 6 7 8 9 10
class(m)## [1] "matrix"
is.atomic(m)## [1] TRUE
is.matrix(m)## [1] TRUE
print("make an array")## [1] "make an array"
dim(m) <- c(2,5,1)
str(m)## int [1:2, 1:5, 1] 1 2 3 4 5 6 7 8 9 10
class(m)## [1] "array"
is.atomic(m)## [1] TRUE
is.array(m)## [1] TRUE
#array must be homogeneous!!!
#the next script will convert all numbers into character!
m[1,3,1] <- "dfd"Useful link with matrix&array code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Matrices-and-Arrays
#make a dataframe
print("make a dataframe")## [1] "make a dataframe"
df <- data.frame(x=c(1:5),y=c(6:10),stringsAsFactors = FALSE)
str(df)## 'data.frame': 5 obs. of 2 variables:
## $ x: int 1 2 3 4 5
## $ y: int 6 7 8 9 10
class(df)## [1] "data.frame"
is.atomic(df)## [1] FALSE
is.data.frame(df)## [1] TRUE
#dataframe could be heterogeneous but each column must contain equal number of rows!
#the next script will generate an error
#df <- data.frame(x=c(1:5),y=c(7:10),stringsAsFactors = FALSE)Useful link with dataframe code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Data-Frames
print ("make a list")## [1] "make a list"
lst <- list("sdfsdfs",1,1,c(2323,"dfsdf"),5.6, data.frame(x=c(1:5),y=c(6:10)),list(1,"34"))
str(lst)## List of 7
## $ : chr "sdfsdfs"
## $ : num 1
## $ : num 1
## $ : chr [1:2] "2323" "dfsdf"
## $ : num 5.6
## $ :'data.frame': 5 obs. of 2 variables:
## ..$ x: int [1:5] 1 2 3 4 5
## ..$ y: int [1:5] 6 7 8 9 10
## $ :List of 2
## ..$ : num 1
## ..$ : chr "34"
class(lst)## [1] "list"
is.atomic(lst)## [1] FALSE
is.list(lst)## [1] TRUE
Useful link with dataframe code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Lists
Vectorization is the greatest feature I have come across in R after dealing with many loops in VBA, Java or PL/SQL to process raw data. To compare the effectiveness of vectorization vs LOOPs we are going to consider an example.
We have the vector c(1,2,4,1,4,3,5,3,1,0) that should be converted into the output c(1,2,4,4,4,4,5,5,5,5). To solve the problem we could use LOOP and vectorization.
Option 1. LOOP
if(!require(zoo)){
library(zoo)
}time_elapsed<-data.frame(phase=character(3), user=double(3), sys=double(3), elapsed=double(3), stringsAsFactors = FALSE)
# function to sort the vector in correct order
max_vector2 <- function(x){
for (i in 1:length(x)){
if(x[i] > x[i+1] & (i+1) <= length(x)){
x[i+1] <- x[i]
}
}
return(x)
}
#set a vector size
sample_size <- 10000000
#populate a vector
x <- sample(1:sample_size, sample_size, replace=TRUE)
#check the unordered vector
head(x)## [1] 4501222 5868746 8246053 1019033 7313071 2219213
#paste data into summary vector
time_elapsed[1,2:4] <- system.time(x2 <- max_vector2(x))[1:3]
time_elapsed[1,1] <- as.character("LOOP")
#check the outcome of the procedure
head(x2)## [1] 4501222 5868746 8246053 8246053 8246053 8246053
#clear the garbage
rm(max_vector2, x, x2)
gc()Option 2. Vectorization
#populate a vector
x <- sample(1:sample_size, sample_size, replace=TRUE)
head(x)## [1] 7459797 3274752 4412254 1975226 3400899 7814052
#LOOP via vectorization
time_elapsed[2,2:4] <-system.time(x2 <- na.locf(x*NA^(!c(TRUE,x[-1]>x[-length(x)]))))[1:3]
time_elapsed[2,1] <- "vectorization"
head(x2)## [1] 7459797 7459797 4412254 4412254 3400899 7814052
#the fastest approach
#check the unordered vector
print("the fastest approach to solve the problem.")## [1] "the fastest approach to solve the problem."
head(x)## [1] 7459797 3274752 4412254 1975226 3400899 7814052
#paste data into summary vector
time_elapsed[3,2:4] <- system.time(x2 <- cummax(x))[1:3]
time_elapsed[3,1] <- "vectorization optimal"
#check the outcome of the procedure
head(x2)## [1] 7459797 7459797 7459797 7459797 7459797 7814052
#clear the garbage
print("Comparison of all three approaches")## [1] "Comparison of all three approaches"
print(paste("Sample size is",sample_size))## [1] "Sample size is 1e+07"
time_elapsed## phase user sys elapsed
## 1 LOOP 29.14 0.00 29.33
## 2 vectorization 1.67 0.53 2.20
## 3 vectorization optimal 0.05 0.00 0.05
print("clear the garbage")
rm(x, x2, time_elapsed)
gc()In this section, we are going to explore wide-used approaches to read/input data into R environment and try to choose the best one for specific purposes. As a source we consider csv files.
Packages are to be used (they are necessary to be installed):
As the object to be tested we consider the dataframe as follows:
file_output <- "test.csv"
file_output2 <- "test_iotools.csv"
file_output3 <- "test_rds.rds"
file_output4 <- "test_feather.feather"
#set the dataframe size (number of rows) for the single run
df_size <- 100000make_df <- function(){
#to generate the main dataframe
set.seed (123)
# one dimension character dataframe of 15 symbol length per cell
w <- as.data.frame(1:df_size)
#populate character dataframe
w <- apply(w, 1, function(x){paste(sample(letters,15), collapse = "")})
#paste character dataframe into the main one
n <- data.frame(x=1:df_size,y=rnorm(1:df_size),z=rnorm(1:df_size),w=w)
#convert column 'w' from factor to character vector
n$w <- as.character(n$w)
#to write the dataframe to csv file for further exercises
if (file.exists(file_output)) file.remove(file_output)
if (file.exists(file_output2)) file.remove(file_output2)
#common file output
write.csv2(n, file_output, row.names = FALSE)
#file output for iotools
cat(noquote(paste(paste(names(n), collapse = ";"), "\n")),file = file_output2)
write.csv.raw(n,file_output2,sep=";",append=TRUE)
#create file in RDS format
saveRDS(n,file_output3)
#create file in feather format
write_feather(n,file_output4)
rm(n,w)
gc()
}
make_df()Tesing criteria:
We will focus at item 1 mainly because the majority of packages are placed in RAM but for ff package (RAM efficient). Considering item 3 we will investigate the output of the iotools I/O functions.
We consider the speed of data reading by READ.TABLE() - read.csv, read.csv2 are wrappers of READ.TABLE (they are not being investigated):
#create the dataframe to record test runs through all functions
test_results <- data.frame(package=character(), tool=character(), phase=character(), cores=numeric(), sys=numeric(), elapsed=numeric(), stringsAsFactors = FALSE)
#to record the test run
time_used <- system.time(check <- read.table(file_output, header=TRUE, sep=";", dec=",", stringsAsFactors = FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))## x y z w
## 1 1 1.41505487 1.6431569 htjuwakqxzpgrsb
## 2 2 0.09534078 0.4065932 xgbhusnmrlqiodz
## 3 3 1.44062019 2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2118444 113.2 3205452 171.2 2637877 140.9
## Vcells 2641824 20.2 9128236 69.7 80621400 615.1
#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="read.table", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 util read.table base 0.49 0.02 0.51
#to record the test run
time_used <- system.time(check <- read.table(file_output, header=TRUE, sep=";", dec=",", stringsAsFactors = FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))## x y z w
## 1 1 0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2 2 0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3 3 0.2013788 0.9409982 zwqsakpefdcgryb
## 4 4 0.2257295 2.0055577 dflgsaipcjzbkxe
## 5 5 -0.4075815 0.8010883 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 419988 22.5 34326404 1833.3 40438471 2159.7
## Vcells 34135143 260.5 230731912 1760.4 283850101 2165.7
#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="read.table", phase="+stringsAsFactors", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 util read.table +stringsAsFactors 96.83 0.6 104.6
#make the vector of column classes
var_types <- unlist(sapply(read.table(file_output, header=TRUE, sep=";", stringsAsFactors = FALSE, dec=","),class))
#to record the test run
time_used <- system.time(check <- read.table(file_output, header=TRUE, sep=";", stringsAsFactors = FALSE, dec=",", colClasses = var_types))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))## x y z w
## 1 1 0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2 2 0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3 3 0.2013788 0.9409982 zwqsakpefdcgryb
## 4 4 0.2257295 2.0055577 dflgsaipcjzbkxe
## 5 5 -0.4075815 0.8010883 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 420049 22.5 26388277 1409.3 40438471 2159.7
## Vcells 34135721 260.5 177253308 1352.4 283850101 2165.7
#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="read.table", phase="+colClasses", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 util read.table +colClasses 37.66 0.33 38.85
if(!require(readr)){
library(readr)
}#to record the test run
time_used <- system.time(check <- read_csv2(file_output, col_names = TRUE,progress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))## Source: local data frame [5 x 4]
##
## x y z w
## <int> <dbl> <dbl> <chr>
## 1 1 0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2 2 0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3 3 0.2013788 0.9409982 zwqsakpefdcgryb
## 4 4 0.2257295 2.0055577 dflgsaipcjzbkxe
## 5 5 -0.4075815 0.8010883 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 435194 23.3 16888496 902.0 40438471 2159.7
## Vcells 34164698 260.7 113442116 865.5 283850101 2165.7
#make a vector to add to dataframe
vector_towrite <- data.frame(package="readr", tool="read_csv2", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 readr read_csv2 base 7.69 0.12 8.3
#to record the test run
time_used <- system.time(check <- read_csv2(file_output, col_types = list(col_double(), col_double(), col_double(), col_character()),progress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))## Source: local data frame [5 x 4]
##
## x y z w
## <dbl> <dbl> <dbl> <chr>
## 1 1 0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2 2 0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3 3 0.2013788 0.9409982 zwqsakpefdcgryb
## 4 4 0.2257295 2.0055577 dflgsaipcjzbkxe
## 5 5 -0.4075815 0.8010883 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 435219 23.3 13002364 694.5 40438471 2159.7
## Vcells 34165055 260.7 133266583 1016.8 283850101 2165.7
#make a vector to add to dataframe
vector_towrite <- data.frame(package="readr", tool="read_csv2", phase="+col_types", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 readr read_csv2 +col_types 18.87 0.22 19.11
It is not shown in tests because of extremely poor performance!
Read and write feather files, a lightweight binary columnar data store designed for maximum speed. NB: Current purpose of feather package (here: “At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.”
library(feather)
print(file_output4)## [1] "test_feather.feather"
#to record the test run
#time_used <- system.time(check <- read_feather(file_output4))
#check data for the completness and consistency (temporary variable)
#print(head(check,n=5))
#remove the temporary variable
#rm(check)
#gc()
#make a vector to add to dataframe
vector_towrite <- data.frame(package="feather",tool="read_feather",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 feather read_feather base 18.87 0.22 19.11
The package is for fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development.
if(!require(data.table)){
library(data.table)
}## Loading required package: data.table
#to record the test run
time_used <- system.time(check <- fread(file_output,dec=",",sep=";", showProgress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))
#remove the temporary variable
rm(check)
gc()
#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fread",phase="base",cores=time_used[[1]],sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)#to record the test run
time_used <- system.time(check <- fread(file_output,dec=",",sep=";",stringsAsFactors = FALSE, showProgress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))## x y z w
## 1: 1 1.41505487 1.6431569 htjuwakqxzpgrsb
## 2: 2 0.09534078 0.4065932 xgbhusnmrlqiodz
## 3: 3 1.44062019 2.4036034 zwqsakpefdcgryb
## 4: 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5: 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2142007 114.4 3205452 171.2 2637877 140.9
## Vcells 2563670 19.6 7302584 55.8 80621353 615.1
#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fread",phase="+stringsAsFactors",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 data.table fread +stringsAsFactors 0.19 0 0.19
#make the vector of column classes
var_types <- unlist(sapply(fread(file_output,dec=",",sep=";",stringsAsFactors = FALSE, showProgress=FALSE),class))
#to record the test run
time_used <- system.time(check <- fread(file_output,dec=",",sep=";",stringsAsFactors = FALSE,colClasses=var_types, showProgress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))## x y z w
## 1: 1 1.41505487 1.6431569 htjuwakqxzpgrsb
## 2: 2 0.09534078 0.4065932 xgbhusnmrlqiodz
## 3: 3 1.44062019 2.4036034 zwqsakpefdcgryb
## 4: 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5: 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2142579 114.5 3205452 171.2 2637877 140.9
## Vcells 2565230 19.6 7302584 55.8 80621353 615.1
#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fread",phase="+colClasses",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 data.table fread +colClasses 0.19 0 0.19
N.B.: File format used in iotools package is not compatible properly with the rest of reading function tested.
if(!require(iotools)){
library(iotools)
}#to record the test run
time_used <- system.time(check <- read.csv.raw(file_output2,sep=";"))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))## x y z w
## 1 1 1.41505487 1.6431569 htjuwakqxzpgrsb
## 2 2 0.09534078 0.4065932 xgbhusnmrlqiodz
## 3 3 1.44062019 2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2143877 114.5 3205452 171.2 2637877 140.9
## Vcells 2557315 19.6 7302584 55.8 80621353 615.1
#make a vector to add to dataframe
vector_towrite <- data.frame(package="iotools",tool="read.csv.raw",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 iotools read.csv.raw base 0.09 0 0.14
#make the vector of column classes
var_types <- unlist(sapply(read.csv.raw(file_output2,sep=";"),class))
#to record the test run
time_used <-system.time(check <- read.csv.raw(file_output2,colClasses=var_types,sep=";"))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))## x y z w
## 1 1 1.41505487 1.6431569 htjuwakqxzpgrsb
## 2 2 0.09534078 0.4065932 xgbhusnmrlqiodz
## 3 3 1.44062019 2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2143919 114.5 3205452 171.2 2637877 140.9
## Vcells 2557745 19.6 7302584 55.8 80621353 615.1
#make a vector to add to dataframe
vector_towrite <- data.frame(package="iotools",tool="read.csv.raw",phase="+colClasses",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 iotools read.csv.raw +colClasses 0.09 0 0.09
This function is interesting to use because of the speed of reading files and the file size allocated (smaller than files saved with ‘write.table’, ‘write_delim’, ‘write_csv’, write.csv.sql,write.csv.raw).
To use ‘readRDS’ you must have the source file saved with rds format “*.rds" (‘saveRDS’ or any similar wrapper function). The file is saved as binary of ASCII format (details - type “?saveRDS”“).
You are unlikely to have correct access to the data saved as ‘.rds’ with other reading functions. Therefore, that could be the critical issue for other users of you functions or business cycle.
‘readRDS’ could be considered as “upgrade” ‘load’. It could be considered as memory efficient vs ‘load’ because it does not make unnecessary download of the saved object that ‘load’ always does (details see at the link: http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/)
#to record the test run
time_used <- system.time(check <- readRDS(file_output3))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))## x y z w
## 1 1 1.41505487 1.6431569 htjuwakqxzpgrsb
## 2 2 0.09534078 0.4065932 xgbhusnmrlqiodz
## 3 3 1.44062019 2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf
#remove the temporary variable
rm(check)
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2143891 114.5 3205452 171.2 2637877 140.9
## Vcells 2557650 19.6 7302584 55.8 80621353 615.1
#make a vector to add to dataframe
vector_towrite <- data.frame(package="base",tool="readRDS",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)## package tool phase cores sys elapsed
## 1 base readRDS base 0.06 0 0.08
The goal of this test is to calculate AVG indicators for theoretical sample such as AVG time elapsed since the specific function is run.
The sample structure is identical to the dataframe used in the abovementioned examples.
The user defines the number of trials (simulations).
if(!require(ggplot2)){
library(ggplot2)
}
if(!require(scales)){
library(scales)
}The performance test is based upon the sample size of (1000, 5000, 10^{4}, 510^{4}, 10^{5}, 510^{5}, 10^{6}, 510^{6}, 10^{7}) with four columns (integer, numeric, numeric, character). The character is treated as not a factor.
## [1] "Samples < 10mio rows"
## [1] "All samples"
## [1] "all samples, no read.table"
## [1] "start"
## [1] "REAL START"
## [1] "Real sample size"
## [1] "rows = 267745 cols = 151"
## [1] TRUE
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 12191301 651.1 39765694 2123.8 42141368 2250.6
## Vcells 91033827 694.6 301245044 2298.4 467287108 3565.2
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
## tools_aggr qty meanElapsed
## 1 read_feather base 267745 0.89
## 2 readRDS base 267745 2.56
## 3 read_csv2 base 267745 3.23
## 4 fread +stringsAsFactors 267745 3.93
## 5 fread +colClasses 267745 4.15
## 6 fread base 267745 4.78
## 7 read.csv.raw +colClasses 267745 5.30
## 8 read.csv.raw base 267745 6.08
## 9 read.table +colClasses 260878 13.73
## 10 read.table +stringsAsFactors 260878 16.04
## 11 read.table base 260878 17.50