1 Preface

This document is designated to investigate some practical issues of Input/output approaches (IO) as well as data manipulation hints considering time cost and memory (RAM) limitations in R.

We are not going to dive deep into R chasm but going to cover the most applicable cases such the fastest way to read data from a file into R environment.

It is recommended to have some R basic skills but not required. Each case is provided with a reproducible example, the useful links are added as well.

The function to read data are as follows (to be tested for performance outcome):

read.table();
read_csv2();
read.csv2.sql;
read_feather;
fread;
read.csv.raw;
readRDS.

1.1 PC & OS features:

PC: i5, RAM 8 Gb
OS: Win7 64bit
R: 3.3.1
RStudio: 0.99.902

2 Brief review of R

2.1 Advantages of R

vectorization (No loops) - amazing reduction of time costs
very flexible data management
simple scripts for hard analytical cases
reproducibility
universe of packages (at least 90% of all problems are solved)
free software
open source
large & growing community
sound graphic capabilities

2.2 Disadvantages of R

hard to start programming for novices
difficult to find sources with description of major features and advanced programming in one piece
easy-to-deplete RAM
to solve the rest of 10% could be a real inferno
lack of sufficient examples to packages placed on CRAN
easy-to-forget different tricks

A good description of pros and cons of R (in comparison with Python) see at https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis.

For inquisitive minds, the useful link to look at APL language that could be the alternative to R in some ways: http://tryapl.org/. It is necessary to admit that APL is not wide spread as R or Python for data science purposes.

There are many sources to read about the reasons to use R and its pros and cons. The useful links to read about R:

briefly about R: https://www.r-project.org/about.html
visualization http://shiny.rstudio.com/gallery/

2.3 Basic data structures in R

The corner stone of good R practice - the understanding of basic data structure elements of R and basics of functional programming in R.

In fact, most of all issues arise concerning the misunderstanding or wrong application of specific elements of data structures.

A good description of data structure could be found here: http://adv-r.had.co.nz/Data-structures.html

Good notes about functional programming could be found here:
http://adv-r.had.co.nz/Functional-programming.html

Data structure consist of four elements:

Vectors introduces you to atomic vectors and lists, R?s 1d data structures. That is the corner stone for the rest of data structure in R.
Matrices and arrays introduces matrices and arrays, data structures for storing 2d and higher dimensional data.
Data frames teaches you about the data frame, the most important data structure for storing data in R. Data frames combine the behavior of lists and matrices to make a structure ideally suited for the needs of statistical data.

2.3.1 Examples of data structures:

#make a vector
print ("make a vector")

## [1] "make a vector"

v <- c(1,2,3,4,5)
str(v) #check the structure of the object

##  num [1:5] 1 2 3 4 5

class(v) #check the class of the object

## [1] "numeric"

is.atomic(v) #check whether the object could contain elements of different types

## [1] TRUE

is.vector(v) #check the is the correct object

## [1] TRUE

# vector must be homogeneous!!!
#the next script will generate an error
#v < - c(1,2,"sfgs",4,5)

Useful link with vector code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Vectors

#make a matrix & array 
print ("make a matrix")

## [1] "make a matrix"

m <- matrix(1:10,nrow=5,ncol=2)
str(m)

##  int [1:5, 1:2] 1 2 3 4 5 6 7 8 9 10

class(m)

## [1] "matrix"

is.atomic(m)

## [1] TRUE

is.matrix(m)

## [1] TRUE

print("make an array")

## [1] "make an array"

dim(m) <- c(2,5,1)
str(m)

##  int [1:2, 1:5, 1] 1 2 3 4 5 6 7 8 9 10

class(m)

## [1] "array"

is.atomic(m)

## [1] TRUE

is.array(m)

## [1] TRUE

#array must be homogeneous!!!
#the next script will convert all numbers into character!
m[1,3,1] <- "dfd"

Useful link with matrix&array code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Matrices-and-Arrays

#make a dataframe
print("make a dataframe")

## [1] "make a dataframe"

df <- data.frame(x=c(1:5),y=c(6:10),stringsAsFactors = FALSE)
str(df)

## 'data.frame':    5 obs. of  2 variables:
##  $ x: int  1 2 3 4 5
##  $ y: int  6 7 8 9 10

class(df)

## [1] "data.frame"

is.atomic(df)

## [1] FALSE

is.data.frame(df)

## [1] TRUE

#dataframe could be heterogeneous but each column must contain equal number of rows!
#the next script will generate an error
#df <- data.frame(x=c(1:5),y=c(7:10),stringsAsFactors = FALSE)

Useful link with dataframe code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Data-Frames

print ("make a list")

## [1] "make a list"

lst <- list("sdfsdfs",1,1,c(2323,"dfsdf"),5.6, data.frame(x=c(1:5),y=c(6:10)),list(1,"34")) 
str(lst)

## List of 7
##  $ : chr "sdfsdfs"
##  $ : num 1
##  $ : num 1
##  $ : chr [1:2] "2323" "dfsdf"
##  $ : num 5.6
##  $ :'data.frame':    5 obs. of  2 variables:
##   ..$ x: int [1:5] 1 2 3 4 5
##   ..$ y: int [1:5] 6 7 8 9 10
##  $ :List of 2
##   ..$ : num 1
##   ..$ : chr "34"

class(lst)

## [1] "list"

is.atomic(lst)

## [1] FALSE

is.list(lst)

## [1] TRUE

Useful link with dataframe code examples: http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-Lists

2.3.2 Example of vectorization

Vectorization is the greatest feature I have come across in R after dealing with many loops in VBA, Java or PL/SQL to process raw data. To compare the effectiveness of vectorization vs LOOPs we are going to consider an example.

We have the vector c(1,2,4,1,4,3,5,3,1,0) that should be converted into the output c(1,2,4,4,4,4,5,5,5,5). To solve the problem we could use LOOP and vectorization.

Option 1. LOOP

if(!require(zoo)){
  library(zoo)  
}

time_elapsed<-data.frame(phase=character(3), user=double(3), sys=double(3), elapsed=double(3), stringsAsFactors = FALSE)
# function to sort the vector in correct order
max_vector2 <- function(x){
                            for (i in 1:length(x)){
                              if(x[i] > x[i+1] & (i+1) <= length(x)){
                                x[i+1] <- x[i]
                              }
                            }    
                            return(x)
                          }
#set a vector size
sample_size <- 10000000
#populate a vector
x <- sample(1:sample_size, sample_size, replace=TRUE)
#check the unordered vector
head(x)

## [1] 4501222 5868746 8246053 1019033 7313071 2219213

#paste data into summary vector
time_elapsed[1,2:4] <- system.time(x2 <- max_vector2(x))[1:3]
time_elapsed[1,1] <- as.character("LOOP")

#check the outcome of the procedure
head(x2)

## [1] 4501222 5868746 8246053 8246053 8246053 8246053

#clear the garbage
rm(max_vector2, x, x2)
gc()

Option 2. Vectorization

#populate a vector
x <- sample(1:sample_size, sample_size, replace=TRUE)
head(x)

## [1] 7459797 3274752 4412254 1975226 3400899 7814052

#LOOP via vectorization
time_elapsed[2,2:4] <-system.time(x2 <- na.locf(x*NA^(!c(TRUE,x[-1]>x[-length(x)]))))[1:3]
time_elapsed[2,1] <- "vectorization"           
head(x2)

## [1] 7459797 7459797 4412254 4412254 3400899 7814052

#the fastest approach 
#check the unordered vector
print("the fastest approach to solve the problem.")

## [1] "the fastest approach to solve the problem."

head(x)

## [1] 7459797 3274752 4412254 1975226 3400899 7814052

#paste data into summary vector
time_elapsed[3,2:4] <- system.time(x2 <- cummax(x))[1:3]
time_elapsed[3,1] <- "vectorization optimal"    
#check the outcome of the procedure
head(x2)

## [1] 7459797 7459797 7459797 7459797 7459797 7814052

#clear the garbage
print("Comparison of all three approaches")

## [1] "Comparison of all three approaches"

print(paste("Sample size is",sample_size))

## [1] "Sample size is 1e+07"

time_elapsed

##                   phase  user  sys elapsed
## 1                  LOOP 29.14 0.00   29.33
## 2         vectorization  1.67 0.53    2.20
## 3 vectorization optimal  0.05 0.00    0.05

print("clear the garbage")
rm(x, x2, time_elapsed)
gc()

3 Reading Files

In this section, we are going to explore wide-used approaches to read/input data into R environment and try to choose the best one for specific purposes. As a source we consider csv files.

Packages are to be used (they are necessary to be installed):

data.table
sqldf
readr
iotools
feather
ff (big memory -> see later)

As the object to be tested we consider the dataframe as follows:

file_output <- "test.csv"
file_output2 <- "test_iotools.csv"
file_output3 <- "test_rds.rds"
file_output4 <- "test_feather.feather"


#set the dataframe size (number of rows) for the single run
df_size <- 100000

make_df <- function(){
#to generate the main dataframe
set.seed (123)
#  one dimension character dataframe of 15 symbol length per cell
w <- as.data.frame(1:df_size)
#populate character dataframe 
w <- apply(w, 1, function(x){paste(sample(letters,15), collapse = "")})
#paste character dataframe into the main one
n <- data.frame(x=1:df_size,y=rnorm(1:df_size),z=rnorm(1:df_size),w=w)
#convert column 'w' from factor to character vector 
n$w <- as.character(n$w)
#to write the dataframe to csv file for further exercises
if (file.exists(file_output)) file.remove(file_output)
if (file.exists(file_output2)) file.remove(file_output2)
#common file output
write.csv2(n, file_output, row.names = FALSE)
#file output for iotools
cat(noquote(paste(paste(names(n), collapse = ";"), "\n")),file = file_output2)
write.csv.raw(n,file_output2,sep=";",append=TRUE)
#create file in RDS format
saveRDS(n,file_output3)
#create file in feather format
write_feather(n,file_output4)


rm(n,w)
gc()
}

make_df()

Tesing criteria:

time to read data;
memory used (critical for ff package);
access to data saved (whether outer users can use data saved).

We will focus at item 1 mainly because the majority of packages are placed in RAM but for ff package (RAM efficient). Considering item 3 we will investigate the output of the iotools I/O functions.

3.1 READING DATA

3.1.1 READ.TABLE() - util package

We consider the speed of data reading by READ.TABLE() - read.csv, read.csv2 are wrappers of READ.TABLE (they are not being investigated):

3.1.1.1 base version of read.table() - no additional adjustments:

#create the dataframe to record test runs through all functions
test_results <- data.frame(package=character(), tool=character(), phase=character(), cores=numeric(), sys=numeric(), elapsed=numeric(), stringsAsFactors = FALSE)
#to record the test run
time_used <- system.time(check <- read.table(file_output, header=TRUE, sep=";", dec=",", stringsAsFactors = FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))

##   x           y          z               w
## 1 1  1.41505487  1.6431569 htjuwakqxzpgrsb
## 2 2  0.09534078  0.4065932 xgbhusnmrlqiodz
## 3 3  1.44062019  2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2118444 113.2    3205452 171.2  2637877 140.9
## Vcells 2641824  20.2    9128236  69.7 80621400 615.1

#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="read.table", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package       tool phase cores  sys elapsed
## 1    util read.table  base  0.49 0.02    0.51

3.1.1.2 setting ‘stringsAsFactors=FALSE’ - to swicth off autoconversion of characters into factors while reading the file.

#to record the test run
time_used <- system.time(check <- read.table(file_output, header=TRUE, sep=";", dec=",", stringsAsFactors = FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))

##   x          y          z               w
## 1 1  0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2 2  0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3 3  0.2013788  0.9409982 zwqsakpefdcgryb
## 4 4  0.2257295  2.0055577 dflgsaipcjzbkxe
## 5 5 -0.4075815  0.8010883 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   419988  22.5   34326404 1833.3  40438471 2159.7
## Vcells 34135143 260.5  230731912 1760.4 283850101 2165.7

#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="read.table", phase="+stringsAsFactors", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package       tool             phase cores sys elapsed
## 1    util read.table +stringsAsFactors 96.83 0.6   104.6

3.1.1.3 setting ‘colClasses’ - to predefine types of columns to be read.

#make the vector of column classes
var_types <- unlist(sapply(read.table(file_output, header=TRUE, sep=";", stringsAsFactors = FALSE, dec=","),class))
#to record the test run           
time_used <- system.time(check <- read.table(file_output, header=TRUE, sep=";", stringsAsFactors = FALSE, dec=",", colClasses = var_types))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))

##   x          y          z               w
## 1 1  0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2 2  0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3 3  0.2013788  0.9409982 zwqsakpefdcgryb
## 4 4  0.2257295  2.0055577 dflgsaipcjzbkxe
## 5 5 -0.4075815  0.8010883 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   420049  22.5   26388277 1409.3  40438471 2159.7
## Vcells 34135721 260.5  177253308 1352.4 283850101 2165.7

#make a vector to add to dataframe
vector_towrite <- data.frame(package="util", tool="read.table", phase="+colClasses", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package       tool       phase cores  sys elapsed
## 1    util read.table +colClasses 37.66 0.33   38.85

3.1.2 READ_CSV2() - readr package (read flat/tabular text files from disk (or a connection)

3.1.2.1 base version of read_table() - no additional adjustments:

if(!require(readr)){
  library(readr)  
}

#to record the test run
time_used <- system.time(check <- read_csv2(file_output, col_names = TRUE,progress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))

## Source: local data frame [5 x 4]
## 
##       x          y          z               w
##   <int>      <dbl>      <dbl>           <chr>
## 1     1  0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2     2  0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3     3  0.2013788  0.9409982 zwqsakpefdcgryb
## 4     4  0.2257295  2.0055577 dflgsaipcjzbkxe
## 5     5 -0.4075815  0.8010883 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells   435194  23.3   16888496 902.0  40438471 2159.7
## Vcells 34164698 260.7  113442116 865.5 283850101 2165.7

#make a vector to add to dataframe
vector_towrite <- data.frame(package="readr", tool="read_csv2", phase="base", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package      tool phase cores  sys elapsed
## 1   readr read_csv2  base  7.69 0.12     8.3

3.1.2.2 setting ‘col_types’ - to predefine types of columns to be read (details see https://cran.r-project.org/web/packages/readr/vignettes/column-types.html):

#to record the test run
time_used <- system.time(check <- read_csv2(file_output, col_types = list(col_double(), col_double(), col_double(), col_character()),progress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check, n=5))

## Source: local data frame [5 x 4]
## 
##       x          y          z               w
##   <dbl>      <dbl>      <dbl>           <chr>
## 1     1  0.5208814 -0.3584017 htjuwakqxzpgrsb
## 2     2  0.9885139 -0.2763683 xgbhusnmrlqiodz
## 3     3  0.2013788  0.9409982 zwqsakpefdcgryb
## 4     4  0.2257295  2.0055577 dflgsaipcjzbkxe
## 5     5 -0.4075815  0.8010883 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   435219  23.3   13002364  694.5  40438471 2159.7
## Vcells 34165055 260.7  133266583 1016.8 283850101 2165.7

#make a vector to add to dataframe
vector_towrite <- data.frame(package="readr", tool="read_csv2", phase="+col_types", cores=time_used[[1]], sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <-rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package      tool      phase cores  sys elapsed
## 1   readr read_csv2 +col_types 18.87 0.22   19.11

3.1.3 READ.CSV2.SQL() - sqldf package

3.1.3.1 READ.CSV2.SQL() allows to read data from the file via SQL scipting.

It is not shown in tests because of extremely poor performance!

3.1.4 READ_FEATHER - feather package

Read and write feather files, a lightweight binary columnar data store designed for maximum speed. NB: Current purpose of feather package (here: “At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.”

3.1.4.1 base version of read_feather() - no additional adjustments:

library(feather)
print(file_output4)

## [1] "test_feather.feather"

#to record the test run
#time_used <- system.time(check <- read_feather(file_output4))
#check data for the completness and consistency (temporary variable)
#print(head(check,n=5))
#remove the temporary variable
#rm(check)
#gc()
#make a vector to add to dataframe
vector_towrite <- data.frame(package="feather",tool="read_feather",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package         tool phase cores  sys elapsed
## 1 feather read_feather  base 18.87 0.22   19.11

3.1.5 FREAD() - data.table package

The package is for fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development.

3.1.5.1 base version of fread() - no additional adjustments:

if(!require(data.table)){
  library(data.table)  
}

## Loading required package: data.table

#to record the test run
time_used <- system.time(check <- fread(file_output,dec=",",sep=";", showProgress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))
#remove the temporary variable
rm(check)
gc()
#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fread",phase="base",cores=time_used[[1]],sys=time_used[[2]], elapsed=time_used[[3]], stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

3.1.5.2 setting ‘stringsAsFactors=FALSE’ - to swicth off autoconversion of characters into factors while reading the file.

#to record the test run
time_used <- system.time(check <- fread(file_output,dec=",",sep=";",stringsAsFactors = FALSE, showProgress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))

##    x           y          z               w
## 1: 1  1.41505487  1.6431569 htjuwakqxzpgrsb
## 2: 2  0.09534078  0.4065932 xgbhusnmrlqiodz
## 3: 3  1.44062019  2.4036034 zwqsakpefdcgryb
## 4: 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5: 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2142007 114.4    3205452 171.2  2637877 140.9
## Vcells 2563670  19.6    7302584  55.8 80621353 615.1

#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fread",phase="+stringsAsFactors",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##      package  tool             phase cores sys elapsed
## 1 data.table fread +stringsAsFactors  0.19   0    0.19

3.1.5.3 setting ‘colClasses’ - to predefine types of columns to be read.

#make the vector of column classes
var_types <- unlist(sapply(fread(file_output,dec=",",sep=";",stringsAsFactors = FALSE, showProgress=FALSE),class))
#to record the test run
time_used <- system.time(check <- fread(file_output,dec=",",sep=";",stringsAsFactors = FALSE,colClasses=var_types, showProgress=FALSE))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))

##    x           y          z               w
## 1: 1  1.41505487  1.6431569 htjuwakqxzpgrsb
## 2: 2  0.09534078  0.4065932 xgbhusnmrlqiodz
## 3: 3  1.44062019  2.4036034 zwqsakpefdcgryb
## 4: 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5: 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2142579 114.5    3205452 171.2  2637877 140.9
## Vcells 2565230  19.6    7302584  55.8 80621353 615.1

#make a vector to add to dataframe
vector_towrite <- data.frame(package="data.table",tool="fread",phase="+colClasses",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##      package  tool       phase cores sys elapsed
## 1 data.table fread +colClasses  0.19   0    0.19

3.1.6 READ.CSV.RAW() - iotools package (I/O Tools for Streaming)

N.B.: File format used in iotools package is not compatible properly with the rest of reading function tested.

3.1.6.1 base version of READ.CSV.RAW() - no additional adjustments:

if(!require(iotools)){
  library(iotools)  
}

#to record the test run
time_used <- system.time(check <- read.csv.raw(file_output2,sep=";"))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))

##   x           y          z              w 
## 1 1  1.41505487  1.6431569 htjuwakqxzpgrsb
## 2 2  0.09534078  0.4065932 xgbhusnmrlqiodz
## 3 3  1.44062019  2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2143877 114.5    3205452 171.2  2637877 140.9
## Vcells 2557315  19.6    7302584  55.8 80621353 615.1

#make a vector to add to dataframe
vector_towrite <- data.frame(package="iotools",tool="read.csv.raw",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package         tool phase cores sys elapsed
## 1 iotools read.csv.raw  base  0.09   0    0.14

3.1.6.2 setting ‘colClasses’ - to predefine types of columns to be read.

#make the vector of column classes
var_types <- unlist(sapply(read.csv.raw(file_output2,sep=";"),class))
#to record the test run
time_used <-system.time(check <-                         read.csv.raw(file_output2,colClasses=var_types,sep=";"))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))

##   x           y          z              w 
## 1 1  1.41505487  1.6431569 htjuwakqxzpgrsb
## 2 2  0.09534078  0.4065932 xgbhusnmrlqiodz
## 3 3  1.44062019  2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2143919 114.5    3205452 171.2  2637877 140.9
## Vcells 2557745  19.6    7302584  55.8 80621353 615.1

#make a vector to add to dataframe
vector_towrite <- data.frame(package="iotools",tool="read.csv.raw",phase="+colClasses",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package         tool       phase cores sys elapsed
## 1 iotools read.csv.raw +colClasses  0.09   0    0.09

3.1.7 readRDS - base package (installed with R on default):

This function is interesting to use because of the speed of reading files and the file size allocated (smaller than files saved with ‘write.table’, ‘write_delim’, ‘write_csv’, write.csv.sql,write.csv.raw).

To use ‘readRDS’ you must have the source file saved with rds format “*.rds" (‘saveRDS’ or any similar wrapper function). The file is saved as binary of ASCII format (details - type “?saveRDS”“).

You are unlikely to have correct access to the data saved as ‘.rds’ with other reading functions. Therefore, that could be the critical issue for other users of you functions or business cycle.

‘readRDS’ could be considered as “upgrade” ‘load’. It could be considered as memory efficient vs ‘load’ because it does not make unnecessary download of the saved object that ‘load’ always does (details see at the link: http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/)

3.1.7.1 base version of readRDS() - no additional adjustments:

#to record the test run
time_used <- system.time(check <- readRDS(file_output3))
#check data for the completness and consistency (temporary variable)
print(head(check,n=5))

##   x           y          z               w
## 1 1  1.41505487  1.6431569 htjuwakqxzpgrsb
## 2 2  0.09534078  0.4065932 xgbhusnmrlqiodz
## 3 3  1.44062019  2.4036034 zwqsakpefdcgryb
## 4 4 -0.90839706 -0.4997547 dflgsaipcjzbkxe
## 5 5 -0.83125594 -0.1823739 rcjgzxqpohmuvaf

#remove the temporary variable
rm(check)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2143891 114.5    3205452 171.2  2637877 140.9
## Vcells 2557650  19.6    7302584  55.8 80621353 615.1

#make a vector to add to dataframe
vector_towrite <- data.frame(package="base",tool="readRDS",phase="base",cores=time_used[[1]],sys=time_used[[2]],elapsed=time_used[[3]],stringsAsFactors = FALSE)
#add to dataframe
test_results <- rbind(test_results, vector_towrite)
#print the vector with time elapsed
print(vector_towrite)

##   package    tool phase cores sys elapsed
## 1    base readRDS  base  0.06   0    0.08

4 Speed test of different sample sizes:

The goal of this test is to calculate AVG indicators for theoretical sample such as AVG time elapsed since the specific function is run.

The sample structure is identical to the dataframe used in the abovementioned examples.

The user defines the number of trials (simulations).

if(!require(ggplot2)){
  library(ggplot2)  
}
if(!require(scales)){
  library(scales)  
}

4.1 Graphic output of the test results:

The performance test is based upon the sample size of (1000, 5000, 10^{4}, 510^{4}, 10^{5}, 510^{5}, 10^{6}, 510^{6}, 10^{7}) with four columns (integer, numeric, numeric, character). The character is treated as not a factor.

## [1] "Samples < 10mio rows"

## [1] "All samples"

## [1] "all samples, no read.table"

5 REAL LIFE EXAMPLE

## [1] "start"

## [1] "REAL START"

## [1] "Real sample size"

## [1] "rows =  267745 cols =  151"

## [1] TRUE

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 12191301 651.1   39765694 2123.8  42141368 2250.6
## Vcells 91033827 694.6  301245044 2298.4 467287108 3565.2

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

##                      tools_aggr    qty meanElapsed
## 1             read_feather base 267745        0.89
## 2                  readRDS base 267745        2.56
## 3                read_csv2 base 267745        3.23
## 4       fread +stringsAsFactors 267745        3.93
## 5             fread +colClasses 267745        4.15
## 6                    fread base 267745        4.78
## 7      read.csv.raw +colClasses 267745        5.30
## 8             read.csv.raw base 267745        6.08
## 9        read.table +colClasses 260878       13.73
## 10 read.table +stringsAsFactors 260878       16.04
## 11              read.table base 260878       17.50

Performance issues of vectorization and reading data from files in R

Demyd Dzyuban

12/09/2016