##Why benchmark read functions
reading large flat text files is one of the most common tasks in data science and in quantitative genomic applications. There are always multiple ways of performing a task in any programming language and reading files in R is no exception.
Two aspects are worth considering when benchmarking file reading functions in R: elapsed time and memory footprint. In this case, we only focus on time to read and not in memmory usage
Let’s compare a few functions here
library(data.table)
library(tidyverse) #it includes readr
setwd("D:/Breed composition hypore")
init<-Sys.time()
f1<-fread(file="genotypes_Hypor_JP_9010_2012-09-29.dat")
fin<-Sys.time()
print(fin-init)
## Time difference of 1.329033 secs
dim(f1)
## [1] 9010 3
class(f1)
## [1] "data.table" "data.frame"
Now let’s compare to read_delim
init<-Sys.time()
f2<-read_delim("genotypes_Hypor_JP_9010_2012-09-29.dat",delim = " ",col_names = FALSE)
fin<-Sys.time()
print(fin-init)
## Time difference of 0.4439561 secs
dim(f2)
## [1] 9010 3
class(f2)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Now, use the traditional read.table
init<-strt_time <- Sys.time()
f3<-read.table("genotypes_Hypor_JP_9010_2012-09-29.dat",header = FALSE)
fin<-Sys.time()
print(fin-init)
## Time difference of 1.142066 mins
dim(f3)
## [1] 9010 3
class(f3)
## [1] "data.frame"
The three functions perform the same task. Even for a modest size file, there is a huge difference in time performance between read.table and the other two functions. Performance of fread and read_delim is very similar.
Note: changing the number of columns may affect the performance of read functions in R significantly.