Benchmark_reading_multiple

1. Objective

Compare the elapsed time for 2 codes to read multiples files. The first code uses for-loop. The second one uses lapply

Thank you anh Thanh Le for the 2nd code.

2. Data and Benchmark

The data is composed of 252 files of rainfall measures. Each file contains 1826 lines and 2 columns. The benchmarking is run for 10, 100, 300 and 500 replications. The total effective number of readed files will be n_replications * 252 files

3.Coding

Here are the codes that we used

library(readr)
library(dplyr)
library(rbenchmark)
library(data.table)
library(bit64)
library(pryr)

res <- new.env()

test_benchmark <- function(repli){
#Cach 1
res$result1 <- benchmark(replications =repli, {  
  path <- "/pluviofrance/data/csv/"
  files <- list.files(path = path,pattern="*.csv")
  for (i in 1:length(files)) {
    name <- substr(files[i],1,(nchar(files[i])-4))
    content <- fread(paste0(path,files[i]))
    assign(name,content)}
})

mem_used()
#Cach2
res$result2 <- benchmark(replications = repli,{
  path <- "/pluviofrance/data/csv/"
  files <- list.files(path = path, pattern="*.csv", full.names = TRUE)
  all_csv <- suppressMessages(lapply(files, function(x) fread(x)))
  csv_name <- basename(files) %>%  gsub("\\.csv","", x = .)
  names(all_csv) <- csv_name
  list2env(all_csv,envir = .GlobalEnv)
  
})

mess <- as.data.frame(rbind(cbind(Code = "loop",res$result1[2:8]),cbind(Code = "lapply",res$result2[2:8])))
return(mess)
}

4.Result

#10 replications
repli10 <- test_benchmark(repli = 10); print(repli10)

##     Code replications elapsed relative user.self sys.self user.child
## 1   loop           10   2.111        1     1.965    0.146          0
## 2 lapply           10   1.886        1     1.759    0.127          0
##   sys.child
## 1         0
## 2         0

#100 replications
repli100 <- test_benchmark(repli = 100); print(repli100)

##     Code replications elapsed relative user.self sys.self user.child
## 1   loop          100  18.879        1    17.652    1.223          0
## 2 lapply          100  18.547        1    17.292    1.250          0
##   sys.child
## 1         0
## 2         0

#300 replications
repli300 <- test_benchmark(repli = 300); print(repli300)

##     Code replications elapsed relative user.self sys.self user.child
## 1   loop          300  56.845        1    53.113    3.722          0
## 2 lapply          300  56.008        1    52.263    3.732          0
##   sys.child
## 1         0
## 2         0

#500 replications
repli500 <- test_benchmark(repli = 500); print(repli500)

##     Code replications elapsed relative user.self sys.self user.child
## 1   loop          500  94.391        1    88.288    6.085          0
## 2 lapply          500  93.727        1    87.484    6.225          0
##   sys.child
## 1         0
## 2         0

Conclusion

The second code is much more tidy than the 1st code and the for-loop is well-known for time-consuming. Surprisingly, the benchmark test indicated a slight difference for the 2 codes.

Benchmark_reading_multiple_files

Vịt Trần

November 15, 2016

1. Objective

2. Data and Benchmark

3.Coding

4.Result

Conclusion