Compare the elapsed time for 2 codes to read multiples files. The first code uses for-loop. The second one uses lapply
Thank you anh Thanh Le for the 2nd code.
The data is composed of 252 files of rainfall measures. Each file contains 1826 lines and 2 columns. The benchmarking is run for 10, 100, 300 and 500 replications. The total effective number of readed files will be n_replications * 252 files
Here are the codes that we used
library(readr)
library(dplyr)
library(rbenchmark)
library(data.table)
library(bit64)
library(pryr)
res <- new.env()
test_benchmark <- function(repli){
#Cach 1
res$result1 <- benchmark(replications =repli, {
path <- "/pluviofrance/data/csv/"
files <- list.files(path = path,pattern="*.csv")
for (i in 1:length(files)) {
name <- substr(files[i],1,(nchar(files[i])-4))
content <- fread(paste0(path,files[i]))
assign(name,content)}
})
mem_used()
#Cach2
res$result2 <- benchmark(replications = repli,{
path <- "/pluviofrance/data/csv/"
files <- list.files(path = path, pattern="*.csv", full.names = TRUE)
all_csv <- suppressMessages(lapply(files, function(x) fread(x)))
csv_name <- basename(files) %>% gsub("\\.csv","", x = .)
names(all_csv) <- csv_name
list2env(all_csv,envir = .GlobalEnv)
})
mess <- as.data.frame(rbind(cbind(Code = "loop",res$result1[2:8]),cbind(Code = "lapply",res$result2[2:8])))
return(mess)
}
#10 replications
repli10 <- test_benchmark(repli = 10); print(repli10)
## Code replications elapsed relative user.self sys.self user.child
## 1 loop 10 2.111 1 1.965 0.146 0
## 2 lapply 10 1.886 1 1.759 0.127 0
## sys.child
## 1 0
## 2 0
#100 replications
repli100 <- test_benchmark(repli = 100); print(repli100)
## Code replications elapsed relative user.self sys.self user.child
## 1 loop 100 18.879 1 17.652 1.223 0
## 2 lapply 100 18.547 1 17.292 1.250 0
## sys.child
## 1 0
## 2 0
#300 replications
repli300 <- test_benchmark(repli = 300); print(repli300)
## Code replications elapsed relative user.self sys.self user.child
## 1 loop 300 56.845 1 53.113 3.722 0
## 2 lapply 300 56.008 1 52.263 3.732 0
## sys.child
## 1 0
## 2 0
#500 replications
repli500 <- test_benchmark(repli = 500); print(repli500)
## Code replications elapsed relative user.self sys.self user.child
## 1 loop 500 94.391 1 88.288 6.085 0
## 2 lapply 500 93.727 1 87.484 6.225 0
## sys.child
## 1 0
## 2 0
The second code is much more tidy than the 1st code and the for-loop is well-known for time-consuming. Surprisingly, the benchmark test indicated a slight difference for the 2 codes.