We will explore several ways of selcting a column of a dataframe based on some predicate and benchmark the time it takes for each process. The methods to be comapred include approaches using the data.table, dplyr, purrr packages and base functions Filter and sapply.We would attempt to select all columns of a dataframe which satisfies some predicate for example is numeric.
library(tidyverse)
library(data.table)
dplyrselec<-function(x){
return( x%>%dplyr::select_if(is.numeric))
}
dplyrselec(iris)%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
Equivalently with a comobination of selcet statement and and base functions we can achieve the same as above.
iris %>% select(which(sapply(., is.numeric)))%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
The base function Filter offers another alternative to select columns based on some predicate.
Filterbase<-function(x){
return(Filter(is.numeric, x) )
}
Filterbase(iris)%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
Using the purrr package:
purrrkeep<-function(x){
x%>%
purrr::keep(is.numeric)
}
purrrkeep(iris) %>%
tail(2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 149 6.2 3.4 5.4 2.3
## 150 5.9 3.0 5.1 1.8
Another method using base functions sapply
index <- sapply(iris, is.numeric)
iris[,index]%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
basesapply<-function(x){
return(x[sapply(x, is.numeric)])
}
basesapply(iris)%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
#Equivalently
iris[, sapply(iris, class) == "numeric"]%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
iris[, lapply(iris, is.numeric) == TRUE]%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
library(data.table)
datatable<-function(x){
xdt<-data.table(x)
index <- which(sapply(xdt,is.numeric))
return( xdt[ , index, with=FALSE])
}
datatable(iris)%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 5.1 3.5 1.4 0.2
## 2: 4.9 3.0 1.4 0.2
## 3: 4.7 3.2 1.3 0.2
Equivalently we can achieve the same purpose above with data.table as below:
irisdt<-data.table(iris)
irisdt[, .SD, .SDcols = sapply(irisdt, is.numeric)]%>%head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1: 5.1 3.5 1.4 0.2
## 2: 4.9 3.0 1.4 0.2
## 3: 4.7 3.2 1.3 0.2
irisdt<-irisdt[,Filter(is.numeric, .SD)]%>%head(3)
library(microbenchmark)
microbenchmark(
dplyr::select_if(mtcars, is.numeric),
Filter(is.numeric, mtcars),
unit="relative"
)
## Unit: relative
## expr min lq mean median
## dplyr::select_if(mtcars, is.numeric) 40.98831 35.75103 32.78331 27.25671
## Filter(is.numeric, mtcars) 1.00000 1.00000 1.00000 1.00000
## uq max neval cld
## 27.20633 177.5123 100 b
## 1.00000 1.0000 100 a
mbm = microbenchmark(
datatable(iris),
dplyrselec(iris),
Filterbase(iris),
purrrkeep(iris),
basesapply(iris),
unit="relative",
times=10L
)
mbm2 = microbenchmark(
datatable(iris),
dplyrselec(iris),
Filterbase(iris),
purrrkeep(iris),
basesapply(iris),
times=10L
)
summary(mbm2)
## expr min lq mean median uq max
## 1 datatable(iris) 798.216 815.432 857.5952 836.0680 856.017 1003.946
## 2 dplyrselec(iris) 1141.383 1207.871 1271.8428 1310.6700 1317.041 1340.681
## 3 Filterbase(iris) 32.356 34.607 47.2218 47.1435 56.226 66.453
## 4 purrrkeep(iris) 404.920 416.488 466.2423 449.0390 474.372 684.339
## 5 basesapply(iris) 47.478 49.739 59.4772 53.8365 65.058 93.777
## neval cld
## 1 10 c
## 2 10 d
## 3 10 a
## 4 10 b
## 5 10 a
boxplot(mbm)
Fig. 30
#S3 method for microbenchmark
summary(mbm)
## expr min lq mean median uq
## 1 datatable(iris) 26.621838 23.730839 5.804476 15.694968 14.737585
## 2 dplyrselec(iris) 39.509899 38.451680 8.115970 25.874913 23.491229
## 3 Filterbase(iris) 1.000000 1.000000 1.000000 1.000000 1.000000
## 4 purrrkeep(iris) 13.770064 14.643025 3.579170 10.663037 10.397330
## 5 basesapply(iris) 1.492816 1.735413 1.249264 1.277517 1.261793
## max neval cld
## 1 2.548825 10 bc
## 2 2.520249 10 c
## 3 1.000000 10 a
## 4 1.336688 10 ab
## 5 1.215719 10 a
Using the base fucntion Filter is about 14 time faster than data.table package,23 times faster than dplyr and more than 7 time faster than the purrr package. The closest is using the approach from the the base function sapply which is just about 0.3 slower than Filter. The result from data.table was a little surprising to me because I have always seen it perfoem faster compared to other tidyverse packages. In this example although it does better than dplyr, the purrr package approach is about 2 times faster.
#Another R tip. Use vector(mode = "list") to pre-allocate lists.
result <- vector(mode = "list", 3)
print(result)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
#Pre-allocation is particularly useful when using for-loops.
for(i in seq_along(result)) {
result[[i]] <- i
}
print(result)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3