Tips on Selecting Columns of a Dataframe in R

We will explore several ways of selcting a column of a dataframe based on some predicate and benchmark the time it takes for each process. The methods to be comapred include approaches using the data.table, dplyr, purrr packages and base functions Filter and sapply.We would attempt to select all columns of a dataframe which satisfies some predicate for example is numeric.

library(tidyverse)
library(data.table)


dplyrselec<-function(x){
  
return( x%>%dplyr::select_if(is.numeric))  
  
}

 


dplyrselec(iris)%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

Equivalently with a comobination of selcet statement and and base functions we can achieve the same as above.

iris %>% select(which(sapply(., is.numeric)))%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

The base function Filter offers another alternative to select columns based on some predicate.

Filterbase<-function(x){
  
return(Filter(is.numeric, x) )  
  
}

Filterbase(iris)%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

Using the purrr package:

purrrkeep<-function(x){
  x%>%
  purrr::keep(is.numeric)  
}


purrrkeep(iris) %>% 
  tail(2)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 149          6.2         3.4          5.4         2.3
## 150          5.9         3.0          5.1         1.8

Another method using base functions sapply

index <- sapply(iris, is.numeric)

iris[,index]%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

basesapply<-function(x){
  
return(x[sapply(x, is.numeric)])  
  
}

basesapply(iris)%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

#Equivalently 
iris[, sapply(iris, class) == "numeric"]%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

iris[, lapply(iris, is.numeric) == TRUE]%>%head(3)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2

library(data.table)

datatable<-function(x){

xdt<-data.table(x)

index <- which(sapply(xdt,is.numeric))
return( xdt[ , index, with=FALSE])

}

datatable(iris)%>%head(3)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:          5.1         3.5          1.4         0.2
## 2:          4.9         3.0          1.4         0.2
## 3:          4.7         3.2          1.3         0.2

Equivalently we can achieve the same purpose above with data.table as below:

irisdt<-data.table(iris)

irisdt[, .SD, .SDcols = sapply(irisdt, is.numeric)]%>%head(3)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:          5.1         3.5          1.4         0.2
## 2:          4.9         3.0          1.4         0.2
## 3:          4.7         3.2          1.3         0.2

irisdt<-irisdt[,Filter(is.numeric, .SD)]%>%head(3)

library(microbenchmark)
microbenchmark(
    dplyr::select_if(mtcars, is.numeric),
    Filter(is.numeric, mtcars),
    unit="relative"
)

## Unit: relative
##                                  expr      min       lq     mean   median
##  dplyr::select_if(mtcars, is.numeric) 40.98831 35.75103 32.78331 27.25671
##            Filter(is.numeric, mtcars)  1.00000  1.00000  1.00000  1.00000
##        uq      max neval cld
##  27.20633 177.5123   100   b
##   1.00000   1.0000   100  a

mbm = microbenchmark(
datatable(iris),
dplyrselec(iris),
Filterbase(iris),
purrrkeep(iris),
basesapply(iris),

unit="relative",
times=10L
)


mbm2 = microbenchmark(
datatable(iris),
dplyrselec(iris),
Filterbase(iris),
purrrkeep(iris),
basesapply(iris),
times=10L
)



summary(mbm2)

##               expr      min       lq      mean    median       uq      max
## 1  datatable(iris)  798.216  815.432  857.5952  836.0680  856.017 1003.946
## 2 dplyrselec(iris) 1141.383 1207.871 1271.8428 1310.6700 1317.041 1340.681
## 3 Filterbase(iris)   32.356   34.607   47.2218   47.1435   56.226   66.453
## 4  purrrkeep(iris)  404.920  416.488  466.2423  449.0390  474.372  684.339
## 5 basesapply(iris)   47.478   49.739   59.4772   53.8365   65.058   93.777
##   neval  cld
## 1    10   c 
## 2    10    d
## 3    10 a   
## 4    10  b  
## 5    10 a

boxplot(mbm)

Fig. 30

#S3 method for microbenchmark
summary(mbm)

##               expr       min        lq     mean    median        uq
## 1  datatable(iris) 26.621838 23.730839 5.804476 15.694968 14.737585
## 2 dplyrselec(iris) 39.509899 38.451680 8.115970 25.874913 23.491229
## 3 Filterbase(iris)  1.000000  1.000000 1.000000  1.000000  1.000000
## 4  purrrkeep(iris) 13.770064 14.643025 3.579170 10.663037 10.397330
## 5 basesapply(iris)  1.492816  1.735413 1.249264  1.277517  1.261793
##        max neval cld
## 1 2.548825    10  bc
## 2 2.520249    10   c
## 3 1.000000    10 a  
## 4 1.336688    10 ab 
## 5 1.215719    10 a

Using the base fucntion Filter is about 14 time faster than data.table package,23 times faster than dplyr and more than 7 time faster than the purrr package. The closest is using the approach from the the base function sapply which is just about 0.3 slower than Filter. The result from data.table was a little surprising to me because I have always seen it perfoem faster compared to other tidyverse packages. In this example although it does better than dplyr, the purrr package approach is about 2 times faster.

#Another R tip. Use vector(mode = "list") to pre-allocate lists.

result <- vector(mode = "list", 3)
print(result)

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

#Pre-allocation is particularly useful when using for-loops.

for(i in seq_along(result)) {
  result[[i]] <- i
}
print(result)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

Tips on Selecting Columns of a Dataframe in R

Nana Boateng

March 10, 2018