Parallel Join

Within the curatedMetagenomicData package parallelization was used to increase performance. However, after some profiling, it was found that parallelization actually slowed processes down, as compared to similar tasks done in serial. The result is difficult to make sense of and a small example has been constructed here to reproduce the scenario. Any helpful comments would be welcomed.

library(dplyr)
library(BiocParallel)

First, a list, assay_list, of nine data_frame objects read in using read_tsv. It is loaded from a URL here instead, as to provide a consistent input and minimize reprocessing.

download.file("https://www.dropbox.com/s/uknlmuu4oc9lsm2/assay_list.Rda?raw=1",
              "./assay_list.Rda")
load("./assay_list.Rda")

Next, the join_assays function is defined using full_join from dplyr, as it is faster than many of the other alternatives available. The function takes two assays, first_assay and second_assay, and joins them by “rownames”. Here “rownames” is actually the first column of the data_frame rather than rownames(data_frame), as the data_frame from dplyr cannot have rownames.

join_assays <- function(first_assay, second_assay) {
    full_join(first_assay, second_assay, by = "rownames")
}

Next, the parallel_join function is defined, taking an assay_list as an argument. It iterates over the list recursively until the length is equal to 1, using a while loop. If assay_length is not a modulus of 2, the last_assay is removed and the odd_assays are serialized, with the intent that its length is equal to the even_assays. Then elements of even_assays and odd_assays are joined using bpmapply, which joins many even and odd pairs of assays in parallel. The processes is at least theoretically faster than joining in serial.

parallel_join <- function(assay_list) {
    while (length(assay_list) > 1) {
        assay_length <- length(assay_list)
        even_assays <- assay_list[seq.int(2, assay_length, 2)]
        if(assay_length %% 2) {
            odd_assays <- assay_list[seq.int(1, assay_length - 1, 2)]
            last_assay <- assay_list[assay_length]
            assay_list <- bpmapply(join_assays, odd_assays, even_assays,
                                   SIMPLIFY = FALSE)
            assay_list <- c(assay_list, last_assay)
        } else {
            odd_assays <- assay_list[seq.int(1, assay_length, 2)]
            assay_list <- bpmapply(join_assays, odd_assays, even_assays,
                                   SIMPLIFY = FALSE)
        }
    }
    assay_list[[1]]
}

The assay_list can be passed to a Reduce and all the elements can be joined in serial using the join_assay function. The timing is as follows:

system.time(Reduce(join_assays, assay_list))

##    user  system elapsed 
##   6.744   0.144   6.887

Similarly, the assay_list can be passed to the parallel_join function and all the elements will be joined in parallel using the join_assay function internally. The timing is as follows:

system.time(parallel_join(assay_list))

##    user  system elapsed 
##  17.608   1.780  49.050

In the absence of further understanding it seems strange that a serial procedure should be faster than a parallel one. Again, any helpful comments would be welcomed. Session info is below.

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.8.1 dplyr_0.5.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8     digest_0.6.10   rprojroot_1.1   assertthat_0.1 
##  [5] R6_2.2.0        DBI_0.5-1       backports_1.0.4 magrittr_1.5   
##  [9] evaluate_0.10   stringi_1.1.2   rmarkdown_1.2   tools_3.3.2    
## [13] stringr_1.1.0   parallel_3.3.2  yaml_2.1.14     htmltools_0.3.5
## [17] knitr_1.15.1    tibble_1.2

Parallel Join

Lucas Schiffer

December 05, 2016