Within the curatedMetagenomicData
package parallelization was used to increase performance. However, after some profiling, it was found that parallelization actually slowed processes down, as compared to similar tasks done in serial. The result is difficult to make sense of and a small example has been constructed here to reproduce the scenario. Any helpful comments would be welcomed.
library(dplyr)
library(BiocParallel)
First, a list, assay_list
, of nine data_frame
objects read in using read_tsv
. It is loaded from a URL here instead, as to provide a consistent input and minimize reprocessing.
download.file("https://www.dropbox.com/s/uknlmuu4oc9lsm2/assay_list.Rda?raw=1",
"./assay_list.Rda")
load("./assay_list.Rda")
Next, the join_assays
function is defined using full_join
from dplyr
, as it is faster than many of the other alternatives available. The function takes two assays, first_assay
and second_assay
, and joins them by “rownames”. Here “rownames” is actually the first column of the data_frame
rather than rownames(data_frame)
, as the data_frame
from dplyr
cannot have rownames.
join_assays <- function(first_assay, second_assay) {
full_join(first_assay, second_assay, by = "rownames")
}
Next, the parallel_join
function is defined, taking an assay_list
as an argument. It iterates over the list recursively until the length is equal to 1, using a while loop. If assay_length
is not a modulus of 2, the last_assay
is removed and the odd_assays
are serialized, with the intent that its length is equal to the even_assays
. Then elements of even_assays
and odd_assays
are joined using bpmapply
, which joins many even and odd pairs of assays in parallel. The processes is at least theoretically faster than joining in serial.
parallel_join <- function(assay_list) {
while (length(assay_list) > 1) {
assay_length <- length(assay_list)
even_assays <- assay_list[seq.int(2, assay_length, 2)]
if(assay_length %% 2) {
odd_assays <- assay_list[seq.int(1, assay_length - 1, 2)]
last_assay <- assay_list[assay_length]
assay_list <- bpmapply(join_assays, odd_assays, even_assays,
SIMPLIFY = FALSE)
assay_list <- c(assay_list, last_assay)
} else {
odd_assays <- assay_list[seq.int(1, assay_length, 2)]
assay_list <- bpmapply(join_assays, odd_assays, even_assays,
SIMPLIFY = FALSE)
}
}
assay_list[[1]]
}
The assay_list
can be passed to a Reduce
and all the elements can be joined in serial using the join_assay
function. The timing is as follows:
system.time(Reduce(join_assays, assay_list))
## user system elapsed
## 6.744 0.144 6.887
Similarly, the assay_list
can be passed to the parallel_join
function and all the elements will be joined in parallel using the join_assay
function internally. The timing is as follows:
system.time(parallel_join(assay_list))
## user system elapsed
## 17.608 1.780 49.050
In the absence of further understanding it seems strange that a serial procedure should be faster than a parallel one. Again, any helpful comments would be welcomed. Session info is below.
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.8.1 dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.8 digest_0.6.10 rprojroot_1.1 assertthat_0.1
## [5] R6_2.2.0 DBI_0.5-1 backports_1.0.4 magrittr_1.5
## [9] evaluate_0.10 stringi_1.1.2 rmarkdown_1.2 tools_3.3.2
## [13] stringr_1.1.0 parallel_3.3.2 yaml_2.1.14 htmltools_0.3.5
## [17] knitr_1.15.1 tibble_1.2