For R users with the limited computing resources, we introduce the RunTerraWorkflow package. This package allows users to run workflows implemented in Terra, a cloud-based genomics platform, without writing any workflow, installing software, or managing cloud resources. Terra’s computing resources rely on Google Cloud Platform (GCP) and to use RunTerraWorkflow, you only need to setup the Terra account once at the beginning.
if (!"RunTerraWorkflow" %in% installed.packages())
devtools::install_github("shbrief/RunTerraWorkflow")
if (!"biobakeR" %in% installed.packages())
devtools::install_github("shbrief/biobakeR")
if (!"AnVIL" %in% installed.packages())
BiocManager::install("AnVIL")
library(RunTerraWorkflow)
library(AnVIL)
You need Terra account setup. Once you have your own Terra account, you need two pieces of information from it to use RunTerraWorkflow package: 1) the email address linked to your Terra account and 2) your billing project name. Here are the basic input parameters used in this vignette. Modify these with YOUR account information!
accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"
setCloudEnv(accountEmail = accountEmail,
billingProjectName = billingProjectName)
Here is the table of major functions for three workflow steps - prepare, run, and check result.
Steps | Functions | Description |
---|---|---|
Prepare | cloneWorkspace |
copy the template workspace |
updateInput |
take user’s inputs | |
Run | launchWorkflow |
launch the workflow in Terra |
abortSubmission |
abort the submission | |
Result | monitorSubmission |
monitor the status of your workflow run |
listOutput |
display the list of your workflow outputs | |
getOutput |
download your output |
If you use RunTerraWorkflow within Terra’s RStudio, you don’t need extra authentication and gcloud SDK. If you intend to use this package from local machine, it requires that the gcloud SDK is installed, and that the billing account used by AnVIL can be authenticated with the user. These requirements are satisfied when using the AnVIL compute cloud. For local use, one must install the gcloud sdk.
Test the installation with AnVIL::gcloud_exists()
## gcloud_exists() should return TRUE
gcloud_exists()
#> [1] TRUE
If it returns FALSE
, install the gcloud SDK following this script:
devtools::install_github("rstudio/cloudml")
cloudml::gcloud_install()
gcloud auth login
## You can change the project using this script
gcloud config set project PROJECT_ID
In this vignette, we are running Whole Metagenome Shotgun (wmgx) analysis workflow using
bioBakery tools. You can find
the currently available workspaces using availableAnalysis()
function and
the values under analysis
column can be used for the analysis argument. For
this vignette, we use "bioBakery"
.
availableAnalysis()
#> analysis
#> 1 bioBakery
#> 2 salmon
#> description
#> 1 Microbiome analysis using bioBakery
#> 2 Trascript quantification from RNAseq using Salmon | Differential gene expression analysis using DESeq2
analysis <- "bioBakery"
First, you should clone the template workspace using cloneWorkspace
function.
Note that you need to provide a unique name for the cloned workspace
through workspaceName
argument.
workspaceName <- "biobakery_test"
If you attempt to clone the template workspace using the existing workspaceName, you will get the below error message.
cloneWorkspace(workspaceName, analysis)
#> Error : 'avworkspace_clone' failed:
#> Conflict (HTTP 409).
#> Workspace waldronlab-terra-rstudio/biobakery_test already exists
With the unique workspace name, you can successfully clone the workspace and the function will return the name of the cloned workspace.
cloneWorkspace(workspaceName = "microbiome", analysis)
#> [1] "waldronlab-terra-rstudio/microbiome"
You can review the current inputs using currentInput
function. Below shows all
the required and optional inputs for the workflow.
currentInput(workspaceName)
#> No encoding supplied: defaulting to UTF-8.
#> $workflowMTX.dataType
#> [1] ""
#>
#> $workflowMTX.inputRead1Files
#> [1] "\"gs://run_terra_workflow/IBDMDB/ibdmdb_file_list.txt\""
#>
#> $workflowMTX.MaxMemGB_QualityControlTasks
#> [1] ""
#>
#> $workflowMTX.inputExtension
#> [1] "\".fastq.gz\""
#>
#> $workflowMTX.versionSpecifichumanDB
#> [1] "\"gs://run_terra_workflow/databases/kneaddata/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz\""
#>
#> $workflowMTX.versionSpecificUtilityMapping
#> [1] "\"gs://run_terra_workflow/databases/humann/full_mapping_v201901.tar.gz\""
#>
#> $workflowMTX.versionSpecificrrnaDB
#> [1] "\"gs://run_terra_workflow/databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz\""
#>
#> $workflowMTX.customQCDB1
#> [1] ""
#>
#> $workflowMTX.inputRead1Identifier
#> [1] "\"_R1\""
#>
#> $workflowMTX.versionSpecifictrancriptDB
#> [1] "\"gs://run_terra_workflow/databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz\""
#>
#> $workflowMTX.preemptibleAttemptsOverride
#> [1] ""
#>
#> $workflowMTX.inputRead2Identifier
#> [1] "\"_R2\""
#>
#> $workflowMTX.ProjectName
#> [1] "\"ibdmdb_test\""
#>
#> $workflowMTX.MaxMemGB_TaxonomicProfileTasks
#> [1] ""
#>
#> $workflowMTX.versionSpecificChocophlan
#> [1] "\"gs://run_terra_workflow/databases/humann/full_chocophlan.v296_201901.tar.gz\""
#>
#> $workflowMTX.customQCDB2
#> [1] ""
#>
#> $workflowMTX.bypassFunctionalProfiling
#> [1] ""
#>
#> $workflowMTX.MaxMemGB_FunctionalProfileTasks
#> [1] ""
#>
#> $workflowMTX.inputMetadataFile
#> [1] "\"gs://run_terra_workflow/IBDMDB/ibdmdb_demo_metadata.txt\""
#>
#> $workflowMTX.versionSpecificUniRef90
#> [1] "\"gs://run_terra_workflow/databases/humann/uniref90_annotated_v201901.tar.gz\""
#>
#> $workflowMTX.AdapterType
#> [1] "\"NexteraPE\""
biobakeR::biobakery_currentInput
is a variation of currentInput
function
optimized for bioBakery workflow. It displays only the two major inputs -
inputListPath
and inputFilePath
. These two types of files should be stored
in Google bucket.
inputFilePath
file contains the paths to the input fastq filesinputListPath
file is a list of their paths as a txt fileIn other words, inputFilePath
is the content of inputListPath
- you save
fastq files and provide their file paths in the txt file.
This vignette uses six fastq files for the test. Using preemptive instances, this demo set will cost about $5 to run.
biobakeR::biobakery_currentInput(workspaceName)
#> No encoding supplied: defaulting to UTF-8.
#> $inputListPath
#> [1] "gs://run_terra_workflow/IBDMDB/ibdmdb_file_list.txt"
#>
#> $inputFilePath
#> [1] "gs://run_terra_workflow/IBDMDB/HSM7J4NY_R1.fastq.gz"
#> [2] "gs://run_terra_workflow/IBDMDB/HSMA33OT_R1.fastq.gz"
#> [3] "gs://run_terra_workflow/IBDMDB/MSM6J2QD_R1.fastq.gz"
#> [4] "gs://run_terra_workflow/IBDMDB/CSM9X23N_R1.fastq.gz"
#> [5] "gs://run_terra_workflow/IBDMDB/HSM6XRQY_R1.fastq.gz"
#> [6] "gs://run_terra_workflow/IBDMDB/HSMA33KE_R1.fastq.gz"
Before launching the workflow, you should provide the correct input information
using updateInput
function. RunTerraWorkflow doesn’t support this function
yet, but the bioBakery-specific function is available through biobakeR::biobakery_updateInput
.
We use the six demo input files and their metadata.
input <- "gs://run_terra_workflow/IBDMDB/ibdmdb_file_list.txt"
inputMeta <- "gs://run_terra_workflow/IBDMDB/ibdmdb_demo_metadata.txt"
biobakeR::biobakery_updateInput(workspaceName,
InputRead1Files = input,
InputMetadataFile = inputMeta,
AdapterType = "NexteraPE",
ProjectName = "ibdmdb_test",
InputExtension = ".fastq.gz",
InputRead1Identifier = "_R1",
InputRead2Identifier = "_R2")
#> No encoding supplied: defaulting to UTF-8.
#> [1] "Input information is succesfully updated."
Once you clone the template workspace and update the input with your own
data, you can launch the workflow using launchWorkflow
function.
launchWorkflow(workspaceName)
#> No encoding supplied: defaulting to UTF-8.
#> [1] "Workflow is succesfully launched."
The last three columns show the submission and the result status.
submissions <- monitorSubmission(workspaceName)
submissions
#> # A tibble: 18 × 6
#> submissionId submitter submissionDate status succeeded failed
#> <chr> <chr> <dttm> <chr> <int> <int>
#> 1 9f71497d-45da-4d89-908… shbrief@… 2022-04-12 00:51:21 Submi… 0 0
#> 2 f158e487-a02a-4de6-924… shbrief@… 2022-04-12 00:47:07 Abort… 0 0
#> 3 4ac530ad-11cf-43b8-91a… shbrief@… 2022-04-12 00:43:55 Submi… 0 0
#> 4 96071a85-03be-4fe2-8d8… shbrief@… 2022-04-12 00:42:43 Submi… 0 0
#> 5 83d92e35-4540-4133-957… shbrief@… 2022-04-12 00:23:21 Abort… 0 0
#> 6 65a9e967-cc90-426d-ae1… shbrief@… 2022-04-12 00:22:15 Abort… 0 0
#> 7 0c82b617-b1d1-4078-aa1… shbrief@… 2022-04-11 23:45:43 Abort… 0 0
#> 8 1a47bfc4-f6f0-43e6-828… shbrief@… 2022-04-11 22:54:36 Abort… 1 0
#> 9 32b10ed9-34b1-4c28-825… shbrief@… 2022-04-11 22:30:53 Done 1 0
#> 10 a0da90f5-397c-4c29-853… shbrief@… 2022-04-11 22:29:24 Done 1 0
#> 11 974a415f-b3fa-49c6-aeb… shbrief@… 2022-04-11 22:28:19 Done 1 0
#> 12 56eb9d61-d8bc-4747-92a… shbrief@… 2022-04-05 23:02:34 Abort… 0 0
#> 13 91a4b648-b6d2-4206-b58… shbrief@… 2022-04-05 22:56:13 Abort… 0 0
#> 14 bea606c1-7ebe-4cf9-9f8… shbrief@… 2021-08-05 10:54:03 Abort… 0 0
#> 15 5d5640ff-424e-4ebc-a1f… shbrief@… 2021-08-05 00:45:19 Abort… 0 0
#> 16 d1dfbae4-01eb-4bfe-bba… shbrief@… 2021-07-20 22:08:00 Abort… 0 0
#> 17 d2021393-8296-4fee-9c4… shbrief@… 2021-07-20 13:54:25 Abort… 0 0
#> 18 f08f93da-21b7-44e1-8f3… shbrief@… 2021-07-20 12:17:01 Done 1 0
To abort the most recently submitted job, you don’t need to specify
submissionId
.
abortSubmission(workspaceName)
#> Status of the submitted job (submissionId: 9f71497d-45da-4d89-9089-dfac2e541c7e)
#> [1] "Workflow is succesfully aborted."
You can check all the output files from the most recently succeeded submission.
If you specify the submissionId
argument, you can get the output files of that
specific submission. Output files can also be subset using keyword
argument.
listOutput(workspaceName)
#> Outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> # A tibble: 193 × 5
#> file workflow task type path
#> <chr> <chr> <chr> <chr> <chr>
#> 1 humann_ecs_relab_counts.tsv workflowMTX call-CountRe… outp… gs:/…
#> 2 humann_genefamilies_relab_counts.tsv workflowMTX call-CountRe… outp… gs:/…
#> 3 humann_pathabundance_relab_counts.tsv workflowMTX call-CountRe… outp… gs:/…
#> 4 humann_read_and_species_count_table.tsv workflowMTX call-Functio… outp… gs:/…
#> 5 CSM9X23N_bowtie2_unaligned.fa workflowMTX call-Functio… outp… gs:/…
#> 6 CSM9X23N_diamond_unaligned.fa workflowMTX call-Functio… outp… gs:/…
#> 7 CSM9X23N_genefamilies.tsv workflowMTX call-Functio… outp… gs:/…
#> 8 CSM9X23N_pathabundance.tsv workflowMTX call-Functio… outp… gs:/…
#> 9 CSM9X23N_pathcoverage.tsv workflowMTX call-Functio… outp… gs:/…
#> 10 HSM6XRQY_bowtie2_unaligned.fa workflowMTX call-Functio… outp… gs:/…
#> # … with 183 more rows
listOutput(workspaceName, keyword = "humann")
#> Outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> # A tibble: 5 × 5
#> file workflow task type path
#> <chr> <chr> <chr> <chr> <chr>
#> 1 humann_ecs_relab_counts.tsv workflowMTX call-CountRel… outp… gs:/…
#> 2 humann_genefamilies_relab_counts.tsv workflowMTX call-CountRel… outp… gs:/…
#> 3 humann_pathabundance_relab_counts.tsv workflowMTX call-CountRel… outp… gs:/…
#> 4 humann_read_and_species_count_table.tsv workflowMTX call-Function… outp… gs:/…
#> 5 humann_feature_counts.tsv workflowMTX call-JoinFeat… outp… gs:/…
If your outputs include tsv files, you can check the head of those files
using tableHead
function without downloading them.
tableHead("HSM7J4NY_genefamilies_relab.tsv", n = 6, workspaceName)
#> Below outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> V1
#> 1 UniRef90_B7B7I3
#> 2 UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
#> 3 UniRef90_B7B7I3|unclassified
#> 4 UniRef90_F3QJ09
#> 5 UniRef90_F3QJ09|g__Parasutterella.s__Parasutterella_excrementihominis
#> 6 UniRef90_A0A374V756
#> V2
#> 1 2.25732e-01
#> 2 2.25683e-01
#> 3 4.95507e-05
#> 4 1.39816e-01
#> 5 1.39816e-01
#> 6 3.66683e-02
keyword
argument takes a character string containing a regular expression. In
the example below, we check all the .tsv
outputs of the sample, “HSM7J4NY”.
listOutput(workspaceName, keyword = "HSM7J4NY.*.tsv")
#> Outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> # A tibble: 9 × 5
#> file workflow task type path
#> <chr> <chr> <chr> <chr> <chr>
#> 1 HSM7J4NY_genefamilies.tsv workflowMTX call-FunctionalProfi… outp… gs:/…
#> 2 HSM7J4NY_pathabundance.tsv workflowMTX call-FunctionalProfi… outp… gs:/…
#> 3 HSM7J4NY_pathcoverage.tsv workflowMTX call-FunctionalProfi… outp… gs:/…
#> 4 HSM7J4NY_ecs.tsv workflowMTX call-RegroupECs outp… gs:/…
#> 5 HSM7J4NY_kos.tsv workflowMTX call-RegroupKOs outp… gs:/…
#> 6 HSM7J4NY_ecs_relab.tsv workflowMTX call-RenormTableECs outp… gs:/…
#> 7 HSM7J4NY_genefamilies_relab.tsv workflowMTX call-RenormTableGenes outp… gs:/…
#> 8 HSM7J4NY_pathabundance_relab.tsv workflowMTX call-RenormTablePath… outp… gs:/…
#> 9 HSM7J4NY.tsv workflowMTX call-TaxonomicProfile outp… gs:/…
You can download any file using getOutput
function. Here, we narrow down the
download to HSM7J4NY sample’s .tsv
output files.
getOutput(workspaceName, keyword = "HSM7J4NY.*.tsv", dest_dir = HSM7J4NY_dir)
list.files(HSM7J4NY_dir)
#> [1] "HSM7J4NY_ecs_relab.tsv" "HSM7J4NY_ecs.tsv"
#> [3] "HSM7J4NY_genefamilies_relab.tsv" "HSM7J4NY_genefamilies.tsv"
#> [5] "HSM7J4NY_kos.tsv" "HSM7J4NY_pathabundance_relab.tsv"
#> [7] "HSM7J4NY_pathabundance.tsv" "HSM7J4NY_pathcoverage.tsv"
#> [9] "HSM7J4NY.tsv"
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] AnVIL_1.7.14 dplyr_1.0.8 RunTerraWorkflow_0.2.0
#> [4] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] bslib_0.3.1 compiler_4.1.2 pillar_1.7.0
#> [4] BiocManager_1.30.16 formatR_1.12 jquerylib_0.1.4
#> [7] futile.logger_1.4.3 futile.options_1.0.1 tools_4.1.2
#> [10] digest_0.6.29 tibble_3.1.6 jsonlite_1.8.0
#> [13] evaluate_0.15 lifecycle_1.0.1 pkgconfig_2.0.3
#> [16] rlang_1.0.2 DBI_1.1.2 cli_3.2.0
#> [19] rstudioapi_0.13 curl_4.3.2 parallel_4.1.2
#> [22] yaml_2.3.5 xfun_0.30 fastmap_1.1.0
#> [25] biobakeR_0.4.0 httr_1.4.2 stringr_1.4.0
#> [28] knitr_1.38 generics_0.1.2 sass_0.4.1
#> [31] vctrs_0.4.0 tidyselect_1.1.2 glue_1.6.2
#> [34] R6_2.5.1 fansi_1.0.3 rmarkdown_2.13
#> [37] bookdown_0.25 tidyr_1.2.0 purrr_0.3.4
#> [40] lambda.r_1.2.4 magrittr_2.0.3 htmltools_0.5.2
#> [43] ellipsis_0.3.2 rapiclient_0.1.3 assertthat_0.2.1
#> [46] utf8_1.2.2 stringi_1.7.6 crayon_1.5.1