Contents

1 Overview

For R users with the limited computing resources, we introduce the RunTerraWorkflow package. This package allows users to run workflows implemented in Terra, a cloud-based genomics platform, without writing any workflow, installing software, or managing cloud resources. Terra’s computing resources rely on Google Cloud Platform (GCP) and to use RunTerraWorkflow, you only need to setup the Terra account once at the beginning.

1.1 Load packages

if (!"RunTerraWorkflow" %in% installed.packages())
    devtools::install_github("shbrief/RunTerraWorkflow")
if (!"biobakeR" %in% installed.packages())
    devtools::install_github("shbrief/biobakeR")
if (!"AnVIL" %in% installed.packages())
    BiocManager::install("AnVIL")

library(RunTerraWorkflow)
library(AnVIL)

1.2 Create Terra account

You need Terra account setup. Once you have your own Terra account, you need two pieces of information from it to use RunTerraWorkflow package: 1) the email address linked to your Terra account and 2) your billing project name. Here are the basic input parameters used in this vignette. Modify these with YOUR account information!

accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"

setCloudEnv(accountEmail = accountEmail, 
            billingProjectName = billingProjectName)

1.3 Major steps

Here is the table of major functions for three workflow steps - prepare, run, and check result.

Steps Functions Description
Prepare cloneWorkspace copy the template workspace
updateInput take user’s inputs
Run launchWorkflow launch the workflow in Terra
abortSubmission abort the submission
Result monitorSubmission monitor the status of your workflow run
listOutput display the list of your workflow outputs
getOutput download your output

1.4 Google Cloud SDK

If you use RunTerraWorkflow within Terra’s RStudio, you don’t need extra authentication and gcloud SDK. If you intend to use this package from local machine, it requires that the gcloud SDK is installed, and that the billing account used by AnVIL can be authenticated with the user. These requirements are satisfied when using the AnVIL compute cloud. For local use, one must install the gcloud sdk.

Test the installation with AnVIL::gcloud_exists()

## gcloud_exists() should return TRUE
gcloud_exists()
#> [1] TRUE

If it returns FALSE, install the gcloud SDK following this script:

devtools::install_github("rstudio/cloudml")
cloudml::gcloud_install()
gcloud auth login

## You can change the project using this script
gcloud config set project PROJECT_ID

1.5 Example in this vignette: Microbiome analysis

In this vignette, we are running Whole Metagenome Shotgun (wmgx) analysis workflow using bioBakery tools. You can find the currently available workspaces using availableAnalysis() function and the values under analysis column can be used for the analysis argument. For this vignette, we use "bioBakery".

availableAnalysis()
#>    analysis
#> 1 bioBakery
#> 2    salmon
#>                                                                                              description
#> 1                                                                    Microbiome analysis using bioBakery
#> 2 Trascript quantification from RNAseq using Salmon | Differential gene expression analysis using DESeq2

analysis <- "bioBakery"

2 Setup

2.1 Clone Workspace

First, you should clone the template workspace using cloneWorkspace function. Note that you need to provide a unique name for the cloned workspace through workspaceName argument.

workspaceName <- "biobakery_test"

If you attempt to clone the template workspace using the existing workspaceName, you will get the below error message.

cloneWorkspace(workspaceName, analysis)
#> Error : 'avworkspace_clone' failed:
#>   Conflict (HTTP 409).
#>   Workspace waldronlab-terra-rstudio/biobakery_test already exists

With the unique workspace name, you can successfully clone the workspace and the function will return the name of the cloned workspace.

cloneWorkspace(workspaceName = "microbiome", analysis)
#> [1] "waldronlab-terra-rstudio/microbiome"

2.2 Prepare Input

2.2.1 Current input

You can review the current inputs using currentInput function. Below shows all the required and optional inputs for the workflow.

currentInput(workspaceName)
#> No encoding supplied: defaulting to UTF-8.
#> $workflowMTX.dataType
#> [1] ""
#> 
#> $workflowMTX.inputRead1Files
#> [1] "\"gs://run_terra_workflow/IBDMDB/ibdmdb_file_list.txt\""
#> 
#> $workflowMTX.MaxMemGB_QualityControlTasks
#> [1] ""
#> 
#> $workflowMTX.inputExtension
#> [1] "\".fastq.gz\""
#> 
#> $workflowMTX.versionSpecifichumanDB
#> [1] "\"gs://run_terra_workflow/databases/kneaddata/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz\""
#> 
#> $workflowMTX.versionSpecificUtilityMapping
#> [1] "\"gs://run_terra_workflow/databases/humann/full_mapping_v201901.tar.gz\""
#> 
#> $workflowMTX.versionSpecificrrnaDB
#> [1] "\"gs://run_terra_workflow/databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz\""
#> 
#> $workflowMTX.customQCDB1
#> [1] ""
#> 
#> $workflowMTX.inputRead1Identifier
#> [1] "\"_R1\""
#> 
#> $workflowMTX.versionSpecifictrancriptDB
#> [1] "\"gs://run_terra_workflow/databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz\""
#> 
#> $workflowMTX.preemptibleAttemptsOverride
#> [1] ""
#> 
#> $workflowMTX.inputRead2Identifier
#> [1] "\"_R2\""
#> 
#> $workflowMTX.ProjectName
#> [1] "\"ibdmdb_test\""
#> 
#> $workflowMTX.MaxMemGB_TaxonomicProfileTasks
#> [1] ""
#> 
#> $workflowMTX.versionSpecificChocophlan
#> [1] "\"gs://run_terra_workflow/databases/humann/full_chocophlan.v296_201901.tar.gz\""
#> 
#> $workflowMTX.customQCDB2
#> [1] ""
#> 
#> $workflowMTX.bypassFunctionalProfiling
#> [1] ""
#> 
#> $workflowMTX.MaxMemGB_FunctionalProfileTasks
#> [1] ""
#> 
#> $workflowMTX.inputMetadataFile
#> [1] "\"gs://run_terra_workflow/IBDMDB/ibdmdb_demo_metadata.txt\""
#> 
#> $workflowMTX.versionSpecificUniRef90
#> [1] "\"gs://run_terra_workflow/databases/humann/uniref90_annotated_v201901.tar.gz\""
#> 
#> $workflowMTX.AdapterType
#> [1] "\"NexteraPE\""


biobakeR::biobakery_currentInput is a variation of currentInput function optimized for bioBakery workflow. It displays only the two major inputs - inputListPath and inputFilePath. These two types of files should be stored in Google bucket.

  1. inputFilePath file contains the paths to the input fastq files
  2. inputListPath file is a list of their paths as a txt file

In other words, inputFilePath is the content of inputListPath - you save fastq files and provide their file paths in the txt file.

This vignette uses six fastq files for the test. Using preemptive instances, this demo set will cost about $5 to run.

biobakeR::biobakery_currentInput(workspaceName)
#> No encoding supplied: defaulting to UTF-8.
#> $inputListPath
#> [1] "gs://run_terra_workflow/IBDMDB/ibdmdb_file_list.txt"
#> 
#> $inputFilePath
#> [1] "gs://run_terra_workflow/IBDMDB/HSM7J4NY_R1.fastq.gz"
#> [2] "gs://run_terra_workflow/IBDMDB/HSMA33OT_R1.fastq.gz"
#> [3] "gs://run_terra_workflow/IBDMDB/MSM6J2QD_R1.fastq.gz"
#> [4] "gs://run_terra_workflow/IBDMDB/CSM9X23N_R1.fastq.gz"
#> [5] "gs://run_terra_workflow/IBDMDB/HSM6XRQY_R1.fastq.gz"
#> [6] "gs://run_terra_workflow/IBDMDB/HSMA33KE_R1.fastq.gz"

2.2.2 Update input

Before launching the workflow, you should provide the correct input information using updateInput function. RunTerraWorkflow doesn’t support this function yet, but the bioBakery-specific function is available through biobakeR::biobakery_updateInput.

We use the six demo input files and their metadata.

input <- "gs://run_terra_workflow/IBDMDB/ibdmdb_file_list.txt"
inputMeta <- "gs://run_terra_workflow/IBDMDB/ibdmdb_demo_metadata.txt"

biobakeR::biobakery_updateInput(workspaceName,
                                InputRead1Files = input,
                                InputMetadataFile = inputMeta,
                                AdapterType = "NexteraPE",
                                ProjectName = "ibdmdb_test",
                                InputExtension = ".fastq.gz",
                                InputRead1Identifier = "_R1",
                                InputRead2Identifier = "_R2")
#> No encoding supplied: defaulting to UTF-8.
#> [1] "Input information is succesfully updated."

3 Run bioBakery workflow

Once you clone the template workspace and update the input with your own data, you can launch the workflow using launchWorkflow function.

launchWorkflow(workspaceName)
#> No encoding supplied: defaulting to UTF-8.
#> [1] "Workflow is succesfully launched."

3.1 Monitor Progress

The last three columns show the submission and the result status.

submissions <- monitorSubmission(workspaceName)
submissions
#> # A tibble: 18 × 6
#>    submissionId            submitter submissionDate      status succeeded failed
#>    <chr>                   <chr>     <dttm>              <chr>      <int>  <int>
#>  1 9f71497d-45da-4d89-908… shbrief@… 2022-04-12 00:51:21 Submi…         0      0
#>  2 f158e487-a02a-4de6-924… shbrief@… 2022-04-12 00:47:07 Abort…         0      0
#>  3 4ac530ad-11cf-43b8-91a… shbrief@… 2022-04-12 00:43:55 Submi…         0      0
#>  4 96071a85-03be-4fe2-8d8… shbrief@… 2022-04-12 00:42:43 Submi…         0      0
#>  5 83d92e35-4540-4133-957… shbrief@… 2022-04-12 00:23:21 Abort…         0      0
#>  6 65a9e967-cc90-426d-ae1… shbrief@… 2022-04-12 00:22:15 Abort…         0      0
#>  7 0c82b617-b1d1-4078-aa1… shbrief@… 2022-04-11 23:45:43 Abort…         0      0
#>  8 1a47bfc4-f6f0-43e6-828… shbrief@… 2022-04-11 22:54:36 Abort…         1      0
#>  9 32b10ed9-34b1-4c28-825… shbrief@… 2022-04-11 22:30:53 Done           1      0
#> 10 a0da90f5-397c-4c29-853… shbrief@… 2022-04-11 22:29:24 Done           1      0
#> 11 974a415f-b3fa-49c6-aeb… shbrief@… 2022-04-11 22:28:19 Done           1      0
#> 12 56eb9d61-d8bc-4747-92a… shbrief@… 2022-04-05 23:02:34 Abort…         0      0
#> 13 91a4b648-b6d2-4206-b58… shbrief@… 2022-04-05 22:56:13 Abort…         0      0
#> 14 bea606c1-7ebe-4cf9-9f8… shbrief@… 2021-08-05 10:54:03 Abort…         0      0
#> 15 5d5640ff-424e-4ebc-a1f… shbrief@… 2021-08-05 00:45:19 Abort…         0      0
#> 16 d1dfbae4-01eb-4bfe-bba… shbrief@… 2021-07-20 22:08:00 Abort…         0      0
#> 17 d2021393-8296-4fee-9c4… shbrief@… 2021-07-20 13:54:25 Abort…         0      0
#> 18 f08f93da-21b7-44e1-8f3… shbrief@… 2021-07-20 12:17:01 Done           1      0

3.2 Abort submission

To abort the most recently submitted job, you don’t need to specify submissionId.

abortSubmission(workspaceName)
#> Status of the submitted job (submissionId: 9f71497d-45da-4d89-9089-dfac2e541c7e)
#> [1] "Workflow is succesfully aborted."

4 Result

4.1 List Outputs

You can check all the output files from the most recently succeeded submission. If you specify the submissionId argument, you can get the output files of that specific submission. Output files can also be subset using keyword argument.

listOutput(workspaceName)
#> Outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> # A tibble: 193 × 5
#>    file                                    workflow    task          type  path 
#>    <chr>                                   <chr>       <chr>         <chr> <chr>
#>  1 humann_ecs_relab_counts.tsv             workflowMTX call-CountRe… outp… gs:/…
#>  2 humann_genefamilies_relab_counts.tsv    workflowMTX call-CountRe… outp… gs:/…
#>  3 humann_pathabundance_relab_counts.tsv   workflowMTX call-CountRe… outp… gs:/…
#>  4 humann_read_and_species_count_table.tsv workflowMTX call-Functio… outp… gs:/…
#>  5 CSM9X23N_bowtie2_unaligned.fa           workflowMTX call-Functio… outp… gs:/…
#>  6 CSM9X23N_diamond_unaligned.fa           workflowMTX call-Functio… outp… gs:/…
#>  7 CSM9X23N_genefamilies.tsv               workflowMTX call-Functio… outp… gs:/…
#>  8 CSM9X23N_pathabundance.tsv              workflowMTX call-Functio… outp… gs:/…
#>  9 CSM9X23N_pathcoverage.tsv               workflowMTX call-Functio… outp… gs:/…
#> 10 HSM6XRQY_bowtie2_unaligned.fa           workflowMTX call-Functio… outp… gs:/…
#> # … with 183 more rows
listOutput(workspaceName, keyword = "humann")
#> Outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> # A tibble: 5 × 5
#>   file                                    workflow    task           type  path 
#>   <chr>                                   <chr>       <chr>          <chr> <chr>
#> 1 humann_ecs_relab_counts.tsv             workflowMTX call-CountRel… outp… gs:/…
#> 2 humann_genefamilies_relab_counts.tsv    workflowMTX call-CountRel… outp… gs:/…
#> 3 humann_pathabundance_relab_counts.tsv   workflowMTX call-CountRel… outp… gs:/…
#> 4 humann_read_and_species_count_table.tsv workflowMTX call-Function… outp… gs:/…
#> 5 humann_feature_counts.tsv               workflowMTX call-JoinFeat… outp… gs:/…

If your outputs include tsv files, you can check the head of those files using tableHead function without downloading them.

tableHead("HSM7J4NY_genefamilies_relab.tsv", n = 6, workspaceName)   
#> Below outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#>                                                                      V1
#> 1                                                       UniRef90_B7B7I3
#> 2          UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
#> 3                                          UniRef90_B7B7I3|unclassified
#> 4                                                       UniRef90_F3QJ09
#> 5 UniRef90_F3QJ09|g__Parasutterella.s__Parasutterella_excrementihominis
#> 6                                                   UniRef90_A0A374V756
#>            V2
#> 1 2.25732e-01
#> 2 2.25683e-01
#> 3 4.95507e-05
#> 4 1.39816e-01
#> 5 1.39816e-01
#> 6 3.66683e-02

4.2 Get Outputs

keyword argument takes a character string containing a regular expression. In the example below, we check all the .tsv outputs of the sample, “HSM7J4NY”.

listOutput(workspaceName, keyword = "HSM7J4NY.*.tsv")
#> Outputs are from the submissionId 1a47bfc4-f6f0-43e6-8289-1947fd2caf29
#> # A tibble: 9 × 5
#>   file                             workflow    task                  type  path 
#>   <chr>                            <chr>       <chr>                 <chr> <chr>
#> 1 HSM7J4NY_genefamilies.tsv        workflowMTX call-FunctionalProfi… outp… gs:/…
#> 2 HSM7J4NY_pathabundance.tsv       workflowMTX call-FunctionalProfi… outp… gs:/…
#> 3 HSM7J4NY_pathcoverage.tsv        workflowMTX call-FunctionalProfi… outp… gs:/…
#> 4 HSM7J4NY_ecs.tsv                 workflowMTX call-RegroupECs       outp… gs:/…
#> 5 HSM7J4NY_kos.tsv                 workflowMTX call-RegroupKOs       outp… gs:/…
#> 6 HSM7J4NY_ecs_relab.tsv           workflowMTX call-RenormTableECs   outp… gs:/…
#> 7 HSM7J4NY_genefamilies_relab.tsv  workflowMTX call-RenormTableGenes outp… gs:/…
#> 8 HSM7J4NY_pathabundance_relab.tsv workflowMTX call-RenormTablePath… outp… gs:/…
#> 9 HSM7J4NY.tsv                     workflowMTX call-TaxonomicProfile outp… gs:/…

You can download any file using getOutput function. Here, we narrow down the download to HSM7J4NY sample’s .tsv output files.

getOutput(workspaceName, keyword = "HSM7J4NY.*.tsv", dest_dir = HSM7J4NY_dir)
list.files(HSM7J4NY_dir)
#> [1] "HSM7J4NY_ecs_relab.tsv"           "HSM7J4NY_ecs.tsv"                
#> [3] "HSM7J4NY_genefamilies_relab.tsv"  "HSM7J4NY_genefamilies.tsv"       
#> [5] "HSM7J4NY_kos.tsv"                 "HSM7J4NY_pathabundance_relab.tsv"
#> [7] "HSM7J4NY_pathabundance.tsv"       "HSM7J4NY_pathcoverage.tsv"       
#> [9] "HSM7J4NY.tsv"

5 Session Info

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] AnVIL_1.7.14           dplyr_1.0.8            RunTerraWorkflow_0.2.0
#> [4] BiocStyle_2.22.0      
#> 
#> loaded via a namespace (and not attached):
#>  [1] bslib_0.3.1          compiler_4.1.2       pillar_1.7.0        
#>  [4] BiocManager_1.30.16  formatR_1.12         jquerylib_0.1.4     
#>  [7] futile.logger_1.4.3  futile.options_1.0.1 tools_4.1.2         
#> [10] digest_0.6.29        tibble_3.1.6         jsonlite_1.8.0      
#> [13] evaluate_0.15        lifecycle_1.0.1      pkgconfig_2.0.3     
#> [16] rlang_1.0.2          DBI_1.1.2            cli_3.2.0           
#> [19] rstudioapi_0.13      curl_4.3.2           parallel_4.1.2      
#> [22] yaml_2.3.5           xfun_0.30            fastmap_1.1.0       
#> [25] biobakeR_0.4.0       httr_1.4.2           stringr_1.4.0       
#> [28] knitr_1.38           generics_0.1.2       sass_0.4.1          
#> [31] vctrs_0.4.0          tidyselect_1.1.2     glue_1.6.2          
#> [34] R6_2.5.1             fansi_1.0.3          rmarkdown_2.13      
#> [37] bookdown_0.25        tidyr_1.2.0          purrr_0.3.4         
#> [40] lambda.r_1.2.4       magrittr_2.0.3       htmltools_0.5.2     
#> [43] ellipsis_0.3.2       rapiclient_0.1.3     assertthat_0.2.1    
#> [46] utf8_1.2.2           stringi_1.7.6        crayon_1.5.1