RunTerraWorkflow: A ‘runnable’ workflow package for Terra-implemented analysis pipelines using Google Cloud resources

Sehyun Oh1, Martin Morgan2, Levi Waldron1


1 Graduate School of Public Health and Health Policy, City University of New York, New York, NY
2 Roswell Park Cancer Institute, Buffalo, NY

Overview

Background

For R users with the limited computing resources, we introduce the RunTerraWorkflow package. Terra is a cloud-based genomics platform and its computing resources rely on Google Cloud Platform (GCP). This package is a convenient wrapper around AnVIL package, allowing users to run the workflows implemented in Terra. With this package, users can run computationally-demanding, non-R-based workflows from the local R session without writing any workflows, installing software, or managing cloud resources. To use RunTerraWorkflow, you only need to setup the Terra account once at the beginning.

Major steps

There are three major steps running workflows: prepare, run, and check results. Below table lists the major functions to perform each step.

Table 1: Major functions to run Terra workflow
Steps Functions Description
Prepare cloneWorkspace Copy the template workspace
updateInput Update the workflow inputs
Run launchWorkflow Launch the workflow in Terra
monitorSubmission Monitor the status of your workflow run
abortSubmission Abort the submission
Result listOutput Display the list of your workflow outputs
getOutput Download your output

Create Terra account

You need Terra account setup. Once you have your own Terra account, you need two pieces of information from it to use RunTerraWorkflow package:

  1. The email address linked to your Terra account
  2. Your billing project name
accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"

Prepare Inputs

Currently, RunTerraWorkflow supports two workflows: bioBakery and salmon (quantifying transcriptome). In this vignette, we run bioBakery’s Whole Metagenome Shotgun (wmgx) analysis workflow.

Clone Workspace

First, you should clone the template workspace using cloneWorkspace function. Note that you need to provide a unique name for the cloned workspace through workspaceName argument. A successful cloning will return the name of the cloned workspace.

cloneWorkspace(workspaceName = "microbiome", analysis = "bioBakery")
#> [1] "waldronlab-terra-rstudio/microbiome"

Prepare Input

The major input is a txt file with the list of paths to the input fastq files. These input files should be stored in Google Cloud Bucket. Below is the content of an example input file.

$ cat ibdmdb_file_list.txt 
gs://run_terra_workflow/IBDMDB/HSM7J4NY_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSMA33OT_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/MSM6J2QD_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/CSM9X23N_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSM6XRQY_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSMA33KE_R1.fastq.gz

For bioBakery workflow, we suggest to use biobakeR package, which provides bioBakery-specific input handling functions. You can update the input file using biobakeR::biobakery_updateInput function.

input <- "gs://{YOUR_FASTQ_LIST}.txt"
biobakeR::biobakery_updateInput(workspaceName, InputRead1Files = input)

Run bioBakery Workflow

Once you clone the template workspace and update the input with your own data, you can launch the workflow using launchWorkflow function.

launchWorkflow(workspaceName)
#> [1] "Workflow is succesfully launched."

Workflow Outputs

Key Outputs

The bioBakery workflow has several intermediate outputs and a final zip archive that includes a report of exploratory figures plus compiled data tables. The current version of the bioBakery workflow performs quality control, genus-level taxonomic profiling, and functional profiling.

Access Output Files

You can check all the output files from the most recently succeeded submission. If you specify the submissionId argument, you can get the output files of that specific submission. Output files can also be subset using keyword argument.

res <- listOutput(workspaceName, keyword = "humann")
#> Outputs are from the submissionId 4ac530ad-11cf-43b8-91a0-350bf91a2295
head(res, 3)
#> # A tibble: 3 × 5
#>   file                                  workflow    task             type  path 
#>   <chr>                                 <chr>       <chr>            <chr> <chr>
#> 1 humann_ecs_relab_counts.tsv           workflowMTX call-CountRelab… outp… gs:/…
#> 2 humann_genefamilies_relab_counts.tsv  workflowMTX call-CountRelab… outp… gs:/…
#> 3 humann_pathabundance_relab_counts.tsv workflowMTX call-CountRelab… outp… gs:/…

You can download any file using getOutput function. Here, we download the HSM7J4NY sample’s tsv output files.

getOutput(workspaceName, keyword = "HSM7J4NY.*.tsv", dest_dir = res_dir)
list.files(res_dir)
#> [1] "HSM7J4NY_ecs_relab.tsv"           "HSM7J4NY_ecs.tsv"                
#> [3] "HSM7J4NY_genefamilies_relab.tsv"  "HSM7J4NY_genefamilies.tsv"       
#> [5] "HSM7J4NY_kos.tsv"                 "HSM7J4NY_pathabundance_relab.tsv"
#> [7] "HSM7J4NY_pathabundance.tsv"       "HSM7J4NY_pathcoverage.tsv"       
#> [9] "HSM7J4NY.tsv"

Summary

  • You can run bioBakery, a python-based whole metagenomic sequencing analysis workflow, and get the taxonomic and functional profiles from raw fastq files with the following four R functions: cloneworkspace, updateInput, launchWorkflow, and getOutput.

  • R users can run established, non-R-based workflows using flexible cloud resources from R session without writing a workflow, learning new language, or managing any software/hardware), which will improve user experience and interoperability.

  • RunTerraWorkflow package can bind batch processing and downstream analysis in a single R vignette, enhancing reproducibility.

Resources

  • For futher information on Terra, please check this workshop.
  • Computing credit is available upon request. Please contact the main author.

Reference

McIver, LJ. et al. bioBakery: a meta’omic analysis environment. Bioinformatics (2018).
Schatz, MC. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom (2022).