RunTerraWorkflow: A ‘runnable’ workflow package for Terra-implemented analysis pipelines using Google Cloud resources

Sehyun Oh¹, Martin Morgan², Levi Waldron¹

¹ Graduate School of Public Health and Health Policy, City University of New York, New York, NY
² Roswell Park Cancer Institute, Buffalo, NY

Overview

Background

For R users with the limited computing resources, we introduce the RunTerraWorkflow package. Terra is a cloud-based genomics platform and its computing resources rely on Google Cloud Platform (GCP). This package is a convenient wrapper around AnVIL package, allowing users to run the workflows implemented in Terra. With this package, users can run computationally-demanding, non-R-based workflows from the local R session without writing any workflows, installing software, or managing cloud resources. To use RunTerraWorkflow, you only need to setup the Terra account once at the beginning.

Major steps

There are three major steps running workflows: prepare, run, and check results. Below table lists the major functions to perform each step.

Table 1: Major functions to run Terra workflow
Steps	Functions	Description
Prepare	cloneWorkspace	Copy the template workspace
	updateInput	Update the workflow inputs
Run	launchWorkflow	Launch the workflow in Terra
	monitorSubmission	Monitor the status of your workflow run
	abortSubmission	Abort the submission
Result	listOutput	Display the list of your workflow outputs
	getOutput	Download your output

Create Terra account

You need Terra account setup. Once you have your own Terra account, you need two pieces of information from it to use RunTerraWorkflow package:

The email address linked to your Terra account
Your billing project name

accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"

Prepare Inputs

Currently, RunTerraWorkflow supports two workflows: bioBakery and salmon (quantifying transcriptome). In this vignette, we run bioBakery’s Whole Metagenome Shotgun (wmgx) analysis workflow.

Clone Workspace

First, you should clone the template workspace using cloneWorkspace function. Note that you need to provide a unique name for the cloned workspace through workspaceName argument. A successful cloning will return the name of the cloned workspace.

cloneWorkspace(workspaceName = "microbiome", analysis = "bioBakery")
#> [1] "waldronlab-terra-rstudio/microbiome"

Prepare Input

The major input is a txt file with the list of paths to the input fastq files. These input files should be stored in Google Cloud Bucket. Below is the content of an example input file.

$ cat ibdmdb_file_list.txt 
gs://run_terra_workflow/IBDMDB/HSM7J4NY_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSMA33OT_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/MSM6J2QD_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/CSM9X23N_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSM6XRQY_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSMA33KE_R1.fastq.gz

For bioBakery workflow, we suggest to use biobakeR package, which provides bioBakery-specific input handling functions. You can update the input file using biobakeR::biobakery_updateInput function.

input <- "gs://{YOUR_FASTQ_LIST}.txt"
biobakeR::biobakery_updateInput(workspaceName, InputRead1Files = input)

Run bioBakery Workflow

Once you clone the template workspace and update the input with your own data, you can launch the workflow using launchWorkflow function.

launchWorkflow(workspaceName)
#> [1] "Workflow is succesfully launched."

Workflow Outputs

Key Outputs

The bioBakery workflow has several intermediate outputs and a final zip archive that includes a report of exploratory figures plus compiled data tables. The current version of the bioBakery workflow performs quality control, genus-level taxonomic profiling, and functional profiling.

Access Output Files

You can check all the output files from the most recently succeeded submission. If you specify the submissionId argument, you can get the output files of that specific submission. Output files can also be subset using keyword argument.

res <- listOutput(workspaceName, keyword = "humann")
#> Outputs are from the submissionId 4ac530ad-11cf-43b8-91a0-350bf91a2295
head(res, 3)
#> # A tibble: 3 × 5
#>   file                                  workflow    task             type  path 
#>   <chr>                                 <chr>       <chr>            <chr> <chr>
#> 1 humann_ecs_relab_counts.tsv           workflowMTX call-CountRelab… outp… gs:/…
#> 2 humann_genefamilies_relab_counts.tsv  workflowMTX call-CountRelab… outp… gs:/…
#> 3 humann_pathabundance_relab_counts.tsv workflowMTX call-CountRelab… outp… gs:/…

You can download any file using getOutput function. Here, we download the HSM7J4NY sample’s tsv output files.

getOutput(workspaceName, keyword = "HSM7J4NY.*.tsv", dest_dir = res_dir)

list.files(res_dir)
#> [1] "HSM7J4NY_ecs_relab.tsv"           "HSM7J4NY_ecs.tsv"                
#> [3] "HSM7J4NY_genefamilies_relab.tsv"  "HSM7J4NY_genefamilies.tsv"       
#> [5] "HSM7J4NY_kos.tsv"                 "HSM7J4NY_pathabundance_relab.tsv"
#> [7] "HSM7J4NY_pathabundance.tsv"       "HSM7J4NY_pathcoverage.tsv"       
#> [9] "HSM7J4NY.tsv"

Summary

You can run bioBakery, a python-based whole metagenomic sequencing analysis workflow, and get the taxonomic and functional profiles from raw fastq files with the following four R functions: cloneworkspace, updateInput, launchWorkflow, and getOutput.
R users can run established, non-R-based workflows using flexible cloud resources from R session without writing a workflow, learning new language, or managing any software/hardware), which will improve user experience and interoperability.
RunTerraWorkflow package can bind batch processing and downstream analysis in a single R vignette, enhancing reproducibility.

Resources

For futher information on Terra, please check this workshop.
Computing credit is available upon request. Please contact the main author.

Reference

McIver, LJ. et al. bioBakery: a meta’omic analysis environment. Bioinformatics (2018).
Schatz, MC. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom (2022).