For R users with the limited computing resources, we introduce the RunTerraWorkflow package. Terra is a cloud-based genomics platform and its computing resources rely on Google Cloud Platform (GCP). This package is a convenient wrapper around AnVIL package, allowing users to run the workflows implemented in Terra. With this package, users can run computationally-demanding, non-R-based workflows from the local R session without writing any workflows, installing software, or managing cloud resources. To use RunTerraWorkflow, you only need to setup the Terra account once at the beginning.
There are three major steps running workflows: prepare, run, and check results. Below table lists the major functions to perform each step.
| Steps | Functions | Description |
|---|---|---|
| Prepare | cloneWorkspace | Copy the template workspace |
| updateInput | Update the workflow inputs | |
| Run | launchWorkflow | Launch the workflow in Terra |
| monitorSubmission | Monitor the status of your workflow run | |
| abortSubmission | Abort the submission | |
| Result | listOutput | Display the list of your workflow outputs |
| getOutput | Download your output |
You need Terra account setup. Once you have your own Terra account, you need two pieces of information from it to use RunTerraWorkflow package:
accountEmail <- "YOUR_EMAIL@gmail.com"
billingProjectName <- "YOUR_BILLING_ACCOUNT"Currently, RunTerraWorkflow supports two workflows: bioBakery and salmon (quantifying transcriptome). In this vignette, we run bioBakery’s Whole Metagenome Shotgun (wmgx) analysis workflow.
First, you should clone the template workspace using cloneWorkspace function.
Note that you need to provide a unique name for the cloned workspace
through workspaceName argument. A successful cloning will return the name of
the cloned workspace.
cloneWorkspace(workspaceName = "microbiome", analysis = "bioBakery")
#> [1] "waldronlab-terra-rstudio/microbiome"The major input is a txt file with the list of paths to the input fastq files. These input files should be stored in Google Cloud Bucket. Below is the content of an example input file.
$ cat ibdmdb_file_list.txt
gs://run_terra_workflow/IBDMDB/HSM7J4NY_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSMA33OT_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/MSM6J2QD_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/CSM9X23N_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSM6XRQY_R1.fastq.gz
gs://run_terra_workflow/IBDMDB/HSMA33KE_R1.fastq.gzFor bioBakery workflow, we suggest to use biobakeR
package, which provides bioBakery-specific input handling functions. You can
update the input file using biobakeR::biobakery_updateInput function.
input <- "gs://{YOUR_FASTQ_LIST}.txt"
biobakeR::biobakery_updateInput(workspaceName, InputRead1Files = input)Once you clone the template workspace and update the input with your own
data, you can launch the workflow using launchWorkflow function.
launchWorkflow(workspaceName)
#> [1] "Workflow is succesfully launched."The bioBakery workflow has several intermediate outputs and a final zip archive that includes a report of exploratory figures plus compiled data tables. The current version of the bioBakery workflow performs quality control, genus-level taxonomic profiling, and functional profiling.
You can check all the output files from the most recently succeeded submission.
If you specify the submissionId argument, you can get the output files of that
specific submission. Output files can also be subset using keyword argument.
res <- listOutput(workspaceName, keyword = "humann")
#> Outputs are from the submissionId 4ac530ad-11cf-43b8-91a0-350bf91a2295
head(res, 3)
#> # A tibble: 3 × 5
#> file workflow task type path
#> <chr> <chr> <chr> <chr> <chr>
#> 1 humann_ecs_relab_counts.tsv workflowMTX call-CountRelab… outp… gs:/…
#> 2 humann_genefamilies_relab_counts.tsv workflowMTX call-CountRelab… outp… gs:/…
#> 3 humann_pathabundance_relab_counts.tsv workflowMTX call-CountRelab… outp… gs:/…You can download any file using getOutput function. Here, we download the
HSM7J4NY sample’s tsv output files.
getOutput(workspaceName, keyword = "HSM7J4NY.*.tsv", dest_dir = res_dir)list.files(res_dir)
#> [1] "HSM7J4NY_ecs_relab.tsv" "HSM7J4NY_ecs.tsv"
#> [3] "HSM7J4NY_genefamilies_relab.tsv" "HSM7J4NY_genefamilies.tsv"
#> [5] "HSM7J4NY_kos.tsv" "HSM7J4NY_pathabundance_relab.tsv"
#> [7] "HSM7J4NY_pathabundance.tsv" "HSM7J4NY_pathcoverage.tsv"
#> [9] "HSM7J4NY.tsv"You can run bioBakery, a python-based whole metagenomic sequencing analysis
workflow, and get the taxonomic and functional profiles from raw fastq files
with the following four R functions: cloneworkspace, updateInput, launchWorkflow,
and getOutput.
R users can run established, non-R-based workflows using flexible cloud resources from R session without writing a workflow, learning new language, or managing any software/hardware), which will improve user experience and interoperability.
RunTerraWorkflow package can bind batch processing and downstream analysis in a single R vignette, enhancing reproducibility.
McIver, LJ. et al. bioBakery: a meta’omic analysis environment. Bioinformatics (2018).
Schatz, MC. et al. Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genom (2022).