Contents

1 Overview

bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. bioBakery is built and maintained by Huttenhower lab.

For R users with limited computing resources, we introduce biobakeR workflow package. This package allows users to run Whole Metagenome Shotgun (wmgx) workflow version 3 on Google Cloud without writing any workflow, installing softwares, nor managing cloud resources. The computing resource is managed by Terra, a cloud-based genomics platform, and users only need to setup their account only once at the beginning to use biobakeR.

0. Pre-requisite
You need Terra account setup. The email address linked to Terra account and the billing project name are required to use biobakeR package.

1. Input
1-1. cloneWorkspace function copies the template workspace containing the bioBakery workflow. Workspace is the main building block of Terra, where different resources are delivered through. For biobakeR, the ‘template’ workspace is where default workflow is store and users will have their own copy of it, where they can modify, compute, and save their own results.
1-2. updateInput function takes user’s inputs. There are two required inputs: ProjectName (string) and InputRead1Files (tsv file). You can optionally provide metadata tsv file through InputMetadataFile argument. You can check the current input information using currentInput function.

2. Run workflow
launchWorkflow function launch the bioBakery workflow in Terra. You can abort the submission using abortSubmission function. For now, the input files should be saved in Google bucket.

3. Result
3-1. monitorSubmission function allows you to monitor the status of your workflow run.
3-2. listOutput function displays the list of your workflow outputs.
3-3. tableHead function allows you to check the head of output tsv files.
3-4. getOutput function allows you to download your outputs.


Here are the basic input parameters used in this example.

accountEmail <- "shbrief@gmail.com"
billingProjectName <- "waldronlab-terra-rstudio"
workspaceName <- "mtx_workflow_biobakery_ver3"

2 Setup

2.1 Clone Workspace

First, you should clone the template workspace using cloneWorkspace function. Note that you need to provide a unique name for the cloned workspace through workspaceName argument. “mtx_workflow_biobakery_ver3” already exists, so the below cloning fails.

cloneWorkspace(accountEmail, billingProjectName, workspaceName)
## [1] "Your workspaceName already exists. Try a new one."

With the unique workspace name, you can succesfully clone the workspace.

cloneWorkspace(accountEmail, billingProjectName, workspaceName = "test")
## [1] "Workspace is successfully cloned"

2.2 Format of inputs

input is a table containing the input files’ Google bucket location. This example input list file contains 6 fastq files. AnVIL::gsutil_pipe function allows you to read a google bucket object without downloading it. More detail can be found in AnVIL vignette.

input <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_file_list.txt"
read.table(gsutil_pipe(input), sep  = "\t")
##                                                                         V1
## 1 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM7J4NY_R1.fastq.gz
## 2 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33OT_R1.fastq.gz
## 3 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/MSM6J2QD_R1.fastq.gz
## 4 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/CSM9X23N_R1.fastq.gz
## 5 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM6XRQY_R1.fastq.gz
## 6 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33KE_R1.fastq.gz

inputMeta contains the metadata of the samples and the optional input for biobakeR.

inputMeta <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_demo_metadata.txt"
read.table(gsutil_pipe(inputMeta), sep = "\t", header = TRUE)
##     Sample Sequencing.Type         Site Age
## 1 HSM7J4NY             MTX   Cincinnati  27
## 2 HSMA33OT             MTX   Cincinnati  16
## 3 MSM6J2QD             MTX          MGH  24
## 4 CSM9X23N             MGX Cedars-Sinai  28
## 5 HSM6XRQY             MGX   Cincinnati  27
## 6 HSMA33KE             MGX   Cincinnati  15

2.3 Prepare Input

2.3.1 Current input

You can review the current input data.

currentInput(accountEmail, billingProjectName, workspaceName)
## $inputListPath
## [1] "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_file_list.txt"
## 
## $inputFilePath
## [1] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM7J4NY_R1.fastq.gz"
## [2] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33OT_R1.fastq.gz"
## [3] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/MSM6J2QD_R1.fastq.gz"
## [4] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/CSM9X23N_R1.fastq.gz"
## [5] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM6XRQY_R1.fastq.gz"
## [6] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33KE_R1.fastq.gz"

If you want to check additional input information, you can set inputOnly = FALSE.

config <- currentInput(accountEmail, billingProjectName, workspaceName, inputOnly = FALSE)
names(config)
## [1] "deleted"             "inputs"              "methodConfigVersion"
## [4] "methodRepoMethod"    "name"                "namespace"          
## [7] "outputs"             "prerequisites"

For example, you can check databases used for this version of workflow.

db_ind <- grep("versionSpecific", names(config$input))
config$input[db_ind]
## $workflowMTX.versionSpecifichumanDB
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/kneaddata/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz\""
## 
## $workflowMTX.versionSpecificUtilityMapping
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/humann/full_mapping_v201901.tar.gz\""
## 
## $workflowMTX.versionSpecificrrnaDB
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz\""
## 
## $workflowMTX.versionSpecifictrancriptDB
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz\""
## 
## $workflowMTX.versionSpecificChocophlan
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/humann/full_chocophlan.v296_201901.tar.gz\""
## 
## $workflowMTX.versionSpecificUniRef90
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/humann/uniref90_annotated_v201901.tar.gz\""

2.3.2 Update input

Before launching the workflow, you should provide the correct input information using updateInput function.

## This function is not available yet.
input <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_file_list.txt"
inputMeta <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_demo_metadata.txt"
updateInput(inputPath = input,
            inputMetadataPath = inputMeta,
            billingProjectName, workspaceName)

3 Run bioBakery workflow

Once you cloned the template workspace and updated the input with your own data, you can launch the workflow using launchWorkflow function.

launchWorkflow(accountEmail, billingProjectName, workspaceName)
## [1] "Workflow is succesfully launched."

3.1 Monitor Progress

submissions <- monitorSubmission(accountEmail, billingProjectName, workspaceName)
submissions
## # A tibble: 54 x 6
##    submissionId         submitter    submissionDate      status succeeded failed
##    <chr>                <chr>        <dttm>              <chr>      <int>  <int>
##  1 b0463af3-0f3f-4995-… shbrief@gma… 2020-11-24 10:44:03 Submi…         0      0
##  2 c4fee2eb-3e43-4e3c-… shbrief@gma… 2020-11-18 14:07:17 Abort…         0      0
##  3 11edc1aa-48aa-4e63-… shbrief@gma… 2020-11-18 14:00:32 Abort…         0      0
##  4 e6d2270a-6dc5-4636-… shbrief@gma… 2020-11-12 01:17:04 Abort…         0      0
##  5 48aa0bf9-c095-4729-… shbrief@gma… 2020-11-12 01:13:53 Abort…         0      0
##  6 e9dd3cd7-1cc1-48b6-… shbrief@gma… 2020-11-12 01:04:43 Abort…         0      0
##  7 8461ad86-5ab4-4abc-… shbrief@gma… 2020-11-12 00:56:33 Abort…         0      0
##  8 19c5bd27-2b55-42b1-… shbrief@gma… 2020-11-12 00:52:53 Abort…         0      0
##  9 6947fe70-f22c-47c5-… shbrief@gma… 2020-11-12 00:50:04 Abort…         0      0
## 10 87adddce-5f43-40b0-… shbrief@gma… 2020-11-12 00:48:14 Done           1      0
## # … with 44 more rows

To abort the most recently submitted job:

abortSubmission(accountEmail, billingProjectName, workspaceName,
                submissionId = submissions$submissionId[1])
## [1] "Workflow is succesfully aborted."

4 Result

4.1 List Outputs (need to review)

For the demonstration purpose, we aborted the launched workflow above. This workspace had a successful run with the same input already. To search that, we screened all the previous submission status and found the one with ‘Done’ status.

jobs <- monitorSubmission(accountEmail, billingProjectName, workspaceName,
                          mostRecentOnly = FALSE)
jobs_succeed <- which(jobs$succeeded == 1)
jobs[jobs_succeed,]
## # A tibble: 12 x 6
##    submissionId         submitter    submissionDate      status succeeded failed
##    <chr>                <chr>        <dttm>              <chr>      <int>  <int>
##  1 87adddce-5f43-40b0-… shbrief@gma… 2020-11-12 00:48:14 Done           1      0
##  2 116db295-b769-4235-… shbrief@gma… 2020-11-12 00:46:06 Done           1      0
##  3 75f01bcf-495e-4808-… shbrief@gma… 2020-11-11 13:35:16 Done           1      0
##  4 9de324b2-314c-4858-… shbrief@gma… 2020-11-05 20:34:52 Done           1      0
##  5 b30fa933-2074-4adc-… shbrief@gma… 2020-11-04 22:32:28 Done           1      0
##  6 b2f7f504-a9ed-4c2b-… shbrief@gma… 2020-11-04 22:30:14 Done           1      0
##  7 0ade77db-665d-4259-… shbrief@gma… 2020-11-04 22:26:59 Done           1      0
##  8 522ac46e-c1a9-495e-… shbrief@gma… 2020-11-04 22:25:41 Done           1      0
##  9 573a88a7-3443-47bc-… shbrief@gma… 2020-11-04 22:16:14 Done           1      0
## 10 7ef750c1-f574-4899-… shbrief@gma… 2020-11-01 00:45:20 Abort…         1      0
## 11 9e20eec5-1b6a-46d2-… shbrief@gma… 2020-10-13 01:49:36 Done           1      0
## 12 36502cde-6cbf-4cca-… shbrief@gma… 2020-10-13 01:04:29 Done           1      0

To access the output of the submitted job, you need a submission Id.

submission_id <- jobs[jobs_succeed,]$submissionId[1]
submission_id
## [1] "87adddce-5f43-40b0-a5a1-f7a3e3851d16"

You can check all the output files from a specific submission or you can subset a specific outputs using keyword argument.

listOutput(accountEmail, billingProjectName, workspaceName, submissionId = submission_id)
## # A tibble: 69 x 4
##    file                workflow   task         path                             
##    <chr>               <chr>      <chr>        <chr>                            
##  1 humann_ecs_relab_c… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-…
##  2 humann_genefamilie… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-…
##  3 humann_pathabundan… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-…
##  4 humann_read_and_sp… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
##  5 HSM7J4NY_genefamil… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
##  6 HSM7J4NY_pathabund… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
##  7 HSM7J4NY_pathcover… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
##  8 HSMA33OT_genefamil… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
##  9 HSMA33OT_pathabund… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 10 HSMA33OT_pathcover… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## # … with 59 more rows
listOutput(accountEmail, billingProjectName, workspaceName, submissionId = submission_id, 
           keyword = "humann")
## # A tibble: 5 x 4
##   file                workflow   task         path                              
##   <chr>               <chr>      <chr>        <chr>                             
## 1 humann_ecs_relab_c… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-d…
## 2 humann_genefamilie… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-d…
## 3 humann_pathabundan… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-d…
## 4 humann_read_and_sp… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-d…
## 5 humann_feature_cou… workflowM… call-JoinFe… gs://fc-071d1d53-e186-44ad-8951-d…

Using tableHead function, you can check the head of output .tsv files without dowwnloading them.

tableHead("HSM7J4NY_genefamilies_relab.tsv", n = 6,
          accountEmail, billingProjectName, workspaceName, 
          submissionId = submission_id)   # gene families relative abundance
##                                                                      V1
## 1                                                       UniRef90_B7B7I3
## 2          UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
## 3                                          UniRef90_B7B7I3|unclassified
## 4                                                       UniRef90_F3QJ09
## 5 UniRef90_F3QJ09|g__Parasutterella.s__Parasutterella_excrementihominis
## 6                                                   UniRef90_A0A374V756
##            V2
## 1 2.25732e-01
## 2 2.25683e-01
## 3 4.95507e-05
## 4 1.39816e-01
## 5 1.39816e-01
## 6 3.66683e-02

4.2 Get Outputs

Here, we list the example of major output files from bioBakery workflow.

4.2.1 Sample-specific outputs

keyword argument takes a character string containing a regular expression and as an example here, we check all .tsv output of the sample, “HSM7J4NY”.

listOutput(accountEmail, billingProjectName, workspaceName, submission_id, 
           keyword = "HSM7J4NY.*.tsv")
## # A tibble: 9 x 4
##   file              workflow  task          path                                
##   <chr>             <chr>     <chr>         <chr>                               
## 1 HSM7J4NY_genefam… workflow… call-Functio… gs://fc-071d1d53-e186-44ad-8951-d85…
## 2 HSM7J4NY_pathabu… workflow… call-Functio… gs://fc-071d1d53-e186-44ad-8951-d85…
## 3 HSM7J4NY_pathcov… workflow… call-Functio… gs://fc-071d1d53-e186-44ad-8951-d85…
## 4 HSM7J4NY_ecs.tsv  workflow… call-Regroup… gs://fc-071d1d53-e186-44ad-8951-d85…
## 5 HSM7J4NY_kos.tsv  workflow… call-Regroup… gs://fc-071d1d53-e186-44ad-8951-d85…
## 6 HSM7J4NY_ecs_rel… workflow… call-RenormT… gs://fc-071d1d53-e186-44ad-8951-d85…
## 7 HSM7J4NY_genefam… workflow… call-RenormT… gs://fc-071d1d53-e186-44ad-8951-d85…
## 8 HSM7J4NY_pathabu… workflow… call-RenormT… gs://fc-071d1d53-e186-44ad-8951-d85…
## 9 HSM7J4NY.tsv      workflow… call-Taxonom… gs://fc-071d1d53-e186-44ad-8951-d85…

You can download any file using getOutput function. Here, we narrow down the download to HSM7J4NY sample’s .tsv output files.

HSM7J4NY_dir <- "~/data2/biobakeR/inst/extdata/outputs/HSM7J4NY"
getOutput(accountEmail, billingProjectName, workspaceName, submission_id, 
          keyword = "HSM7J4NY.*.tsv", dest_dir = HSM7J4NY_dir)

4.2.1.1 HSM7J4NY_genefamilies.tsv

##                                                 X..Gene.Family
## 1                                                     UNMAPPED
## 2                                              UniRef90_B7B7I3
## 3 UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
##   HSM7J4NY_Abundance.RPKs
## 1              10963114.0
## 2                583244.9
## 3                583116.8

4.2.1.2 HSM7J4NY_pathabundance.tsv

##                                                      X..Pathway
## 1                                                      UNMAPPED
## 2                                                  UNINTEGRATED
## 3 UNINTEGRATED|g__Parabacteroides.s__Parabacteroides_distasonis
##   HSM7J4NY_Abundance
## 1          2875610.0
## 2           675985.8
## 3           150306.1

4.2.1.3 HSM7J4NY_pathcoverage.tsv

##                                          X..Pathway HSM7J4NY_Coverage
## 1                                          UNMAPPED                 1
## 2                                      UNINTEGRATED                 1
## 3 UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae                 1

4.2.1.4 HSM7J4NY_ecs.tsv

##                                   X..Gene.Family HSM7J4NY_Abundance.RPKs
## 1                                       UNMAPPED             10963114.00
## 2                                      UNGROUPED              2552297.40
## 3 UNGROUPED|g__Alistipes.s__Alistipes_finegoldii                41352.53

4.2.1.5 HSM7J4NY_kos.tsv

##                                   X..Gene.Family HSM7J4NY_Abundance.RPKs
## 1                                       UNMAPPED             10963114.00
## 2                                      UNGROUPED              2544777.64
## 3 UNGROUPED|g__Alistipes.s__Alistipes_finegoldii                41201.35

4.2.1.6 HSM7J4NY_ecs_relab.tsv

##                                 X..Gene.Family HSM7J4NY_Abundance.RPKs
## 1                                      1.1.1.1             0.000435738
## 2 1.1.1.1|g__Alistipes.s__Alistipes_finegoldii             0.000435738
## 3                                    1.1.1.100             0.002691470

4.2.1.7 HSM7J4NY_genefamilies_relab.tsv

##                                                 X..Gene.Family
## 1                                              UniRef90_B7B7I3
## 2 UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
## 3                                 UniRef90_B7B7I3|unclassified
##   HSM7J4NY_Abundance.RPKs
## 1             2.25732e-01
## 2             2.25683e-01
## 3             4.95507e-05

4.2.1.8 HSM7J4NY_pathabundance_relab.tsv

##                                             X..Pathway HSM7J4NY_Abundance
## 1              PWY-1042: glycolysis IV (plant cytosol)          0.0859454
## 2 PWY-1042: glycolysis IV (plant cytosol)|unclassified          0.0224201
## 3     ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)          0.0501175

4.2.1.9 HSM7J4NY.tsv

##                   X.clade_name NCBI_tax_id relative_abundance
## 1                  k__Bacteria           2           73.37069
## 2                   k__Archaea        2157           26.62931
## 3 k__Bacteria|p__Bacteroidetes       2|976           72.77130
##   additional_species
## 1                   
## 2                   
## 3

4.2.2 Output files for visualization

Output files for visualization is saved in ibdmdb_test_visualizations.zip.

dest_dir <- "~/data2/biobakeR/inst/extdata/outputs"
getOutput(accountEmail, billingProjectName, workspaceName, submission_id, 
          keyword = "visualization", dest_dir = dest_dir)
unzip(file.path(dest_dir, "ibdmdb_test_visualizations.zip"), exdir = dest_dir)

Here are the list of files used for visualization.

list.files(data_dir)
##  [1] "humann_feature_counts.tsv"              
##  [2] "humann_read_and_species_count_table.tsv"
##  [3] "kneaddata_read_count_table.tsv"         
##  [4] "metaphlan_taxonomic_profiles.tsv"       
##  [5] "microbial_counts_table.tsv"             
##  [6] "pathabundance_relab.tsv"                
##  [7] "qc_counts_orphans_table.tsv"            
##  [8] "qc_counts_pairs_table.tsv"              
##  [9] "taxa_counts_table.tsv"                  
## [10] "top_average_pathways_names.tsv"

4.2.2.1 humann_feature_counts.tsv

This contains the feature counts (pathways, gene families, ecs) for each sample.

##   X..samples humann_ecs_relab_counts humann_genefamilies_relab_counts
## 1   CSM9X23N                    1049                            88227
## 2   HSM6XRQY                    1150                            64841
## 3   HSM7J4NY                     416                             9002
##   humann_pathabundance_relab_counts
## 1                               209
## 2                               262
## 3                                42

4.2.2.2 humann_read_and_species_count_table.tsv

This contains the counts of reads aligning at each step plus the total number of species identified for each sample.

##   X..samples total.reads total.nucleotide.aligned total.translated.aligned
## 1   HSM6XRQY    17166083                 13505861                 14076343
## 2   HSMA33OT    11189134                  3394815                  4348515
## 3   MSM6J2QD     3858456                   545341                   883737
##   total.species
## 1            27
## 2            26
## 3            10

4.2.2.3 kneaddata_read_count_table.tsv

This contains the read counts (split into pairs and orphans) for each step in the quality control process for each sample.

##     Sample raw.pair1 raw.pair2 trimmed.pair1 trimmed.pair2
## 1 CSM9X23N  10529590  10529590      10529590      10529590
## 2 HSM6XRQY   8655985   8655985       8655985       8655985
## 3 HSM7J4NY   7241429   7241429       7241425       7241425
##   decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.pair1
## 1                                                     10482541
## 2                                                      8622357
## 3                                                      7192958
##   decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.pair2
## 1                                                     10482541
## 2                                                      8622357
## 3                                                      7192958
##   decontaminated.hg37dec_v0.1.pair1 decontaminated.hg37dec_v0.1.pair2
## 1                          10429946                          10429946
## 2                           8573730                           8573730
## 3                           5764053                           5764053
##   decontaminated.human_hg38_refMrna.pair1
## 1                                10482541
## 2                                 8622359
## 3                                 7192959
##   decontaminated.human_hg38_refMrna.pair2
## 1                                10482541
## 2                                 8622359
## 3                                 7192959
##   decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.orphan1
## 1                                                          16559
## 2                                                           9642
## 3                                                          22026
##   decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.orphan2
## 1                                                          16158
## 2                                                           9009
## 3                                                          21765
##   decontaminated.hg37dec_v0.1.orphan1 decontaminated.hg37dec_v0.1.orphan2
## 1                               19029                               18559
## 2                               11519                               11078
## 3                              775171                               46603
##   decontaminated.human_hg38_refMrna.orphan1
## 1                                     16560
## 2                                      9642
## 3                                     22027
##   decontaminated.human_hg38_refMrna.orphan2 final.pair1 final.pair2
## 1                                     16158    10429946    10429946
## 2                                      9009     8573730     8573730
## 3                                     21764     5764053     5764053
##   final.orphan1 final.orphan2
## 1         16546         16129
## 2          9630          8993
## 3         21992         21736

4.2.2.4 metaphlan_taxonomic_profiles.tsv

This contains the merged taxonomic profiles for all samples.

##                 X..taxonomy  CSM9X23N  HSM6XRQY HSM7J4NY  HSMA33KE  HSMA33OT
## 1                k__Archaea   0.00000   0.00000 26.62931   0.00000   0.00000
## 2               k__Bacteria 100.00000 100.00000 73.37069 100.00000 100.00000
## 3 k__Bacteria|p__Firmicutes   6.76299   2.84778  0.00000  27.44129   6.94853
##    MSM6J2QD
## 1   0.00000
## 2 100.00000
## 3   1.22901

4.2.2.5 microbial_counts_table.tsv

This table includes counts ratios for each step of the quality control process for all samples.

##   X..Sample rRNA...Trim rRNA...Raw hg37dec_v0.1...Trim hg37dec_v0.1...Raw
## 1  CSM9X23N     0.99566    0.99720             0.99065            0.99219
## 2  HSM6XRQY     0.99623    0.99730             0.99059            0.99166
## 3  HSM7J4NY     1.04517    1.04833             0.79829            0.80070
##   mRNA...Trim mRNA...Raw
## 1     0.99554    0.99709
## 2     0.99612    0.99719
## 3     0.99333    0.99633

4.2.2.6 pathabundance_relab.tsv

This is a merged table of the pathway abundances for all samples normalized to relative abundance.

##                                                                                    X..Pathway
## 1                                        1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis
## 2   1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Alistipes.s__Alistipes_finegoldii
## 3 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Anaerostipes.s__Anaerostipes_hadrus
##   CSM9X23N_Abundance HSM6XRQY_Abundance HSM7J4NY_Abundance HSMA33KE_Abundance
## 1          0.0120141          0.0128154           0.023005        1.03715e-02
## 2          0.0000000          0.0000000           0.000000        9.96737e-05
## 3          0.0000000          0.0000000           0.000000        1.65680e-04
##   HSMA33OT_Abundance MSM6J2QD_Abundance
## 1         0.00895635          0.0155297
## 2         0.00000000          0.0000000
## 3         0.00000000          0.0000000

4.2.2.7 qc_counts_orphans_table.tsv

This is table with the total number of orphan reads not aligning to each of the reference tables.

##   X..Sample rRNA.orphan1 rRNA.orphan2 hg37dec_v0.1.orphan1 hg37dec_v0.1.orphan2
## 1  CSM9X23N        16559        16158                19029                18559
## 2  HSM6XRQY         9642         9009                11519                11078
## 3  HSM7J4NY        22026        21765               775171                46603
##   mRNA.orphan1 mRNA.orphan2
## 1        16560        16158
## 2         9642         9009
## 3        22027        21764

4.2.2.8 qc_counts_pairs_table.tsv

This is table with the total number of paired reads not aligning to each of the reference tables.

##   X..Sample      Raw     Trim     rRNA hg37dec_v0.1     mRNA
## 1  CSM9X23N 10529590 10529590 10482541     10429946 10482541
## 2  HSM6XRQY  8655985  8655985  8622357      8573730  8622359
## 3  HSM7J4NY  7241429  7241425  7192958      5764053  7192959

4.2.2.9 taxa_counts_table.tsv

This table includes the total number of species and genera before and after filtering.

##   X..Sample Species Species.filtered Genera Genera.filtered
## 1  CSM9X23N      34               33     19              19
## 2  HSM6XRQY      33               30     22              21
## 3  HSM7J4NY      11               11      6               6

4.2.2.10 top_average_pathways_names.tsv

This table includes the top pathways by average abundance, with their full names, including average abundance and variance.

##                                                 X..Pathway Average.abundance
## 1                  PWY-1042: glycolysis IV (plant cytosol)            0.0399
## 2         ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)            0.0232
## 3 PWY-7219: adenosine ribonucleotides de novo biosynthesis            0.0231
##   Variance
## 1 0.001020
## 2 0.000225
## 3 0.000180

4.2.3 Make SummarizedExperiment

4.2.3.1 Taxonomy profile

taxo_profile <- read.csv(file.path(data_dir, "metaphlan_taxonomic_profiles.tsv"),
                         sep = "\t", header = TRUE)
taxo_profile <- tibble::column_to_rownames(taxo_profile, var = "X..taxonomy")
meta <- read.table("~/data2/biobakeR/inst/extdata/ibdmdb_demo_metadata.txt", 
                   sep = "\t", header = TRUE)
colData <- DataFrame(meta)
se <- SummarizedExperiment(assays = list(taxo_profile = taxo_profile),
                           colData = colData)
se
## class: SummarizedExperiment 
## dim: 202 6 
## metadata(0):
## assays(1): taxo_profile
## rownames(202): k__Archaea k__Bacteria ...
##   k__Archaea|p__Euryarchaeota|c__Thermoplasmata|o__Methanomassiliicoccales|f__Methanomassiliicoccaceae|g__Methanomassiliicoccus|s__Candidatus_Methanomassiliicoccus_intestinalis
##   k__Bacteria|p__Proteobacteria|c__Proteobacteria_unclassified|o__Proteobacteria_unclassified|f__Proteobacteria_unclassified|g__Proteobacteria_unclassified|s__Proteobacteria_bacterium_CAG_139
## rowData names(0):
## colnames(6): CSM9X23N HSM6XRQY ... HSMA33OT MSM6J2QD
## colData names(4): Sample Sequencing.Type Site Age

4.2.3.2 Pathway abundance

path_abun <- read.csv(file.path(data_dir, "pathabundance_relab.tsv"),
                      sep = "\t", header = TRUE)
path_abun <- tibble::column_to_rownames(path_abun, var = "X..Pathway")
colnames(path_abun) <- gsub("_Abundance", "", colnames(path_abun))
se <- SummarizedExperiment(assays = list(taxo_profile = path_abun),
                           colData = colData)
se
## class: SummarizedExperiment 
## dim: 2625 6 
## metadata(0):
## assays(1): taxo_profile
## rownames(2625): 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis
##   1CMET2-PWY: N10-formyl-tetrahydrofolate
##   biosynthesis|g__Alistipes.s__Alistipes_finegoldii ... VALSYN-PWY:
##   L-valine
##   biosynthesis|g__Ruthenibacterium.s__Ruthenibacterium_lactatiformans
##   VALSYN-PWY: L-valine biosynthesis|unclassified
## rowData names(0):
## colnames(6): CSM9X23N HSM6XRQY ... HSMA33OT MSM6J2QD
## colData names(4): Sample Sequencing.Type Site Age