bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. bioBakery is built and maintained by Huttenhower lab.
For R users with limited computing resources, we introduce biobakeR workflow package. This package allows users to run Whole Metagenome Shotgun (wmgx) workflow version 3 on Google Cloud without writing any workflow, installing softwares, nor managing cloud resources. The computing resource is managed by Terra, a cloud-based genomics platform, and users only need to setup their account only once at the beginning to use biobakeR.
0. Pre-requisite
You need Terra account setup.
The email address linked to Terra account and the billing project name are required
to use biobakeR package.
1. Input
1-1. cloneWorkspace function copies the template workspace containing the bioBakery
workflow. Workspace is the main building block of Terra,
where different resources are delivered through. For biobakeR, the ‘template’
workspace is where default workflow is store and users will have their own copy
of it, where they can modify, compute, and save their own results.
1-2. updateInput function takes user’s inputs. There are two required inputs: ProjectName
(string) and InputRead1Files (tsv file). You can optionally provide metadata tsv file
through InputMetadataFile argument. You can check the current input information using
currentInput function.
2. Run workflow
launchWorkflow function launch the bioBakery workflow in Terra. You can abort
the submission using abortSubmission function. For now, the input files should
be saved in Google bucket.
3. Result
3-1. monitorSubmission function allows you to monitor the status of your workflow run.
3-2. listOutput function displays the list of your workflow outputs.
3-3. tableHead function allows you to check the head of output tsv files.
3-4. getOutput function allows you to download your outputs.
Here are the basic input parameters used in this example.
accountEmail <- "shbrief@gmail.com"
billingProjectName <- "waldronlab-terra-rstudio"
workspaceName <- "mtx_workflow_biobakery_ver3"
First, you should clone the template workspace using cloneWorkspace function.
Note that you need to provide a unique name for the cloned workspace through workspaceName
argument. “mtx_workflow_biobakery_ver3” already exists, so the below cloning fails.
cloneWorkspace(accountEmail, billingProjectName, workspaceName)
## [1] "Your workspaceName already exists. Try a new one."
With the unique workspace name, you can succesfully clone the workspace.
cloneWorkspace(accountEmail, billingProjectName, workspaceName = "test")
## [1] "Workspace is successfully cloned"
input is a table containing the input files’ Google bucket location. This
example input list file contains 6 fastq files. AnVIL::gsutil_pipe function
allows you to read a google bucket object without downloading it. More detail
can be found in AnVIL vignette.
input <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_file_list.txt"
read.table(gsutil_pipe(input), sep = "\t")
## V1
## 1 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM7J4NY_R1.fastq.gz
## 2 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33OT_R1.fastq.gz
## 3 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/MSM6J2QD_R1.fastq.gz
## 4 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/CSM9X23N_R1.fastq.gz
## 5 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM6XRQY_R1.fastq.gz
## 6 gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33KE_R1.fastq.gz
inputMeta contains the metadata of the samples and the optional input for biobakeR.
inputMeta <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_demo_metadata.txt"
read.table(gsutil_pipe(inputMeta), sep = "\t", header = TRUE)
## Sample Sequencing.Type Site Age
## 1 HSM7J4NY MTX Cincinnati 27
## 2 HSMA33OT MTX Cincinnati 16
## 3 MSM6J2QD MTX MGH 24
## 4 CSM9X23N MGX Cedars-Sinai 28
## 5 HSM6XRQY MGX Cincinnati 27
## 6 HSMA33KE MGX Cincinnati 15
You can review the current input data.
currentInput(accountEmail, billingProjectName, workspaceName)
## $inputListPath
## [1] "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_file_list.txt"
##
## $inputFilePath
## [1] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM7J4NY_R1.fastq.gz"
## [2] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33OT_R1.fastq.gz"
## [3] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/MSM6J2QD_R1.fastq.gz"
## [4] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/CSM9X23N_R1.fastq.gz"
## [5] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSM6XRQY_R1.fastq.gz"
## [6] "gs://fc-7130738a-5cde-4238-b00a-e07eba6047f2/IBDMDB/HSMA33KE_R1.fastq.gz"
If you want to check additional input information, you can set inputOnly = FALSE.
config <- currentInput(accountEmail, billingProjectName, workspaceName, inputOnly = FALSE)
names(config)
## [1] "deleted" "inputs" "methodConfigVersion"
## [4] "methodRepoMethod" "name" "namespace"
## [7] "outputs" "prerequisites"
For example, you can check databases used for this version of workflow.
db_ind <- grep("versionSpecific", names(config$input))
config$input[db_ind]
## $workflowMTX.versionSpecifichumanDB
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/kneaddata/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz\""
##
## $workflowMTX.versionSpecificUtilityMapping
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/humann/full_mapping_v201901.tar.gz\""
##
## $workflowMTX.versionSpecificrrnaDB
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/kneaddata/SILVA_128_LSUParc_SSUParc_ribosomal_RNA_v0.2.tar.gz\""
##
## $workflowMTX.versionSpecifictrancriptDB
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/kneaddata/Homo_sapiens_hg38_transcriptome_Bowtie2_v0.1.tar.gz\""
##
## $workflowMTX.versionSpecificChocophlan
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/humann/full_chocophlan.v296_201901.tar.gz\""
##
## $workflowMTX.versionSpecificUniRef90
## [1] "\"gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/databases/humann/uniref90_annotated_v201901.tar.gz\""
Before launching the workflow, you should provide the correct input information
using updateInput function.
## This function is not available yet.
input <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_file_list.txt"
inputMeta <- "gs://fc-07ee4ddc-5b5b-46f6-bed7-809aa14bb012/IBDMDB/ibdmdb_demo_metadata.txt"
updateInput(inputPath = input,
inputMetadataPath = inputMeta,
billingProjectName, workspaceName)
Once you cloned the template workspace and updated the input with your own data,
you can launch the workflow using launchWorkflow function.
launchWorkflow(accountEmail, billingProjectName, workspaceName)
## [1] "Workflow is succesfully launched."
submissions <- monitorSubmission(accountEmail, billingProjectName, workspaceName)
submissions
## # A tibble: 54 x 6
## submissionId submitter submissionDate status succeeded failed
## <chr> <chr> <dttm> <chr> <int> <int>
## 1 b0463af3-0f3f-4995-… shbrief@gma… 2020-11-24 10:44:03 Submi… 0 0
## 2 c4fee2eb-3e43-4e3c-… shbrief@gma… 2020-11-18 14:07:17 Abort… 0 0
## 3 11edc1aa-48aa-4e63-… shbrief@gma… 2020-11-18 14:00:32 Abort… 0 0
## 4 e6d2270a-6dc5-4636-… shbrief@gma… 2020-11-12 01:17:04 Abort… 0 0
## 5 48aa0bf9-c095-4729-… shbrief@gma… 2020-11-12 01:13:53 Abort… 0 0
## 6 e9dd3cd7-1cc1-48b6-… shbrief@gma… 2020-11-12 01:04:43 Abort… 0 0
## 7 8461ad86-5ab4-4abc-… shbrief@gma… 2020-11-12 00:56:33 Abort… 0 0
## 8 19c5bd27-2b55-42b1-… shbrief@gma… 2020-11-12 00:52:53 Abort… 0 0
## 9 6947fe70-f22c-47c5-… shbrief@gma… 2020-11-12 00:50:04 Abort… 0 0
## 10 87adddce-5f43-40b0-… shbrief@gma… 2020-11-12 00:48:14 Done 1 0
## # … with 44 more rows
To abort the most recently submitted job:
abortSubmission(accountEmail, billingProjectName, workspaceName,
submissionId = submissions$submissionId[1])
## [1] "Workflow is succesfully aborted."
For the demonstration purpose, we aborted the launched workflow above. This workspace had a successful run with the same input already. To search that, we screened all the previous submission status and found the one with ‘Done’ status.
jobs <- monitorSubmission(accountEmail, billingProjectName, workspaceName,
mostRecentOnly = FALSE)
jobs_succeed <- which(jobs$succeeded == 1)
jobs[jobs_succeed,]
## # A tibble: 12 x 6
## submissionId submitter submissionDate status succeeded failed
## <chr> <chr> <dttm> <chr> <int> <int>
## 1 87adddce-5f43-40b0-… shbrief@gma… 2020-11-12 00:48:14 Done 1 0
## 2 116db295-b769-4235-… shbrief@gma… 2020-11-12 00:46:06 Done 1 0
## 3 75f01bcf-495e-4808-… shbrief@gma… 2020-11-11 13:35:16 Done 1 0
## 4 9de324b2-314c-4858-… shbrief@gma… 2020-11-05 20:34:52 Done 1 0
## 5 b30fa933-2074-4adc-… shbrief@gma… 2020-11-04 22:32:28 Done 1 0
## 6 b2f7f504-a9ed-4c2b-… shbrief@gma… 2020-11-04 22:30:14 Done 1 0
## 7 0ade77db-665d-4259-… shbrief@gma… 2020-11-04 22:26:59 Done 1 0
## 8 522ac46e-c1a9-495e-… shbrief@gma… 2020-11-04 22:25:41 Done 1 0
## 9 573a88a7-3443-47bc-… shbrief@gma… 2020-11-04 22:16:14 Done 1 0
## 10 7ef750c1-f574-4899-… shbrief@gma… 2020-11-01 00:45:20 Abort… 1 0
## 11 9e20eec5-1b6a-46d2-… shbrief@gma… 2020-10-13 01:49:36 Done 1 0
## 12 36502cde-6cbf-4cca-… shbrief@gma… 2020-10-13 01:04:29 Done 1 0
To access the output of the submitted job, you need a submission Id.
submission_id <- jobs[jobs_succeed,]$submissionId[1]
submission_id
## [1] "87adddce-5f43-40b0-a5a1-f7a3e3851d16"
You can check all the output files from a specific submission or you can subset
a specific outputs using keyword argument.
listOutput(accountEmail, billingProjectName, workspaceName, submissionId = submission_id)
## # A tibble: 69 x 4
## file workflow task path
## <chr> <chr> <chr> <chr>
## 1 humann_ecs_relab_c… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-…
## 2 humann_genefamilie… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-…
## 3 humann_pathabundan… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-…
## 4 humann_read_and_sp… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 5 HSM7J4NY_genefamil… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 6 HSM7J4NY_pathabund… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 7 HSM7J4NY_pathcover… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 8 HSMA33OT_genefamil… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 9 HSMA33OT_pathabund… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## 10 HSMA33OT_pathcover… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-…
## # … with 59 more rows
listOutput(accountEmail, billingProjectName, workspaceName, submissionId = submission_id,
keyword = "humann")
## # A tibble: 5 x 4
## file workflow task path
## <chr> <chr> <chr> <chr>
## 1 humann_ecs_relab_c… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-d…
## 2 humann_genefamilie… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-d…
## 3 humann_pathabundan… workflowM… call-CountR… gs://fc-071d1d53-e186-44ad-8951-d…
## 4 humann_read_and_sp… workflowM… call-Functi… gs://fc-071d1d53-e186-44ad-8951-d…
## 5 humann_feature_cou… workflowM… call-JoinFe… gs://fc-071d1d53-e186-44ad-8951-d…
Using tableHead function, you can check the head of output .tsv files without
dowwnloading them.
tableHead("HSM7J4NY_genefamilies_relab.tsv", n = 6,
accountEmail, billingProjectName, workspaceName,
submissionId = submission_id) # gene families relative abundance
## V1
## 1 UniRef90_B7B7I3
## 2 UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
## 3 UniRef90_B7B7I3|unclassified
## 4 UniRef90_F3QJ09
## 5 UniRef90_F3QJ09|g__Parasutterella.s__Parasutterella_excrementihominis
## 6 UniRef90_A0A374V756
## V2
## 1 2.25732e-01
## 2 2.25683e-01
## 3 4.95507e-05
## 4 1.39816e-01
## 5 1.39816e-01
## 6 3.66683e-02
Here, we list the example of major output files from bioBakery workflow.
keyword argument takes a character string containing a regular expression and as
an example here, we check all .tsv output of the sample, “HSM7J4NY”.
listOutput(accountEmail, billingProjectName, workspaceName, submission_id,
keyword = "HSM7J4NY.*.tsv")
## # A tibble: 9 x 4
## file workflow task path
## <chr> <chr> <chr> <chr>
## 1 HSM7J4NY_genefam… workflow… call-Functio… gs://fc-071d1d53-e186-44ad-8951-d85…
## 2 HSM7J4NY_pathabu… workflow… call-Functio… gs://fc-071d1d53-e186-44ad-8951-d85…
## 3 HSM7J4NY_pathcov… workflow… call-Functio… gs://fc-071d1d53-e186-44ad-8951-d85…
## 4 HSM7J4NY_ecs.tsv workflow… call-Regroup… gs://fc-071d1d53-e186-44ad-8951-d85…
## 5 HSM7J4NY_kos.tsv workflow… call-Regroup… gs://fc-071d1d53-e186-44ad-8951-d85…
## 6 HSM7J4NY_ecs_rel… workflow… call-RenormT… gs://fc-071d1d53-e186-44ad-8951-d85…
## 7 HSM7J4NY_genefam… workflow… call-RenormT… gs://fc-071d1d53-e186-44ad-8951-d85…
## 8 HSM7J4NY_pathabu… workflow… call-RenormT… gs://fc-071d1d53-e186-44ad-8951-d85…
## 9 HSM7J4NY.tsv workflow… call-Taxonom… gs://fc-071d1d53-e186-44ad-8951-d85…
You can download any file using getOutput function. Here, we narrow down the
download to HSM7J4NY sample’s .tsv output files.
HSM7J4NY_dir <- "~/data2/biobakeR/inst/extdata/outputs/HSM7J4NY"
getOutput(accountEmail, billingProjectName, workspaceName, submission_id,
keyword = "HSM7J4NY.*.tsv", dest_dir = HSM7J4NY_dir)
## X..Gene.Family
## 1 UNMAPPED
## 2 UniRef90_B7B7I3
## 3 UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
## HSM7J4NY_Abundance.RPKs
## 1 10963114.0
## 2 583244.9
## 3 583116.8
## X..Pathway
## 1 UNMAPPED
## 2 UNINTEGRATED
## 3 UNINTEGRATED|g__Parabacteroides.s__Parabacteroides_distasonis
## HSM7J4NY_Abundance
## 1 2875610.0
## 2 675985.8
## 3 150306.1
## X..Pathway HSM7J4NY_Coverage
## 1 UNMAPPED 1
## 2 UNINTEGRATED 1
## 3 UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae 1
## X..Gene.Family HSM7J4NY_Abundance.RPKs
## 1 UNMAPPED 10963114.00
## 2 UNGROUPED 2552297.40
## 3 UNGROUPED|g__Alistipes.s__Alistipes_finegoldii 41352.53
## X..Gene.Family HSM7J4NY_Abundance.RPKs
## 1 UNMAPPED 10963114.00
## 2 UNGROUPED 2544777.64
## 3 UNGROUPED|g__Alistipes.s__Alistipes_finegoldii 41201.35
## X..Gene.Family HSM7J4NY_Abundance.RPKs
## 1 1.1.1.1 0.000435738
## 2 1.1.1.1|g__Alistipes.s__Alistipes_finegoldii 0.000435738
## 3 1.1.1.100 0.002691470
## X..Gene.Family
## 1 UniRef90_B7B7I3
## 2 UniRef90_B7B7I3|g__Parabacteroides.s__Parabacteroides_merdae
## 3 UniRef90_B7B7I3|unclassified
## HSM7J4NY_Abundance.RPKs
## 1 2.25732e-01
## 2 2.25683e-01
## 3 4.95507e-05
## X..Pathway HSM7J4NY_Abundance
## 1 PWY-1042: glycolysis IV (plant cytosol) 0.0859454
## 2 PWY-1042: glycolysis IV (plant cytosol)|unclassified 0.0224201
## 3 ANAGLYCOLYSIS-PWY: glycolysis III (from glucose) 0.0501175
## X.clade_name NCBI_tax_id relative_abundance
## 1 k__Bacteria 2 73.37069
## 2 k__Archaea 2157 26.62931
## 3 k__Bacteria|p__Bacteroidetes 2|976 72.77130
## additional_species
## 1
## 2
## 3
Output files for visualization is saved in ibdmdb_test_visualizations.zip.
dest_dir <- "~/data2/biobakeR/inst/extdata/outputs"
getOutput(accountEmail, billingProjectName, workspaceName, submission_id,
keyword = "visualization", dest_dir = dest_dir)
unzip(file.path(dest_dir, "ibdmdb_test_visualizations.zip"), exdir = dest_dir)
Here are the list of files used for visualization.
list.files(data_dir)
## [1] "humann_feature_counts.tsv"
## [2] "humann_read_and_species_count_table.tsv"
## [3] "kneaddata_read_count_table.tsv"
## [4] "metaphlan_taxonomic_profiles.tsv"
## [5] "microbial_counts_table.tsv"
## [6] "pathabundance_relab.tsv"
## [7] "qc_counts_orphans_table.tsv"
## [8] "qc_counts_pairs_table.tsv"
## [9] "taxa_counts_table.tsv"
## [10] "top_average_pathways_names.tsv"
This contains the feature counts (pathways, gene families, ecs) for each sample.
## X..samples humann_ecs_relab_counts humann_genefamilies_relab_counts
## 1 CSM9X23N 1049 88227
## 2 HSM6XRQY 1150 64841
## 3 HSM7J4NY 416 9002
## humann_pathabundance_relab_counts
## 1 209
## 2 262
## 3 42
This contains the counts of reads aligning at each step plus the total number of species identified for each sample.
## X..samples total.reads total.nucleotide.aligned total.translated.aligned
## 1 HSM6XRQY 17166083 13505861 14076343
## 2 HSMA33OT 11189134 3394815 4348515
## 3 MSM6J2QD 3858456 545341 883737
## total.species
## 1 27
## 2 26
## 3 10
This contains the read counts (split into pairs and orphans) for each step in the quality control process for each sample.
## Sample raw.pair1 raw.pair2 trimmed.pair1 trimmed.pair2
## 1 CSM9X23N 10529590 10529590 10529590 10529590
## 2 HSM6XRQY 8655985 8655985 8655985 8655985
## 3 HSM7J4NY 7241429 7241429 7241425 7241425
## decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.pair1
## 1 10482541
## 2 8622357
## 3 7192958
## decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.pair2
## 1 10482541
## 2 8622357
## 3 7192958
## decontaminated.hg37dec_v0.1.pair1 decontaminated.hg37dec_v0.1.pair2
## 1 10429946 10429946
## 2 8573730 8573730
## 3 5764053 5764053
## decontaminated.human_hg38_refMrna.pair1
## 1 10482541
## 2 8622359
## 3 7192959
## decontaminated.human_hg38_refMrna.pair2
## 1 10482541
## 2 8622359
## 3 7192959
## decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.orphan1
## 1 16559
## 2 9642
## 3 22026
## decontaminated.SILVA_128_LSUParc_SSUParc_ribosomal_RNA.orphan2
## 1 16158
## 2 9009
## 3 21765
## decontaminated.hg37dec_v0.1.orphan1 decontaminated.hg37dec_v0.1.orphan2
## 1 19029 18559
## 2 11519 11078
## 3 775171 46603
## decontaminated.human_hg38_refMrna.orphan1
## 1 16560
## 2 9642
## 3 22027
## decontaminated.human_hg38_refMrna.orphan2 final.pair1 final.pair2
## 1 16158 10429946 10429946
## 2 9009 8573730 8573730
## 3 21764 5764053 5764053
## final.orphan1 final.orphan2
## 1 16546 16129
## 2 9630 8993
## 3 21992 21736
This contains the merged taxonomic profiles for all samples.
## X..taxonomy CSM9X23N HSM6XRQY HSM7J4NY HSMA33KE HSMA33OT
## 1 k__Archaea 0.00000 0.00000 26.62931 0.00000 0.00000
## 2 k__Bacteria 100.00000 100.00000 73.37069 100.00000 100.00000
## 3 k__Bacteria|p__Firmicutes 6.76299 2.84778 0.00000 27.44129 6.94853
## MSM6J2QD
## 1 0.00000
## 2 100.00000
## 3 1.22901
This table includes counts ratios for each step of the quality control process for all samples.
## X..Sample rRNA...Trim rRNA...Raw hg37dec_v0.1...Trim hg37dec_v0.1...Raw
## 1 CSM9X23N 0.99566 0.99720 0.99065 0.99219
## 2 HSM6XRQY 0.99623 0.99730 0.99059 0.99166
## 3 HSM7J4NY 1.04517 1.04833 0.79829 0.80070
## mRNA...Trim mRNA...Raw
## 1 0.99554 0.99709
## 2 0.99612 0.99719
## 3 0.99333 0.99633
This is a merged table of the pathway abundances for all samples normalized to relative abundance.
## X..Pathway
## 1 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis
## 2 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Alistipes.s__Alistipes_finegoldii
## 3 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Anaerostipes.s__Anaerostipes_hadrus
## CSM9X23N_Abundance HSM6XRQY_Abundance HSM7J4NY_Abundance HSMA33KE_Abundance
## 1 0.0120141 0.0128154 0.023005 1.03715e-02
## 2 0.0000000 0.0000000 0.000000 9.96737e-05
## 3 0.0000000 0.0000000 0.000000 1.65680e-04
## HSMA33OT_Abundance MSM6J2QD_Abundance
## 1 0.00895635 0.0155297
## 2 0.00000000 0.0000000
## 3 0.00000000 0.0000000
This is table with the total number of orphan reads not aligning to each of the reference tables.
## X..Sample rRNA.orphan1 rRNA.orphan2 hg37dec_v0.1.orphan1 hg37dec_v0.1.orphan2
## 1 CSM9X23N 16559 16158 19029 18559
## 2 HSM6XRQY 9642 9009 11519 11078
## 3 HSM7J4NY 22026 21765 775171 46603
## mRNA.orphan1 mRNA.orphan2
## 1 16560 16158
## 2 9642 9009
## 3 22027 21764
This is table with the total number of paired reads not aligning to each of the reference tables.
## X..Sample Raw Trim rRNA hg37dec_v0.1 mRNA
## 1 CSM9X23N 10529590 10529590 10482541 10429946 10482541
## 2 HSM6XRQY 8655985 8655985 8622357 8573730 8622359
## 3 HSM7J4NY 7241429 7241425 7192958 5764053 7192959
This table includes the total number of species and genera before and after filtering.
## X..Sample Species Species.filtered Genera Genera.filtered
## 1 CSM9X23N 34 33 19 19
## 2 HSM6XRQY 33 30 22 21
## 3 HSM7J4NY 11 11 6 6
This table includes the top pathways by average abundance, with their full names, including average abundance and variance.
## X..Pathway Average.abundance
## 1 PWY-1042: glycolysis IV (plant cytosol) 0.0399
## 2 ANAGLYCOLYSIS-PWY: glycolysis III (from glucose) 0.0232
## 3 PWY-7219: adenosine ribonucleotides de novo biosynthesis 0.0231
## Variance
## 1 0.001020
## 2 0.000225
## 3 0.000180
taxo_profile <- read.csv(file.path(data_dir, "metaphlan_taxonomic_profiles.tsv"),
sep = "\t", header = TRUE)
taxo_profile <- tibble::column_to_rownames(taxo_profile, var = "X..taxonomy")
meta <- read.table("~/data2/biobakeR/inst/extdata/ibdmdb_demo_metadata.txt",
sep = "\t", header = TRUE)
colData <- DataFrame(meta)
se <- SummarizedExperiment(assays = list(taxo_profile = taxo_profile),
colData = colData)
se
## class: SummarizedExperiment
## dim: 202 6
## metadata(0):
## assays(1): taxo_profile
## rownames(202): k__Archaea k__Bacteria ...
## k__Archaea|p__Euryarchaeota|c__Thermoplasmata|o__Methanomassiliicoccales|f__Methanomassiliicoccaceae|g__Methanomassiliicoccus|s__Candidatus_Methanomassiliicoccus_intestinalis
## k__Bacteria|p__Proteobacteria|c__Proteobacteria_unclassified|o__Proteobacteria_unclassified|f__Proteobacteria_unclassified|g__Proteobacteria_unclassified|s__Proteobacteria_bacterium_CAG_139
## rowData names(0):
## colnames(6): CSM9X23N HSM6XRQY ... HSMA33OT MSM6J2QD
## colData names(4): Sample Sequencing.Type Site Age
path_abun <- read.csv(file.path(data_dir, "pathabundance_relab.tsv"),
sep = "\t", header = TRUE)
path_abun <- tibble::column_to_rownames(path_abun, var = "X..Pathway")
colnames(path_abun) <- gsub("_Abundance", "", colnames(path_abun))
se <- SummarizedExperiment(assays = list(taxo_profile = path_abun),
colData = colData)
se
## class: SummarizedExperiment
## dim: 2625 6
## metadata(0):
## assays(1): taxo_profile
## rownames(2625): 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis
## 1CMET2-PWY: N10-formyl-tetrahydrofolate
## biosynthesis|g__Alistipes.s__Alistipes_finegoldii ... VALSYN-PWY:
## L-valine
## biosynthesis|g__Ruthenibacterium.s__Ruthenibacterium_lactatiformans
## VALSYN-PWY: L-valine biosynthesis|unclassified
## rowData names(0):
## colnames(6): CSM9X23N HSM6XRQY ... HSMA33OT MSM6J2QD
## colData names(4): Sample Sequencing.Type Site Age