The curatedMetagenomicData package provides taxonomic, functional, and gene marker abundance for samples collected from different bodysites. It provides data from aproximately 3000 human microbiome samples that have been highly processed, refined, and curated such that analysis that might otherwise require a computing cluster can be done on an ordianary laptop.
curatedMetagenomicData
providescuratedMetagenomicData
provides 6 types of data for each dataset:
Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.
Currently, curatedMetagenomicData
provides:
These datasets are documented in the reference manual.
curatedMetagenomicData
ResourcesUse of the resources in curatedMetagenomicData
is simplified with the use of Bioconductor’s ExperimentHub platform, which allows for the accessing of data through an intuitive interface. First, curatedMetagenomicData
is installed using BiocInstaller
and then called as a library - the process allows for the user to simply call datasets as functions because the package is aware of the resources present in ExperimentHub S3 buckets.
# BiocInstaller::biocLite("curatedMetagenomicData")
library(curatedMetagenomicData)
A function call to a dataset name will return an expression set object which can be manipulated or serialized.
LomanNJ_2013_Mi.genefamilies_relab.stool()
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 337637 features, 9 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: OBK1122 OBK1196 ... OBK4961 (9 total)
## varLabels: subjectID first ... number_reads (19 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## pubMedIds: 23571589
## Annotation:
ExpressionSet
ObjectsAll datasets are represented as ExpressionSet
objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase
package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.
To access the experiment-level metadata the experimentData()
function is used to return a MIAME
(Minimum Information About a Microarray Experiment) object.
LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
experimentData()
## Experiment data
## Experimenter name: Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ
## Laboratory: Institute of Microbiology and Infection, University of Birmingham, Birmingham, England.
## Contact information:
## Title: A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4.
## URL:
## PMIDs: 23571589
##
## Abstract: A 308 word abstract is available. Use 'abstract' method.
## notes:
## Sequencing technology:
## Illumina
To access the subject-level metadata the pData()
function is used to return a data.frame
containing subject-level variables of study.
LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
pData() %>%
head()
## subjectID first repeat stooltexture daysafteronset hus stec_count
## OBK1122 OBK1122 1122 0 <NA> <NA> <NA> <NA>
## OBK1196 OBK1196 1196 0 <NA> <NA> <NA> <NA>
## OBK1253 OBK1253 1253 0 <NA> <NA> <NA> <NA>
## OBK2535 OBK2535 2535 0 <NA> <NA> <NA> <NA>
## OBK2638 OBK2638 2638 0 <NA> <NA> <NA> <NA>
## OBK2723 OBK2723 2723 0 <NA> <NA> <NA> <NA>
## shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122 <NA> 8.00 33 <NA>
## OBK1196 <NA> 17.50 >99 <NA>
## OBK1253 <NA> 12.10 >99 <NA>
## OBK2535 <NA> 17.70 33.04 <NA>
## OBK2638 <NA> 18.74 67.51 <NA>
## OBK2723 <NA> 11.81 1.86 <NA>
## stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122 y <NA> <NA> positive
## OBK1196 n <NA> <NA> negative
## OBK1253 n <NA> <NA> negative
## OBK2535 <NA> <NA> <NA> negative
## OBK2638 <NA> <NA> <NA> negative
## OBK2723 <NA> <NA> <NA> negative
## disease bodysite country number_reads
## OBK1122 stec2-positive stool germany 1758352
## OBK1196 stec2-positive stool germany 13423104
## OBK1253 stec2-positive stool germany 9949304
## OBK2535 stec2-positive stool germany 664514
## OBK2638 stec2-positive stool germany 1009732
## OBK2723 stec2-positive stool germany 8017954
To access the data itself the exprs()
function is used to return a long and skinny matrix
in which samples are column rather than rows.
LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
exprs() %>%
head()
## OBK1122 OBK1196 OBK1253 OBK2535 OBK2638
## UNMAPPED 0.08320560 0.12063500 0.11415300 0.071663900 0.05090200
## UniRef90_unknown 0.08003470 0.07693680 0.08019060 0.122752000 0.14018600
## UniRef90_P03641 0.00418468 0.00176623 0.00326958 0.001151130 0.01027770
## UniRef90_P69486 0.00357096 0.00225170 0.00369798 0.001339380 0.01182520
## UniRef90_P03631 0.00311485 0.00170172 0.00302549 0.000995641 0.00930738
## UniRef90_P69172 0.00300317 0.00204393 0.00246346 0.001040070 0.00769270
## OBK2723 OBK4096 OBK4328 OBK4961
## UNMAPPED 0.09993320 0.0654341 0.12492000 0.0479208
## UniRef90_unknown 0.06659520 0.0512603 0.06270720 0.1208630
## UniRef90_P03641 0.00294142 0.0759686 0.00117431 0.0419860
## UniRef90_P69486 0.00318407 0.0783307 0.00240333 0.0473116
## UniRef90_P03631 0.00266316 0.0680656 0.00194322 0.0383972
## UniRef90_P69172 0.00235623 0.0608840 0.00163305 0.0309943
Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.
Absolute raw count data can be estimated from the relative count data by multiplying the rows of the ExpressionSet
data by the number of reads, as found in the pData
. The sweep
function from base R
provides a way to quickly multiply every row of the matrix by the number of reads vector.
LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
exprs() %>%
sweep(., 2, LomanNJ_2013_Mi.genefamilies_relab.stool()$number_reads, "*") %>%
head()
## OBK1122 OBK1196 OBK1253 OBK2535 OBK2638
## UNMAPPED 146304.733 1619296.15 1135742.90 47621.6648 51397.378
## UniRef90_unknown 140729.175 1032730.67 797840.66 81570.4225 141550.290
## UniRef90_P03641 7358.140 23708.29 32530.05 764.9420 10377.723
## UniRef90_P69486 6279.005 30224.80 36792.33 890.0368 11940.283
## UniRef90_P03631 5477.003 22842.36 30101.52 661.6174 9397.959
## UniRef90_P69172 5280.630 27435.88 24509.71 691.1411 7767.565
## OBK2723 OBK4096 OBK4328 OBK4961
## UNMAPPED 801259.80 10792.570 1372570.74 17631.12
## UniRef90_unknown 533957.25 8454.771 689001.51 44468.16
## UniRef90_P03641 23584.17 12530.109 12902.85 15447.57
## UniRef90_P69486 25529.73 12919.709 26406.82 17406.98
## UniRef90_P03631 21353.09 11226.604 21351.32 14127.17
## UniRef90_P69172 18892.14 10042.085 17943.30 11403.48
phyloseq
The phyloseq
bioconductor package provides a number of useful functions related to taxonomic analysis and readily produces ggPlot2
graphics. curatedMetagenomicData
provides 6 types of data; of these, only the taxonomic profiles make sense to analyse as phylogenetic data. A convienence function, ExpressionSet2phyloseq()
, is provided within the curatedMetagenomicData
package to allow for coversion between the ExpressionSet
class and the phyloseq
class; it is used as follows.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq()
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 399 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 399 taxa by 8 taxonomic ranks ]
Exploratory data analysis might begin through the examination of clade-specific realative abundances contained in the OTU (Operational Taxanomic Unit) table. A dataset containing taxanomic profiles is first converted to a phyloseq
object and then the otu_table()
function is used to view the contents and structure of the table.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
otu_table() %>%
head()
## OTU Table: [6 taxa and 9 samples]
## taxa are rows
## OBK1122 OBK1196 OBK1253 OBK2535 OBK2638 OBK2723 OBK4096
## k__Bacteria 1000000 994785 999901 1000000 1000000 989760 1000000
## p__Bacteroidetes 786168 525459 808827 236409 69300 503518 392508
## p__Firmicutes 157688 432482 177192 268285 24606 443958 180435
## p__Proteobacteria 56145 25305 9536 495306 892984 31430 427056
## c__Bacteroidia 786168 525459 808827 236409 69300 503518 392508
## c__Clostridia 150854 248574 103997 219160 23622 405306 180435
## OBK4328 OBK4961
## k__Bacteria 1000000 1000000
## p__Bacteroidetes 407899 55926
## p__Firmicutes 467944 0
## p__Proteobacteria 155 944074
## c__Bacteroidia 407899 55926
## c__Clostridia 396215 0
Similarly, the process of exploratory data analysis will likely include an examination of study-related metadata in order to understand the type of questions that can be asked of the data. To examine the structure and contents of metadata, a phyloseq
object is created and then the sample_data()
function is used to view the table.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
sample_data() %>%
head()
## subjectID first repeat. stooltexture daysafteronset hus
## OBK1122 OBK1122 1122 0 <NA> <NA> <NA>
## OBK1196 OBK1196 1196 0 <NA> <NA> <NA>
## OBK1253 OBK1253 1253 0 <NA> <NA> <NA>
## OBK2535 OBK2535 2535 0 <NA> <NA> <NA>
## OBK2638 OBK2638 2638 0 <NA> <NA> <NA>
## OBK2723 OBK2723 2723 0 <NA> <NA> <NA>
## stec_count shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122 <NA> <NA> 8.00 33 <NA>
## OBK1196 <NA> <NA> 17.50 >99 <NA>
## OBK1253 <NA> <NA> 12.10 >99 <NA>
## OBK2535 <NA> <NA> 17.70 33.04 <NA>
## OBK2638 <NA> <NA> 18.74 67.51 <NA>
## OBK2723 <NA> <NA> 11.81 1.86 <NA>
## stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122 y <NA> <NA> positive
## OBK1196 n <NA> <NA> negative
## OBK1253 n <NA> <NA> negative
## OBK2535 <NA> <NA> <NA> negative
## OBK2638 <NA> <NA> <NA> negative
## OBK2723 <NA> <NA> <NA> negative
## disease bodysite country number_reads
## OBK1122 stec2-positive stool germany 1758352
## OBK1196 stec2-positive stool germany 13423104
## OBK1253 stec2-positive stool germany 9949304
## OBK2535 stec2-positive stool germany 664514
## OBK2638 stec2-positive stool germany 1009732
## OBK2723 stec2-positive stool germany 8017954
As with the OTU and metadata tables, the structure and contents of the taxanomic table should also be examined. This is done by converting an ExpressionSet
object to a phyloseq
object and then using the tax_table()
function to expose the underlying taxanomic table. The table provides a complete listing of every clade available within the dataset down to the strain level.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
tax_table() %>%
head()
## Taxonomy Table: [6 taxa by 8 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## k__Bacteria "Bacteria" NA NA NA NA
## p__Bacteroidetes "Bacteria" "Bacteroidetes" NA NA NA
## p__Firmicutes "Bacteria" "Firmicutes" NA NA NA
## p__Proteobacteria "Bacteria" "Proteobacteria" NA NA NA
## c__Bacteroidia "Bacteria" "Bacteroidetes" "Bacteroidia" NA NA
## c__Clostridia "Bacteria" "Firmicutes" "Clostridia" NA NA
## Genus Species Strain
## k__Bacteria NA NA NA
## p__Bacteroidetes NA NA NA
## p__Firmicutes NA NA NA
## p__Proteobacteria NA NA NA
## c__Bacteroidia NA NA NA
## c__Clostridia NA NA NA
The process of subseting begins with the names of phylogenetic ranks, which are easily examined using the rank_names()
function.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
rank_names()
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## [8] "Strain"
With knowledge of the rank names it becomes possible to filter by taxa using the subset_taxa()
function. In the example below, where a species is not given taxa are removed.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(!is.na(Species))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 264 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 264 taxa by 8 taxonomic ranks ]
Even greater specificity can be achieved using a more specific taxa such as strain.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(!is.na(Strain))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 124 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 124 taxa by 8 taxonomic ranks ]
Where a specific level of taxa is desired for analysis it can be obtained through the use of the prune_taxa()
function in combination with an apply()
function.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
tax_table() %>%
apply(., 1, function(x) {sum(!is.na(x))}) ->
taxa_level
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
prune_taxa(taxa_level == 2, .) %>%
taxa_names()
## [1] "p__Bacteroidetes" "p__Firmicutes" "p__Proteobacteria"
## [4] "p__Actinobacteria" "p__Viruses_noname" "p__Euryarchaeota"
## [7] "p__Verrucomicrobia"
More advanced pruning of taxa can be done with filtering functions, such as the following one whereby taxa are kept only if they are among the most abundant 10% in at least two samples.
topp(0.10) %>%
filterfun_sample() ->
top_ten
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
genefilter_sample(., top_ten, A = 2) ->
filtered_taxa
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 399 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 399 taxa by 8 taxonomic ranks ]
Heatmaps of taxonomic abundance are easily made by passing a phyloseq
class object to the plot_heatmap()
function. Here, extensive subsetting is done to reduce the amount of OTUs that appear on the y-axis.
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa) %>%
subset_taxa(!is.na(Strain)) %>%
plot_heatmap(., method="PCoA", distance="bray")
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa) %>%
subset_taxa(!is.na(Strain)) ->
unsorted_taxa
par(mar = c(20, 4, 0, 0) + 0.15)
barplot(sort(taxa_sums(unsorted_taxa), TRUE)[1:20] / nsamples(unsorted_taxa),
las = 2)
c("Shannon", "Simpson", "InvSimpson") ->
alpha_measures
unsorted_taxa %>%
plot_richness(., "c_difficile_frequency", "stxab_detected",
measures = alpha_measures)
unsorted_taxa %>%
estimate_richness(., measures = alpha_measures) %>%
pairs()
unsorted_taxa %>%
distance(method="bray") %>%
hclust() %>%
plot(main="Bray-Curtis Dissimilarity", xlab="", sub = "")
unsorted_taxa %>%
ordinate(., method = "PCoA", distance = "bray") ->
ordinated_taxa
unsorted_taxa %>%
plot_ordination(., ordinated_taxa, color="c_difficile_frequency",
title = "Bray-Curtis Principal Coordinates Analysis")
plot_scree(ordinated_taxa)
Authors welcome the addition of new datasets provided they can be or already have been run through the MetaPhlAn2 and HUMAnN2 pipelines. Please read the developer documentation and contact the maintainer if you have a shotgun metagenomic dataset that would be of interest to the Bioconductor community.
Development of the curatedMetagenomicData
package occurs on GitHub and bugs reported via GitHub issues will recieve the quickest response. Please visit the project repository and file an issue should you find one.
Visit the Bioconductor support site at https://support.bioconductor.org/, breifly describe your issue, and add the curatedmetagenomicdata.