library(curatedMetagenomicData)
BioinforMA, February 6th, 2020
curatedMetagenomicDatalibrary(curatedMetagenomicData)
The manually curated metadata for all available samples are provided in a single table called combined_metadata
?combined_metadata View(combined_metadata)
Using the combined_metadata object we can see all the dataset names included in curatedMetagenomicData:
table(combined_metadata$dataset_name)[1:20]
## ## AsnicarF_2017 BackhedF_2015 Bengtsson-PalmeJ_2015 ## 24 381 70 ## BritoIL_2016 Castro-NallarE_2015 ChengpingW_2017 ## 312 32 97 ## ChngKR_2016 CosteaPI_2017 DavidLA_2015 ## 78 279 49 ## DhakanDB_2019 FengQ_2015 FerrettiP_2018 ## 110 154 215 ## GopalakrishnanV_2018 HanniganGD_2017 HansenLBS_2018 ## 25 82 208 ## Heitz-BuschartA_2016 HMP_2012 JieZ_2017 ## 53 749 385 ## KarlssonFH_2013 KieserS_2018 ## 145 27
sort(table(combined_metadata$disease),decreasing=T)[1:5]
## ## healthy CRC T2D ACVD hypertension ## 6229 305 223 214 169
Individual data projects can be fetched via per-dataset functions or the curatedMetagenomicData() function. A function call to a dataset name returns a Bioconductor ExpressionSet object:
loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
loman.eset
## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 736 features, 43 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: OBK1122 OBK1196 ... OBK5066 (43 total) ## varLabels: subjectID body_site ... NCBI_accession (19 total) ## varMetadata: labelDescription ## featureData: none ## experimentData: use 'experimentData(object)' ## pubMedIds: 23571589 ## Annotation:
All datasets are represented as ExpressionSet objects. The exprs() function allows accessing the data (in this case relative abundance):
exprs( loman.eset )[1:6, 1:5] #first 6 rows and 5 columns
## OBK1122 OBK1196 OBK1253 ## k__Bacteria 100.00000 100.00000 100.00000 ## k__Bacteria|p__Bacteroidetes 83.55662 30.41764 79.41203 ## k__Bacteria|p__Firmicutes 15.14819 68.30191 19.28784 ## k__Bacteria|p__Proteobacteria 1.29520 0.06303 0.97218 ## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 83.55662 30.41764 79.41203 ## k__Bacteria|p__Firmicutes|c__Clostridia 14.93141 33.22720 11.26228 ## OBK2535 OBK2535b ## k__Bacteria 99.39953 99.63084 ## k__Bacteria|p__Bacteroidetes 15.14525 7.03101 ## k__Bacteria|p__Firmicutes 25.71252 45.17325 ## k__Bacteria|p__Proteobacteria 58.53298 47.29051 ## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 15.14525 7.03101 ## k__Bacteria|p__Firmicutes|c__Clostridia 16.50313 13.97996
To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.
head( pData( loman.eset ) )
## subjectID body_site antibiotics_current_use study_condition disease ## OBK1122 OBK1122 stool <NA> STEC STEC ## OBK1196 OBK1196 stool <NA> STEC STEC ## OBK1253 OBK1253 stool <NA> STEC STEC ## OBK2535 OBK2535 stool <NA> STEC STEC ## OBK2535b OBK2535b stool <NA> STEC STEC ## OBK2638 OBK2638 stool <NA> STEC STEC ## age_category country non_westernized DNA_extraction_kit number_reads ## OBK1122 adult DEU no Qiagen 464060 ## OBK1196 adult DEU no Qiagen 30181380 ## OBK1253 adult DEU no Qiagen 65704818 ## OBK2535 adult DEU no Qiagen 43597736 ## OBK2535b adult DEU no Qiagen 22253314 ## OBK2638 adult DEU no Qiagen 1105492 ## number_bases minimum_read_length median_read_length days_after_onset ## OBK1122 69609000 150 150 NA ## OBK1196 4527207000 150 150 NA ## OBK1253 9855722700 150 150 NA ## OBK2535 6539660400 150 150 3 ## OBK2535b 3337997100 150 150 NA ## OBK2638 165823800 150 150 5 ## stec_count shigatoxin_2_elisa stool_texture curator ## OBK1122 <NA> <NA> <NA> Paolo_Manghi ## OBK1196 <NA> <NA> <NA> Paolo_Manghi ## OBK1253 <NA> <NA> <NA> Paolo_Manghi ## OBK2535 moderate positive smooth Paolo_Manghi ## OBK2535b <NA> <NA> <NA> Paolo_Manghi ## OBK2638 high positive bloody Paolo_Manghi ## NCBI_accession ## OBK1122 ERR262939;ERR262938 ## OBK1196 ERR262941;ERR262940 ## OBK1253 ERR262942;ERR260476 ## OBK2535 ERR260478;ERR260477 ## OBK2535b ERR260479 ## OBK2638 ERR262943;ERR260480
We can now retrieve all the stool-associated MetaPhlAn2 datasets with the following command (if dryrun=FALSE is set):
esl <- curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)
The following creates a list of two ExpressionSet objects with the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles:
oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
"Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
## Working on BritoIL_2016.metaphlan_bugs_list.oralcavity
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
## Working on Castro-NallarE_2015.metaphlan_bugs_list.oralcavity
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
esl
## List of length 2 ## names(2): BritoIL_2016.metaphlan_bugs_list.oralcavity ...
esl[[1]]
## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 1864 features, 140 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: M1.1.SA M1.10.SA ... WL.9.SA (140 total) ## varLabels: subjectID body_site ... NCBI_accession (17 total) ## varMetadata: labelDescription ## featureData: none ## experimentData: use 'experimentData(object)' ## pubMedIds: 27409808 ## Annotation:
esl[[2]]
## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 755 features, 32 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: ES_001 ES_007 ... ES_211 (32 total) ## varLabels: subjectID body_site ... NCBI_accession (18 total) ## varMetadata: labelDescription ## featureData: none ## experimentData: use 'experimentData(object)' ## pubMedIds: 26336637 ## Annotation:
We can use mergeData to merge them all to a single ExpressionSet object:
eset <- mergeData(esl) eset
## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 1914 features, 172 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.1.SA ## BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.10.SA ... ## Castro-NallarE_2015.metaphlan_bugs_list.oralcavity:ES_211 (172 ## total) ## varLabels: subjectID body_site ... studyID (19 total) ## varMetadata: labelDescription ## featureData: none ## experimentData: use 'experimentData(object)' ## Annotation:
Now we'll analyse E. coli distribution in the Loman dataset we looked at earlier.
We first need to create a vector of E. coli relative abundances. This grep call with a "$" at the end selects the only row that ends with "s__Escherichia_coli":
x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset )), ]
summary( x )
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.6047 2.3710 13.6897 11.8291 90.4474
Let's now plot a histogram of E. coli relative abundance distribution:
hist( x, xlab = "Relative Abundance", main="E. Coli Relative Abundance",
breaks="FD")
phyloseqphyloseq is another R library that facilitates taxonomic data analysis and visualization.
curatedMetagenomicData provides the ExpressionSet2phyloseq() function to create phyloseq class objects:
Warning: this is only valid for taxonomy, thus MetaPhlAn2 but not HUMAnN2 profiles!
library(phyloseq) loman.pseq = ExpressionSet2phyloseq( loman.eset )
phyloseq objects have 3 components:
loman.pseq
## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 736 taxa and 43 samples ] ## sample_data() Sample Data: [ 43 samples by 19 sample variables ] ## tax_table() Taxonomy Table: [ 736 taxa by 8 taxonomic ranks ]
otu_table() returns the same exprs( loman.eset ) did: the taxonomic abundance table. Here are the first 6 rows and 5 columns:
otu_table( loman.pseq )[1:6, 1:5]
## OTU Table: [6 taxa and 5 samples] ## taxa are rows ## OBK1122 OBK1196 OBK1253 OBK2535 OBK2535b ## k__Bacteria 100.00000 100.00000 100.00000 99.39953 99.63084 ## p__Bacteroidetes 83.55662 30.41764 79.41203 15.14525 7.03101 ## p__Firmicutes 15.14819 68.30191 19.28784 25.71252 45.17325 ## p__Proteobacteria 1.29520 0.06303 0.97218 58.53298 47.29051 ## c__Bacteroidia 83.55662 30.41764 79.41203 15.14525 7.03101 ## c__Clostridia 14.93141 33.22720 11.26228 16.50313 13.97996
The participant data (or metadata) that was available from pData( loman.eset ) is now availble using sample_data() on the phyloseq object:
sample_data( loman.pseq )[1:6, 1:5]
## subjectID body_site antibiotics_current_use study_condition disease ## OBK1122 OBK1122 stool <NA> STEC STEC ## OBK1196 OBK1196 stool <NA> STEC STEC ## OBK1253 OBK1253 stool <NA> STEC STEC ## OBK2535 OBK2535 stool <NA> STEC STEC ## OBK2535b OBK2535b stool <NA> STEC STEC ## OBK2638 OBK2638 stool <NA> STEC STEC
But the phyloseq object also provides information on the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.
head( tax_table( loman.pseq ) )
## Taxonomy Table: [6 taxa by 8 taxonomic ranks]: ## Kingdom Phylum Class Order Family Genus ## k__Bacteria "Bacteria" NA NA NA NA NA ## p__Bacteroidetes "Bacteria" "Bacteroidetes" NA NA NA NA ## p__Firmicutes "Bacteria" "Firmicutes" NA NA NA NA ## p__Proteobacteria "Bacteria" "Proteobacteria" NA NA NA NA ## c__Bacteroidia "Bacteria" "Bacteroidetes" "Bacteroidia" NA NA NA ## c__Clostridia "Bacteria" "Firmicutes" "Clostridia" NA NA NA ## Species Strain ## k__Bacteria NA NA ## p__Bacteroidetes NA NA ## p__Firmicutes NA NA ## p__Proteobacteria NA NA ## c__Bacteroidia NA NA ## c__Clostridia NA NA
The process of subsetting takes advantadge of the taxonomic ranks:
rank_names( loman.pseq )
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species" ## [8] "Strain"
Taxa can be filtered by these rank names. For example, to return an object with only species and strains:
subset_taxa( loman.pseq, !is.na(Species) & is.na(Strain))
## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 287 taxa and 43 samples ] ## sample_data() Sample Data: [ 43 samples by 19 sample variables ] ## tax_table() Taxonomy Table: [ 287 taxa by 8 taxonomic ranks ]
Or to keep only certain taxa (eg: the Bacteroidetes phylum), we can use the prune_taxa function:
loman.bd = prune_taxa('p__Bacteroidetes', loman.pseq)
head( taxa_names( loman.bd ) )
## [1] "p__Bacteroidetes"
Let's now select only the taxa that are among the top 5% in abundance in at least five samples:
keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5) summary(keepotu)
## Mode FALSE TRUE ## logical 624 112
subset_taxa(loman.pseq, keepotu)
## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 112 taxa and 43 samples ] ## sample_data() Sample Data: [ 43 samples by 19 sample variables ] ## tax_table() Taxonomy Table: [ 112 taxa by 8 taxonomic ranks ]
The phyloseq package provides the plot_heatmap() function to create heatmaps using a variety of built-in dissimilarity metrics.
Let's first apply the same abundance filter as above, and keep only species:
loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Species) & is.na(Strain))
We'll plot a heatmap on Bray-Curtis dissimilarity using PCoA for sample ordination:
plot_heatmap(loman.filt, method="PCoA", distance="bray")
We can also plot the top 20 most abundant species in the dataset:
barplot(sort(taxa_sums(loman.filt) / nsamples(loman.filt), TRUE)[1:20],
ylab = "Total counts", las = 2)
The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity indices in the species-level data, stratifying by stool texture:
alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.filt, "stool_texture", measures = alphas)
We can now plot a PCoA of species-level abundances from the Loman dataset, using Bray-Curtis dissimilarity:
ordinated_taxa = ordinate(loman.filt, method="PCoA", distance="bray")
plot_ordination(loman.filt, ordinated_taxa, color="stool_texture",
title = "Bray-Curtis Principal Coordinates Analysis")
Option A:
Think of a research question you would like to answer with cMD - and go for it!
Option B:
Choose a dataset and check if different bodysite or different conditions were analysed
Plot a heatmap of the 10 most abundant species
Choose a species and compare its abundance between the bodysites or conditions
Choose a dataset and check if different bodysite or different conditions were analysed
ds_nielsen <- NielsenHB_2014.metaphlan_bugs_list.stool() table(pData(ds_nielsen)$study_condition)
## ## control IBD ## 248 148
table(pData(ds_nielsen)$bodysite)
## < table of extent 0 >
Plot a heatmap of the 10 most abundant species
ds_nielsen_pseq <- ExpressionSet2phyloseq(ds_nielsen) ds_nielsen_pseq_sp <- subset_taxa(ds_nielsen_pseq, !is.na(Species) & is.na(Strain)) top_10 <- names(sort(taxa_sums(ds_nielsen_pseq_sp), TRUE)[1:10]) ds_nielsen_pseq_sp <- prune_taxa(top_10, ds_nielsen_pseq) plot_heatmap( ds_nielsen_pseq_sp )
Choose a species and compare its abundance between the bodysites or conditions
relab <- exprs(ds_nielsen)[grep("s__Butyrivibrio_crossotus$", rownames(ds_nielsen)), ]
sample_condt <- pData(ds_nielsen)$study_condition
x <- data.frame(relab, sample_condt)
library(ggplot2)
ggplot(x) +
geom_boxplot(aes(sample_condt, relab))
curatedMetagenomicData also includes the three output tables that HUMAnN2 generates when metagenomes are profiled:
LomanNJ_2013.genefamilies_relab.stool()
LomanNJ_2013.pathabundance_relab.stool()
LomanNJ_2013.pathcoverage.stool()
Use the HUMAnN2 tables to obtain more information about the dataset you chose.