BioinforMA, February 6th, 2020

Let's start curatedMetagenomicData

library(curatedMetagenomicData)

Browsing available data

The manually curated metadata for all available samples are provided in a single table called combined_metadata

?combined_metadata
View(combined_metadata)

Browsing available data

Using the combined_metadata object we can see all the dataset names included in curatedMetagenomicData:

table(combined_metadata$dataset_name)[1:20]
## 
##         AsnicarF_2017         BackhedF_2015 Bengtsson-PalmeJ_2015 
##                    24                   381                    70 
##          BritoIL_2016   Castro-NallarE_2015       ChengpingW_2017 
##                   312                    32                    97 
##           ChngKR_2016         CosteaPI_2017          DavidLA_2015 
##                    78                   279                    49 
##         DhakanDB_2019            FengQ_2015        FerrettiP_2018 
##                   110                   154                   215 
##  GopalakrishnanV_2018       HanniganGD_2017        HansenLBS_2018 
##                    25                    82                   208 
##  Heitz-BuschartA_2016              HMP_2012             JieZ_2017 
##                    53                   749                   385 
##       KarlssonFH_2013          KieserS_2018 
##                   145                    27

Exercise: What are the most studied diseases with metagenomics?

Exercise: What are the most studied diseases with metagenomics?

sort(table(combined_metadata$disease),decreasing=T)[1:5]
## 
##      healthy          CRC          T2D         ACVD hypertension 
##         6229          305          223          214          169

Accessing datasets

Individual data projects can be fetched via per-dataset functions or the curatedMetagenomicData() function. A function call to a dataset name returns a Bioconductor ExpressionSet object:

loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache

Accessing datasets

loman.eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 736 features, 43 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: OBK1122 OBK1196 ... OBK5066 (43 total)
##   varLabels: subjectID body_site ... NCBI_accession (19 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23571589 
## Annotation:

Using ExpressionSet Objects

All datasets are represented as ExpressionSet objects. The exprs() function allows accessing the data (in this case relative abundance):

exprs( loman.eset )[1:6, 1:5]  #first 6 rows and 5 columns
##                                               OBK1122   OBK1196   OBK1253
## k__Bacteria                                 100.00000 100.00000 100.00000
## k__Bacteria|p__Bacteroidetes                 83.55662  30.41764  79.41203
## k__Bacteria|p__Firmicutes                    15.14819  68.30191  19.28784
## k__Bacteria|p__Proteobacteria                 1.29520   0.06303   0.97218
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia  83.55662  30.41764  79.41203
## k__Bacteria|p__Firmicutes|c__Clostridia      14.93141  33.22720  11.26228
##                                              OBK2535 OBK2535b
## k__Bacteria                                 99.39953 99.63084
## k__Bacteria|p__Bacteroidetes                15.14525  7.03101
## k__Bacteria|p__Firmicutes                   25.71252 45.17325
## k__Bacteria|p__Proteobacteria               58.53298 47.29051
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 15.14525  7.03101
## k__Bacteria|p__Firmicutes|c__Clostridia     16.50313 13.97996

Using ExpressionSet Objects

To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.

head( pData( loman.eset ) )
##          subjectID body_site antibiotics_current_use study_condition disease
## OBK1122    OBK1122     stool                    <NA>            STEC    STEC
## OBK1196    OBK1196     stool                    <NA>            STEC    STEC
## OBK1253    OBK1253     stool                    <NA>            STEC    STEC
## OBK2535    OBK2535     stool                    <NA>            STEC    STEC
## OBK2535b  OBK2535b     stool                    <NA>            STEC    STEC
## OBK2638    OBK2638     stool                    <NA>            STEC    STEC
##          age_category country non_westernized DNA_extraction_kit number_reads
## OBK1122         adult     DEU              no             Qiagen       464060
## OBK1196         adult     DEU              no             Qiagen     30181380
## OBK1253         adult     DEU              no             Qiagen     65704818
## OBK2535         adult     DEU              no             Qiagen     43597736
## OBK2535b        adult     DEU              no             Qiagen     22253314
## OBK2638         adult     DEU              no             Qiagen      1105492
##          number_bases minimum_read_length median_read_length days_after_onset
## OBK1122      69609000                 150                150               NA
## OBK1196    4527207000                 150                150               NA
## OBK1253    9855722700                 150                150               NA
## OBK2535    6539660400                 150                150                3
## OBK2535b   3337997100                 150                150               NA
## OBK2638     165823800                 150                150                5
##          stec_count shigatoxin_2_elisa stool_texture      curator
## OBK1122        <NA>               <NA>          <NA> Paolo_Manghi
## OBK1196        <NA>               <NA>          <NA> Paolo_Manghi
## OBK1253        <NA>               <NA>          <NA> Paolo_Manghi
## OBK2535    moderate           positive        smooth Paolo_Manghi
## OBK2535b       <NA>               <NA>          <NA> Paolo_Manghi
## OBK2638        high           positive        bloody Paolo_Manghi
##               NCBI_accession
## OBK1122  ERR262939;ERR262938
## OBK1196  ERR262941;ERR262940
## OBK1253  ERR262942;ERR260476
## OBK2535  ERR260478;ERR260477
## OBK2535b           ERR260479
## OBK2638  ERR262943;ERR260480

Using ExpressionSet Objects

We can now retrieve all the stool-associated MetaPhlAn2 datasets with the following command (if dryrun=FALSE is set):

esl <- curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)

Using ExpressionSet Objects

The following creates a list of two ExpressionSet objects with the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles:

oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
          "Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
## Working on BritoIL_2016.metaphlan_bugs_list.oralcavity
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
## Working on Castro-NallarE_2015.metaphlan_bugs_list.oralcavity
## snapshotDate(): 2019-10-22
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache
esl
## List of length 2
## names(2): BritoIL_2016.metaphlan_bugs_list.oralcavity ...
esl[[1]]
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1864 features, 140 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: M1.1.SA M1.10.SA ... WL.9.SA (140 total)
##   varLabels: subjectID body_site ... NCBI_accession (17 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 27409808 
## Annotation:
esl[[2]]
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 755 features, 32 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: ES_001 ES_007 ... ES_211 (32 total)
##   varLabels: subjectID body_site ... NCBI_accession (18 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 26336637 
## Annotation:

Using ExpressionSet Objects

We can use mergeData to merge them all to a single ExpressionSet object:

eset <- mergeData(esl)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1914 features, 172 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.1.SA
##     BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.10.SA ...
##     Castro-NallarE_2015.metaphlan_bugs_list.oralcavity:ES_211 (172
##     total)
##   varLabels: subjectID body_site ... studyID (19 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

Escherichia coli distribution

Now we'll analyse E. coli distribution in the Loman dataset we looked at earlier.

We first need to create a vector of E. coli relative abundances. This grep call with a "$" at the end selects the only row that ends with "s__Escherichia_coli":

x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset )), ]
summary( x )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6047  2.3710 13.6897 11.8291 90.4474

Escherichia coli distribution

Let's now plot a histogram of E. coli relative abundance distribution:

hist( x, xlab = "Relative Abundance", main="E. Coli Relative Abundance",
      breaks="FD")

Taxonomic analysis with phyloseq

phyloseq is another R library that facilitates taxonomic data analysis and visualization.

curatedMetagenomicData provides the ExpressionSet2phyloseq() function to create phyloseq class objects:

Warning: this is only valid for taxonomy, thus MetaPhlAn2 but not HUMAnN2 profiles!

library(phyloseq)
loman.pseq = ExpressionSet2phyloseq( loman.eset )

Components of a phyloseq object

phyloseq objects have 3 components:

loman.pseq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 736 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 736 taxa by 8 taxonomic ranks ]

Components of a phyloseq object

otu_table() returns the same exprs( loman.eset ) did: the taxonomic abundance table. Here are the first 6 rows and 5 columns:

otu_table( loman.pseq )[1:6, 1:5]
## OTU Table:          [6 taxa and 5 samples]
##                      taxa are rows
##                     OBK1122   OBK1196   OBK1253  OBK2535 OBK2535b
## k__Bacteria       100.00000 100.00000 100.00000 99.39953 99.63084
## p__Bacteroidetes   83.55662  30.41764  79.41203 15.14525  7.03101
## p__Firmicutes      15.14819  68.30191  19.28784 25.71252 45.17325
## p__Proteobacteria   1.29520   0.06303   0.97218 58.53298 47.29051
## c__Bacteroidia     83.55662  30.41764  79.41203 15.14525  7.03101
## c__Clostridia      14.93141  33.22720  11.26228 16.50313 13.97996

Components of a phyloseq object

The participant data (or metadata) that was available from pData( loman.eset ) is now availble using sample_data() on the phyloseq object:

sample_data( loman.pseq )[1:6, 1:5]
##          subjectID body_site antibiotics_current_use study_condition disease
## OBK1122    OBK1122     stool                    <NA>            STEC    STEC
## OBK1196    OBK1196     stool                    <NA>            STEC    STEC
## OBK1253    OBK1253     stool                    <NA>            STEC    STEC
## OBK2535    OBK2535     stool                    <NA>            STEC    STEC
## OBK2535b  OBK2535b     stool                    <NA>            STEC    STEC
## OBK2638    OBK2638     stool                    <NA>            STEC    STEC

Components of a phyloseq object

But the phyloseq object also provides information on the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.

head( tax_table( loman.pseq ) )
## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##                   Kingdom    Phylum           Class         Order Family Genus
## k__Bacteria       "Bacteria" NA               NA            NA    NA     NA   
## p__Bacteroidetes  "Bacteria" "Bacteroidetes"  NA            NA    NA     NA   
## p__Firmicutes     "Bacteria" "Firmicutes"     NA            NA    NA     NA   
## p__Proteobacteria "Bacteria" "Proteobacteria" NA            NA    NA     NA   
## c__Bacteroidia    "Bacteria" "Bacteroidetes"  "Bacteroidia" NA    NA     NA   
## c__Clostridia     "Bacteria" "Firmicutes"     "Clostridia"  NA    NA     NA   
##                   Species Strain
## k__Bacteria       NA      NA    
## p__Bacteroidetes  NA      NA    
## p__Firmicutes     NA      NA    
## p__Proteobacteria NA      NA    
## c__Bacteroidia    NA      NA    
## c__Clostridia     NA      NA

Subsetting / Pruning

The process of subsetting takes advantadge of the taxonomic ranks:

rank_names( loman.pseq )
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
## [8] "Strain"

Taxa can be filtered by these rank names. For example, to return an object with only species and strains:

subset_taxa( loman.pseq, !is.na(Species) & is.na(Strain))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 287 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 287 taxa by 8 taxonomic ranks ]

Subsetting / Pruning

Or to keep only certain taxa (eg: the Bacteroidetes phylum), we can use the prune_taxa function:

loman.bd = prune_taxa('p__Bacteroidetes', loman.pseq)
head( taxa_names( loman.bd ) )
## [1] "p__Bacteroidetes"

Advanced Pruning

Let's now select only the taxa that are among the top 5% in abundance in at least five samples:

keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)
##    Mode   FALSE    TRUE 
## logical     624     112
subset_taxa(loman.pseq, keepotu)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 112 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 112 taxa by 8 taxonomic ranks ]

Taxonomy Heatmap

The phyloseq package provides the plot_heatmap() function to create heatmaps using a variety of built-in dissimilarity metrics.

Let's first apply the same abundance filter as above, and keep only species:

loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Species) & is.na(Strain))

Taxonomy Heatmap

We'll plot a heatmap on Bray-Curtis dissimilarity using PCoA for sample ordination:

plot_heatmap(loman.filt, method="PCoA", distance="bray")

Taxonomy Histogram

We can also plot the top 20 most abundant species in the dataset:

barplot(sort(taxa_sums(loman.filt) / nsamples(loman.filt), TRUE)[1:20],
        ylab = "Total counts", las = 2)

Alpha Diversity Estimation

The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity indices in the species-level data, stratifying by stool texture:

alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.filt, "stool_texture", measures = alphas)

Ordination Analysis

We can now plot a PCoA of species-level abundances from the Loman dataset, using Bray-Curtis dissimilarity:

ordinated_taxa = ordinate(loman.filt, method="PCoA", distance="bray")
plot_ordination(loman.filt, ordinated_taxa, color="stool_texture", 
                title = "Bray-Curtis Principal Coordinates Analysis")

Bonus Exercises

Option A:

Think of a research question you would like to answer with cMD - and go for it!

Option B:

Choose a dataset and check if different bodysite or different conditions were analysed

Plot a heatmap of the 10 most abundant species

Choose a species and compare its abundance between the bodysites or conditions

Option B:

Choose a dataset and check if different bodysite or different conditions were analysed

ds_nielsen <-  NielsenHB_2014.metaphlan_bugs_list.stool()
table(pData(ds_nielsen)$study_condition)
## 
## control     IBD 
##     248     148
table(pData(ds_nielsen)$bodysite)
## < table of extent 0 >

Option B:

Plot a heatmap of the 10 most abundant species

ds_nielsen_pseq <- ExpressionSet2phyloseq(ds_nielsen)
ds_nielsen_pseq_sp <- subset_taxa(ds_nielsen_pseq, !is.na(Species) & is.na(Strain))
top_10 <- names(sort(taxa_sums(ds_nielsen_pseq_sp), TRUE)[1:10])
ds_nielsen_pseq_sp <- prune_taxa(top_10, ds_nielsen_pseq)
plot_heatmap( ds_nielsen_pseq_sp )

Option B:

Choose a species and compare its abundance between the bodysites or conditions

relab <- exprs(ds_nielsen)[grep("s__Butyrivibrio_crossotus$", rownames(ds_nielsen)), ]
sample_condt <- pData(ds_nielsen)$study_condition
x <- data.frame(relab, sample_condt)

library(ggplot2)
ggplot(x) +
    geom_boxplot(aes(sample_condt, relab))

Bonus: HUMANn2 Resources

curatedMetagenomicData also includes the three output tables that HUMAnN2 generates when metagenomes are profiled:

  • Abundance of gene families
LomanNJ_2013.genefamilies_relab.stool()
  • Metabolic pathway coverage
LomanNJ_2013.pathabundance_relab.stool()
  • Metabolic pathway abundance
LomanNJ_2013.pathcoverage.stool()

Bonus exercise:

Use the HUMAnN2 tables to obtain more information about the dataset you chose.