curatedMetagenomicData Microbiome taxonomic and functional information from metagenomic datasets

BioinforMA, February 6th, 2020

Let's start `curatedMetagenomicData`

library(curatedMetagenomicData)

Browsing available data

The manually curated metadata for all available samples are provided in a single table called combined_metadata

?combined_metadata
View(combined_metadata)

Browsing available data

Using the combined_metadata object we can see all the dataset names included in curatedMetagenomicData:

table(combined_metadata$dataset_name)[1:20]

## 
##         AsnicarF_2017         BackhedF_2015 Bengtsson-PalmeJ_2015 
##                    24                   381                    70 
##          BritoIL_2016   Castro-NallarE_2015       ChengpingW_2017 
##                   312                    32                    97 
##           ChngKR_2016         CosteaPI_2017          DavidLA_2015 
##                    78                   279                    49 
##         DhakanDB_2019            FengQ_2015        FerrettiP_2018 
##                   110                   154                   215 
##  GopalakrishnanV_2018       HanniganGD_2017        HansenLBS_2018 
##                    25                    82                   208 
##  Heitz-BuschartA_2016              HMP_2012             JieZ_2017 
##                    53                   749                   385 
##       KarlssonFH_2013          KieserS_2018 
##                   145                    27

Exercise: What are the most studied diseases with metagenomics?

sort(table(combined_metadata$disease),decreasing=T)[1:5]

## 
##      healthy          CRC          T2D         ACVD hypertension 
##         6229          305          223          214          169

Accessing datasets

Individual data projects can be fetched via per-dataset functions or the curatedMetagenomicData() function. A function call to a dataset name returns a Bioconductor ExpressionSet object:

loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()

## snapshotDate(): 2019-10-22

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## loading from cache

Accessing datasets

loman.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 736 features, 43 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: OBK1122 OBK1196 ... OBK5066 (43 total)
##   varLabels: subjectID body_site ... NCBI_accession (19 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23571589 
## Annotation:

Using ExpressionSet Objects

All datasets are represented as ExpressionSet objects. The exprs() function allows accessing the data (in this case relative abundance):

exprs( loman.eset )[1:6, 1:5]  #first 6 rows and 5 columns

##                                               OBK1122   OBK1196   OBK1253
## k__Bacteria                                 100.00000 100.00000 100.00000
## k__Bacteria|p__Bacteroidetes                 83.55662  30.41764  79.41203
## k__Bacteria|p__Firmicutes                    15.14819  68.30191  19.28784
## k__Bacteria|p__Proteobacteria                 1.29520   0.06303   0.97218
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia  83.55662  30.41764  79.41203
## k__Bacteria|p__Firmicutes|c__Clostridia      14.93141  33.22720  11.26228
##                                              OBK2535 OBK2535b
## k__Bacteria                                 99.39953 99.63084
## k__Bacteria|p__Bacteroidetes                15.14525  7.03101
## k__Bacteria|p__Firmicutes                   25.71252 45.17325
## k__Bacteria|p__Proteobacteria               58.53298 47.29051
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 15.14525  7.03101
## k__Bacteria|p__Firmicutes|c__Clostridia     16.50313 13.97996

Using ExpressionSet Objects

To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.

head( pData( loman.eset ) )

##          subjectID body_site antibiotics_current_use study_condition disease
## OBK1122    OBK1122     stool                    <NA>            STEC    STEC
## OBK1196    OBK1196     stool                    <NA>            STEC    STEC
## OBK1253    OBK1253     stool                    <NA>            STEC    STEC
## OBK2535    OBK2535     stool                    <NA>            STEC    STEC
## OBK2535b  OBK2535b     stool                    <NA>            STEC    STEC
## OBK2638    OBK2638     stool                    <NA>            STEC    STEC
##          age_category country non_westernized DNA_extraction_kit number_reads
## OBK1122         adult     DEU              no             Qiagen       464060
## OBK1196         adult     DEU              no             Qiagen     30181380
## OBK1253         adult     DEU              no             Qiagen     65704818
## OBK2535         adult     DEU              no             Qiagen     43597736
## OBK2535b        adult     DEU              no             Qiagen     22253314
## OBK2638         adult     DEU              no             Qiagen      1105492
##          number_bases minimum_read_length median_read_length days_after_onset
## OBK1122      69609000                 150                150               NA
## OBK1196    4527207000                 150                150               NA
## OBK1253    9855722700                 150                150               NA
## OBK2535    6539660400                 150                150                3
## OBK2535b   3337997100                 150                150               NA
## OBK2638     165823800                 150                150                5
##          stec_count shigatoxin_2_elisa stool_texture      curator
## OBK1122        <NA>               <NA>          <NA> Paolo_Manghi
## OBK1196        <NA>               <NA>          <NA> Paolo_Manghi
## OBK1253        <NA>               <NA>          <NA> Paolo_Manghi
## OBK2535    moderate           positive        smooth Paolo_Manghi
## OBK2535b       <NA>               <NA>          <NA> Paolo_Manghi
## OBK2638        high           positive        bloody Paolo_Manghi
##               NCBI_accession
## OBK1122  ERR262939;ERR262938
## OBK1196  ERR262941;ERR262940
## OBK1253  ERR262942;ERR260476
## OBK2535  ERR260478;ERR260477
## OBK2535b           ERR260479
## OBK2638  ERR262943;ERR260480

Using ExpressionSet Objects

We can now retrieve all the stool-associated MetaPhlAn2 datasets with the following command (if dryrun=FALSE is set):

esl <- curatedMetagenomicData("*metaphlan_bugs_list.stool*", dryrun = TRUE)

Using ExpressionSet Objects

The following creates a list of two ExpressionSet objects with the BritoIL_2016 and Castro-NallarE_2015 oral cavity taxonomic profiles:

oral <- c("BritoIL_2016.metaphlan_bugs_list.oralcavity",
          "Castro-NallarE_2015.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)

## Working on BritoIL_2016.metaphlan_bugs_list.oralcavity

## snapshotDate(): 2019-10-22

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## loading from cache

## Working on Castro-NallarE_2015.metaphlan_bugs_list.oralcavity

## snapshotDate(): 2019-10-22

## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation

## loading from cache

esl

## List of length 2
## names(2): BritoIL_2016.metaphlan_bugs_list.oralcavity ...

esl[[1]]

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1864 features, 140 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: M1.1.SA M1.10.SA ... WL.9.SA (140 total)
##   varLabels: subjectID body_site ... NCBI_accession (17 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 27409808 
## Annotation:

esl[[2]]

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 755 features, 32 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: ES_001 ES_007 ... ES_211 (32 total)
##   varLabels: subjectID body_site ... NCBI_accession (18 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 26336637 
## Annotation:

Using ExpressionSet Objects

We can use mergeData to merge them all to a single ExpressionSet object:

eset <- mergeData(esl)
eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1914 features, 172 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.1.SA
##     BritoIL_2016.metaphlan_bugs_list.oralcavity:M1.10.SA ...
##     Castro-NallarE_2015.metaphlan_bugs_list.oralcavity:ES_211 (172
##     total)
##   varLabels: subjectID body_site ... studyID (19 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

Escherichia coli distribution

Now we'll analyse E. coli distribution in the Loman dataset we looked at earlier.

We first need to create a vector of E. coli relative abundances. This grep call with a "$" at the end selects the only row that ends with "s__Escherichia_coli":

x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset )), ]
summary( x )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6047  2.3710 13.6897 11.8291 90.4474

Escherichia coli distribution

Let's now plot a histogram of E. coli relative abundance distribution:

hist( x, xlab = "Relative Abundance", main="E. Coli Relative Abundance",
      breaks="FD")

Taxonomic analysis with `phyloseq`

phyloseq is another R library that facilitates taxonomic data analysis and visualization.

curatedMetagenomicData provides the ExpressionSet2phyloseq() function to create phyloseq class objects:

Warning: this is only valid for taxonomy, thus MetaPhlAn2 but not HUMAnN2 profiles!

library(phyloseq)
loman.pseq = ExpressionSet2phyloseq( loman.eset )

Components of a phyloseq object

phyloseq objects have 3 components:

loman.pseq

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 736 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 736 taxa by 8 taxonomic ranks ]

Components of a phyloseq object

otu_table() returns the same exprs( loman.eset ) did: the taxonomic abundance table. Here are the first 6 rows and 5 columns:

otu_table( loman.pseq )[1:6, 1:5]

## OTU Table:          [6 taxa and 5 samples]
##                      taxa are rows
##                     OBK1122   OBK1196   OBK1253  OBK2535 OBK2535b
## k__Bacteria       100.00000 100.00000 100.00000 99.39953 99.63084
## p__Bacteroidetes   83.55662  30.41764  79.41203 15.14525  7.03101
## p__Firmicutes      15.14819  68.30191  19.28784 25.71252 45.17325
## p__Proteobacteria   1.29520   0.06303   0.97218 58.53298 47.29051
## c__Bacteroidia     83.55662  30.41764  79.41203 15.14525  7.03101
## c__Clostridia      14.93141  33.22720  11.26228 16.50313 13.97996

Components of a phyloseq object

The participant data (or metadata) that was available from pData( loman.eset ) is now availble using sample_data() on the phyloseq object:

sample_data( loman.pseq )[1:6, 1:5]

##          subjectID body_site antibiotics_current_use study_condition disease
## OBK1122    OBK1122     stool                    <NA>            STEC    STEC
## OBK1196    OBK1196     stool                    <NA>            STEC    STEC
## OBK1253    OBK1253     stool                    <NA>            STEC    STEC
## OBK2535    OBK2535     stool                    <NA>            STEC    STEC
## OBK2535b  OBK2535b     stool                    <NA>            STEC    STEC
## OBK2638    OBK2638     stool                    <NA>            STEC    STEC

Components of a phyloseq object

But the phyloseq object also provides information on the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.

head( tax_table( loman.pseq ) )

## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##                   Kingdom    Phylum           Class         Order Family Genus
## k__Bacteria       "Bacteria" NA               NA            NA    NA     NA   
## p__Bacteroidetes  "Bacteria" "Bacteroidetes"  NA            NA    NA     NA   
## p__Firmicutes     "Bacteria" "Firmicutes"     NA            NA    NA     NA   
## p__Proteobacteria "Bacteria" "Proteobacteria" NA            NA    NA     NA   
## c__Bacteroidia    "Bacteria" "Bacteroidetes"  "Bacteroidia" NA    NA     NA   
## c__Clostridia     "Bacteria" "Firmicutes"     "Clostridia"  NA    NA     NA   
##                   Species Strain
## k__Bacteria       NA      NA    
## p__Bacteroidetes  NA      NA    
## p__Firmicutes     NA      NA    
## p__Proteobacteria NA      NA    
## c__Bacteroidia    NA      NA    
## c__Clostridia     NA      NA

Subsetting / Pruning

The process of subsetting takes advantadge of the taxonomic ranks:

rank_names( loman.pseq )

## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
## [8] "Strain"

Taxa can be filtered by these rank names. For example, to return an object with only species and strains:

subset_taxa( loman.pseq, !is.na(Species) & is.na(Strain))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 287 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 287 taxa by 8 taxonomic ranks ]

Subsetting / Pruning

Or to keep only certain taxa (eg: the Bacteroidetes phylum), we can use the prune_taxa function:

loman.bd = prune_taxa('p__Bacteroidetes', loman.pseq)
head( taxa_names( loman.bd ) )

## [1] "p__Bacteroidetes"

Advanced Pruning

Let's now select only the taxa that are among the top 5% in abundance in at least five samples:

keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)

##    Mode   FALSE    TRUE 
## logical     624     112

subset_taxa(loman.pseq, keepotu)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 112 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 112 taxa by 8 taxonomic ranks ]

Taxonomy Heatmap

The phyloseq package provides the plot_heatmap() function to create heatmaps using a variety of built-in dissimilarity metrics.

Let's first apply the same abundance filter as above, and keep only species:

loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Species) & is.na(Strain))

Taxonomy Heatmap

We'll plot a heatmap on Bray-Curtis dissimilarity using PCoA for sample ordination:

plot_heatmap(loman.filt, method="PCoA", distance="bray")

Taxonomy Histogram

We can also plot the top 20 most abundant species in the dataset:

barplot(sort(taxa_sums(loman.filt) / nsamples(loman.filt), TRUE)[1:20],
        ylab = "Total counts", las = 2)

Alpha Diversity Estimation

The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity indices in the species-level data, stratifying by stool texture:

alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.filt, "stool_texture", measures = alphas)

Ordination Analysis

We can now plot a PCoA of species-level abundances from the Loman dataset, using Bray-Curtis dissimilarity:

ordinated_taxa = ordinate(loman.filt, method="PCoA", distance="bray")
plot_ordination(loman.filt, ordinated_taxa, color="stool_texture", 
                title = "Bray-Curtis Principal Coordinates Analysis")

Bonus Exercises

Option A:

Think of a research question you would like to answer with cMD - and go for it!

Option B:

Choose a dataset and check if different bodysite or different conditions were analysed

Plot a heatmap of the 10 most abundant species

Choose a species and compare its abundance between the bodysites or conditions

Option B:

Choose a dataset and check if different bodysite or different conditions were analysed

ds_nielsen <-  NielsenHB_2014.metaphlan_bugs_list.stool()
table(pData(ds_nielsen)$study_condition)

## 
## control     IBD 
##     248     148

table(pData(ds_nielsen)$bodysite)

## < table of extent 0 >

Option B:

Plot a heatmap of the 10 most abundant species

ds_nielsen_pseq <- ExpressionSet2phyloseq(ds_nielsen)
ds_nielsen_pseq_sp <- subset_taxa(ds_nielsen_pseq, !is.na(Species) & is.na(Strain))
top_10 <- names(sort(taxa_sums(ds_nielsen_pseq_sp), TRUE)[1:10])
ds_nielsen_pseq_sp <- prune_taxa(top_10, ds_nielsen_pseq)
plot_heatmap( ds_nielsen_pseq_sp )

Option B:

Choose a species and compare its abundance between the bodysites or conditions

relab <- exprs(ds_nielsen)[grep("s__Butyrivibrio_crossotus$", rownames(ds_nielsen)), ]
sample_condt <- pData(ds_nielsen)$study_condition
x <- data.frame(relab, sample_condt)

library(ggplot2)
ggplot(x) +
    geom_boxplot(aes(sample_condt, relab))

Bonus: HUMANn2 Resources

curatedMetagenomicData also includes the three output tables that HUMAnN2 generates when metagenomes are profiled:

Abundance of gene families

LomanNJ_2013.genefamilies_relab.stool()

Metabolic pathway coverage

LomanNJ_2013.pathabundance_relab.stool()

Metabolic pathway abundance

LomanNJ_2013.pathcoverage.stool()

Bonus exercise:

Use the HUMAnN2 tables to obtain more information about the dataset you chose.

Let's start curatedMetagenomicData

Browsing available data

Browsing available data

Exercise: What are the most studied diseases with metagenomics?

Exercise: What are the most studied diseases with metagenomics?

Accessing datasets

Accessing datasets

Using ExpressionSet Objects

Using ExpressionSet Objects

Using ExpressionSet Objects

Using ExpressionSet Objects

Using ExpressionSet Objects

Escherichia coli distribution

Escherichia coli distribution

Taxonomic analysis with phyloseq

Components of a phyloseq object

Components of a phyloseq object

Components of a phyloseq object

Components of a phyloseq object

Subsetting / Pruning

Subsetting / Pruning

Advanced Pruning

Taxonomy Heatmap

Taxonomy Heatmap

Taxonomy Histogram

Alpha Diversity Estimation

Ordination Analysis

Bonus Exercises

Option B:

Option B:

Option B:

Bonus: HUMANn2 Resources

Bonus exercise:

Let's start `curatedMetagenomicData`

Taxonomic analysis with `phyloseq`