Using curatedMetagenomicData Resources

Use of the resources in curatedMetagenomicData is simplified with the use of Bioconductor’s ExperimentHub platform, which allows for the accessing of data through an intuitive interface. First, curatedMetagenomicData is installed using BiocInstaller and then called as a library - the process allows for the user to simply call datasets as functions because the package is aware of the resources present in ExperimentHub S3 buckets.

# BiocInstaller::biocLite("curatedMetagenomicData")  #Bioconductor version
# BiocInstaller::biocLite("waldronlab/curatedMetagenomicData")  #bleeding edge version
suppressPackageStartupMessages(library(curatedMetagenomicData))

Available samples and metadata

The manually curated metadata for all available samples are provided in a single table combined_metadata:

?combined_metadata
View(combined_metadata)
table(combined_metadata$antibiotics_current_use)
table(combined_metadata$disease)

Accessing datasets

Individual data projects can be fetched via per-dataset functions or the curatedMetagenomicData() function. A function call to a dataset name returns a Bioconductor ExpressionSet object:

loman.eset = LomanNJ_2013.metaphlan_bugs_list.stool()

See all available datasets (note that the default dryrun=TRUE prevents these from actually being downloaded:

curatedMetagenomicData("*")

All metaphlan datasets:

curatedMetagenomicData("*metaphlan*")

All HMP datasets:

curatedMetagenomicData("HMP*")

All HMP metaphlan bugs datasets:

curatedMetagenomicData("HMP_2012.metaphlan*")

Or to actually download these:

hmp.esetlist = curatedMetagenomicData("HMP_2012.metaphlan*", dryrun=FALSE)
hmp.esetlist

To download them as phyloseq objects, with relative abundances multiplied by sequencing depth:

hmp.esetlist = curatedMetagenomicData("HMP_2012.metaphlan*", dryrun=FALSE,
                                     counts=TRUE, bugs.as.phyloseq = TRUE )

Previous versions can be specified once there is more than one version:

cmdValidVersions()

Using ExpressionSet Objects

All datasets are represented as ExpressionSet objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the experimentData() function is used to return a MIAME (Minimum Information About a Microarray Experiment) object.

experimentData( loman.eset )

To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.

head( pData( loman.eset ) )

To access the data itself (in this case relative abundance), the exprs() function returns a variables by samples (rows by columns) numeric matrix. Note the presence of “synthetic” clades at all levels of the taxonomy, starting with kingdom, e.g. k__bacteria here:

exprs( loman.eset )[1:6, 1:5]  #first 6 rows and 5 columns

Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.

E. coli prevalence

Here’s a direct, exploratory analysis of E. coli prevalence in the Loman dataset using the ExpressionSet object. More elegant solutions will be provided later using subsetting methods provided by the phyloseq package, but for users familiar with grep() and the ExpressionSet object, such manual methods may suffice.

First, which E. coli-related taxa are available?

grep("coli", rownames(loman.eset), value=TRUE)

Create a vector of E. coli relative abundances. This grep call with a “$” at the end selects the only row that ends with “s__Escherichia_coli“:

x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset)), ]
summary( x )

This could be plotted as a histogram:

hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
      breaks="FD")

Taxonomy-Aware Analysis using phyloseq

For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of taxonomy-aware, ecological analysis and plotting by conversion to a phyloseq class object. curatedMetagenomicData provides the ExpressionSet2phyloseq() function to make this easy:

suppressPackageStartupMessages(library(phyloseq))
loman.pseq = ExpressionSet2phyloseq( loman.eset )

Estimating Absolute Raw Count Data

Absolute raw count data can be estimated from the relative count data by multiplying the columns of the ExpressionSet data by the number of reads for each sample, as found in the pData column “number_reads”. You could do this manually using the sweep() function (dividing by 100 to convert relative abundance percentages to fractions):

loman.counts = sweep(exprs( loman.eset ) / 100, 2, loman.eset$number_reads, "*")
loman.counts[1:6, 1:5]

or just set the relab argument in Expressionset2phyloseq() to FALSE:

loman.pseq2 = ExpressionSet2phyloseq( loman.eset, relab=FALSE )
otu_table( loman.pseq2 )[1:6, 1:5]

Note the simplified row names of the OTU table, resulting from the default argument simplify=TRUE. The simplified names are convenient, and lossless because taxonomic information is now attainable by tax_table(loman.pseq2)

Components of a phyloseq object

This phyloseq objects contain 3 components, with extractor functions hinted at by its show method:

loman.pseq

otu_table() returns the same thing as exprs( loman.eset ) did, the Operational Taxanomic Unit (OTU) table. Here are the first 6 rows and 5 columns:

otu_table( loman.pseq )[1:6, 1:5]

The same patient or participant data that was available from pData( loman.eset ) is now availble using sample_data() on the phyloseq object:

sample_data( loman.pseq )[1:6, 1:5]

But this object also is aware of the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.

head( tax_table( loman.pseq ) )

Subsetting / Pruning

The process of subsetting begins with the names of phylogenetic ranks:

rank_names( loman.pseq )

Taxa can be filtered by these rank names. For example, to return an object with only species and strains:

subset_taxa( loman.pseq, !is.na(Species))

To keep only phylum-level data (not class or finer, and not kingdom-level):

subset_taxa( loman.pseq, is.na(Class) & !is.na(Phylum))

Or to keep only Bacteroidetes phylum. Note that taxa names have been shortened from the rownames of the ExpressionSet object, for nicer plotting.

loman.bd = subset_taxa( loman.pseq, Phylum == "Bacteroidetes")
head( taxa_names( loman.bd ) )

Advanced Pruning

The phyloseq package provides advanced pruning of taxa, such as the following which keeps only taxa that are among the most abundant 5% in at least five samples:

keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)
subset_taxa(loman.pseq, keepotu)

Note that phyloseq also provides topk() for selecting the most abundant k taxa, and other functions for advanced pruning of taxa.

Taxonomy Heatmap

The phyloseq package provides the plot_heatmap() function to create heatmaps using a variety of built-in dissimilarity metrics for clustering. Here, we apply the same abundance filter as above, keep only strain-level OTUs. This function supports a large number of distance and ordination methods, here we use Bray-Curtis dissimilarity for distance and PCoA as the ordination method for organizing the heatmap.

loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Strain))
plot_heatmap(loman.filt, method="PCoA", distance="bray")

Taxonomy Histogram

Here we plot the top 20 most abundant species (not strains), defined by the sum of abundance across all samples in the dataset:

loman.sp = subset_taxa(loman.pseq, !is.na(Species) & is.na(Strain))
par(mar = c(20, 4, 0, 0) + 0.15) #increase margin size on the bottom
barplot(sort(taxa_sums(loman.sp), TRUE)[1:20] / nsamples(loman.sp),
        ylab = "Total counts", las = 2)

Alpha Diversity Estimation

The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity in the species-level data, stratifying by stool texture:

alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.sp, "stool_texture", measures = alphas)

Let’s compare these three alpha diversity measures:

pairs( estimate_richness(loman.sp, measures = alphas) )

Beta Diversity / Dissimilarity Clustering

Numerous beta diversity / dissimilarity are provided by the distance() function when provided a phyloseq object, and these can be used for any kind of clustering or classification scheme. For example, here is a hierarchical clustering dendrogram produced by the hclust() from the base R stats package with “Ward” linkage:

mydist = distance(loman.sp, method="bray")
myhclust = hclust( mydist )
plot(myhclust, main="Bray-Curtis Dissimilarity", 
     method="ward.D", xlab="Samples", sub = "")

Ordination Analysis

The phyloseq package provides a variety of ordination methods, with convenient options for labelling points. Here is a Principal Coordinates Analysis plot of species-level taxa from the Loman dataset, using Bray-Curtis distance:

ordinated_taxa = ordinate(loman.sp, method="PCoA", distance="bray")
plot_ordination(loman.sp, ordinated_taxa, color="stool_texture", 
                title = "Bray-Curtis Principal Coordinates Analysis")
plot_scree(ordinated_taxa, title="Screeplot")

Example: Rampelli Africa dataset

Rampelli S et al.: Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. Biol. 2015, 25:1682–1693.

library(curatedMetagenomicData)
library(curatedMetagenomicData)
Rampelli = curatedMetagenomicData("RampelliS_2015.metaphlan_bugs_list.stool", 
                       dryrun = FALSE, counts = FALSE,
                       bugs.as.phyloseq = TRUE)[[1]]
summary(otu_table(Rampelli))
summary(sample_data(Rampelli))

Initial exploration

phyloseq help vignettes here.

Rampelli
Rampelli = prune_taxa(taxa_sums(Rampelli) > 0, Rampelli)
Rampelli

Subsetting and pruning

rank_names(Rampelli)
subset_taxa(Rampelli, !is.na(Strain))
(Rampelli.sp_strain = subset_taxa(Rampelli, !is.na(Species)))

taxonomy.level = apply(tax_table(Rampelli), 1, function(x) sum(!is.na(x)))
Rampelli.phy = prune_taxa(taxonomy.level==2, Rampelli)
taxa_names(Rampelli.phy)

Advanced pruning

Keep taxa only if they are in the most abundant 5% of taxa in at least two samples:

f1<- filterfun_sample(topp(0.05))
pru <- genefilter_sample(Rampelli, f1, A=2)
summary(pru)
Rampelli.pru <- subset_taxa(Rampelli, pru)

More help here.

Heatmap

plot_heatmap(Rampelli.pru, method="PCoA", distance="bray")

Barplot

par(mar = c(18, 4, 0, 0) + 0.1) # make more room on bottom margin
barplot(sort(taxa_sums(Rampelli.sp_strain), TRUE)[1:30]/nsamples(Rampelli.sp_strain), las=2)

Alpha diversity estimates

  • Look at ?phyloseq::estimate_richness
  • Supported measures of alpha diversity are:
    • “Observed”, “Chao1”, “ACE”, “Shannon”, “Simpson”, “InvSimpson”, “Fisher”
    • more information from vegan package

Note, you can ignore warning about singletons:

Comparison of alpha diversity estimates

alphas = estimate_richness(Rampelli.counts, measures=alpha_meas)
pairs(alphas)

Beta diversity / dissimilarity

E.g. Bray-Curtis dissimilarity between all pairs of samples:

plot(hclust(phyloseq::distance(Rampelli, method="bray")), 
     main="Bray-Curtis Dissimilarity", xlab="", sub = "")
  • Dozens of distance measures are available
    • see ?phyloseq::distance and ?phyloseq::distanceMethodList

Ordination

ord = ordinate(Rampelli, method="PCoA", distance="bray")
plot_ordination(Rampelli, ord, color="camp", shape="camp") + 
  ggplot2::ggtitle("Bray-Curtis Principal Coordinates Analysis")
  • Available methods are “DCA”, “CCA”, “RDA”, “CAP”, “DPCoA”, “NMDS”, “MDS”, “PCoA”
  • PCoA approximately preserves distances in a two-dimensional projection

Not much “horseshoe” effect here.

Ordination (cont’d)

plot_scree(ord) + ggplot2::ggtitle("Screeplot")