Contents

1 What curatedMetagenomicData provides

curatedMetagenomicData provides 6 types of data for each dataset:

  1. species-level taxonomic profiles,
  2. presence of unique, clade-specific markers,
  3. abundance of unique, clade-specific markers,
  4. abundance of gene families,
  5. metabolic pathway coverage, and
  6. metabolic pathway abundance.

Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.

Currently, curatedMetagenomicData provides:

These datasets are documented in the reference manual.

2 Using curatedMetagenomicData Resources

Use of the resources in curatedMetagenomicData is simplified with the use of Bioconductor’s ExperimentHub platform, which allows for the accessing of data through an intuitive interface. First, curatedMetagenomicData is installed using BiocInstaller and then called as a library - the process allows for the user to simply call datasets as functions because the package is aware of the resources present in ExperimentHub S3 buckets.

# BiocInstaller::biocLite("curatedMetagenomicData")
library(curatedMetagenomicData)

A function call to a dataset name will return an expression set object which can be manipulated or serialized.

LomanNJ_2013_Mi.genefamilies_relab.stool()
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 337637 features, 9 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: OBK1122 OBK1196 ... OBK4961 (9 total)
##   varLabels: subjectID first ... number_reads (19 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23571589 
## Annotation:

3 Features of ExpressionSet Objects

All datasets are represented as ExpressionSet objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the experimentData() function is used to return a MIAME (Minimum Information About a Microarray Experiment) object.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
experimentData()
## Experiment data
##   Experimenter name: Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ 
##   Laboratory: Institute of Microbiology and Infection, University of Birmingham, Birmingham, England. 
##   Contact information:  
##   Title: A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. 
##   URL:  
##   PMIDs: 23571589 
## 
##   Abstract: A 308 word abstract is available. Use 'abstract' method.
##   notes:
##    Sequencing technology:     
##       Illumina

To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
pData() %>%
head()
##         subjectID first repeat stooltexture daysafteronset  hus stec_count
## OBK1122   OBK1122  1122      0         <NA>           <NA> <NA>       <NA>
## OBK1196   OBK1196  1196      0         <NA>           <NA> <NA>       <NA>
## OBK1253   OBK1253  1253      0         <NA>           <NA> <NA>       <NA>
## OBK2535   OBK2535  2535      0         <NA>           <NA> <NA>       <NA>
## OBK2638   OBK2638  2638      0         <NA>           <NA> <NA>       <NA>
## OBK2723   OBK2723  2723      0         <NA>           <NA> <NA>       <NA>
##         shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122             <NA>          8.00       33          <NA>
## OBK1196             <NA>         17.50      >99          <NA>
## OBK1253             <NA>         12.10      >99          <NA>
## OBK2535             <NA>         17.70    33.04          <NA>
## OBK2638             <NA>         18.74    67.51          <NA>
## OBK2723             <NA>         11.81     1.86          <NA>
##         stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122              y      <NA>       <NA>              positive
## OBK1196              n      <NA>       <NA>              negative
## OBK1253              n      <NA>       <NA>              negative
## OBK2535           <NA>      <NA>       <NA>              negative
## OBK2638           <NA>      <NA>       <NA>              negative
## OBK2723           <NA>      <NA>       <NA>              negative
##                disease bodysite country number_reads
## OBK1122 stec2-positive    stool germany      1758352
## OBK1196 stec2-positive    stool germany     13423104
## OBK1253 stec2-positive    stool germany      9949304
## OBK2535 stec2-positive    stool germany       664514
## OBK2638 stec2-positive    stool germany      1009732
## OBK2723 stec2-positive    stool germany      8017954

To access the data itself the exprs() function is used to return a long and skinny matrix in which samples are column rather than rows.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
exprs() %>%
head()
##                     OBK1122    OBK1196    OBK1253     OBK2535    OBK2638
## UNMAPPED         0.08320560 0.12063500 0.11415300 0.071663900 0.05090200
## UniRef90_unknown 0.08003470 0.07693680 0.08019060 0.122752000 0.14018600
## UniRef90_P03641  0.00418468 0.00176623 0.00326958 0.001151130 0.01027770
## UniRef90_P69486  0.00357096 0.00225170 0.00369798 0.001339380 0.01182520
## UniRef90_P03631  0.00311485 0.00170172 0.00302549 0.000995641 0.00930738
## UniRef90_P69172  0.00300317 0.00204393 0.00246346 0.001040070 0.00769270
##                     OBK2723   OBK4096    OBK4328   OBK4961
## UNMAPPED         0.09993320 0.0654341 0.12492000 0.0479208
## UniRef90_unknown 0.06659520 0.0512603 0.06270720 0.1208630
## UniRef90_P03641  0.00294142 0.0759686 0.00117431 0.0419860
## UniRef90_P69486  0.00318407 0.0783307 0.00240333 0.0473116
## UniRef90_P03631  0.00266316 0.0680656 0.00194322 0.0383972
## UniRef90_P69172  0.00235623 0.0608840 0.00163305 0.0309943

Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.

4 Estimating Absolute Raw Count Data

Absolute raw count data can be estimated from the relative count data by multiplying the rows of the ExpressionSet data by the number of reads, as found in the pData. The sweep function from base R provides a way to quickly multiply every row of the matrix by the number of reads vector.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
exprs() %>%
sweep(., 2, LomanNJ_2013_Mi.genefamilies_relab.stool()$number_reads, "*") %>%
head()
##                     OBK1122    OBK1196    OBK1253    OBK2535    OBK2638
## UNMAPPED         146304.733 1619296.15 1135742.90 47621.6648  51397.378
## UniRef90_unknown 140729.175 1032730.67  797840.66 81570.4225 141550.290
## UniRef90_P03641    7358.140   23708.29   32530.05   764.9420  10377.723
## UniRef90_P69486    6279.005   30224.80   36792.33   890.0368  11940.283
## UniRef90_P03631    5477.003   22842.36   30101.52   661.6174   9397.959
## UniRef90_P69172    5280.630   27435.88   24509.71   691.1411   7767.565
##                    OBK2723   OBK4096    OBK4328  OBK4961
## UNMAPPED         801259.80 10792.570 1372570.74 17631.12
## UniRef90_unknown 533957.25  8454.771  689001.51 44468.16
## UniRef90_P03641   23584.17 12530.109   12902.85 15447.57
## UniRef90_P69486   25529.73 12919.709   26406.82 17406.98
## UniRef90_P03631   21353.09 11226.604   21351.32 14127.17
## UniRef90_P69172   18892.14 10042.085   17943.30 11403.48

5 Taxonomy-Aware Analysis using phyloseq

The phyloseq bioconductor package provides a number of useful functions related to taxonomic analysis and readily produces ggPlot2 graphics. curatedMetagenomicData provides 6 types of data; of these, only the taxonomic profiles make sense to analyse as phylogenetic data. A convienence function, ExpressionSet2phyloseq(), is provided within the curatedMetagenomicData package to allow for coversion between the ExpressionSet class and the phyloseq class; it is used as follows.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq()
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 399 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 399 taxa by 8 taxonomic ranks ]

5.1 Exploratory Data Analysis

Exploratory data analysis might begin through the examination of clade-specific realative abundances contained in the OTU (Operational Taxanomic Unit) table. A dataset containing taxanomic profiles is first converted to a phyloseq object and then the otu_table() function is used to view the contents and structure of the table.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>% 
otu_table() %>% 
head()
## OTU Table:          [6 taxa and 9 samples]
##                      taxa are rows
##                   OBK1122 OBK1196 OBK1253 OBK2535 OBK2638 OBK2723 OBK4096
## k__Bacteria       1000000  994785  999901 1000000 1000000  989760 1000000
## p__Bacteroidetes   786168  525459  808827  236409   69300  503518  392508
## p__Firmicutes      157688  432482  177192  268285   24606  443958  180435
## p__Proteobacteria   56145   25305    9536  495306  892984   31430  427056
## c__Bacteroidia     786168  525459  808827  236409   69300  503518  392508
## c__Clostridia      150854  248574  103997  219160   23622  405306  180435
##                   OBK4328 OBK4961
## k__Bacteria       1000000 1000000
## p__Bacteroidetes   407899   55926
## p__Firmicutes      467944       0
## p__Proteobacteria     155  944074
## c__Bacteroidia     407899   55926
## c__Clostridia      396215       0

Similarly, the process of exploratory data analysis will likely include an examination of study-related metadata in order to understand the type of questions that can be asked of the data. To examine the structure and contents of metadata, a phyloseq object is created and then the sample_data() function is used to view the table.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>% 
sample_data() %>% 
head()
##         subjectID first repeat. stooltexture daysafteronset  hus
## OBK1122   OBK1122  1122       0         <NA>           <NA> <NA>
## OBK1196   OBK1196  1196       0         <NA>           <NA> <NA>
## OBK1253   OBK1253  1253       0         <NA>           <NA> <NA>
## OBK2535   OBK2535  2535       0         <NA>           <NA> <NA>
## OBK2638   OBK2638  2638       0         <NA>           <NA> <NA>
## OBK2723   OBK2723  2723       0         <NA>           <NA> <NA>
##         stec_count shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122       <NA>             <NA>          8.00       33          <NA>
## OBK1196       <NA>             <NA>         17.50      >99          <NA>
## OBK1253       <NA>             <NA>         12.10      >99          <NA>
## OBK2535       <NA>             <NA>         17.70    33.04          <NA>
## OBK2638       <NA>             <NA>         18.74    67.51          <NA>
## OBK2723       <NA>             <NA>         11.81     1.86          <NA>
##         stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122              y      <NA>       <NA>              positive
## OBK1196              n      <NA>       <NA>              negative
## OBK1253              n      <NA>       <NA>              negative
## OBK2535           <NA>      <NA>       <NA>              negative
## OBK2638           <NA>      <NA>       <NA>              negative
## OBK2723           <NA>      <NA>       <NA>              negative
##                disease bodysite country number_reads
## OBK1122 stec2-positive    stool germany      1758352
## OBK1196 stec2-positive    stool germany     13423104
## OBK1253 stec2-positive    stool germany      9949304
## OBK2535 stec2-positive    stool germany       664514
## OBK2638 stec2-positive    stool germany      1009732
## OBK2723 stec2-positive    stool germany      8017954

As with the OTU and metadata tables, the structure and contents of the taxanomic table should also be examined. This is done by converting an ExpressionSet object to a phyloseq object and then using the tax_table() function to expose the underlying taxanomic table. The table provides a complete listing of every clade available within the dataset down to the strain level.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
tax_table() %>%
head()
## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##                   Kingdom    Phylum           Class         Order Family
## k__Bacteria       "Bacteria" NA               NA            NA    NA    
## p__Bacteroidetes  "Bacteria" "Bacteroidetes"  NA            NA    NA    
## p__Firmicutes     "Bacteria" "Firmicutes"     NA            NA    NA    
## p__Proteobacteria "Bacteria" "Proteobacteria" NA            NA    NA    
## c__Bacteroidia    "Bacteria" "Bacteroidetes"  "Bacteroidia" NA    NA    
## c__Clostridia     "Bacteria" "Firmicutes"     "Clostridia"  NA    NA    
##                   Genus Species Strain
## k__Bacteria       NA    NA      NA    
## p__Bacteroidetes  NA    NA      NA    
## p__Firmicutes     NA    NA      NA    
## p__Proteobacteria NA    NA      NA    
## c__Bacteroidia    NA    NA      NA    
## c__Clostridia     NA    NA      NA

5.2 Subseting / Pruning

The process of subseting begins with the names of phylogenetic ranks, which are easily examined using the rank_names() function.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
rank_names()
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
## [8] "Strain"

With knowledge of the rank names it becomes possible to filter by taxa using the subset_taxa() function. In the example below, where a species is not given taxa are removed.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(!is.na(Species))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 264 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 264 taxa by 8 taxonomic ranks ]

Even greater specificity can be achieved using a more specific taxa such as strain.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(!is.na(Strain))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 124 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 124 taxa by 8 taxonomic ranks ]

Where a specific level of taxa is desired for analysis it can be obtained through the use of the prune_taxa() function in combination with an apply() function.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
tax_table() %>%
apply(., 1, function(x) {sum(!is.na(x))}) ->
taxa_level

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
prune_taxa(taxa_level == 2, .) %>%
taxa_names()
## [1] "p__Bacteroidetes"   "p__Firmicutes"      "p__Proteobacteria" 
## [4] "p__Actinobacteria"  "p__Viruses_noname"  "p__Euryarchaeota"  
## [7] "p__Verrucomicrobia"

5.3 Advanced Pruning

More advanced pruning of taxa can be done with filtering functions, such as the following one whereby taxa are kept only if they are among the most abundant 10% in at least two samples.

topp(0.10) %>%
filterfun_sample() ->
top_ten

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
genefilter_sample(., top_ten, A = 2) ->
filtered_taxa
    
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 399 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 399 taxa by 8 taxonomic ranks ]

5.4 Taxonomy Heatmap

Heatmaps of taxonomic abundance are easily made by passing a phyloseq class object to the plot_heatmap() function. Here, extensive subsetting is done to reduce the amount of OTUs that appear on the y-axis.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa) %>%
subset_taxa(!is.na(Strain)) %>% 
plot_heatmap(., method="PCoA", distance="bray")

5.5 Taxonomy Histogram

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa) %>%
subset_taxa(!is.na(Strain)) ->
unsorted_taxa

par(mar = c(20, 4, 0, 0) + 0.15)

barplot(sort(taxa_sums(unsorted_taxa), TRUE)[1:20] / nsamples(unsorted_taxa),
        las = 2)

5.6 Alpha Diversity Estimation

c("Shannon", "Simpson", "InvSimpson") ->
alpha_measures

unsorted_taxa %>%
plot_richness(., "c_difficile_frequency", "stxab_detected",
              measures = alpha_measures)

5.7 Alpha Diversity Estimate Comparison

unsorted_taxa %>%
estimate_richness(., measures = alpha_measures) %>%
pairs()

5.8 Beta Diversity / Dissimilarity Clustering

unsorted_taxa %>%
distance(method="bray") %>% 
hclust() %>% 
plot(main="Bray-Curtis Dissimilarity", xlab="", sub = "")

5.9 Ordination Analysis

unsorted_taxa %>%
ordinate(., method = "PCoA", distance = "bray") ->
ordinated_taxa

unsorted_taxa %>%
plot_ordination(., ordinated_taxa, color="c_difficile_frequency", 
                title = "Bray-Curtis Principal Coordinates Analysis")

plot_scree(ordinated_taxa)

6 Adding Datasets

Authors welcome the addition of new datasets provided they can be or already have been run through the MetaPhlAn2 and HUMAnN2 pipelines. Please read the developer documentation and contact the maintainer if you have a shotgun metagenomic dataset that would be of interest to the Bioconductor community.

7 Reporting Bugs

Development of the curatedMetagenomicData package occurs on GitHub and bugs reported via GitHub issues will recieve the quickest response. Please visit the project repository and file an issue should you find one.

8 Other Issues

Visit the Bioconductor support site at https://support.bioconductor.org/, breifly describe your issue, and add the curatedmetagenomicdata.