3 Features of `ExpressionSet` Objects

All datasets are represented as ExpressionSet objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.

To access the experiment-level metadata the experimentData() function is used to return a MIAME (Minimum Information About a Microarray Experiment) object.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
experimentData()

## Experiment data
##   Experimenter name: Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ 
##   Laboratory: Institute of Microbiology and Infection, University of Birmingham, Birmingham, England. 
##   Contact information:  
##   Title: A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. 
##   URL:  
##   PMIDs: 23571589 
## 
##   Abstract: A 308 word abstract is available. Use 'abstract' method.
##   notes:
##    Sequencing technology:     
##       Illumina

To access the subject-level metadata the pData() function is used to return a data.frame containing subject-level variables of study.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
pData() %>%
head()

##         subjectID first repeat stooltexture daysafteronset  hus stec_count
## OBK1122   OBK1122  1122      0         <NA>           <NA> <NA>       <NA>
## OBK1196   OBK1196  1196      0         <NA>           <NA> <NA>       <NA>
## OBK1253   OBK1253  1253      0         <NA>           <NA> <NA>       <NA>
## OBK2535   OBK2535  2535      0         <NA>           <NA> <NA>       <NA>
## OBK2638   OBK2638  2638      0         <NA>           <NA> <NA>       <NA>
## OBK2723   OBK2723  2723      0         <NA>           <NA> <NA>       <NA>
##         shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122             <NA>          8.00       33          <NA>
## OBK1196             <NA>         17.50      >99          <NA>
## OBK1253             <NA>         12.10      >99          <NA>
## OBK2535             <NA>         17.70    33.04          <NA>
## OBK2638             <NA>         18.74    67.51          <NA>
## OBK2723             <NA>         11.81     1.86          <NA>
##         stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122              y      <NA>       <NA>              positive
## OBK1196              n      <NA>       <NA>              negative
## OBK1253              n      <NA>       <NA>              negative
## OBK2535           <NA>      <NA>       <NA>              negative
## OBK2638           <NA>      <NA>       <NA>              negative
## OBK2723           <NA>      <NA>       <NA>              negative
##                disease bodysite country number_reads
## OBK1122 stec2-positive    stool germany      1758352
## OBK1196 stec2-positive    stool germany     13423104
## OBK1253 stec2-positive    stool germany      9949304
## OBK2535 stec2-positive    stool germany       664514
## OBK2638 stec2-positive    stool germany      1009732
## OBK2723 stec2-positive    stool germany      8017954

To access the data itself the exprs() function is used to return a long and skinny matrix in which samples are column rather than rows.

LomanNJ_2013_Mi.genefamilies_relab.stool() %>%
exprs() %>%
head()

##                     OBK1122    OBK1196    OBK1253     OBK2535    OBK2638
## UNMAPPED         0.08320560 0.12063500 0.11415300 0.071663900 0.05090200
## UniRef90_unknown 0.08003470 0.07693680 0.08019060 0.122752000 0.14018600
## UniRef90_P03641  0.00418468 0.00176623 0.00326958 0.001151130 0.01027770
## UniRef90_P69486  0.00357096 0.00225170 0.00369798 0.001339380 0.01182520
## UniRef90_P03631  0.00311485 0.00170172 0.00302549 0.000995641 0.00930738
## UniRef90_P69172  0.00300317 0.00204393 0.00246346 0.001040070 0.00769270
##                     OBK2723   OBK4096    OBK4328   OBK4961
## UNMAPPED         0.09993320 0.0654341 0.12492000 0.0479208
## UniRef90_unknown 0.06659520 0.0512603 0.06270720 0.1208630
## UniRef90_P03641  0.00294142 0.0759686 0.00117431 0.0419860
## UniRef90_P69486  0.00318407 0.0783307 0.00240333 0.0473116
## UniRef90_P03631  0.00266316 0.0680656 0.00194322 0.0383972
## UniRef90_P69172  0.00235623 0.0608840 0.00163305 0.0309943

Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.

5 Taxonomy-Aware Analysis using `phyloseq`

The phyloseq bioconductor package provides a number of useful functions related to taxonomic analysis and readily produces ggPlot2 graphics. curatedMetagenomicData provides 6 types of data; of these, only the taxonomic profiles make sense to analyse as phylogenetic data. A convienence function, ExpressionSet2phyloseq(), is provided within the curatedMetagenomicData package to allow for coversion between the ExpressionSet class and the phyloseq class; it is used as follows.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq()

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 399 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 399 taxa by 8 taxonomic ranks ]

5.1 Exploratory Data Analysis

Exploratory data analysis might begin through the examination of clade-specific realative abundances contained in the OTU (Operational Taxanomic Unit) table. A dataset containing taxanomic profiles is first converted to a phyloseq object and then the otu_table() function is used to view the contents and structure of the table.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>% 
otu_table() %>% 
head()

## OTU Table:          [6 taxa and 9 samples]
##                      taxa are rows
##                   OBK1122 OBK1196 OBK1253 OBK2535 OBK2638 OBK2723 OBK4096
## k__Bacteria       1000000  994785  999901 1000000 1000000  989760 1000000
## p__Bacteroidetes   786168  525459  808827  236409   69300  503518  392508
## p__Firmicutes      157688  432482  177192  268285   24606  443958  180435
## p__Proteobacteria   56145   25305    9536  495306  892984   31430  427056
## c__Bacteroidia     786168  525459  808827  236409   69300  503518  392508
## c__Clostridia      150854  248574  103997  219160   23622  405306  180435
##                   OBK4328 OBK4961
## k__Bacteria       1000000 1000000
## p__Bacteroidetes   407899   55926
## p__Firmicutes      467944       0
## p__Proteobacteria     155  944074
## c__Bacteroidia     407899   55926
## c__Clostridia      396215       0

Similarly, the process of exploratory data analysis will likely include an examination of study-related metadata in order to understand the type of questions that can be asked of the data. To examine the structure and contents of metadata, a phyloseq object is created and then the sample_data() function is used to view the table.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>% 
sample_data() %>% 
head()

##         subjectID first repeat. stooltexture daysafteronset  hus
## OBK1122   OBK1122  1122       0         <NA>           <NA> <NA>
## OBK1196   OBK1196  1196       0         <NA>           <NA> <NA>
## OBK1253   OBK1253  1253       0         <NA>           <NA> <NA>
## OBK2535   OBK2535  2535       0         <NA>           <NA> <NA>
## OBK2638   OBK2638  2638       0         <NA>           <NA> <NA>
## OBK2723   OBK2723  2723       0         <NA>           <NA> <NA>
##         stec_count shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122       <NA>             <NA>          8.00       33          <NA>
## OBK1196       <NA>             <NA>         17.50      >99          <NA>
## OBK1253       <NA>             <NA>         12.10      >99          <NA>
## OBK2535       <NA>             <NA>         17.70    33.04          <NA>
## OBK2638       <NA>             <NA>         18.74    67.51          <NA>
## OBK2723       <NA>             <NA>         11.81     1.86          <NA>
##         stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122              y      <NA>       <NA>              positive
## OBK1196              n      <NA>       <NA>              negative
## OBK1253              n      <NA>       <NA>              negative
## OBK2535           <NA>      <NA>       <NA>              negative
## OBK2638           <NA>      <NA>       <NA>              negative
## OBK2723           <NA>      <NA>       <NA>              negative
##                disease bodysite country number_reads
## OBK1122 stec2-positive    stool germany      1758352
## OBK1196 stec2-positive    stool germany     13423104
## OBK1253 stec2-positive    stool germany      9949304
## OBK2535 stec2-positive    stool germany       664514
## OBK2638 stec2-positive    stool germany      1009732
## OBK2723 stec2-positive    stool germany      8017954

As with the OTU and metadata tables, the structure and contents of the taxanomic table should also be examined. This is done by converting an ExpressionSet object to a phyloseq object and then using the tax_table() function to expose the underlying taxanomic table. The table provides a complete listing of every clade available within the dataset down to the strain level.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
tax_table() %>%
head()

## Taxonomy Table:     [6 taxa by 8 taxonomic ranks]:
##                   Kingdom    Phylum           Class         Order Family
## k__Bacteria       "Bacteria" NA               NA            NA    NA    
## p__Bacteroidetes  "Bacteria" "Bacteroidetes"  NA            NA    NA    
## p__Firmicutes     "Bacteria" "Firmicutes"     NA            NA    NA    
## p__Proteobacteria "Bacteria" "Proteobacteria" NA            NA    NA    
## c__Bacteroidia    "Bacteria" "Bacteroidetes"  "Bacteroidia" NA    NA    
## c__Clostridia     "Bacteria" "Firmicutes"     "Clostridia"  NA    NA    
##                   Genus Species Strain
## k__Bacteria       NA    NA      NA    
## p__Bacteroidetes  NA    NA      NA    
## p__Firmicutes     NA    NA      NA    
## p__Proteobacteria NA    NA      NA    
## c__Bacteroidia    NA    NA      NA    
## c__Clostridia     NA    NA      NA

5.2 Subseting / Pruning

The process of subseting begins with the names of phylogenetic ranks, which are easily examined using the rank_names() function.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
rank_names()

## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
## [8] "Strain"

With knowledge of the rank names it becomes possible to filter by taxa using the subset_taxa() function. In the example below, where a species is not given taxa are removed.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(!is.na(Species))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 264 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 264 taxa by 8 taxonomic ranks ]

Even greater specificity can be achieved using a more specific taxa such as strain.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(!is.na(Strain))

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 124 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 124 taxa by 8 taxonomic ranks ]

Where a specific level of taxa is desired for analysis it can be obtained through the use of the prune_taxa() function in combination with an apply() function.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
tax_table() %>%
apply(., 1, function(x) {sum(!is.na(x))}) ->
taxa_level

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
prune_taxa(taxa_level == 2, .) %>%
taxa_names()

## [1] "p__Bacteroidetes"   "p__Firmicutes"      "p__Proteobacteria" 
## [4] "p__Actinobacteria"  "p__Viruses_noname"  "p__Euryarchaeota"  
## [7] "p__Verrucomicrobia"

5.3 Advanced Pruning

More advanced pruning of taxa can be done with filtering functions, such as the following one whereby taxa are kept only if they are among the most abundant 10% in at least two samples.

topp(0.10) %>%
filterfun_sample() ->
top_ten

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
genefilter_sample(., top_ten, A = 2) ->
filtered_taxa
    
LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 399 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 399 taxa by 8 taxonomic ranks ]

5.4 Taxonomy Heatmap

Heatmaps of taxonomic abundance are easily made by passing a phyloseq class object to the plot_heatmap() function. Here, extensive subsetting is done to reduce the amount of OTUs that appear on the y-axis.

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa) %>%
subset_taxa(!is.na(Strain)) %>% 
plot_heatmap(., method="PCoA", distance="bray")

5.5 Taxonomy Histogram

LomanNJ_2013_Mi.metaphlan_bugs_list.stool() %>%
ExpressionSet2phyloseq() %>%
subset_taxa(., filtered_taxa) %>%
subset_taxa(!is.na(Strain)) ->
unsorted_taxa

par(mar = c(20, 4, 0, 0) + 0.15)

barplot(sort(taxa_sums(unsorted_taxa), TRUE)[1:20] / nsamples(unsorted_taxa),
        las = 2)

5.6 Alpha Diversity Estimation

c("Shannon", "Simpson", "InvSimpson") ->
alpha_measures

unsorted_taxa %>%
plot_richness(., "c_difficile_frequency", "stxab_detected",
              measures = alpha_measures)

5.7 Alpha Diversity Estimate Comparison

unsorted_taxa %>%
estimate_richness(., measures = alpha_measures) %>%
pairs()

5.8 Beta Diversity / Dissimilarity Clustering

unsorted_taxa %>%
distance(method="bray") %>% 
hclust() %>% 
plot(main="Bray-Curtis Dissimilarity", xlab="", sub = "")

5.9 Ordination Analysis

unsorted_taxa %>%
ordinate(., method = "PCoA", distance = "bray") ->
ordinated_taxa

unsorted_taxa %>%
plot_ordination(., ordinated_taxa, color="c_difficile_frequency", 
                title = "Bray-Curtis Principal Coordinates Analysis")

plot_scree(ordinated_taxa)

curatedMetagenomicData

Abstract

Contents

1 What `curatedMetagenomicData` provides

2 Using `curatedMetagenomicData` Resources

3 Features of `ExpressionSet` Objects

4 Estimating Absolute Raw Count Data

5 Taxonomy-Aware Analysis using `phyloseq`

5.1 Exploratory Data Analysis

5.2 Subseting / Pruning

5.3 Advanced Pruning

5.4 Taxonomy Heatmap

5.5 Taxonomy Histogram

5.6 Alpha Diversity Estimation

5.7 Alpha Diversity Estimate Comparison

5.8 Beta Diversity / Dissimilarity Clustering

5.9 Ordination Analysis

6 Adding Datasets

7 Reporting Bugs

8 Other Issues

curatedMetagenomicData

Abstract

Contents

1 What curatedMetagenomicData provides

2 Using curatedMetagenomicData Resources

3 Features of ExpressionSet Objects

4 Estimating Absolute Raw Count Data

5 Taxonomy-Aware Analysis using phyloseq

5.1 Exploratory Data Analysis

5.2 Subseting / Pruning

5.3 Advanced Pruning

5.4 Taxonomy Heatmap

5.5 Taxonomy Histogram

5.6 Alpha Diversity Estimation

5.7 Alpha Diversity Estimate Comparison

5.8 Beta Diversity / Dissimilarity Clustering

5.9 Ordination Analysis

6 Adding Datasets

7 Reporting Bugs

8 Other Issues

1 What `curatedMetagenomicData` provides

2 Using `curatedMetagenomicData` Resources

3 Features of `ExpressionSet` Objects

5 Taxonomy-Aware Analysis using `phyloseq`