The following demonstrates some uses of the
plot_heatmap function in the phyloseq package for R and Bioconductor.
In a 2010 article in BMC Genomics, Rajaram and Oono describe an approach to creating a heatmap using ordination methods (namely, NMDS and PCA) to organize the rows and columns instead of (hierarchical) cluster analysis. In many cases the ordination-based ordering does a much better job than h-clustering at providing an order of elements that is easily interpretable. The authors provided an immediately useful example of their approach as the NeatMap package for R. The NeatMap package can be used directly on the abundance table (
"otuTable"-class) of phylogenetic-sequencing data, but the NMDS or PCA ordination options that it supports are not based on ecological distances. To fill this void, and because phyloseq already provides support for a large number of ecological distances and ordination methods, phyloseq now includes the
plot_heatmap() function: an ecology-oriented variant of the NeatMap approach to organizing a heatmap and build it using ggplot2 graphics tools. The distance and method arguments are the same as for the plot_ordination function, and support large number of distances and ordination methods, respectively, with a strong leaning toward ecology. This function also provides the options to re-label the OTU and sample axis-ticks with a taxonomic name and/or sample variable, respectively, in the hope that this might hasten your interpretation of the patterns (See the documentation for the
taxa.label arguments, and the examples below). Note that this function makes no attempt to overlay dendrograms from hierarchical clustering next to the axes, as hierarchical clustering is not used to organize these plots. Also note that each re-ordered axis repeats at the edge, and so apparent clusters at the far right/left or top/bottom of the heat-map may actually be the same. For now, the placement of this edge can be considered arbitrary, so beware of this artifact of the graphic and visually check if there are two “mergeable” clusters at the edges of a particular axis. If you benefit from this phyloseq-specific implementation of the NeatMap approach, please cite the NeatMap article, as well as phyloseq.
Traditionally heatmaps have been used to emphasize data that is above or below a threshold as “hot” or “cold” colors, respectively. In the case of OTU-abundance data, however, it seems the most common need is to see the relative patterns of high-abundance OTUs against a background of taxa that are mostly low-abundance or absent in a sparse matrix. Furthermore, there is usually not an obvious or intrinsically meaningful abundance value to use as a suitable threshold for the traditional “cold/hot” display. For these reasons, the default color scheme in
plot_heatmap maps a very dark blue color to the lowest abundance values, up to a very light blue for the highest abundance values. The dark blue for the lowest abundance values is not very much lighter than black - the color used to represent missing or zero abundance values by default - for a coherent, blue-oriented color scheme in which the eye should be drawn to the lighter shades.
If, for whatever reason, you need to change this default color scheme, it is possible through the
na.value arguments. Several examples are provided below. The character-string values supplied to these arguments need to be the names of R colors. There are over 600 English color names that are understood by R (try
colors() at the R terminal), as well as other finely-resolved color gradient nomenclatures. The examples below use a 6-digit hexadecimal color representation, and a nice table summary of these colors is available at the R Cookbook.
For further details, not that the
na.value parameters are passed along to ggplot2's scale_gradient function, which does an excellent job selecting suitable colors of your gradient, provided that you select colors that make sense to have at two ends of a gradient.
I also got some useful ideas and suggestions at the following WordPress page regarding the construction of heatmaps using ggplot2's
By default, the color mapping of
plot_heatmap is transformed to a log-scale of base 4, using
log_trans(4) from the scales package. This is an arbitrary choice that you might need to adjust based on your needs and data. If specifying an alternative transformation object to the
trans argument, you probably need to load the scales package first. Since scales is a required package for
phyloseq, you should already have it installed if you are at this point. Any transformation object that is valid for the scales package should work here, but the relative contrast and the way it represents your data could change dramatically based on this choice, so make this selection carefully; or better yet, try several different transformations if you think data is being “left in the background” or too much information is being “pushed to the foreground”, for example.
Let's finally get started with some code. Of course, we have to start by loading packages and data.
##  '1.5.8'
data("GlobalPatterns") library("ggplot2") packageVersion("ggplot2")
##  '0.9.3.1'
ggplot2 package theme set. See the ggplot2 online documentation for further help.
The following two lines subset the dataset to just the top 300 most abundant Bacteria taxa across all samples (in this case, with no prior preprocessing. Not recommended, but quick).
gpt <- subset_taxa(GlobalPatterns, Kingdom == "Bacteria") gpt <- prune_taxa(names(sort(taxa_sums(gpt), TRUE)[1:300]), gpt) plot_heatmap(gpt, sample.label = "SampleType")
Subset the dataset to something manageable that can be reasonably represented in one plot. In the following examples, the Crenarchaeota phylum.
gpac <- subset_taxa(GlobalPatterns, Phylum == "Crenarchaeota")
Now let's see how our
plot_heatmap function works with all default settings.
Here is an example re-labelling based on the “SampleType” sample variable and the taxonomic rank of “Family”.
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family")
Changing the color scheme might be worthwhile, depending on the graphics device or paper on which you want to display the heatmap.
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low = "#000033", high = "#CCFF66")
Here is a dark-blue to red scheme.
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low = "#000033", high = "#FF3300")
A very dark-blue to very light-blue scheme
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low = "#000033", high = "#66CCFF")
Here is a “dark on light” color scheme. Note that we change the background value (the value of the NA and 0 elements)
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low = "#66CCFF", high = "#000033", na.value = "white")
This is a similar color scheme as the previous, but the “near zero” color is closer to a cream color, and the colors themselves are closer to blue-grey. This is better overall contrast than a lot of schemes, but may not be as exciting.
plot_heatmap(gpac, "NMDS", "bray", "SampleType", "Family", low = "#FFFFCC", high = "#000033", na.value = "white")
Now try the default color scheme, but using different ecological distances/ordinations. For example, NMDS ordination on the jaccard distance.
plot_heatmap(gpac, "NMDS", "jaccard")
Detrended correspondence analysis.
plot_heatmap(gpac, "DCA", "none", "SampleType", "Family")
Unconstrained redundancy analysis (Principle Components Analysis, PCA)
plot_heatmap(gpac, "RDA", "none", "SampleType", "Family")
PCoA/MDS ordination on the (default) bray-curtis distance.
plot_heatmap(gpac, "PCoA", "bray", "SampleType", "Family")
MDS/PCoA ordination on the Unweighted-UniFrac distance.
plot_heatmap(gpac, "PCoA", "unifrac", "SampleType", "Family")
Now try weighted-UniFrac distance and MDS/PCoA ordination.
plot_heatmap(gpac, "MDS", "unifrac", "SampleType", "Family", weighted = TRUE)
Here is how you might create a heatmap using base-R graphics
and the more common (but problematic) hierarchical clustering
organization, in case you want to compare with
plot_heatmap, for example.