Taxa Plot Tutorial

This tutorial shows you how to make a taxa plot from a phyloseq object. Requires phyloseq and tidyverse.

Start by loading the required R packages

(Packages must be installed prior to running this tutorial)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(phyloseq)

Load your phyloseq object

This tutorial assumes that you have a phyloseq object of the data that you want to plot. The example phyloseq object shown here has 9 samples, 9 sample variables, and 12,003 unique taxa.

toy_ps <- read_rds("/Users/lgschaer/Desktop/toy_phyloseq.rds")  
toy_ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 12003 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 12003 taxa by 7 taxonomic ranks ]

Explore the sample variables in the test phyloseq object

This phyloseq object has 9 sample variables: SampleID, Environoment, Sample_Type, Replicate, Substrate, Transfer, Substrate_Label, Environment_Label_Location, and Environment_Label.

head(sample_data(toy_ps))
##          SampleID Environment Sample_Type Replicate Substrate Transfer
## CCONR1T1 CCONR1T1     Compost  Enrichment        R1   Control       T1
## CCONR2T1 CCONR2T1     Compost  Enrichment        R2   Control       T1
## CCONR3T1 CCONR3T1     Compost  Enrichment        R3   Control       T1
## CTAR1T1   CTAR1T1     Compost  Enrichment        R1        TA       T1
## CTAR2T1   CTAR2T1     Compost  Enrichment        R2        TA       T1
## CTAR3T1   CTAR3T1     Compost  Enrichment        R3        TA       T1
##          Substrate_Label Environment_Label_Location Environment_Label
## CCONR1T1         Control       Compost\nCalumet, MI           Compost
## CCONR2T1         Control       Compost\nCalumet, MI           Compost
## CCONR3T1         Control       Compost\nCalumet, MI           Compost
## CTAR1T1  Terephthalamide       Compost\nCalumet, MI           Compost
## CTAR2T1  Terephthalamide       Compost\nCalumet, MI           Compost
## CTAR3T1  Terephthalamide       Compost\nCalumet, MI           Compost

STEP #1: Convert the phyloseq object into a dataframe that can be more easily manipulated

The tax_glom() function specifies the most specific taxonomic level you are interested in. Change the taxonomic level to the most specific level you are interested in plotting. The more broad the category, the quicker the code will run (ie it will run faster for Phylum than Genus). This step will probably run slowly for a large data set.

This block of code will output a data frame including sample variables, taxonomic, and abundance information.

genusabundance <- toy_ps %>%
  tax_glom(taxrank = "Genus") %>%                        # Set to smallest taxonomic level you are interested in
  transform_sample_counts(function(x) {x/sum(x)} ) %>%   # Transform to rel. abundance
  psmelt()                                               # Melt to long format
head(genusabundance)
##         OTU   Sample Abundance SampleID Environment Sample_Type Replicate
## 6157 sp7848 CCONR2T1 0.6666667 CCONR2T1     Compost  Enrichment        R2
## 1279   sp16  CTAR3T1 0.5578226  CTAR3T1     Compost  Enrichment        R3
## 3887  sp413 CCONR2T1 0.3333333 CCONR2T1     Compost  Enrichment        R2
## 2710   sp28  CTPR3T1 0.2451447  CTPR3T1     Compost  Enrichment        R3
## 963   sp135 CCONR1T1 0.2263978 CCONR1T1     Compost  Enrichment        R1
## 2715   sp28  CTPR2T1 0.1867858  CTPR2T1     Compost  Enrichment        R2
##      Substrate Transfer Substrate_Label Environment_Label_Location
## 6157   Control       T1         Control       Compost\nCalumet, MI
## 1279        TA       T1 Terephthalamide       Compost\nCalumet, MI
## 3887   Control       T1         Control       Compost\nCalumet, MI
## 2710       TPA       T1   Terephthalate       Compost\nCalumet, MI
## 963    Control       T1         Control       Compost\nCalumet, MI
## 2715       TPA       T1   Terephthalate       Compost\nCalumet, MI
##      Environment_Label  Kingdom         Phylum               Class
## 6157           Compost Bacteria Proteobacteria Alphaproteobacteria
## 1279           Compost Bacteria Proteobacteria Gammaproteobacteria
## 3887           Compost Bacteria Proteobacteria Gammaproteobacteria
## 2710           Compost Bacteria Proteobacteria Gammaproteobacteria
## 963            Compost Bacteria   Bacteroidota         Bacteroidia
## 2715           Compost Bacteria Proteobacteria Gammaproteobacteria
##                   Order              Family          Genus
## 6157        Rhizobiales        Rhizobiaceae   Ochrobactrum
## 1279    Burkholderiales      Comamonadaceae    Ramlibacter
## 3887 Steroidobacterales Steroidobacteraceae Steroidobacter
## 2710    Xanthomonadales    Xanthomonadaceae    Thermomonas
## 963     Chitinophagales    Chitinophagaceae  Edaphobaculum
## 2715    Xanthomonadales    Xanthomonadaceae    Thermomonas

STEP #2: Filter, group and modify data to prepare for plotting.

Here we will make a simplified data frame that only includes the variables we are interested in plotting for all taxonomic levels.

select() function: Choose which variables to include in the plot. Select ALL variables that will be used for x-axis and/or facet labels in addition to all taxonomic levels you are interested in.

filter() function: Filter out all taxa with zero percent abundance. Filter can also be used to remove treatments or conditions that are not to be included in the plot.

mutate() function: Convert the taxonomic level columns to character vectors (they are factors by default), can also be used to add additional columns or modify existing columns as desired.

all <- genusabundance %>%
  select(Phylum, Class, Family, Genus, Sample, Abundance, Substrate, Substrate_Label, Replicate) %>%
  filter(Abundance != 0) %>%
  mutate(
    Phylum = as.character(Phylum),
    Class = as.character(Class),
    Family = as.character(Family),
    Genus = as.character(Genus))
head(all)
##           Phylum               Class              Family          Genus
## 1 Proteobacteria Alphaproteobacteria        Rhizobiaceae   Ochrobactrum
## 2 Proteobacteria Gammaproteobacteria      Comamonadaceae    Ramlibacter
## 3 Proteobacteria Gammaproteobacteria Steroidobacteraceae Steroidobacter
## 4 Proteobacteria Gammaproteobacteria    Xanthomonadaceae    Thermomonas
## 5   Bacteroidota         Bacteroidia    Chitinophagaceae  Edaphobaculum
## 6 Proteobacteria Gammaproteobacteria    Xanthomonadaceae    Thermomonas
##     Sample Abundance Substrate Substrate_Label Replicate
## 1 CCONR2T1 0.6666667   Control         Control        R2
## 2  CTAR3T1 0.5578226        TA Terephthalamide        R3
## 3 CCONR2T1 0.3333333   Control         Control        R2
## 4  CTPR3T1 0.2451447       TPA   Terephthalate        R3
## 5 CCONR1T1 0.2263978   Control         Control        R1
## 6  CTPR2T1 0.1867858       TPA   Terephthalate        R2

STEP #3: Prepare to plot Phylum

Here we will group the data even more to prepare to make a taxa plot at the Phylum level

group_by() function: Groups data by the specified variables. Include ONLY the variables that will be used in the ggplot() command to make the plot. Adding additional variables here can lead to aesthetic problems with the plot.

summarise() function: This function will collapse the data based on the groups specified in the group_by() function above. Here we will calculate the average abundance for each group (sum of the abundance of each phyla for each unique substrate/replicate combination divided by the total number of samples in each group).

More detailed explanation of this code block: select() select the variables desired for plotting, dropping excess variables will help code run faster. group_by() group by the variables that will be used for plotting, these will be updated for each plot. mutate() add a new column giving the number of samples in each group. In this case there is only one sample in each group, but the code provided here will calculate whatever the number of samples in the group is. ungroup() remove previous groupings. group_by() group by the variables that will be used for plotting PLUS the number of samples in each group (totalSum), and the taxonomic level you are going to plot, in this case, Genus. summarise() is used to do two things: (1) calculate the sum of abundance of all organisms in each group belonging to each genus. In this case, for each unique substrate and replicate combination, we calculate the sum() the abundance of each genus. (2) calculate relative abundance. unique() selects only the unique observations and removes any duplicate rows.

phylum <- all %>%
  select(Substrate, Substrate_Label, Replicate, Phylum, Abundance) %>%  #choose variables to work with
  group_by(Substrate, Substrate_Label, Replicate) %>%                   #group by variables used to plot NOT taxonomic level
  mutate(totalSum = sum(Abundance)) %>%                                 #calculate total abundance of each Phylum
  ungroup() %>%                                                         #remove grouping variables
  group_by(Substrate, Substrate_Label, Replicate, Phylum) %>%           #group by same variables PLUS taxonomic level
  summarise(                                                            
    Abundance = sum(Abundance),                                         #sum abundance in each phylum for each  group
    totalSum,
    RelAb = Abundance/totalSum) %>%                                     #calculate relative abundance
  unique()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate', 'Phylum'. You can override using the `.groups` argument.
head(phylum)
## # A tibble: 6 × 7
## # Groups:   Substrate, Substrate_Label, Replicate, Phylum [6]
##   Substrate Substrate_Label Replicate Phylum           Abundance total…¹   RelAb
##   <chr>     <chr>           <chr>     <chr>                <dbl>   <dbl>   <dbl>
## 1 Control   Control         R1        Acidobacteriota    0.0275        1 0.0275 
## 2 Control   Control         R1        Actinobacteriota   0.0422        1 0.0422 
## 3 Control   Control         R1        Bacteroidota       0.446         1 0.446  
## 4 Control   Control         R1        Bdellovibrionota   0.00336       1 0.00336
## 5 Control   Control         R1        Chloroflexi        0.00397       1 0.00397
## 6 Control   Control         R1        Crenarchaeota      0.00275       1 0.00275
## # … with abbreviated variable name ¹​totalSum

Look at descriptive statistics to check that everything is as we would expect. Maximum relative abundance should be no more than one, and the minimum should be greater than zero. The mean should be between the maximum and the minimum.

max(phylum$RelAb)
## [1] 1
mean(phylum$RelAb)
## [1] 0.08108108
min(phylum$RelAb)
## [1] 8.058667e-05

STEP #4: Prepare colors for the plot

Calculate how many colors will be needed. This is equal to the length of the unique values in the Phylum column of the “phylum” data frame, which in this case is 18. Here we will make a character vector containing a list of colors (phylum_colors) and then check the length to be sure there are enough colors… Here we have 25 colors, which is more than enough!

length(unique(phylum$Phylum))
## [1] 18
phylum_colors <- c(
  "grey22", "darkcyan", "orchid1", "green", "orange", "blue", "tomato2", "olivedrab", "grey47",
  "cyan", "coral3", "darkgreen", "magenta", "palegoldenrod", "dodgerblue", "firebrick", "yellow", "purple4",
  "lightblue", "grey77", "mediumpurple1", "tan4", "red", "darkblue", "yellowgreen")

length(phylum_colors)
## [1] 25

STEP #5: Make a plot!

geom_col(): this creates the columns in the plot. The bars should add to exactly 1, if they do not add to 1, check grouping variables since this is the most likely place something went wrong. If the bars do add exactly to 1, percentages from the data frame can be multiplied by 100 to be interpreted as relative abundances. If you end up with many small blocks of the same color, check grouping variables in Step # 3, it’s likely you have multiple groups of the same genus represented by a single bar.

Note on using aes(): Anything inside the aes() function will create a legend (for example, “Fill = Phylum” is inside the aes() function since we want a legend to interpret the phylum colors). Anything outside outside aes() will NOT create a legend (for example, “Color = black” is outside the aes() function since we do not want a legend for the outline around the colored blocks).

facet_grid(): divides the figure into panels based on the variables provided.

scale_fill_manual(): specifies vector of colors to use instead of default colors.

theme_linedraw(): sets theme for graphic. Additional themes are available.

theme(): aesthetic tweaks to appearance, font size, legend position, etc.

guides(): adjusts legend appearance. Can alter the number of rows/columns in the legend.

ggplot(phylum)+
  geom_col(mapping = aes(x = Replicate, y = RelAb, fill = Phylum), color = "black", position = "stack", show.legend = TRUE)+
  facet_grid(cols = vars(Substrate_Label))+
  ylab("Proportion of Community") +
  xlab(NULL)+
  scale_fill_manual(values = phylum_colors) +
  theme_linedraw()+
  theme(axis.text.y = element_text(size = 20, color = "black"),
        axis.title.y = element_text(size = 18, color = "black"),
        axis.text.x = element_text(size = 20, angle = 0, vjust = 1, hjust = 0.5, color = "black"),
        legend.text = element_text(size = 13),
        legend.position = "bottom",
        legend.spacing.x = unit(0.1, 'mm'),
        legend.spacing.y = unit(0.05, 'mm'),
        plot.margin=grid::unit(c(0.1,0.1,0.1,0.1), "mm"),
        strip.text = element_text(size = 18, face = "bold", angle = 0),
        legend.title = element_text(face="bold", size = 22))+
  guides(fill=guide_legend(ncol=3,byrow=TRUE))

STEP #6: Plotting other taxonomic levels, grouping low abundance taxa.

Plotting other taxonomic levels is very similar, starting with the “all” dataframe we created earlier, modify the grouping variables to desired taxonomic level. This example will be grouped by genus.

genus <- all %>%
  select(Substrate, Substrate_Label, Replicate, Genus, Abundance) %>%
  group_by(Substrate, Substrate_Label, Replicate) %>%
  mutate(totalSum = sum(Abundance)) %>%
  ungroup() %>%
  group_by(Substrate, Substrate_Label, Replicate, Genus) %>%
  summarise(
    Abundance = sum(Abundance),
    totalSum,
    RelAb = Abundance/totalSum) %>%
  unique()
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate'. You can override using the `.groups` argument.
head(genus)
## # A tibble: 6 × 7
## # Groups:   Substrate, Substrate_Label, Replicate [1]
##   Substrate Substrate_Label Replicate Genus              Abund…¹ total…²   RelAb
##   <chr>     <chr>           <chr>     <chr>                <dbl>   <dbl>   <dbl>
## 1 Control   Control         R1        Adhaeribacter      0.0107        1 0.0107 
## 2 Control   Control         R1        Aeromicrobium      0.00122       1 0.00122
## 3 Control   Control         R1        Ahniella           0.00458       1 0.00458
## 4 Control   Control         R1        Algoriphagus       0.00397       1 0.00397
## 5 Control   Control         R1        Altererythrobacter 0.0131        1 0.0131 
## 6 Control   Control         R1        Amaricoccus        0.00519       1 0.00519
## # … with abbreviated variable names ¹​Abundance, ²​totalSum

Check the distribution of the data, max, mean and minimum should all be between 1 and >0. Here we also check the number of unique genera, there are 230! It would be impossible to visualize 230 colors in a single figure.

max(genus$RelAb)
## [1] 0.6666667
mean(genus$RelAb)
## [1] 0.01092233
min(genus$RelAb)
## [1] 5.597537e-05
length(unique(genus$Genus))
## [1] 230

One solution for this is to group low abundance genera into a single category. Starting with the data frame “all” created earlier, we will group all of the genera present at less than 5% relative abundance into a single group.

Using mutate() paired with ifelse() and summarise() we will group any taxa with low (< 5% relative abundance) into a single category, reducing the number of colors needed.

More detailed explanation of this code block: select() select the variables desired for plotting, dropping excess variables will help code run faster. group_by() group by the variables that will be used for plotting, these will be updated for each plot. mutate() add a new column giving the number of samples in each group. In this case there is only one sample in each group, but the code provided here will calculate whatever the number of samples in the group is. ungroup() remove previous groupings. group_by() group by the variables that will be used for plotting PLUS the number of samples in each group (totalSum), and the taxonomic level you are going to plot, in this case, Genus. summarise() is used to do two things: (1) calculate the sum of abundance of all organisms in each group belonging to each genus. In this case, for each unique substrate and replicate combination, we calculate the sum() the abundance of each genus. (2) we use ifelse() to check if the summed abundance is less than 0.05 (5%), if it is, the Genus name is replaced with “< 5 %”. Now, as the last step, we use group_by() and summarise() to calculate the sum of abundance again (so the low abundance taxa are treated as a single category), and calculate relative abundance. unique() selects only the unique observations and removes any duplicate rows.

genus <- all %>%
  select(Substrate, Substrate_Label, Replicate, Genus, Abundance) %>%
  group_by(Substrate, Substrate_Label, Replicate) %>%
  mutate(totalSum = sum(Abundance)) %>%
  ungroup() %>%
  group_by(Substrate, Substrate_Label, Replicate, Genus, totalSum) %>%
  summarise(
    Abundance = sum(Abundance),
    Genus = ifelse(Abundance < 0.05, "< 5 %", Genus)) %>%               #change Genus label to group low abundance taxa together
  group_by(Substrate, Substrate_Label, Replicate, Genus, totalSum) %>%  #now group and summarize again to group newly labeled low abundance taxa together
  summarise(
    Abundance = sum(Abundance),
    RelAb = Abundance/totalSum) %>%
  unique()
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate', 'Genus'. You can override using the `.groups` argument.
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate', 'Genus', 'totalSum'. You can override using the `.groups`
## argument.
head(genus)
## # A tibble: 6 × 7
## # Groups:   Substrate, Substrate_Label, Replicate, Genus, totalSum [6]
##   Substrate Substrate_Label Replicate Genus          totalSum Abundance  RelAb
##   <chr>     <chr>           <chr>     <chr>             <dbl>     <dbl>  <dbl>
## 1 Control   Control         R1        < 5 %                 1    0.723  0.723 
## 2 Control   Control         R1        Edaphobaculum         1    0.226  0.226 
## 3 Control   Control         R1        Hassallia             1    0.0501 0.0501
## 4 Control   Control         R2        Ochrobactrum          1    0.667  0.667 
## 5 Control   Control         R2        Steroidobacter        1    0.333  0.333 
## 6 Control   Control         R3        < 5 %                 1    0.713  0.713

Now check the number of unique genera we have… 17 unique genera! Much easier to visualize!

length(unique(genus$Genus))
## [1] 17

One way to get a long vector of colors is to use colorRampPalette to make a color palette based on a few supplied colors.

colFunc <- colorRampPalette(c("purple4", "lightpink", "firebrick", "orange", "lemonchiffon", "olivedrab4", "darkcyan", "lightblue", "darkblue"))
color_list <- colFunc(length(unique(genus$Genus)))
genus_colors <- c("black", color_list)
length(genus_colors)
## [1] 18

Now we can plot the data! This same process can be used to plot any taxonomic level where there are too many unique genera to visualize all of them.

ggplot(genus)+
  geom_col(mapping = aes(x = Replicate, y = RelAb, fill = Genus), color = "black", position = "stack", show.legend = TRUE)+
  facet_grid(cols = vars(Substrate_Label))+
  ylab("Proportion of Community") +
  xlab(NULL)+
  scale_fill_manual(values = genus_colors) +
  theme_linedraw()+
  theme(axis.text.y = element_text(size = 20, color = "black"),
        axis.title.y = element_text(size = 18, color = "black"),
        axis.text.x = element_text(size = 20, angle = 0, vjust = 1, hjust = 0.5, color = "black"),
        legend.text = element_text(size = 13),
        legend.position = "bottom",
        legend.spacing.x = unit(0.1, 'mm'),
        legend.spacing.y = unit(0.05, 'mm'),
        plot.margin=grid::unit(c(0.1,0.1,0.1,0.1), "mm"),
        strip.text = element_text(size = 18, face = "bold", angle = 0),
        legend.title = element_text(face="bold", size = 22))+
  guides(fill=guide_legend(ncol=3,byrow=TRUE))

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.