This tutorial shows you how to make a taxa plot from a phyloseq object. Requires phyloseq and tidyverse.
(Packages must be installed prior to running this tutorial)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(phyloseq)
This tutorial assumes that you have a phyloseq object of the data that you want to plot. The example phyloseq object shown here has 9 samples, 9 sample variables, and 12,003 unique taxa.
toy_ps <- read_rds("/Users/lgschaer/Desktop/toy_phyloseq.rds")
toy_ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 12003 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 9 sample variables ]
## tax_table() Taxonomy Table: [ 12003 taxa by 7 taxonomic ranks ]
This phyloseq object has 9 sample variables: SampleID, Environoment, Sample_Type, Replicate, Substrate, Transfer, Substrate_Label, Environment_Label_Location, and Environment_Label.
head(sample_data(toy_ps))
## SampleID Environment Sample_Type Replicate Substrate Transfer
## CCONR1T1 CCONR1T1 Compost Enrichment R1 Control T1
## CCONR2T1 CCONR2T1 Compost Enrichment R2 Control T1
## CCONR3T1 CCONR3T1 Compost Enrichment R3 Control T1
## CTAR1T1 CTAR1T1 Compost Enrichment R1 TA T1
## CTAR2T1 CTAR2T1 Compost Enrichment R2 TA T1
## CTAR3T1 CTAR3T1 Compost Enrichment R3 TA T1
## Substrate_Label Environment_Label_Location Environment_Label
## CCONR1T1 Control Compost\nCalumet, MI Compost
## CCONR2T1 Control Compost\nCalumet, MI Compost
## CCONR3T1 Control Compost\nCalumet, MI Compost
## CTAR1T1 Terephthalamide Compost\nCalumet, MI Compost
## CTAR2T1 Terephthalamide Compost\nCalumet, MI Compost
## CTAR3T1 Terephthalamide Compost\nCalumet, MI Compost
The tax_glom() function specifies the most specific taxonomic level you are interested in. Change the taxonomic level to the most specific level you are interested in plotting. The more broad the category, the quicker the code will run (ie it will run faster for Phylum than Genus). This step will probably run slowly for a large data set.
This block of code will output a data frame including sample variables, taxonomic, and abundance information.
genusabundance <- toy_ps %>%
tax_glom(taxrank = "Genus") %>% # Set to smallest taxonomic level you are interested in
transform_sample_counts(function(x) {x/sum(x)} ) %>% # Transform to rel. abundance
psmelt() # Melt to long format
head(genusabundance)
## OTU Sample Abundance SampleID Environment Sample_Type Replicate
## 6157 sp7848 CCONR2T1 0.6666667 CCONR2T1 Compost Enrichment R2
## 1279 sp16 CTAR3T1 0.5578226 CTAR3T1 Compost Enrichment R3
## 3887 sp413 CCONR2T1 0.3333333 CCONR2T1 Compost Enrichment R2
## 2710 sp28 CTPR3T1 0.2451447 CTPR3T1 Compost Enrichment R3
## 963 sp135 CCONR1T1 0.2263978 CCONR1T1 Compost Enrichment R1
## 2715 sp28 CTPR2T1 0.1867858 CTPR2T1 Compost Enrichment R2
## Substrate Transfer Substrate_Label Environment_Label_Location
## 6157 Control T1 Control Compost\nCalumet, MI
## 1279 TA T1 Terephthalamide Compost\nCalumet, MI
## 3887 Control T1 Control Compost\nCalumet, MI
## 2710 TPA T1 Terephthalate Compost\nCalumet, MI
## 963 Control T1 Control Compost\nCalumet, MI
## 2715 TPA T1 Terephthalate Compost\nCalumet, MI
## Environment_Label Kingdom Phylum Class
## 6157 Compost Bacteria Proteobacteria Alphaproteobacteria
## 1279 Compost Bacteria Proteobacteria Gammaproteobacteria
## 3887 Compost Bacteria Proteobacteria Gammaproteobacteria
## 2710 Compost Bacteria Proteobacteria Gammaproteobacteria
## 963 Compost Bacteria Bacteroidota Bacteroidia
## 2715 Compost Bacteria Proteobacteria Gammaproteobacteria
## Order Family Genus
## 6157 Rhizobiales Rhizobiaceae Ochrobactrum
## 1279 Burkholderiales Comamonadaceae Ramlibacter
## 3887 Steroidobacterales Steroidobacteraceae Steroidobacter
## 2710 Xanthomonadales Xanthomonadaceae Thermomonas
## 963 Chitinophagales Chitinophagaceae Edaphobaculum
## 2715 Xanthomonadales Xanthomonadaceae Thermomonas
Here we will make a simplified data frame that only includes the variables we are interested in plotting for all taxonomic levels.
select() function: Choose which variables to include in the plot. Select ALL variables that will be used for x-axis and/or facet labels in addition to all taxonomic levels you are interested in.
filter() function: Filter out all taxa with zero percent abundance. Filter can also be used to remove treatments or conditions that are not to be included in the plot.
mutate() function: Convert the taxonomic level columns to character vectors (they are factors by default), can also be used to add additional columns or modify existing columns as desired.
all <- genusabundance %>%
select(Phylum, Class, Family, Genus, Sample, Abundance, Substrate, Substrate_Label, Replicate) %>%
filter(Abundance != 0) %>%
mutate(
Phylum = as.character(Phylum),
Class = as.character(Class),
Family = as.character(Family),
Genus = as.character(Genus))
head(all)
## Phylum Class Family Genus
## 1 Proteobacteria Alphaproteobacteria Rhizobiaceae Ochrobactrum
## 2 Proteobacteria Gammaproteobacteria Comamonadaceae Ramlibacter
## 3 Proteobacteria Gammaproteobacteria Steroidobacteraceae Steroidobacter
## 4 Proteobacteria Gammaproteobacteria Xanthomonadaceae Thermomonas
## 5 Bacteroidota Bacteroidia Chitinophagaceae Edaphobaculum
## 6 Proteobacteria Gammaproteobacteria Xanthomonadaceae Thermomonas
## Sample Abundance Substrate Substrate_Label Replicate
## 1 CCONR2T1 0.6666667 Control Control R2
## 2 CTAR3T1 0.5578226 TA Terephthalamide R3
## 3 CCONR2T1 0.3333333 Control Control R2
## 4 CTPR3T1 0.2451447 TPA Terephthalate R3
## 5 CCONR1T1 0.2263978 Control Control R1
## 6 CTPR2T1 0.1867858 TPA Terephthalate R2
Here we will group the data even more to prepare to make a taxa plot at the Phylum level
group_by() function: Groups data by the specified variables. Include ONLY the variables that will be used in the ggplot() command to make the plot. Adding additional variables here can lead to aesthetic problems with the plot.
summarise() function: This function will collapse the data based on the groups specified in the group_by() function above. Here we will calculate the average abundance for each group (sum of the abundance of each phyla for each unique substrate/replicate combination divided by the total number of samples in each group).
More detailed explanation of this code block: select() select the variables desired for plotting, dropping excess variables will help code run faster. group_by() group by the variables that will be used for plotting, these will be updated for each plot. mutate() add a new column giving the number of samples in each group. In this case there is only one sample in each group, but the code provided here will calculate whatever the number of samples in the group is. ungroup() remove previous groupings. group_by() group by the variables that will be used for plotting PLUS the number of samples in each group (totalSum), and the taxonomic level you are going to plot, in this case, Genus. summarise() is used to do two things: (1) calculate the sum of abundance of all organisms in each group belonging to each genus. In this case, for each unique substrate and replicate combination, we calculate the sum() the abundance of each genus. (2) calculate relative abundance. unique() selects only the unique observations and removes any duplicate rows.
phylum <- all %>%
select(Substrate, Substrate_Label, Replicate, Phylum, Abundance) %>% #choose variables to work with
group_by(Substrate, Substrate_Label, Replicate) %>% #group by variables used to plot NOT taxonomic level
mutate(totalSum = sum(Abundance)) %>% #calculate total abundance of each Phylum
ungroup() %>% #remove grouping variables
group_by(Substrate, Substrate_Label, Replicate, Phylum) %>% #group by same variables PLUS taxonomic level
summarise(
Abundance = sum(Abundance), #sum abundance in each phylum for each group
totalSum,
RelAb = Abundance/totalSum) %>% #calculate relative abundance
unique()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate', 'Phylum'. You can override using the `.groups` argument.
head(phylum)
## # A tibble: 6 × 7
## # Groups: Substrate, Substrate_Label, Replicate, Phylum [6]
## Substrate Substrate_Label Replicate Phylum Abundance total…¹ RelAb
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Control Control R1 Acidobacteriota 0.0275 1 0.0275
## 2 Control Control R1 Actinobacteriota 0.0422 1 0.0422
## 3 Control Control R1 Bacteroidota 0.446 1 0.446
## 4 Control Control R1 Bdellovibrionota 0.00336 1 0.00336
## 5 Control Control R1 Chloroflexi 0.00397 1 0.00397
## 6 Control Control R1 Crenarchaeota 0.00275 1 0.00275
## # … with abbreviated variable name ¹totalSum
Look at descriptive statistics to check that everything is as we would expect. Maximum relative abundance should be no more than one, and the minimum should be greater than zero. The mean should be between the maximum and the minimum.
max(phylum$RelAb)
## [1] 1
mean(phylum$RelAb)
## [1] 0.08108108
min(phylum$RelAb)
## [1] 8.058667e-05
Calculate how many colors will be needed. This is equal to the length of the unique values in the Phylum column of the “phylum” data frame, which in this case is 18. Here we will make a character vector containing a list of colors (phylum_colors) and then check the length to be sure there are enough colors… Here we have 25 colors, which is more than enough!
length(unique(phylum$Phylum))
## [1] 18
phylum_colors <- c(
"grey22", "darkcyan", "orchid1", "green", "orange", "blue", "tomato2", "olivedrab", "grey47",
"cyan", "coral3", "darkgreen", "magenta", "palegoldenrod", "dodgerblue", "firebrick", "yellow", "purple4",
"lightblue", "grey77", "mediumpurple1", "tan4", "red", "darkblue", "yellowgreen")
length(phylum_colors)
## [1] 25
geom_col(): this creates the columns in the plot. The bars should add to exactly 1, if they do not add to 1, check grouping variables since this is the most likely place something went wrong. If the bars do add exactly to 1, percentages from the data frame can be multiplied by 100 to be interpreted as relative abundances. If you end up with many small blocks of the same color, check grouping variables in Step # 3, it’s likely you have multiple groups of the same genus represented by a single bar.
Note on using aes(): Anything inside the aes() function will create a legend (for example, “Fill = Phylum” is inside the aes() function since we want a legend to interpret the phylum colors). Anything outside outside aes() will NOT create a legend (for example, “Color = black” is outside the aes() function since we do not want a legend for the outline around the colored blocks).
facet_grid(): divides the figure into panels based on the variables provided.
scale_fill_manual(): specifies vector of colors to use instead of default colors.
theme_linedraw(): sets theme for graphic. Additional themes are available.
theme(): aesthetic tweaks to appearance, font size, legend position, etc.
guides(): adjusts legend appearance. Can alter the number of rows/columns in the legend.
ggplot(phylum)+
geom_col(mapping = aes(x = Replicate, y = RelAb, fill = Phylum), color = "black", position = "stack", show.legend = TRUE)+
facet_grid(cols = vars(Substrate_Label))+
ylab("Proportion of Community") +
xlab(NULL)+
scale_fill_manual(values = phylum_colors) +
theme_linedraw()+
theme(axis.text.y = element_text(size = 20, color = "black"),
axis.title.y = element_text(size = 18, color = "black"),
axis.text.x = element_text(size = 20, angle = 0, vjust = 1, hjust = 0.5, color = "black"),
legend.text = element_text(size = 13),
legend.position = "bottom",
legend.spacing.x = unit(0.1, 'mm'),
legend.spacing.y = unit(0.05, 'mm'),
plot.margin=grid::unit(c(0.1,0.1,0.1,0.1), "mm"),
strip.text = element_text(size = 18, face = "bold", angle = 0),
legend.title = element_text(face="bold", size = 22))+
guides(fill=guide_legend(ncol=3,byrow=TRUE))
Plotting other taxonomic levels is very similar, starting with the “all” dataframe we created earlier, modify the grouping variables to desired taxonomic level. This example will be grouped by genus.
genus <- all %>%
select(Substrate, Substrate_Label, Replicate, Genus, Abundance) %>%
group_by(Substrate, Substrate_Label, Replicate) %>%
mutate(totalSum = sum(Abundance)) %>%
ungroup() %>%
group_by(Substrate, Substrate_Label, Replicate, Genus) %>%
summarise(
Abundance = sum(Abundance),
totalSum,
RelAb = Abundance/totalSum) %>%
unique()
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate'. You can override using the `.groups` argument.
head(genus)
## # A tibble: 6 × 7
## # Groups: Substrate, Substrate_Label, Replicate [1]
## Substrate Substrate_Label Replicate Genus Abund…¹ total…² RelAb
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Control Control R1 Adhaeribacter 0.0107 1 0.0107
## 2 Control Control R1 Aeromicrobium 0.00122 1 0.00122
## 3 Control Control R1 Ahniella 0.00458 1 0.00458
## 4 Control Control R1 Algoriphagus 0.00397 1 0.00397
## 5 Control Control R1 Altererythrobacter 0.0131 1 0.0131
## 6 Control Control R1 Amaricoccus 0.00519 1 0.00519
## # … with abbreviated variable names ¹Abundance, ²totalSum
Check the distribution of the data, max, mean and minimum should all be between 1 and >0. Here we also check the number of unique genera, there are 230! It would be impossible to visualize 230 colors in a single figure.
max(genus$RelAb)
## [1] 0.6666667
mean(genus$RelAb)
## [1] 0.01092233
min(genus$RelAb)
## [1] 5.597537e-05
length(unique(genus$Genus))
## [1] 230
One solution for this is to group low abundance genera into a single category. Starting with the data frame “all” created earlier, we will group all of the genera present at less than 5% relative abundance into a single group.
Using mutate() paired with ifelse() and summarise() we will group any taxa with low (< 5% relative abundance) into a single category, reducing the number of colors needed.
More detailed explanation of this code block: select() select the variables desired for plotting, dropping excess variables will help code run faster. group_by() group by the variables that will be used for plotting, these will be updated for each plot. mutate() add a new column giving the number of samples in each group. In this case there is only one sample in each group, but the code provided here will calculate whatever the number of samples in the group is. ungroup() remove previous groupings. group_by() group by the variables that will be used for plotting PLUS the number of samples in each group (totalSum), and the taxonomic level you are going to plot, in this case, Genus. summarise() is used to do two things: (1) calculate the sum of abundance of all organisms in each group belonging to each genus. In this case, for each unique substrate and replicate combination, we calculate the sum() the abundance of each genus. (2) we use ifelse() to check if the summed abundance is less than 0.05 (5%), if it is, the Genus name is replaced with “< 5 %”. Now, as the last step, we use group_by() and summarise() to calculate the sum of abundance again (so the low abundance taxa are treated as a single category), and calculate relative abundance. unique() selects only the unique observations and removes any duplicate rows.
genus <- all %>%
select(Substrate, Substrate_Label, Replicate, Genus, Abundance) %>%
group_by(Substrate, Substrate_Label, Replicate) %>%
mutate(totalSum = sum(Abundance)) %>%
ungroup() %>%
group_by(Substrate, Substrate_Label, Replicate, Genus, totalSum) %>%
summarise(
Abundance = sum(Abundance),
Genus = ifelse(Abundance < 0.05, "< 5 %", Genus)) %>% #change Genus label to group low abundance taxa together
group_by(Substrate, Substrate_Label, Replicate, Genus, totalSum) %>% #now group and summarize again to group newly labeled low abundance taxa together
summarise(
Abundance = sum(Abundance),
RelAb = Abundance/totalSum) %>%
unique()
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate', 'Genus'. You can override using the `.groups` argument.
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## `summarise()` has grouped output by 'Substrate', 'Substrate_Label',
## 'Replicate', 'Genus', 'totalSum'. You can override using the `.groups`
## argument.
head(genus)
## # A tibble: 6 × 7
## # Groups: Substrate, Substrate_Label, Replicate, Genus, totalSum [6]
## Substrate Substrate_Label Replicate Genus totalSum Abundance RelAb
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Control Control R1 < 5 % 1 0.723 0.723
## 2 Control Control R1 Edaphobaculum 1 0.226 0.226
## 3 Control Control R1 Hassallia 1 0.0501 0.0501
## 4 Control Control R2 Ochrobactrum 1 0.667 0.667
## 5 Control Control R2 Steroidobacter 1 0.333 0.333
## 6 Control Control R3 < 5 % 1 0.713 0.713
Now check the number of unique genera we have… 17 unique genera! Much easier to visualize!
length(unique(genus$Genus))
## [1] 17
One way to get a long vector of colors is to use colorRampPalette to make a color palette based on a few supplied colors.
colFunc <- colorRampPalette(c("purple4", "lightpink", "firebrick", "orange", "lemonchiffon", "olivedrab4", "darkcyan", "lightblue", "darkblue"))
color_list <- colFunc(length(unique(genus$Genus)))
genus_colors <- c("black", color_list)
length(genus_colors)
## [1] 18
Now we can plot the data! This same process can be used to plot any taxonomic level where there are too many unique genera to visualize all of them.
ggplot(genus)+
geom_col(mapping = aes(x = Replicate, y = RelAb, fill = Genus), color = "black", position = "stack", show.legend = TRUE)+
facet_grid(cols = vars(Substrate_Label))+
ylab("Proportion of Community") +
xlab(NULL)+
scale_fill_manual(values = genus_colors) +
theme_linedraw()+
theme(axis.text.y = element_text(size = 20, color = "black"),
axis.title.y = element_text(size = 18, color = "black"),
axis.text.x = element_text(size = 20, angle = 0, vjust = 1, hjust = 0.5, color = "black"),
legend.text = element_text(size = 13),
legend.position = "bottom",
legend.spacing.x = unit(0.1, 'mm'),
legend.spacing.y = unit(0.05, 'mm'),
plot.margin=grid::unit(c(0.1,0.1,0.1,0.1), "mm"),
strip.text = element_text(size = 18, face = "bold", angle = 0),
legend.title = element_text(face="bold", size = 22))+
guides(fill=guide_legend(ncol=3,byrow=TRUE))
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.