Vegan cheat sheet

Introduction and main takeaways

In this workshop, we worked through real data on bird communities in three different habitat types. The data are published in a really cool paper: Adams, Bryce T. and Stephen N. Matthews. “Diverse temperate forest bird assemblages demonstrate closer correspondence to plant species composition than vegetation structure.” Ecography. https://doi.org/10.1111/ecog.04487

With the functions from this workshop, you’ll be able to answer the following questions about the data (and your own!):

How speciose are my communities? specnumber()
How diverse are my communities? diversity()
How different are my communities in species composition? adonis(), rda(), metaMDS()
How do species drive the separation of my communities? envfit()
How is community structure related to specific environmental variables? cca()

Let’s start by loading in our libraries and data.

# libraries
library(tidyverse)
library(vegan)
library(ggvegan)

# data
# bird communities
birds <- read_csv(here::here("data", "bird-comm.csv")) %>% 
  column_to_rownames("site")

# environmental variables
env <- read_csv(here::here("data", "env-var.csv"))

# set up a "metadata" frame - will be useful for plotting later!
site_type <- env %>% 
  select(site, landtype)

What’s all this data?

Bird communities were sampled using point counts at 120 points distributed amongst 3 ecological landtypes: 1. dry (ridgetops), 2. riparian (bottomlands), and 3. mixed (hillslopes). The landtypes can be distinguished using their most common tree species, though the site is speciose (~75 tree species).

The birds data frame is in the format that data generally have to be in to start doing any analysis, which is to say that samples (or sites) are in rows, and species are columns. The only entries in the data frame are counts.
If you’d like to explore what species are in this set, you can translate alpha codes to common names using this chart.

head(birds, 3)

##        ACFL AMCR AMGO AMRE AMRO BAOR BAWW BGGN BHCO BLBW BLJA BRTH BTNW BWWA
## site_1    0    0    1    0    0    0    1    0    0    0    1    0    0    0
## site_2    9    0    0    0    3    0    2    2    1    0    1    0    0    0
## site_3    4    0    0    0    3    0    0    4    4    0    0    0    0    0
##        CACH CARW CEDW CERW CHSP COYE EAPH EATO EAWP ETTI FISP GCFL GRCA HOWA
## site_1    0    0    0    0    0    0    0    1    2    0    0    0    0    7
## site_2    0    0    0    0    0    0    1    0    0    0    0    0    0    4
## site_3    0    0    0    3    0    0    0    1    0    2    0    0    0    2
##        INBU KEWA LOWA NOCA NOPA OVEN PIWA PRAW RBGR REVI RWBL SCTA SUTA WBNU
## site_1    0    0    0    0    0    6    0    0    0   12    0    0    0    1
## site_2    0    0    0    1    0    6    0    0    0    6    0    4    0    0
## site_3    0    0    0    0    0    8    0    0    0    9    0    3    0    3
##        WEVI WEWA WOTH YBCH YTVI YTWA
## site_1    0    0    0    0    1    0
## site_2    0    2    7    0    1    0
## site_3    0    0    6    0    4    0

The env data frame contains all the “environmental” variables that were measured, which in this case was tree community composition and canopy structure. An explanation of the columns is below:

Column	Meaning
site	point count (1-120)
landtype	ecological landtype (dry, riparian, mixed)
stems_ha	total stem density (# of stems per hectare) of both under and overstory stems
big_stem_bas	large stem basal area (m2/hectare) of trees [stems/trees >/=8 cm diameter at breast height (DBH)]
canopy_cover	percent canopy cover
canopy_height	mean canopy height (m)

How speciose are my communities?

specnumber() will tell you the number of species within each sample. You can then run an analysis of variance to ask if mean species richness is significantly different across sites.

sppr <- specnumber(birds)

# analysis of variance takes the same form as the usual models you'd see in R
# response ~ dependent, data = environmental grouping
sppr_aov <- aov(sppr ~ landtype, data = site_type)
summary(sppr_aov)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## landtype      2     86    43.0   2.774 0.0648 .
## Residuals   207   3209    15.5                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like there’s no significant difference in species richness between landtypes. Let’s plot!

sppr_df <- sppr %>% 
  enframe() %>% 
  full_join(site_type, by = c("name" = "site"))

pal <- c("lightsalmon1", "gold1", "palegreen4")

plot_sppr <- ggplot(sppr_df, aes(x = landtype, y = value, fill = landtype)) +
  geom_boxplot() +
  scale_fill_manual(values = pal) +
  scale_x_discrete(labels = c("dry \n (n = 96)", "mix \n (n = 59)", "riparian \n (n = 55)")) +
  theme(legend.position = "none",
        plot.background = element_rect("white"),
        panel.background = element_rect("white"),
        panel.grid = element_line("grey90"),
        axis.line = element_line("gray25"),
        axis.text = element_text(size = 12, color = "gray25"),
        axis.title = element_text(color = "gray25"),
        legend.text = element_text(size = 12)) + 
  labs(x = "Ecological landtype",
       y = "Number of species per site",
       title = "Species richness")
plot_sppr

How diverse are my communities?

There are many diversity metrics out there, but I’ve only included Shannon diversity here, takes into account species abundance and evenness: \[ H = -\sum_{i=1}^{R}p_iln(p_i) \] where \(R\) is total richness, \(p_i\) is the proportion of \(R\) of the \(i\)th species. The diversity() function in vegan will calculate Shannon, Simpson, and Fisher’s alpha - just select which one you want in the arguments. Here, I’ll calculate Shannon diversity for each site, then plot mean Shannon diversity per landtype.

shannondiv <- diversity(birds)
head(shannondiv)

##   site_1   site_2   site_3   site_4   site_5   site_6 
## 1.812353 2.443171 2.502177 2.449965 2.505759 2.508055

You can do the same analysis of variance on Shannon diversity.

sppdiv_aov <- aov(shannondiv ~ landtype, data = site_type)
summary(sppdiv_aov)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## landtype      2  0.399 0.19958   2.817 0.0621 .
## Residuals   207 14.668 0.07086                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

No significant difference in diversity either. Another plot, now in column form.

shandiv_df <- shannondiv %>% 
  # put all those calculations into a data frame
  enframe() %>% 
  # rename columns for ease of joining
  rename(site = name,
         shan_div = value)

div_plot_df <- shandiv_df %>% 
  # join with site_type
  full_join(site_type, ., by = "site") %>% 
  # group by landtype
  group_by(landtype) %>% 
  # calculate mean and standard error of diversity
  summarize(mean = round(mean(shan_div), 2),
            err = sd(shan_div)/sqrt(length(shan_div))) %>% 
  dplyr::mutate(label = "mean") %>% 
  unite("mean_label", label, mean, sep = " = ", remove = FALSE)

clean_background <- theme(plot.background = element_rect("white"),
        panel.background = element_rect("white"),
        panel.grid = element_line("white"),
        axis.line = element_line("gray25"),
        axis.text = element_text(size = 12, color = "gray25"),
        axis.title = element_text(color = "gray25"),
        legend.text = element_text(size = 12),
        legend.key = element_rect("white"))

plot_shandiv <- ggplot(div_plot_df, aes(x = landtype, y = mean, fill = landtype)) +
  geom_col(color = "black") +
  scale_fill_manual(values = pal) +
  geom_errorbar(aes(ymin = mean - err, ymax = mean + err), width = 0.5) +
  geom_text(aes(x = landtype, y = mean + err + 0.07, label = mean_label)) +
  scale_x_discrete(labels = c("dry \n (n = 96)", "mix \n (n = 59)", "riparian \n (n = 55)")) +
  scale_y_continuous(limits = c(0, 2.75), expand = c(0,0)) +
  clean_background + 
  theme(legend.position = "none") +
  labs(x = "Ecological landtype",
       y = "Mean Shannon diversity",
       title = "Shannon diversity")
plot_shandiv

How different are my communities in species composition?

Permutational Multivariate Analysis of Variance (perMANOVA)

Assessing differences in community composition is done with permutational Multivariate Analysis of Variance, or perMANOVA. These tests are done on distances, meaning that they assess the differences between communities based on dissimilarity. With perMANOVA, the null hypothesis is that the centroids of your groups (in ordination space as defined by the dissimilarity measure you’ve chosen) are equivalent for all groups. In other words, you are asking if, following some measure of (dis)similarity, the community composition of sites between groups is the same.

The function adonis() will do this for you.

bird_perm <- adonis(birds ~ landtype, data = env)
bird_perm

## 
## Call:
## adonis(formula = birds ~ landtype, data = env) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## landtype    2    0.9465 0.47326  3.6684 0.03423  0.001 ***
## Residuals 207   26.7049 0.12901         0.96577           
## Total     209   27.6514                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bird communities differ by landtype, yay! We can now represent our communities in ordination space, which distills the multivariate (i.e. species) dataset into two axes. Like diversity, there are very many different methods of ordination. I’m still wrapping my head around how to pick the best method for any given dataset, but the way most people tend to choose a method is to go with whatever has been done in their field.

Principal Components Analysis (PCA)

Principal Components Analysis (PCA) is a basic form of ordination wherein the goal is to

reduce a data set with n cases (objects) and p variables (attributes) to a smaller number of synthetic variables that represent most of the information in the original data set. Thus, we reduce a data set of n objects in a p-dimensional space to n objects in a reduced k-dimensional space, where k is typically much smaller than p (McCune and Grace 2001).

For a fun explanation of PCA using superheroes, see Allison Horst’s lectures on PCA here and here.

The function in vegan for PCA is rda(), which technically stands for Redundancy Analysis. I won’t go into RDA, but when you run this function on your site x species matrix without any environmental variables it does a PCA.

birdPCA <- rda(birds)
birdPCA

## Call: rda(X = birds)
## 
##               Inertia Rank
## Total           97.82     
## Unconstrained   97.82   48
## Inertia is variance 
## 
## Eigenvalues for unconstrained axes:
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 23.059 12.015  8.711  6.485  5.608  4.859  4.120  3.187 
## (Showing 8 of 48 unconstrained eigenvalues)

PCAscores <- scores(birdPCA, display = "sites") %>% 
  as.data.frame() %>% 
  rownames_to_column("site") %>% 
  full_join(site_type, by = "site")

PCAvect <- scores(birdPCA, display = "species") %>% 
  as.data.frame()

plot_PCA <- ggplot() +
  geom_point(data = PCAscores, aes(x = PC1, y = PC2, color = landtype)) +
  scale_color_manual(values = pal) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_segment(data = PCAvect, aes(x = 0, y = 0, xend = PC1, yend = PC2), arrow = arrow(length = unit(0.2, "cm"))) +
  geom_text(data = PCAvect, aes(x = PC1, y = PC2, label = rownames(PCAvect))) +
  clean_background +
  labs(x = "PC1 (23.57%)",
       y = "PC2 (12.23%)",
       title = "Principal Components Analysis") 
plot_PCA

There’s a lot to look at here, but I would pay most attention to the arrows and their direction. In this ordination, points are sites and arrows are species. Length of the arrow indicates the amount of variation in your communities explained by that particular variable (longer arrows -> larger increase) and the angle of the arrows to each other indicates correlations (the more obtuse the angle, the less correlated).

Working with `ggvegan`

You can take the same steps above using ggvegan. This package gets vegan outputs into dataframes that can then be manipulated using ggplot2 (or you can plot directly). I’ve used it here for PCA, but this works for any vegan output. The number of steps taken to get to a plot are more or less the same, but you might like this more!

autoplot() will give you an automatic plot without customization.

# autoplot()
PCA_biplot <- autoplot(birdPCA)
PCA_biplot

fortify() will get give you a data frame from which you can then extract elements.

# fortify()
PCA_fortify <- fortify(birdPCA)

# extract site coords (points)
PCA_fort_sites <- PCA_fortify %>% 
  filter(Score == "sites") %>% 
  full_join(., site_type, by = c("Label" = "site"))

# extract species coords (segment)
PCA_fort_species <- PCA_fortify %>% 
  filter(Score == "species")

PCA_fortify_plot <- ggplot() +
  geom_point(data = PCA_fort_sites, aes(x = PC1, y = PC2, col = landtype)) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  scale_color_manual(values = c("coral", "lightgreen", "darkblue")) +
  geom_segment(data = PCA_fort_species, aes(x = 0, xend = PC1, y = 0, yend = PC2), 
               arrow = arrow(length = unit(0.2, "cm"))) +
  geom_text(data = PCA_fort_species, aes(x = PC1, y = PC2, label = Label)) +
  clean_background +
  labs(x = "PC1 (23.57%)",
       y = "PC2 (12.23%)",
       title = "Principal Components Analysis - using fortify()")
PCA_fortify_plot

Non-metric Multidimensional Scaling (NMDS)

Most community ecologists use Non-metric Multidimensional Scaling (NMDS). You can imagine NMDS as a reduction of axes, where all your “axes” are the species within a sample, and each sample exists relative to others on the axes. For this dataset, imagine an axis that describes relative abundance of ACFL (Acadian Flycatchers) - all points exist somewhere on that axis relative to the others. Now consider another axis describing the relative abundance of KEWA (Kentucky Warblers) - all points still exist somewhere on that axis, but now there are two axes describing the position of your points. NMDS allows you to collapse all these species axes (in this case, 48) into 2 to plot in cartesian space in order to visualize the differences between samples and sites. We’ll do this by using metaMDS().

bird_NMDS <- metaMDS(birds)

## Wisconsin double standardization
## Run 0 stress 0.2503763 
## Run 1 stress 0.2531572 
## Run 2 stress 0.2561305 
## Run 3 stress 0.262508 
## Run 4 stress 0.2650259 
## Run 5 stress 0.2651364 
## Run 6 stress 0.2562296 
## Run 7 stress 0.2612796 
## Run 8 stress 0.2628524 
## Run 9 stress 0.263064 
## Run 10 stress 0.2575937 
## Run 11 stress 0.2534864 
## Run 12 stress 0.2536264 
## Run 13 stress 0.2555303 
## Run 14 stress 0.2615288 
## Run 15 stress 0.2614233 
## Run 16 stress 0.2490149 
## ... New best solution
## ... Procrustes: rmse 0.0190721  max resid 0.1574522 
## Run 17 stress 0.258342 
## Run 18 stress 0.257506 
## Run 19 stress 0.2573904 
## Run 20 stress 0.2560168 
## *** No convergence -- monoMDS stopping criteria:
##      2: no. of iterations >= maxit
##     18: stress ratio > sratmax

bird_NMDS

## 
## Call:
## metaMDS(comm = birds) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(birds) 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.2490149 
## Stress type 1, weak ties
## No convergent solutions - best solution after 20 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(birds)'

Looking at the stressplot for your NMDS is an important part of evaluating how well the ordination represented the complexity in your data. The x-axis is the observed dissimilarity, and the y-axis is the ordination distance. The stressplot shows you how closely the ordination (y-axis) represents the dissimilarities calculated (x-axis). The points around the red stair steps are the communities, and the distance from the line represents the “stress”, or how they are pulled from their original position to be represented in their ordination.

stressplot(bird_NMDS)

You can plot your NMDS output in base R…

plot(bird_NMDS)

… but you can also extract elements from the output to plot in ggplot, which is much nicer.

plot_df <- scores(bird_NMDS, display = "sites") %>% 
  as.data.frame() %>% 
  rownames_to_column("site") %>% 
  full_join(site_type, by = "site")

plot_nmds <- ggplot(plot_df, aes(x = NMDS1, y = NMDS2, color = landtype, shape = landtype)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_manual(values = pal) +
  stat_ellipse(linetype = 2, size = 1) +
  clean_background +
  labs(title = "NMDS")
plot_nmds

Things to consider about stress

Generally, an average stress of > 0.2 is an indication that the ordination didn’t do a good job of representing community structure. However, NMDS stress increases as sample size increases, so if you have many samples (as this dataset does) your ordination will likely exhibit a lot of stress. As a quick example, I’ve subsampled 15 communities from the original set of 210. If you look at the stress from the NMDS and the stress plot, it’s within the range that’s considered “acceptable.”

sub <- birds[sample(nrow(birds), 15), ]
subNMDS <- metaMDS(sub)

## Wisconsin double standardization
## Run 0 stress 0.1555764 
## Run 1 stress 0.155526 
## ... New best solution
## ... Procrustes: rmse 0.03580192  max resid 0.09832266 
## Run 2 stress 0.1623118 
## Run 3 stress 0.1555261 
## ... Procrustes: rmse 0.0001005491  max resid 0.000268755 
## ... Similar to previous best
## Run 4 stress 0.155526 
## ... New best solution
## ... Procrustes: rmse 0.0001006753  max resid 0.0002691296 
## ... Similar to previous best
## Run 5 stress 0.1623118 
## Run 6 stress 0.1688421 
## Run 7 stress 0.1722275 
## Run 8 stress 0.1736257 
## Run 9 stress 0.1555764 
## ... Procrustes: rmse 0.03576806  max resid 0.09791903 
## Run 10 stress 0.155526 
## ... Procrustes: rmse 0.0001122697  max resid 0.0002880662 
## ... Similar to previous best
## Run 11 stress 0.2234705 
## Run 12 stress 0.1623118 
## Run 13 stress 0.1835247 
## Run 14 stress 0.155526 
## ... Procrustes: rmse 5.168914e-05  max resid 0.0001379997 
## ... Similar to previous best
## Run 15 stress 0.1623118 
## Run 16 stress 0.2203251 
## Run 17 stress 0.1623118 
## Run 18 stress 0.2228459 
## Run 19 stress 0.1688422 
## Run 20 stress 0.1757642 
## *** Solution reached

stressplot(subNMDS)

Using `envfit()`

envfit() is a vegan function that allows you to determine the relative contribution of environmental variables to the separation of your communities along ordination axes. However, you can also use it to answer the same questions about species: How do species contribute to the dissimilarity of communities?

# envfit() takes the output of metaMDS() and the species matrix you created
fit <- envfit(bird_NMDS, birds, perm = 999) 

# extract p-values for each species
fit_pvals <- fit$vectors$pvals %>% 
  as.data.frame() %>% 
  rownames_to_column("species") %>% 
  dplyr::rename("pvals" = ".")

# extract coordinates for species, only keep species with p-val = 0.001
fit_spp <- fit %>% 
  scores(., display = "vectors") %>% 
  as.data.frame() %>% 
  rownames_to_column("species") %>% 
  full_join(., fit_pvals, by = "species") %>% 
  filter(pvals == 0.001)

# new plot
nmds_plot_new <- ggplot(plot_df, aes(x = NMDS1, y = NMDS2)) +
  coord_fixed() +
  geom_point(aes(color = landtype, shape = landtype), size = 3, alpha = 0.8) +
  stat_ellipse(aes(color = landtype)) +
  scale_color_manual(values = pal) +
  geom_segment(data = fit_spp, aes(x = 0, xend = NMDS1, y = 0, yend = NMDS2),
               arrow = arrow(length = unit(0.25, "cm")),
               col = "black") +
  geom_text(data = fit_spp, aes(label = species)) +
  clean_background
nmds_plot_new

How is community structure related to specific environmental variables?

The above ordination methods are “unconstrained” ordination, which means that the ordination is done only considering species counts in the site x species matrix. Constrained ordination is appropriate for what’s common in ecological data: a matrix of communities (site x species) and another matrix of environmental characteristics (site x environment). Our environmental matrix is env, created up top.

With constrained ordination, we can ask how environmental variables relate to community composition. Again, there are many different methods for constrained ordination and what’s appropriate for your data is up to you. This example demonstrates Canonical Correspondence Analysis, or CCA.

Let’s say I’m interested in the effects of canopy height, stems/hectare, and big stem diameter on bird community structure.

birdCCA <- cca(birds ~ canopy_height + stems_ha + big_stem_bas, data = env)
birdCCA

## Call: cca(formula = birds ~ canopy_height + stems_ha + big_stem_bas,
## data = env)
## 
##               Inertia Proportion Rank
## Total         1.88755    1.00000     
## Constrained   0.16996    0.09004    3
## Unconstrained 1.71759    0.90996   47
## Inertia is scaled Chi-square 
## 
## Eigenvalues for constrained axes:
##    CCA1    CCA2    CCA3 
## 0.11865 0.02884 0.02246 
## 
## Eigenvalues for unconstrained axes:
##     CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8 
## 0.24246 0.13582 0.12619 0.07986 0.07478 0.06388 0.06137 0.05277 
## (Showing 8 of 47 unconstrained eigenvalues)

We can do this with the base plotting tool:

plot(birdCCA)

But we can also extract this info to plot in ggplot. The tricky additional step is to get the scaling right for the vectors (blue in the base plot). To do this, you can store the base plot in an object (I call mine ccaplot), run that objectccaplot, and extract the attr(,"arrow.mul") value from the summary that pops up in the console. Note: I don’t know how to reproducibly do this; I just run the object and manually take out the value.

Then, I’ll extract the scores for the biplot vectors, site coordinates, and species coordinates.

# vectors
ccavectors <- as.matrix(scores(birdCCA, display = "bp", scaling = "species")*7.627807) %>% 
  as.data.frame()

# site coordinates
site_data <- scores(birdCCA, display = "sites") %>% 
  as.data.frame() %>% 
  rownames_to_column("site") %>% 
  full_join(site_type, by = "site")

# species coordinates
species_data <- scores(birdCCA, display = "species") %>% 
  as.data.frame()

# plotting
plot_cca <- ggplot(site_data) +
  geom_point(aes(x = CCA1, y = CCA2, color = landtype), shape = 19, size = 2, alpha = 0.8) +
  coord_fixed() +
  scale_color_manual(values = pal) +
  geom_segment(data = ccavectors, aes(x = 0, y = 0, xend = CCA1, yend = CCA2), arrow = arrow(length = unit(0.2, "cm"))) +
  geom_vline(xintercept = c(0), color = "grey70", linetype = 2) +
  geom_hline(yintercept = c(0), color = "grey70", linetype = 2) +
  geom_point(data = species_data, aes(x = CCA1, y = CCA2), shape = 17, size = 2, color = "slateblue") +
  scale_x_continuous(limits = c(-12, 10)) +
  scale_y_continuous(limits = c(-3, 12)) +
  geom_text(data = ccavectors, aes(x = CCA1, y = CCA2, label = rownames(ccavectors)), nudge_x = 0.3, nudge_y = 0.3) +
  clean_background +
  labs(title = "Canonical Correspondence Analysis")
plot_cca

Bibliography and further reading

Many of these functions are described in more technical language in Jari Oksanen’s own tutorial for using vegan for ordination. He also has very helpful slides from his course on multivariate methods.
Roeland Kindt’s BiodiversityR provides a Graphical User Interface to several functions in vegan - thanks to Roeland for reaching out to me!
Additionally, Roeland’s Tree Diversity Analysis manual provides an overview of statistical methods for biodiversity studies, download here.
Michael M. Palmer’s Ordination Methods - an overview contains descriptions of many different ordination methods that I revisit often.
This comparison of different techniques in multivariate statistics is basically a flowchart for deciding which test is best for your questions and data. It’s a lifesaver! It comes from a whole course on multivariate statistics in ecology at UMass Amherst.
McCune and Grace’s Analysis of Ecological Communities is an invaluable resource for learning about different analyses, and what is appropriate given your data.
For microbial ecologists, GUSTA ME is a great summary of ordination techniques in the context of microbial work.