Community Ecology Methods in R

Some examples & code to practice with

GK Smith & ER Moran, October 2013

Part 1. Community data & Diversity measures

library(vegan)

## Warning: package 'vegan' was built under R version 2.15.3

## Loading required package: permute

## This is vegan 2.0-7

library(Hotelling)

## Loading required package: corpcor

To begin, let's open up one of the built-in data sets, called dune:

data(dune)

We can examine the data

head(dune)

##    Belper Empnig Junbuf Junart Airpra Elepal Rumace Viclat Brarut Ranfla
## 2       3      0      0      0      0      0      0      0      0      0
## 13      0      0      3      0      0      0      0      0      0      2
## 4       2      0      0      0      0      0      0      0      2      0
## 16      0      0      0      3      0      8      0      0      4      2
## 6       0      0      0      0      0      0      6      0      6      0
## 1       0      0      0      0      0      0      0      0      0      0
##    Cirarv Hyprad Leoaut Potpal Poapra Calcus Tripra Trirep Antodo Salrep
## 2       0      0      5      0      4      0      0      5      0      0
## 13      0      0      2      0      2      0      0      2      0      0
## 4       2      0      2      0      4      0      0      1      0      0
## 16      0      0      0      0      0      3      0      0      0      0
## 6       0      0      3      0      3      0      5      5      3      0
## 1       0      0      0      0      4      0      0      0      0      0
##    Achmil Poatri Chealb Elyrep Sagpro Plalan Agrsto Lolper Alogen Brohor
## 2       3      7      0      4      0      0      0      5      2      4
## 13      0      9      1      0      2      0      5      0      5      0
## 4       0      5      0      4      5      0      8      5      2      3
## 16      0      2      0      0      0      0      7      0      4      0
## 6       2      4      0      0      0      5      0      6      0      0
## 1       1      2      0      4      0      0      0      7      0      0

names(dune)

##  [1] "Belper" "Empnig" "Junbuf" "Junart" "Airpra" "Elepal" "Rumace"
##  [8] "Viclat" "Brarut" "Ranfla" "Cirarv" "Hyprad" "Leoaut" "Potpal"
## [15] "Poapra" "Calcus" "Tripra" "Trirep" "Antodo" "Salrep" "Achmil"
## [22] "Poatri" "Chealb" "Elyrep" "Sagpro" "Plalan" "Agrsto" "Lolper"
## [29] "Alogen" "Brohor"

So we have a data set that comprises counts of species abundance measured in a series of sampling sites.

Now let's look at number of species and the number of sampling sites in the dataset:

ncol(dune)  # for the number of species

## [1] 30

nrow(dune)  # for the number of sites

## [1] 20

Alright, on to the fun stuff: diversity! We begin by calculating the Shannon and Simpson index values for each site using the diversity function.

shann <- diversity(dune)  # for the Shannon index
shann

##     2    13     4    16     6     1     8     5    17    15    10    11 
## 2.253 2.100 2.427 1.960 2.346 1.440 2.435 2.544 1.876 1.979 2.399 2.106 
##     9    18     3    20    14    19    12     7 
## 2.494 2.079 2.194 2.048 1.864 2.134 2.114 2.472

simp <- diversity(dune, "simpson")  # Simpson index
simp

##      2     13      4     16      6      1      8      5     17     15 
## 0.8900 0.8522 0.9007 0.8430 0.9002 0.7346 0.9087 0.9140 0.8356 0.8507 
##     10     11      9     18      3     20     14     19     12      7 
## 0.9032 0.8672 0.9116 0.8615 0.8788 0.8678 0.8333 0.8741 0.8686 0.9075

par(mfrow = c(1, 2))  # to generate panels with 1 row of 2 graphs
hist(shann)
hist(simp)

plot of chunk unnamed-chunk-5

Next we can calcuate pair-wise distance measures between sites based on their species composition:

par(mfrow = c(1, 2))
bray = vegdist(dune, "bray")
gower = vegdist(dune, "gower")
hist(bray)
hist(gower)

plot of chunk unnamed-chunk-6

To try some rarefaction, we use the rarefy and rarecurve functions.

sp.abund <- rowSums(dune)  #gives the number of individuals found in each plot
raremax <- min(rowSums(dune))  #rarefaction uses the smallest number of observations (individuals here) per sample to extrapolate the expected number if all other samples only had that number of observations
raremax

## [1] 15

Srare <- rarefy(dune, raremax)
par(mfrow = c(1, 2))
plot(sp.abund, Srare, xlab = "Observed No. of Species", ylab = "Rarefied No. of Species")
rarecurve(dune, col = "blue")

plot of chunk unnamed-chunk-7

# notice that rarefied numbers are generally lower than observed, as
# expected notice that they mostly asymptote, indicating thorough sampling

Part 2. Ordination methods

PCA

# By setting the seed number you can be sure to follow the example
# analysis exactly.
set.seed(12345)

# Generate a set of 50 random variables (x), and then another set of 50
# (y).  These will represent the abundance of two species, measured at 50
# different sites.
x = rnorm(50, mean = 50, sd = 10)
y = x - runif(50, min = 0, max = 15)

# Convert the data to a data.frame object and rename the columns.

sp_site_matrix = data.frame(x, y)
names(sp_site_matrix) = c("spA", "spB")
# Let's examine the data by plotting it

plot(x, y, pch = 19, xlab = "Abundance of species A", ylab = "Abundance of species B", 
    asp = 1)

plot of chunk unnamed-chunk-8

How are the abundances related?

# Calcuate the covariance matrix:
cov(sp_site_matrix)

##       spA   spB
## spA 120.3 125.4
## spB 125.4 149.4

# Species A has a lower variance than species B, reflecting the slightly
# lower variation in that species' abundance values among the sites.

# Calculate the correlation matrix:
cor(sp_site_matrix)

##        spA    spB
## spA 1.0000 0.9359
## spB 0.9359 1.0000

# Both species' correlation is 1 (prefectly correlated with themselves),
# and the correlation between them is high, 0.9.

You can think of the rotation we'd like to perform as one in which we will end up with a new series of values that describe the same points, but in which there is ZERO correlation between the two variables (e.g. those off-diagonal elements are 0 instead of 0.9).

To get there we calcluate the eigenvalues and eigenvectors of our original correlation matrix:

e.cor = eigen(cor(sp_site_matrix))
eigenvalues = e.cor$values
eigenvectors = e.cor$vectors

A plot of the rescaled values should look exactly like our previous figure, but with new scales for the x and the y axes.

x2 = scale(sp_site_matrix)
plot(x2, pch = 19, xlab = "Scaled abundance of species A", ylab = "Scaled abundance of species B", 
    asp = 1)

plot of chunk unnamed-chunk-12

To calculate the slopes of the Principal Components Axes, we use the eigenvectors, since they describe the necessary rotation we want to do:

pc1.slope = eigenvectors[1, 1]/eigenvectors[2, 1]
pc2.slope = eigenvectors[1, 2]/eigenvectors[2, 2]

plot(x2, pch = 19, xlab = "Scaled abundance of species A", ylab = "Scaled abundance of species B", 
    asp = 1)
abline(0, pc1.slope, col = "blue", lwd = 2, lty = 2)
abline(0, pc2.slope, col = "purple", lwd = 2, lty = 2)
text(-2.5, -1.5, "PC 1", col = "blue")
text(1.5, -0.5, "PC 2", col = "purple")

plot of chunk unnamed-chunk-13

Note how the lines whose slopes we calculated using the eigenvectors map onto the first major axis of variation (blue) and the second major axis of variation (purple). Note as well that these axes are perpendicular (i.e. orthogonal) to one another.

Now, we can actually perform the rotation and re-plot our values (equivalent to plotting the scores, a term we'll see again later). To do this we multiply our rescaled values by the eigenvectors (equivalent to loadings, another term we'll see in a bit).

This requires not just multiplication, but matrix multiplication, so we surround the * symbol with % symbols:

scores = x2 %*% eigenvectors

# And we plot:

plot(scores, pch = 19, ylim = c(-2.5, 2.5), xlab = "Principal Component 1", 
    ylab = "Principal Component 2", las = 1)
abline(h = 0, col = "blue", lwd = 2, lty = 2)
abline(v = 0, col = "purple", lwd = 2, lty = 2)
text(3, -0.5, "PC 1", col = "blue")
text(0.5, 2, "PC 2", col = "purple")

plot of chunk unnamed-chunk-14

Now the trendlines describing the first and second major axes of variation are perfectly horizontal (y = 0) and vertical (x = 0).

How would we do this not by hand?!
Using one of several built-in R functions that calculate Principal Components. For example, using the function prcomp:

pc_output = prcomp(sp_site_matrix, center = TRUE, scale = TRUE)
pc_output

## Standard deviations:
## [1] 1.3914 0.2532
## 
## Rotation:
##         PC1     PC2
## spA -0.7071 -0.7071
## spB -0.7071  0.7071

The output of the prcomp function yields two primary objects:
1. A set of standard deviation values.
2. A table of 'rotation' values.

Let's examine the values in $sdev first:

pc_output$sdev

## [1] 1.3914 0.2532

(pc_output$sdev)^2

## [1] 1.93587 0.06413

These values reflect how much of the total variation in the original data is explained by each PC axis. To determine the proportional amounts, we divide by the total variance.

pc_output$sdev^2/(sum(pc_output$sdev^2))

## [1] 0.96794 0.03206

So the first PC axis explains ~95% of the total variance, while the remaining ~5% is explained by PC 2. We could obtain the same values by running the summary function on the prcomp output object:

summary(pc_output)

## Importance of components:
##                          PC1    PC2
## Standard deviation     1.391 0.2532
## Proportion of Variance 0.968 0.0321
## Cumulative Proportion  0.968 1.0000

This presents the importance of each component, first listing the total standard deviation accounted for by each PC, then the proportion of the total variance explained, then the cumulative proportion of variance explained, going from the first to the last PC.

There are a couple of different helpful plots we can generate with our Principal Components Analysis output:

par(mfrow = c(1, 2))
plot(pc_output, las = 1)
biplot(pc_output, las = 1)

plot of chunk unnamed-chunk-19

The first is a screeplot, which illustrates the relative contributions of each PC to explaining the variation in the original dataset. These values will always be declining, as each PC explains as much of the remaining variation as possible. Screeplots can be useful when evaluating how many PC axes should be retained when using PCA for dimension-reduction purposes.

The second is a biplot, which should look familiar (since it resembles the rotated figure we produced above - if you can imagine the axes re-scaled).

Or does it? Actually, it resembles a version of the plot we made above, but flipped left to right. That's because PCA won't take into account any positive/negative directionality in your data. i.e. a high PC score may reflect lower/more negative values of your original data. It is helpful to examine the loadings (and especially their signs) to determine how a particular variable is reflected in the PC scores.

par(mfrow = c(1, 3))
biplot(pc_output)
plot(pc_output$x, pch = 19)
plot(pc_output$x, pch = 19, asp = 1)

plot of chunk unnamed-chunk-20

Another way is to look at the arrows in the biplots. You can see from the direction of the arrows that the abundances of both species load negatively onto PC 1, meaning high abundances of each species are reflected by low PC 1 scores. The directions of the arrows in the second dimension, PC 2, indicate that PC 2 is positively correlated with the abundance of species B, and negatively associated with species A.