#Load libraries
library("hillR")
library("tidyverse")
library("vegan")
library("asbio")
library("gridExtra")

Overview:

  1. Calculate diversity indices to reduce dimensionality of species columns as response variables. As part of this, you will review: richness, Simpsons diversity, and Shannon diversity and the associated concept of Hill numbers.
  2. Explore multivariate methods through multiple methods including: Multivariate Analysis of Variance, Ordination and Multivariate Regression.

Lab Introduction: The Analysis of Multivariate Data

Welcome back! So far the data sets we have been working with have not contained more than two predictor or response variables. However, that is often NOT the case in ecology and instead we are interested in how multiple response variables are related simultaneously to one or more predictors. What happens if we are interested in the impacts of nitrogen addition on a plant community and we measure the cover of all plants we find? Suddenly instead of nitrogen predicting one plant we are using nitrogen to predict MANY plants. What if we are interested in the impacts of many climate variables on plant germination?

Today’s lab will show how different multivariate methods can help to identify patterns AND reduce the dimensionality of our data. This includes useful summary statistics such as richness, diversity indices, and hill numbers. This also includes multivariate techniques such as PCA, RDA, factor analysis, and random forest. This is certainly not an exhaustive list, but these are commonly used techniques that can help us better understand data sets with many variables.

Summary/Comparison of Multivariate Methods

Summary/Comparison of Multivariate Methods

Summarizing Multivariate Response Variables: Richness, Diversity, and Hill Numbers

Background

Brief Overview

This section focuses on alpha diversity. Richness, diversity indices, and hill numbers are indices that summarize multivariate data across some unit of collection. These indices are often used in reference to species but, because they represent counts of unique entities (richness) or counts of unique entities with consideration to their abundance (evenness), they have broad application (e.g. DNA, interactions, OTU). These summary statistics can serve as either predictors or response variables, depending on the data and the questions being asked. The key take away of this section, albeit we go into a bit of detail about measures of diversity, is that diversity indices function to effectivly reduce dimensionality of response variables- that would otherwise be multiple columns of species abundance data.

Richness, Simpsons and Shannons Diversity As mentioned above, richness refers the number of unique entities within a group. Methods that take abundance into account, and thereby represent measures of evenness, include Simpson and Shannon diversity. Both of these indices implement the proportion of individuals from a given species (pi) relative to the total number of individuals. Differences in their calculations however, make it such that Simpsons diversity puts more weight on common species and evenness than Shannons diversity which will be more sensitive to species richness, hence the presence of rare species. One is not better than the other and often it is good practice to report values for each as they add different layers of information. These are often plotted in diveristy profile curves and using the transformed values richness, shannon, simpson using Hill numbers.

Fig. 1. Shannons and Simpson Diveristy

Fig. 1. Shannons and Simpson Diveristy

Hill Numbers Following an important publication (Jost et al. 2006), Hill numbers have become, in many cases, the preferred method to report diversity. Hill numbers are a convenient algebraic transformation of richness and other diversity indices (e.g. Simpsons and Shannon), such that values for diversity can be intuitively interpreted as “equivalent numbers of species”. For example, if in group A diversity=4 and group B diversity= 8, we can say group B is twice as diverse than A if your metric for diversity is transformed to equivalent species. If you did not use hill numbers, you could deduce that diversity is greater in group B, but not by an intuitive or comparable magnitude. When you transform values for diversity into effective species, you are linearizing the relationship between different measures of diversity (Figure 2). Click here if you are new to the concept of Hill Numbers. Click this link if you are interested in learning more about this transformation.. Hill numbers are calculated using an equation which contains the term “q”. By increasing q we increase the importance of abundant species. q = 0 is species richness, which ignores abundance. q = 1 is and q = 2 are Shannon’s and inverse Simpson’s diversity respectively. Hill numbers are expressed at the effective number of species (or whatever else you are measuring). This should be thought of as the equivalent number of equally abundant species.

Fig. 2. Linearizing the relationship between diveristy indices to make for more intuitive comparison

Fig. 2. Linearizing the relationship between diveristy indices to make for more intuitive comparison

Image taken from

Estimating Diversity We do not get into detail here, but if you are less familiar with methods of calculating and estimating diversity check out: Chapter 13: The Measurement of Biodiversity in A Primer of Ecological Statistics (Gotelli & Ellison 2013). In the code to follow, we will calculate species diversity which serves to describe diversity for the different elevations from which we collected data. It is important to remember that if we wish to compare diversity of two groups thoroughly, sampling effect needs to be taken into account. Sampling effect refers to the idea that: 1) The diversity of species sampled from a group is correlated with abundance in that group 2) We collect only a sample of what is there such that if we were to continue collecting, we would expect to accumulate more species. To make inferences about the diversity of an assemblage and get confidence intervals around our estimates typically involves: 1) creating rarefaction curves, 2) comparing these rarefaction curves statistically or by gauging overlap in their confidence intervals 3) extrapolating species richness beyond the range of sampling units through the use of asymptotic estimators (Anne Chao does a lot of work on this). In figure 3 below, we see a diversity profile with rarefaction curves. Each rarefaction curve has the interpolated and extrapolated component. Each line represents a unique measure of diversity richness and the y axis can be interpreted as species equivalents. We see that diversity for both measures and including richness, isn’t very different among the girdled and logged groups.

Fig. 3. Diversity profiles for two treatments (forest types-logged and girdled) for all three diveristy measures. These diversity profiles indicate interpolated and extrapolated diveristy (using a Chao asymptotic estimator) and include their 95% CI.

Fig. 3. Diversity profiles for two treatments (forest types-logged and girdled) for all three diveristy measures. These diversity profiles indicate interpolated and extrapolated diveristy (using a Chao asymptotic estimator) and include their 95% CI.

Practice

The data we will work with is comprised of 3 variables: abundance, plots, elevation. The abundance contains 20 plant species across 3 elevations. There are 20 plots at each elevation for a total of 60 plots. We want to know how plant communities change with elevation. In this case we have 20 response variables, so what can we do? We could run 20 different linear models, but is there a more efficient way? We probably do not want to have to discuss 20 models in a paper. This is where summaries like richness and diversity come in.

community_data <- read.csv("community_data.csv")

There are lots of packages in R to calculate diversity and hill numbers. A few popular ones include:

  • vegan
  • iNext
  • spadeR

We will do it manually to show under-the-hood processes. Below we create 3 functions to calculate richness and diversity manually.

Richness

#The function calculates the total number of unique species. This function uses the `length` function to return the number of entries in data that are greater (or not equal to 1) using `!=0`
richness <- function (data) {
  
  rich <- length(data[data != 0])
  
  return(rich)
}

rich <- apply(community_data[,3:22], 1, richness)#using the `apply` to apply `richness` (the function we made above) to rows in columns 3 through 22. `rich` is a vector of the number of unique species in each row- in other words, the number of unique species in each plot for each elevation
rich
##  [1]  5  5  5  5  5  5  4  5  5  4  5  4  5  5  5  5  5  5  5  4 20 20 20 20 20
## [26] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 13 12 11 13 14 14 14 14 13 12
## [51] 14 16 14 16 12 14 16 15 14 13

Shannon Diversity

#the function calculates Shannon diversity. Recall that Shannon diversity is calculated as the sum of the proportion of individuals of species of the ith species in a given community multiplied by the log of that proportion. Lastly, the sum is multiplied by -1. 
shannon <- function (data) {
  
  p <- data[data != 0]/sum(data)
  shan <- -sum(p * log(p))
  
  return(shan)
}

shan <- apply(community_data[,3:22], 1, shannon) #applying `shannon()` using `apply()`. The arguments indicate we apply `shannon()` to all rows (indicated by `apply(,1,)` in columns 3 through 22 
shan
##  [1] 1.432260 1.419891 1.494874 1.556983 1.452994 1.313304 1.353525 1.520321
##  [9] 1.467236 1.340860 1.409487 1.382378 1.385268 1.510398 1.505855 1.414772
## [17] 1.419642 1.399484 1.428677 1.277404 2.944390 2.954487 2.942325 2.962073
## [25] 2.934473 2.961329 2.952017 2.956611 2.960266 2.951858 2.948055 2.949616
## [33] 2.948878 2.947339 2.946155 2.953208 2.913105 2.974281 2.929710 2.966129
## [41] 1.847007 1.740696 1.595708 1.976775 1.732708 1.902856 1.962446 2.018403
## [49] 1.800655 1.814542 1.814255 2.017998 1.741357 2.010400 1.715785 1.851572
## [57] 2.030285 1.993660 1.815625 1.759718

Simpsons Diversity

simpson <- function (data) {
  
  non_zero <- data[data != 0]
  simp <- 1 - ( sum(non_zero * (non_zero - 1)) / (sum(non_zero) * (sum(non_zero)-1)) )
  
  return(simp)
}

simp <- apply(community_data[,3:22], 1, simpson)
simp
##  [1] 0.7633333 0.7521368 0.7807882 0.8030303 0.7581792 0.7037037 0.7736842
##  [8] 0.7977208 0.7647059 0.7517241 0.7405303 0.7721774 0.7462121 0.7899160
## [15] 0.7862069 0.7609195 0.7504456 0.7349206 0.7486772 0.7179487 0.9502350
## [22] 0.9507407 0.9489676 0.9508613 0.9480911 0.9511717 0.9506371 0.9505582
## [29] 0.9507405 0.9510058 0.9499554 0.9512008 0.9507051 0.9494976 0.9498326
## [36] 0.9509920 0.9469448 0.9523810 0.9491603 0.9518905 0.7778846 0.7585921
## [43] 0.7485081 0.8245902 0.7698246 0.7975647 0.8172973 0.8288288 0.7866550
## [50] 0.7931388 0.7809077 0.8161491 0.7682571 0.8178227 0.7639241 0.7922535
## [57] 0.8234707 0.8180857 0.7787975 0.7730849

If we did the same calculations as above using the vegan package, it would look like:

shan1 <- diversity(community_data[,3:22], index = "shannon") # shannon diversity
shan1
##  [1] 1.432260 1.419891 1.494874 1.556983 1.452994 1.313304 1.353525 1.520321
##  [9] 1.467236 1.340860 1.409487 1.382378 1.385268 1.510398 1.505855 1.414772
## [17] 1.419642 1.399484 1.428677 1.277404 2.944390 2.954487 2.942325 2.962073
## [25] 2.934473 2.961329 2.952017 2.956611 2.960266 2.951858 2.948055 2.949616
## [33] 2.948878 2.947339 2.946155 2.953208 2.913105 2.974281 2.929710 2.966129
## [41] 1.847007 1.740696 1.595708 1.976775 1.732708 1.902856 1.962446 2.018403
## [49] 1.800655 1.814542 1.814255 2.017998 1.741357 2.010400 1.715785 1.851572
## [57] 2.030285 1.993660 1.815625 1.759718
simp1 <- diversity(community_data[,3:22], index = "simpson") # simpson diversity
simp1
##  [1] 0.7328000 0.7242798 0.7538644 0.7786961 0.7382271 0.6776406 0.7350000
##  [8] 0.7681756 0.7428571 0.7266667 0.7180900 0.7480469 0.7235996 0.7673469
## [15] 0.7600000 0.7355556 0.7283737 0.7145062 0.7219388 0.6995398 0.9450986
## [22] 0.9459146 0.9445538 0.9464592 0.9436399 0.9465318 0.9458115 0.9460317
## [29] 0.9462132 0.9460267 0.9453663 0.9456055 0.9455661 0.9450607 0.9451993
## [36] 0.9453979 0.9418262 0.9479921 0.9433013 0.9471548 0.7659172 0.7477551
## [43] 0.7376602 0.8110723 0.7596953 0.7866391 0.8064000 0.8177778 0.7750865
## [50] 0.7805493 0.7700617 0.8044898 0.7585323 0.8057958 0.7543750 0.7812500
## [57] 0.8143210 0.8048907 0.7690625 0.7620408

Next we will combine into a data frame.

community_data1 <- as.data.frame(cbind(elevation=community_data$elevation, plot=community_data$plot, rich, shan, simp))
head(community_data1)
##   elevation  plot rich             shan              simp
## 1      high Plot1    5 1.43225952491776 0.763333333333333
## 2      high Plot2    5 1.41989055660587 0.752136752136752
## 3      high Plot3    5 1.49487353489826 0.780788177339901
## 4      high Plot4    5 1.55698310433322 0.803030303030303
## 5      high Plot5    5 1.45299385525672 0.758179231863442
## 6      high Plot6    5 1.31330418505401 0.703703703703704

Notice we have effectively reduced the number of response variables in our data set from 20 to 3. This is much more manageable for running linear models. Of course by doing this we change our response variable. So instead of being able to say something like “this plant is more abundant and lower elevations” we can say that “lower elevations see greater richness and diversity.”

Recall that hill numbers are transformed versions of richness, Simpson and Shannons diversity.

Here I use the HillR package to calculate them.

community_data1$h0 <- hillR::hill_taxa(community_data[,3:22], q = 0) # calculates richness, `community_data1$h0` indicates that these calculations will be added to the community_data dataframe in a column named `h0`
community_data1$h1 <- hillR::hill_taxa(community_data[,3:22], q = 1) # calculates shannons diversity; `community_data1$h1` indicates that these calculations will be added to the community_data dataframe in a column named `h0`
community_data1$h2 <- hillR::hill_taxa(community_data[,3:22], q = 2) #calculated simpsons diversity; `community_data1$h2` indicates that these calculations will be added to the community_data dataframe in a column named `h0`
community_data1
##    elevation   plot rich             shan              simp h0        h1
## 1       high  Plot1    5 1.43225952491776 0.763333333333333  5  4.188152
## 2       high  Plot2    5 1.41989055660587 0.752136752136752  5  4.136668
## 3       high  Plot3    5 1.49487353489826 0.780788177339901  5  4.458773
## 4       high  Plot4    5 1.55698310433322 0.803030303030303  5  4.744486
## 5       high  Plot5    5 1.45299385525672 0.758179231863442  5  4.275897
## 6       high  Plot6    5 1.31330418505401 0.703703703703704  5  3.718440
## 7       high  Plot7    4 1.35352527060842 0.773684210526316  4  3.871048
## 8       high  Plot8    5 1.52032086608605 0.797720797720798  5  4.573693
## 9       high  Plot9    5 1.46723572803534 0.764705882352941  5  4.337229
## 10      high Plot10    4  1.3408598246035 0.751724137931034  4  3.822329
## 11      high Plot11    5 1.40948707878429 0.740530303030303  5  4.093855
## 12      high Plot12    4 1.38237787447815 0.772177419354839  4  3.984365
## 13      high Plot13    5 1.38526804901328 0.746212121212121  5  3.995897
## 14      high Plot14    5 1.51039790046389 0.789915966386555  5  4.528532
## 15      high Plot15    5 1.50585515670471 0.786206896551724  5  4.508007
## 16      high Plot16    5 1.41477229956808 0.760919540229885  5  4.115549
## 17      high Plot17    5  1.4196424116535 0.750445632798574  5  4.135641
## 18      high Plot18    5 1.39948379083823 0.734920634920635  5  4.053107
## 19      high Plot19    5 1.42867746009952 0.748677248677249  5  4.173176
## 20      high Plot20    4 1.27740385851185 0.717948717948718  4  3.587314
## 21       mid  Plot1   20 2.94439042217718 0.950235017626322 20 18.999077
## 22       mid  Plot2   20 2.95448725799586 0.950740702372319 20 19.191880
## 23       mid  Plot3   20 2.94232486030438  0.94896761573571 20 18.959874
## 24       mid  Plot4   20 2.96207302927355 0.950861326442722 20 19.338019
## 25       mid  Plot5   20 2.93447284146345 0.948091062095845 20 18.811584
## 26       mid  Plot6   20 2.96132945566823  0.95117168818747 20 19.323645
## 27       mid  Plot7   20 2.95201686445789 0.950637107634932 20 19.144527
## 28       mid  Plot8   20 2.95661148975522 0.950558213716108 20 19.232691
## 29       mid  Plot9   20 2.96026645738146 0.950740487582593 20 19.303115
## 30       mid Plot10   20 2.95185811847897 0.951005786718104 20 19.141488
## 31       mid Plot11   20 2.94805471675688 0.949955442990479 20 19.068823
## 32       mid Plot12   20 2.94961580219372 0.951200835363731 20 19.098615
## 33       mid Plot13   20 2.94887845535767 0.950705052878966 20 19.084538
## 34       mid Plot14   20 2.94733926791339 0.949497608705191 20 19.055185
## 35       mid Plot15   20  2.9461552363842 0.949832615973219 20 19.032637
## 36       mid Plot16   20 2.95320835341558 0.950991994430908 20 19.167351
## 37       mid Plot17   20 2.91310478309302 0.946944770857814 20 18.413881
## 38       mid Plot18   20 2.97428063568894 0.952380952380952 20 19.575536
## 39       mid Plot19   20 2.92970977036856 0.949160340464688 20 18.722196
## 40       mid Plot20   20 2.96612870566209 0.951890547263682 20 19.416607
## 41       low  Plot1   13 1.84700663314071 0.777884615384615 13  6.340811
## 42       low  Plot2   12  1.7406962158642 0.758592132505176 12  5.701311
## 43       low  Plot3   11 1.59570783106231 0.748508098891731 11  4.931819
## 44       low  Plot4   13  1.9767750901353 0.824590163934426 13  7.219423
## 45       low  Plot5   14 1.73270786784507 0.769824561403509 14  5.655949
## 46       low  Plot6   14 1.90285557122293 0.797564687975647 14  6.705014
## 47       low  Plot7   14 1.96244640586172 0.817297297297297 14  7.116716
## 48       low  Plot8   14 2.01840286880169 0.828828828828829 14  7.526295
## 49       low  Plot9   13   1.800655271742 0.786654960491659 13  6.053613
## 50       low Plot10   12 1.81454220339747 0.793138760880696 12  6.138265
## 51       low Plot11   14 1.81425468710767 0.780907668231612 14  6.136501
## 52       low Plot12   16 2.01799810593734 0.816149068322981 16  7.523249
## 53       low Plot13   14 1.74135709069995   0.7682570593963 14  5.705080
## 54       low Plot14   16 2.01040015009854 0.817822651448639 16  7.466304
## 55       low Plot15   12 1.71578545470725 0.763924050632911 12  5.561042
## 56       low Plot16   14 1.85157225721411 0.792253521126761 14  6.369827
## 57       low Plot17   16 2.03028453431716 0.823470661672909 16  7.616253
## 58       low Plot18   15  1.9936599240182 0.818085668958223 15  7.342357
## 59       low Plot19   14 1.81562450413645  0.77879746835443 14  6.144913
## 60       low Plot20   13 1.75971795242715 0.773084886128364 13  5.810798
##           h2
## 1   3.742515
## 2   3.626866
## 3   4.062802
## 4   4.518672
## 5   3.820106
## 6   3.102128
## 7   3.773585
## 8   4.313609
## 9   3.888889
## 10  3.658537
## 11  3.547231
## 12  3.968992
## 13  3.617940
## 14  4.298246
## 15  4.166667
## 16  3.781513
## 17  3.681529
## 18  3.502703
## 19  3.596330
## 20  3.328228
## 21 18.214476
## 22 18.489281
## 23 18.035505
## 24 18.677342
## 25 17.743058
## 26 18.702715
## 27 18.454113
## 28 18.529412
## 29 18.591906
## 30 18.527679
## 31 18.303716
## 32 18.384224
## 33 18.370907
## 34 18.201908
## 35 18.247937
## 36 18.314322
## 37 17.189854
## 38 19.227848
## 39 17.637097
## 40 18.923185
## 41  4.271992
## 42  3.964401
## 43  3.811849
## 44  5.293030
## 45  4.161383
## 46  4.686895
## 47  5.165289
## 48  5.487805
## 49  4.446154
## 50  4.556831
## 51  4.348993
## 52  5.114823
## 53  4.141340
## 54  5.149220
## 55  4.071247
## 56  4.571429
## 57  5.385638
## 58  5.125333
## 59  4.330176
## 60  4.202401

We can plot these to examine the difference in simulated diversity at three elevations. Here we see that diversity is greatest at mid elevations. Here, richness is 20 and higher q’s are around 18. This tells us that the diversity here is equivalent to a community that contains 18 equally abundant species. Compare this with high low elevations. Here we also see high richness, however the drop off between q=0 and q=1 tells us that the communities are uneven and are dominated by a few species.

library(gridExtra)

plot_dat <- community_data1 %>% 
  dplyr::select(elevation, plot, h0, h1, h2) %>% 
  gather(3:5, key = "var", value = "val") %>% 
  group_by(elevation, var) %>% 
  summarise(med = quantile(val, 0.5), 
            lc = quantile(val, 0.025),
            uc = quantile(val, 0.975))
## `summarise()` has grouped output by 'elevation'. You can override using the
## `.groups` argument.
p1 <- ggplot2::ggplot(data = subset(plot_dat, elevation == "high")) +
  geom_errorbar(aes(x = var, ymin = lc, ymax = uc), width = 0) +
  geom_point(aes(var, med), pch = 21, fill = "gray65") +
  labs(x = "q", y = "Effective number of species") +
  scale_x_discrete(labels = c("0", "1", "2")) +
  ggtitle("High") +
  coord_cartesian(ylim = c(0,20)) +
  theme_classic()

p2 <- ggplot2::ggplot(data = subset(plot_dat, elevation == "mid")) +
  geom_errorbar(aes(x = var, ymin = lc, ymax = uc), width = 0) +
  geom_point(aes(var, med), pch = 21, fill = "gray65") +
  labs(x = "q", y = "Effective number of plants") +
  scale_x_discrete(labels = c("0", "1", "2")) +
  ggtitle("Mid") +
  coord_cartesian(ylim = c(0,20)) +
  theme_classic()

p3 <- ggplot2::ggplot(data = subset(plot_dat, elevation == "low")) +
  geom_errorbar(aes(x = var, ymin = lc, ymax = uc), width = 0) +
  geom_point(aes(var, med), pch = 21, fill = "gray65") +
  labs(x = "q", y = "Effective number of plants") +
  scale_x_discrete(labels = c("0", "1", "2")) +
  ggtitle("Low") +
  coord_cartesian(ylim = c(0,20)) +
  theme_classic()

grid.arrange(p1, p2, p3, ncol=3)

Setting the Stage for Multivariate Predictors: A Quick Review on Multicollinearity

Multicollinearity among variables is common when data sets have many variables and yet an assumption of regressions is that variables are not correlated. Further when multiple response variables are taken from the same individual- they are not independent of each other. Essentially, if a linear model contains variables that are too highly correlated, the model will have trouble fitting parameters. Let’s simulate some data and take a quick look.

Let’s say we are interested in the physical strain exerted by a protagonist as they flee the evil beings (trollocs, fades, etc.). To do this we measure their running speed, their fear, and their heart rate.

We want to know what drives up their heart rate more, their fear or running speed? We plot each predictor fear and running_speed against fear heart_rate`.

Here we see that both variables- fear and running_speed are linearly related to heart rate. Are those two predictors correlated?

cor(running_speed, fear)
## [1] 0.9907435

Looks like they are. Now we will input these variables into separate linear models and output their estimates for their coefficients. We are doing this step so that we can next compare these coefficients to those from a model that includes both predictors.

mod1 <- glm(heart_rate ~ running_speed) # simple linear model with running speed predicting heart rate
summary(mod1)$coefficients # extracts just the coefficient section from the `summary(mod1)` output. You could also do `coef(mod1)` but that does not return results for the significance tests and those will be helpful making the point we are trying to make about multicolinearity
##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)   70.47677 0.69755439 101.0341 1.401099e-57
## running_speed  8.44568 0.06972009 121.1370 2.388965e-61
mod2 <- glm(heart_rate ~ fear)
summary(mod2)$coefficients
##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 72.89309  1.7039722 42.77833 7.008553e-40
## fear        16.50321  0.3425042 48.18397 2.633252e-42

Now, let’s check out the coefficients for the model that considers fear and running_speed together as predictors in the model.

mod3 <- glm(heart_rate ~ running_speed + fear)
summary(mod3)$coefficients
##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)   70.5127973  0.7045466 100.0825126 1.968991e-56
## running_speed  8.1303935  0.5169581  15.7273750 2.375045e-20
## fear           0.6272086  1.0188815   0.6155854 5.411378e-01

Here we see the problem with collinearity. The relationship between fear and heart rate disappears and the confidence (taken from the standard errors and p-values) is much lower for both estimates. Multicollinearity can cause small changes in the model to have drastic impacts on the coefficient estimates. So what does this have to do with multivariate data sets? If we have data sets with too many predictor variables some of these are bound to be correlated. Also, all of the models we have run so far have one response variable but what if we have more?

Multivariate Analyses: PCA, Factor Analysis, RDA

Matrix algebra is central to understanding the math behind multivariate analyses. It is highly recommended that you take time to read/watch the recommended resources to get a true understanding of these methods if you are not yet familiar. I give an in depth example for MANOVA using excel to explicitly show the matrix math involved with MANOVA, but do not get that in depth with the others. Helpful review includes:

Variance, Covariance, Correlation Matrices It’s no surprise that if I am interested in understanding how to model two or more response variables, or reduce multiple variables into a representative component, that a natural step would be to ask how, mathematically, do those variables covary with each other. Click here for review on the concept of covariance. The structure of variance, covariance and correlation among multiple variables in a multivariate dataset is a major component of the under-the-hood math involved in many multivariate statistics. Three matrices that you will read about repeatedly when conducting such analyses are:

  1. covariance matrix (also called the variance-covariance matrix, variance matrix) (Fig. 4A)
  2. matrix of sample standard deviations (Fig.4B )
  3. matrix of sample correlations (Fig.4C)
Fig. 4. Outlining components of a covariance, standard deiation and sample correlation matrix

Fig. 4. Outlining components of a covariance, standard deiation and sample correlation matrix

Types of multivariate analyses

Gotelli & Ellison group multivariate analyses into 4 categories. While I think factor analysis fits better in a lesson on structural equation modeling, it is mentioned here under ordination, because it can be used, like PCA, as a data reduction technique.

  1. MANOVA (multivariate ANOVA): Compares multivariate means. Similar to the ANOVA design, where you compute AMONG and WITHIN group variance, MANOVA extends this to cases where you have multiple response variables (and, the predictor is categorical)-hence, matrix algebra is necessary.
  2. Ordination: Creates new variables called principal axes along which samples are ordered or scored. These effectively reduce dimensionality of the data set (e.g. data reduction techniques). Ordination methods include: Principal Component Analyses (PCA), Factor Analysis, Principal Coordinates Analysis, Correspondence Analysis, Non-metric Multidimensional Scaling. These methods are useful for data exploration and pattern recognition. The resulting bi-plots (graphs that plot samples along the two principal axes) allow us to explore relationships among environmental variables and species assemblages. Additionally, the scores or factor lodgings can be uses as response or predictor variables in a model.
  3. Classification: Grouping similar objects into classes that are distinguishable to neighboring classes. Classification methods include: Cluster Analysis and Discriminate Analysis.
  4. Multivariate multiple regression: Regression with multiple response variables. These include: Redundancy analysis and canonical correspondence analysis.

MANOVA

Mechanics of MANOVA

*If you are new to MANOVA, I recommend first watching this video.

Read in data. I choose the same pitcher plant dataset referenced in Chapter 12 of Primer of Ecological Statistics (Gotelli and Ellison 2012). The dataset has 10 measurements of the pitcher plant (Darlingtonia californica) at 4 sites. The four sites will be treated as our categorical predictor variable. Later we will subset the 10 variables to the 7 variables referenced in the chapter. By using a MANOVA we are evaluating whether the morphological measurements for Darlingtonia differ significantly among sites.

Fig. 5. Null Hypotheses for the pitcher plant example and a picutre *Darlingtonia californica* along with the measurments taken

Fig. 5. Null Hypotheses for the pitcher plant example and a picutre Darlingtonia californica along with the measurments taken

dat<-read.csv("manova_example.csv", header=TRUE)
head(dat)
##   site height mouth.diam tube.diam keel.diam wing1.length wing2.length
## 1  TJH    654       38.4      16.6       6.4           85           76
## 2  TJH    413       22.2      17.2       5.9           55           26
## 3  TJH    610       31.2      19.9       6.7           62           60
## 4  TJH    546       34.4      20.8       6.3           84           79
## 5  TJH    665       30.5      20.4       6.6           60           51
## 6  TJH    665       33.6      19.5       6.6           84           66
##   wingspread hoodarea wingarea tubearea
## 1         55    63.77    33.65    87.15
## 2         60    21.10     7.36    44.78
## 3         78    28.47    15.75    56.64
## 4         95    48.82    30.47    76.31
## 5         30    29.48    11.33   100.22
## 6         82    55.67    27.54   106.12
dat_long <- gather(dat, variable, value, height:tubearea, factor_key=TRUE)# `gather()` allows you to change data from a wide format (each variable is a column) to a long format (each variable is a row undera new variable called "variable')
head(dat_long)
##   site variable value
## 1  TJH   height   654
## 2  TJH   height   413
## 3  TJH   height   610
## 4  TJH   height   546
## 5  TJH   height   665
## 6  TJH   height   665
p1 <- ggplot(dat, aes(x = site, y = height, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")
p2 <- ggplot(dat, aes(x = site, y = mouth.diam, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")
p3 <- ggplot(dat, aes(x = site, y = tube.diam, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")
p4 <- ggplot(dat, aes(x = site, y = keel.diam, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")
p5 <- ggplot(dat, aes(x = site, y = wing1.length, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")
p6 <- ggplot(dat, aes(x = site, y = wing2.length, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")
p7 <- ggplot(dat, aes(x = site, y = wingspread, fill = site)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2) + theme(legend.position="none")

grid.arrange(p1, p2, p3,p4,p5,p6,p7, ncol=4)

Assumptions and preliminary tests

MANOVA makes the following assumptions about the data. This section is not focused on meeting these assumptions. Instead I focus on comparng the process of conducting a MANOVA through R and by hand.

dat_long %>%
  group_by(variable) %>%
  summarise(N = n())
## # A tibble: 10 × 2
##    variable         N
##    <fct>        <int>
##  1 height          84
##  2 mouth.diam      84
##  3 tube.diam       84
##  4 keel.diam       84
##  5 wing1.length    84
##  6 wing2.length    84
##  7 wingspread      84
##  8 hoodarea        84
##  9 wingarea        84
## 10 tubearea        84

There are 84 observations per variable.

Evaluate Multivariate normality

The null hypothesis of the Doornik-Hansen test for multivariate normality is that the variables are multivariate normal(Doornik and Hansen 2008).

vars<- as.matrix(dat[2:8]) # we are making a matrix of the independent variables of interest. I chose to work with the variables in columns 2 through 8 of `dat` for consistency with  the example in the Gotelli & Ellison chapter 12. vars` thta`manova()`, which we use below, requires the dependent variables to be in a matrix of their own. `vars` would be a dataframe if I did not indicate `as.matrix()`
head(vars)
##      height mouth.diam tube.diam keel.diam wing1.length wing2.length wingspread
## [1,]    654       38.4      16.6       6.4           85           76         55
## [2,]    413       22.2      17.2       5.9           55           26         60
## [3,]    610       31.2      19.9       6.7           62           60         78
## [4,]    546       34.4      20.8       6.3           84           79         95
## [5,]    665       30.5      20.4       6.6           60           51         30
## [6,]    665       33.6      19.5       6.6           84           66         82
DH.test(vars, Y.names = NULL) #Doornik-Hansen test for multivariate normality
## $multi
##     E df P(Chi > E)
## 1 NaN 14        NaN
## 
## $univ
##      E df P(Chi > E)
## Y1 NaN  2        NaN
## Y2 NaN  2        NaN
## Y3 NaN  2        NaN
## Y4 NaN  2        NaN
## Y5 NaN  2        NaN
## Y6 NaN  2        NaN
## Y7 NaN  2        NaN

The data do not meet assumptions of multivariate normality. We could consider running MANOVA on the data after transforming the outcome variables. You can also perform the test regardless as MANOVA is fairly robust to deviations from normality.

Evaluate Multicollinearity

cor(vars)
##                  height  mouth.diam     tube.diam  keel.diam  wing1.length
## height        1.0000000  0.63006322  1.571517e-01 -0.1421400  3.362256e-01
## mouth.diam    0.6300632  1.00000000 -4.686177e-02 -0.3917945  5.729505e-01
## tube.diam     0.1571517 -0.04686177  1.000000e+00  0.5169862 -6.306916e-05
## keel.diam    -0.1421400 -0.39179448  5.169862e-01  1.0000000 -3.358236e-01
## wing1.length  0.3362256  0.57295054 -6.306916e-05 -0.3358236  1.000000e+00
## wing2.length  0.3173115  0.43080463  8.470713e-02 -0.2744624  8.214029e-01
## wingspread    0.1718566  0.24136594  2.038861e-01 -0.1902531  6.149062e-01
##              wing2.length wingspread
## height         0.31731146  0.1718566
## mouth.diam     0.43080463  0.2413659
## tube.diam      0.08470713  0.2038861
## keel.diam     -0.27446240 -0.1902531
## wing1.length   0.82140285  0.6149062
## wing2.length   1.00000000  0.6999089
## wingspread     0.69990888  1.0000000

Wing length 1 and 2 are correlated.

Run MANOVA function

The manova() function accepts a formula argument with the dependent variables formatted as a matrix and the grouping factor on the right of the ~. In milliseconds manova() is doing a bunch of matrix math that is essentially summarizing variance-covariance matrices. These matrices effectively summarize within and among group variance in the multivariate data.

pitcher.manova<-manova(vars ~ dat$site) # run the  model
pitcher.manova
## Call:
##    manova(vars ~ dat$site)
## 
## Terms:
##                 dat$site Residuals
## height           79790.1  655148.8
## mouth.diam        1187.2    2080.3
## tube.diam          215.7     566.6
## keel.diam          113.2     256.5
## wing1.length     11670.9   27697.9
## wing2.length      6949.2   36111.2
## wingspread       20489.7   80820.2
## Deg. of Freedom        3        80
## 
## Residual standard errors: 90.49508 5.099353 2.661245 1.790527 18.60709 21.24595 31.78447
## Estimated effects may be unbalanced

The meaning of the terms will be more clear if you go through the MANOVA.xlsx which uses the same dataset. I work out the example “by hand”-at least stepwise in excel.

dat$site values are analogous to the values along the diagonal of the hypothesis matrix(H). The hypothesis matrix is discussed in more detail in the MANOVA.xlsx. The hypothesis matrix, more generally referred to as a sum of squares & cross product matrix,is a variance-covariance matrix that essentially shows how the group means vary from the overall mean. This is analogous to “among group variance” in ANOVA.

Residuals values are the diagonal, or variance, of the error matrix(E). The error matrix is a variance-covariance matrix or sum of squares & cross product matrix that represents within group variance.

Together, the H and E matrices are used to generate test statistics (e.g. Wilk’s lambda, Pillais trace, Hotelling-Lawley trace, Roys greatest root) that help us evaluate the occurrence of a significant difference. All of them essentially evaluate the ratio of the error matrix (E) to the total variance (E+H). Therefore, it is possible to derive an F statistic from these. Below we see that the default test statistics used in manova() is Pillai’s trace. Pillais trace is noted for being the most forgiving to violations of the manova assumptions (i.e. multivariate normality).

pitcher.manova<-summary(manova(vars ~ dat$site))# get a summary of the model 
pitcher.manova
##           Df Pillai approx F num Df den Df    Pr(>F)    
## dat$site   3 1.1156    6.428     21    228 4.273e-14 ***
## Residuals 80                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output indicates there are significant differences in pitcher plant measurements at each site.

In the code below we can extract the Hypothesis Matrix (H) and Error matrix (E) after running the summary function. These values will match the values in the MANOVA.xlsx file

pitcher.manova$SS
## $`dat$site`
##                 height mouth.diam   tube.diam   keel.diam wing1.length
## height       79790.128  8531.8580 -1200.14200 -2791.22071   26539.6879
## mouth.diam    8531.858  1187.1745  -352.02552  -353.65717    2812.9177
## tube.diam    -1200.142  -352.0255   215.67971    95.29806    -589.2937
## keel.diam    -2791.221  -353.6572    95.29806   113.24999   -1036.7754
## wing1.length 26539.688  2812.9177  -589.29370 -1036.77541   11670.9094
## wing2.length 16569.670  1450.2677  -208.49824  -634.11924    8523.6188
## wingspread   26676.726  1077.1965  1003.14423  -598.47513    9579.4701
##              wing2.length wingspread
## height         16569.6700 26676.7264
## mouth.diam      1450.2677  1077.1965
## tube.diam       -208.4982  1003.1442
## keel.diam       -634.1192  -598.4751
## wing1.length    8523.6188  9579.4701
## wing2.length    6949.1870  8188.8191
## wingspread      8188.8191 20489.7484
## 
## $Residuals
##                  height  mouth.diam tube.diam  keel.diam wing1.length
## height       655148.765 22343.64200 4968.2170  448.16000   30651.9550
## mouth.diam    22343.642  2080.27218  277.1055  -76.97283    3685.3490
## tube.diam      4968.217   277.10552  566.5778  182.73444     588.9437
## keel.diam       448.160   -76.97283  182.7344  256.47894    -244.4603
## wing1.length  30651.955  3685.34900  588.9437 -244.46030   27697.9002
## wing2.length  39878.580  3659.76567  700.1232 -461.00576   25296.2145
## wingspread    20217.345  3314.23683  811.9058 -565.91773   29254.4347
##              wing2.length wingspread
## height         39878.5800 20217.3450
## mouth.diam      3659.7657  3314.2368
## tube.diam        700.1232   811.9058
## keel.diam       -461.0058  -565.9177
## wing1.length   25296.2145 29254.4347
## wing2.length   36111.2297 38039.3476
## wingspread     38039.3476 80820.2039

We can explore each variable separately -“univariate results”.

summary(aov(vars ~ dat$site))
##  Response height :
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## dat$site     3  79790 26596.7  3.2477 0.02614 *
## Residuals   80 655149  8189.4                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response mouth.diam :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## dat$site     3 1187.2  395.72  15.218 6.355e-08 ***
## Residuals   80 2080.3   26.00                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response tube.diam :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## dat$site     3 215.68  71.893  10.151 9.715e-06 ***
## Residuals   80 566.58   7.082                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response keel.diam :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## dat$site     3 113.25  37.750  11.775 1.815e-06 ***
## Residuals   80 256.48   3.206                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response wing1.length :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## dat$site     3  11671  3890.3  11.236 3.143e-06 ***
## Residuals   80  27698   346.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response wing2.length :
##             Df Sum Sq Mean Sq F value   Pr(>F)   
## dat$site     3   6949 2316.40  5.1317 0.002687 **
## Residuals   80  36111  451.39                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response wingspread :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## dat$site     3  20490  6829.9  6.7606 0.0004025 ***
## Residuals   80  80820  1010.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output tells us there are significant differences in the means for ALL of the morphological variables among the 4 sites.

Considerations for MANOVA:

MANOVA has multiple assumptions to consider which may limit its applicability: 1) observations are independent and randomly sampled 2) within-group errors are equal among groups and normally distributed 3) covariances are equal among groups 4) Errors of the multivariate variables must conform to a multivariate normal distribution. If residuals are not multivariate normal, (refer to Gotelli & Ellsion 2012 pg. 394-98 for a helpful discussion on how to evaluate multivariate normality), then PERMANOVA (permutational multivariate ANOVA) is a non-parametric alternative to MANOVA.

PermANOVA (permutation analysis of variance) aka. Distance-based MANOVA aka. non-parametric MANOVA

Click here for a more detailed discussion of PERMANOVA.

This method has earned itself many names- most commonly referred to as PerMANOVA. Each name emphasizes a different attribute of the design. Similar to ANOVA and MANOVA and regression, this method functions to test if there is a difference in a measure of centrality (here centroids) and spread (here dispersion). This method was developed relatively recently after Anderson (2001) showed that the sums of squares could be calculated directly using distances among data points, rather than the distances from the data points to the mean (e.g. ANOVA, MANOVA).

Fig. 7. Comparison of how between and within distance is figured in ANOVA vs distance-based MANOVA

Fig. 7. Comparison of how between and within distance is figured in ANOVA vs distance-based MANOVA

So PerMANOVA extends the same logic as ANOVA and MANOVA and regression, by partitioning variance using the sum of squares between and within - but unique to this method is the use of distance measures-hence the other name- distance-based MANOVA. There are a plethora of distance-based and dissimilarity measures that can be applied (Fig.8) and we won’t go into detail about each here. Each one works better for different types of data. For example, we will use the Bray Curtis which is commonly used in ecology because it is good for zero inflated data (i.e. species data) or abundance data. If you want more background refer to pg.402-406 in Goetlli & Ellsion 2002 or some worked out examples here.

Fig. 8. Distance Based/ Dissimilarity Measures (Gotelli & Ellison 2002)7

Fig. 8. Distance Based/ Dissimilarity Measures (Gotelli & Ellison 2002)7

Another difference, which earns its other identifier as permutation MANOVA, is that “significance” is evaluated using permutation. Permutation is also the reason this technique relaxes the strict assumptions of MANOVA. Rather than assume any underlying distribution, permutation involves constructing a null distribution. Generally speaking, a permutation finds all possible combinations or arrangements of the objects into a linear sequence or order. As it applies to the PerMANOVA, permuting the data means it is shuffling the observations within a variable (Figure 9). Permuting data repeatedly over x iterations, will allow you to reevaluate the data and output the desired test statistic for each iteration. The number of iterations or permutations is up to you and this gets specified into the arguments of the code. If you specified 1000 iterations, you would get a distribution of 1000 test statistics (the default is typically 999 iterations). In the case of PerMANOVA, an F-statistic is calculated each time and you compare your observed F statistics to the 1000 pseudo F statistics to evaluate significance. If the null hypothesis is true and the groups do not differ in their means, then shuffling the labels should be meaningless. The null hypothesis being tested in PerMANOVA is that the centroids and dispersion for all groups is the same. This is similar to the null hypothesis of ANOVA where mean and variance are assumed to be equal, however in multivariate land where matrix algebra is involved, the mean is analogous to the centroid and spread analogous to dispersion. Rejecting this hypothesis indicates that the groups are different- the proportion of F-values in the null distribution that are more extreme than the observed F-statistic returns the p-value that enables you to accept or reject this hypothesis.

Fig. 9. PerMANOVA permutes the raw data. This provides a visual example of what permutated data looks like.

Fig. 9. PerMANOVA permutes the raw data. This provides a visual example of what permutated data looks like.

We will continue with the pitcher plant dataset used in the MANOVA example. In this case we are testing the null hypothesis that morphological measurements are the same at all sites.

Here is an overview of the steps of a PerMANOVA that we will follow in the script Click here for a 15min overview:

Fig. 8. Distance Based/ Dissimilarity Measures (Gotelli & Ellison 2002)7

Fig. 8. Distance Based/ Dissimilarity Measures (Gotelli & Ellison 2002)7

Now, adonis() enables us to do all three of those steps in one line of code.

set.seed(355)
group<-dat$site
adon.results<-adonis(vars ~ group, method="bray",perm=999) #By using this function we are conducting three steps in one: 1) making the dissimilarity matrix from the raw data `vars`, 2) Calculating the within and between variance of the dissimilarity matrix to get the observed F-statistic 3) Permuting the raw data 999 times to get the distribution of the pseudo F statistics for the significance test.
adon.results
## 
## Call:
## adonis(formula = vars ~ group, permutations = 999, method = "bray") 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs  MeanSqs F.Model      R2 Pr(>F)   
## group      3   0.10452 0.034840  4.9387 0.15626  0.002 **
## Residuals 80   0.56436 0.007055         0.84374          
## Total     83   0.66888                  1.00000          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see there are ‘significant’ differences in the centroids among the sites (which matches our results for the MANOVA). It is interesting to consider that despite our data not meeting the assumptions of a MANOVA, the results of the MANOVA were robust. We confirmed with a non-parametric analysis (one that doesn’t assume a distribution) that the groups differ in their centroids.

Interpretation of outputs:

Let’s make a visual to go with your perMANOVA results. In ANOVA you would show a bar chart or box plot for a single response variable. The equivalent for MANOVA would be an ordination.

dispr <- betadisper(vegdist(vars), group) #This function calculates the values for dispersion in multivariate space so we can plot on an ordination plot (PCoA)
## Warning in betadisper(vegdist(vars), group): some squared distances are negative
## and changed to zero
dispr
## 
##  Homogeneity of multivariate dispersions
## 
## Call: betadisper(d = vegdist(vars), group = group)
## 
## No. of Positive Eigenvalues: 33
## No. of Negative Eigenvalues: 50
## 
## Average distance to median:
##      DG      HD     LEH     TJH 
## 0.07485 0.06006 0.07509 0.07042 
## 
## Eigenvalues for PCoA axes:
## (Showing 8 of 83 eigenvalues)
##   PCoA1   PCoA2   PCoA3   PCoA4   PCoA5   PCoA6   PCoA7   PCoA8 
## 0.44792 0.17012 0.04984 0.03413 0.02017 0.01877 0.01453 0.01324
plot(dispr, main = "Ordination Centroids and Dispersion", sub = "")

This is an analogue to an ANOVA boxplot which illustrates mean and variance. For a perMANOVA the boxplot illustrates the group centroids (site centroids) and their dispersion.

boxplot(dispr, main = "", xlab = "")

We can see in this plot that dispersion may be greater in LEH or DG, but TJH has some outliers that increase its spread. An assumption of the perMANOVA is that variance is homogenous.

This is a significance test evaluating if the dispersion is homogeneous across all groups “homogeneity of multivariate dispersions”. This is important to note because we want to verify that the results of adonis are not confounded with dispersion. If we want to verify between group differences are significant we should make sure that is not an artifact of the dispersion (within group dispersion). When dispersions are quite different, adonis may return a significant p-value, but the result is heavily influenced by the differences in composition within groups rather than the difference in composition between groups.

permutest(dispr) # permutation test for the homogenity of dispersion
## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df   Sum Sq    Mean Sq      F N.Perm Pr(>F)
## Groups     3 0.002052 0.00068409 0.4002    999  0.761
## Residuals 80 0.136764 0.00170955

Results indicate that dispersion is similar among all the groups.

Ordination

Review of the Mechanics of Ordination:

Ordination is a wonderful approach for getting a grasp on multivariate datasets. These approaches can be used to identify interesting patterns in data or to reduce the number of dimensions. These approaches are the standard in genetic and chemistry analyses which can have thousands of variables. Here, we focus on identifying variables that are correlated with each other to create new axes or factors that represent both variables.

For this part we will work with the ordination.csv data in the folder.

ord_dat <- read.csv("ordination.csv")

You can see that we have trait data for the three plant species: Piper, Miconia, Psychotria. Can we figure out which traits are associated with which species?

Principal component analysis (PCA)

This ordination technique builds new axes called “components” based on the variation that exists in the data. In Gotelli and Ellison’s words: PCA performs data reduction by creating “new variables as linear combinations of the original variables.” In this dataset we have 11 predictor variables and we want to associate them with plant genera. With 11 variables, our data theoretically exists in 11 dimensional space. A PCA identifies the axes in the space where the most variation exists, which becomes PC1. PC2 is the next spot of most variation that is orthogonal to PC1. This continuous until as many PCs have been drawn as predictor variable. Often PC1 and PC2 explain a high amount of variation in the data.PCA Review

Let’s perform a PCA on these data. You’ll notice that since we are interested in identifying the variation in our 11 predictors we only perform the PCA on those (columns 2 through 12).

Examining the pca using summary() gives us the variance explained by each principle component. This is important because if a few PCs explain most of the variation we can essentially plot what was 11 dimensional data in 2 or 3 dimensional space. This seems magical but of course it isn’t. This works because our variables are correlated and so what seemed like 11 dimensional data really isn’t. Some of our variables are redundant.

pca1 <- princomp(ord_dat[,2:12]) # Makes a pca object called `pca1`. First , it takes columns 2 through 12 in the data set `ord_dat` and calculates the attributes of the principal components (ie.loadings, scores, etc. ). 

#use `pca1$` or `str()` to explore all the attributes our `pca1` object. We explore these more below.
pca1$loadings # loadings of each variable on the principal component (aka. eigenvectors). Shows to what magnitude that variable loads on the component. Also returns information on the variance explained by each principal component axis.
## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## height     0.832  0.542                                                 
## width      0.535 -0.837                                                 
## leaves     0.135        -0.668 -0.662 -0.266        -0.109              
## veins                                        -0.449  0.116  0.728 -0.309
## leflength                0.122  0.213 -0.844  0.180               -0.376
## lefwidth                        0.165 -0.421 -0.232 -0.184  0.307  0.619
## fruits                  -0.720  0.682                                   
## trichomes                                    -0.510  0.730 -0.294       
## odor                                          0.481  0.496  0.482  0.245
## thickness                             -0.110 -0.252        -0.212  0.544
## taste                          -0.110         0.391  0.380         0.129
##           Comp.10 Comp.11
## height                   
## width                    
## leaves                   
## veins     -0.239   0.311 
## leflength -0.204         
## lefwidth   0.472         
## fruits                   
## trichomes  0.285  -0.162 
## odor      -0.110  -0.461 
## thickness -0.753         
## taste      0.127   0.805 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091
## Cumulative Var  0.091  0.182  0.273  0.364  0.455  0.545  0.636  0.727  0.818
##                Comp.10 Comp.11
## SS loadings      1.000   1.000
## Proportion Var   0.091   0.091
## Cumulative Var   0.909   1.000
head(pca1$scores) #These represent the projection onto the PC, if you take the scores of the first two principal components, you will have the coordinates of the sample points for a biplot/pca graph
##          Comp.1     Comp.2    Comp.3      Comp.4    Comp.5     Comp.6
## [1,] -67.769814 -22.197014 -9.814909 -16.9151054 -3.669413 -0.8720094
## [2,]  -2.916698   2.952394  3.858568   0.3547213 -1.038405 -0.4547091
## [3,] -69.196715  18.416240 -3.146421  -5.9515717 -4.015966  4.4385313
## [4,] -51.882182  29.695378 -8.777544 -14.2312159  2.537111 -1.4123881
## [5,] -63.120452  24.059685  1.431729  -3.4769059  4.399495 -0.0166952
## [6,] -11.963417 -33.917056 -7.141689 -16.6132591 -6.507611  2.7650920
##          Comp.7      Comp.8      Comp.9    Comp.10     Comp.11
## [1,] -0.8774505  6.06190417 -3.54879432 -0.7271848  0.04562485
## [2,]  0.2245316  6.51637644 -0.48398837 -0.3301458  3.09551773
## [3,]  0.6641767  0.25506426 -0.03406119 -0.4564666  0.16903850
## [4,]  3.3136672  2.71327452  0.88067679  0.2266849  0.82767673
## [5,] -0.6937256 -1.54312874  2.46771984 -3.3447145  0.39224696
## [6,] -3.3672291 -0.07923123 -1.42332925 -0.3139246 -1.73332922
summary(pca1) # The summary function applied to the pca object returns the proportion of variance explained by each principal component.
## Importance of components:
##                            Comp.1      Comp.2      Comp.3      Comp.4
## Standard deviation     93.1187860 26.79029797 16.63195700 9.480983358
## Proportion of Variance  0.8830104  0.07308804  0.02816942 0.009153727
## Cumulative Proportion   0.8830104  0.95609843  0.98426785 0.993421577
##                             Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     6.535351873 2.4953026038 2.4450309067 1.9440314630
## Proportion of Variance 0.004349398 0.0006340706 0.0006087793 0.0003848556
## Cumulative Proportion  0.997770975 0.9984050454 0.9990138248 0.9993986803
##                              Comp.9      Comp.10      Comp.11
## Standard deviation     1.5494775989 1.3837163363 1.2607024551
## Proportion of Variance 0.0002444904 0.0001949779 0.0001618514
## Cumulative Proportion  0.9996431707 0.9998381486 1.0000000000

A scree plot is a great way to visualize how much of the variation is the data each PCA represents.

screeplot(pca1)

The output from pca also produces “scores.” Scores tells us where each of our observations falls in along each of these axes in ordination space. Since PC1 and PC2 explain most of the variation (> 95%), plotting PC1 against PC2 will give us good info about the trait space. First I will make a new dataframe for plotting then plot it!

plot_dat <- data.frame(genus = ord_dat$genus, 
                       PC1 = pca1$scores[,1],
                       PC2 = pca1$scores[,2])

ggplot(data = plot_dat) +
  geom_point(aes(PC1, PC2, fill = genus), pch = 21, color = "black") +
  scale_fill_manual(values = wesanderson::wes_palette("Moonrise3")) +
  theme_classic()

Often PCA plots look like this. PC1 and PC2 represent most of the variation in plant trait space and we color our points by another variable we are interested in (and that was not used in the ordination) in order to see if the observations cluster by that variable. Here we see a fair amount of overlap among genera, but it does appear that Psychotria has more variation on PC1 (which is the most important axis given that is represent 88% of the variation in our traits). But what does that mean? What is PC1?

This brings up the trade off of ordination techniques. Representing many different traits as components sacrifices interpretability. It is easy to say that Plant A and plant B differ in size but what does it mean if they differ in PC1? We can get to the bottom of this! PCAs also output “loadings” which tell us how much a variable “loads” onto a PC. The greater the loading, the more that PC represents that variable.

loadings(pca1)
## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## height     0.832  0.542                                                 
## width      0.535 -0.837                                                 
## leaves     0.135        -0.668 -0.662 -0.266        -0.109              
## veins                                        -0.449  0.116  0.728 -0.309
## leflength                0.122  0.213 -0.844  0.180               -0.376
## lefwidth                        0.165 -0.421 -0.232 -0.184  0.307  0.619
## fruits                  -0.720  0.682                                   
## trichomes                                    -0.510  0.730 -0.294       
## odor                                          0.481  0.496  0.482  0.245
## thickness                             -0.110 -0.252        -0.212  0.544
## taste                          -0.110         0.391  0.380         0.129
##           Comp.10 Comp.11
## height                   
## width                    
## leaves                   
## veins     -0.239   0.311 
## leflength -0.204         
## lefwidth   0.472         
## fruits                   
## trichomes  0.285  -0.162 
## odor      -0.110  -0.461 
## thickness -0.753         
## taste      0.127   0.805 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091
## Cumulative Var  0.091  0.182  0.273  0.364  0.455  0.545  0.636  0.727  0.818
##                Comp.10 Comp.11
## SS loadings      1.000   1.000
## Proportion Var   0.091   0.091
## Cumulative Var   0.909   1.000

This shows us that PC1 and PC2 is primarily composed of height and width. This tells us that the main variation in the data are related to plant size. If we use PC1 as a predictor variable in a further analysis we can know that we are predicting our response variable using plant size.

In practice to do this we use the scores. Each observation gets a “score” which is its association with each PC. First we make a new dataset that has only the PCs and then let’s run some quick models.

model_data <- data.frame(genus = ord_dat$genus, 
                         PC1 = pca1$scores[,1],
                         PC2 = pca1$scores[,2],
                         PC3 = pca1$scores[,3]) #data frame compiling the pca scores

mod1 <- glm(PC1 ~ genus, data = model_data)
summary(mod1)
## 
## Call:
## glm(formula = PC1 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -153.79   -48.97   -14.61    56.80   317.32  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -44.41      21.38  -2.078  0.04348 * 
## genusPiper         31.33      30.23   1.036  0.30564   
## genusPsychotria   101.91      30.23   3.371  0.00155 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7311.47)
## 
##     Null deviance: 416213  on 47  degrees of freedom
## Residual deviance: 329016  on 45  degrees of freedom
## AIC: 568.19
## 
## Number of Fisher Scoring iterations: 2
mod2 <- glm(PC2 ~ genus, data = model_data)
summary(mod2)
## 
## Call:
## glm(formula = PC2 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -51.456  -16.014   -2.985   17.004   83.437  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)       -5.342      6.846  -0.780    0.439
## genusPiper         7.134      9.681   0.737    0.465
## genusPsychotria    8.890      9.681   0.918    0.363
## 
## (Dispersion parameter for gaussian family taken to be 749.8024)
## 
##     Null deviance: 34451  on 47  degrees of freedom
## Residual deviance: 33741  on 45  degrees of freedom
## AIC: 458.87
## 
## Number of Fisher Scoring iterations: 2
mod3 <- glm(PC3 ~ genus, data = model_data)
summary(mod3)
## 
## Call:
## glm(formula = PC3 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -72.917   -4.583    1.491    6.305   22.921  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.975      3.519   3.119  0.00317 ** 
## genusPiper        -9.685      4.977  -1.946  0.05792 .  
## genusPsychotria  -23.240      4.977  -4.669 2.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 198.161)
## 
##     Null deviance: 13277.9  on 47  degrees of freedom
## Residual deviance:  8917.2  on 45  degrees of freedom
## AIC: 395
## 
## Number of Fisher Scoring iterations: 2

Effect sizes become more tricky with PCs because the scale of the PC represents the variation in the original data. We can scale these again. Also, just because PC1 represents the most variation in the data, that does not mean it is the most predictive. Remember that these are summaries of the original variables, so if those original variables are not that predictive the PCs wont be either. Here PC1 and PC2 are aspects of plant size, which is actually not that predictive of differences in genus. PC3 is fruits and leaves which are way more predictive. Look at the mod3 compared to mod1 and mod2.

We can also look at the bi plot. This gives us a visual representation of the loadings & scores. Here the numbers are each individual observation and the arrow represent the direction and strength of variation in each trait. All of the traits are plotted here but they do not vary- therefore they are, for the most part, clustered at the origin.

biplot(pca1)

Before moving on it is worth briefly discussing explained variance. Here PC1 explained 88% of the variance. That is pretty high, in fact that is really high. If you use PC1 as a predictor or response variable you can be pretty sure that one variable is a sufficient summary. What if the explained variance is lower? It might be a problem or it might not, it depends on the data! Often PC’s of genetic data are around 5% which seems bad, BUT often these data often have over 10,000 predictor variables. So if the PC’s were no help (and all variables were perfectly not correlated) we would expect the explained variance of PC1 to be ~0.0001. So 5% is pretty good! Generally the larger the ordination space (i.e. the more variables you have) the lower we expect our explained variance to be. Even if the PC does explains a lot of variation.

Factor analysis (FA)

Factor analysis is another technique for reducing the dimensionality of data. While PCA’s and RDA’s do this by creating new variables that are linear combinations of the original variables, FA considers our data to be linear combinations of underlying “factors.” For example, if we measure length and width of a plant, we are really measuring plant size, so length and width are just linear combinations of plant size. Factors represent “latent variables” or unmeasurable variables. We will talk more about this next week with SEM.

Here is some code

fa1 <- factanal(ord_dat[,2:12], factors = 4, rotation="varimax", scores = "regression")
fa1
## 
## Call:
## factanal(x = ord_dat[, 2:12], factors = 4, scores = "regression",     rotation = "varimax")
## 
## Uniquenesses:
##    height     width    leaves     veins leflength  lefwidth    fruits trichomes 
##     0.156     0.005     0.200     0.560     0.005     0.201     0.840     0.899 
##      odor thickness     taste 
##     0.427     0.838     0.267 
## 
## Loadings:
##           Factor1 Factor2 Factor3 Factor4
## height     0.481   0.777                 
## width      0.601   0.661   0.137  -0.421 
## leaves             0.846   0.270         
## veins             -0.532          -0.381 
## leflength  0.940                   0.329 
## lefwidth   0.884          -0.103         
## fruits             0.386                 
## trichomes  0.199   0.121   0.217         
## odor                       0.709  -0.245 
## thickness  0.104                   0.385 
## taste              0.254   0.774   0.256 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.339   2.275   1.266   0.721
## Proportion Var   0.213   0.207   0.115   0.066
## Cumulative Var   0.213   0.420   0.535   0.600
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 30.85 on 17 degrees of freedom.
## The p-value is 0.0208
load <- fa1$loadings[,1:2]

plot(load,type="n") 
text(load,labels=names(ord_dat[,2:12]),cex=.7)

This plot shows how variables group in ordination space. We can see that width and height are grouped. So are leaf length and leaf width. Compare this to the PCA biplot. Does it look similar?

Notice the “factors =” argument. When performing a factor analysis you need to specify how many factors you believe to be present in the data. This is tricky. You may base this decision on prior knowledge of the system. There are also tools for thinking about the number of factors in your data.

library(nFactors)

nfact1 <- nMreg(ord_dat[,2:12], cor = TRUE, model = "components", details = TRUE)
summary(nfact1)
## Report For a nFactors Class 
## 
## Details: 
## 
##   v    values      mreg    tmreg     pmreg
## 1 1 3.4166261 0.8543799 4.366586 -5.115989
## 2 2 2.0729868 0.6008300 3.380077 -4.276461
## 3 3 1.4043531 0.4773909 3.483552 -4.371272
## 4 4 1.1701466 0.3968408 3.546287 -4.427941
## 5 5 0.9335793 0.3879396 4.591330 -5.288695
## 6 6 0.7852437 0.3229172      NaN       NaN
## 
## 
##  Number of factors retained by index 
## 
##   b t.p p.b 
##   4   5   5
ev <- eigen(cor(ord_dat[,2:12])) # get eigenvalues
ap <- parallel(subject=nrow(ord_dat[,2:12]),var=ncol(ord_dat[,2:12]),
               rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

These results suggest 4 or 5 factors. Run them both and see if interpretation changes.

Just like with PCAs, we can extract the factor “scores”, which summarize the relationship between each variable and each factor. Here I look at FA1 and FA2. FA1 is a lot of size stuff again and this time “leaves” and “fruits” are on FA2. Based on our findings from the PCA, we would expect FA2 to be more predictive of differences between genera.

model_data <- data.frame(genus = ord_dat$genus, 
                         FA1 = fa1$scores[,1],
                         FA2 = fa1$scores[,2],
                         FA3 = fa1$scores[,3],
                         FA4 = fa1$scores[,4])

mod1 <- glm(FA1 ~ genus, data = model_data)
summary(mod1)
## 
## Call:
## glm(formula = FA1 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.12807  -0.81722  -0.05944   0.72871   1.91975  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.12120    0.25008   0.485    0.630
## genusPiper      -0.33492    0.35367  -0.947    0.349
## genusPsychotria -0.02868    0.35367  -0.081    0.936
## 
## (Dispersion parameter for gaussian family taken to be 1.000658)
## 
##     Null deviance: 46.132  on 47  degrees of freedom
## Residual deviance: 45.030  on 45  degrees of freedom
## AIC: 141.15
## 
## Number of Fisher Scoring iterations: 2
mod2 <- glm(FA3 ~ genus, data = model_data)
summary(mod2)
## 
## Call:
## glm(formula = FA3 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1517  -0.3125   0.1268   0.4195   1.4257  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.5848     0.1893  -3.089  0.00344 ** 
## genusPiper        1.1703     0.2677   4.372  7.2e-05 ***
## genusPsychotria   0.5840     0.2677   2.181  0.03442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.5733297)
## 
##     Null deviance: 36.757  on 47  degrees of freedom
## Residual deviance: 25.800  on 45  degrees of freedom
## AIC: 114.42
## 
## Number of Fisher Scoring iterations: 2

Considerations for ORDINATION techniques:

Broad generalizations from biplots of PCA, NMDS, ets. should be avoided. For testing hypotheses its best to use the principal components in models with the response variable to test hypotheses. A better approach may be to use the response and principal axis in a model. Often, the mechanisms of these methods are hidden and assumptions are rarely spelled out but the user should be aware, justify use of certain arguments and mention explicitly in the results. For example, there are many distance measures or dissimilarity measures that can be applied (e.g. Euclidean, Bray-Curtis, Jaccard, etc; see table 12.6 in Gotelli and Ellison 2012) and rotation methods.

Redundancy analysis (RDA)

Fig. 9. Framework for Redundancy Analyses taken from Legendre & Legendre 2012

Fig. 9. Framework for Redundancy Analyses taken from Legendre & Legendre 2012

The redundancy analysis framework (Fig.9) is a compilation of methods we have encountered, but is unique by allowing for the addition of multiple predictor variables. Using an RDA we can predict a matrix of response variables with a matrix of predictor variables. What an RDA is doing is performing a multiple regression on each of the response columns separately, generating a matrix of predicted values. The RDA then can tell us the contribution of each predictor to producing the matrix of predicted values. That is key, because it is tempting to interpret the length of the lines in a RDA as an effect size, but it isn’t.

The data we have isn’t great for exploring this method because it only has the one predictor genus. Either way, let’s explore the function and output.

response <- ord_dat[,2:12] # `rda` requires a separate matrix of the response variables. here we make a matrix of the response variables in column 2 through 12
rda1 <- vegan::rda(response ~ genus, data = ord_dat) # make an rda  object

This plot could have used some better formatting and I would have like to have more time to work on this example. BUT you can see that it looks similar to the PCA (look at where the height and width lines go), but it also plots the region of ordination space associated with our predictors (shown in blue), which in this case was genera.

plot(rda1, xlim = c(-10, 10), ylim = c(-10, 10))

Here is another RDA that is more standard. Here the ordination space is genetic variation which is composed of 30,000 SNPs (so the dataset has 30,000 columns). The point colors show different populations and the RDA is of environmental traits predicting genetic variation.

Fig. 10. Example RDA plot using SNP data as response variables and the environmental data as predictor variables

Fig. 10. Example RDA plot using SNP data as response variables and the environmental data as predictor variables

Considerations for Multivariate Regression: Significance testing can be done with Monte Carlo analysis. The hypothesis is that the predictor has no relationship with the response variables. The distance-based RDA allows distance based measures to be applied and does not have assumptions such as data be multivariate normally distributed or that the distance based measure be Euclidean.

Key Takeaways:

After this lab, you should be able to:

  1. Justify why the methods boxed in red fit in those respective categories.You can see from the table below, we just scratched the surface of multivariate methods available!
Summary/Comparison of Multivariate Methods

Summary/Comparison of Multivariate Methods

  1. Explain the similarities and differences among MANOVA, perMANOVA, ANOVA and regression in regards to how they implement signal to noise ratios (between and within variance) in significance tests.

  2. Explain permutation and its application to multivariate methods

  3. Calculate a dissimilarity matrix from raw data

  4. Understand the mechanics and utility of principal component analysis for reducing the dimensionality of the dataset.

  5. Understand how exploratory PCA and FA can be used in hypothesis testing as predictors in models.