library(hillR)
library(tidyverse)
library(vegan)

Welcome back! So far the datasets we have been working with have been fairly simple. By that I mean that our models have not contained a bunch of predictor or response variables and interpretation of the model outputs has been fairly straightforward. Often clean datasets like these are not what we work with in Ecology. What happens if we are interested in the impacts of nitrogen addition on a plant community and we measure the cover of all plants we find? Suddenly instead of nitrogen predicting one plant we are using nitrogen to predict MANY plants. What if we are interested in the impacts of many climate variables on plant germination?

Today’s lab will be about different methods of identifying patterns and the reducing dimensionality of our data. This includes useful summary statistics such as richness, diversity indices, and hill numbers. This also includes multivariate techniques such as PCA, RDA, factor analysis, and random forest. This is certainly not an exhaustive list, but these are commonly used techniques that can help us better understand datasets with many variables.

Before getting too far down this rabbit hole, why is having too many variables in a dataset bad? One of the main problems is multi colinearity, which is an assumption of regressions. Essentially, if a linear model contains variables that are too highly correlated, the model will have trouble fitting parameters. Let’s simulate some data and take a quick look.

Let’s say we are interested in the physical strain exerted by a protagonist as they flee the evil beings (trollocs, fades, etc.). To do this we measure their running speed, their fear, and their heart rate. We want to know what drives up their heart rate more, their fear or running speed?

Here we see that both variables are linearly related to heart rate. Are they correlated?

cor(running_speed, fear)

## [1] 0.9907435

Looks like they are. Let’s run some models and see what happens.

mod1 <- glm(heart_rate ~ running_speed)
summary(mod1)$coefficients

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)   70.47677 0.69755439 101.0341 1.401099e-57
## running_speed  8.44568 0.06972009 121.1370 2.388965e-61

mod2 <- glm(heart_rate ~ fear)
summary(mod2)$coefficients

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 72.89309  1.7039722 42.77833 7.008553e-40
## fear        16.50321  0.3425042 48.18397 2.633252e-42

These look pretty ordinary, but what happens when fear and running speed are both in the model?

mod3 <- glm(heart_rate ~ running_speed + fear)
summary(mod3)$coefficients

##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)   70.5127973  0.7045466 100.0825126 1.968991e-56
## running_speed  8.1303935  0.5169581  15.7273750 2.375045e-20
## fear           0.6272086  1.0188815   0.6155854 5.411378e-01

Here we see the problem with colinearity. The relationship between fear and heart rate is gone and the confidence in both estimates is much lower (look at those standard errors around the estimates). Multicolinearity can cause small changes in the model to have drastic impacts on the coefficient estimates. So what does this have to do with multivariate datasets? If we have datasets with too many predictor variables some of these are bound to be correlated. Also, all of the models we have run so far have one response variable but what if we have more?

Richness, diversity, and hill numbers

Richness, diversity indices, and hill numbers are all ways of summarizing multivariate data (in fact Hill numbers are just a convenient algebraic transformation of richness and other diversity indices). These summary statistics can serve as either predictors or response variables, depending on the data and the questions being asked. For the next section I simulated some community data. These data show the abundance of 20 plant species across 3 elevations. There are 20 plots at each elevation for a total of 60 plots.

community_data <- read.csv("community_data.csv")

I won’t waste space here examining these data, but you can use functions from the previous labs to examine these data. We want to know how plant communities change with elevation. In this case we have 20 response variables, so what can we do? Would could run 20 different linear models, but is there a more efficient way? We probably do not want to have to discuss 20 models in a paper.

This is where summaries like richness and diversity come in. Richness is the number of individual species that are present and diversity is an index that accounts for both abundance and evenness. Some common diversity indices are the Shannon and Simpson’s diversity index.

There are functions in the vegan package to calculate these but you can also do them manually. They are just equations after all.

Richness

richness <- function (data) {
  
  rich <- length(data[data != 0])
  
  return(rich)
}

rich <- apply(community_data[,3:22], 1, richness)

Shannon diversity

shannon <- function (data) {
  
  p <- data[data != 0]/sum(data)
  shan <- -sum(p * log(p))
  
  return(shan)
}

shan <- apply(community_data[,3:22], 1, shannon)

Simpsons diversity

simpson <- function (data) {
  
  non_zero <- data[data != 0]
  simp <- 1 - ( sum(non_zero * (non_zero - 1)) / (sum(non_zero) * (sum(non_zero)-1)) )
  
  return(simp)
}

simp <- apply(community_data[,3:22], 1, simpson)

Here is the way to do this using vegan.

shan1 <- diversity(community_data[,3:22], index = "shannon")
simp1 <- diversity(community_data[,3:22], index = "simpson")
inv_simp1 <- diversity(community_data[,3:22], index = "invsimpson")

We can clean this up a little.

community_data1 <- cbind(community_data$elevation, community_data$plot, rich, shan, simp)

We have now effectively reduced the number of response variables in our dataset from 20 to 3. This is much more manageable for running linear models. Of course by doing this we change our response variable. So instead of being able to say something like “this plant is more abundant and lower elevations” we can say that “lower elevations see greater richness and diversity.”

Hill numbers are another useful tool for summarizing multiple columns. Hill numbers are calculated using an equation which contains the term “q”. By increasing q we increase the importance of abundant species. q = 0 is species richness, which ignores abundance. q = 1 is and q = 2 are Shannon’s and inverse Simpson’s diversity respectively. Hill numbers are expressed at the “effective number of species” (or whatever else you are measuring). This should be thought of as the equivalent number of equally abundant species.

Here I use the HillR package to calculate them.

community_data$h0 <- hillR::hill_taxa(community_data[,3:22], q = 0)
community_data$h1 <- hillR::hill_taxa(community_data[,3:22], q = 1)
community_data$h2 <- hillR::hill_taxa(community_data[,3:22], q = 2)

We can plot these to examine the difference in simulated diversity at three elevations. Here we see that diversity is greatest at mid elevations. Here, richness is 20 and higher q’s are around 18. This tells us that the diversity here is equivalent to a community that contains 18 equally abundant species. Compare this with high low elevations. Here we also see high richness, however the drop off between q=0 and q=1 tells us that the communities are uneven and are dominated by a few species.

Ordination

Ordination is a wonderful approach for getting a grasp on multivariate datasets. These approaches can be used to identify interesting patterns in data or to reduce the number of dimensions. These approaches are the standard in genetic and chemistry analyses which can have thousands of variables. The math behind them is too complicated to get into here, but essentially the aim to identify variables that are correlated with each other and create new axes or factors that represent both variables.

For this part we will work with the “ordination.csv” data in the folder.

ord_dat <- read.csv("ordination.csv")

You can see that we have trait data for the three plant species we have considered in previous labs. Can we figure out which traits are associated with which species?

Principal component analysis (PCA)

This ordination technique builds new axes called “components” based on the variation that exists in the data. In Gotelli and Ellison’s words: PCA performs data reduction by creating “new variables as linear combinations of the original variables.” In this dataset we have 11 predictor variables and we want to associate them with plant genera. With 11 variables, our data theoretically exists in 11 dimensional space. A PCA identifies the axes in the space where the most variation exists, which becomes PC1. PC2 is the next spot of most variation that is orthogonal to PC1. This continuous until as many PCs have been drawn as predictor variable. Often PC1 and PC2 explain a high about of variation in the data.

Let’s perform a PCA on these data. You’ll notice that since we are interested in identifying the variation in our 11 predictors we only perform the PCA on those (columns 2 through 12).

Examining the pca using summary() gives us the variance explained by each principle component. This is important because if a few PCs explain most of the variation we can essentially plot what was 11 dimensional data in 2 or 3 dimensional space. This seems magical but of course it isn’t. This works because our variables are correlated and so what seemed like 11 dimensional data really isn’t. Some of our variables are redundant.

pca1 <- princomp(ord_dat[,2:12])
summary(pca1)

## Importance of components:
##                            Comp.1      Comp.2      Comp.3      Comp.4
## Standard deviation     93.1187860 26.79029797 16.63195700 9.480983358
## Proportion of Variance  0.8830104  0.07308804  0.02816942 0.009153727
## Cumulative Proportion   0.8830104  0.95609843  0.98426785 0.993421577
##                             Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     6.535351873 2.4953026038 2.4450309067 1.9440314630
## Proportion of Variance 0.004349398 0.0006340706 0.0006087793 0.0003848556
## Cumulative Proportion  0.997770975 0.9984050454 0.9990138248 0.9993986803
##                              Comp.9      Comp.10      Comp.11
## Standard deviation     1.5494775989 1.3837163363 1.2607024551
## Proportion of Variance 0.0002444904 0.0001949779 0.0001618514
## Cumulative Proportion  0.9996431707 0.9998381486 1.0000000000

A scree plot is a great way to visualize how much of the variation is the data each PCA represents.

screeplot(pca1)

The output from pca also produces “scores.” Scores tells us where each of our observations falls in along each of these axes in ordination space. Since PC1 and PC2 explain most of the variation (> 95%), plotting PC1 against PC2 will give us good info about the trait space. First I will make a new dataframe for plotting then plot it!

plot_dat <- data.frame(genus = ord_dat$genus, 
                       PC1 = pca1$scores[,1],
                       PC2 = pca1$scores[,2])

ggplot(data = plot_dat) +
  geom_point(aes(PC1, PC2, fill = genus), pch = 21, color = "black") +
  scale_fill_manual(values = wesanderson::wes_palette("Moonrise3")) +
  theme_classic()

Often PCA plots look like this. PC1 and PC2 represent most of the variation in plant trait space and we color our points by another variable we are interested in (and that was not used in the ordination) in order to see if the observations cluster but that variable. Here we see a fair amount of overlap among genera, but it does appear that Psychotria has more variation on PC1 (which is the most important axis given that is represent 88% of the variation in our traits). But what does that mean? What is PC1?

This brings up the trade off of ordination techniques. Representing many different traits as components sacrifices interpretability. It is easy to say that Plant A and plant B differ in size but what does it mean if they differ in PC1? We can get to the bottom of this! PCAs also output “loadings” which tell us how much a variable “loads” onto a PC. The greater the loading, the more that PC represents that variable.

loadings(pca1)

## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## height     0.832  0.542                                                 
## width      0.535 -0.837                                                 
## leaves     0.135        -0.668 -0.662 -0.266        -0.109              
## veins                                        -0.449  0.116  0.728 -0.309
## leflength                0.122  0.213 -0.844  0.180               -0.376
## lefwidth                        0.165 -0.421 -0.232 -0.184  0.307  0.619
## fruits                  -0.720  0.682                                   
## trichomes                                    -0.510  0.730 -0.294       
## odor                                          0.481  0.496  0.482  0.245
## thickness                             -0.110 -0.252        -0.212  0.544
## taste                          -0.110         0.391  0.380         0.129
##           Comp.10 Comp.11
## height                   
## width                    
## leaves                   
## veins     -0.239   0.311 
## leflength -0.204         
## lefwidth   0.472         
## fruits                   
## trichomes  0.285  -0.162 
## odor      -0.110  -0.461 
## thickness -0.753         
## taste      0.127   0.805 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091
## Cumulative Var  0.091  0.182  0.273  0.364  0.455  0.545  0.636  0.727  0.818
##                Comp.10 Comp.11
## SS loadings      1.000   1.000
## Proportion Var   0.091   0.091
## Cumulative Var   0.909   1.000

This shows us that PC1 and PC2 is primarily composed of height and width. This tells us that the main variation in the data are related to plant size. If we use PC1 as a predictor variable in a further analysis we can know that we are predicting our response variable using plant size.

In practice to do this we use the scores. Each observation gets a “score” which is its association with each PC. First we make a new dataset that has only the PCs and then let’s run some quick models.

model_data <- data.frame(genus = ord_dat$genus, 
                         PC1 = pca1$scores[,1],
                         PC2 = pca1$scores[,2],
                         PC3 = pca1$scores[,3])

mod1 <- glm(PC1 ~ genus, data = model_data)
summary(mod1)

## 
## Call:
## glm(formula = PC1 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -153.79   -48.97   -14.61    56.80   317.32  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -44.41      21.38  -2.078  0.04348 * 
## genusPiper         31.33      30.23   1.036  0.30564   
## genusPsychotria   101.91      30.23   3.371  0.00155 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7311.47)
## 
##     Null deviance: 416213  on 47  degrees of freedom
## Residual deviance: 329016  on 45  degrees of freedom
## AIC: 568.19
## 
## Number of Fisher Scoring iterations: 2

mod2 <- glm(PC2 ~ genus, data = model_data)
summary(mod2)

## 
## Call:
## glm(formula = PC2 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -51.456  -16.014   -2.985   17.004   83.437  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)       -5.342      6.846  -0.780    0.439
## genusPiper         7.134      9.681   0.737    0.465
## genusPsychotria    8.890      9.681   0.918    0.363
## 
## (Dispersion parameter for gaussian family taken to be 749.8024)
## 
##     Null deviance: 34451  on 47  degrees of freedom
## Residual deviance: 33741  on 45  degrees of freedom
## AIC: 458.87
## 
## Number of Fisher Scoring iterations: 2

mod3 <- glm(PC3 ~ genus, data = model_data)
summary(mod3)

## 
## Call:
## glm(formula = PC3 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -72.917   -4.583    1.491    6.305   22.921  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       10.975      3.519   3.119  0.00317 ** 
## genusPiper        -9.685      4.977  -1.946  0.05792 .  
## genusPsychotria  -23.240      4.977  -4.669 2.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 198.161)
## 
##     Null deviance: 13277.9  on 47  degrees of freedom
## Residual deviance:  8917.2  on 45  degrees of freedom
## AIC: 395
## 
## Number of Fisher Scoring iterations: 2

Effect sizes become more tricky with PCs because the scale of the PC represents the variation in the original data. We can scale these again. Also, just because PC1 represents the most variation in the data, that does not mean it is the most predictive. Remember that these are summaries of the original variables, so if those original variables are not that predictive the PCs wont be either. Here PC1 and PC2 are aspects of plant size, which is actually not that predictive of differences in genus. PC3 is fruits and leaves which are way more predictive. Look at the mod3 compared to mod1 and mod2.

We can also look at the bi plot. This gives us a visual representation of the loadings. Here the numbers are each individual observation and the arrow represent the direction and strength of variation in each trait. All of the traits are plotted here but they do not vary much so are all clustered at the origin.

biplot(pca1) #I know this is ugly. There are ways to make them prettier but I am lazy right now

Before moving on it is worth briefly discussing explained variance. Here PC1 explained 88% of the variance. That is pretty high, in fact that is really high. If you use PC1 as a predictor or response variable you can be pretty sure that one variable is a sufficient summary. What if the explained variance is lower? It might be a problem or it might not, it depends on the data! Often PC’s of genetic data are around 5% which seems bad, BUT often these data often have over 10,000 predictor variables. So if the PC’s were no help (and all variables were perfectly not correlated) we would expect the explained variance of PC1 to be ~0.0001. So 5% is pretty good! Generally the larger the ordination space (i.e. the more variables you have) the lower we expect our explained variance to be. Even if the PC does explains a lot of variation.

Redundancy analysis (RDA)

The redundancy analysis is a similar technique to the PCA, but allows for the addition of multiple predictor variables. Using an RDA we can predict a matrix of response variables with a matrix of predictor variables. What an RDA is doing is performing a multiple regression on each of the response columns separately, generating a matrix of predicted values. The RDA then can tell us the contribution of each predictor to producing the matrix of predicted values. That is key, because it is tempting to interpret the length of the lines in a RDA as an effect size, but it isn’t.

I am happy to chat about this further but this type of analysis isn’t great for the data we have here. This is because we only have one predictor. Regardless, here is code for one.

response <- ord_dat[,2:12]
rda1 <- vegan::rda(response ~ genus, data = ord_dat)

Again, this plot sucks. BUT you can see that it looks similar to the PCA (look at where the height and width lines go), but it also plots the region of ordination space associated with our predictors (shown in blue), which in this case was genera.

plot(rda1, xlim = c(-10, 10), ylim = c(-10, 10))

Here is another RDA that is more standard. Here the ordination space is genetic variation which is composed of 30,000 SNPs (so the dataset has 30,000 columns). The point colors show different populations and the RDA is of environmental traits predicting genetic variation.

Factor analysis (FA)

Factor analysis is another technique for reducing the dimensionaliy of data. While PCA’s and RDA’s do this by creating new variables that are linear combinations of the original variables, FA considers our data to be linear combinations of underlying “factors.” For example, if we measure length and width of a plant, we are really measuring plant size, so length and width are just linear combinations of plant size.

Here is some code

fa1 <- factanal(ord_dat[,2:12], factors = 4, rotation="varimax", scores = "regression")

load <- fa1$loadings[,1:2]

plot(load,type="n") 
text(load,labels=names(ord_dat[,2:12]),cex=.7)

This plot shows how variables group in ordination space. We can see that width and height are grouped. So are leaf length and leaf width. Compare this to the PCA biplot. Does it look similar?

Notice the “factors =” argument. When performing a factor analysis you need to specify how many factors you believe to be present in the data. This is tricky. You may base this decision on prior knowledge of the system. There are also tools for thinking about the number of factors in your data.

library(nFactors)

nfact1 <- nMreg(ord_dat[,2:12], cor = TRUE, model = "components", details = TRUE)
summary(nfact1)

## Report For a nFactors Class 
## 
## Details: 
## 
##   v    values      mreg    tmreg     pmreg
## 1 1 3.4166261 0.8543799 4.366586 -5.115989
## 2 2 2.0729868 0.6008300 3.380077 -4.276461
## 3 3 1.4043531 0.4773909 3.483552 -4.371272
## 4 4 1.1701466 0.3968408 3.546287 -4.427941
## 5 5 0.9335793 0.3879396 4.591330 -5.288695
## 6 6 0.7852437 0.3229172      NaN       NaN
## 
## 
##  Number of factors retained by index 
## 
##   b t.p p.b 
##   4   5   5

ev <- eigen(cor(ord_dat[,2:12])) # get eigenvalues
ap <- parallel(subject=nrow(ord_dat[,2:12]),var=ncol(ord_dat[,2:12]),
               rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

These results suggest 4 or 5 factors. Run them both and see if interpretation changes.

Just like with PCAs, we can extract the factor “scores”, which summarize the relationship between each variable and each factor. Here I look at FA1 and FA2. FA1 is a lot of size stuff again and this time “leaves” and “fruits” are on FA2. Based on our findings from the PCA, we would expect FA2 to be more predictive of differences between genera.

model_data <- data.frame(genus = ord_dat$genus, 
                         FA1 = fa1$scores[,1],
                         FA2 = fa1$scores[,2],
                         FA3 = fa1$scores[,3],
                         FA4 = fa1$scores[,4])

mod1 <- glm(FA1 ~ genus, data = model_data)
summary(mod1)

## 
## Call:
## glm(formula = FA1 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.12807  -0.81722  -0.05944   0.72871   1.91975  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.12120    0.25008   0.485    0.630
## genusPiper      -0.33492    0.35367  -0.947    0.349
## genusPsychotria -0.02868    0.35367  -0.081    0.936
## 
## (Dispersion parameter for gaussian family taken to be 1.000658)
## 
##     Null deviance: 46.132  on 47  degrees of freedom
## Residual deviance: 45.030  on 45  degrees of freedom
## AIC: 141.15
## 
## Number of Fisher Scoring iterations: 2

mod2 <- glm(FA3 ~ genus, data = model_data)
summary(mod2)

## 
## Call:
## glm(formula = FA3 ~ genus, data = model_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1517  -0.3125   0.1268   0.4195   1.4257  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.5848     0.1893  -3.089  0.00344 ** 
## genusPiper        1.1703     0.2677   4.372  7.2e-05 ***
## genusPsychotria   0.5840     0.2677   2.181  0.03442 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.5733297)
## 
##     Null deviance: 36.757  on 47  degrees of freedom
## Residual deviance: 25.800  on 45  degrees of freedom
## AIC: 114.42
## 
## Number of Fisher Scoring iterations: 2

Machine learning techniques

These techniques are becoming more and more common in ecology related field as a way of reducing the number of variables in a model. Unlike ordination, which focuses on creating new variables through the combination of redundant variables, many of these techniques focus on throwing variables out that do not seem important. How these techniques figure out what is and is not importance get’s at how these “learn.” This is typically done using by assessing a variables ability for prediction. As for specific techniques, I will only provide code for random forest, which is probably the most common technique I see, but there are certainly others used for variable reduction.

Training and testing data

Training and testing data are important principles in machine learning, but are also very useful concept in statistics in general. It comes down to prediction. Remember that models are tools for quantifying relationships in our data and for making predictions about unobserved data. With this in mind it is important that models good enough to explain the data the we collected and that they are general enough to predict data we did not collect.

This gets at the problem of overfitting. An overfit model is a model that can explains the data we did collect very well, but lacks any generality. At it’s most extreme, a model can explain the collected data perfectly. If a model contains as many predictive terms as observations.

Consider these data:

Which of these models do you think explains the observed data better? Which is actually more realistic?

Validation (and cross validation) is one method of making our models more general. The idea is that we withhold a portion of our data, say 20% of it. We now build our model based on the 80% of the data we gave it and we use that model to predict the 20% of the data we withheld. The best model becomes the one that can best predict the withheld data.

We can do this using the caret package, but you can also code this in R yourself!

library(caret)

#Sets up the type of cross validation. This is 5 fold. We can also use 10 fold or LOOCV.
train.control <- trainControl(method = "cv", number = 5)

#Train the model
model <- train(y ~ x, method = "lm", data = data.frame(x=x, y=y), trControl = train.control)
#Summarize the results
summary(model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1686  -2.7827  -0.0863   2.7866  10.7438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.0929     1.4387   16.05   <2e-16 ***
## x             1.1120     0.0491   22.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.01 on 48 degrees of freedom
## Multiple R-squared:  0.9144, Adjusted R-squared:  0.9126 
## F-statistic: 512.9 on 1 and 48 DF,  p-value: < 2.2e-16

Which model from above does this resemble?

plot(x, y) 
lines(x, predict(model), col = "red")

Random forest

Random forest is a widely used technique for understanding the most important variables in our data. In random forest, “importance” is judged by a variable’s ability to predict data. The random forest algorithm is randomly subsetting the predictor variables from our data and using these subsets to predict our response variable (that’s an understatement, I would recommend reviewing the reading or watching a youtube video. It’s pretty rad.). Variables that are important are those that do a good job of predicting our data when they are in the model, but make the model much worse when they are removed. If our response variable is categorical, random forest is performing classification. This means that the algorithm is trying to classify our response variables into groups. If the response variable is continuous, it is performing regression, and is trying to predict the exact value.

Like before, we want to break our dataset into a training and testing dataset. We will want testing data to assess how well the model performs.

keep <- sample(1:nrow(ord_dat), round(0.7*nrow(ord_dat))) #Randomly determine what rows to keep
train <- ord_dat[keep,] #Create training dataset
test <- ord_dat[-keep,] #Create testing dataset

Now we can run the actual model. Here we build it using the training data. In this example the random forest is performing classification because it is using our predictor variables to classify our response variables into one of the three genera.

library(randomForest)

#View which variables are important for predicting differences in plant genera
rf1 <- randomForest(train$genus ~ ., data=train[,2:12]) 
varImpPlot(rf1)

These variables may be important, but is the model any good? Let’s use our testing data generate predictions and then compare to what the values actually are. Here we use the model we built off the training data and examine what the model predicts from our testing data. Then we compare the predicted values to what the real groups are.

pred <-  predict(rf1, newdata=test[,2:12])
table(test[,1], pred)

##             pred
##              Miconia Piper Psychotria
##   Miconia          3     0          0
##   Piper            0     6          0
##   Psychotria       0     1          4

This looks pretty good! Our model only messed up one prediction. So how do we interpret it? We can say that fruits and leaves are more important for differentiating between the genera than our other variables, we cannot make any claim about effect sizes. What we can do is subset our data to these more important variables and then run a traditional model.

Entering the multivariate void

Chris Halsch

2/9/2021