Load libraries and data

library(here)
library(dplyr)
library(lme4)
library(sjPlot)
library(ggplot2)
library(tidyr)
library(patchwork)
library(MuMIn)
library(lmerTest)

cabralarch <- read.csv(here("Week_11", "Data", "cabral_arch_data.csv"))
cabralisland <- read.csv(here("Week_11", "Data", "cabral_island_data.csv"))

Linear Mixed Model

To start, let’s consider that the data are naturally organized into groups (archipelagoes), and we should account for this structure (non-independence) in the model. In addition, some of the predictors are defined at the scale of the archipelago, which means we need an archipelago random effect in order to not pseudoreplicate when testing those predictors. Make a model with species richness as the response, and with a random effect for Archipelago. Species richness could be modeled as discrete count data (e.g., a negative binomial distribution), but we’ll finish covering GLMMs later; for now you can use log(Richness+1) to get a pretty normal looking response. What proportion of the variation in species richness occurs at the archipelago scale, and what proportion occurs within archipelagoes?

# species richness response, random effect for archipelago
cabral.richness = lmer(log(Species+1) ~ (1|Archipelago), data = cabralisland) 
# fit an intercept, i.e. the mean of the data, and then let that intercept vary randomly between archipelago


summary(cabral.richness)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(Species + 1) ~ (1 | Archipelago)
##    Data: cabralisland
## 
## REML criterion at convergence: 501.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.1507 -0.5357  0.0796  0.5790  1.9455 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  Archipelago (Intercept) 0.6408   0.8005  
##  Residual                0.8243   0.9079  
## Number of obs: 174, groups:  Archipelago, 23
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   4.7685     0.1865 18.9104   25.57 3.94e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The random effects section will report the variance estimates for any random effects terms in the model. The fixed effects section will report the parameter estimates for any fixed effects terms in the model.

The proportion of variation in species richness that occurs at the archipelago scale is 64%, and the residual variance is 82%.

Variation

Which archipelagoes are particularly diverse, and which are depauperate?

ranef(cabral.richness)
## $Archipelago
##                             (Intercept)
## Aldabra                      0.28121897
## Azores                       0.36773210
## Balearic Islands             1.18183176
## Canaries                     1.45904002
## Cape Verde                  -0.22621387
## Cook islands                -0.22591288
## Dutch Caribbean              0.91678059
## Galapagos                    0.09980580
## Hawaii                      -0.09227461
## Iles Crozet                 -1.16604580
## Inner Seychelles             0.06058546
## Juan Fernandez               0.06298814
## Kuril Islands                0.12453644
## Madeira                      0.74721555
## Marianas                    -0.11819580
## Marquesas                    0.10645794
## North. Cal. Channel Islands  0.67822271
## Phoenix Islands             -1.35368520
## Pitcairn Islands            -0.94759395
## Prince Edward Islands       -0.90859247
## Revillagigedo Islands       -0.71128117
## Society Islands             -0.02387580
## Tristan da Cunha            -0.31274390
## 
## with conditional variances for "Archipelago"

Random effects

plot_model(cabral.richness, type ="re")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the sjPlot package.
##   Please report the issue at <https://github.com/strengejacke/sjPlot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Canaries and Balaeric Islands are particularly diverse with the highest positive values (they have high species richness in comparison to other islands), while the Phoenix Islands and Iles Crozet are depauperate.

Mixed models

Exploratory plots

Now let’s think about the six predictors. Make some exploratory plots of the effect of each variable on richness. You’ll need to merge the datasets. Think about which predictors might need to be transformed for use in a linear model.

cabralmerged <- left_join(cabralisland, cabralarch, by = "Archipelago")

# pivot long
cabrallong <- pivot_longer(cabralmerged, c(Area, Elev, Temp, number.islands, distance, age), names_to = "category", values_to = "values")

ggplot(cabrallong, aes(values, Species)) +
  geom_point() +
  facet_wrap(~ category, scales = "free")

logcabrallong <- cabrallong %>%
  mutate(logvalues = log(values))

ggplot(logcabrallong, aes(logvalues, log(Species))) +
  geom_point() +
  facet_wrap(~ category, scales = "free")

Mixed models

Construct mixed model(s) that test the roles of the six predictors in explaining species richness. Plot fitted effects (fixed and random), plus model diagnostics. Calculate how much variation the predictors explain, at the two different scales in the data (island and archipelago). I.e., present R2 values for the two scales. Also, how much of the total variation have they explained, according to R2GLMM(m)?

# all variables richness model
cabral.richness.modall = lmer(log(Species+1) ~ log(Area) + log(Elev + 1) + log(Temp) + (1|Archipelago) + log(number.islands) + log(distance) + log(age), data = cabralmerged)
summary(cabral.richness.modall)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(Species + 1) ~ log(Area) + log(Elev + 1) + log(Temp) + (1 |  
##     Archipelago) + log(number.islands) + log(distance) + log(age)
##    Data: cabralmerged
## 
## REML criterion at convergence: 309.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.6881 -0.4890  0.0988  0.5617  2.0021 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  Archipelago (Intercept) 0.2752   0.5246  
##  Residual                0.2390   0.4889  
## Number of obs: 174, groups:  Archipelago, 23
## 
## Fixed effects:
##                      Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)           1.79175    0.71416  28.82931   2.509  0.01799 *  
## log(Area)             0.32851    0.03223 161.03491  10.192  < 2e-16 ***
## log(Elev + 1)         0.10453    0.03968 166.66229   2.635  0.00922 ** 
## log(Temp)             0.87107    0.15552  85.47693   5.601 2.54e-07 ***
## log(number.islands)   0.09998    0.18204  18.59259   0.549  0.58939    
## log(distance)        -0.27510    0.07445  20.07228  -3.695  0.00143 ** 
## log(age)              0.10516    0.06422  19.02819   1.638  0.11795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) lg(Ar) l(E+1) lg(Tm) lg(n.) lg(ds)
## log(Area)    0.123                                   
## log(Elev+1) -0.314 -0.757                            
## log(Temp)   -0.504 -0.260  0.344                     
## lg(nmbr.sl) -0.375 -0.031  0.032  0.029              
## log(distnc) -0.534  0.132 -0.096 -0.267 -0.081       
## log(age)    -0.174  0.047 -0.064 -0.061 -0.315  0.252

Random effects plots

plot_model(cabral.richness.modall, type ="re")

Fixed effects plot

fixef(cabral.richness.modall)
##         (Intercept)           log(Area)       log(Elev + 1)           log(Temp) 
##          1.79175009          0.32851415          0.10453193          0.87107414 
## log(number.islands)       log(distance)            log(age) 
##          0.09997955         -0.27510497          0.10515856
par(mfrow = c(3,2))
areap <- plot_model(cabral.richness.modall, type = "eff", terms = "Area")
elevp <- plot_model(cabral.richness.modall, type = "eff", terms = "Elev")
tempp <- plot_model(cabral.richness.modall, type = "eff", terms = "Temp")
islandsp <- plot_model(cabral.richness.modall, type = "eff", terms = "number.islands")
distancep <- plot_model(cabral.richness.modall, type = "eff", terms = "distance")
agep <- plot_model(cabral.richness.modall, type = "eff", terms = "age")

(areap + elevp + tempp) / (distancep + agep + islandsp)

Model diagnostics

pd1<- plot_model(cabral.richness.modall, type="diag")[[1]]
pd2<- plot_model(cabral.richness.modall, type="diag")[[2]]

pd1+ pd2

Total & residual variation

#R^2 = 1 - variance of full model/ variance of model with not predictors 
1 - VarCorr(cabral.richness.modall)$Archipelago[1]/VarCorr(cabral.richness)$Archipelago[1]
## [1] 0.5705734
#sigma() extracts the residual standard deviation from the model.
1 - (sigma(cabral.richness.modall)^2)/(sigma(cabral.richness)^2)
## [1] 0.710061
#total variation have they explained, according to R2GLMM(m)
r.squaredGLMM(cabral.richness.modall)
##            R2m       R2c
## [1,] 0.6655305 0.8445342

57% of the variation between archipelagos is explained by the predictors. 71% of the variation within islands in an archipelago is explained by the predictors. The fixed effects explained 67% of the total variation and the random effects explained 84% of the variation.

Use appropriate hypothesis tests (as described in lecture) to test the significance of the predictors. How do you interpret the results of these tests, and the effects plots, in light of hypotheses for what controls species richness in islands and archipelagos? What are the denominator degrees of freedom for each predictor? This is essentially telling you how much replication there is for that predictor, minus the number of fitted parameters for that predictor. Do the denominator df make sense? Why or why not?

anova(cabral.richness.modall, ddf = "Satterthwaite", type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
##                      Sum Sq Mean Sq NumDF   DenDF  F value    Pr(>F)    
## log(Area)           24.8240 24.8240     1 161.035 103.8734 < 2.2e-16 ***
## log(Elev + 1)        1.6588  1.6588     1 166.662   6.9410  0.009218 ** 
## log(Temp)            7.4969  7.4969     1  85.477  31.3702 2.542e-07 ***
## log(number.islands)  0.0721  0.0721     1  18.593   0.3016  0.589391    
## log(distance)        3.2634  3.2634     1  20.072  13.6553  0.001426 ** 
## log(age)             0.6408  0.6408     1  19.028   2.6815  0.117952    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is the model we’ve used the best model? Often I just stick with one big model when the ratio of data to parameters is pretty good. But one might instead prefer to find the best model(s), while accounting for model selection uncertainty. Use AICc in some capacity to assess which predictors are important, what the ‘best’ model is, and how sure you are about what the best model is. The details of how you do it are up to you, as long as it seems justifiable. Remember to set REML=FALSE when comparing models with different sets of predictors.