In this section, we analyze species richness, which represents count data (number of species). Therefore, a Poisson Generalized Linear Model (GLM) is the most appropriate approach.
First, we load the dataset and inspect its structure to understand the variables we are working with.
# Load the dataset from the CSV file
forest <- read.csv("forest.csv")
# Display the internal structure of the data frame (variable types)
str(forest)## 'data.frame': 200 obs. of 4 variables:
## $ forest : num 71.9 20.9 83.2 37 74.5 ...
## $ forest.clumpy: num 0.861 0.796 0.822 0.799 0.902 ...
## $ sp.richness : int 109 48 117 50 78 71 79 96 102 61 ...
## $ Empenigr : int 1 0 1 0 0 0 0 0 1 0 ...
## forest forest.clumpy sp.richness Empenigr
## Min. : 8.28 Min. :0.7609 Min. : 23.00 Min. :0.00
## 1st Qu.:39.09 1st Qu.:0.8457 1st Qu.: 72.00 1st Qu.:0.00
## Median :53.89 Median :0.8695 Median : 85.00 Median :0.00
## Mean :53.69 Mean :0.8637 Mean : 85.17 Mean :0.35
## 3rd Qu.:68.40 3rd Qu.:0.8858 3rd Qu.:100.00 3rd Qu.:1.00
## Max. :91.60 Max. :0.9487 Max. :127.00 Max. :1.00
Before modeling, it is crucial to visualize the distribution of our response variable (Species Richness). Count data typically follows a right-skewed Poisson distribution.
# Plot a histogram to visualize the frequency of species richness counts
hist(forest$sp.richness,
main = "Histogram of Species Richness",
xlab = "Species Richness (Count)",
col = "lightblue")We fit a Poisson GLM using the log link function. The mathematical representation is:
\[ \log(\lambda) = \beta_0 + \beta_1(Forest) + \beta_2(Clumpy) + \beta_3(Forest \times Clumpy) \]
# Fit the full Poisson GLM including the interaction term (*) between predictors
model_full <- glm(sp.richness ~ forest * forest.clumpy,
family = poisson(link = "log"),
data = forest)
# Display the model summary to check the significance of the interaction term
summary(model_full)##
## Call:
## glm(formula = sp.richness ~ forest * forest.clumpy, family = poisson(link = "log"),
## data = forest)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.903143 0.526995 3.611 0.000305 ***
## forest 0.025377 0.008490 2.989 0.002797 **
## forest.clumpy 2.502664 0.614149 4.075 4.6e-05 ***
## forest:forest.clumpy -0.021457 0.009995 -2.147 0.031813 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 887.57 on 199 degrees of freedom
## Residual deviance: 575.27 on 196 degrees of freedom
## AIC: 1834.5
##
## Number of Fisher Scoring iterations: 4
If the interaction is not significant, we simplify the model by
removing it, leaving only the additive main effects
(+).
# Fit the final additive model without the interaction term (+)
model_final <- glm(sp.richness ~ forest + forest.clumpy,
family = poisson(link = "log"),
data = forest)
# Display the summary to interpret the main effects
summary(model_final)##
## Call:
## glm(formula = sp.richness ~ forest + forest.clumpy, family = poisson(link = "log"),
## data = forest)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.9556051 0.1892674 15.616 < 2e-16 ***
## forest 0.0071777 0.0004181 17.166 < 2e-16 ***
## forest.clumpy 1.2671322 0.2103598 6.024 1.71e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 887.57 on 199 degrees of freedom
## Residual deviance: 579.88 on 197 degrees of freedom
## AIC: 1837.1
##
## Number of Fisher Scoring iterations: 4
A strict assumption of the Poisson model is that the variance equals the mean. We check for overdispersion by dividing the residual deviance by the residual degrees of freedom.
\[ Dispersion\_Ratio = \frac{Residual\_Deviance}{Residual\_df} \]
# Calculate the dispersion ratio
dispersion_ratio <- deviance(model_final) / df.residual(model_final)
print(dispersion_ratio)## [1] 2.943544
# Decision Rule:
# If value is close to 1 -> Good (No overdispersion)
# If value > 1.5 or 2 -> Overdispersion problem exists
# If overdispersion exists, we must use the 'quasipoisson' family to correct standard errors
model_final_quasi <- glm(sp.richness ~ forest + forest.clumpy,
family = quasipoisson(link = "log"),
data = forest)We use diagnostic plots to verify the model assumptions (e.g., residual patterns, influential outliers).
# Divide the plotting area into a 2x2 grid to show all 4 diagnostic plots together
par(mfrow = c(2, 2))
# Generate the diagnostic plots for the final model
plot(model_final)forest.clumpy) had a significant
positive effect, indicating that highly aggregated forests support
higher species richness.In this section, we analyze the presence or absence of a specific plant species (Empetrum nigrum). Since the data is binary (0 = absent, 1 = present), a Binomial GLM (Logistic Regression) is required.
We use the logit link function to model the probability (\(p\)) of the species being present:
\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \dots \]
# Fit the full binomial GLM including the interaction term (*)
model2_full <- glm(Empenigr ~ forest * forest.clumpy,
family = binomial(link = "logit"),
data = forest)
# Check the summary to evaluate predictors and their interaction
summary(model2_full)##
## Call:
## glm(formula = Empenigr ~ forest * forest.clumpy, family = binomial(link = "logit"),
## data = forest)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.7165 13.4065 -2.291 0.0220 *
## forest 0.5084 0.2101 2.420 0.0155 *
## forest.clumpy 32.8914 15.4238 2.133 0.0330 *
## forest:forest.clumpy -0.5558 0.2441 -2.276 0.0228 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 258.98 on 199 degrees of freedom
## Residual deviance: 235.90 on 196 degrees of freedom
## AIC: 243.9
##
## Number of Fisher Scoring iterations: 4
Depending on the significance of the interaction in the full model, we can fit an additive model for comparison or final interpretation.
# Fit the additive binomial model without the interaction term (+)
model2_final <- glm(Empenigr ~ forest + forest.clumpy,
family = binomial(link = "logit"),
data = forest)
summary(model2_final)##
## Call:
## glm(formula = Empenigr ~ forest + forest.clumpy, family = binomial(link = "logit"),
## data = forest)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.306573 3.895496 -0.592 0.553775
## forest 0.033682 0.008951 3.763 0.000168 ***
## forest.clumpy -0.214035 4.306505 -0.050 0.960361
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 258.98 on 199 degrees of freedom
## Residual deviance: 241.67 on 197 degrees of freedom
## AIC: 247.67
##
## Number of Fisher Scoring iterations: 4
Just like the Poisson model, we check the residuals for the Binomial model.
# Set up a 2x2 grid for diagnostic plots
par(mfrow = c(2, 2))
# Generate plots to check for influential data points or severe deviations
plot(model2_final)