Starting off

Remember to set your working directory before you start. You also need to load today’s packages.

library(car)
library(caret)
library(effects)
library(emmeans)
library(GGally)
library(ggplot2)
library(ggthemes)
library(multcomp)
library(MuMIn)
library(relaimpo)

Multiple regression and model selection

This week’s tutorial covers much of Chapters 8 and 9 of Quinn & Keough (2024) on multiple regression, predictor importance, and model selection.

The birdabund.csv dataset (described in Quinn & Keough, 2024) contains information on bird abundance in forest fragments in southeastern Australia. The columns in the dataset are:

Inspect the data

Read in the data and convert the graze variable to a factor. Then inspect the output of the summary() command (which is not shown because it is too wide for this page).

d <- read.csv("Data/birdabund.csv")
d$graze <- factor(d$graze)
summary(d)

Next, let’s check how the variables are related to each other with a scatterplot matrix.

ggpairs(d) +
  theme_few()

area, dist, and ldist appear to be highly right-skewed with a few very high-leverage points. To fix this, let’s log10-transform them.

d$area.log10 <- log10(d$area)
d$dist.log10 <- log10(d$dist)
d$ldist.log10 <- log10(d$ldist)

Don’t just assume that your transformations made things better. Always check.

# Exclude the untransformed variables in columns 2, 4, and 5
ggpairs(d[,-c(2, 4, 5)]) +
  theme_few()

Looks pretty good to me. All of our predictors are positively correlated with the response (abund). None of our predictor variables are particularly highly correlated with one another. (According to one rule of thumb, \(\left|r\right|\ge 0.7\) is an indicator of potentially problematic collinearity.)

Now we will begin our model-selection procedure.

Build the full model

We have six predictor variables. Our first step is to construct a “full model” that contains all the terms we believe might have an effect. You should start with a biologically sensible “full model” constructed based on your expert knowledge of the system, including any interactions you believe might be important. Typically, a full model will not include any interactions that are higher than three-way (i.e., it excludes four-way, five-way, etc. interactions) because they are unlikely to explain much variation and because they are difficult to interpret. Because we lack expert knowledge on this system, we will build our full model including main effects only.

We are going to do model comparison, and we need to make sure that we do not fit our models to different subsets of the data, so we need to set our options to na.fail. This way, if we have any missing data in our dataset, the operation will fail and we will be forced to deal with the missing data in some way (e.g. by removing any records with missing data) before we can continue. Let’s fit a linear model containing all predictors.

op <- options(na.action = "na.fail")
abund.fullmodel <- lm(abund ~ area.log10 + dist.log10 + ldist.log10 + graze + alt + yr.isol,
                      data = d)

There are no missing data in this dataset, so the model was built as requested.

Check assumptions of the full model

Rather than simply using R’s built-in residuals plot, we will try some of the more advanced regression diagnostics available in the car package.

Check for linearity

crPlots(abund.fullmodel)

These plots show “partial residuals” (i.e., the residuals from the full model, with the predictor’s own linear effect added back in). Here is how to interpret:

  • For a continuous variable like area.log10, the blue dashed shows the linear effect of this predictor as estimated by the model. The magenta line is a LOWESS smoother that helps detect non-linearity. If it deviates substantially from the blue line, this suggests the linearity assumption may be violated.
    • These plots suggest no strong deviations from linearity.
  • For a categorical variable like graze, the residuals (adjusted for other variables in the model) are presented as box plots for each level of the factor.
    • This plot suggests that grazing intensity levels of 1-4 have similar effects on bird abundance (after accounting for log-transformed area).

If the assumption of linearity is violated, consider including polynomial terms or transforming the predictor. The boxTidwell() function in the car package can help with transformations of predictors.

In this case, everything looks pretty good.

Check for constant variance (homoscedasticity)

The spreadLevelPlot() function plots the absolute value of the studentized residuals against the fitted value. Ideally, you want this plot to be a flat line, indicating that the error variance is constant. However, commonly you will find that the line is increasing, suggesting that higher values have higher error variance. As long as this is not extreme, it is not a problem. But if the error variance is not constant, a power transformation of the response variable using the powerTransform() function in the car package can help.

spreadLevelPlot(abund.fullmodel)

## 
## Suggested power transformation:  0.6213003

The function suggests that we may want to transform our response variable with a power transformation of 0.62, which is somewhat close to a square-root transformation. However, I generally do not like to transform variables unless the model looks inadequate. If you want to formally test this assumption, you can use the ncvTest() function.

Check for influential points and outliers

A regression influence plot shows each residual as a bubble, with its size proportional to Cook’s D (a measure of influence on the results of a regression) plotted against leverage on the x-axis. By default, the function prints “noteworthy” data points, which it defines as large residuals, and those with high leverage or high Cook’s D. Data points with Cook’s D > 1 are considered highly influential (i.e., removing that data point from the analysis would strongly affect your parameter estimates.)

influencePlot(abund.fullmodel, id = FALSE)

There are no highly influential points. If you want to formally test for the presence of outliers, you can use the outlierTest() function.

Check for normality of residuals

The qqPlot() function is an improved version of R’s default QQ plot. It tests whether the Studentized residuals match a t-distribution.

qqPlot(abund.fullmodel, id = FALSE)

If non-normal, a power transformation of the response variable using the powerTransform() function in the car package can help.

Check for independence of residuals

The Durbin-Watson test checks for autocorrelation in the residuals using a lag of 1 (i.e., successive residuals). This test is useful when the observations and residuals have a natural order (such as in time-series or spatial data). Since this does not apply here, I have not run this function.

durbinWatsonTest(abund.fullmodel)

Check for multicollinearity

Finally, let’s also check the variance inflation factors (VIF). According to one rule of thumb, VIF>5 warrants further investigation, while VIF>10 suggests that our standard errors may be substantially inflated due to multicollinearity. This inflation can make it difficult to distinguish the individual effects of correlated predictors, leading to less precise (but still unbiased) coefficient estimates.

vif(abund.fullmodel)
##                 GVIF Df GVIF^(1/(2*Df))
## area.log10  2.201116  1        1.483616
## dist.log10  1.907874  1        1.381258
## ldist.log10 2.228456  1        1.492801
## graze       7.925036  4        1.295314
## alt         1.597400  1        1.263883
## yr.isol     3.252591  1        1.803494

Because our model contains a factor (i.e., graze), this function has outputted generalized VIF (GVIF) rather than VIF. The rightmost column of this table represents \(GVIF^{1/(2\times d.f.)}\), which is comparable to standard VIF. To interpret it using common VIF thresholds, we compare these values to \(\sqrt{5}=2.24\) (which may warrant further investigation) and \(\sqrt{10}=3.16\) (which may be worrisome).

There is no evidence of strong multicollinearity. As discussed in lecture, how best to identify and deal with multicollinearity is a matter of some debate.

Estimate relative predictor importance (optional)

Now let’s examine the relative predictor importances.1 We will use the LMG metric (described in lecture) and get its 95% CIs using bootstrapping. Generally, this is done on the full model.

abund.boot <- boot.relimp(abund.fullmodel)
plot(booteval.relimp(abund.boot))

Based on these results, area.log10 and graze appear to have the greatest effects. yr.isol and alt may also have smaller effects.

All-subsets model comparison (i.e., “dredging” for the best reduced model)

The dredge() function in the MuMin package ranks all models that are subsets of the full model, while respecting the principal of marginality (i.e., if an interaction is included in a model, the relevant main effects are also included).2 By default, MuMin uses the corrected Akaike information criterion (AICc) to rank models. The AICc includes a correction for small sample sizes that makes it more appropriate for use than the AIC when sample size is small; at large sample sizes, it converges on the AIC. Therefore, AICc can always be used rather than AIC.

\(\Delta\)AICc is the difference in AICc between the top-ranked model and the model under consideration. According to Burnham & Anderson (2002), the following rules of thumb are appropriate for nested models:

  • 0 < \(\Delta\)AICc < 2: substantial support
  • 4 < \(\Delta\)AICc < 7: considerably less support
  • \(\Delta\)AICc > 10: essentially no support

Let’s print all models with \(\Delta\)AICc < 4.

dd <- dredge(abund.fullmodel)
(dd.subset <- subset(dd, delta < 4))
## Global model call: lm(formula = abund ~ area.log10 + dist.log10 + ldist.log10 + 
##     graze + alt + yr.isol, data = d)
## ---
## Model selection table 
##    (Int)      alt are.l10 dst.l10 grz lds.l10   yr.isl df   logLik  AICc delta weight
## 11 15.72            7.247           +                   7 -175.522 367.4  0.00  0.480
## 12 14.46 0.008227   7.203           +                   8 -175.435 369.9  2.56  0.134
## 27 14.41            7.038           +  0.5906           8 -175.457 370.0  2.60  0.131
## 43 48.47            7.192           +         -0.01664  8 -175.467 370.0  2.62  0.129
## 15 14.62            7.204  0.4602   +                   8 -175.499 370.1  2.68  0.126
## Models ranked by AICc(x)

Let’s also plot the model selection table.

plot(dd.subset)

These top-ranked models all contain an intercept, area.log10, and graze. The top-ranked model contains only these variables. Some variables (e.g., yr.isol) seem to have a modest effect based on the LMG metric but are not included in our top-ranked model according to the AICc… This is likely because they do not improve predictive performance enough to justify the added complexity of the model. If we had a larger sample size, yr.isol (etc.) might be included in the top model because its coefficient could be estimated more precisely.

Our top-ranked model has an Akaike weight of 0.480. This weight can be interpreted as the weight of evidence that this is the “best” model (i.e., closest to the truth) among the models with \(\Delta\)AICc < 4. Let’s fit this model.

abund.bestmodel <- lm(abund ~ area.log10 + graze, data = d)

Check assumptions of the reduced model

Let’s check if our reduced model meets the assumptions of a regression, the same way we did for the full model above. (Output not shown to save space.)

crPlots(abund.bestmodel)
spreadLevelPlot(abund.bestmodel)
influencePlot(abund.bestmodel, id = FALSE)
qqPlot(abund.bestmodel, id = FALSE)
vif(abund.bestmodel)

If you run this code, you will see that the assumptions of the model don’t appear to be severely violated.

Interpret the reduced model

The assumptions seem to be satisfied, so let’s summarize the model. The S() function is in the car package. It is a modified version of the summary() function.

S(abund.bestmodel)
## Call: lm(formula = abund ~ area.log10 + graze, data = d)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.7164     2.7674   5.679 6.87e-07 ***
## area.log10    7.2472     1.2551   5.774 4.90e-07 ***
## graze2        0.3826     2.9123   0.131 0.895993    
## graze3       -0.1893     2.5498  -0.074 0.941119    
## graze4       -1.5916     2.9762  -0.535 0.595182    
## graze5      -11.8938     2.9311  -4.058 0.000174 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard deviation: 5.883 on 50 degrees of freedom
## Multiple R-squared: 0.727
## F-statistic: 26.63 on 5 and 50 DF,  p-value: 5.148e-13 
##    AIC    BIC 
## 365.04 379.22

And get the confidence intervals:

Confint(abund.bestmodel)
##                Estimate      2.5 %    97.5 %
## (Intercept)  15.7163938  10.157887 21.274901
## area.log10    7.2472294   4.726241  9.768218
## graze2        0.3826488  -5.466821  6.232119
## graze3       -0.1892869  -5.310705  4.932131
## graze4       -1.5915795  -7.569483  4.386324
## graze5      -11.8938327 -17.781171 -6.006494

The top-ranked model according to the AICc explains 72.7% of the variance in bird abundance and contains log10-transformed fragment area and grazing intensity as predictors.

To interpret this model, remember that area.log10 is log10-transformed. The intercept (15.716) represents the predicted bird abundance when area.log10=0 (corresponding to 1 ha on the original scale) and graze=1 (its reference level). Each unit increase in log10-transformed area corresponds to a 10-fold increase in area on the untransformed scale; for every 10-fold increase in area, the bird abundance is predicted to increase by 7.247 (if graze is held constant). Bird abundances at grazing intensities of 2, 3, and 4 do not differ significantly from the reference level of graze=1. However, when grazing intensity is 5, the predicted bird abundance declines by 11.894 compared to graze=1 (if area.log10 is held constant).

Our coefficient table suggests that grazing levels 1 to 4 do not differ, whereas grazing level 5 does. To test this formally, we could use pairwise contrasts with Tukey’s adjustment for multiple comparisons.

emm <- emmeans(abund.bestmodel, ~graze)
pairs(emm, adjust = "tukey")
##  contrast        estimate   SE df t.ratio p.value
##  graze1 - graze2   -0.383 2.91 50  -0.131  0.9999
##  graze1 - graze3    0.189 2.55 50   0.074  1.0000
##  graze1 - graze4    1.592 2.98 50   0.535  0.9833
##  graze1 - graze5   11.894 2.93 50   4.058  0.0016
##  graze2 - graze3    0.572 2.58 50   0.222  0.9994
##  graze2 - graze4    1.974 3.05 50   0.648  0.9662
##  graze2 - graze5   12.276 2.71 50   4.534  0.0003
##  graze3 - graze4    1.402 2.70 50   0.520  0.9849
##  graze3 - graze5   11.705 2.30 50   5.087  0.0001
##  graze4 - graze5   10.302 2.84 50   3.624  0.0058
## 
## P value adjustment: tukey method for comparing a family of 5 estimates

And get the “compact letter display”:

(tukey_cld <- cld(emm))
##  graze emmean   SE df lower.CL upper.CL .group
##  5       10.6 1.79 50     6.98     14.2  1    
##  4       20.9 2.22 50    16.41     25.3   2   
##  3       22.3 1.53 50    19.21     25.4   2   
##  1       22.5 1.95 50    18.56     26.4   2   
##  2       22.9 2.09 50    18.66     27.0   2   
## 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 5 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.

Together, our results suggest that area.log10 and graze are approximately equally important predictors of bird abundance, and that the effect of graze is driven almost entirely by grazing level 5.

Now, let’s plot the effects in our top-ranked model using the allEffects() function in the effects package.3

# Create custom x-axis labels for each panel
custom_xlabs <- list(
  area.log10 = list(lab = expression(log[10](Area))),
  graze = list(lab = "Grazing intensity")
)

# Plot effects with custom x-axis labels
plot(allEffects(abund.bestmodel),
     main = NULL,
     ylab = "Bird abundance",
     ylim = c(0, 40),
     axes = list(x = custom_xlabs))

Note that allEffects() produces marginal effects, meaning that it adjusts for the other predictor(s) in the model. In other words, unlike the coefficient table, the effect of area.log10 on bird abundance is not calculated assuming that graze is at its reference level (graze=1). Instead, it computes a weighted average of the predicted abundance across all levels of graze. And when calculating the effect for each level of graze, allEffects() does so by holding area.log10 at its mean, not at 0 (as in the coefficient table).

To add “a”, “b”, “ab” (etc.) indicators to the right-hand plot (corresponding to “1”, “2”, “12” (etc.) in the “compact letter display”), the easiest way is to edit the figure in Adobe Illustrator, Inkscape, etc.

Make predictions on new data (optional)

Imagine that we discovered a new forest patch in this region. Suppose it has an area of 0.5 ha and a grazing intensity of 1. What is its predicted bird abundance?

Validate the model

Before using our model to make predictions, we should validate it, either using new data or using cross-validation. We will conduct leave-one-out cross-validation (LOOCV) using the trainControl() and train() functions from the caret package.

train_control <- trainControl(method = "LOOCV")
cv_model <- train(abund ~ area.log10 + graze, data = d,
                  method = "lm",
                  trControl = train_control)
cv_model$results$Rsquared
## [1] 0.6635024

A cross-validated \(R^2\) of 0.664 indicates moderate-to-high predictive power on out-of-sample data. It is only slightly lower than the multiple \(R^2\) (0.727), suggesting that the model is not overfitted and that it generalizes to out-of-sample data reasonably well.

Now let’s get the root mean squared error (RMSE), which can be loosely interpreted as the typical size of a prediction error.4

cv_model$results$RMSE
## [1] 6.186616

This suggests that our predictions for new patches will tend to be off by around 6.2 birds on average. If this magnitude of error is acceptable, then we can continue to make our prediction.

Make the prediction

Now, we can create a data frame for the new data. We need one column for each variable in our model: area.log10 and graze. We must log10 transform the area as we did for our original data. We will calculate prediction intervals, which express our uncertainty in forecasting bird abundance for this specific new patch.5

newdata <- data.frame(area = 0.5, graze = 1)
newdata$graze <- factor(newdata$graze)
newdata$area.log10 <- log10(newdata$area)
predict(abund.bestmodel, newdata = newdata, interval = "prediction")
##        fit       lwr      upr
## 1 13.53476 0.1965995 26.87292

This patch has a predicted bird abundance of 13.5 (95% PI: 0.2-26.9).

Finally, return options to default settings.

options(op)

This week’s task

Create an R script that contains an appropriate analysis of the following dataset.

The air pollution dataset (pollution.csv) contains information about air pollution for each of 41 different U.S. cities averaged over 1969 to 1971. The data was originally published by Sokal & Rohlf (1981). The variables in the file are the following:

The response variable is SO2. Fit an appropriate model for this data. (For the purposes of this exercise, do not worry about interactions.) Note that you may have to transform the response variable and/or the predictors to get a good fit. How much air pollution does your model predict for a city of 1,000,000 people with average values for all other variables?


  1. Standardized partial regression coefficients are commonly used for this purpose, but they are sensitive to collinearity. If you wish to estimate standardized partial regression coefficients, you can use the std.coef() from the MuMIn package. Another approach commonly used in Ecology and Environmental Sciences is hierarchical partitioning. This can be implemented with the glmm.hp() function in the glmm.hp package.↩︎

  2. Dredging is common practice, but some experts (e.g., Burnham & Anderson, 2002) advise against it. They suggest creating a set of models based on your biological knowledge about the system, then ranking your models and choosing the best one. You can do this using the model.sel() function in the MuMIn package. As an alternative to selecting a single best model, you can use model averaging. For instance, this set of commands creates a 95% confidence set of models, prints the coefficient table and confidence intervals, and prints the variable importance values (as the sum of Akaike weights over all models in the set): avgmod.95p <- model.avg(dd, cumsum(weight) <= 0.95, fit = TRUE); summary(avgmod.95p); Confint(avgmod.95p); sw(avgmod.95p). Model averaging may be particularly useful when your goal is to make predictions on new data.↩︎

  3. If we had an interaction term (e.g., between area.log10 and alt), we could plot it as plot(effect("area.log10:alt", abund.fullmodel), multiline = TRUE) or use the persp() function from the rsm package to get a 3D plot showing the interaction between two continuous variables, e.g.: persp(abund.fullmodel, ~area.log10 + alt, zlab = "Predicted bird abundance")↩︎

  4. As an alternative to the RMSE (or in addition to it), you may present the mean absolute error (MAE), which is the average magnitude of a prediction error. MAE is easier to interpret than RMSE, but RMSE better matches the assumption of normality and weights larger prediction errors more heavily. You can access the MAE as cv_model$results$MAE.↩︎

  5. If, instead of predicting bird abundance in a single new patch, we were interested in estimating the mean bird abundance across all possible 0.5 ha patches with graze = 1, we would use interval = "confidence" rather than interval = "prediction".↩︎