Remember to set your working directory before you start. You also need to load today’s packages.
library(car)
library(caret)
library(effects)
library(emmeans)
library(GGally)
library(ggplot2)
library(ggthemes)
library(multcomp)
library(MuMIn)
library(relaimpo)
This week’s tutorial covers much of Chapters 8 and 9 of Quinn & Keough (2024) on multiple regression, predictor importance, and model selection.
The birdabund.csv dataset (described in Quinn & Keough, 2024) contains information on bird abundance in forest fragments in southeastern Australia. The columns in the dataset are:
abund: abundance of forest birds (response variable;
number of forest birds per count)area: fragment area (ha)dist: distance to the nearest other fragment (km)ldist: distance to the nearest larger fragment
(km)graze: grazing pressure (1 to 5, indicating light to
heavy)alt: altitude above sea level (m)yr.isol: number of years since isolationRead in the data and convert the graze variable to a factor. Then
inspect the output of the summary() command (which is not
shown because it is too wide for this page).
d <- read.csv("Data/birdabund.csv")
d$graze <- factor(d$graze)
summary(d)
Next, let’s check how the variables are related to each other with a scatterplot matrix.
ggpairs(d) +
theme_few()
area, dist, and ldist appear
to be highly right-skewed with a few very high-leverage points. To fix
this, let’s log10-transform them.
d$area.log10 <- log10(d$area)
d$dist.log10 <- log10(d$dist)
d$ldist.log10 <- log10(d$ldist)
Don’t just assume that your transformations made things better. Always check.
# Exclude the untransformed variables in columns 2, 4, and 5
ggpairs(d[,-c(2, 4, 5)]) +
theme_few()
Looks pretty good to me. All of our predictors are positively correlated with the response (abund). None of our predictor variables are particularly highly correlated with one another. (According to one rule of thumb, \(\left|r\right|\ge 0.7\) is an indicator of potentially problematic collinearity.)
Now we will begin our model-selection procedure.
We have six predictor variables. Our first step is to construct a “full model” that contains all the terms we believe might have an effect. You should start with a biologically sensible “full model” constructed based on your expert knowledge of the system, including any interactions you believe might be important. Typically, a full model will not include any interactions that are higher than three-way (i.e., it excludes four-way, five-way, etc. interactions) because they are unlikely to explain much variation and because they are difficult to interpret. Because we lack expert knowledge on this system, we will build our full model including main effects only.
We are going to do model comparison, and we need to make sure that we
do not fit our models to different subsets of the data, so we need to
set our options to na.fail. This way, if we have any
missing data in our dataset, the operation will fail and we will be
forced to deal with the missing data in some way (e.g. by removing any
records with missing data) before we can continue. Let’s fit a linear
model containing all predictors.
op <- options(na.action = "na.fail")
abund.fullmodel <- lm(abund ~ area.log10 + dist.log10 + ldist.log10 + graze + alt + yr.isol,
data = d)
There are no missing data in this dataset, so the model was built as requested.
Rather than simply using R’s built-in residuals plot, we will try
some of the more advanced regression diagnostics available in the
car package.
crPlots(abund.fullmodel)
These plots show “partial residuals” (i.e., the residuals from the full model, with the predictor’s own linear effect added back in). Here is how to interpret:
area.log10, the blue
dashed shows the linear effect of this predictor as estimated by the
model. The magenta line is a LOWESS smoother that helps detect
non-linearity. If it deviates substantially from the blue line, this
suggests the linearity assumption may be violated.
graze, the residuals
(adjusted for other variables in the model) are presented as box plots
for each level of the factor.
If the assumption of linearity is violated, consider including
polynomial terms or transforming the predictor. The
boxTidwell() function in the car package can
help with transformations of predictors.
In this case, everything looks pretty good.
The spreadLevelPlot() function plots the absolute value
of the studentized residuals against the fitted value. Ideally, you want
this plot to be a flat line, indicating that the error variance is
constant. However, commonly you will find that the line is increasing,
suggesting that higher values have higher error variance. As long as
this is not extreme, it is not a problem. But if the error variance is
not constant, a power transformation of the response variable using the
powerTransform() function in the car package
can help.
spreadLevelPlot(abund.fullmodel)
##
## Suggested power transformation: 0.6213003
The function suggests that we may want to transform our response
variable with a power transformation of 0.62, which is somewhat close to
a square-root transformation. However, I generally do not like to
transform variables unless the model looks inadequate. If you want to
formally test this assumption, you can use the ncvTest()
function.
A regression influence plot shows each residual as a bubble, with its size proportional to Cook’s D (a measure of influence on the results of a regression) plotted against leverage on the x-axis. By default, the function prints “noteworthy” data points, which it defines as large residuals, and those with high leverage or high Cook’s D. Data points with Cook’s D > 1 are considered highly influential (i.e., removing that data point from the analysis would strongly affect your parameter estimates.)
influencePlot(abund.fullmodel, id = FALSE)
There are no highly influential points. If you want to formally test
for the presence of outliers, you can use the outlierTest()
function.
The qqPlot() function is an improved version of R’s
default QQ plot. It tests whether the Studentized residuals match a
t-distribution.
qqPlot(abund.fullmodel, id = FALSE)
If non-normal, a power transformation of the response variable using
the powerTransform() function in the car
package can help.
The Durbin-Watson test checks for autocorrelation in the residuals using a lag of 1 (i.e., successive residuals). This test is useful when the observations and residuals have a natural order (such as in time-series or spatial data). Since this does not apply here, I have not run this function.
durbinWatsonTest(abund.fullmodel)
Finally, let’s also check the variance inflation factors (VIF). According to one rule of thumb, VIF>5 warrants further investigation, while VIF>10 suggests that our standard errors may be substantially inflated due to multicollinearity. This inflation can make it difficult to distinguish the individual effects of correlated predictors, leading to less precise (but still unbiased) coefficient estimates.
vif(abund.fullmodel)
## GVIF Df GVIF^(1/(2*Df))
## area.log10 2.201116 1 1.483616
## dist.log10 1.907874 1 1.381258
## ldist.log10 2.228456 1 1.492801
## graze 7.925036 4 1.295314
## alt 1.597400 1 1.263883
## yr.isol 3.252591 1 1.803494
Because our model contains a factor (i.e., graze), this
function has outputted generalized VIF (GVIF) rather than VIF. The
rightmost column of this table represents \(GVIF^{1/(2\times d.f.)}\), which is
comparable to standard VIF. To interpret it using common VIF thresholds,
we compare these values to \(\sqrt{5}=2.24\) (which may warrant further
investigation) and \(\sqrt{10}=3.16\)
(which may be worrisome).
There is no evidence of strong multicollinearity. As discussed in lecture, how best to identify and deal with multicollinearity is a matter of some debate.
Now let’s examine the relative predictor importances.1 We will use the LMG metric (described in lecture) and get its 95% CIs using bootstrapping. Generally, this is done on the full model.
abund.boot <- boot.relimp(abund.fullmodel)
plot(booteval.relimp(abund.boot))
Based on these results, area.log10 and
graze appear to have the greatest effects.
yr.isol and alt may also have smaller
effects.
The dredge() function in the MuMin package
ranks all models that are subsets of the full model, while respecting
the principal of marginality (i.e., if an interaction is included in a
model, the relevant main effects are also included).2 By default,
MuMin uses the corrected Akaike information criterion
(AICc) to rank models. The AICc
includes a correction for small sample sizes that makes it more
appropriate for use than the AIC when sample size is small; at large
sample sizes, it converges on the AIC. Therefore,
AICc can always be used rather than AIC.
\(\Delta\)AICc is the difference in AICc between the top-ranked model and the model under consideration. According to Burnham & Anderson (2002), the following rules of thumb are appropriate for nested models:
Let’s print all models with \(\Delta\)AICc < 4.
dd <- dredge(abund.fullmodel)
(dd.subset <- subset(dd, delta < 4))
## Global model call: lm(formula = abund ~ area.log10 + dist.log10 + ldist.log10 +
## graze + alt + yr.isol, data = d)
## ---
## Model selection table
## (Int) alt are.l10 dst.l10 grz lds.l10 yr.isl df logLik AICc delta weight
## 11 15.72 7.247 + 7 -175.522 367.4 0.00 0.480
## 12 14.46 0.008227 7.203 + 8 -175.435 369.9 2.56 0.134
## 27 14.41 7.038 + 0.5906 8 -175.457 370.0 2.60 0.131
## 43 48.47 7.192 + -0.01664 8 -175.467 370.0 2.62 0.129
## 15 14.62 7.204 0.4602 + 8 -175.499 370.1 2.68 0.126
## Models ranked by AICc(x)
Let’s also plot the model selection table.
plot(dd.subset)
These top-ranked models all contain an intercept,
area.log10, and graze. The top-ranked model
contains only these variables. Some variables (e.g.,
yr.isol) seem to have a modest effect based on the LMG
metric but are not included in our top-ranked model according to the
AICc… This is likely because they do not improve
predictive performance enough to justify the added complexity of the
model. If we had a larger sample size, yr.isol (etc.) might
be included in the top model because its coefficient could be estimated
more precisely.
Our top-ranked model has an Akaike weight of 0.480. This weight can be interpreted as the weight of evidence that this is the “best” model (i.e., closest to the truth) among the models with \(\Delta\)AICc < 4. Let’s fit this model.
abund.bestmodel <- lm(abund ~ area.log10 + graze, data = d)
Let’s check if our reduced model meets the assumptions of a regression, the same way we did for the full model above. (Output not shown to save space.)
crPlots(abund.bestmodel)
spreadLevelPlot(abund.bestmodel)
influencePlot(abund.bestmodel, id = FALSE)
qqPlot(abund.bestmodel, id = FALSE)
vif(abund.bestmodel)
If you run this code, you will see that the assumptions of the model don’t appear to be severely violated.
The assumptions seem to be satisfied, so let’s summarize the model.
The S() function is in the car package. It is
a modified version of the summary() function.
S(abund.bestmodel)
## Call: lm(formula = abund ~ area.log10 + graze, data = d)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.7164 2.7674 5.679 6.87e-07 ***
## area.log10 7.2472 1.2551 5.774 4.90e-07 ***
## graze2 0.3826 2.9123 0.131 0.895993
## graze3 -0.1893 2.5498 -0.074 0.941119
## graze4 -1.5916 2.9762 -0.535 0.595182
## graze5 -11.8938 2.9311 -4.058 0.000174 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 5.883 on 50 degrees of freedom
## Multiple R-squared: 0.727
## F-statistic: 26.63 on 5 and 50 DF, p-value: 5.148e-13
## AIC BIC
## 365.04 379.22
And get the confidence intervals:
Confint(abund.bestmodel)
## Estimate 2.5 % 97.5 %
## (Intercept) 15.7163938 10.157887 21.274901
## area.log10 7.2472294 4.726241 9.768218
## graze2 0.3826488 -5.466821 6.232119
## graze3 -0.1892869 -5.310705 4.932131
## graze4 -1.5915795 -7.569483 4.386324
## graze5 -11.8938327 -17.781171 -6.006494
The top-ranked model according to the AICc explains 72.7% of the variance in bird abundance and contains log10-transformed fragment area and grazing intensity as predictors.
To interpret this model, remember that area.log10 is
log10-transformed. The intercept (15.716) represents the
predicted bird abundance when area.log10=0 (corresponding
to 1 ha on the original scale) and graze=1 (its reference
level). Each unit increase in log10-transformed area
corresponds to a 10-fold increase in area on the untransformed scale;
for every 10-fold increase in area, the bird abundance is predicted to
increase by 7.247 (if graze is held constant). Bird
abundances at grazing intensities of 2, 3, and 4 do not differ
significantly from the reference level of graze=1. However,
when grazing intensity is 5, the predicted bird abundance declines by
11.894 compared to graze=1 (if area.log10 is
held constant).
Our coefficient table suggests that grazing levels 1 to 4 do not differ, whereas grazing level 5 does. To test this formally, we could use pairwise contrasts with Tukey’s adjustment for multiple comparisons.
emm <- emmeans(abund.bestmodel, ~graze)
pairs(emm, adjust = "tukey")
## contrast estimate SE df t.ratio p.value
## graze1 - graze2 -0.383 2.91 50 -0.131 0.9999
## graze1 - graze3 0.189 2.55 50 0.074 1.0000
## graze1 - graze4 1.592 2.98 50 0.535 0.9833
## graze1 - graze5 11.894 2.93 50 4.058 0.0016
## graze2 - graze3 0.572 2.58 50 0.222 0.9994
## graze2 - graze4 1.974 3.05 50 0.648 0.9662
## graze2 - graze5 12.276 2.71 50 4.534 0.0003
## graze3 - graze4 1.402 2.70 50 0.520 0.9849
## graze3 - graze5 11.705 2.30 50 5.087 0.0001
## graze4 - graze5 10.302 2.84 50 3.624 0.0058
##
## P value adjustment: tukey method for comparing a family of 5 estimates
And get the “compact letter display”:
(tukey_cld <- cld(emm))
## graze emmean SE df lower.CL upper.CL .group
## 5 10.6 1.79 50 6.98 14.2 1
## 4 20.9 2.22 50 16.41 25.3 2
## 3 22.3 1.53 50 19.21 25.4 2
## 1 22.5 1.95 50 18.56 26.4 2
## 2 22.9 2.09 50 18.66 27.0 2
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 5 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
Together, our results suggest that area.log10 and
graze are approximately equally important predictors of
bird abundance, and that the effect of graze is driven
almost entirely by grazing level 5.
Now, let’s plot the effects in our top-ranked model using the
allEffects() function in the effects
package.3
# Create custom x-axis labels for each panel
custom_xlabs <- list(
area.log10 = list(lab = expression(log[10](Area))),
graze = list(lab = "Grazing intensity")
)
# Plot effects with custom x-axis labels
plot(allEffects(abund.bestmodel),
main = NULL,
ylab = "Bird abundance",
ylim = c(0, 40),
axes = list(x = custom_xlabs))
Note that allEffects() produces marginal
effects, meaning that it adjusts for the other predictor(s) in the
model. In other words, unlike the coefficient table, the effect of
area.log10 on bird abundance is not calculated
assuming that graze is at its reference level
(graze=1). Instead, it computes a weighted average of the
predicted abundance across all levels of graze. And when
calculating the effect for each level of graze,
allEffects() does so by holding area.log10 at
its mean, not at 0 (as in the coefficient table).
To add “a”, “b”, “ab” (etc.) indicators to the right-hand plot (corresponding to “1”, “2”, “12” (etc.) in the “compact letter display”), the easiest way is to edit the figure in Adobe Illustrator, Inkscape, etc.
Imagine that we discovered a new forest patch in this region. Suppose it has an area of 0.5 ha and a grazing intensity of 1. What is its predicted bird abundance?
Before using our model to make predictions, we should validate it,
either using new data or using cross-validation. We will conduct
leave-one-out cross-validation (LOOCV) using the
trainControl() and train() functions from the
caret package.
train_control <- trainControl(method = "LOOCV")
cv_model <- train(abund ~ area.log10 + graze, data = d,
method = "lm",
trControl = train_control)
cv_model$results$Rsquared
## [1] 0.6635024
A cross-validated \(R^2\) of 0.664 indicates moderate-to-high predictive power on out-of-sample data. It is only slightly lower than the multiple \(R^2\) (0.727), suggesting that the model is not overfitted and that it generalizes to out-of-sample data reasonably well.
Now let’s get the root mean squared error (RMSE), which can be loosely interpreted as the typical size of a prediction error.4
cv_model$results$RMSE
## [1] 6.186616
This suggests that our predictions for new patches will tend to be off by around 6.2 birds on average. If this magnitude of error is acceptable, then we can continue to make our prediction.
Now, we can create a data frame for the new data. We need one column
for each variable in our model: area.log10 and
graze. We must log10 transform the area as we
did for our original data. We will calculate prediction intervals, which
express our uncertainty in forecasting bird abundance for this specific
new patch.5
newdata <- data.frame(area = 0.5, graze = 1)
newdata$graze <- factor(newdata$graze)
newdata$area.log10 <- log10(newdata$area)
predict(abund.bestmodel, newdata = newdata, interval = "prediction")
## fit lwr upr
## 1 13.53476 0.1965995 26.87292
This patch has a predicted bird abundance of 13.5 (95% PI: 0.2-26.9).
Finally, return options to default settings.
options(op)
Create an R script that contains an appropriate analysis of the following dataset.
The air pollution dataset (pollution.csv) contains information about air pollution for each of 41 different U.S. cities averaged over 1969 to 1971. The data was originally published by Sokal & Rohlf (1981). The variables in the file are the following:
City: City nameSO2: Sulfur dioxide content of air in micrograms per
cubic meterTemp: Average annual temperature in degrees
FahrenheitMan: Number of manufacturing enterprises employing 20
or more workersPop: Population size in thousands from the 1970
censusWind: Average annual wind speed in miles per hourRain: Average annual precipitation in inchesRainDays: Average number of days with precipitation per
yearThe response variable is SO2. Fit an appropriate model
for this data. (For the purposes of this exercise, do not worry about
interactions.) Note that you may have to transform the response variable
and/or the predictors to get a good fit. How much air pollution does
your model predict for a city of 1,000,000 people with average values
for all other variables?
Standardized partial regression coefficients are
commonly used for this purpose, but they are sensitive to collinearity.
If you wish to estimate standardized partial regression coefficients,
you can use the std.coef() from the MuMIn
package. Another approach commonly used in Ecology and Environmental
Sciences is hierarchical partitioning. This can be implemented with the
glmm.hp() function in the glmm.hp package.↩︎
Dredging is common practice, but some experts (e.g.,
Burnham & Anderson, 2002) advise against it. They suggest creating a
set of models based on your biological knowledge about the system, then
ranking your models and choosing the best one. You can do this using the
model.sel() function in the MuMIn package. As
an alternative to selecting a single best model, you can use model
averaging. For instance, this set of commands creates a 95% confidence
set of models, prints the coefficient table and confidence intervals,
and prints the variable importance values (as the sum of Akaike weights
over all models in the set):
avgmod.95p <- model.avg(dd, cumsum(weight) <= 0.95, fit = TRUE); summary(avgmod.95p); Confint(avgmod.95p); sw(avgmod.95p).
Model averaging may be particularly useful when your goal is to make
predictions on new data.↩︎
If we had an interaction term (e.g., between
area.log10 and alt), we could plot it as
plot(effect("area.log10:alt", abund.fullmodel), multiline = TRUE)
or use the persp() function from the rsm
package to get a 3D plot showing the interaction between two continuous
variables, e.g.:
persp(abund.fullmodel, ~area.log10 + alt, zlab = "Predicted bird abundance")↩︎
As an alternative to the RMSE (or in addition to it),
you may present the mean absolute error (MAE), which is the average
magnitude of a prediction error. MAE is easier to interpret than RMSE,
but RMSE better matches the assumption of normality and weights larger
prediction errors more heavily. You can access the MAE as
cv_model$results$MAE.↩︎
If, instead of predicting bird abundance in a single new
patch, we were interested in estimating the mean bird abundance across
all possible 0.5 ha patches with graze = 1, we would use
interval = "confidence" rather than
interval = "prediction".↩︎