In this week’s lab exercise we will first go through calculating the AIC values “by hand”, and then using a addon package MuMIn. Doing the calculations by hand serves three purposes: 1) you practice using R!, 2) you gain a deeper understanding of how much work the add-on package is doing for you, and 3) sometimes the add-on package won’t work for a new class of model, and if you can do it by hand you won’t be stuck.
We’ll begin by using a simulated dataset from the paper “Confronting collinearity: comparing methods for disentangling the effects of habitat loss and fragmentation” by Adam Smith and colleagues. They had real data, but I have written a few lines of code that produce datasets that are similar. They have 4 variables that describe patches in a landscape (Amount, Edge, MnPatch, and Hetero), and simulated a normally distributed response variable. Amount, Edge and Hetero affect the response, Y, and MnPatch does not. All four of the variables are correlated to some extent. The script and data will be available in package NRES803. The code to make the data is in the examples section of ?smithsim, or simply load the dataframe with data(“smithsim”). Include all the outputs and figures in your Rmd file. Answer the numbered questions.
Fit a global model that includes all four variables.
#library(NRES803)
smithsim<-read.csv("smithsim.csv")
global <- lm(Y~Amount+Edge+MnPatch+Hetero, data=smithsim)
summary(global)
##
## Call:
## lm(formula = Y ~ Amount + Edge + MnPatch + Hetero, data = smithsim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4933 -2.6799 -0.0211 2.3678 11.3932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1129 0.1899 -0.594 0.5527
## Amount 1.5933 0.5211 3.058 0.0024 **
## Edge -2.0668 0.4406 -4.691 3.92e-06 ***
## MnPatch 0.2714 0.3033 0.895 0.3715
## Hetero 2.1978 0.2290 9.599 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.541 on 345 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.313
## F-statistic: 40.74 on 4 and 345 DF, p-value: < 2.2e-16
old.par = par(mfrow=c(2,2))
plot(global)
Therefore the global model is Y = -0.1129 + 1.5933 X Amount - 2.0668 X Edge + 0.2714 X MnPatch + 2.1978X Hetero + ε If I put these values: (intercept=0, Amount = 2, Edge = -2, MnPatch = 0, Hetero = 2), and the known residual standard error of 3.5
Y=15.2158
truth <- lm(Y~Amount+Edge+Hetero, data=smithsim)
summary(truth)
##
## Call:
## lm(formula = Y ~ Amount + Edge + Hetero, data = smithsim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1007 -2.6453 0.0434 2.3937 11.7930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1014 0.1894 -0.535 0.592867
## Amount 1.7978 0.4681 3.840 0.000146 ***
## Edge -2.0584 0.4404 -4.674 4.23e-06 ***
## Hetero 2.1856 0.2285 9.565 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.54 on 346 degrees of freedom
## Multiple R-squared: 0.3193, Adjusted R-squared: 0.3134
## F-statistic: 54.09 on 3 and 346 DF, p-value: < 2.2e-16
old.par = par(mfrow=c(2,2))
plot(truth)
For truth model: Y = -0.1014 + 1.7978Amount -2.0584Edge + 2.18568Hetero + ε
(intercept=0, Amount = 2, Edge = -2, MnPatch = 0, Hetero = 2), and the known residual standard error of 3.5
Y=15.58376
If we see the residuals for both models, they seems similar, and can not find significant difference.
One strategy for deciding if we can remove a variable from a model is “backward selection”. We can do this with an F-test.
anova(truth,global)
## Analysis of Variance Table
##
## Model 1: Y ~ Amount + Edge + Hetero
## Model 2: Y ~ Amount + Edge + MnPatch + Hetero
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 346 4336.2
## 2 345 4326.2 1 10.04 0.8007 0.3715
So we can justify using the smaller model because there is no significant difference between the simpler and more complex models. Now we can check to see if we can further simplify the model truth. I use the function update() to create a model that is identical except that it removes the variable Amount.
#smaller.model <- update(truth,.~.-Amount)
smaller.model <- update(truth,.~.-Edge)
#smaller.model <- update(truth,.~.-Hetero)
anova(smaller.model, truth)
## Analysis of Variance Table
##
## Model 1: Y ~ Amount + Hetero
## Model 2: Y ~ Amount + Edge + Hetero
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 347 4610.1
## 2 346 4336.2 1 273.82 21.849 4.234e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Can we eliminate the Amount variable from our model truth? No, because doing so leads to a significant change in the residual sum of squares.
smaller.model <- update(truth,.~.-Edge) anova(smaller.model, truth) Analysis of Variance Table
Model 1: Y ~ Amount + Hetero Model 2: Y ~ Amount + Edge + Hetero Res.Df RSS Df Sum of Sq F Pr(>F)
1 347 4610.1
2 346 4336.2 1 273.82 21.849 4.234e-06 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
smaller.model <- update(truth,.~.-Hetero) anova(smaller.model, truth) Analysis of Variance Table
Model 1: Y ~ Amount + Edge Model 2: Y ~ Amount + Edge + Hetero Res.Df RSS Df Sum of Sq F Pr(>F)
1 347 5482.9
2 346 4336.2 1 1146.6 91.492 < 2.2e-16 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
For the both, the p value is significant but the sum of squared error increased significantly. Therefore we can’t eliminate other variables from the model
Now we’re going to try AIC model selection. The first step, no matter how you do the calculations, is to write out a list object containing all of the models we want to fit. In this case, we’ll simply do all of the possible main effect models – we have plenty of data:
models <- list(Y~1,
Y~Amount,
Y~Edge,
Y~MnPatch,
Y~Hetero,
Y~Amount + Edge,
Y~Amount + MnPatch,
Y~Amount + Hetero,
Y~Edge + MnPatch,
Y~Edge + Hetero,
Y~MnPatch + Hetero,
Y~Amount+Edge+Hetero,
Y~Amount+Edge+MnPatch,
Y~Amount+MnPatch+Hetero,
Y~Edge+MnPatch+Hetero,
Y~Amount+Edge+MnPatch+Hetero)
This amounts to 16 possible models. This may seem time consuming, but in fact you have to write out all the models at some point in order to check them, and by doing it in a list we gain a lot of flexibility and speed later on. For example, we don’t need to fit each model individually. Putting all our model formulas in a list makes it dead easy to fit all the models with one line of code using map() 1. This function takes a list as it’s first argument, and passes each item in the list to the function defined in the 2nd argument. It also passes any additional arguments given to map() to the function. In this case I give the argument data=smithsim to make sure the data are available to lm().
library(purrr)
fits <- map(models, lm, data=smithsim)
Next we want to get the AIC values for each of these models. We use map_dbl(), which is similar to map(), but returns a numeric vector:
aic <- map_dbl(fits,AIC)
aic
## [1] 2012.749 1981.386 2003.985 1994.779 1901.672 1964.264 1983.353 1903.578
## [9] 1996.685 1896.756 1903.558 1884.147 1966.180 1904.977 1892.695 1885.336
The best model has the smallest AIC.
which.min(aic)
## [1] 12
models[[which.min(aic)]] # which model is that?
## Y ~ Amount + Edge + Hetero
Which happens to be the true model. However, how many other models are close by? Here is where we manually calculate an AIC table. Start by calculating the ΔAIC values, and convert them to the model weights.
deltas <- aic - min(aic) # smallest will be zero
weights <- exp(-deltas/2) # this is an intermediate step
weights <- weights/sum(weights) # so we can do this
In this step we’re really taking advantage of R’s “vectorizing” of calculations to automatically repeat the same calculation across an entire vector.
We also want the number of estimated parameters in each model.
npars <- map_dbl(fits, "rank") + 1
If the second argument to map_dbl() is a character vector, map_dbl() tries to extract an object of that name from each element of the list. Finally, use tibble() to show all the results together.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tibble(AIC=aic,k=npars,deltas=deltas,weights=weights)
## # A tibble: 16 × 4
## AIC k deltas weights
## <dbl> <dbl> <dbl> <dbl>
## 1 2013. 2 129. 7.57e-29
## 2 1981. 3 97.2 4.89e-22
## 3 2004. 3 120. 6.05e-27
## 4 1995. 3 111. 6.04e-25
## 5 1902. 3 17.5 9.98e- 5
## 6 1964. 4 80.1 2.56e-18
## 7 1983. 4 99.2 1.83e-22
## 8 1904. 4 19.4 3.85e- 5
## 9 1997. 4 113. 2.33e-25
## 10 1897. 4 12.6 1.17e- 3
## 11 1904. 4 19.4 3.89e- 5
## 12 1884. 5 0 6.38e- 1
## 13 1966. 5 82.0 9.81e-19
## 14 1905. 5 20.8 1.91e- 5
## 15 1893. 5 8.55 8.88e- 3
## 16 1885. 6 1.19 3.52e- 1
The first thing is to sort the models so they go from lowest AIC to highest AIC. This will mess up the row numbers, so we need to put them in the table directly before sorting:
aic.table <- tibble(model=1:16,AIC=aic,k=npars,deltas=deltas,weights=weights)
at2 <- aic.table %>%
arrange(AIC) %>%
mutate(cumw = cumsum(weights))
at2
## # A tibble: 16 × 6
## model AIC k deltas weights cumw
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12 1884. 5 0 6.38e- 1 0.638
## 2 16 1885. 6 1.19 3.52e- 1 0.990
## 3 15 1893. 5 8.55 8.88e- 3 0.999
## 4 10 1897. 4 12.6 1.17e- 3 1.00
## 5 5 1902. 3 17.5 9.98e- 5 1.00
## 6 11 1904. 4 19.4 3.89e- 5 1.00
## 7 8 1904. 4 19.4 3.85e- 5 1.00
## 8 14 1905. 5 20.8 1.91e- 5 1
## 9 6 1964. 4 80.1 2.56e-18 1
## 10 13 1966. 5 82.0 9.81e-19 1
## 11 2 1981. 3 97.2 4.89e-22 1
## 12 7 1983. 4 99.2 1.83e-22 1
## 13 4 1995. 3 111. 6.04e-25 1
## 14 9 1997. 4 113. 2.33e-25 1
## 15 3 2004. 3 120. 6.05e-27 1
## 16 1 2013. 2 129. 7.57e-29 1
Model 12 with an AIC of 1884.147 and a weight of 0.637 which is substantial. The next best model is model 16 with an AIC of 1885.336 and a weight of 0.351.
The AIC difference between the models (12,16) or (12,15) is less than 10, therefore, I believe that there are no significant difference among the models. After seeing the scientific notation changed, models with weights greater than 0.05 are models 12, 16, 15, 10, and 5.
However, since the model 16 has “Amount + Edge + MnPatch + Hetero” and model 12 has “Amount + Edge + Hetero”, we can easily eliminate the variable MnPatch. Therefore the model 12 is the best choice.
And that’s all fine, but what if you want that table to be “pretty”, and actually formatted as a word table so you could put it in a manuscript? A final step for most analyses is to put the table into a word processing document. However, word processors don’t handle numbers very well, so a better approach is to copy the table into a spreadsheet like Excel, format all the numbers, and then copy from Excel into Word. On a windoze box, you can copy an R object to the windows clipboard like this:
write.csv(aic.table,"clipboard",row.names=FALSE)
should work. In both cases, change to Excel and paste. You could also sort it before doing the write.table() but that’s not essential. Once in Excel you will have to use the Text to Columns dialog to get the results into a table. Then it is easy to change the scientific notation for the weights column to something readable, adjust the number of decimal places, and sort the table how you like. Or we can do it directly in R Markdown. First load the pander package.
library(pander)
pander(as.data.frame(at2),caption="Table 1 AIC table of smithsim results.")
| model | AIC | k | deltas | weights | cumw |
|---|---|---|---|---|---|
| 12 | 1884 | 5 | 0 | 0.6378 | 0.6378 |
| 16 | 1885 | 6 | 1.189 | 0.352 | 0.9898 |
| 15 | 1893 | 5 | 8.548 | 0.008879 | 0.9986 |
| 10 | 1897 | 4 | 12.61 | 0.001166 | 0.9998 |
| 5 | 1902 | 3 | 17.53 | 9.979e-05 | 0.9999 |
| 11 | 1904 | 4 | 19.41 | 3.888e-05 | 0.9999 |
| 8 | 1904 | 4 | 19.43 | 3.848e-05 | 1 |
| 14 | 1905 | 5 | 20.83 | 1.911e-05 | 1 |
| 6 | 1964 | 4 | 80.12 | 2.556e-18 | 1 |
| 13 | 1966 | 5 | 82.03 | 9.806e-19 | 1 |
| 2 | 1981 | 3 | 97.24 | 4.891e-22 | 1 |
| 7 | 1983 | 4 | 99.21 | 1.829e-22 | 1 |
| 4 | 1995 | 3 | 110.6 | 6.041e-25 | 1 |
| 9 | 1997 | 4 | 112.5 | 2.329e-25 | 1 |
| 3 | 2004 | 3 | 119.8 | 6.054e-27 | 1 |
| 1 | 2013 | 2 | 128.6 | 7.568e-29 | 1 |
When you click “Knit Word” in RStudio you get a word document that is completely editable, and the table is nicely produced and ready to go. Incidentally, pander does pretty cool things with other R objects too.
pander(truth)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.1014 | 0.1894 | -0.5352 | 0.5929 |
| Amount | 1.798 | 0.4681 | 3.84 | 0.0001461 |
| Edge | -2.058 | 0.4404 | -4.674 | 4.234e-06 |
| Hetero | 2.186 | 0.2285 | 9.565 | 2.186e-19 |
pander(global)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.1129 | 0.1899 | -0.5943 | 0.5527 |
| Amount | 1.593 | 0.5211 | 3.058 | 0.002403 |
| Edge | -2.067 | 0.4406 | -4.691 | 3.923e-06 |
| MnPatch | 0.2714 | 0.3033 | 0.8948 | 0.3715 |
| Hetero | 2.198 | 0.229 | 9.599 | 1.714e-19 |
Is there nothing R can’t do? Well, you still have to write the interpretation of the model!
And here is the easy way to do all of the above. First, make sure you have the MuMIn package installed on your computer.
library("MuMIn")
model.sel(fits)
## Model selection table
## (Intrc) Amont Edge MnPtc Heter df logLik AICc delta weight
## 12 -0.1014 1.7980 -2.05800 2.186 5 -937.073 1884.3 0.00 0.646
## 16 -0.1129 1.5930 -2.06700 0.27140 2.198 6 -936.668 1885.6 1.26 0.344
## 15 -0.1302 -1.04300 0.67820 2.425 5 -941.348 1892.9 8.55 0.009
## 10 -0.1015 -0.61020 2.477 4 -944.378 1896.9 12.55 0.001
## 5 -0.1176 2.174 3 -947.836 1901.7 17.42 0.000
## 11 -0.1221 0.07515 2.145 4 -947.779 1903.7 19.35 0.000
## 8 -0.1160 -0.0760 2.217 4 -947.789 1903.7 19.37 0.000
## 14 -0.1263 -0.2643 0.24080 2.228 5 -947.489 1905.2 20.83 0.000
## 6 -0.1808 3.2860 -2.18200 4 -978.132 1964.4 80.06 0.000
## 13 -0.1851 3.2150 -2.18500 0.09820 5 -978.090 1966.4 82.03 0.000
## 2 -0.1975 1.3210 3 -987.693 1981.5 97.13 0.000
## 7 -0.2003 1.2730 0.06331 4 -987.677 1983.5 99.15 0.000
## 4 -0.2432 1.04200 3 -994.390 1994.8 110.53 0.000
## 9 -0.2410 0.09607 0.97500 4 -994.343 1996.8 112.48 0.000
## 3 -0.2028 0.75990 3 -998.993 2004.1 119.73 0.000
## 1 -0.1964 2 -1004.375 2012.8 128.46 0.000
## Models ranked by AICc(x)
where fits is the list of fitted model objects we created before. One line! Two, if you count loading the library. This package also adds the estimated coefficients to the table, leaving blank spaces where the coefficient is not included in the model. If you look closely you’ll see that the AIC values in this table are slightly different from the ones we calculated by hand, because model.sel() is using the 2nd order sample size correction by default. If you do:
model.sel(fits, rank = AIC)
## Model selection table
## (Intrc) Amont Edge MnPtc Heter df logLik AIC delta weight
## 12 -0.1014 1.7980 -2.05800 2.186 5 -937.073 1884.1 0.00 0.638
## 16 -0.1129 1.5930 -2.06700 0.27140 2.198 6 -936.668 1885.3 1.19 0.352
## 15 -0.1302 -1.04300 0.67820 2.425 5 -941.348 1892.7 8.55 0.009
## 10 -0.1015 -0.61020 2.477 4 -944.378 1896.8 12.61 0.001
## 5 -0.1176 2.174 3 -947.836 1901.7 17.53 0.000
## 11 -0.1221 0.07515 2.145 4 -947.779 1903.6 19.41 0.000
## 8 -0.1160 -0.0760 2.217 4 -947.789 1903.6 19.43 0.000
## 14 -0.1263 -0.2643 0.24080 2.228 5 -947.489 1905.0 20.83 0.000
## 6 -0.1808 3.2860 -2.18200 4 -978.132 1964.3 80.12 0.000
## 13 -0.1851 3.2150 -2.18500 0.09820 5 -978.090 1966.2 82.03 0.000
## 2 -0.1975 1.3210 3 -987.693 1981.4 97.24 0.000
## 7 -0.2003 1.2730 0.06331 4 -987.677 1983.4 99.21 0.000
## 4 -0.2432 1.04200 3 -994.390 1994.8 110.63 0.000
## 9 -0.2410 0.09607 0.97500 4 -994.343 1996.7 112.54 0.000
## 3 -0.2028 0.75990 3 -998.993 2004.0 119.84 0.000
## 1 -0.1964 2 -1004.375 2012.7 128.60 0.000
## Models ranked by AIC(x)
Based on AICc and model weights, the best model for predicting incidence is Amount + Edge + Hetero.
You can get exactly the same results we obtained manually. It is also possible to calculate the 2nd order corrected AIC manually, but involves a couple of extra lines, so we’ll skip it. In general there is no reason to not use the 2nd order correction because it naturally converges to the uncorrected AIC as sample size increases. The table is even easier to understand if we put the model formulas into the table.
library(purrr)
modnames <- map_chr(models,~deparse(.x[[3]]))
names(fits) <- modnames
pander(model.sel(fits)[,-6], split.tables=Inf)
## Warning in `[.model.selection`(model.sel(fits), , -6): cannot recalculate
## "weights" on an incomplete object
| (Intercept) | Amount | Edge | MnPatch | Hetero | logLik | AICc | delta | weight | |
|---|---|---|---|---|---|---|---|---|---|
| Amount + Edge + Hetero | -0.1014 | 1.798 | -2.058 | NA | 2.186 | -937.1 | 1884 | 0 | 0.6456 |
| Amount + Edge + MnPatch + Hetero | -0.1129 | 1.593 | -2.067 | 0.2714 | 2.198 | -936.7 | 1886 | 1.259 | 0.344 |
| Edge + MnPatch + Hetero | -0.1302 | NA | -1.043 | 0.6782 | 2.425 | -941.3 | 1893 | 8.548 | 0.008989 |
| Edge + Hetero | -0.1015 | NA | -0.6102 | NA | 2.477 | -944.4 | 1897 | 12.55 | 0.001215 |
| Hetero | -0.1176 | NA | NA | NA | 2.174 | -947.8 | 1902 | 17.42 | 0.0001065 |
| MnPatch + Hetero | -0.1221 | NA | NA | 0.07515 | 2.145 | -947.8 | 1904 | 19.35 | 4.052e-05 |
| Amount + Hetero | -0.116 | -0.076 | NA | NA | 2.217 | -947.8 | 1904 | 19.37 | 4.01e-05 |
| Amount + MnPatch + Hetero | -0.1263 | -0.2643 | NA | 0.2408 | 2.228 | -947.5 | 1905 | 20.83 | 1.935e-05 |
| Amount + Edge | -0.1808 | 3.286 | -2.182 | NA | NA | -978.1 | 1964 | 80.06 | 2.664e-18 |
| Amount + Edge + MnPatch | -0.1851 | 3.215 | -2.185 | 0.0982 | NA | -978.1 | 1966 | 82.03 | 9.926e-19 |
| Amount | -0.1975 | 1.321 | NA | NA | NA | -987.7 | 1981 | 97.13 | 5.218e-22 |
| Amount + MnPatch | -0.2003 | 1.273 | NA | 0.06331 | NA | -987.7 | 1983 | 99.15 | 1.907e-22 |
| MnPatch | -0.2432 | NA | NA | 1.042 | NA | -994.4 | 1995 | 110.5 | 6.445e-25 |
| Edge + MnPatch | -0.241 | NA | 0.09607 | 0.975 | NA | -994.3 | 1997 | 112.5 | 2.428e-25 |
| Edge | -0.2028 | NA | 0.7599 | NA | NA | -999 | 2004 | 119.7 | 6.459e-27 |
| 1 | -0.1964 | NA | NA | NA | NA | -1004 | 2013 | 128.5 | 8.216e-29 |
The [, -6] just removes the superfluous column “family” – it is the same for all the models so no point in adding to the table. The split.tables=Inf argument to pander just tells pander to not try and be smart about dividing up the result to fit inside a fixed number of columns. While it pains me to admit it, Word is actually smarter than R when it comes to sizing the columns of the table to fit on a page. You might already be familiar with base R functions lapply(), sapply() etc.; the map() functions do the same thing, but are more picky about the arguments they get and the values they return. This makes them safer to program with. The map() functions are found in package purrr in the tidyverse.↩︎
In fact, we could have done all in the individual calculations directly inside the call to tibble(). In this case I needed to explain each line one by one, so that becomes awkward.↩︎