In this week’s lab exercise we will first go through calculating the AIC values “by hand”, and then using a addon package MuMIn. Doing the calculations by hand serves three purposes: 1) you practice using R!, 2) you gain a deeper understanding of how much work the add-on package is doing for you, and 3) sometimes the add-on package won’t work for a new class of model, and if you can do it by hand you won’t be stuck.

We’ll begin by using a simulated dataset from the paper “Confronting collinearity: comparing methods for disentangling the effects of habitat loss and fragmentation” by Adam Smith and colleagues. They had real data, but I have written a few lines of code that produce datasets that are similar. They have 4 variables that describe patches in a landscape (Amount, Edge, MnPatch, and Hetero), and simulated a normally distributed response variable. Amount, Edge and Hetero affect the response, Y, and MnPatch does not. All four of the variables are correlated to some extent. The script and data will be available in package NRES803. The code to make the data is in the examples section of ?smithsim, or simply load the dataframe with data(“smithsim”). Include all the outputs and figures in your Rmd file. Answer the numbered questions.

Fit a global model that includes all four variables.

#library(NRES803)
smithsim<-read.csv("smithsim.csv")
global <- lm(Y~Amount+Edge+MnPatch+Hetero, data=smithsim)
summary(global)
## 
## Call:
## lm(formula = Y ~ Amount + Edge + MnPatch + Hetero, data = smithsim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4933  -2.6799  -0.0211   2.3678  11.3932 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1129     0.1899  -0.594   0.5527    
## Amount        1.5933     0.5211   3.058   0.0024 ** 
## Edge         -2.0668     0.4406  -4.691 3.92e-06 ***
## MnPatch       0.2714     0.3033   0.895   0.3715    
## Hetero        2.1978     0.2290   9.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.541 on 345 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.313 
## F-statistic: 40.74 on 4 and 345 DF,  p-value: < 2.2e-16
old.par = par(mfrow=c(2,2))
plot(global)

  1. Check the summary of the fitted model and plot the residuals – this is what the residuals SHOULD look like! Compare the estimated coefficients with the known true values of (intercept=0, Amount = 2, Edge = -2, MnPatch = 0, Hetero = 2), and the known residual standard error of 3.5. Now fit the TRUE model; recall that the true effect of the MnPatch variable is 0. This model enforces that by not estimating a coefficient for this variable.

Therefore the global model is Y = -0.1129 + 1.5933 X Amount - 2.0668 X Edge + 0.2714 X MnPatch + 2.1978X Hetero + ε If I put these values: (intercept=0, Amount = 2, Edge = -2, MnPatch = 0, Hetero = 2), and the known residual standard error of 3.5

Y=15.2158

truth <- lm(Y~Amount+Edge+Hetero, data=smithsim)
summary(truth)
## 
## Call:
## lm(formula = Y ~ Amount + Edge + Hetero, data = smithsim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1007  -2.6453   0.0434   2.3937  11.7930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1014     0.1894  -0.535 0.592867    
## Amount        1.7978     0.4681   3.840 0.000146 ***
## Edge         -2.0584     0.4404  -4.674 4.23e-06 ***
## Hetero        2.1856     0.2285   9.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.54 on 346 degrees of freedom
## Multiple R-squared:  0.3193, Adjusted R-squared:  0.3134 
## F-statistic: 54.09 on 3 and 346 DF,  p-value: < 2.2e-16
old.par = par(mfrow=c(2,2))
plot(truth)

For truth model: Y = -0.1014 + 1.7978Amount -2.0584Edge + 2.18568Hetero + ε

(intercept=0, Amount = 2, Edge = -2, MnPatch = 0, Hetero = 2), and the known residual standard error of 3.5

Y=15.58376

  1. Use the summary and plots of residuals to compare the estimated coefficients and residual standard error to the true values. Is this model closer to the truth than the global model? How can you tell?

If we see the residuals for both models, they seems similar, and can not find significant difference.

One strategy for deciding if we can remove a variable from a model is “backward selection”. We can do this with an F-test.

anova(truth,global)
## Analysis of Variance Table
## 
## Model 1: Y ~ Amount + Edge + Hetero
## Model 2: Y ~ Amount + Edge + MnPatch + Hetero
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    346 4336.2                           
## 2    345 4326.2  1     10.04 0.8007 0.3715

So we can justify using the smaller model because there is no significant difference between the simpler and more complex models. Now we can check to see if we can further simplify the model truth. I use the function update() to create a model that is identical except that it removes the variable Amount.

#smaller.model <-  update(truth,.~.-Amount)
smaller.model <-  update(truth,.~.-Edge)
#smaller.model <-  update(truth,.~.-Hetero)
anova(smaller.model, truth)
## Analysis of Variance Table
## 
## Model 1: Y ~ Amount + Hetero
## Model 2: Y ~ Amount + Edge + Hetero
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    347 4610.1                                  
## 2    346 4336.2  1    273.82 21.849 4.234e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Can we eliminate the Amount variable from our model truth? No, because doing so leads to a significant change in the residual sum of squares.

  1. Repeat these two steps for the other two variables, remembering to always start from the model truth. Can we eliminate any additional variables?

smaller.model <- update(truth,.~.-Edge) anova(smaller.model, truth) Analysis of Variance Table

Model 1: Y ~ Amount + Hetero Model 2: Y ~ Amount + Edge + Hetero Res.Df RSS Df Sum of Sq F Pr(>F)
1 347 4610.1
2 346 4336.2 1 273.82 21.849 4.234e-06 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

smaller.model <- update(truth,.~.-Hetero) anova(smaller.model, truth) Analysis of Variance Table

Model 1: Y ~ Amount + Edge Model 2: Y ~ Amount + Edge + Hetero Res.Df RSS Df Sum of Sq F Pr(>F)
1 347 5482.9
2 346 4336.2 1 1146.6 91.492 < 2.2e-16 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

For the both, the p value is significant but the sum of squared error increased significantly. Therefore we can’t eliminate other variables from the model

Now we’re going to try AIC model selection. The first step, no matter how you do the calculations, is to write out a list object containing all of the models we want to fit. In this case, we’ll simply do all of the possible main effect models – we have plenty of data:

models <- list(Y~1,
              Y~Amount,
              Y~Edge,
              Y~MnPatch,
              Y~Hetero,
              Y~Amount + Edge,
              Y~Amount + MnPatch,
              Y~Amount + Hetero,
              Y~Edge + MnPatch,
              Y~Edge + Hetero,
              Y~MnPatch + Hetero,
              Y~Amount+Edge+Hetero,
              Y~Amount+Edge+MnPatch,
              Y~Amount+MnPatch+Hetero,
              Y~Edge+MnPatch+Hetero,
              Y~Amount+Edge+MnPatch+Hetero)

This amounts to 16 possible models. This may seem time consuming, but in fact you have to write out all the models at some point in order to check them, and by doing it in a list we gain a lot of flexibility and speed later on. For example, we don’t need to fit each model individually. Putting all our model formulas in a list makes it dead easy to fit all the models with one line of code using map() 1. This function takes a list as it’s first argument, and passes each item in the list to the function defined in the 2nd argument. It also passes any additional arguments given to map() to the function. In this case I give the argument data=smithsim to make sure the data are available to lm().

library(purrr)
fits <- map(models, lm, data=smithsim)

Next we want to get the AIC values for each of these models. We use map_dbl(), which is similar to map(), but returns a numeric vector:

aic <- map_dbl(fits,AIC)
aic
##  [1] 2012.749 1981.386 2003.985 1994.779 1901.672 1964.264 1983.353 1903.578
##  [9] 1996.685 1896.756 1903.558 1884.147 1966.180 1904.977 1892.695 1885.336

The best model has the smallest AIC.

which.min(aic)
## [1] 12
models[[which.min(aic)]] # which model is that?
## Y ~ Amount + Edge + Hetero

Which happens to be the true model. However, how many other models are close by? Here is where we manually calculate an AIC table. Start by calculating the ΔAIC values, and convert them to the model weights.

deltas <- aic - min(aic) # smallest will be zero
weights <- exp(-deltas/2) # this is an intermediate step
weights <- weights/sum(weights) # so we can do this

In this step we’re really taking advantage of R’s “vectorizing” of calculations to automatically repeat the same calculation across an entire vector.

We also want the number of estimated parameters in each model.

npars <- map_dbl(fits, "rank") + 1

If the second argument to map_dbl() is a character vector, map_dbl() tries to extract an object of that name from each element of the list. Finally, use tibble() to show all the results together.2

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tibble(AIC=aic,k=npars,deltas=deltas,weights=weights)
## # A tibble: 16 × 4
##      AIC     k deltas  weights
##    <dbl> <dbl>  <dbl>    <dbl>
##  1 2013.     2 129.   7.57e-29
##  2 1981.     3  97.2  4.89e-22
##  3 2004.     3 120.   6.05e-27
##  4 1995.     3 111.   6.04e-25
##  5 1902.     3  17.5  9.98e- 5
##  6 1964.     4  80.1  2.56e-18
##  7 1983.     4  99.2  1.83e-22
##  8 1904.     4  19.4  3.85e- 5
##  9 1997.     4 113.   2.33e-25
## 10 1897.     4  12.6  1.17e- 3
## 11 1904.     4  19.4  3.89e- 5
## 12 1884.     5   0    6.38e- 1
## 13 1966.     5  82.0  9.81e-19
## 14 1905.     5  20.8  1.91e- 5
## 15 1893.     5   8.55 8.88e- 3
## 16 1885.     6   1.19 3.52e- 1

The first thing is to sort the models so they go from lowest AIC to highest AIC. This will mess up the row numbers, so we need to put them in the table directly before sorting:

aic.table <- tibble(model=1:16,AIC=aic,k=npars,deltas=deltas,weights=weights)
at2 <- aic.table %>%
  arrange(AIC) %>%
  mutate(cumw = cumsum(weights))
at2
## # A tibble: 16 × 6
##    model   AIC     k deltas  weights  cumw
##    <int> <dbl> <dbl>  <dbl>    <dbl> <dbl>
##  1    12 1884.     5   0    6.38e- 1 0.638
##  2    16 1885.     6   1.19 3.52e- 1 0.990
##  3    15 1893.     5   8.55 8.88e- 3 0.999
##  4    10 1897.     4  12.6  1.17e- 3 1.00 
##  5     5 1902.     3  17.5  9.98e- 5 1.00 
##  6    11 1904.     4  19.4  3.89e- 5 1.00 
##  7     8 1904.     4  19.4  3.85e- 5 1.00 
##  8    14 1905.     5  20.8  1.91e- 5 1    
##  9     6 1964.     4  80.1  2.56e-18 1    
## 10    13 1966.     5  82.0  9.81e-19 1    
## 11     2 1981.     3  97.2  4.89e-22 1    
## 12     7 1983.     4  99.2  1.83e-22 1    
## 13     4 1995.     3 111.   6.04e-25 1    
## 14     9 1997.     4 113.   2.33e-25 1    
## 15     3 2004.     3 120.   6.05e-27 1    
## 16     1 2013.     2 129.   7.57e-29 1
  1. Which models have a substantial weight? Can we be completely confident in the top model?

Model 12 with an AIC of 1884.147 and a weight of 0.637 which is substantial. The next best model is model 16 with an AIC of 1885.336 and a weight of 0.351.
The AIC difference between the models (12,16) or (12,15) is less than 10, therefore, I believe that there are no significant difference among the models. After seeing the scientific notation changed, models with weights greater than 0.05 are models 12, 16, 15, 10, and 5.

However, since the model 16 has “Amount + Edge + MnPatch + Hetero” and model 12 has “Amount + Edge + Hetero”, we can easily eliminate the variable MnPatch. Therefore the model 12 is the best choice.


And that’s all fine, but what if you want that table to be “pretty”, and actually formatted as a word table so you could put it in a manuscript? A final step for most analyses is to put the table into a word processing document. However, word processors don’t handle numbers very well, so a better approach is to copy the table into a spreadsheet like Excel, format all the numbers, and then copy from Excel into Word. On a windoze box, you can copy an R object to the windows clipboard like this:

write.csv(aic.table,"clipboard",row.names=FALSE)

should work. In both cases, change to Excel and paste. You could also sort it before doing the write.table() but that’s not essential. Once in Excel you will have to use the Text to Columns dialog to get the results into a table. Then it is easy to change the scientific notation for the weights column to something readable, adjust the number of decimal places, and sort the table how you like. Or we can do it directly in R Markdown. First load the pander package.

library(pander)
pander(as.data.frame(at2),caption="Table 1 AIC table of smithsim results.")
Table 1 AIC table of smithsim results.
model AIC k deltas weights cumw
12 1884 5 0 0.6378 0.6378
16 1885 6 1.189 0.352 0.9898
15 1893 5 8.548 0.008879 0.9986
10 1897 4 12.61 0.001166 0.9998
5 1902 3 17.53 9.979e-05 0.9999
11 1904 4 19.41 3.888e-05 0.9999
8 1904 4 19.43 3.848e-05 1
14 1905 5 20.83 1.911e-05 1
6 1964 4 80.12 2.556e-18 1
13 1966 5 82.03 9.806e-19 1
2 1981 3 97.24 4.891e-22 1
7 1983 4 99.21 1.829e-22 1
4 1995 3 110.6 6.041e-25 1
9 1997 4 112.5 2.329e-25 1
3 2004 3 119.8 6.054e-27 1
1 2013 2 128.6 7.568e-29 1

When you click “Knit Word” in RStudio you get a word document that is completely editable, and the table is nicely produced and ready to go. Incidentally, pander does pretty cool things with other R objects too.

pander(truth)
Fitting linear model: Y ~ Amount + Edge + Hetero
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1014 0.1894 -0.5352 0.5929
Amount 1.798 0.4681 3.84 0.0001461
Edge -2.058 0.4404 -4.674 4.234e-06
Hetero 2.186 0.2285 9.565 2.186e-19
pander(global)
Fitting linear model: Y ~ Amount + Edge + MnPatch + Hetero
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1129 0.1899 -0.5943 0.5527
Amount 1.593 0.5211 3.058 0.002403
Edge -2.067 0.4406 -4.691 3.923e-06
MnPatch 0.2714 0.3033 0.8948 0.3715
Hetero 2.198 0.229 9.599 1.714e-19

Is there nothing R can’t do? Well, you still have to write the interpretation of the model!

And here is the easy way to do all of the above. First, make sure you have the MuMIn package installed on your computer.

library("MuMIn")
model.sel(fits)
## Model selection table 
##    (Intrc)   Amont     Edge   MnPtc Heter df    logLik   AICc  delta weight
## 12 -0.1014  1.7980 -2.05800         2.186  5  -937.073 1884.3   0.00  0.646
## 16 -0.1129  1.5930 -2.06700 0.27140 2.198  6  -936.668 1885.6   1.26  0.344
## 15 -0.1302         -1.04300 0.67820 2.425  5  -941.348 1892.9   8.55  0.009
## 10 -0.1015         -0.61020         2.477  4  -944.378 1896.9  12.55  0.001
## 5  -0.1176                          2.174  3  -947.836 1901.7  17.42  0.000
## 11 -0.1221                  0.07515 2.145  4  -947.779 1903.7  19.35  0.000
## 8  -0.1160 -0.0760                  2.217  4  -947.789 1903.7  19.37  0.000
## 14 -0.1263 -0.2643          0.24080 2.228  5  -947.489 1905.2  20.83  0.000
## 6  -0.1808  3.2860 -2.18200                4  -978.132 1964.4  80.06  0.000
## 13 -0.1851  3.2150 -2.18500 0.09820        5  -978.090 1966.4  82.03  0.000
## 2  -0.1975  1.3210                         3  -987.693 1981.5  97.13  0.000
## 7  -0.2003  1.2730          0.06331        4  -987.677 1983.5  99.15  0.000
## 4  -0.2432                  1.04200        3  -994.390 1994.8 110.53  0.000
## 9  -0.2410          0.09607 0.97500        4  -994.343 1996.8 112.48  0.000
## 3  -0.2028          0.75990                3  -998.993 2004.1 119.73  0.000
## 1  -0.1964                                 2 -1004.375 2012.8 128.46  0.000
## Models ranked by AICc(x)

where fits is the list of fitted model objects we created before. One line! Two, if you count loading the library. This package also adds the estimated coefficients to the table, leaving blank spaces where the coefficient is not included in the model. If you look closely you’ll see that the AIC values in this table are slightly different from the ones we calculated by hand, because model.sel() is using the 2nd order sample size correction by default. If you do:

model.sel(fits, rank = AIC)
## Model selection table 
##    (Intrc)   Amont     Edge   MnPtc Heter df    logLik    AIC  delta weight
## 12 -0.1014  1.7980 -2.05800         2.186  5  -937.073 1884.1   0.00  0.638
## 16 -0.1129  1.5930 -2.06700 0.27140 2.198  6  -936.668 1885.3   1.19  0.352
## 15 -0.1302         -1.04300 0.67820 2.425  5  -941.348 1892.7   8.55  0.009
## 10 -0.1015         -0.61020         2.477  4  -944.378 1896.8  12.61  0.001
## 5  -0.1176                          2.174  3  -947.836 1901.7  17.53  0.000
## 11 -0.1221                  0.07515 2.145  4  -947.779 1903.6  19.41  0.000
## 8  -0.1160 -0.0760                  2.217  4  -947.789 1903.6  19.43  0.000
## 14 -0.1263 -0.2643          0.24080 2.228  5  -947.489 1905.0  20.83  0.000
## 6  -0.1808  3.2860 -2.18200                4  -978.132 1964.3  80.12  0.000
## 13 -0.1851  3.2150 -2.18500 0.09820        5  -978.090 1966.2  82.03  0.000
## 2  -0.1975  1.3210                         3  -987.693 1981.4  97.24  0.000
## 7  -0.2003  1.2730          0.06331        4  -987.677 1983.4  99.21  0.000
## 4  -0.2432                  1.04200        3  -994.390 1994.8 110.63  0.000
## 9  -0.2410          0.09607 0.97500        4  -994.343 1996.7 112.54  0.000
## 3  -0.2028          0.75990                3  -998.993 2004.0 119.84  0.000
## 1  -0.1964                                 2 -1004.375 2012.7 128.60  0.000
## Models ranked by AIC(x)

Based on AICc and model weights, the best model for predicting incidence is Amount + Edge + Hetero.

You can get exactly the same results we obtained manually. It is also possible to calculate the 2nd order corrected AIC manually, but involves a couple of extra lines, so we’ll skip it. In general there is no reason to not use the 2nd order correction because it naturally converges to the uncorrected AIC as sample size increases. The table is even easier to understand if we put the model formulas into the table.

library(purrr)
modnames <- map_chr(models,~deparse(.x[[3]]))
names(fits) <- modnames
pander(model.sel(fits)[,-6], split.tables=Inf)
## Warning in `[.model.selection`(model.sel(fits), , -6): cannot recalculate
## "weights" on an incomplete object
  (Intercept) Amount Edge MnPatch Hetero logLik AICc delta weight
Amount + Edge + Hetero -0.1014 1.798 -2.058 NA 2.186 -937.1 1884 0 0.6456
Amount + Edge + MnPatch + Hetero -0.1129 1.593 -2.067 0.2714 2.198 -936.7 1886 1.259 0.344
Edge + MnPatch + Hetero -0.1302 NA -1.043 0.6782 2.425 -941.3 1893 8.548 0.008989
Edge + Hetero -0.1015 NA -0.6102 NA 2.477 -944.4 1897 12.55 0.001215
Hetero -0.1176 NA NA NA 2.174 -947.8 1902 17.42 0.0001065
MnPatch + Hetero -0.1221 NA NA 0.07515 2.145 -947.8 1904 19.35 4.052e-05
Amount + Hetero -0.116 -0.076 NA NA 2.217 -947.8 1904 19.37 4.01e-05
Amount + MnPatch + Hetero -0.1263 -0.2643 NA 0.2408 2.228 -947.5 1905 20.83 1.935e-05
Amount + Edge -0.1808 3.286 -2.182 NA NA -978.1 1964 80.06 2.664e-18
Amount + Edge + MnPatch -0.1851 3.215 -2.185 0.0982 NA -978.1 1966 82.03 9.926e-19
Amount -0.1975 1.321 NA NA NA -987.7 1981 97.13 5.218e-22
Amount + MnPatch -0.2003 1.273 NA 0.06331 NA -987.7 1983 99.15 1.907e-22
MnPatch -0.2432 NA NA 1.042 NA -994.4 1995 110.5 6.445e-25
Edge + MnPatch -0.241 NA 0.09607 0.975 NA -994.3 1997 112.5 2.428e-25
Edge -0.2028 NA 0.7599 NA NA -999 2004 119.7 6.459e-27
1 -0.1964 NA NA NA NA -1004 2013 128.5 8.216e-29

The [, -6] just removes the superfluous column “family” – it is the same for all the models so no point in adding to the table. The split.tables=Inf argument to pander just tells pander to not try and be smart about dividing up the result to fit inside a fixed number of columns. While it pains me to admit it, Word is actually smarter than R when it comes to sizing the columns of the table to fit on a page. You might already be familiar with base R functions lapply(), sapply() etc.; the map() functions do the same thing, but are more picky about the arguments they get and the values they return. This makes them safer to program with. The map() functions are found in package purrr in the tidyverse.↩︎

In fact, we could have done all in the individual calculations directly inside the call to tibble(). In this case I needed to explain each line one by one, so that becomes awkward.↩︎