Markdown Created By: Jessie Bell, 2023
Markdown Data: Highstat.com
Libraries Used:
library(ggcorrplot) #for making ggpairs
library(GGally) #for making ggpairs
library(ggplot2) #for making ggplots
library(knitr) #for r markdown
library(gridExtra) #for ggarrange
library(ggiraphExtra) #for ggPredict()
library(plyr) # I think this need for ggPredict
library(mgcv) #for gam()
Forest bird densities were measured in 56 forest patches in south-eastern Victoria, Australia. The aim of the study was to relate bird densities to six habitat variables; size of the forest patch, distance to the nearest patch, distance to the nearest larger patch, mean altitude of the patch, year of isolation by clearing, and an index of stock grazing history (1 = light, 5 = intensive). The variables are given in the table below.
birdData <- read.table("Loyn.txt", header = T, sep="\t")
Check each of the following:
A. Are there outliers in response and explanatory variables?
birdData$fGRAZE <- factor(birdData$GRAZE) #turning graze into factor data
#setup the graphing pannel
op <- par(mfrow=c(4,2), mar=c(3,3,3,1))
dotchart(birdData$ABUND, main= "ABUND", group=birdData$fGRAZE)
plot(0,0, type = "n", axes = F)
dotchart(birdData$AREA, main="Area", groups = birdData$fGRAZE)
dotchart(birdData$DIST, main = "DIST", group = birdData$fGRAZE)
dotchart(birdData$LDIST, main = "LDIST", group = birdData$fGRAZE)
dotchart(birdData$YR.ISOL, main = "YR.ISOL", group = birdData$fGRAZE)
dotchart(birdData$ALT, main = "ALT", group = birdData$fGRAZE)
dotchart(birdData$GRAZE, main = "GRAZE", group = birdData$fGRAZE)
Figure 1: Isolated points at the far ends, and on either side in a dotplot, suggest potential outliers. This is the case for two observations with high AREA values, one observation with a high DIST value, and a couple of observations with high LDIST values (note that these are all different forest patches).
par(op)
If two or three observations (the same) have larger values for all variables, then the decision what to do is easy, just drop them from the analysis. However, if we do this here, we lose too many observations. The alternative is to apply a transformation on AREA, DIST and LDIST.
Based on the values of these three variables, a strong transformation is needed, for example, a logarithmic (base 10) transformation or a natural logarithmic transformation. Note that all three variables are related as they measure size and distance. Variables like size, distance, and volume often need a transformation. It is easier to justify a transformation on only a subset of variables if they are somehow ecologically ‘related’.
You should always redo 1b if you end up transforming your data. (Sometimes things can look good in 1a and you won’t notice that you need to transform the data until 1b. Go back, transform, and move through again.)
Transform Data by taking log10():
birdData$L.AREA <- log10(birdData$AREA)
birdData$L.DIST <- log10(birdData$DIST)
birdData$L.LDist <- log10(birdData$LDIST)
B. Is there collinearity of the explanatory variables?
There are 3 tools for this:
[1] Pairwise Scatterplots
[2] Correlation Coefficients
ggpair.setup <- birdData[,c("ABUND", "L.AREA","L.DIST","L.LDist", "YR.ISOL", "ALT", "GRAZE")]
ggpairs(ggpair.setup, columns = 1:7)
Figure 2: Assessing [1] pairwise scatterplots and [2] correlation coefficients. We included GRAZE in the pairplot, but there is an argument that it should not be included as it is a nominal variable. However, it is actually ordinal; so, it does make some sense to include it, but you should interpret the correlation coefficients involving GRAZE with care as the difference between 4 and 5 may not be the same as the difference between 1 and 2 (the correlation coefficient assumes it is). We also included the response variable ABUND in the pairplot as it avoids repeating the same graph when we look at relationships between response and explanatory variables in the next paragraph. However, if you have a large number of explanatory variables, you would not normally use the response variable at this stage.
Focussing only on the explanatory variables in Fig. 2, there seems to be some correlation between L.DIST and L.LDIST; GRAZE and L.AREA; and GRAZE and YR.ISOL. However, the value of 0.6 (and –0.6) is not large enough to worry us.
[3] Variance Inflation Factor
The last took we use for collinearity is VIF:
#library(car) use car function VIF on linear models for this
#I am usin corvif function built by Zuur (available on the website: highstat.com)
corvif(ggpair.setup[, c(-1,-7)])
##
##
## Variance inflation factors
##
## GVIF
## L.AREA 1.622200
## L.DIST 1.622396
## L.LDist 2.008157
## YR.ISOL 1.201719
## ALT 1.347805
According to Zuur, we can keep all of the variables in our analysis.
C. Are there relationships between response variable and explanatory variables?
We now look at relationships between the response variable and the explanatory variables. The most obvious tool for this task is a pairplot that contains the response variable and a set of explanatory variables (Fig. 2). If you have a large number of explanatory variables (10−15), then multiple pairplots may be needed. If you have more than 10−15 explanatory variables, then pairplots are less useful. However, if you have more than 10 explanatory variables, then it is very likely that there will be high collinearity.
Other ways you can assess this are with: coplot, xyplot, and plot.design.
Based on the pairplot in Fig. 2, we expect that the variables L.AREA and GRAZE will play an important role in the analyses.
Before running your lm(), you should think about what you are doing. In this problem, we want to find the relationship between bird densities (ABUND) and the 6 explanatory variables. It could be that birds have different affect with GRAZE involved. If that is the case, we have to include an interaction term between GRAZE and L.AREA.
The problem is that there are a large number of potential two-way interactions. And there could also be three-way interactions; birds may respond in a different way to AREA if GRAZE values are low in combination with low values for altitude (ALT). The smaller the data set (56 observations is small), the more difficult it is to include multiple interaction terms. Especially with multiple nominal explanatory variables with more than 2 levels involved.
The statistics professionals believe you do 1-3 below, but you can decide to do any of the 7:
[1] Start with a model with no interactions. Apply the model, model selection, and model validation. If the validation shows that there are patterns in the residuals, investigate why. Adding interactions may be an option to improve the model.
[2] Use biological knowledge to decide which, if any, interactions are sensible to add.
[3] Apply a good data exploration to see which interaction may be important.
[4] Identify the prime explanatory variables(s) of interest. In the bird example, this would be GRAZE and AREA as we can control them using mgmt. decisions. Include interactions between these variables and any of the other variables.
[5] Only include the main terms and 2-way interaction terms.
[6] Only include higher interaction terms if you have good reasons (i.e. biological justification) to do so.
[7] Include all interactions by default.
We will start the bird data analysis with no interactions. The following R code applies a linear regression in R.
Model1 <- lm(ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT + fGRAZE, data = birdData)
The question now is: Should we look at the numerical output first or the graphical output? There is no point in applying a detailed model validation if nothing is significant. On the other hand, why look at the numerical output if all the assumptions are violated?
Perhaps starting with the numerical output is better as it takes less time and is easier. There are multiple ways of getting numerical output for our linear regression model:
[1] Summary(Model1)
[2] drop1(Model1, test=F)
[3] anova(Model1)
Each of them produces a slightly different output.
summary(Model1)
##
## Call:
## lm(formula = ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT +
## fGRAZE, data = birdData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8992 -2.7245 -0.2772 2.7052 11.2811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.68025 115.16348 0.319 0.7515
## L.LDist 0.79765 2.13759 0.373 0.7107
## L.AREA 6.83303 1.50330 4.545 3.97e-05 ***
## L.DIST 0.33286 2.74778 0.121 0.9041
## YR.ISOL -0.01277 0.05803 -0.220 0.8267
## ALT 0.01070 0.02390 0.448 0.6565
## fGRAZE2 0.52851 3.25221 0.163 0.8716
## fGRAZE3 0.06601 2.95871 0.022 0.9823
## fGRAZE4 -1.24877 3.19838 -0.390 0.6980
## fGRAZE5 -12.47309 4.77827 -2.610 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.105 on 46 degrees of freedom
## Multiple R-squared: 0.7295, Adjusted R-squared: 0.6766
## F-statistic: 13.78 on 9 and 46 DF, p-value: 2.115e-10
#The first part of the output tells you which model was applied and some basic information on the residuals. The part below ‘Coefficients’ gives the estimated regression parameters, standard errors, t-values, and p-values. The only confusing part of this output is perhaps the absence of GRAZE level 1. It is used as a baseline. Hence, a patch that has GRAZE level 2 has 0.52 birds (density) more than a patch with level 1, and a patch with GRAZE level 5 has 12.4 birds less than a patch with level 1. The corresponding p-values tell you whether a patch is significantly different from level 1. Dalgaard (2002) shows how to change the baseline and adjust for multiple comparisons. Note that you should not assess the significance of a factor by the individual p-values. We will give a better method for this in a moment. You should not drop individual levels of a nominal variable. They all go in or you drop the whole variable. The last bit of the code gives the R2 and adjusted R2 (for model selection). The rest of the output you should, hopefully, be familiar with.
H0: \(\beta\) = 0
drop1(Model1)
## Single term deletions
##
## Model:
## ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT + fGRAZE
## Df Sum of Sq RSS AIC
## <none> 1714.4 211.60
## L.LDist 1 5.19 1719.6 209.77
## L.AREA 1 770.01 2484.4 230.38
## L.DIST 1 0.55 1715.0 209.62
## YR.ISOL 1 1.81 1716.2 209.66
## ALT 1 7.47 1721.9 209.85
## fGRAZE 4 413.50 2127.9 215.70
#The full model has a sum of squares of 1714.43. Each time, one term is dropped in turn, and each time, the residual sum of squares is calculated. These are then used to calculate an F-statistic and a corresponding p-value. For example, to get the output on the first line, R fits two models. The first model contains all explanatory variables and the second model all, but L.AREA. It then uses the residual sums of squares of each model in the following F-statistic
#The larger the value of the F-statistic, the more evidence there is to reject this hypothesis. In fact, the F-statistic follows an F-distribution, assuming homogeneity, normality, independence and no residual patterns. In this case, we can reject the null hypothesis.
#In linear regression, the p-values from the drop1 function are the same as those obtained by the t-statistic from the summary command, but for non-Gaussian GLMs, this is not necessarily the case. The null-hypothesis underlying the F-statistic is that the regression parameter from the term that was dropped is equal to 0. Basically, we are comparing a full and (repeatedly) a nested model. If the model has multiple nominal variables, the drop1 function gives one p-value for each variable, which is handy.
The null hypothesis of the fstat for anova(lm()):
H0: The 4 regression parameters for GRAZE (levels 2-5) are equal to 0.
anova(Model1)
## Analysis of Variance Table
##
## Response: ABUND
## Df Sum Sq Mean Sq F value Pr(>F)
## L.LDist 1 88.4 88.4 2.3728 0.130315
## L.AREA 1 3584.5 3584.5 96.1755 7.539e-13 ***
## L.DIST 1 0.1 0.1 0.0018 0.966623
## YR.ISOL 1 458.8 458.8 12.3109 0.001019 **
## ALT 1 78.2 78.2 2.0979 0.154281
## fGRAZE 4 413.5 103.4 2.7736 0.037992 *
## Residuals 46 1714.4 37.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#add another Model
Model2 <- lm(ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT, data = birdData)
anova(Model1, Model2)
## Analysis of Variance Table
##
## Model 1: ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT + fGRAZE
## Model 2: ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 1714.4
## 2 50 2127.9 -4 -413.5 2.7736 0.03799 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The sum of squares on the first row, 3471.0, is the regression sum of squares from the model ABUNDi = α + β × L.AREAi + εi. The 65.5 on the second line is the decrease in residual sum of squares if L.DIST is added to this model (to see this, fit a model with only the intercept and L.AREA, and a model with intercept, L.AREA, and L.DIST and compare the two residual sum of squares obtained from the anova commands; the difference will be 65.5). A theoretical justification for this table can be found in Section 1.3 in Wood (2006)
#The nice thing about this approach is that the last line gives us one p-value for the nominal variable GRAZE (as it is the last variable that is added), and we need this to assess whether GRAZE is significant. The disadvantage of this way of testing is that the p-values will depend on the order the variables: Change the order and you get a different conclusion
#Note that the last line of the anova command and the drop1 are identical. That is because the same nested models are being compared.
# The anova function can also be used to compare models that are nested. Suppose we fit a linear model with all explanatory variables, and a model with all explanatory variables, except GRAZE. These models are nested as the second model is a special case of the first, assuming all four regression parameters for the GRAZE levels are equal to zero (see below). The R code and its output are below.
#note that the p-value is the same as anova(M1) command. Then why do the comparison? The advantage of the anova(M1, M2) command is that we can control which terms are dropped. This is especially useful with multiple interaction terms.
Not all explanatory variables are significantly different from 0 as can be seen from teh p-values of the t-stats (summary(lm))) and the p-values of the Fstat (drop.1(lm())) commands in Step 2.
There are 3 main approaches to deciding what to do with the information from step 2.
[1] Drop individual explanatory variables one by one based on hypothesis testing procedures (step 2). You will need to drop the least significant term in this situation obtained in t-test of summary(lm()) or anova(()) for nominal variables or interactive variables.
[2] Drop individual explanatory variables one by one (and each time you will need to refit the model). You use the Akaike information criteria (AIC) in this case. It measures the goodness of fit and model complexity. The advantage is that R has a function that compares the model either backwards or forwards, making life easy. The disadvantage is that it is conservative. Once AIC has selected the best model, a backwards selection is applied by the command step(Model1) and its output is given as:
step(Model1)
## Start: AIC=211.6
## ABUND ~ L.LDist + L.AREA + L.DIST + YR.ISOL + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - L.DIST 1 0.55 1715.0 209.62
## - YR.ISOL 1 1.81 1716.2 209.66
## - L.LDist 1 5.19 1719.6 209.77
## - ALT 1 7.47 1721.9 209.85
## <none> 1714.4 211.60
## - fGRAZE 4 413.50 2127.9 215.70
## - L.AREA 1 770.01 2484.4 230.38
##
## Step: AIC=209.62
## ABUND ~ L.LDist + L.AREA + YR.ISOL + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - YR.ISOL 1 1.73 1716.7 207.68
## - ALT 1 7.07 1722.0 207.85
## - L.LDist 1 8.57 1723.5 207.90
## <none> 1715.0 209.62
## - fGRAZE 4 413.28 2128.2 213.71
## - L.AREA 1 769.64 2484.6 228.38
##
## Step: AIC=207.68
## ABUND ~ L.LDist + L.AREA + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - L.LDist 1 8.32 1725.0 205.95
## - ALT 1 9.71 1726.4 205.99
## <none> 1716.7 207.68
## - fGRAZE 4 848.77 2565.5 222.18
## - L.AREA 1 790.20 2506.9 226.88
##
## Step: AIC=205.95
## ABUND ~ L.AREA + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - ALT 1 5.37 1730.4 204.12
## <none> 1725.0 205.95
## - fGRAZE 4 914.23 2639.3 221.76
## - L.AREA 1 1130.78 2855.8 232.18
##
## Step: AIC=204.12
## ABUND ~ L.AREA + fGRAZE
##
## Df Sum of Sq RSS AIC
## <none> 1730.4 204.12
## - fGRAZE 4 1136.5 2866.9 224.40
## - L.AREA 1 1153.8 2884.2 230.73
##
## Call:
## lm(formula = ABUND ~ L.AREA + fGRAZE, data = birdData)
##
## Coefficients:
## (Intercept) L.AREA fGRAZE2 fGRAZE3 fGRAZE4 fGRAZE5
## 15.7164 7.2472 0.3826 -0.1893 -1.5916 -11.8938
The first part of the step() function shows the model containing all of the variables. This has an AIC of 211.6 before each term is dropped. The lower the AIC, the better the model, as judged by AIC. The optimal model above is lm(ABUND ~ L.AREA + fGRAZE). Now you should reapply this model and see whether both terms are significant. Note that both the summary() command and the anova() command are needed for this.
#reapplying the model
Model3 <- lm(ABUND ~ L.AREA + fGRAZE, data = birdData)
summary(Model3) #GRAZE 2-4 not significant t-test, significant anova
##
## Call:
## lm(formula = ABUND ~ L.AREA + fGRAZE, data = birdData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0849 -2.4793 -0.0817 2.6486 11.6344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.7164 2.7674 5.679 6.87e-07 ***
## L.AREA 7.2472 1.2551 5.774 4.90e-07 ***
## fGRAZE2 0.3826 2.9123 0.131 0.895993
## fGRAZE3 -0.1893 2.5498 -0.074 0.941119
## fGRAZE4 -1.5916 2.9762 -0.535 0.595182
## fGRAZE5 -11.8938 2.9311 -4.058 0.000174 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.883 on 50 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.6997
## F-statistic: 26.63 on 5 and 50 DF, p-value: 5.148e-13
anova(Model3) #L.AREA and fGRAZE significant
## Analysis of Variance Table
##
## Response: ABUND
## Df Sum Sq Mean Sq F value Pr(>F)
## L.AREA 1 3471.0 3471.0 100.2944 1.530e-13 ***
## fGRAZE 4 1136.5 284.1 8.2101 3.598e-05 ***
## Residuals 50 1730.4 34.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#you could also try to see whether adding interaction between L.AREA and fGRAZE improves the model. You should be able to get one p-value for this interaction term.
[3] Specify a priori chosen models and compare these models with one another. **see koala case study chapter in Zuur et al., 2009 for a good example of this.
Once you have selected the optimal model, apply model validation. This process has all of the following steps:
[1] Plot the standard residuals against fitted values to assess homogeneity.
[2] Make a histogram of the residuals to verify normality. You can also use QQplot.
[3] Plot the residuals against the explanatory variable from the model. If you see a pattern you are violating the independence assumption.
[4] Plot the residuals against each explanatory NOT shown in the model. If you see a pattern, include the omitted explanatory variable and refit the model. If residual patterns disappear, include the term, even if it is not significant.
[5] Assess the model for influential observations. A useful tool for this is Cook’s Distance Function.
Figure 3 interpretation: There is some evidence of heterogeneity (as can be seen in residuals vs. fitted values) and non-normality. It seems there is less spread in sites with GRAZE level 5. Based on the graph that shows residuals vs. L.AREA, we seem to have some violation of independence.
dp <- ggplot(Model3, aes(x=Model3$residuals)) +
geom_density(fill="blue", colour=NA, alpha=.2) +
geom_line(stat = "density") +
expand_limits(y = 0) +
labs(title="Density plot residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residual") +
ylab("Density")
qqp <- ggplot(Model3, aes(sample = Model3$residuals))+
stat_qq() + stat_qq_line(col="steelblue")+
labs(title="QQ residuals",
x="Theoretical Qualtiles",
y="Sample Quantiles")
#*QQ plot plots our residuals vs. normal distribution, a line means data are normal
#Now plot fitted vs. residuals to determine if equal variance above and below line
fitres.p <- ggplot(Model3,
aes(Model3$fitted.values,
Model3$residuals))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", col="steelblue")+
labs(title="Residuals vs. Fitted",
x="Fitted Values", y="Residuals")
std.resid.p<-ggplot(Model3, aes(.fitted, sqrt(abs(.stdresid))))+
geom_point(na.rm=TRUE)+
stat_smooth(method="loess", na.rm = TRUE)+
xlab("Fitted Value")+
ylab(expression(sqrt("|Std. residuals|")))+
ggtitle("Scale-Location")
ggarrange(dp, qqp, fitres.p, std.resid.p, labels = c("A", "B", "C", "D"))
## `geom_smooth()` using formula = 'y ~ x'
Figure 3: Evaluating Model Assumptions. (A) Density Plot of residuals has a bit of a right skew. (B) The Q-Q plot looks normal, with a few points further from the line around x = 2, indicating some heaviness in the tails. (C) Residuals vs. fitted show no indication of non-linearity seen. (D) Scale-location plot, another type of residual vs. fitted that shows no clear trend in residual variance throughout the plot.
Create a graph that shows what the model is doing.
ggplot(birdData, aes(L.AREA, ABUND, color=fGRAZE))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x)
Figure 4: Graphing all with same line, not taking into account multi regression lines
Model3 <- lm(ABUND ~ L.AREA + GRAZE, data = birdData)
#plotting the prediction model
ggPredict(Model3) #*if lines are not parallel then you know there is not an interaction term between L.AREA and fGRAZE in the model.
Figure 5: Model Prediction if an interaction term was significant, the lines would have different slopes and would not be parallel. Here you can see there is no interaction between L.AREA and F.GRAZE. Note Graze is not fGRAZE here, I don’t think ggPredict likes categories.
#* plot the results of linear regression model because it helps with interpretation.
The scatterplot of ABUND vs. L.AREA (Fig. 2) and the fit of the lines in Fig. 4 all suggest that imposing a linear L.AREA affect might be incorrect. From a biological point of view, it also makes more sense to assume that the larger the forest patches, the higher the number of birds, but only up to a certain level. A generalized additive model is a method that can be used to verify the type of model required. If the GAM indicates that the smoother is a straight line, then we know that the linear regression model is correct.
We will use GAM with the Gaussian Distribution and apply the following model:
The smoothing functions \(f_1\) are estimated by a thin plate regression spline (See Ch. 3 for more on this). There are other alternatives, like cubic regression splines, etc…
It is not important that you know the differences between all of these smoothers, but it becomes an issue for very large data sets. The R code to run the GAM is:
AM1 <- gam(ABUND ~ s(L.AREA)+s(L.DIST)+s(L.LDist)+s(YR.ISOL)+s(ALT)+fGRAZE, data=birdData) #something is wrong with the output below. Numbers do not match others.
anova(AM1) #notice we do not get an Ftest in this output. Instead we get the Wald test (approximate). This shows the significance of each term in the model.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ABUND ~ s(L.AREA) + s(L.DIST) + s(L.LDist) + s(YR.ISOL) + s(ALT) +
## fGRAZE
##
## Parametric Terms:
## df F p-value
## fGRAZE 4 3.281 0.0202
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(L.AREA) 3.030 3.650 8.709 7.16e-05
## s(L.DIST) 2.524 3.160 0.541 0.645
## s(L.LDist) 1.000 1.000 0.219 0.642
## s(YR.ISOL) 2.842 3.376 1.384 0.281
## s(ALT) 1.000 1.000 0.720 0.401
**Above you can see that none of the p-values are significant. This means the smoothers aren’t exactly what we want and we are back to the data selection process as in the linear model sections Steps 3. We can either compare a priori selected models (not discussed here), use hypothesis testing procedures, or the model selection tool like AIC.
[1] The hypothesis testing is the easiest. Just drop the least significant term from the model, refit the model, and repeat the process until all terms are significant. This is quick and dirty, but very useful with the computed times are long.
[2] Using AIC is harder because there is not a command like step() that will work it all for you. You have to manually drop things 1 by 1 and decide which is best. This can take a long time with many vars.
[3] The last method, the optimal amount of smoothing is estimated with a method called, cross-validation. Here, 1 degree of freedom produces a straight line and 10 degrees of freedom is a highly non-linear curve. In linear regression, a non-significant term is still consuming one degree of freedom. The gam function is able to produce smoothers with 0 degrees of freedom, which basically removes the need to refit th emodel without the terms. It only works with thin plate regression splines and cubic regression splines.
Here is the code for [3]:
AM2 <- gam(ABUND ~ s(L.AREA, bs="cs")+s(L.DIST, bs="cs")+s(L.LDist, bs="cs")+s(YR.ISOL, bs="cs")+s(ALT, bs="cs")+fGRAZE, data=birdData)
#* cs tells r to use the cubic regression spline. Don't worry about understanding them. Just know that thin plates are generally a tiny bit more linear that cubics.
anova(AM2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ABUND ~ s(L.AREA, bs = "cs") + s(L.DIST, bs = "cs") + s(L.LDist,
## bs = "cs") + s(YR.ISOL, bs = "cs") + s(ALT, bs = "cs") +
## fGRAZE
##
## Parametric Terms:
## df F p-value
## fGRAZE 4 4.092 0.00674
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(L.AREA) 2.369e+00 9.000e+00 4.033 1.41e-06
## s(L.DIST) 2.991e+00 9.000e+00 0.431 0.270
## s(L.LDist) 1.735e-08 9.000e+00 0.000 0.782
## s(YR.ISOL) 2.669e+00 9.000e+00 0.450 0.199
## s(ALT) 1.181e-08 9.000e+00 0.000 0.593
** Here we will keep L.AREA and fGRAZE in our ABUND model.
AM3 <- gam(ABUND~s(L.AREA, bs="cs")+fGRAZE, data=birdData)
plot(AM3) #The model validation should follow the same steps as linear regression. The only differences are that the residuals are obtained by the command:
AM3.residuals <- resid(AM3)
#and there is not a function that easily plots these against the fitted values. so do this:
fit.AM3 <- fitted(AM3)
plot(fit.AM3, AM3.residuals, xlab="Fitted Values", ylab="Residuals")
Again! It is important that you plot this data for each individual explanatory variable. If any of these graphs shows a pattern, you must find the solution.
Finally, we need to decide if the GAM was necessary. We ended up with the same ending as we did without the GAM. Let’s see which is better!
The underlying null hypothesis here is that linear is accepted. Since p is low, the null has to go. We reject the null and we will go with the GAM. I think. We also prefer the GAM because it shows no residual patterns…however, the non-linear L.Area affect is not linear and we should sample more of this patch next time we go out.
#Model3 vs. AM3
anova(Model3, AM3)
## Analysis of Variance Table
##
## Model 1: ABUND ~ L.AREA + GRAZE
## Model 2: ABUND ~ s(L.AREA, bs = "cs") + fGRAZE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 53.000 2200.9
## 2 48.058 1520.0 4.9415 680.92 4.3567 0.002461 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Once you read chapter 4, you can extend on this appendix. (see page 550). Other options are the bioluminescent case and two-dimensional smoothers. These extensions won’t exactly work with these data though, because there are only 56 observations.
# Mixed Effects Models and Extensions in Ecology with R (2009)
# Zuur, Ieno, Walker, Saveliev and Smith. Springer
# This file was produced by Alain Zuur (highstat@highstat.com)
# www.highstat.com
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
Loyn <- read.table("Loyn.txt", header = T, sep="\t")
Loyn$fGRAZE <- factor(Loyn$GRAZE)
op<- par(mfrow=c(4,2),mar=c(3,3,3,1))
dotchart(Loyn$ABUND,main="ABUND",group=Loyn$fGRAZE)
plot(0,0,type="n",axes=F)
dotchart(Loyn$AREA,main="AREA",group=Loyn$fGRAZE)
dotchart(Loyn$DIST,main="DIST",group=Loyn$fGRAZE)
dotchart(Loyn$LDIST,main="LDIST",group=Loyn$fGRAZE)
dotchart(Loyn$YR.ISOL,main="YR.ISOL",group=Loyn$fGRAZE)
dotchart(Loyn$ALT,main="ALT",group=Loyn$fGRAZE)
dotchart(Loyn$GRAZE,main="GRAZE",group=Loyn$fGRAZE)
par(op)
Loyn$L.AREA<-log10(Loyn$AREA)
Loyn$L.DIST<-log10(Loyn$DIST)
Loyn$L.LDIST<-log10(Loyn$LDIST)
Z<-cbind(Loyn$ABUND,Loyn$L.AREA,Loyn$L.DIST,Loyn$L.LDIST,Loyn$YR.ISOL,Loyn$ALT,Loyn$GRAZE)
colnames(Z)<-c("ABUND","L.AREA","L.DIST","L.LDIST","YR.ISOL","ALT","GRAZE")
pairs(Z, lower.panel=panel.smooth2,
upper.panel=panel.cor,diag.panel=panel.hist)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
corvif(Z[,c(-1,-7)])
##
##
## Variance inflation factors
##
## GVIF
## L.AREA 1.622200
## L.DIST 1.622396
## L.LDIST 2.008157
## YR.ISOL 1.201719
## ALT 1.347805
M1 <- lm(ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT +
fGRAZE, data = Loyn)
summary(M1)
##
## Call:
## lm(formula = ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT +
## fGRAZE, data = Loyn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8992 -2.7245 -0.2772 2.7052 11.2811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.68025 115.16348 0.319 0.7515
## L.AREA 6.83303 1.50330 4.545 3.97e-05 ***
## L.DIST 0.33286 2.74778 0.121 0.9041
## L.LDIST 0.79765 2.13759 0.373 0.7107
## YR.ISOL -0.01277 0.05803 -0.220 0.8267
## ALT 0.01070 0.02390 0.448 0.6565
## fGRAZE2 0.52851 3.25221 0.163 0.8716
## fGRAZE3 0.06601 2.95871 0.022 0.9823
## fGRAZE4 -1.24877 3.19838 -0.390 0.6980
## fGRAZE5 -12.47309 4.77827 -2.610 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.105 on 46 degrees of freedom
## Multiple R-squared: 0.7295, Adjusted R-squared: 0.6766
## F-statistic: 13.78 on 9 and 46 DF, p-value: 2.115e-10
anova(M1)
## Analysis of Variance Table
##
## Response: ABUND
## Df Sum Sq Mean Sq F value Pr(>F)
## L.AREA 1 3471.0 3471.0 93.1303 1.247e-12 ***
## L.DIST 1 65.5 65.5 1.7568 0.191565
## L.LDIST 1 136.5 136.5 3.6630 0.061868 .
## YR.ISOL 1 458.8 458.8 12.3109 0.001019 **
## ALT 1 78.2 78.2 2.0979 0.154281
## fGRAZE 4 413.5 103.4 2.7736 0.037992 *
## Residuals 46 1714.4 37.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(M1,test="F")
## Single term deletions
##
## Model:
## ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1714.4 211.60
## L.AREA 1 770.01 2484.4 230.38 20.6603 3.97e-05 ***
## L.DIST 1 0.55 1715.0 209.62 0.0147 0.90411
## L.LDIST 1 5.19 1719.6 209.77 0.1392 0.71075
## YR.ISOL 1 1.81 1716.2 209.66 0.0485 0.82675
## ALT 1 7.47 1721.9 209.85 0.2004 0.65650
## fGRAZE 4 413.50 2127.9 215.70 2.7736 0.03799 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Verify drop1
M1A <- lm(ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT +
fGRAZE, data = Loyn)
M1B <- lm(ABUND ~ L.DIST + L.LDIST + YR.ISOL + ALT +
fGRAZE, data = Loyn)
anova(M1A,M1B)
## Analysis of Variance Table
##
## Model 1: ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
## Model 2: ABUND ~ L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 1714.4
## 2 47 2484.4 -1 -770.01 20.66 3.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Model 1: ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
#Model 2: ABUND ~ L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
# Res.Df RSS Df Sum of Sq F Pr(>F)
#1 46 1714.43
#2 47 2484.44 -1 -770.01 20.660 3.97e-05 ***
((2484.44-1714.43)/1) / (1714.43/(56-9))
## [1] 21.10933
#Verify anova table
M1A <- lm(ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT +
fGRAZE, data = Loyn)
M1B <- lm(ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL ,
data = Loyn)
anova(M1A,M1B)
## Analysis of Variance Table
##
## Model 1: ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
## Model 2: ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 1714.4
## 2 51 2206.1 -5 -491.69 2.6385 0.03528 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M1C <- lm(ABUND ~ 1, data = Loyn)
anova(M1C)
## Analysis of Variance Table
##
## Response: ABUND
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 55 6337.9 115.23
M1C <- lm(ABUND ~ fGRAZE+L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT, data = Loyn)
anova(M1A)
## Analysis of Variance Table
##
## Response: ABUND
## Df Sum Sq Mean Sq F value Pr(>F)
## L.AREA 1 3471.0 3471.0 93.1303 1.247e-12 ***
## L.DIST 1 65.5 65.5 1.7568 0.191565
## L.LDIST 1 136.5 136.5 3.6630 0.061868 .
## YR.ISOL 1 458.8 458.8 12.3109 0.001019 **
## ALT 1 78.2 78.2 2.0979 0.154281
## fGRAZE 4 413.5 103.4 2.7736 0.037992 *
## Residuals 46 1714.4 37.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(M1C)
## Analysis of Variance Table
##
## Response: ABUND
## Df Sum Sq Mean Sq F value Pr(>F)
## fGRAZE 4 3453.7 863.42 23.1665 1.557e-10 ***
## L.AREA 1 1153.8 1153.85 30.9589 1.297e-06 ***
## L.DIST 1 1.4 1.43 0.0383 0.8457
## L.LDIST 1 2.6 2.61 0.0699 0.7926
## YR.ISOL 1 4.5 4.47 0.1198 0.7308
## ALT 1 7.5 7.47 0.2004 0.6565
## Residuals 46 1714.4 37.27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Model 1: ABUND ~ L.AREA
#Model 2: ABUND ~ L.AREA + L.DIST
# Res.Df RSS Df Sum of Sq F Pr(>F)
#1 54 2866.94
#2 53 2801.47 1 65.48 1.2388 0.2707
((2866.94-2801.47)/1) / (2801.47/(56-3))
## [1] 1.238603
#ok
#Analysis of Variance Table
#
#Response: ABUND
# Df Sum Sq Mean Sq F value Pr(>F)
#L.AREA 1 3471.0 3471.0 93.1303 1.247e-12 ***
#L.DIST 1 65.5 65.5 1.7568 0.191565
M2 <- lm(ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT, data = Loyn)
anova(M1, M2)
## Analysis of Variance Table
##
## Model 1: ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
## Model 2: ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 1714.4
## 2 50 2127.9 -4 -413.5 2.7736 0.03799 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step(M1)
## Start: AIC=211.6
## ABUND ~ L.AREA + L.DIST + L.LDIST + YR.ISOL + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - L.DIST 1 0.55 1715.0 209.62
## - YR.ISOL 1 1.81 1716.2 209.66
## - L.LDIST 1 5.19 1719.6 209.77
## - ALT 1 7.47 1721.9 209.85
## <none> 1714.4 211.60
## - fGRAZE 4 413.50 2127.9 215.70
## - L.AREA 1 770.01 2484.4 230.38
##
## Step: AIC=209.62
## ABUND ~ L.AREA + L.LDIST + YR.ISOL + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - YR.ISOL 1 1.73 1716.7 207.68
## - ALT 1 7.07 1722.0 207.85
## - L.LDIST 1 8.57 1723.5 207.90
## <none> 1715.0 209.62
## - fGRAZE 4 413.28 2128.2 213.71
## - L.AREA 1 769.64 2484.6 228.38
##
## Step: AIC=207.68
## ABUND ~ L.AREA + L.LDIST + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - L.LDIST 1 8.32 1725.0 205.95
## - ALT 1 9.71 1726.4 205.99
## <none> 1716.7 207.68
## - fGRAZE 4 848.77 2565.5 222.18
## - L.AREA 1 790.20 2506.9 226.88
##
## Step: AIC=205.95
## ABUND ~ L.AREA + ALT + fGRAZE
##
## Df Sum of Sq RSS AIC
## - ALT 1 5.37 1730.4 204.12
## <none> 1725.0 205.95
## - fGRAZE 4 914.23 2639.3 221.76
## - L.AREA 1 1130.78 2855.8 232.18
##
## Step: AIC=204.12
## ABUND ~ L.AREA + fGRAZE
##
## Df Sum of Sq RSS AIC
## <none> 1730.4 204.12
## - fGRAZE 4 1136.5 2866.9 224.40
## - L.AREA 1 1153.8 2884.2 230.73
##
## Call:
## lm(formula = ABUND ~ L.AREA + fGRAZE, data = Loyn)
##
## Coefficients:
## (Intercept) L.AREA fGRAZE2 fGRAZE3 fGRAZE4 fGRAZE5
## 15.7164 7.2472 0.3826 -0.1893 -1.5916 -11.8938
M3 <- lm(ABUND ~ L.AREA + fGRAZE, data = Loyn)
op <- par(mfrow = c(2, 2))
plot(M3) #standard graphical output
win.graph()
op <- par(mfrow = c(2, 2))
#Check for normality
E <- rstandard(M3)
hist(E)
#qqnorm(E)
#Check for independence: residuals versus individual #explanatory variables
plot(y = E, x = Loyn$L.AREA, xlab = "AREA", ylab = "Residuals")
abline(0,0)
plot(E ~ Loyn$fGRAZE, xlab = "GRAZE", ylab = "Residuals")
abline(0, 0)
par(op)
plot(Loyn$L.AREA,Loyn$ABUND)
#Loyn$SL.AREA<-sort(Loyn$L.AREA)
D1<-data.frame(L.AREA=Loyn$L.AREA[Loyn$GRAZE==1],fGRAZE="1")
D2<-data.frame(L.AREA=Loyn$L.AREA[Loyn$GRAZE==2],fGRAZE="2")
D3<-data.frame(L.AREA=Loyn$L.AREA[Loyn$GRAZE==3],fGRAZE="3")
D4<-data.frame(L.AREA=Loyn$L.AREA[Loyn$GRAZE==4],fGRAZE="4")
D5<-data.frame(L.AREA=Loyn$L.AREA[Loyn$GRAZE==5],fGRAZE="5")
P1<-predict(M3,newdata=D1)
P2<-predict(M3,newdata=D2)
P3<-predict(M3,newdata=D3)
P4<-predict(M3,newdata=D4)
P5<-predict(M3,newdata=D5)
D1<-data.frame(L.AREA = Loyn$L.AREA, fGRAZE = "1")
P1<-predict(M3, newdata = D1)
lines(D1$L.AREA,P1,lty=1)
lines(D2$L.AREA,P2,lty=2)
lines(D3$L.AREA,P3,lty=3)
lines(D4$L.AREA,P4,lty=4)
lines(D5$L.AREA,P5,lty=5)
library(mgcv)
AM1<-gam(ABUND~s(L.AREA)+s(L.DIST)+s(L.LDIST)+
s(YR.ISOL)+s(ALT)+fGRAZE, data = Loyn)
anova(AM1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ABUND ~ s(L.AREA) + s(L.DIST) + s(L.LDIST) + s(YR.ISOL) + s(ALT) +
## fGRAZE
##
## Parametric Terms:
## df F p-value
## fGRAZE 4 3.281 0.0202
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(L.AREA) 3.030 3.650 8.709 7.16e-05
## s(L.DIST) 2.524 3.160 0.541 0.645
## s(L.LDIST) 1.000 1.000 0.219 0.642
## s(YR.ISOL) 2.842 3.376 1.384 0.281
## s(ALT) 1.000 1.000 0.720 0.401
AM2<-gam(ABUND ~ s(L.AREA, bs = "cs") + s(L.DIST, bs = "cs") +
s(L.LDIST,bs = "cs") + s(YR.ISOL, bs = "cs") +
s(ALT, bs = "cs") + fGRAZE, data = Loyn)
anova(AM2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ABUND ~ s(L.AREA, bs = "cs") + s(L.DIST, bs = "cs") + s(L.LDIST,
## bs = "cs") + s(YR.ISOL, bs = "cs") + s(ALT, bs = "cs") +
## fGRAZE
##
## Parametric Terms:
## df F p-value
## fGRAZE 4 4.092 0.00674
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(L.AREA) 2.369e+00 9.000e+00 4.033 1.41e-06
## s(L.DIST) 2.991e+00 9.000e+00 0.431 0.270
## s(L.LDIST) 1.735e-08 9.000e+00 0.000 0.782
## s(YR.ISOL) 2.669e+00 9.000e+00 0.450 0.199
## s(ALT) 1.181e-08 9.000e+00 0.000 0.593
AM3 <- gam(ABUND ~ s(L.AREA, bs = "cs") + fGRAZE, data = Loyn)
plot(AM3)
E.AM3 <- resid(AM3)
Fit.AM3 <- fitted(AM3)
plot(x = Fit.AM3, y = E.AM3, xlab = "Fitted values",
ylab = "Residuals")
M3<-lm(ABUND ~ L.AREA + fGRAZE, data = Loyn)
AM3<-gam(ABUND ~ s(L.AREA, bs = "cs") + fGRAZE, data = Loyn)
anova(M3, AM3)
## Analysis of Variance Table
##
## Model 1: ABUND ~ L.AREA + fGRAZE
## Model 2: ABUND ~ s(L.AREA, bs = "cs") + fGRAZE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 50.000 1730.4
## 2 48.058 1520.0 1.9415 210.37 3.4258 0.04196 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1