Markdown Author: Jessie Bell, 2023
Libraries Used: none
Answers: plum
nereisData <- read.table("Nereis.txt", header = T, sep="\t")
names(nereisData)
## [1] "concentration" "biomass" "nutrient"
Cleveland dotplots are useful for determining outliers and violations of homogeneity in the dataset. The points aren’t entirely far from one another, so you can see that there is not a violation of outliers. There seem to be violations in homogeneity though, because the spread of the circle nutrients is much closer together than the second and third. You can also see here that the mean value for nutrient two seems to be larger than the other two, so we can assume this will be a key player in our regression.
dotchart(nereisData$concentration,
groups=factor(nereisData$nutrient),
ylab = "Order of Observations",
xlab = "Concentration",
main = "Cleveland Dotplot",
pch = nereisData$nutrient)
Figure 1: Conditional Cleveland dotplot of Nereis concentration on nutrient with values 1, 2, and 3. Different symbols were used and the graph suggests violation of homogeneity. The x-axis shows the value at a particular observation and the y shows the observation.
Each panel below is a scatterplot of two variables. The graph does not show any obvious relationships between concentration and biomass, but there seems to be a clear relationship between concentration and nutrients. You can make even more intricate pairplots by using ?pairs in R.
pairs(nereisData) #not sure how nutrient is so different from biomass. Both look similar to me
Figure 2: Pairplots for concentration, biomass, and nutrient. Each panel is a scatterplot between two variables. It is also possible to add regression or smoothin lines in each panel. In general, it does not make sense to add a nominal variable (nutreint) to a pairplot. In this case, there are only two explanatory variables; hence it does not do any harm to include nutrient.
Routinely apply this to your data too! When you can. As with the Cleveland dotplot, you can see that nutrient level 2 seems higher than the others. You can also see less spread for nutrient 3, indicating there will likely be a violation of heterogeneity in the data.
boxplot(concentration ~ factor(nutrient),
varwidth = T, #tells the function to create width of boxes proportional to sample size per level
xlab = "nutrients",
main = "Boxplot of concentration condition on nutrient", ylab = "concentration", col = c("red", "steelblue", "plum"), data= nereisData)
Figure 3: Boxplot of concentration conditional on the nominal vaiable, nutrients. The horizontal line in each box is the median, the boxes define the hinge (25-75% quartile, and the line is 1.5 times the hinge. Points outside this interval are represented as dots. Such points may (or may not) be outliers. One should not label them as outliers purely on the basis of a boxplot! The width of the boxes is proportional to the number of observations per class.
The nereis data has only two explanatory variables and the xyplots are less appropriate for that data. We will instead use a different dataset to show this part. This data will be the isotopic composition of whale teeth from Scotland. This graph is used to help determine if all whales have similar nitrogen-age relationships. Fig. 4 suggests that some do!
whaleData <- read.table("teethNitrogen.txt", header = T) #left out the seperate part that was used in Nereis data above. CRAY!
names(whaleData)
## [1] "X15N" "Age" "Tooth"
xyplot(X15N ~ Age | factor(Tooth),
type = "l",
xlab = "Estimated age",
ylab = expression(paste(delta^{15}, "N")),
strip = function(bg="white", ...)
strip.default(bg="white", ...),
scales = list (tck = c(-1, 0)), #makes the ticks go inside graph
data=whaleData)
Figure 4: Nitrogen concentration in teeth versus age for each of the 11 whales stranded in Scotland. The graph was made using the xyplot from teh lattice package.
In the second step of data analysis, we have to apply a model. The mother of all models is without a doubt the linear regression model. The bivariate linear regression model is defined by:
\(Y_i\) = \(\alpha\) + \(\beta\) * \(X_i\) + \(\epsilon\)\(_i\) where \(\epsilon\)\(_i\) ~ N(0, \(\sigma^2\))
\(Y_i\) is the response variable and \(X_i\) is the explanatory variable. The unexplained information is captured in the residuals \(\epsilon\)\(_i\). These are assumed to be normally distributed with expectation 0 and variance \(\sigma^2\). The parameters \(\alpha\) + \(\beta\) are the population intercept and slope. They are not known values. In practice we take a sample and come up with estimates a + b, and confidence intervals. The confidence intervals tell us how often the real population is within the estimates. In most cases, \(\beta\) (the slope) is the primary interest, as it tells us whether there is a relationship between X and Y.
Multiple linear regression is an extension on bivariate regression and it looks like this with the visualization underneath:
\(Y_i\) = \(\alpha\) + \(\beta_1\) * \(X_1\)\(_i\) + \(\beta_2\) * \(X_2\)\(_i\) + … + \(\beta_M\) * \(X_M\)\(_i\) + \(\epsilon\)\(_i\) where \(\epsilon\)\(_i\) ~ N(0, \(\sigma^2\))
Always apply the simplest statistical technique on your data, but ensure that it is applied correctly. The next parts of this will go through examples of violations of assumptions, and what to do if your data violate the assumptions.
Assumptions of Linear Models
1. Normality
This is of residuals, and some experts argue normality is unncessicary if your sample size is large enough.
2. Homogeneity/Heteroscedasticity
Violations of homogeneity are common in environmental data, spreads can change throughout seasons, males vs. females, etc…The easiest way to deal with this violation is to transform the data. This is called, “mean-variance stabilizing”. Using tests to assess normality and equal variance assume normality, so it is prefered to look at a visualization of the data to decide rather than perform statistical tests.
3. Fixed X (X represents explanatory variables)
The assumption that the explanatory variables are deterministic. Chapter 7 discusses ways to brute force these data (bootstrapping).
4. Independence
You can have violations both in the data, and in the nature of the data. To deal with violations in the data (if Y at \(X_i\) is influenced by another \(X_i\) then you have a violation in the data). You can fix this using transformations to “linearize the relationship”. Violations in the nature of the data happen because some things are naturally dependent on other things. For example, what you eat right now depends on what you ate 10 minutes ago. If we have a large number of birds at time t, it is likely we will also have a large number of birds at time t-1. The same holds for the spatial locations close to each other and sampling pelagic bioluminescence along a depth gradient (see example 4).
5. Correct Model Selection
It is assumed that you have selected the best and simplest model to explain your data.
This example violates the assumption of normality and heteroscedasticity.
398 wedge clams (Donax hanleyanus) are plotted against length for 6 different months. These data were measured on a beach in Argentina in 1997. The initial scatterplot of the data (Fig. 6) shows a clear non-linear relationship. Both AFD and length data were then log transformed to linearise the relationship. Note this transformation is only necessary if we want to apply linear regression, and the untransformed data can be analyzed with additive modeling (Chapter 3).
clamData <- read.table("Clams.txt", header = T)
a <- names(clamData) #AFD stands for Ash Free Dry Weight
clamData2 <- clamData[1:3]
ggpairs(clamData2) #AFD needs transformed, if we do this we will also need to log transform length.
Figure 5: A correlelagram showing possible connections in our data exploration stage.
clamData2$log.length <- log(clamData2$LENGTH)
clamData2$log.AFD <- log(clamData2$AFD)
#we also need to change months into factors
clamData2$f.Month <- as.factor(clamData2$MONTH)
coplot(log.AFD ~ log.length |f.Month, data = clamData2)
Figure 6: Coplot indicating that the data follow a clear linear relationship with AFD and length in all months. It seems sensible to apply linear regression to model this relationship. Due to different stages of the life cycle of wedge clams, the biomass-length relationship may change between months, especially before and after teh spawning period in September - October and February - March. This justifies adding a length-month interaction term. This model is also known as an analysis of covariance (ANCOVA).
M1 <- lm(log.AFD ~ log.length * f.Month, data = clamData2)
summary(M1)
##
## Call:
## lm(formula = log.AFD ~ log.length * f.Month, data = clamData2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73851 -0.06425 0.01041 0.08376 0.42588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.48318 0.20049 -57.274 < 2e-16 ***
## log.length 3.03740 0.06508 46.674 < 2e-16 ***
## f.Month3 0.57761 1.68290 0.343 0.73162
## f.Month4 0.03783 0.21698 0.174 0.86167
## f.Month9 -0.61448 0.35031 -1.754 0.08020 .
## f.Month11 0.94007 0.34068 2.759 0.00607 **
## f.Month12 1.68056 0.97240 1.728 0.08474 .
## log.length:f.Month3 -0.29988 0.72241 -0.415 0.67829
## log.length:f.Month4 -0.07172 0.07232 -0.992 0.32197
## log.length:f.Month9 0.15166 0.13162 1.152 0.24993
## log.length:f.Month11 -0.30605 0.11597 -2.639 0.00865 **
## log.length:f.Month12 -0.57034 0.31532 -1.809 0.07127 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1284 on 386 degrees of freedom
## Multiple R-squared: 0.9869, Adjusted R-squared: 0.9866
## F-statistic: 2652 on 11 and 386 DF, p-value: < 2.2e-16
drop1(M1, test="F") #the drop 1 function compares the model to the model with the interaction term dropped. The F-test is used to compare the residual sums of squares of both the models.
## Single term deletions
##
## Model:
## log.AFD ~ log.length * f.Month
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 6.3592 -1622.3
## log.length:f.Month 5 0.22558 6.5848 -1618.5 2.7385 0.01906 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
On the 3rd line of this output (labelled “none”) we have the output of the full model, and the last line shows the output from the model without the interaction. Note that this model is nested within the full model. The F-statistic shows that the interaction is significant at the 5% level. However, before trusting the values obtained by the F-stat and using the “magic 5%” as a rejection rule, we need to be confident that all model assumptions are valid. Hence, we enter the next stage of the analysis. The model validation.
residuals vs. fitted to verify homogeneity
QQ plot or histogram of residuals to verify normality
residuals vs. each explanatory variable to check independence
op <- par(mfrow = c(2, 2), mar = c(5, 4, 1, 2)) #mar means margins
plot(M1, add.smooth=F, which = 1)
E <- resid(M1)
hist(E, xlab = "Residuals", main = "", col="#f7b1bf")
plot(clamData2$log.length, E, xlab = "Log(Length)", ylab = "Residuals")
plot(clamData2$f.Month, E, xlab = "Month", ylab = "Residuals", col = c("darkgreen", "steelblue", "tomato3", "#395b50", "#bc8da0", "red"))
Figure 7: [A] Fitted values versus residuals (homogeneity). [B] Histogram of the residuals (normality). [C] Residuals versus length (independence). [D] Residuals versus month. Here you can see minor evidence of non-normality (B), and violations in heteroscedasticity (D). We will also try a non-visual tool for this below.
par(op)
Bartlett’s (assumes normality)
Scheffe-Box Test aka Hartley’s \(F_m\)\(_a\)\(_x\)
Apparently it looks like observations for log(length) Fig. 7c, have different spread for data before 2.275 and after 2.275. ?? I don’t think they do. But maybe I need more help looking at these.
E1 <- E[clamData2$log.length <= 2.275]
E2 <- E[clamData2$log.length > 2.275]
var.test(E1, E2) #H0: two variances = 0; Ha: two variance not equal to zero
##
## F test to compare two variances
##
## data: E1 and E2
## F = 1.0173, num df = 23, denom df = 373, p-value = 0.8839
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6007081 2.0264106
## sample estimates:
## ratio of variances
## 1.017268
bartlett.test(E, clamData2$f.Month) #super sensitve to normality.
##
## Bartlett test of homogeneity of variances
##
## data: E and clamData2$f.Month
## Bartlett's K-squared = 34.286, df = 5, p-value = 2.089e-06
#check normality
hist((E[clamData2$MONTH == 12]), col="plum") #bimodal histogram
The conclusion of the linear regression (or ANCOVA) model is that there is a significant relationship between biomass, length, and month with a weak but significant interaction between the length and the month. However, with a p-value of 0.02 for this interaction term, we would have preferred to see no patterns at all in the residuals. Both the tests and graphical output, gave us some reasons to doubt the suitability of this model for these data. In Chapter 4, we discuss extensions of the linear regression model that can be used to test whether we need different variances per month.
This example violates the assumption of independence.
Figure 4 under Xyplot from lattice, shows that Moby’s isotope ratios increase with age, and a linear regression was applied to model this pattern.
whaleData <- read.table("TeethNitrogen.txt", header=T)
residuals vs. fitted to verify homogeneity
QQ plot or histogram of residuals to verify normality
residuals vs. each explanatory variable to check independence
TN <- whaleData
M2 <- lm(X15N ~ Age, subset = (TN$Tooth == "Moby"), data = TN)
op <- par(mfrow = c(2, 2))
plot(M2, add.smooth=F)
Figure 8: [A] and [C] show residuals versus fitted values. Note the clear pattern! [B] QQ-plot for normality. [D] standardized residuals versus leverage and the Cook statistic is superimposed as contour plots. In this case, the Cook values are small and cannot be clearly seen. The QQ plot looks great and the data lok normally distributed. Panel D seems to show potentially influential observations. Leverage measures wheter any observation has extreme values of the explanatory variables. A shows a clear independence violation though.
par(op)
summary(M2)
##
## Call:
## lm(formula = X15N ~ Age, data = TN, subset = (TN$Tooth == "Moby"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07102 -0.28706 0.04346 0.33820 1.13724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.748940 0.163559 71.83 <2e-16 ***
## Age 0.113794 0.006186 18.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4859 on 40 degrees of freedom
## Multiple R-squared: 0.8943, Adjusted R-squared: 0.8917
## F-statistic: 338.4 on 1 and 40 DF, p-value: < 2.2e-16
\(y_i\) = 11.748 + 0.113 * age\(_i\)
The estimated slope and intercept are significantly different from 0 at the 5% level. The model explains 89% of the variation; the estimator for \(\sigma\) is equal to s = 0.486. But the problem is that we still have to reject this model because there is a clear violation of independence. Solutions will be given in Chapters 6 and 7.
This example violates the assumption of homogeneity.
Nereis <- read.table("Nereis.txt", header=T)
Nereis$fbiomass <- factor(Nereis$biomass)
Nereis$fnutrient <- factor(Nereis$nutrient)
M3 <- lm(concentration ~ fbiomass * fnutrient,
data = Nereis)
drop1(M3, test = "F") #The numerical output obtained by the drop1 command is printed below and shows that the biomass-nutrient interaction term is significant at the 5% level.
## Single term deletions
##
## Model:
## concentration ~ fbiomass * fnutrient
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 13.630 -23.746
## fbiomass:fnutrient 8 11.553 25.183 -12.121 3.1785 0.009899 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
op <- par(mfrow = c(1, 2))
plot(resid(M3) ~ Nereis$fbiomass, xlab = "Biomass",
ylab = "Residuals")
plot(resid(M3) ~ Nereis$fnutrient,
xlab = "Nutrient", ylab = "Residuals")
Figure 9: The boxplot of residuals vs. biomass shows a clear violation of homogeneity. Applying transformation on concentration may solve this problem. The disadvantage of a transformation is that we are changing the type of relationship between response and explanatory variables. So again, we need to reject the linear regression model for these data.
par(op)
This example violates the assumption of linearity.
ISIT <- read.table("ISIT.txt", header = T)
ISIT$f.Station <- as.factor(ISIT$Station)
xyplot(Sources ~ SampleDepth | f.Station, data = ISIT, xlab = "Sample Depth", ylab = "Sources",
strip = function(bg = 'white', ...)
strip.default(bg = 'white', ...),
panel = function(x, y) {
panel.grid(h = -1, v = 2)
I1 <- order(x)
llines(x[I1], y[I1], col = 1)})
There is no point in applying a linear regression model with Sources as the response variable and Depth and Station as explanatory variables (plus an interaction between them) because the relationships are not linear and the variation per station differs. Perhaps it is better to consider station as a random effect (Chapter 5). Another problem is that depth can be seen as a spatial gradient. Hence, there may be spatial correlation along the depth gradient. In Chapter 5, we discuss random effect models, and in Chapter 7 spatial correlation for smoothing models. A full analysis of this data set is presented in Chapter 17.
The data exploration should filter out any typing mistakes (typos), identify possible outliers and the need for a data transformation, and provide some ideas about the follow up analyses. As for typos, these should obviously be corrected before continuing with any analysis, but do not apply a transformation on the response variable yet unless there are strong reasons to do so. Some of the methods discussed in later chapters may be able to deal with (groups) of extreme observations or heterogeneity. Many books will tell you to routinely apply a data transformation to linearise the relationship. Well, if you are particular fond on linear regression then yes, but (generalised) additive (mixed) modelling is especially designed to model non-linear relationships. Even heterogeneity, as for example encountered in Fig. 9B can be dealt with (as will be explained in Chapter 4); so you do not need to apply a transformation to stabilize the mean-variance relationship, provided you are willing to read the rest of this book. The only thing we cannot solve with any of the techniques discussed in later chapters is observations with extreme explanatory variables. If this happens for your data, then a transformation on the explanatory variable(s) could well be justified at this stage.
The original aim of this chapter was to simply illustrate the linear regression model for an ecological data set and discuss the numerical and graphical output. However, in preparing this book, we had access to about 15 data sets, and in Zuur et al. (2007), we had access to a further 20 data sets. In none of these real data sets could we find a non-trivial example for a linear regression model for which all assumptions held. This clearly identifies the limitation of linear regression for analysing ecological data. Hence, our choice of the title of this chapter.
So, what can we do? The problem of heterogeneity can be solved by either allowing for different variances in the linear regression model (using generalized least squares estimation) or using a different distribution and model structure (Poisson, negative binomial and Gamma distributions in GLM); the dependence problem requires the use of models that allow for more flexibility than regression (e.g. smoothing methods) and a model for the error structure (e.g. temporal, spatial correlation, or along another gradient like age or depth). We will also need to consider nested data and random effects. Taken together, all these techniques lead to mixed modelling approach, and if combined with GLM and GAM, to generalised linear mixed modelling (GLMM) and generalised additive mixed modelling (GAMM).
Chapter 4 shows how we can deal with heterogeneity in linear regression and smoothing models, random effects for nested data are introduced in Chapter 5, and temporal and spatial correlation structures are discussed in Chapters 6 and 7. In Chapter 8, we introduce different distributions for count data, binary data, proportional data, and zero inflated count data. These are then used in Chapters 9, 10, and 11. Finally, Chapters 12 and 13 discuss how we can incorporate correlation structures and random effects in models for count data, binary data, and proportional data. See Fig. 2.12 for a schematic overview.
Before reading on, we strongly advise to read Appendix A as it provides a more detailed discussion on linear regression. It is essential that you are familiar with all steps discussed in this appendix.
[1] Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. & Smith, G. M. Mixed Effects Models and Extensions in Ecology with R (Springer, 2010).