09/11/2020
Any statistical analysis makes assumptions.
If assumptions are violated, you may be using the wrong analysis and may draw the wrong conclusion.
You can’t assume that the same assumptions that seemed OK last time will be OK this time – every dataset is different and assumptions need to be checked.
Residual plots are a good way to check assumptions.
Invertebrates are often used as biological controls, but their effectiveness may be reduced by interactions with other species in the food web. Ingo looked at how bird exclusion might change on the effectiveness of biological control of aphids in an oat field in Germany (Grass et al., 2017).
He established eight plots, four of which were randomly allocated a plastic netting treatment to exclude birds. He measured number of aphids in each plot a few days after the treatment was applied, and again six weeks later.
He wants to know:
Is there an effect of the netting treatment on changes in aphid numbers across the sampling times?
The data and analysis functions described in this talk are available in the ecostats package.
An early version of the ecostats package is available on CRAN but you want to use version 0.1.5 or later, available on github:
library(devtools)
## Loading required package: usethis
install_github("dwarton/ecostats")
## Skipping install of 'ecostats' from a github remote, the SHA1 (bac6564c) has not changed since last install. ## Use `force = TRUE` to force installation
library(ecostats)
data(aphidsBACI)
plot(logcount~interaction(Treatment,Time), xlab="",
names=c("Excluded/Before","Control/Before","Excluded/After",
"Control/After"),col=c(4,2,4,2), data=aphidsBACI)
aphidLM = lm(logcount~Plot+Treatment*Time, data=aphidsBACI) par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5)) plot(aphidLM,which=1:2)
The points don’t all lie on the line on the QQ-plot.
Do we have a problem with our assumptions?
Residual plots never look perfect, and we should always expect some deviation from the expected trends.
e.g. Plots for random normal samples:
par(mfrow=c(1,4),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5))
for (i in 1:4) { qqnorm(rnorm(20)); abline(c(0,1),lty=3,col="grey")}
plotenvelope functionI have written a function (available in the ecostats package) that will take (pretty much) any fitted object, simulate new data, and repeatedly refit the model to the data and reconstruct residual plots.
A global envelope is constructed around these simulated residual plots.
global = all points should fall inside the envelope if assumptions are satisfied
plotenvelopepar(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5)) plotenvelope(aphidLM)
## Registered S3 method overwritten by 'spatstat': ## method from ## print.boxx cli
Sometimes people add a smoother, sometimes with confidence bands, to a residual plot. However, this gives pointwise confidence -> multiple testing issue, often falsely declares a problem.
resGAM = mgcv::gam(residuals(aphidLM)~s(fitted(aphidLM))) plot(resGAM,xlab="Fitted values",ylab="Residuals",res=TRUE) abline(c(0,0),col="red",lty=3)
Simulation envelopes around a function, or a set of points, need to be global, otherwise they would be hard to interpret (and you would “see” problems with assumptions too often).
The plotenvelope function uses a tricky modern method (Myllymäki et al 2017) to get global envelopes out of simulated data. (The method was originally designed for K functions in point process models but I adapted it to residual plots.)
The smoother should always be inside the envelope (but the data need not be)
par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5)) plotenvelope(aphidLM,add.smooth=TRUE,which=c(1,3))
Invertebrates are often used as biological controls, but their effectiveness may be reduced by interactions with other species in the food web. Ingo looked at how bird exclusion might change on the effectiveness of biological control of aphids in an oat field in Germany (Grass et al., 2017).
He established eight plots, four of which were randomly allocated a plastic netting treatment to exclude birds. He measured number of aphids in each plot a few days after the treatment was applied, and again six weeks later.
He wants to know:
Is there an effect of the netting treatment on changes in aphid numbers across the sampling times?
anova(aphidLM)
## Analysis of Variance Table ## ## Response: logcount ## Df Sum Sq Mean Sq F value Pr(>F) ## Plot 7 0.8986 0.1284 0.4603 0.833357 ## Time 1 5.4675 5.4675 19.6038 0.004434 ** ## Treatment:Time 1 0.7397 0.7397 2.6522 0.154527 ## Residuals 6 1.6734 0.2789 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is no evidence of an interaction :(
That is, we have no evidence that bird exclusion decreases aphid numbers
Edgar (Anderson 1936) was studying Iris flowers. For 50 flowers on each of three Iris species, he took four measurements – length and width of petals, and length and width of sepals.
He wants to know:
How do these three species of Iris differ in terms of flower size and shape?
data(iris) head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
There are four response variables here, each measuring something different about Iris flowers, so we want to fit a multivariate linear model.
iris$Y = with(iris, cbind(Sepal.Length,Sepal.Width,Petal.Length,
Petal.Width) )
pairs(iris$Y, col=iris$Species)
If you fit a linear model using lm with a matrix response, this fits a multivariate linear model (of type mlm)
iris.mlm = lm(Y~Species,data=iris)
try(plot(iris.mlm))
## Error : 'plot.mlm' is not implemented yet
plot doesn’t work for multivariate linear models!
mlm objects?Because it is not immediately obvious how to do residual plots for multivariate linear models.
If each response is normally distributed, that does not necessarily mean that all species jointly are multivariate normal.
So if each response satisfies a linear model, that doesn’t necessarily mean that all assumptions of the multivariate linear model are satisfied.
mlm objectsBut data do come from a multivariate linear model if each response satisfies a linear model as a function of all other responses variables as well as X variables.
The plotenvelope function knows about this and will produce a plot of these full conditional residuals (calculated using a new function cresiduals).
try(plot(iris.mlm))
## Error : 'plot.mlm' is not implemented yet
par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5)) plotenvelope(iris.mlm, n.sim=499)
There is some suggestion of a fan-shape, could try log-transforming…
plotenvelope just tells you if there is evidence of an assumption violation – if it is “significant”.
But with assumption checks, what matters is not if it is “significant”, what matters is how big the violations are. (Small violations of assumptions only have small implications.)
So when checking assumptions, don’t just think about whether data are in/out of the envelope, think about how far they are from what was expected…
plotenvelope(iris.mlm, which=3, add.smooth=TRUE)
The trend on the scale-location plot goes from about 0.6 up to 0.9 (50% increase) which is slightly concerning…
log-transform the iris data and repeat the above analyses.
Does log-transformed data better satisfy multivariate linear model assumptions?
Add a trend to the scale-location plot. How does the range of variation in the scale of residuals compare to what was seen for untransformed data?
Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59: 2–5.
Grass et al. (2017) Insectivorous birds disrupt biological control of cereal aphids. Ecology, 98: 1583-90.
Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H. and Hahn, U. (2017), Global envelope tests for spatial processes. J. R. Stat. Soc. B, 79: 381-404.