plotenvelope – making sense of residual plots using simulation envelopes

09/11/2020

Why residual plots

Why residual plots?

Any statistical analysis makes assumptions.

If assumptions are violated, you may be using the wrong analysis and may draw the wrong conclusion.

You can’t assume that the same assumptions that seemed OK last time will be OK this time – every dataset is different and assumptions need to be checked.

Residual plots are a good way to check assumptions.

Biological control of aphids

Example - Can birds reduce the effectiveness of a biological control?

Invertebrates are often used as biological controls, but their effectiveness may be reduced by interactions with other species in the food web. Ingo looked at how bird exclusion might change on the effectiveness of biological control of aphids in an oat field in Germany (Grass et al., 2017).

He established eight plots, four of which were randomly allocated a plastic netting treatment to exclude birds. He measured number of aphids in each plot a few days after the treatment was applied, and again six weeks later.

He wants to know:

Is there an effect of the netting treatment on changes in aphid numbers across the sampling times?

Get the ecostats package

The data and analysis functions described in this talk are available in the ecostats package.

An early version of the ecostats package is available on CRAN but you want to use version 0.1.5 or later, available on github:

library(devtools)

## Loading required package: usethis

install_github("dwarton/ecostats")

## Skipping install of 'ecostats' from a github remote, the SHA1 (bac6564c) has not changed since last install.
##   Use `force = TRUE` to force installation

library(ecostats)

Load and plot the aphid data…

data(aphidsBACI)
plot(logcount~interaction(Treatment,Time), xlab="",
     names=c("Excluded/Before","Control/Before","Excluded/After",
             "Control/After"),col=c(4,2,4,2), data=aphidsBACI)

Checking linear model assumptions

aphidLM = lm(logcount~Plot+Treatment*Time, data=aphidsBACI)
par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5))
plot(aphidLM,which=1:2)

The points don’t all lie on the line on the QQ-plot.
Do we have a problem with our assumptions?

But they are never all on the line

Residual plots never look perfect, and we should always expect some deviation from the expected trends.

e.g. Plots for random normal samples:

par(mfrow=c(1,4),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5))
for (i in 1:4) { qqnorm(rnorm(20)); abline(c(0,1),lty=3,col="grey")}

Compare our data to random samples!

The `plotenvelope` function

I have written a function (available in the ecostats package) that will take (pretty much) any fitted object, simulate new data, and repeatedly refit the model to the data and reconstruct residual plots.

A global envelope is constructed around these simulated residual plots.

global = all points should fall inside the envelope if assumptions are satisfied

Using `plotenvelope`

par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5))
plotenvelope(aphidLM)

## Registered S3 method overwritten by 'spatstat':
##   method     from
##   print.boxx cli

The global thing

Sometimes people add a smoother, sometimes with confidence bands, to a residual plot. However, this gives pointwise confidence -> multiple testing issue, often falsely declares a problem.

resGAM = mgcv::gam(residuals(aphidLM)~s(fitted(aphidLM)))
plot(resGAM,xlab="Fitted values",ylab="Residuals",res=TRUE)
abline(c(0,0),col="red",lty=3)

The global thing

Simulation envelopes around a function, or a set of points, need to be global, otherwise they would be hard to interpret (and you would “see” problems with assumptions too often).

The plotenvelope function uses a tricky modern method (Myllymäki et al 2017) to get global envelopes out of simulated data. (The method was originally designed for K functions in point process models but I adapted it to residual plots.)

Using plotenvelope on a smoother

The smoother should always be inside the envelope (but the data need not be)

par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5))
plotenvelope(aphidLM,add.smooth=TRUE,which=c(1,3))

Example - Can birds reduce the effectiveness of a biological control?

He wants to know:

Is there an effect of the netting treatment on changes in aphid numbers across the sampling times?

Test for an interaction…

anova(aphidLM)

## Analysis of Variance Table
## 
## Response: logcount
##                Df Sum Sq Mean Sq F value   Pr(>F)   
## Plot            7 0.8986  0.1284  0.4603 0.833357   
## Time            1 5.4675  5.4675 19.6038 0.004434 **
## Treatment:Time  1 0.7397  0.7397  2.6522 0.154527   
## Residuals       6 1.6734  0.2789                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is no evidence of an interaction :(

That is, we have no evidence that bird exclusion decreases aphid numbers

Multivariate linear models

How do Iris flowers differ across species?

Edgar (Anderson 1936) was studying Iris flowers. For 50 flowers on each of three Iris species, he took four measurements – length and width of petals, and length and width of sepals.

He wants to know:

How do these three species of Iris differ in terms of flower size and shape?

How do Iris flowers differ across species?

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

There are four response variables here, each measuring something different about Iris flowers, so we want to fit a multivariate linear model.

Plot the data!

iris$Y = with(iris, cbind(Sepal.Length,Sepal.Width,Petal.Length,
                    Petal.Width) )
pairs(iris$Y, col=iris$Species)

Fit a multivariate linear model and check assumptions

If you fit a linear model using lm with a matrix response, this fits a multivariate linear model (of type mlm)

iris.mlm = lm(Y~Species,data=iris)

try(plot(iris.mlm))

## Error : 'plot.mlm' is not implemented yet

plot doesn’t work for multivariate linear models!

Why no plot function for `mlm` objects?

Because it is not immediately obvious how to do residual plots for multivariate linear models.

If each response is normally distributed, that does not necessarily mean that all species jointly are multivariate normal.

So if each response satisfies a linear model, that doesn’t necessarily mean that all assumptions of the multivariate linear model are satisfied.

plotenvelope can handle `mlm` objects

But data do come from a multivariate linear model if each response satisfies a linear model as a function of all other responses variables as well as X variables.

The plotenvelope function knows about this and will produce a plot of these full conditional residuals (calculated using a new function cresiduals).

Check assumptions (plot doesn’t work)

try(plot(iris.mlm))

## Error : 'plot.mlm' is not implemented yet

Check assumptions (plotenvelope)

par(mfrow=c(1,2),mgp=c(1.75,0.75,0),mar=c(3,3,2,0.5))
plotenvelope(iris.mlm, n.sim=499)

There is some suggestion of a fan-shape, could try log-transforming…

Assumption violations: how big?

plotenvelope just tells you if there is evidence of an assumption violation – if it is “significant”.

But with assumption checks, what matters is not if it is “significant”, what matters is how big the violations are. (Small violations of assumptions only have small implications.)

So when checking assumptions, don’t just think about whether data are in/out of the envelope, think about how far they are from what was expected…

Assumption violations: how big?

plotenvelope(iris.mlm, which=3, add.smooth=TRUE)

The trend on the scale-location plot goes from about 0.6 up to 0.9 (50% increase) which is slightly concerning…

Exercise

log-transform the iris data and repeat the above analyses.

Does log-transformed data better satisfy multivariate linear model assumptions?

Add a trend to the scale-location plot. How does the range of variation in the scale of residuals compare to what was seen for untransformed data?

References

Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59: 2–5.

Grass et al. (2017) Insectivorous birds disrupt biological control of cereal aphids. Ecology, 98: 1583-90.

Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H. and Hahn, U. (2017), Global envelope tests for spatial processes. J. R. Stat. Soc. B, 79: 381-404.

Why residual plots

Why residual plots?

Biological control of aphids

Example - Can birds reduce the effectiveness of a biological control?

Get the ecostats package

Load and plot the aphid data…

Checking linear model assumptions

But they are never all on the line

Compare our data to random samples!

Compare our data to random samples!

Compare our data to random samples!

The plotenvelope function

Using plotenvelope

The global thing

The global thing

Using plotenvelope on a smoother

Example - Can birds reduce the effectiveness of a biological control?

Test for an interaction…

Multivariate linear models

How do Iris flowers differ across species?

How do Iris flowers differ across species?

Plot the data!

Fit a multivariate linear model and check assumptions

Why no plot function for mlm objects?

plotenvelope can handle mlm objects

Check assumptions (plot doesn’t work)

Check assumptions (plotenvelope)

Assumption violations: how big?

Assumption violations: how big?

Exercise

References

The `plotenvelope` function

Using `plotenvelope`

Why no plot function for `mlm` objects?

plotenvelope can handle `mlm` objects