Notes 12
Poisson Counts | Part 2: Diagnostics for Outliers
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
Introduction
In these notes, we will discuss diagnostic measures for identifying outliers in a dataset with respect to a model. (Note: These diagnostic measures are for all kinds of generalized linear models, not just Poisson models for counts; they just happen to fall within our Poisson unit.)
The three kinds of outliers (and the diagnostic measure used to identify this type of outlier) are:
Outliers in terms of \(y\) (residuals)
Outliers in terms of \(x\) or the \(x\)s (leverage) - Sections 3.4, 8.4 in book
Outliers in terms of influence (Cook’s distance). Influential observations are outliers that substantial change the fitted model when removed from the dataset. - Sections 3.6, 8.8 in book
Of the three types, I would argue that outliers in terms of influence are most important.
Connecting the three types is the result: Influential observations are outliers in terms of both residuals and leverage
Noisy miner data example cont.
(See Section 8.12)
To aid in identifying the observations, we add an observation ID number.
data(nminer)
nminer <- nminer |>
mutate(ID = 1:nrow(nminer)) |>
select(ID, Minerab, Eucs)
head(nminer, 5)
## ID Minerab Eucs
## 1 1 0 2
## 2 2 0 10
## 3 3 3 16
## 4 4 2 20
## 5 5 8 19
Plot of model fit
We fit the same GLM to the data as in Notes 11.
Unfortunately, the ID variable is not carried through the
augment()
model because it isn’t one of the variables in
the model.
## # A tibble: 31 × 8
## Minerab Eucs .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2 0.523 -1.02 0.0354 1.49 0.00994 -1.04
## 2 0 10 1.30 -1.61 0.0398 1.47 0.0281 -1.65
## 3 3 16 2.58 0.255 0.0406 1.50 0.00151 0.261
## 4 2 20 4.07 -1.14 0.0491 1.49 0.0285 -1.17
## 5 8 19 3.63 1.98 0.0454 1.45 0.131 2.02
## 6 1 18 3.24 -1.46 0.0430 1.48 0.0364 -1.49
## 7 8 12 1.63 3.56 0.0399 1.34 0.536 3.63
## 8 5 16 2.58 1.33 0.0406 1.48 0.0501 1.36
## 9 0 3 0.586 -1.08 0.0363 1.49 0.0114 -1.10
## 10 4 12 1.63 1.56 0.0399 1.47 0.0740 1.59
## # ℹ 21 more rows
So we use bind_cols()
to add the ID column.
Plot of the observed counts (shown by numbers) and predicted counts.
Note that geom_text()
is used to plot the IDs instead of
points.
ggplot(
data = fit_dat,
mapping = aes(x = Eucs, y = Minerab, label=ID)
) +
geom_text(size=3, position = "jitter") +
geom_line(aes(y = .fitted)) +
labs(
x = "Number of Eucalytpus trees",
y = "Number of Noisy Miner birds"
)
Which points appear to be outliers in terms of
\(y\) (residuals)? 11, 7, 17
in terms of \(x\) (leverage)? 17, 11
in terms of influence (Cook’s distance)? 7, 17, 5
Outliers in terms of y (residuals)
We covered residuals in depth in Notes 9. Recall that quantile residuals are preferred for discrete distributions like Poisson.
Quantile residuals follow the normal distribution, so quantile residuals with absolute values larger than 2 or 3 should be considered outliers.
fit_dat |>
mutate(resid = qresid(fit)) |>
ggplot(
mapping = aes(x = Eucs, y = resid, label=ID)
) +
geom_text(size = 3) +
geom_hline(yintercept = 0) +
labs(y = "Quantile residuals")
Which observation ID indicates the most extreme outlier in the dataset (in terms of \(y\))? 7
Why does the plot change slightly when you rerun the chunk? the values are the result of some random number generation.
Looking back at the plot with the observed and fitted counts, note that observations 7 and 17 are roughly the same distance from the line.
In other words, their response residuals have similar magnitude.
Why does 7 have a larger quantile residual than 17? For the poisson distribution the variance is equal to the mean. The mean for observation 7 is much much less than the mean for 17. So, relative to the variance, 7 is further away from the mean.
## ID Minerab Eucs .fitted .resid .hat .sigma .cooksd .std.resid
## 1 7 8 12 1.634872 3.560305 0.03987337 1.337853 0.5359525 3.633482
## 2 17 7 30 12.720681 -1.754678 0.31108992 1.449738 0.8431715 -2.114055
tibble(
x = seq(from = 0, to = 10, by = 1),
probability = dpois(x, lambda = 1.635)
) |>
ggplot(
mapping = aes(x = x, y = probability)
) +
geom_line() +
geom_point() +
labs(
title = "Poisson distribution for obs 7"
)
tibble(
x = seq(from = 0, to = 24, by = 1),
probability = dpois(x, lambda = 12.721)
) |>
ggplot(
mapping = aes(x = x, y = probability)
) +
geom_line() +
geom_point() +
labs(title = "Poisson distribution for obs 17")
Outliers in terms of Xs (leverage)
The leverage measures how much observations are an outlier in terms of x or the x’s. (It is not dependent on the response at all.)
The leverage is given by the augment()
function as
.hat
. So we don’t get confused, let’s rename this
variable.
fit_dat |>
rename(leverage = .hat) |>
arrange(desc(leverage)) |>
select(leverage, ID, .fitted, Eucs) |>
head(5)
## leverage ID .fitted Eucs
## 1 0.48961774 11 15.977664 32
## 2 0.31108992 17 12.720681 30
## 3 0.08542349 30 6.419505 24
## 4 0.05428985 29 4.560339 21
## 5 0.04905501 4 4.069074 20
Why does point 11 has the largest leverage? It has the
greatest Eucs
value.
What constitutes a large leverage value? The sum of the leverage values is equal to the number of beta coefficients \(p\).
## sum_lev
## 1 2
So the average leverage is \(p/n\). Observations with leverage values greater than 2 or 3 times \(p/n\) identify outliers in terms of the explanatory variables.
## [1] 0.06451613
Outliers in terms of influence (Cook’s distance)
Cook’s distance measures the change in model fit from leaving out the
observation. It is given by augment()
as
.cooksd
. A rule-of-thumb is that observations with Cook’s
distance greater than 1 can be identified as potentially
influential.
fit_dat |>
mutate(resid = qresid(fit)) |>
rename(leverage = .hat) |>
arrange(desc(.cooksd)) |>
select(.cooksd, ID, resid, leverage) |>
head(6)
## .cooksd ID resid leverage
## 1 0.84317148 17 -1.7373953 0.31108992
## 2 0.53728929 11 0.6613919 0.48961774
## 3 0.53595247 7 3.5060946 0.03987337
## 4 0.13110736 5 1.8515415 0.04544019
## 5 0.12951918 31 1.9620053 0.04014949
## 6 0.07796161 20 1.7870093 0.04544019
Sorting the data by Cook’s distance reveals observation 17 as the most influential. Overall, observations 17, 11, and 7 have much more influence than any of the other points.
The plot below demonstrates how the expected counts change when each of the points is left out.
Why does this plot agree with observation 17 having the largest Cook’s distance? Because the green line (when obs 17 is left out) is farthest from the black line
Connecting the 3 measures
The Cook’s distance increases both
as the magnitude of the residual increases
as the leverage increases
This means that the most influential points are outliers in terms of BOTH y and x.
fit_dat |>
mutate(resid = qresid(fit)) |>
rename(leverage = .hat) |>
arrange(desc(.cooksd)) |>
select(.cooksd, ID, resid, leverage) |>
head(3)
## .cooksd ID resid leverage
## 1 0.8431715 17 -1.6849353 0.31108992
## 2 0.5372893 11 0.7654911 0.48961774
## 3 0.5359525 7 3.7991035 0.03987337
Looking at the three most influential observations (17, 11, and 7):
observation 7 has the largest residual but a small leverage
observation 11 has the largest leverage but a small residual
observation 17 is the most influential because has a somewhat large residual AND a large leverage.