Notes 12

Poisson Counts | Part 2: Diagnostics for Outliers

Setup

Loading packages

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(GLMsData)
library(broom) # for tidy, augment

## Warning: package 'broom' was built under R version 4.3.3

library(statmod) # for qresid()

## Warning: package 'statmod' was built under R version 4.3.3

Introduction

In these notes, we will discuss diagnostic measures for identifying outliers in a dataset with respect to a model. (Note: These diagnostic measures are for all kinds of generalized linear models, not just Poisson models for counts; they just happen to fall within our Poisson unit.)

The three kinds of outliers (and the diagnostic measure used to identify this type of outlier) are:

Outliers in terms of \(y\) (residuals)
Outliers in terms of \(x\) or the \(x\)s (leverage) - Sections 3.4, 8.4 in book
Outliers in terms of influence (Cook’s distance). Influential observations are outliers that substantial change the fitted model when removed from the dataset. - Sections 3.6, 8.8 in book

Of the three types, I would argue that outliers in terms of influence are most important.

Connecting the three types is the result: Influential observations are outliers in terms of both residuals and leverage

Noisy miner data example cont.

(See Section 8.12)

To aid in identifying the observations, we add an observation ID number.

data(nminer)
nminer <- nminer |> 
  mutate(ID = 1:nrow(nminer)) |>
  select(ID, Minerab, Eucs)
head(nminer, 5)

##   ID Minerab Eucs
## 1  1       0    2
## 2  2       0   10
## 3  3       3   16
## 4  4       2   20
## 5  5       8   19

Plot of model fit

We fit the same GLM to the data as in Notes 11.

fit <- glm(Minerab ~ Eucs, data = nminer, family=poisson(link="log"))

Unfortunately, the ID variable is not carried through the augment() model because it isn’t one of the variables in the model.

augment(fit, type.predict = "response") #|>

## # A tibble: 31 × 8
##    Minerab  Eucs .fitted .resid   .hat .sigma .cooksd .std.resid
##      <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1       0     2   0.523 -1.02  0.0354   1.49 0.00994     -1.04 
##  2       0    10   1.30  -1.61  0.0398   1.47 0.0281      -1.65 
##  3       3    16   2.58   0.255 0.0406   1.50 0.00151      0.261
##  4       2    20   4.07  -1.14  0.0491   1.49 0.0285      -1.17 
##  5       8    19   3.63   1.98  0.0454   1.45 0.131        2.02 
##  6       1    18   3.24  -1.46  0.0430   1.48 0.0364      -1.49 
##  7       8    12   1.63   3.56  0.0399   1.34 0.536        3.63 
##  8       5    16   2.58   1.33  0.0406   1.48 0.0501       1.36 
##  9       0     3   0.586 -1.08  0.0363   1.49 0.0114      -1.10 
## 10       4    12   1.63   1.56  0.0399   1.47 0.0740       1.59 
## # ℹ 21 more rows

 # head(3)

So we use bind_cols() to add the ID column.

fit_dat <- bind_cols(
  nminer |> select(ID),
  augment(fit, type.predict = "response")
)
# fit_dat

Plot of the observed counts (shown by numbers) and predicted counts. Note that geom_text() is used to plot the IDs instead of points.

ggplot(
  data = fit_dat, 
  mapping = aes(x = Eucs, y = Minerab, label=ID)
) + 
  geom_text(size=3, position = "jitter") + 
  geom_line(aes(y = .fitted)) + 
  labs(
    x = "Number of Eucalytpus trees", 
    y = "Number of Noisy Miner birds"
  )

Which points appear to be outliers in terms of

\(y\) (residuals)? 11, 7, 17
in terms of \(x\) (leverage)? 17, 11
in terms of influence (Cook’s distance)? 7, 17, 5

Outliers in terms of y (residuals)

We covered residuals in depth in Notes 9. Recall that quantile residuals are preferred for discrete distributions like Poisson.

Quantile residuals follow the normal distribution, so quantile residuals with absolute values larger than 2 or 3 should be considered outliers.

fit_dat |>
  mutate(resid = qresid(fit)) |> 
  ggplot(
    mapping = aes(x = Eucs, y = resid, label=ID)
  ) + 
  geom_text(size = 3) + 
  geom_hline(yintercept = 0) + 
  labs(y = "Quantile residuals")

Which observation ID indicates the most extreme outlier in the dataset (in terms of \(y\))? 7

Why does the plot change slightly when you rerun the chunk? the values are the result of some random number generation.

Looking back at the plot with the observed and fitted counts, note that observations 7 and 17 are roughly the same distance from the line.

In other words, their response residuals have similar magnitude.
Why does 7 have a larger quantile residual than 17? For the poisson distribution the variance is equal to the mean. The mean for observation 7 is much much less than the mean for 17. So, relative to the variance, 7 is further away from the mean.

fit_dat |>
  filter(ID %in% c(7, 17))

##   ID Minerab Eucs   .fitted    .resid       .hat   .sigma   .cooksd .std.resid
## 1  7       8   12  1.634872  3.560305 0.03987337 1.337853 0.5359525   3.633482
## 2 17       7   30 12.720681 -1.754678 0.31108992 1.449738 0.8431715  -2.114055

tibble(
  x = seq(from = 0, to = 10, by = 1),
  probability = dpois(x, lambda = 1.635)
) |>
  ggplot(
    mapping = aes(x = x, y = probability)
  ) +
  geom_line() +
  geom_point() +
  labs(
    title = "Poisson distribution for obs 7"
  )

tibble(
  x = seq(from = 0, to = 24, by = 1),
  probability = dpois(x, lambda = 12.721)
) |>
  ggplot(
    mapping = aes(x = x, y = probability)
  ) +
  geom_line() +
  geom_point() +
  labs(title = "Poisson distribution for obs 17")

Outliers in terms of Xs (leverage)

The leverage measures how much observations are an outlier in terms of x or the x’s. (It is not dependent on the response at all.)

The leverage is given by the augment() function as .hat. So we don’t get confused, let’s rename this variable.

fit_dat |> 
  rename(leverage = .hat) |> 
  arrange(desc(leverage)) |>
  select(leverage, ID, .fitted, Eucs) |> 
  head(5)

##     leverage ID   .fitted Eucs
## 1 0.48961774 11 15.977664   32
## 2 0.31108992 17 12.720681   30
## 3 0.08542349 30  6.419505   24
## 4 0.05428985 29  4.560339   21
## 5 0.04905501  4  4.069074   20

Why does point 11 has the largest leverage? It has the greatest Eucs value.

What constitutes a large leverage value? The sum of the leverage values is equal to the number of beta coefficients \(p\).

fit_dat |> 
  summarize(
    sum_lev = sum(.hat)
  )

##   sum_lev
## 1       2

So the average leverage is \(p/n\). Observations with leverage values greater than 2 or 3 times \(p/n\) identify outliers in terms of the explanatory variables.

2 / nrow(nminer)

## [1] 0.06451613

Outliers in terms of influence (Cook’s distance)

Cook’s distance measures the change in model fit from leaving out the observation. It is given by augment() as .cooksd. A rule-of-thumb is that observations with Cook’s distance greater than 1 can be identified as potentially influential.

fit_dat |> 
  mutate(resid = qresid(fit)) |> 
  rename(leverage = .hat) |>
  arrange(desc(.cooksd)) |> 
  select(.cooksd, ID, resid, leverage) |> 
  head(6)

##      .cooksd ID      resid   leverage
## 1 0.84317148 17 -1.7373953 0.31108992
## 2 0.53728929 11  0.6613919 0.48961774
## 3 0.53595247  7  3.5060946 0.03987337
## 4 0.13110736  5  1.8515415 0.04544019
## 5 0.12951918 31  1.9620053 0.04014949
## 6 0.07796161 20  1.7870093 0.04544019

Sorting the data by Cook’s distance reveals observation 17 as the most influential. Overall, observations 17, 11, and 7 have much more influence than any of the other points.

The plot below demonstrates how the expected counts change when each of the points is left out.

Why does this plot agree with observation 17 having the largest Cook’s distance? Because the green line (when obs 17 is left out) is farthest from the black line

Connecting the 3 measures

The Cook’s distance increases both

as the magnitude of the residual increases
as the leverage increases

This means that the most influential points are outliers in terms of BOTH y and x.

fit_dat |> 
  mutate(resid = qresid(fit)) |> 
  rename(leverage = .hat) |>
  arrange(desc(.cooksd)) |> 
  select(.cooksd, ID, resid, leverage) |> 
  head(3)

##     .cooksd ID      resid   leverage
## 1 0.8431715 17 -1.6849353 0.31108992
## 2 0.5372893 11  0.7654911 0.48961774
## 3 0.5359525  7  3.7991035 0.03987337

Looking at the three most influential observations (17, 11, and 7):

observation 7 has the largest residual but a small leverage
observation 11 has the largest leverage but a small residual
observation 17 is the most influential because has a somewhat large residual AND a large leverage.