Notes 9

Modeling Proportions with Binomial Distribution | Part 4: Evaluating Model Fit with Residuals

Setup

Loading packages

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(GLMsData)
library(broom) # for tidy, augment

## Warning: package 'broom' was built under R version 4.3.3

library(statmod) # for qresid()

## Warning: package 'statmod' was built under R version 4.3.3

Residuals

Section 8.3.1 - Response residuals are insufficient for GLMs

In (normal-based) linear regression, we use residuals to evaluate the fit of the model. These residuals are calculated as “actual minus predicted”. However, a main takeaway from this notes is that these residuals are inadequate for assessing a a fitted generalized linear model.

These residuals, \(y_i - \hat{\mu}_i\), are referred to as “response residuals” in the book. See Section 8.3.1. The problem is that for GLMs, the variance of the response depends on the mean.

For example, see Figure 8.1 which shows a dataset of cherry trees (introduced Example 3.14).

y = Volume
x = log(Girth)
Uses GLM based on Gamma distribution where the variance grows with the mean.

Two residuals are the same size, but one shows an outlier and the other doesn’t because of the variance depending on the mean.

Section 8.3.2 - Pearson residuals

Divides out the effect of non-constant variance. Pearson residuals are defined as

\[r_P = \frac{y-\hat{\mu}}{\sqrt{V(\hat{\mu})}}\]

This uses the variance function \(V()\). For all distributions used in GLMs, the variance of the responses are a function of the mean. Table 5.1 (p. 221) shows these variance functions:

Table 5.1 from GLMs textbook

Note that the Pearson residual for the normal distribution is just the response residual \(y-\hat{\mu}\).

Section 8.3.3 - Deviance residuals

These residuals use a quantity called the deviance \(d(y,\mu)\) which is a distance measure between y and \(\mu\). The deviance for different distributions is also shown in Table 5.1 above. Note:

The deviance is an appropriate distance measure in that it equals zero when y = \(\mu\) and grows as y gets farther from \(\mu\).
For instance, the deviance for the normal distribution is the squared difference \((y - \mu)^2\).

Deviance residuals are defined as

\[r_D = sign(y-\hat{\mu})\sqrt{d(y,\hat{\mu})}\]

Again, the deviance residual for the normal distribution is just the response residual \(y-\hat{\mu}\)

Section 8.3.4 - Quantile residuals

Under certain conditions, the Pearson and deviance residuals follow approximate normal distributions (assuming the model is true). But quantile residuals are exactly normally distributed (so they are preferred). Also, quantile residuals are definitely preferred for discrete distributions – we’ll see why later.

Understanding quantile residuals relies on distribution functions like we introduced in Notes 2.

pnorm____ : the cumulative distribution function (or CDF)
qnorm___ : the quantile function

Cumulative distribution function (or CDF)

This gives the probability the random variable is less than or equal to a value x. For example, from Notes 2, this gives a picture of pnorm(1, mean=0, sd=1):

tibble(
  z = seq(from=-4, to=4, by=.05), 
  Density = dnorm(z, mean = 0, sd = 1), 
  x_range = ifelse(z <= 1, z, NA)
) |> 
  ggplot(
    aes(x = z, y= Density)
  ) + geom_line() + 
  geom_area(
    aes(x = x_range, y = Density), 
    fill = "grey50"
  ) + labs(title = "pnorm(1)")

Plotting the CDF for the standard normal distribution, pnorm(z). Note now that probability is on the y-axis:

tibble(
  z = seq(from=-4, to=4, by=.05), 
  Probability = pnorm(z, mean = 0, sd = 1)
) |> 
  ggplot(
    aes(x = z, y= Probability)
  ) + geom_line() + 
  labs(title="Normal CDF, pnorm(z)")

Quantile function

The quantile function for the standard normal distribution, qnorm(z). Note that probability is now on the x axis and the normal value z is on the y-axis.

tibble(
  Probability = seq(from=-.005, to=.995, by=.005), 
  z = qnorm(Probability, mean = 0, sd = 1)
) |> 
  ggplot(
    aes(x = Probability, y= z)
  ) + geom_line() + 
  labs(title="Normal quantile function, qnorm(z)")

Definition of quantile residual

The quantile residual \(r_Q\) for an observation has the same cumulative probability on a standard normal distribution as \(y\) does for the fitted GLM distribution.

Quantile residuals for continuous response distribution

See Figure 8.2, p. 301

Left: 0.3297 is the cumulative probability for the observation \(y=1.2\) on the fitted GLM distribution, an exponential distribution with \(\hat{\mu}=3\). In R, this is:

pexp(1.2, rate=1/3)

## [1] 0.32968

Right: The quantile residual \(r_Q = -0.441\) has a cumulative probability on a standard normal distribution of 0.3297.

How can we use R to go from the probability 0.3297 to the quantile residual -0.441?

#calculate
qnorm(.3297)

## [1] -0.4407417

Note: since quantile residuals follow the standard normal distribution, what sorts of values for quantile residuals indicate outliers?

Quantile residuals

with absolute values greater than 2 happen 5% of the time
with absolute values greater than 3 happen .3% of the time <- outliers

Section 8.3.4.2 - Quantile Residuals: Discrete response

For discrete GLM distributions (like binomial, Poisson), there’s a modification to obtain quantile residuals.

This is because the CDF has jumps at discrete values of the distribution. Following the book, consider calculating a quantile residual for the observation \(y=1\) for the fitted Poisson model with \(\hat{\mu}=2.6\).

The CDF for the Poisson distribution with \(\hat{\mu}=2.6\) (Fig 8.3, left):

tibble(
  y = seq(from = -.5, to = 9.5, by = .1), 
  Probability = ppois(y, lambda = 2.6)
) |> 
  ggplot() + 
  geom_step(aes(x = y, y = Probability))

The value of the CDF at \(y = 1\) depends on whether we approach it from the left or the right:

ppois(.9, lambda = 2.6) #from left

## [1] 0.07427358

ppois(1.1, lambda = 2.6) # from right

## [1] 0.2673849

So the solution is to obtain the probability by generating a random number in that range (We can use runif() function to do this) .In the book, this value is 0.149.

#calculate
set.seed(24601)
runif(n = 1, min  = .0743, max = .2674)

## [1] 0.1513309

Then the quantile residual is the standard normal quantile corresponding to a probability of 0.149:

qnorm(.149)

## [1] -1.040732

Range of possible quantile residuals:

qnorm(.0743)

## [1] -1.444494

qnorm(.2674)

## [1] -0.6206955

Note: When quantile residuals are generated using random number generation, the book recommends calculating multiple sets of them.

Residuals simulation examples

To illustrate calculating these 4 different types of residuals, we simulate from a binomial generalized linear model:

Random component: Proportions \(y\sim Binomial(\mu, 3)\) (the proportion of successes out of 3 trials)
Systematic component: (uses logit link) \(\log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 x\). I’ll pick parameter values \(\beta_0 = 0\) and \(\beta_1 = 1\) and use \(x\) values from -4 to 4.

set.seed(123)
sim <- tibble(
  x = seq(from=-4, to=4, length = 60), 
  prob = exp(x) / (1 + exp(x)), 
  count = rbinom(n = length(x), size = 3, prob = prob), 
  prop = count / 3, 
  Trials = 3
)
sim

## # A tibble: 60 × 5
##        x   prob count  prop Trials
##    <dbl>  <dbl> <int> <dbl>  <dbl>
##  1 -4    0.0180     0 0          3
##  2 -3.86 0.0205     0 0          3
##  3 -3.73 0.0235     0 0          3
##  4 -3.59 0.0268     0 0          3
##  5 -3.46 0.0305     1 0.333      3
##  6 -3.32 0.0348     0 0          3
##  7 -3.19 0.0397     0 0          3
##  8 -3.05 0.0452     1 0.333      3
##  9 -2.92 0.0514     0 0          3
## 10 -2.78 0.0584     0 0          3
## # ℹ 50 more rows

Plot of the proportions vs. x:

ggplot(
  data = sim, 
  mapping = aes(x = x, y = prop)
) + geom_point()

Fit logit model

fit <- glm(prop ~ x, family = binomial(link=logit), weights = Trials, data = sim)

Plot fitted proportion values \(\hat{\mu}\) along with the observed proportions

ggplot(
  data = augment(fit, type.predict = "response"), 
  mapping = aes(x = x)
) + 
  geom_point(aes(y = prop)) + 
  geom_line(aes(y = .fitted))

Calculate four kinds of residuals:

Response residuals: resid(fit, "response")
Pearson residuals: resid(fit, "pearson")
Deviance residuals: resid(fit, "deviance")
Quantile residuals: qresid(fit) (uses {statmod} package)

resid <- augment(fit, type.predict = "response") |> 
  mutate(
    resp_resid = resid(fit, "response"),
    pearson_resid = resid(fit, "pearson"), 
    dev_resid = resid(fit, "deviance"), 
    quant_resid = statmod::qresid(fit)
  ) |> 
  select(-c(.hat, .sigma, .cooksd, .resid, .std.resid, )) |> 
  rename(response = resp_resid, pearson = pearson_resid, deviance = dev_resid, quantile = quant_resid)

head(resid, 5)

## # A tibble: 5 × 8
##    prop     x `(weights)` .fitted response pearson deviance quantile
##   <dbl> <dbl>       <dbl>   <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
## 1 0     -4              3  0.0194  -0.0194  -0.244   -0.343    0.324
## 2 0     -3.86           3  0.0224  -0.0224  -0.262   -0.369   -1.35 
## 3 0     -3.73           3  0.0258  -0.0258  -0.282   -0.396   -0.372
## 4 0     -3.59           3  0.0297  -0.0297  -0.303   -0.425   -0.672
## 5 0.333 -3.46           3  0.0341   0.299    2.85     1.75     2.03

Plotting the four kinds of residuals:

resid |> 
  pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |> 
  ggplot(
    mapping = aes(x = x)
  ) + 
  geom_point(aes(y = Residual)) + 
  facet_wrap(~type, scales = "free_y")

Discussing different kinds of residuals

Big idea: quantile residuals are the best!

Note that the quantile residuals best give the kind of residual plot we want to see for a correctly specified model:

Centered around 0
Constant vertical spread

Problem with Pearson and deviance residuals: harder to interpret because of the bands of points, caused by that the proportion \(y\) can only take the values 0, 1/3, 2/3 and 1. To show the bands more clearly:

resid |> 
  pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |> 
  filter(type %in% c("pearson", "deviance")) |> 
  mutate(prop = round(prop, 2) |> as.character()) |> 
  ggplot(
    mapping = aes(x = x, y = Residual, color = prop)
  ) + 
  geom_point() + 
  facet_wrap(~type)

Quadratic data

Model only has linear dependence

Suppose the data truly has a quadratic dependence with x, but only a linear dependence is included in the model.

Simulate the data:

set.seed(123)
sim2 <- tibble(
  x = seq(from=-3, to=3, length = 60), 
  lin_pred = .5*(x^2 - 4),
  prob = exp(lin_pred) / (1 + exp(lin_pred)), 
  count = rbinom(n = length(x), size = 3, prob = prob), 
  prop = count / 3, 
  Trials = 3
)
sim

## # A tibble: 60 × 5
##        x   prob count  prop Trials
##    <dbl>  <dbl> <int> <dbl>  <dbl>
##  1 -4    0.0180     0 0          3
##  2 -3.86 0.0205     0 0          3
##  3 -3.73 0.0235     0 0          3
##  4 -3.59 0.0268     0 0          3
##  5 -3.46 0.0305     1 0.333      3
##  6 -3.32 0.0348     0 0          3
##  7 -3.19 0.0397     0 0          3
##  8 -3.05 0.0452     1 0.333      3
##  9 -2.92 0.0514     0 0          3
## 10 -2.78 0.0584     0 0          3
## # ℹ 50 more rows

ggplot(
  data = sim2, 
  mapping = aes(x = x, y = prop)
) + geom_point()

fit2 <- glm(prop ~ x, family = binomial(link="logit"), weights = Trials, data = sim2)

Plot fitted proportion values \(\hat{\mu}\) along with the observed proportions

ggplot(
  data = augment(fit2, type.predict = "response"), 
  mapping = aes(x = x)
) + 
  geom_point(aes(y = prop)) + 
  geom_line(aes(y = .fitted))

resid2 <- augment(fit2, type.predict = "response") |> 
  mutate(
    resp_resid = resid(fit2, "response"),
    pearson_resid = resid(fit2, "pearson"), 
    dev_resid = resid(fit2, "deviance"), 
    quant_resid = statmod::qresid(fit2)
  ) |> 
  select(-c(.hat, .sigma, .cooksd, .resid, .std.resid, )) |> 
  rename(response = resp_resid, pearson = pearson_resid, deviance = dev_resid, quantile = quant_resid)

resid2 |> 
  pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |> 
  ggplot(
    mapping = aes(x = x)
  ) + 
  geom_point(aes(y = Residual)) + 
  facet_wrap(~type, scales = "free_y")

Correct quadratic model specified

Fit the model. Note that we use the function I() here to include x^2 in the model without making it a variable in the dataset first.

fit3 <- glm(prop ~ x + I(x^2), family = binomial(link="logit"), weights = Trials, data = sim2)

Plot fitted proportion values \(\hat{\mu}\) along with the observed proportions

ggplot(
  data = augment(fit3, type.predict = "response"), 
  mapping = aes(x = x)
) + 
  geom_point(aes(y = prop)) + 
  geom_line(aes(y = .fitted))

Calculate residuals

resid3 <- augment(fit3, type.predict = "response") |> 
  mutate(
    resp_resid = resid(fit3, "response"),
    pearson_resid = resid(fit3, "pearson"), 
    dev_resid = resid(fit3, "deviance"), 
    quant_resid = statmod::qresid(fit3)
  ) |> 
  select(-c(.hat, .sigma, .cooksd, .resid, .std.resid, )) |> 
  rename(response = resp_resid, pearson = pearson_resid, deviance = dev_resid, quantile = quant_resid)

Plot residuals

resid3 |> 
  pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |> 
  ggplot(
    mapping = aes(x = x)
  ) + 
  geom_point(aes(y = Residual)) + 
  facet_wrap(~type, scales = "free_y")