Notes 9
Modeling Proportions with Binomial Distribution | Part 4: Evaluating Model Fit with Residuals
Setup
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'statmod' was built under R version 4.3.3
Residuals
Section 8.3.1 - Response residuals are insufficient for GLMs
In (normal-based) linear regression, we use residuals to evaluate the fit of the model. These residuals are calculated as “actual minus predicted”. However, a main takeaway from this notes is that these residuals are inadequate for assessing a a fitted generalized linear model.
These residuals, \(y_i - \hat{\mu}_i\), are referred to as “response residuals” in the book. See Section 8.3.1. The problem is that for GLMs, the variance of the response depends on the mean.
For example, see Figure 8.1 which shows a dataset of cherry trees (introduced Example 3.14).
y = Volume
x = log(Girth)
Uses GLM based on Gamma distribution where the variance grows with the mean.
Two residuals are the same size, but one shows an outlier and the other doesn’t because of the variance depending on the mean.
Section 8.3.2 - Pearson residuals
Divides out the effect of non-constant variance. Pearson residuals are defined as
\[r_P = \frac{y-\hat{\mu}}{\sqrt{V(\hat{\mu})}}\]
This uses the variance function \(V()\). For all distributions used in GLMs, the variance of the responses are a function of the mean. Table 5.1 (p. 221) shows these variance functions:
Note that the Pearson residual for the normal distribution is just the response residual \(y-\hat{\mu}\).
Section 8.3.3 - Deviance residuals
These residuals use a quantity called the deviance \(d(y,\mu)\) which is a distance measure between y and \(\mu\). The deviance for different distributions is also shown in Table 5.1 above. Note:
The deviance is an appropriate distance measure in that it equals zero when y = \(\mu\) and grows as y gets farther from \(\mu\).
For instance, the deviance for the normal distribution is the squared difference \((y - \mu)^2\).
Deviance residuals are defined as
\[r_D = sign(y-\hat{\mu})\sqrt{d(y,\hat{\mu})}\]
Again, the deviance residual for the normal distribution is just the response residual \(y-\hat{\mu}\)
Section 8.3.4 - Quantile residuals
Under certain conditions, the Pearson and deviance residuals follow approximate normal distributions (assuming the model is true). But quantile residuals are exactly normally distributed (so they are preferred). Also, quantile residuals are definitely preferred for discrete distributions – we’ll see why later.
Understanding quantile residuals relies on distribution functions like we introduced in Notes 2.
pnorm____
: the cumulative distribution function (or CDF)qnorm___
: the quantile function
Cumulative distribution function (or CDF)
This gives the probability the random variable is less than or equal
to a value x. For example, from Notes 2, this gives a picture of
pnorm(1, mean=0, sd=1)
:
tibble(
z = seq(from=-4, to=4, by=.05),
Density = dnorm(z, mean = 0, sd = 1),
x_range = ifelse(z <= 1, z, NA)
) |>
ggplot(
aes(x = z, y= Density)
) + geom_line() +
geom_area(
aes(x = x_range, y = Density),
fill = "grey50"
) + labs(title = "pnorm(1)")
Plotting the CDF for the standard normal distribution,
pnorm(z)
. Note now that probability is on the y-axis:
tibble(
z = seq(from=-4, to=4, by=.05),
Probability = pnorm(z, mean = 0, sd = 1)
) |>
ggplot(
aes(x = z, y= Probability)
) + geom_line() +
labs(title="Normal CDF, pnorm(z)")
Quantile function
The quantile function for the standard normal distribution,
qnorm(z)
. Note that probability is now on the x axis and
the normal value z is on the y-axis.
tibble(
Probability = seq(from=-.005, to=.995, by=.005),
z = qnorm(Probability, mean = 0, sd = 1)
) |>
ggplot(
aes(x = Probability, y= z)
) + geom_line() +
labs(title="Normal quantile function, qnorm(z)")
Definition of quantile residual
The quantile residual \(r_Q\) for an observation has the same cumulative probability on a standard normal distribution as \(y\) does for the fitted GLM distribution.
Quantile residuals for continuous response distribution
See Figure 8.2, p. 301
- Left: 0.3297 is the cumulative probability for the observation \(y=1.2\) on the fitted GLM distribution, an exponential distribution with \(\hat{\mu}=3\). In R, this is:
## [1] 0.32968
- Right: The quantile residual \(r_Q = -0.441\) has a cumulative probability on a standard normal distribution of 0.3297.
How can we use R to go from the probability 0.3297 to the quantile residual -0.441?
## [1] -0.4407417
Note: since quantile residuals follow the standard normal distribution, what sorts of values for quantile residuals indicate outliers?
Quantile residuals
- with absolute values greater than 2 happen 5% of the time
- with absolute values greater than 3 happen .3% of the time <- outliers
Section 8.3.4.2 - Quantile Residuals: Discrete response
For discrete GLM distributions (like binomial, Poisson), there’s a modification to obtain quantile residuals.
This is because the CDF has jumps at discrete values of the distribution. Following the book, consider calculating a quantile residual for the observation \(y=1\) for the fitted Poisson model with \(\hat{\mu}=2.6\).
The CDF for the Poisson distribution with \(\hat{\mu}=2.6\) (Fig 8.3, left):
tibble(
y = seq(from = -.5, to = 9.5, by = .1),
Probability = ppois(y, lambda = 2.6)
) |>
ggplot() +
geom_step(aes(x = y, y = Probability))
The value of the CDF at \(y = 1\) depends on whether we approach it from the left or the right:
## [1] 0.07427358
## [1] 0.2673849
So the solution is to obtain the probability by generating a random
number in that range (We can use runif()
function to do
this) .In the book, this value is 0.149.
## [1] 0.1513309
Then the quantile residual is the standard normal quantile corresponding to a probability of 0.149:
## [1] -1.040732
Range of possible quantile residuals:
## [1] -1.444494
## [1] -0.6206955
Note: When quantile residuals are generated using random number generation, the book recommends calculating multiple sets of them.
Residuals simulation examples
To illustrate calculating these 4 different types of residuals, we simulate from a binomial generalized linear model:
Random component: Proportions \(y\sim Binomial(\mu, 3)\) (the proportion of successes out of 3 trials)
Systematic component: (uses logit link) \(\log\frac{\mu}{1-\mu} = \beta_0 + \beta_1 x\). I’ll pick parameter values \(\beta_0 = 0\) and \(\beta_1 = 1\) and use \(x\) values from -4 to 4.
set.seed(123)
sim <- tibble(
x = seq(from=-4, to=4, length = 60),
prob = exp(x) / (1 + exp(x)),
count = rbinom(n = length(x), size = 3, prob = prob),
prop = count / 3,
Trials = 3
)
sim
## # A tibble: 60 × 5
## x prob count prop Trials
## <dbl> <dbl> <int> <dbl> <dbl>
## 1 -4 0.0180 0 0 3
## 2 -3.86 0.0205 0 0 3
## 3 -3.73 0.0235 0 0 3
## 4 -3.59 0.0268 0 0 3
## 5 -3.46 0.0305 1 0.333 3
## 6 -3.32 0.0348 0 0 3
## 7 -3.19 0.0397 0 0 3
## 8 -3.05 0.0452 1 0.333 3
## 9 -2.92 0.0514 0 0 3
## 10 -2.78 0.0584 0 0 3
## # ℹ 50 more rows
Plot of the proportions vs. x:
Fit logit model
Plot fitted proportion values \(\hat{\mu}\) along with the observed proportions
ggplot(
data = augment(fit, type.predict = "response"),
mapping = aes(x = x)
) +
geom_point(aes(y = prop)) +
geom_line(aes(y = .fitted))
Calculate four kinds of residuals:
Response residuals:
resid(fit, "response")
Pearson residuals:
resid(fit, "pearson")
Deviance residuals:
resid(fit, "deviance")
Quantile residuals:
qresid(fit)
(uses{statmod}
package)
resid <- augment(fit, type.predict = "response") |>
mutate(
resp_resid = resid(fit, "response"),
pearson_resid = resid(fit, "pearson"),
dev_resid = resid(fit, "deviance"),
quant_resid = statmod::qresid(fit)
) |>
select(-c(.hat, .sigma, .cooksd, .resid, .std.resid, )) |>
rename(response = resp_resid, pearson = pearson_resid, deviance = dev_resid, quantile = quant_resid)
head(resid, 5)
## # A tibble: 5 × 8
## prop x `(weights)` .fitted response pearson deviance quantile
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 -4 3 0.0194 -0.0194 -0.244 -0.343 0.324
## 2 0 -3.86 3 0.0224 -0.0224 -0.262 -0.369 -1.35
## 3 0 -3.73 3 0.0258 -0.0258 -0.282 -0.396 -0.372
## 4 0 -3.59 3 0.0297 -0.0297 -0.303 -0.425 -0.672
## 5 0.333 -3.46 3 0.0341 0.299 2.85 1.75 2.03
Plotting the four kinds of residuals:
resid |>
pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |>
ggplot(
mapping = aes(x = x)
) +
geom_point(aes(y = Residual)) +
facet_wrap(~type, scales = "free_y")
Discussing different kinds of residuals
Big idea: quantile residuals are the best!
Note that the quantile residuals best give the kind of residual plot we want to see for a correctly specified model:
Centered around 0
Constant vertical spread
Problem with Pearson and deviance residuals: harder to interpret because of the bands of points, caused by that the proportion \(y\) can only take the values 0, 1/3, 2/3 and 1. To show the bands more clearly:
resid |>
pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |>
filter(type %in% c("pearson", "deviance")) |>
mutate(prop = round(prop, 2) |> as.character()) |>
ggplot(
mapping = aes(x = x, y = Residual, color = prop)
) +
geom_point() +
facet_wrap(~type)
Quadratic data
Model only has linear dependence
Suppose the data truly has a quadratic dependence with x, but only a linear dependence is included in the model.
Simulate the data:
set.seed(123)
sim2 <- tibble(
x = seq(from=-3, to=3, length = 60),
lin_pred = .5*(x^2 - 4),
prob = exp(lin_pred) / (1 + exp(lin_pred)),
count = rbinom(n = length(x), size = 3, prob = prob),
prop = count / 3,
Trials = 3
)
sim
## # A tibble: 60 × 5
## x prob count prop Trials
## <dbl> <dbl> <int> <dbl> <dbl>
## 1 -4 0.0180 0 0 3
## 2 -3.86 0.0205 0 0 3
## 3 -3.73 0.0235 0 0 3
## 4 -3.59 0.0268 0 0 3
## 5 -3.46 0.0305 1 0.333 3
## 6 -3.32 0.0348 0 0 3
## 7 -3.19 0.0397 0 0 3
## 8 -3.05 0.0452 1 0.333 3
## 9 -2.92 0.0514 0 0 3
## 10 -2.78 0.0584 0 0 3
## # ℹ 50 more rows
Plot fitted proportion values \(\hat{\mu}\) along with the observed proportions
ggplot(
data = augment(fit2, type.predict = "response"),
mapping = aes(x = x)
) +
geom_point(aes(y = prop)) +
geom_line(aes(y = .fitted))
resid2 <- augment(fit2, type.predict = "response") |>
mutate(
resp_resid = resid(fit2, "response"),
pearson_resid = resid(fit2, "pearson"),
dev_resid = resid(fit2, "deviance"),
quant_resid = statmod::qresid(fit2)
) |>
select(-c(.hat, .sigma, .cooksd, .resid, .std.resid, )) |>
rename(response = resp_resid, pearson = pearson_resid, deviance = dev_resid, quantile = quant_resid)
resid2 |>
pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |>
ggplot(
mapping = aes(x = x)
) +
geom_point(aes(y = Residual)) +
facet_wrap(~type, scales = "free_y")
Correct quadratic model specified
Fit the model. Note that we use the function I()
here to
include x^2
in the model without making it a variable in
the dataset first.
Plot fitted proportion values \(\hat{\mu}\) along with the observed proportions
ggplot(
data = augment(fit3, type.predict = "response"),
mapping = aes(x = x)
) +
geom_point(aes(y = prop)) +
geom_line(aes(y = .fitted))
Calculate residuals
resid3 <- augment(fit3, type.predict = "response") |>
mutate(
resp_resid = resid(fit3, "response"),
pearson_resid = resid(fit3, "pearson"),
dev_resid = resid(fit3, "deviance"),
quant_resid = statmod::qresid(fit3)
) |>
select(-c(.hat, .sigma, .cooksd, .resid, .std.resid, )) |>
rename(response = resp_resid, pearson = pearson_resid, deviance = dev_resid, quantile = quant_resid)
Plot residuals
resid3 |>
pivot_longer(cols = c(response, pearson, deviance, quantile), names_to = "type", values_to = "Residual") |>
ggplot(
mapping = aes(x = x)
) +
geom_point(aes(y = Residual)) +
facet_wrap(~type, scales = "free_y")