Learning objectives
1 Load the needle-sharing dataset
2 Compare the mean to the variance of the outcome variable. Calculate what fraction of the counts are zero.
3 Create a histogram of the outcome variable.
4 Fit Poisson and Negative Binomial models as in the lecture, with and without zero inflation.
5 Use chi-square nested deviance tests to assess which model seems to fit best.
6 Create residual deviance plots using the functions defined in the lecture.
7 Plot predicted and observed counts

Learning objectives

Fit Poisson, NB, and zero-inflated loglinear models
Perform nested deviance test for model selection
Make diagnostic plots of loglinear models

1 Load the needle-sharing dataset

suppressPackageStartupMessages({
  library(tidyverse)
})
needledat <- readr::read_csv("needle_sharing.csv")
needledat2 <- needledat %>%
  dplyr::filter(sex %in% c("M", "F") &
                  ethn %in% c("White", "AA", "Hispanic") &
                  !is.na(homeless)) %>%
  mutate(
    homeless = recode(homeless, "0" = "no", "1" = "yes"),
    hiv = recode(
      hivstat,
      "0" = "negative",
      "1" = "positive",
      "2" = "yes"
    )
  )

2 Compare the mean to the variance of the outcome variable. Calculate what fraction of the counts are zero.

meanvarzeros <- needledat2 %>% 
  filter(complete.cases(shared_syr)) %>%
summarise(
  mean = mean(shared_syr),
  var = var(shared_syr),
  fraczero = (sum(shared_syr == 0) / length(shared_syr))
)
meanvarzeros

The mean number of needle sharing events per participant is 3.12, the variance is 113, and the fraction of participants who never shared a needle is 0.774. (Note how I put computed results in the text here rather than writing in numbers manually - they will change automatically if the analysis is changed!)

3 Create a histogram of the outcome variable.

This was done in the lecture using base R, but let’s do it here with ggplot2. Note the filtering of complete cases only is unnecessary because ggplot does it anyways, but this gets rid of a warning (try it without filtering). Specifying the binwidth is also unnecessary, but by default geom_histogram creates histogram bins of size 2 (ie 0 and 1 in the same bin, 2 and 3 together, …)

library(ggplot2)
filter(needledat2, complete.cases(shared_syr)) %>%
  ggplot(aes(x = shared_syr)) +
  geom_histogram(binwidth = 2, fill=I("blue"))

4 Fit Poisson and Negative Binomial models as in the lecture, with and without zero inflation.

4.1 Poisson

fit.pois <- glm(shared_syr ~ sex + ethn + homeless + dprsn_dx,
                data = needledat2,
                family = poisson(link = "log"))

4.2 Negative Binomial

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

fit.negbin <- MASS::glm.nb(shared_syr ~ sex + ethn + homeless + dprsn_dx,
                           data = needledat2)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning in MASS::glm.nb(shared_syr ~ sex + ethn + homeless + dprsn_dx, data =
## needledat2): alternation limit reached

4.3 Zero-inflated Poisson

library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

fit.ZIpois <- pscl::zeroinfl(shared_syr ~ sex + ethn + homeless + dprsn_dx | 1,
                       dist = "poisson",
                       data = needledat2)

4.4 Intercept-only ZI model

fit.ZInegbin <-
  pscl::zeroinfl(shared_syr ~ sex + ethn + homeless + dprsn_dx | 1,
           dist = "negbin",
           data = needledat2)

5 Use chi-square nested deviance tests to assess which model seems to fit best.

I want to calculate the log-likelihood from each model. The simplest way is to call the logLik function one at a time:

logLik(fit.pois)

## 'log Lik.' -707.0519 (df=6)

logLik(fit.negbin)

## 'log Lik.' -141.1498 (df=7)

logLik(fit.ZIpois)

## 'log Lik.' -289.9595 (df=7)

logLik(fit.ZInegbin)

## 'log Lik.' -139.8061 (df=8)

Just to demonstrate a fancier way that could be used on many models, I’ll create a list of model objects:

listofmodels <- list(
  poisson = fit.pois,
  negbin = fit.negbin,
  ZIpois = fit.ZIpois,
  ZInegbin = fit.ZInegbin
)

Then demonstrate how the lapply (also related functions like sapply) can be used to implicitly loop over elements of a list, to do exactly the same thing:

lapply(listofmodels, logLik)

## $poisson
## 'log Lik.' -707.0519 (df=6)
## 
## $negbin
## 'log Lik.' -141.1498 (df=7)
## 
## $ZIpois
## 'log Lik.' -289.9595 (df=7)
## 
## $ZInegbin
## 'log Lik.' -139.8061 (df=8)

OK now to actually answer the question. Just to get an idea of how big a difference in \(-2 \times log(likelihood)\) would be statistically significant on one difference of degrees of freedom:

qchisq(p=0.95, df = 1)  #critical value is 3.84

## [1] 3.841459

qchisq(p=0.05, df = 1, lower.tail = FALSE) #equivalent

## [1] 3.841459

Or two degrees of freedom:

qchisq(p=0.95, df = 2)  #critical value is 6

## [1] 5.991465

All the differences in double log-likelihoods above are much larger (in the hundreds) than these critical significance values, except for the difference between Negative Binomial and zero-inflated negative binomial models. So it doesn’t look like zero inflation helped the Negative Binomial distribution model, but it helped the Poisson model, and the Negative Binomial model fits better than the Poisson model.

6 Create residual deviance plots using the functions defined in the lecture.

These were the (base graphics) functions defined to create the first two panels of residuals plots for all of these types of models.

plotpanel1 <- function(fit, ...) {
  plot(
    x = predict(fit),
    y = residuals(fit, type = "pearson"),
    xlab = "Predicted Values",
    ylab = "Pearson Residuals",
    ...
  )
  abline(h = 0, lty = 3)
  lines(lowess(x = predict(fit), y = resid(fit, type = "pearson")),
        col = "green")
}
plotpanel2 <- function(fit, ...) {
  resids <- scale(residuals(fit, type = "pearson"))
  qqnorm(resids, ylab = "Std Pearson resid.", ...)
  qqline(resids)
}

Let’s make these plots. As a shortcut, remember that list of models? I’m going to use an explicit for loop this time instead of lapply so that I can use the vector names as plot titles.

Although we saw some evidence from the chi-square test that the Negative Binomial distribution fit better than the Poisson distribution (not surprising since these data are very over-dispersed), these diagnostic plots show the Negative Binomial distribution still does not fit well at all. I would take any interpretation of the coefficients of these models with plenty of skepticism. But this dataset is tricky and I’m not sure offhand of a good model to fit it.

Note, the line par(mfrow=c(1, 2)) only works for base graphics (not ggplot2), and creates a 1 row by 2 column plot panel.

par(mfrow=c(1, 2))
for (i in seq_along(listofmodels)){
  plotpanel1(listofmodels[[i]], main = names(listofmodels)[i])
  plotpanel2(listofmodels[[i]], main = names(listofmodels)[i])
}

7 Plot predicted and observed counts

Here is a tibble that we can use to make histograms, density plots, etc.

preds <- tibble(observed = fit.pois$y,
           poisson = predict(fit.pois),
           negbin = predict(fit.negbin),
           ZIpois = predict(fit.ZIpois),
           ZInegbin = predict(fit.ZInegbin))

Just to help with pivoting, let’s pivot this into long-format:

preds.long <- pivot_longer(preds, everything())

What did this do?

preds.long

ggplot(preds.long, aes(x=value)) +
  facet_wrap(~name) +
  geom_histogram(binwidth = 2)

We can see that none of the models come close to modeling the extreme observed counts of 30+. In reality, these might require a more complex mixture model: for example a mixture of zero-inflation plus two different count distributions, one with a much higher mean. This is beyond the scope of the course, but it is possible to fit more complex mixture models that could fit this dataset better.

Session 5 lab exercise: loglinear regression part 2

Kaitlyn Vallo