Negative Binomial Regression

Objectives and Setup

Complete the MySay survey
Introduction of Negative Binomial (in comparison to Poisson)
Interpretation of coefficients
Generate model predictions
Nested model comparisons

# We need the following packages
library(tidyverse)
library(psyntur)
library(modelr)
library(MASS)     # glm.nb for Negative Binomial Generalised Linear Models
library(AER)      # dispersion test

Assumptions of Poisson models?

One parameter: \(\lambda\) (Greek lambda) which is the rate of an event.
Rate parameter \(\lambda\) is the expected number of events / frequency in fixed window (day, hour, km\(^2\) etc).
Lower bound at 0: distribution is not necessarily symmetric (for low rates).
As the average rate increases so does the variance, hence mean = variance.
Underdispersion: mean is larger than variance
Overdispersion: mean is smaller than variance

Dispersion check

# Generate more random numbers
x <- rpois(10000, lambda = 1.69)
# Calculate the mean 
mean(x)

[1] 1.6951

# Variance
var(x)

[1] 1.657102

For random samples from a Poisson distribution both mean and variance should be close to the \(\lambda\) parameter value.

The ratio of variance and mean var(x) / mean(x) shows that the variance is 0.978 times larger than the mean, so essentially equal.

Repeat for a \(\lambda\) value of 3.5.

Example: smoking data

# Load the smoking data
smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv")

# Mean and variance of `cigs`
describe(smoking_df, 
         avg = mean(cigs),
         var = var(cigs),
         disp = var(cigs) / mean(cigs))

# A tibble: 1 × 3
    avg   var  disp
  <dbl> <dbl> <dbl>
1  8.69  188.  21.7

Variance is substantially larger than the mean!

Example: publications of PhD students

Number of publications of PhD students in biochemistry in the US.

# Load data
biochem_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/biochemist.csv")

# Show data header
glimpse(biochem_df)

Rows: 915
Columns: 6
$ publications <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ gender       <chr> "Men", "Women", "Women", "Men", "Women", "Women", "Women"…
$ married      <chr> "Married", "Single", "Single", "Married", "Single", "Marr…
$ children     <dbl> 0, 0, 0, 1, 0, 2, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, …
$ prestige     <dbl> 2.520, 2.050, 3.750, 1.180, 3.750, 3.590, 3.190, 2.960, 4…
$ mentor       <dbl> 7, 6, 6, 3, 26, 2, 3, 4, 6, 0, 14, 13, 3, 4, 0, 1, 7, 13,…

Example: publications of PhD students

How much larger is the variance of publications compared to the average?

Example: publications of PhD students

# Mean and variance
describe(biochem_df, 
          avg = mean(publications),
          var = var(publications),
          disp = var(publications) / mean(publications))

# A tibble: 1 × 3
    avg   var  disp
  <dbl> <dbl> <dbl>
1  1.69  3.71  2.19

The variance is two times larger than the mean!

Evidence of overdispersion (variance is larger than one would expect under the assumptions of a Poisson process with the same mean).

Dispersion check

We can test whether the mean of publications \(\approx\) var of publication (equi-dispersion) assumption of a Poisson model was violated. Start by fitting a Poisson model:

m_poisson <- glm(publications ~ 1, data = biochem_df, family = poisson(link='log'))

Dispersion check

# Load AER
library(AER)
# Null hypothesis test for equi-dispersion
dispersiontest(m_poisson)

    Overdispersion test

data:  m_poisson
z = 4.6792, p-value = 1.44e-06
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  2.188963

Dispersion is ratio of variance to mean.
p-value of \(< 0.05\) means that the we reject the null hypothesis of equi-dispersion.
Data are incompatible with the assumption of equi-dispersion .
Therefore, equi-dispersion assumption of Poisson distribution was violated.

Negative binomial

Often used for data that show signs of overdispersion (larger variance than average rate).
Similar properties to Poisson distribution but extra parameter for variance (dispersion).
Main difference: Poisson distribution has a single parameter \(\lambda\) that represents both mean and variance.
Negative binomial distribution has two parameters:

\[ y_i \sim \text{NegBinomial}(\mu_i, r), \]

Distinguishing the mean \(\mu\) and dispersion parameter \(r\) (aka the ‘size’ and ‘shape’ of a distribution) means that

The width of the distribution is independent of the the location of the distribution.
The mean of the distribution varies as a function of the predictor variables

\[ \log(\mu_i) = \beta_0 + \beta_1 \times x_i. \]

where a log-link allows us to vary the mean as a linear function of predictors (\(x\)).

Probability of a value given a specific distribution

What is the probability of getting a rate of exactly 3 in Poisson distribution with \(\lambda\) = 1.75?

dpois(x = 3, lambda = 1.75)

[1] 0.15522

Probability of a value given a specific distribution

What is the probability of getting a rate of exactly 3 in a Negative Binomial distribution with \(\mu\) = 1.75 and \(r\) = 0.85?

dnbinom(x = 3, mu = 1.75, size = 0.85)

[1] 0.08805563

Exercise

A random variable \(x\) is distributed according to a Negative Binomial distribution with mean parameter \(\mu\) = 2.5 and dispersion \(r\) = 3. What is the probability that \(x\) takes the value of 2?
A random variable \(x\) is distributed according to a Negative Binomial distribution with mean parameter \(\mu\) = 10 and dispersion \(r\) = 4. What is the probability that \(x\) takes the value of 3?

dbinom(x = ..., mu = ..., size = ...)

Negative Binomial regression in \(R\)

# MASS package loads function `glm.nb`
library(MASS)

# Repeat the model from earlier but as NB
m_nb <- glm.nb(publications ~ 1, data = biochem_df)

Negative Binomial regression in \(R\)

# Negative Binomial model
summary(m_nb)$coef

             Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 0.5264408 0.03586252 14.67942 8.734017e-49

Estimate is identical but standard error – and therefore z-value and p-value – aren’t.

# Poisson model
summary(m_poisson)$coef

             Estimate Std. Error  z value     Pr(>|z|)
(Intercept) 0.5264408 0.02540804 20.71945 2.312911e-95

The standard error is underestimated in the Poisson model: it doesn’t account for overdispersion. Therefore, if the equi-dispersion assumption of the Poisson model was violated, it is likely to lead to incorrect rejections of the null hypothesis.

Negative Binomial regression in \(R\)

# Dispersion parameter
r <- m_nb$theta

[1] 1.706205

# Average number of publications
mu <- exp(coef(m_nb))

[1] 1.692896

# Variance (compare to descriptive summary from earlier)
mu + mu ^ 2 / r

[1] 3.372588

Negative Binomial regression in \(R\)

Adding predictor variables to model:

m_nb_1 <- glm.nb(publications ~ children + married, data = biochem_df)

Interpretation of parameter value estimates same as in Poisson models:

summary(m_nb_1)$coef

                Estimate Std. Error   z value     Pr(>|z|)
(Intercept)    0.6431733 0.05764912 11.156689 6.641969e-29
children      -0.1228265 0.05348967 -2.296266 2.166065e-02
marriedSingle -0.1780358 0.08497173 -2.095236 3.615002e-02

Negative Binomial: models coefficients

What is the factor by which the mean (number of publications) changes for every unit increase in childen, i.e. \(e^{\beta_1}\) or

exp(-0.1228265)

[1] 0.8844171

For every additional child someone has the mean number of publications (during their PhD) changes by a factor of .884 (which is smaller than 1, so it’s decreasing), so it goes down by (1 - .885) * 100 = 11.6 %

Negative Binomial: models coefficients

What is the factor by which the mean (number of publications) changes for married vs single individuals?

Dummy coding 0 and 1 uses alphabetical order (by default):

contrasts(factor(biochem_df$married))

        Single
Married      0
Single       1

exp(-0.1780358)

[1] 0.8369125

Model predictions for Negative Binomial

# Data frame for both levels controlling for children
new_df <- tibble(married = c('Married', 'Single'),
                 children = 0)

# Generate predicted values on log scale
add_predictions(data = new_df, model = m_nb_1)

# A tibble: 2 × 3
  married children  pred
  <chr>      <dbl> <dbl>
1 Married        0 0.643
2 Single         0 0.465

Compare to estimates of (Intercept) and marriedSingle:

summary(m_nb_1)$coef

Model predictions for Negative Binomial

# Make a new dataframe with married and children
new_df <- tibble(married = 'Married', # married PhD students
                 children = c(0, 1, 2, 3, 4, 5)) # with 0-5 children

# Predictions are shown in log average values
add_predictions(data = new_df, model = m_nb_1)

# A tibble: 6 × 3
  married children   pred
  <chr>      <dbl>  <dbl>
1 Married        0 0.643 
2 Married        1 0.520 
3 Married        2 0.398 
4 Married        3 0.275 
5 Married        4 0.152 
6 Married        5 0.0290

Model predictions for Negative Binomial

# Make a new dataframe with married and children
new_df <- tibble(married = 'Married', # married PhD students
                 children = c(0, 1, 2, 3, 4, 5)) # with 0-5 children

# Predictions are shown as average number of publications
add_predictions(data = new_df, model = m_nb_1, type = "response")

# A tibble: 6 × 3
  married children  pred
  <chr>      <dbl> <dbl>
1 Married        0  1.90
2 Married        1  1.68
3 Married        2  1.49
4 Married        3  1.32
5 Married        4  1.16
6 Married        5  1.03

Predictions exercise (manually)

betas <- coef(m_nb_1) # assigned the coefficient values to a vector `betas`
betas

  (Intercept)      children marriedSingle 
    0.6431733    -0.1228265    -0.1780358

What is the predicted log mean of number of publications for married PhD students with no children?
What is the predicted mean of number of publications for married PhD students with no children?
What is the predicted mean of number of publications for married PhD students with two children?

Model comparison: Deviance

Residual deviance in summary() output (next slide) and deviance() is not the deviance of this model!

deviance(m_nb_1)

[1] 1001.536

This is twice the likelihood ratio comparing our model to the saturated model (perfect fit to data because one parameter per observation).

Deviance calculation from log-likelihood is similar to Binomial models, not similar to Poisson.

-2 * as.numeric(logLik(m_nb_1))

[1] 3213.316

Model comparison: Deviance

Call:
glm.nb(formula = publications ~ children + married, data = biochem_df, 
    init.theta = 1.735947852, link = log)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.64317    0.05765  11.157   <2e-16 ***
children      -0.12283    0.05349  -2.296   0.0217 *  
marriedSingle -0.17804    0.08497  -2.095   0.0362 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.7359) family taken to be 1)

    Null deviance: 1008.1  on 914  degrees of freedom
Residual deviance: 1001.5  on 912  degrees of freedom
AIC: 3221.3

Number of Fisher Scoring iterations: 1

              Theta:  1.736 
          Std. Err.:  0.182 

 2 x log-likelihood:  -3213.316

Model comparison: Deviance

Deviance in normal linear regression and binomial logistic regression is \(-2 \times \log(\mathcal{L})\) where \(\mathcal{L}\) is the likelihood. That’s the same for Negative Binomial:

-2 * as.numeric(logLik(m_nb_1))

[1] 3213.316

Model comparison: Deviance

-2 * as.numeric(logLik(m_nb))

[1] 3219.873

-2 * as.numeric(logLik(m_nb_1))

[1] 3213.316

Which model is better?

Model comparison: LRT

anova(m_nb, m_nb_1, test = 'Chisq')

Likelihood ratio tests of Negative Binomial Models

Response: publications
               Model    theta Resid. df    2 x log-lik.   Test    df LR stat.
1                  1 1.706205       914       -3219.873                      
2 children + married 1.735948       912       -3213.316 1 vs 2     2  6.55726
     Pr(Chi)
1           
2 0.03767984

theta: estimated value of dispersion parameter \(r\)
LR stat. = difference in deviances
Note: if you don’t add the model with the larger deviance first, ignore the sign of the difference in deviance

Model comparison: LRT

anova(m_nb, m_nb_1, test = 'Chisq')

Likelihood ratio tests of Negative Binomial Models

Response: publications
               Model    theta Resid. df    2 x log-lik.   Test    df LR stat.    Pr(Chi)
1                  1 1.706205       914       -3219.873                                 
2 children + married 1.735948       912       -3213.316 1 vs 2     2  6.55726 0.03767984

# Deviance of model 1
deviance_m_nb <- -2 * as.numeric(logLik(m_nb))
# Deviance of model 2
deviance_m_nb_1 <- -2 * as.numeric(logLik(m_nb_1))

Model comparison: LRT

anova(m_nb, m_nb_1, test = 'Chisq')

Likelihood ratio tests of Negative Binomial Models

Response: publications
               Model    theta Resid. df    2 x log-lik.   Test    df LR stat.    Pr(Chi)
1                  1 1.706205       914       -3219.873                                 
2 children + married 1.735948       912       -3213.316 1 vs 2     2  6.55726 0.03767984

# Their difference
diff_of_dev <- deviance_m_nb - deviance_m_nb_1

[1] 6.55726

pchisq(diff_of_dev, df = 2, lower.tail = FALSE)

[1] 0.03767984

Exercise: Poisson and Negative Binomial

Start by implementing a Poisson model where the average number of publications is \(\text{publications}_i \sim \text{Poisson}(\lambda_i)\) and varies as a function of

\[ \mathcal{M}_3: \text{log}(\lambda_i) = \beta_0 + \beta_1 \times \text{married}_i + \beta_2 \times \text{prestige}_i + \beta_3 \times \text{children}_i \\ \] Conduct a dispersion check for the Poisson model.

Exercise: Poisson and Negative Binomial

As a next step implement the same model \(\mathcal{M}_3\) using Negative Binomial distribution \(\text{publications}_i \sim \text{NegBinomial}(\mu_i, r)\).

Compare the standard errors of the model coefficients obtained via summary. Are the standard errors of the Poisson model smaller compared to the results of the Negative Binomial model? What’s the consequence of underestimated standard errors (given the result of the dispersion check) for our inference?

Compare multiple models

Using anova(..., ...., ..., test = 'Chisq') compare the three Negative Binomial models you have built in this session.

Which model is the best model and what does that mean?
Try to calculate the difference in the deviance (LR stat.).
Start by determining how you can calculate the value in 2 x log-lik. from the value provided by logLik for each model.
Use pchisq to reproduce the p-values shown in Pr(Chi).

Evaluating the results of the best fitting model

Use summary for the best fitting model to obtain the slope coefficients. By what factor does the number of publications change for each predictor?

What is the predicted log and raw number of publications for a single PhD students with 3 children at an institution with a prestige of 3?