Chapter 3: Binomial Regression

Chapter 3 Notes

3.1 Binomial Regression Model

There are \(n_i\) trials and for each trial, there are outcomes \(Y_i\) ~ \(Binom(n_i,p_i)\) with binomial distribution with probability \(p_i\). In general, our goal is to predict \(p_i\).

\[\mathbb{P}(Y_i = y_i) = {n_i \choose y_i}p_{i}^{y_i}(1-p_i)^{n_i-y_i}\]

We assume that each \(Y_i\) is independent (IID)
All the \(Y_i\) are subject to q predictors \(\mathbf{x} = (x_{i1},x_{i2},..,x_{iq})\)
One group of trials, lets say the jth group: \(y_j = \beta_{0}+ \beta_{1}x_{i4j1}+\beta_{2}x_{j2}+...+\beta_{q}x_{jq})\) is called a covariate class.

Systematic component or linear predictor: \(\eta_i = \beta_{0}+ \beta_{1}x_{i4i1}+\beta_{2}x_{i2}+...+\beta_{q}x_{iq}\)
Link Function:\(logit: \mathbb{R} \rightarrow \mathbb{R}\), \(logit(p_i) = log(\frac{p_i}{1-p_i}) = \eta_i \in \mathbb{R}\) and \(ilogit(\mathbf{x}) = \frac{e^{\eta_i}}{1+e^{\eta_i}} \in [0,1]\)
Random Component: \(Y_i\) ~ \(Binom(n_i,p_i)\)

Using maximum likelihood method:

\[l(\mathbf{\beta})=\sum [y_{i}\eta_i - n_ilog(1+e^{\eta_{i}})+log({n_i \choose y_i})]\] Where we will maximize this function using some type of iterative technique.

We will study Binomial Regression using O-rings dataset. For context, Each shuttle had two boosters, each with 3 O-rings. This means in total we had 6 O-rings per shuttle. For each shuttle mission, we had trials \(n_i: n_1 = 6, n_2 = 6,...,n_{23}=6\) since \(1 \leq i \leq 23\). There were 23 missions and each with 6 trials. This means for each mission we output a \(y_i = (y_1,y_2,...,y_6)\) but we treat as a single count.

data(orings)
head(orings)

##   temp damage
## 1   53      5
## 2   57      1
## 3   58      1
## 4   63      1
## 5   66      0
## 6   67      0

Plotting:

plot(damage/6 ~ temp, orings, xlim=c(25,85), ylim=c(0,1), xlab= "Temp", ylab = "Probability of Damage")

We want to predict the probability of failure in a given O-ring and if it’s related to the launch temperature. In order to do this regression, we need 2 pieces of data, \(m_i\) and \(y_i\).

\(m_i\) : the total number of trials
\(y_i\): the number of “success” in the trial
We need to create a 2-column matrix \([y_i , m_i - y_i]\)

bmod <- glm(cbind(damage, 6-damage) ~ temp, family = "binomial", data= orings)
summary(bmod)

## 
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = "binomial", 
##     data = orings)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.66299    3.29626   3.538 0.000403 ***
## temp        -0.21623    0.05318  -4.066 4.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38.898  on 22  degrees of freedom
## Residual deviance: 16.912  on 21  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 6

Let’s use this summary and plot our fit:

\(ilogit (\eta_i) = \frac{e^{\eta_{i}}}{1-e^{\eta_i}}\)

We just need to plot the \(ilogit(.)\) function.

plot(damage/6 ~ temp, orings, xlim=c(25,85), ylim=c(0,1), xlab= "Temp", ylab = "Probability of Damage")
x <- seq(25,81,1)
lines(x, ilogit(coef(bmod)[1]+coef(bmod)[2]*x))

Let’s see what our probability is at 31C:

x <- 31
ilogit(coef(bmod)[1]+coef(bmod)[2]*x)

## (Intercept) 
##   0.9930342

plot(damage/6 ~ temp, orings, xlim=c(25,85), ylim=c(0,1), xlab= "Temp", ylab = "Probability of Damage")
x <- seq(25,81,1)
lines(x, ilogit(coef(bmod)[1]+coef(bmod)[2]*x))
abline(v=31, h=0.9930342,  col="red", lwd=2)

It seems that at 31C, there is a \(99\%\) chance of damage in the O-ring. However, before we make any conclusions, let us consider some inferential techniques.

3.2 Inference

We know that Deviance is:

\[D = 2 \sum [y_ilog(y_i)/\hat{y_i} + (m_i - y_i)log(m_i - y_i)/(m_i - \hat{y_i)}\] where \(\hat{y_i}\) are fitted values. We can use the likelihood method to test the goodness of fit.

Note: This isn’t possible for the Logistic Regression. For Logistic Regression we must indirect error estimation or direct error estimation techniques. We can look at AIC, AICc, BIC, Cross Validation errors, and ROC Curves. There also exist Hosmer-Lemeshow test, but sensitive to choice of bins.

Let’s test the fit of our model:

\(H_0:\) The model is a sufficient fit

\(H_1:\) The model is NOT a sufficient fit

residual_deviance <- deviance(bmod)
residual_deviance_df <- df.residual(bmod)

pchisq(residual_deviance, residual_deviance_df, lower= FALSE)

## [1] 0.7164099

We find that \(p-value = 0.7164099\), which is greater than \(0.05\). We fail to reject the null hypothesis. Assuming the null hypothesis is true, the observed data is likely under our null so we don’t have enough evidence to reject the null hypothesis. The model is a sufficient fit.

Note: Altough our model is a sufficient fit, our model may or may not be correct.

Let’s test our null model:

null_residual <- 38.898 
null_residual_df <- 22 

pchisq(null_residual, null_residual_df, lower= FALSE)

## [1] 0.0144964

We find that the null model is not a sufficient fit.

Note: Remember that we’re relying on asymptotic results like \(D\) ~ \(\chi^2(n - q - 1)\), where \(n:\) # trials and \(q:\) # of predictors. So if \(n_i < 5\) \(\forall i\), we run into issues.

We can also test nested models with this test as well!

\(D_S - D_L\) ~ \(\chi^2(l-s)\) where \(L\) denotes the larger model and \(S\) the smaller. This is asymptotically true assuming the smaller model is true and distributional assumptions hold.

\(H_0:\): The smaller model is as sufficient as the larger model

\(H_1:\) The smaller model is NOT as sufficient as the larger model

pchisq(null_residual- residual_deviance, null_residual_df - residual_deviance_df, lower= FALSE)

## [1] 2.746864e-06

#Smaller Model Deviance - Larger Model Deviance, Smaller Model DF - Larger Model DF

Alternatively, we can just use the ANOVA function:

anova(bmod, test="Chi")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(damage, 6 - damage)
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    22     38.898              
## temp  1   21.985        21     16.912 2.747e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since we only have one predictor this is equivalent to:

drop1(bmod, test= "Chi") #This is just anova() but only dropping one variable

## Single term deletions
## 
## Model:
## cbind(damage, 6 - damage) ~ temp
##        Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>      16.912 33.675                     
## temp    1   38.898 53.660 21.985 2.747e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our result is that we reject the null and so our smaller model is NOT sufficient. However more importantly, since this is just one variable we can also conclude that temperature is in fact a significant predictor.

What if we ungroup the data and treat it is as logistic regression?

erings <- erings <- with(orings, data.frame(temp=rep(temp,each=6), damage=as.vector(sapply(orings$damage, function(x) rep(c(0,1), times = c(6-x,x))))))

head(erings)

##   temp damage
## 1   53      0
## 2   53      1
## 3   53      1
## 4   53      1
## 5   53      1
## 6   53      1

dim(erings)

## [1] 138   2

We should expect 138 observations because \(23 * 6 = 138\).

lmod <- glm(damage ~ temp, family = "binomial", erings)
summary(lmod)

## 
## Call:
## glm(formula = damage ~ temp, family = "binomial", data = erings)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.66299    3.29616   3.538 0.000403 ***
## temp        -0.21623    0.05318  -4.066 4.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 76.745  on 137  degrees of freedom
## Residual deviance: 54.759  on 136  degrees of freedom
## AIC: 58.759
## 
## Number of Fisher Scoring iterations: 6

plot(damage ~ temp, erings, xlim=c(25,85), ylim=c(0,1), xlab= "Temp", ylab = "Probability of Damage")
x <- seq(0,85,1)
lines(ilogit(coef(lmod)[1] + coef(lmod)[2]*x))

Let’s once again test at 31C:

ilogit(coef(lmod)[1] + coef(lmod)[2]*31)

## (Intercept) 
##   0.9930342

Altough our coefficients, standard errors, and p-values are difficult, we still obtain the same results. We can check the confidence interval of our parameters \(\beta\):

confint(lmod)

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept)  5.5751948 18.737598
## temp        -0.3326569 -0.120179

This likelihood method is preferred over \(\hat{\beta} + 2se(\hat{\beta)}\) approach.

3.3 Pearson’s \(\chi^2\) Statistics

There are alternative methods to testing the fit of the model that isn’t the deviance method. One such method is the Pearson’s test statistic.

\[X^2 = \sum \frac{(O_i - E_i)^2}{E_i}\] where \(O_i\) is the observed count and \(E_i\) is the expected count for the case \(i\). For a binomial response, we count the number of successes for which \(O_i = y_i\) while \(E_i = n_i\hat{p_i}\) and the failures for which \(O_i = n_i - y_i\) and \(E_i = n_i(1-p_i)\). Taking these results:

\[X^2 = \sum \frac{(y_i - n_i \hat{p_i})^2}{n_i\hat{p_i}(1-\hat{p_i})}\]

with this result, we can define the Pearson’s Residuals:

\[r_{i}^{p} = (y_i - n_i \hat{p_i})/ \sqrt{var(y_i)}\] Which we can view as some type of standardized residual, then:

\[X^2 = \sum (r_{i}^{p})^2\]

This means that the Pearson’s \(X^2\) is analogous to the RSS used in normal linear models.

The Pearson’s \(X^2\) will be close to the deviance and can be used in the same manner. With the same hypotheses, we may use this statistic over the deviance one. However, Pearson’s in some case may fail because our models are designed to minimize the deviance and NOT Pearson’s \(X^2\).

3.4 Overdispersion

If the binomial specification is correct we expect that the residual deviance will be approximately distributed \(\chi ^2\) with the appropriate degrees of freedom.

\[D(M) \sim \chi 2 (n-p)\]

However sometimes we observe Deviances \(D(M)\) that are very large. This can be due to 4 reasons:

The model structure is wrong: we have not included the right predictors or we have not transformed or combined them in the correct way
There are some outliers and we need to identify them and remove them. Typically we notice this in model diagnostics
The binomial distribution assumption might be wrong: if larger numbers of points are identified as outliers (or sparse data)
the binomial distribution makes sense but some changes are need (This is what we deal with in dispersion)

\((4)\) is typically some deficiencies in the random part of the model. The random part would be Y. \(Y \sim Binom(n,p)\) arises when the probability of success \(p\) is iid for each trial within the group. If the group size is \(n\), then \(var(Y) = np(1-p)\) assuming that the binomial assumption is correct. However, if this is violated, the variance may be greater.

More formally: For binomial regression model, we assume that for a given \(i\), we have \(Y_{i} \sim Binom(n_i,p_i)\) with \(p_i = p(x_i)\). So we can impose a relationship \(\mathbb{E}[Y_i] = n_i p_i\) and \(\mathbb{V}ar[Y_i] = n_i p_i (1-p_i)\). If this assumption is broken, then we may face Overdisperson or Underdisperson.

Overdispersion: \(\mathbb{V}ar(Y_i) > n_ip_i(1 − p_i)\); Meaning that our observed variance is greater than the expected variance.

Underdispersion: \(\mathbb{V}ar(Y_i) < n_ip_i(1 − p_i)\); Meaning that our observed variance is less than the expected variance.

Question: So what causes Overdispersion ?

Answer:

(1) When the identical assumption is violated

We typically assume in the binomial regression that for a given \(\mathbf{x_i}\). \(p_i (\mathbf{x_i}) is a constant value.\) That is, with the same \(\mathbf{X} = \mathbf{x_j}\), probability of \(Y = 1\) should be \(p_j\) constant.
In practice that is not necessarily true. Subject or people can have different probability of “successes” (\(Y = 1\)) given the same set of predictor variables. This is because there maybe some other characteristic that also affects the probability of having \(Y_i\) but the information isn’t collected
In laymen terms, although you’re using the identical predictor values to predict the probability of sucescess for two individuals, the outcome can vary quite a bit. Perhaps information that isn’t collected is looks. There are studies that argue that “better” looks tend correlated with greater success in life. Two outcomes \(Y_i\) and \(Y_j\) are NOT identical.

(2) When the independent assumption is violated

If the response has a common cause, the response will tend to be positively correlated
In many human or animal studies, one response can influence another’s response

Conclusion: Both of these violations lead to the same consequences: clusters in population, which inflates the variance.

Questions: How do we deal with Overdispersion?

Answer: The simplest strategy is to introduced an additional dispersion parameter \(\sigma ^2\)

Assuming that \(\mathbb{E}(Y_i)\) and \(\mathbb{V}ar(Y_i) = \sigma^2 n_i p_i (1-p_i)\), we can estimate:

\[\hat{\sigma}^2 = \frac{X^2}{n-p}\]

where \(X^2 = \sum \frac{(y_i - n_i \hat{p_i})^2}{n_i\hat{p_i}(1-\hat{p_i})}\), there Pearson’s \(X^2\) statistic. Note that \(X^2 \sim \chi ^2 (n-p)\) only when there is no dispersion. Alsom \(\mathbb{E}[\hat{\sigma}^2] = \frac{\mathbb{E}X^2}{n-p} \approx 1\)

Notice that the estimation of \(\mathbf{\beta}\) is unaffected since \(\sigma^2\) doesn’t change the mean response
However, we need to scale up the variance and it becomes: \(\mathbb{V}ar [\hat{\mathbf{\beta}}] = \hat{\sigma}^2 \mathbf{I}^{-1}(\mathbf{\beta})\)
Notice that when there NO dispersion, \(\sigma^2 \approx 1\) so nothing changes
To compare models now, we need to use an F-statistic (rather than LRT):

\[F = \frac{[D(small) - D(large)]/[df_{small} - df_{large}]}{\hat{\sigma}^2}\] \(df_{small}\) is the degrees of freedom of residuals for the small model

The The F-statistical is only an approximately F distributed, in contrast to the Gaussian case

Question: Is there another way to deal with Dispersion?

Answer:

For a binomial regression we assume that \(Y_{i} \sim Binom(n_i,p_i)\). where \(p_i = p(x_i)\) is modelled through a link function.
If no overdispersion we assume that all subjects with \(X = x_i\), have the same probability to have a sucess \(Y = 1\).
when there is over-dispersion, we might expect all subjects with \(X = x_i\) to have slightly different probability but still close to \(p(x_i)\). Why?: This is because of clustering effect. That is:

Assume \(Y_{i} \sim Binom(n_i,p_i)\). Let \(p_{i0} = p(x_i)\) be the overall probability of success for all subjects with \(X = x_i\), \(p_i\) is random with \(\mathbb{E}[p_i] = p_{i0}\) and \(\mathbb{V}ar[p_i] = \tau^{2}_{i}\)

We achieve a new model: \(\mathbb{E}[Y_i] = n_ip(x_i)\) and We achieve a new mode: \(\mathbb{V}ar[Y_i] = (n^{2}_{i} - n_i)\tau^{2}_{i} + n_ip_{i0}(1-p_{i0}\)
Note that \(n_ip_{i0}(1-p_{i0}\) is the expected variance
Notice our previous stratergy introducing the parameter \(\sigma^2\) ensures that: \(\mathbb{V}ar[Y_i] = \sigma^2 n_i p_{i0}(1-p_{i0})\) where \(\sigma^2\) is constant and independent of \(n_i\) and \(p_i\)
This is equivalent to choosing: \[\tau^{2}_{i} = \frac{(\sigma^2 - 1)p_{i0} (1- p_{i0})}{n_i - 1}\]

Question: Is there another way to deal with overdispersion:

Answer: try beta-binomial distribution

Recall for \(Y \sim Binom(n,p)\): \[\mathbb{P}(Y = y) = {n \choose y}p^{y}(1-p)^{n-y}\]

Recall for \(X \sim Beta(\alpha, \beta)\): \[\frac{1}{B(\alpha,\beta)} x^{\alpha - 1} (1-x)^{\beta - 1}\] where \(x \in [0,1]\),\(\alpha > 0\), \(\beta > 0\)

\(\mathbb{E}[X]= \frac{\alpha}{\alpha + \beta}, \mathbb{V}ar[X] = \frac{\alpha}{\alpha + \beta} \frac{\beta}{\alpha + \beta} \frac{1}{\alpha + \beta + 1}\)
Now let’s assume \(p \sim Beta(\alpha, \beta)\), the beta-binomial distribution

Now we’re assuming: \(Y \sim Binom(n,p)\) and \(p \sim Beta(\alpha, \beta)\)