Poisson versus Negative Binomial model:

Two distributions are proposed in the text for use in count regression. The Poisson distribution is the most common application. It is described in the reading of modelling several natural processes: events like a rare illness that have low prevalence in a population, those events whose occurrence is proportional to the observed duration between events (i.e., events that are generally rare but may also cluster temporally), or any event where the time between events follows an exponential distribution. Poisson regression also works best when the count only considers ‘absence’ or ‘presence’ of an event, and not an event with multiple possible outcomes.

In contrast, the negative binomial distribution accommodates processes that do not meet the expectation of proportional variance relative to mean. Negative binomial regression is an alternative to Poisson regression in the case that that the assumptions are not met.

In practice, this means that Poisson models cannot be overdispersed. Overdispersion in the context of generalized linear models means that the variance of observations exceed what is expected based on the underlying parameters.

For homework 5, count regression is used to estimate the number of wine cases ordered of a specific brand. Both Poisson and negative binomial regressions are to be considered. To support one over the other, consider models for each regression, plus the findings of the R function ‘check_dispersion() in the performance package.

poissonmod <- glm(wine_train, formula = TARGET ~., family = poisson)

summary(poissonmod)
## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9803  -0.7083   0.0639   0.5756   3.2351  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.618e+00  2.368e-01   6.830 8.49e-12 ***
## FixedAcidity       -1.785e-04  1.001e-03  -0.178 0.858472    
## VolatileAcidity    -3.296e-02  7.888e-03  -4.178 2.94e-05 ***
## CitricAcid          4.358e-03  7.178e-03   0.607 0.543785    
## ResidualSugar      -5.403e-05  1.831e-04  -0.295 0.767882    
## Chlorides          -4.827e-02  1.939e-02  -2.489 0.012815 *  
## FreeSulfurDioxide   1.275e-04  4.173e-05   3.057 0.002239 ** 
## TotalSulfurDioxide  9.401e-05  2.698e-05   3.484 0.000493 ***
## Density            -3.618e-01  2.332e-01  -1.552 0.120766    
## pH                 -1.708e-02  9.073e-03  -1.883 0.059759 .  
## Sulphates          -1.092e-02  6.657e-03  -1.640 0.101005    
## Alcohol             1.492e-03  1.677e-03   0.890 0.373490    
## LabelAppeal         1.324e-01  7.369e-03  17.963  < 2e-16 ***
## AcidIndex          -8.671e-02  5.479e-03 -15.824  < 2e-16 ***
## STARS               3.094e-01  5.532e-03  55.936  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15334.3  on 8674  degrees of freedom
## Residual deviance:  9962.1  on 8660  degrees of freedom
##   (4120 observations deleted due to missingness)
## AIC: 31705
## 
## Number of Fisher Scoring iterations: 5

The initial Poisson model indicates several variables which are highly correlated to the number of cases sold. Next, lets check dispersion using the two different functions.

summary(glm.nb(wine_train, formula = TARGET ~.))
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 49024.79946, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9802  -0.7083   0.0639   0.5756   3.2350  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.618e+00  2.368e-01   6.830 8.50e-12 ***
## FixedAcidity       -1.785e-04  1.001e-03  -0.178 0.858492    
## VolatileAcidity    -3.296e-02  7.888e-03  -4.178 2.94e-05 ***
## CitricAcid          4.358e-03  7.178e-03   0.607 0.543793    
## ResidualSugar      -5.402e-05  1.831e-04  -0.295 0.767908    
## Chlorides          -4.827e-02  1.939e-02  -2.489 0.012816 *  
## FreeSulfurDioxide   1.276e-04  4.173e-05   3.056 0.002240 ** 
## TotalSulfurDioxide  9.401e-05  2.698e-05   3.484 0.000493 ***
## Density            -3.618e-01  2.332e-01  -1.552 0.120773    
## pH                 -1.708e-02  9.074e-03  -1.883 0.059760 .  
## Sulphates          -1.092e-02  6.657e-03  -1.640 0.101004    
## Alcohol             1.492e-03  1.677e-03   0.890 0.373527    
## LabelAppeal         1.324e-01  7.369e-03  17.963  < 2e-16 ***
## AcidIndex          -8.671e-02  5.479e-03 -15.824  < 2e-16 ***
## STARS               3.094e-01  5.532e-03  55.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(49024.8) family taken to be 1)
## 
##     Null deviance: 15333.7  on 8674  degrees of freedom
## Residual deviance:  9961.8  on 8660  degrees of freedom
##   (4120 observations deleted due to missingness)
## AIC: 31708
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  49025 
##           Std. Err.:  61688 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -31675.54
dispersiontest(poissonmod)
## 
##  Overdispersion test
## 
## data:  poissonmod
## z = -9.8815, p-value = 1
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##  0.8686586
check_overdispersion(poissonmod)
## # Overdispersion test
## 
##        dispersion ratio =    0.852
##   Pearson's Chi-Squared = 7381.489
##                 p-value =        1
## No overdispersion detected.

Both checks for overdispersion indicate there is no need to use a negative binomial model. Next, lets visualize the wine case values using a rootogram, which compares a histogram to the expected Poisson distribution.

rootogram(poissonmod)

Clearly, these data do not match the expected distributions, owing clearly to the zero-inflated response variable. Because of this, a zero-inflated model can be used instead.

mod_zinf <- zeroinfl(data = wine_train, formula = TARGET ~., dist = 'poisson')
summary(mod_zinf)
## 
## Call:
## zeroinfl(formula = TARGET ~ ., data = wine_train, dist = "poisson")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.111684 -0.409481 -0.008564  0.372250  5.365008 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.478e+00  2.452e-01   6.026 1.68e-09 ***
## FixedAcidity        1.073e-04  1.029e-03   0.104  0.91694    
## VolatileAcidity    -1.389e-02  8.109e-03  -1.712  0.08684 .  
## CitricAcid         -1.661e-03  7.353e-03  -0.226  0.82125    
## ResidualSugar      -1.168e-04  1.876e-04  -0.623  0.53350    
## Chlorides          -2.289e-02  1.994e-02  -1.148  0.25094    
## FreeSulfurDioxide   2.357e-05  4.222e-05   0.558  0.57664    
## TotalSulfurDioxide -1.845e-05  2.684e-05  -0.687  0.49181    
## Density            -3.101e-01  2.413e-01  -1.285  0.19884    
## pH                  7.124e-03  9.354e-03   0.762  0.44632    
## Sulphates          -1.607e-04  6.857e-03  -0.023  0.98131    
## Alcohol             6.709e-03  1.719e-03   3.902 9.55e-05 ***
## LabelAppeal         2.323e-01  7.664e-03  30.312  < 2e-16 ***
## AcidIndex          -1.907e-02  5.905e-03  -3.230  0.00124 ** 
## STARS               9.924e-02  6.331e-03  15.674  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -4.8669920  1.6038736  -3.035 0.002409 ** 
## FixedAcidity       -0.0028185  0.0068730  -0.410 0.681741    
## VolatileAcidity     0.1484965  0.0526773   2.819 0.004818 ** 
## CitricAcid         -0.0300724  0.0487991  -0.616 0.537730    
## ResidualSugar      -0.0015925  0.0012209  -1.304 0.192122    
## Chlorides           0.1193311  0.1282052   0.931 0.351966    
## FreeSulfurDioxide  -0.0007017  0.0002826  -2.483 0.013034 *  
## TotalSulfurDioxide -0.0011350  0.0001816  -6.250 4.11e-10 ***
## Density             0.8880925  1.5838941   0.561 0.575001    
## pH                  0.2308894  0.0617621   3.738 0.000185 ***
## Sulphates           0.1493774  0.0448840   3.328 0.000874 ***
## Alcohol             0.0360414  0.0115976   3.108 0.001886 ** 
## LabelAppeal         0.7180759  0.0517719  13.870  < 2e-16 ***
## AcidIndex           0.4247642  0.0309185  13.738  < 2e-16 ***
## STARS              -2.3531344  0.0717061 -32.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 36 
## Log-likelihood: -1.385e+04 on 30 Df
rootogram(mod_zinf)

While not an ideal fit, the zero-inflated model helps explain more of the zero values, and graphically accomodates the non-zero values better as well.