Models for excess zeros using pscl package (Hurdle and zero-inflated regression models) and their interpretations

References

Load packages

## Load pscl
library(pscl)
## Load sandwich package for robust sandwich covariance matrix estimators
library(sandwich)
## Load lmtest package for coeftest
library(lmtest)
## Load MASS for glm.nb
library(MASS)

Load data: Demand for medical care by the elderly (see first reference)

## Load
load("./DebTrivedi.RData")
## Show summary
summary(DebTrivedi)
      ofp             ofnp             opp              opnp             emer             hosp      
 Min.   : 0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   :0.000  
 1st Qu.: 1.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:0.000  
 Median : 4.00   Median :  0.00   Median :  0.00   Median :  0.00   Median : 0.000   Median :0.000  
 Mean   : 5.77   Mean   :  1.62   Mean   :  0.75   Mean   :  0.54   Mean   : 0.264   Mean   :0.296  
 3rd Qu.: 8.00   3rd Qu.:  1.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.: 0.000   3rd Qu.:0.000  
 Max.   :89.00   Max.   :104.00   Max.   :141.00   Max.   :155.00   Max.   :12.000   Max.   :8.000  
       health        numchron    adldiff        region          age       black         gender     married   
 poor     : 554   Min.   :0.00   no :3507   midwest:1157   Min.   : 6.6   no :3890   female:2628   no :2000  
 average  :3509   1st Qu.:1.00   yes: 899   noreast: 837   1st Qu.: 6.9   yes: 516   male  :1778   yes:2406  
 excellent: 343   Median :1.00              other  :1614   Median : 7.3                                      
                  Mean   :1.54              west   : 798   Mean   : 7.4                                      
                  3rd Qu.:2.00                             3rd Qu.: 7.8                                      
                  Max.   :8.00                             Max.   :10.9                                      
     school         faminc      employed   privins    medicaid  
 Min.   : 0.0   Min.   :-1.01   no :3951   no : 985   no :4004  
 1st Qu.: 8.0   1st Qu.: 0.91   yes: 455   yes:3421   yes: 402  
 Median :11.0   Median : 1.70                                   
 Mean   :10.3   Mean   : 2.53                                   
 3rd Qu.:12.0   3rd Qu.: 3.17                                   
 Max.   :18.0   Max.   :54.84                                   
## Show head
head(DebTrivedi)
  ofp ofnp opp opnp emer hosp  health numchron adldiff region age black gender married school faminc employed
1   5    0   0    0    0    1 average        2      no  other 6.9   yes   male     yes      6 2.8810      yes
2   1    0   2    0    2    0 average        2      no  other 7.4    no female     yes     10 2.7478       no
3  13    0   0    0    3    3    poor        4     yes  other 6.6   yes female      no     10 0.6532       no
4  16    0   5    0    1    1    poor        2     yes  other 7.6    no   male     yes      3 0.6588       no
5   3    0   0    0    0    0 average        2     yes  other 7.9    no female     yes      6 0.6588       no
6  17    0   0    0    0    0    poor        5     yes  other 6.6    no female      no      7 0.3301       no
  privins medicaid
1     yes       no
2     yes       no
3      no      yes
4     yes       no
5     yes       no
6      no      yes

Deb and Trivedi (1997) analyzed data on 4406 individuals, aged 66 and over, who are covered by Medicare, a public insurance program.

1       ofp     Number of physician office visits (integer outcome)
2      ofnp
3       opp
4      opnp
5      emer
6      hosp     Number of hospital stays (integer)
7    health     Self-perceived health status (poor, average, excellent)
8  numchron     Number of chronic condition (integer)
9   adldiff
10   region
11      age
12    black
13   gender     Gender (female, male)
14  married
15   school     Number of years of education (integer)
16   faminc
17 employed
18  privins     Private insurance indicator (no, yes)
19 medicaid

Check outcome (Number of physician office visits)

## Outcome
numberOfPhysicianOfficeVisits <- DebTrivedi$ofp
## Plot
plot(sort(numberOfPhysicianOfficeVisits))

plot of chunk unnamed-chunk-4

Poisson regression

This appears too optimistic with the standard variance estimation. The estimation by the robust variance estimator appears more realistic.

## Create a formula
formula <- ofp ~ hosp + health + numchron + gender + school + privins
## Fit Poisson
modelPoisson <- glm(formula = formula,
                    family  = poisson(link = "log"),
                    data    = DebTrivedi)
summary(modelPoisson)

Call:
glm(formula = formula, family = poisson(link = "log"), data = DebTrivedi)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-8.405  -1.996  -0.674   0.705  16.362  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.02887    0.02378   43.26   <2e-16 ***
hosp             0.16480    0.00600   27.48   <2e-16 ***
healthpoor       0.24831    0.01784   13.92   <2e-16 ***
healthexcellent -0.36199    0.03030  -11.95   <2e-16 ***
numchron         0.14664    0.00458   32.02   <2e-16 ***
gendermale      -0.11232    0.01295   -8.68   <2e-16 ***
school           0.02614    0.00184   14.18   <2e-16 ***
privinsyes       0.20169    0.01686   11.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23168  on 4398  degrees of freedom
AIC: 35959

Number of Fisher Scoring iterations: 5
## Test using robust sandwich covariance matrix estimators
coeftest(modelPoisson, vcov = sandwich)

z test of coefficients:

                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.02887    0.06453   15.94  < 2e-16 ***
hosp             0.16480    0.02195    7.51  5.9e-14 ***
healthpoor       0.24831    0.05402    4.60  4.3e-06 ***
healthexcellent -0.36199    0.07745   -4.67  3.0e-06 ***
numchron         0.14664    0.01291   11.36  < 2e-16 ***
gendermale      -0.11232    0.03534   -3.18   0.0015 ** 
school           0.02614    0.00508    5.14  2.7e-07 ***
privinsyes       0.20169    0.04313    4.68  2.9e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Exponentiated coefficients
exp(coef(modelPoisson))
    (Intercept)            hosp      healthpoor healthexcellent        numchron      gendermale          school 
         2.7979          1.1792          1.2819          0.6963          1.1579          0.8938          1.0265 
     privinsyes 
         1.2235 

Interpretation: The “baseline” average office visit count is 2.8. The other exponentiated coefficients are interpreted multiplicatively. One unit increase in hospital stays increases the average office visits by 1.18 times. Poor health compared to average health status increases the average office visits by 1.28 times, whereas excellent health decreases the average office visits by 0.70 times.

Quasi-Poisson regression

The dispersion parameter is greater than 1, indicating overdispersion

## Fit quasi-Poisson
modelQuasiPoisson <- glm(formula = formula,
                         family  = quasipoisson(link = "log"),
                         data    = DebTrivedi)
summary(modelQuasiPoisson)

Call:
glm(formula = formula, family = quasipoisson(link = "log"), data = DebTrivedi)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-8.405  -1.996  -0.674   0.705  16.362  

Coefficients:
                Estimate Std. Error t value    Pr(>|t|)    
(Intercept)      1.02887    0.06159   16.70     < 2e-16 ***
hosp             0.16480    0.01553   10.61     < 2e-16 ***
healthpoor       0.24831    0.04621    5.37 0.000000081 ***
healthexcellent -0.36199    0.07848   -4.61 0.000004086 ***
numchron         0.14664    0.01186   12.36     < 2e-16 ***
gendermale      -0.11232    0.03352   -3.35     0.00081 ***
school           0.02614    0.00477    5.48 0.000000046 ***
privinsyes       0.20169    0.04366    4.62 0.000003959 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 6.706)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23168  on 4398  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The point estimates have the same interpretations as the regular Poisson regression.

Negative binomial regression

## Fit negative binomial
modelNB <- glm.nb(formula = formula,
                  data    = DebTrivedi)
summary(modelNB)

Call:
glm.nb(formula = formula, data = DebTrivedi, init.theta = 1.206603534, 
    link = log)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.047  -0.995  -0.295   0.296   5.818  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      0.92926    0.05459   17.02  < 2e-16 ***
hosp             0.21777    0.02018   10.79  < 2e-16 ***
healthpoor       0.30501    0.04851    6.29  3.2e-10 ***
healthexcellent -0.34181    0.06092   -5.61  2.0e-08 ***
numchron         0.17492    0.01209   14.47  < 2e-16 ***
gendermale      -0.12649    0.03122   -4.05  5.1e-05 ***
school           0.02682    0.00439    6.10  1.0e-09 ***
privinsyes       0.22440    0.03946    5.69  1.3e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.207) family taken to be 1)

    Null deviance: 5743.7  on 4405  degrees of freedom
Residual deviance: 5044.5  on 4398  degrees of freedom
AIC: 24359

Number of Fisher Scoring iterations: 1

              Theta:  1.2066 
          Std. Err.:  0.0336 

 2 x log-likelihood:  -24341.1070 
## Exponentiated coefficients
exp(coef(modelNB))
    (Intercept)            hosp      healthpoor healthexcellent        numchron      gendermale          school 
         2.5326          1.2433          1.3566          0.7105          1.1911          0.8812          1.0272 
     privinsyes 
         1.2516 

Interpretation: The “baseline” average office visit count is 2.5. The other exponentiated coefficients are interpreted multiplicatively. One unit increase in hospital stays increases the average office visits by 1.24 times. Poor health compared to average health status increases the average office visits by 1.36 times, whereas excellent health decreases the average office visits by 0.71 times.

Hurdle model by pscl::hurdle

This is a two-component model: A truncated count component, such as Poisson, geometric or negative binomial, is employed for positive counts, and a hurdle (binary) component models zero vs. larger counts. For the latter, either a binomial model or a censored count distribution can be employed. (Ref 1) This is appropriate if the underlying data generating processes are different for zero and positive outcomes, i.e., only structural zeros are present.

## Fit hurdle
modelHurdle <- hurdle(formula = formula,
                      dist    = "negbin",
                      data    = DebTrivedi)
summary(modelHurdle)

Call:
hurdle(formula = formula, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.172 -0.708 -0.274  0.320 18.009 

Count model coefficients (truncated negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19770    0.05897   20.31  < 2e-16 ***
hosp             0.21190    0.02140    9.90  < 2e-16 ***
healthpoor       0.31596    0.04806    6.57  4.9e-11 ***
healthexcellent -0.33186    0.06609   -5.02  5.1e-07 ***
numchron         0.12642    0.01245   10.15  < 2e-16 ***
gendermale      -0.06832    0.03242   -2.11    0.035 *  
school           0.02069    0.00453    4.56  5.0e-06 ***
privinsyes       0.10017    0.04262    2.35    0.019 *  
Log(theta)       0.33325    0.04275    7.79  6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      0.04315    0.13985    0.31  0.75769    
hosp             0.31245    0.09144    3.42  0.00063 ***
healthpoor      -0.00872    0.16102   -0.05  0.95683    
healthexcellent -0.28957    0.14268   -2.03  0.04241 *  
numchron         0.53521    0.04538   11.79  < 2e-16 ***
gendermale      -0.41566    0.08761   -4.74  2.1e-06 ***
school           0.05854    0.01199    4.88  1.0e-06 ***
privinsyes       0.74712    0.10088    7.41  1.3e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.396
Number of iterations in BFGS optimization: 16 
Log-likelihood: -1.21e+04 on 17 Df

## The hurdle part can be fit as follows.
hurdlePart <- glm(formula = I(ofp>0) ~ hosp + health + numchron + gender + school + privins,
                  data    = DebTrivedi,
                  family  = binomial(link = "logit"))
coef(summary(hurdlePart))
                 Estimate Std. Error  z value  Pr(>|z|)
(Intercept)      0.043147    0.13985  0.30852 7.577e-01
hosp             0.312449    0.09143  3.41743 6.321e-04
healthpoor      -0.008716    0.16102 -0.05413 9.568e-01
healthexcellent -0.289570    0.14268 -2.02949 4.241e-02
numchron         0.535213    0.04538 11.79472 4.156e-32
gendermale      -0.415658    0.08761 -4.74456 2.090e-06
school           0.058541    0.01199  4.88282 1.046e-06
privinsyes       0.747120    0.10088  7.40614 1.300e-13

## Exponentiated coefficients
expCoef <- exp(coef((modelHurdle)))
expCoef <- matrix(expCoef, ncol = 2)
rownames(expCoef) <- names(coef(hurdlePart))
colnames(expCoef) <- c("Count_model","Zero_hurdle_model")
expCoef
                Count_model Zero_hurdle_model
(Intercept)          3.3125            1.0441
hosp                 1.2360            1.3668
healthpoor           1.3716            0.9913
healthexcellent      0.7176            0.7486
numchron             1.1348            1.7078
gendermale           0.9340            0.6599
school               1.0209            1.0603
privinsyes           1.1054            2.1109

Interpretation: (Zero-hurdle model) The baseline odds of having a positive count vs zero is 1.04. This odds is increased by 1.37 times by a unit increase in hospital stay. Poor health does not have significant effect, whereas excellent health decreases it by 0.75 times. (Positive count model) Given the response is positive (among those who have positive counts), the average count is 3.3. This is increased by a unit increase in hospital stay by 1.24 among those who have positive counts. Poor health increases it by 1.37 times, whereas excellent health decreases it by 0.72, both among those who have positive counts.

The coefficients do not have to match in the two parts.

## hurdle part without health
modelHurdleSimpler <- hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
                             dist    = "negbin",
                             data    = DebTrivedi)
summary(modelHurdleSimpler)

Call:
hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + 
    school + privins, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.172 -0.709 -0.273  0.323 17.918 

Count model coefficients (truncated negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19770    0.05897   20.31  < 2e-16 ***
hosp             0.21190    0.02140    9.90  < 2e-16 ***
healthpoor       0.31596    0.04806    6.57  4.9e-11 ***
healthexcellent -0.33186    0.06609   -5.02  5.1e-07 ***
numchron         0.12642    0.01245   10.15  < 2e-16 ***
gendermale      -0.06832    0.03242   -2.11    0.035 *  
school           0.02069    0.00453    4.56  5.0e-06 ***
privinsyes       0.10017    0.04262    2.35    0.019 *  
Log(theta)       0.33325    0.04275    7.79  6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.0159     0.1378    0.12  0.90788    
hosp          0.3184     0.0911    3.50  0.00047 ***
numchron      0.5478     0.0436   12.57  < 2e-16 ***
gendermale   -0.4191     0.0875   -4.79  1.7e-06 ***
school        0.0571     0.0119    4.78  1.7e-06 ***
privinsyes    0.7457     0.1003    7.43  1.1e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.396
Number of iterations in BFGS optimization: 16 
Log-likelihood: -1.21e+04 on 15 Df

## LRT to compare these models
lmtest::lrtest(modelHurdle, modelHurdleSimpler)
Likelihood ratio test

Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins | 
    hosp + numchron + gender + school + privins
  #Df LogLik Df Chisq Pr(>Chisq)
1  17 -12088                    
2  15 -12090 -2  3.99       0.14

The hurdle part does not have to be the same as the positive count part. Here, the hosp variable is removed. A likelihood ratio test is not significant, indicating the simpler model is sufficient.

Zero-inflated regression

This is also a two part model. However, both parts predict zero counts. The count model predicts some zero counts, and on the top of that the zero-inflation binary model part adds zero counts, thus, the name zero “inflation”. This is more suitable for situations where there are two types of zeros, i.e., structural zeros and sampling zeros. Those who have the structural zeros are the people who can only have zeros. Those who have sampling zeros are the people who could have experienced non-zero outcome, but by chance experienced zeros.

## Fit zeroinfl
modelZeroInfl <- zeroinfl(formula = formula,
                          dist    = "negbin",
                          data    = DebTrivedi)
summary(modelZeroInfl)

Call:
zeroinfl(formula = formula, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.197 -0.710 -0.278  0.326 17.766 

Count model coefficients (negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19347    0.05674   21.04  < 2e-16 ***
hosp             0.20121    0.02039    9.87  < 2e-16 ***
healthpoor       0.28719    0.04594    6.25  4.1e-10 ***
healthexcellent -0.31354    0.06298   -4.98  6.4e-07 ***
numchron         0.12895    0.01194   10.80  < 2e-16 ***
gendermale      -0.08009    0.03103   -2.58   0.0099 ** 
school           0.02134    0.00437    4.89  1.0e-06 ***
privinsyes       0.12681    0.04169    3.04   0.0023 ** 
Log(theta)       0.39473    0.03515   11.23  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -0.0635     0.2767   -0.23   0.8184    
hosp             -0.8176     0.4388   -1.86   0.0624 .  
healthpoor        0.1018     0.4407    0.23   0.8174    
healthexcellent   0.1049     0.3096    0.34   0.7348    
numchron         -1.2463     0.1792   -6.96  3.5e-12 ***
gendermale        0.6494     0.2005    3.24   0.0012 ** 
school           -0.0848     0.0268   -3.17   0.0015 ** 
privinsyes       -1.1581     0.2244   -5.16  2.4e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.484 
Number of iterations in BFGS optimization: 31 
Log-likelihood: -1.21e+04 on 17 Df

## Exponentiated coefficients
expCoef <- exp(coef((modelZeroInfl)))
expCoef <- matrix(expCoef, ncol = 2)
rownames(expCoef) <- names(coef(hurdlePart))
colnames(expCoef) <- c("Count_model","Zero_inflation_model")
expCoef
                Count_model Zero_inflation_model
(Intercept)          3.2985               0.9384
hosp                 1.2229               0.4415
healthpoor           1.3327               1.1071
healthexcellent      0.7309               1.1106
numchron             1.1376               0.2876
gendermale           0.9230               1.9143
school               1.0216               0.9187
privinsyes           1.1352               0.3141

Interpretation: (Zero-inflation model) The baseline odds of being among those who never visit physician officie is 0.94. The odds is decreased by one unit increase in hospital stay by 0.44. Both poor health and excellent health increase the odds of being among those who never visit physician office, but they are non-significant. Each unit increase in the number of chronic condition decrease it by 0.29. (Count model) The baseline number of physician office visit is 3.29 among those who have a chance of visiting physician office. A unit increase in hospital stay increase it by 1.22 times among those who have a chance of visiting physician office. Poor health increases it by 1.33 times, whereas excellent health decreases it by 0.73 times among those who have a chance of visiting physician office.

The coefficients do not have to match in the two parts.

## zeroinfl part without health
modelZeroInflSimpler <- zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
                                 dist    = "negbin",
                                 data    = DebTrivedi)
summary(modelZeroInflSimpler)

Call:
zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + 
    school + privins, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.196 -0.711 -0.278  0.325 17.845 

Count model coefficients (negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19372    0.05666   21.07  < 2e-16 ***
hosp             0.20148    0.02036    9.90  < 2e-16 ***
healthpoor       0.28513    0.04509    6.32  2.6e-10 ***
healthexcellent -0.31934    0.06040   -5.29  1.2e-07 ***
numchron         0.12900    0.01193   10.81  < 2e-16 ***
gendermale      -0.08028    0.03102   -2.59   0.0097 ** 
school           0.02142    0.00436    4.92  8.8e-07 ***
privinsyes       0.12586    0.04159    3.03   0.0025 ** 
Log(theta)       0.39414    0.03503   11.25  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.0468     0.2686   -0.17   0.8615    
hosp         -0.8005     0.4208   -1.90   0.0571 .  
numchron     -1.2479     0.1783   -7.00  2.6e-12 ***
gendermale    0.6477     0.2001    3.24   0.0012 ** 
school       -0.0838     0.0263   -3.19   0.0014 ** 
privinsyes   -1.1756     0.2201   -5.34  9.3e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.483 
Number of iterations in BFGS optimization: 28 
Log-likelihood: -1.21e+04 on 15 Df

## LRT to compare these models
lmtest::lrtest(modelZeroInfl, modelZeroInflSimpler)
Likelihood ratio test

Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins | 
    hosp + numchron + gender + school + privins
  #Df LogLik Df Chisq Pr(>Chisq)
1  17 -12091                    
2  15 -12091 -2  0.15       0.93

Again the hosp variable can be removed.