Models for excess zeros using pscl package

References

(pscl) Regression Models for Count Data in R http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf
Data: http://www.jstatsoft.org/v27/i08/

Load packages

## Load pscl
library(pscl)
## Load sandwich package for robust sandwich covariance matrix estimators
library(sandwich)
## Load lmtest package for coeftest
library(lmtest)
## Load MASS for glm.nb
library(MASS)

Load data: Demand for medical care by the elderly (see first reference)

## Load
load("./DebTrivedi.RData")
## Show summary
summary(DebTrivedi)

      ofp             ofnp             opp              opnp             emer             hosp      
 Min.   : 0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   :0.000  
 1st Qu.: 1.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:0.000  
 Median : 4.00   Median :  0.00   Median :  0.00   Median :  0.00   Median : 0.000   Median :0.000  
 Mean   : 5.77   Mean   :  1.62   Mean   :  0.75   Mean   :  0.54   Mean   : 0.264   Mean   :0.296  
 3rd Qu.: 8.00   3rd Qu.:  1.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.: 0.000   3rd Qu.:0.000  
 Max.   :89.00   Max.   :104.00   Max.   :141.00   Max.   :155.00   Max.   :12.000   Max.   :8.000  
       health        numchron    adldiff        region          age       black         gender     married   
 poor     : 554   Min.   :0.00   no :3507   midwest:1157   Min.   : 6.6   no :3890   female:2628   no :2000  
 average  :3509   1st Qu.:1.00   yes: 899   noreast: 837   1st Qu.: 6.9   yes: 516   male  :1778   yes:2406  
 excellent: 343   Median :1.00              other  :1614   Median : 7.3                                      
                  Mean   :1.54              west   : 798   Mean   : 7.4                                      
                  3rd Qu.:2.00                             3rd Qu.: 7.8                                      
                  Max.   :8.00                             Max.   :10.9                                      
     school         faminc      employed   privins    medicaid  
 Min.   : 0.0   Min.   :-1.01   no :3951   no : 985   no :4004  
 1st Qu.: 8.0   1st Qu.: 0.91   yes: 455   yes:3421   yes: 402  
 Median :11.0   Median : 1.70                                   
 Mean   :10.3   Mean   : 2.53                                   
 3rd Qu.:12.0   3rd Qu.: 3.17                                   
 Max.   :18.0   Max.   :54.84

Check outcome (Number of physician office visits)

## Outcome
numberOfPhysicianOfficeVisits <- DebTrivedi$ofp
## Plot
plot(sort(numberOfPhysicianOfficeVisits))

plot of chunk unnamed-chunk-4

Poisson regression

This appears too optimistic with the standard variance estimation. The estimation by the robust variance estimator appears more realistic.

## Create a formula
formula <- ofp ~ hosp + health + numchron + gender + school + privins
## Fit Poisson
modelPoisson <- glm(formula = formula,
                    family  = poisson(link = "log"),
                    data    = DebTrivedi)
summary(modelPoisson)


Call:
glm(formula = formula, family = poisson(link = "log"), data = DebTrivedi)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-8.405  -1.996  -0.674   0.705  16.362  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.02887    0.02378   43.26   <2e-16 ***
hosp             0.16480    0.00600   27.48   <2e-16 ***
healthpoor       0.24831    0.01784   13.92   <2e-16 ***
healthexcellent -0.36199    0.03030  -11.95   <2e-16 ***
numchron         0.14664    0.00458   32.02   <2e-16 ***
gendermale      -0.11232    0.01295   -8.68   <2e-16 ***
school           0.02614    0.00184   14.18   <2e-16 ***
privinsyes       0.20169    0.01686   11.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23168  on 4398  degrees of freedom
AIC: 35959

Number of Fisher Scoring iterations: 5

## Test using robust sandwich covariance matrix estimators
coeftest(modelPoisson, vcov = sandwich)


z test of coefficients:

                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.02887    0.06453   15.94  < 2e-16 ***
hosp             0.16480    0.02195    7.51  5.9e-14 ***
healthpoor       0.24831    0.05402    4.60  4.3e-06 ***
healthexcellent -0.36199    0.07745   -4.67  3.0e-06 ***
numchron         0.14664    0.01291   11.36  < 2e-16 ***
gendermale      -0.11232    0.03534   -3.18   0.0015 ** 
school           0.02614    0.00508    5.14  2.7e-07 ***
privinsyes       0.20169    0.04313    4.68  2.9e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quasi-Poisson regression

The dispersion parameter is greater than 1, indicating overdispersion

## Fit quasi-Poisson
modelQuasiPoisson <- glm(formula = formula,
                         family  = quasipoisson(link = "log"),
                         data    = DebTrivedi)
summary(modelQuasiPoisson)


Call:
glm(formula = formula, family = quasipoisson(link = "log"), data = DebTrivedi)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-8.405  -1.996  -0.674   0.705  16.362  

Coefficients:
                Estimate Std. Error t value    Pr(>|t|)    
(Intercept)      1.02887    0.06159   16.70     < 2e-16 ***
hosp             0.16480    0.01553   10.61     < 2e-16 ***
healthpoor       0.24831    0.04621    5.37 0.000000081 ***
healthexcellent -0.36199    0.07848   -4.61 0.000004086 ***
numchron         0.14664    0.01186   12.36     < 2e-16 ***
gendermale      -0.11232    0.03352   -3.35     0.00081 ***
school           0.02614    0.00477    5.48 0.000000046 ***
privinsyes       0.20169    0.04366    4.62 0.000003959 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 6.706)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23168  on 4398  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Negative binomial regression

## Fit negative binomial
modelNB <- glm.nb(formula = formula,
                  data    = DebTrivedi)
summary(modelNB)


Call:
glm.nb(formula = formula, data = DebTrivedi, init.theta = 1.206603534, 
    link = log)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.047  -0.995  -0.295   0.296   5.818  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      0.92926    0.05459   17.02  < 2e-16 ***
hosp             0.21777    0.02018   10.79  < 2e-16 ***
healthpoor       0.30501    0.04851    6.29  3.2e-10 ***
healthexcellent -0.34181    0.06092   -5.61  2.0e-08 ***
numchron         0.17492    0.01209   14.47  < 2e-16 ***
gendermale      -0.12649    0.03122   -4.05  5.1e-05 ***
school           0.02682    0.00439    6.10  1.0e-09 ***
privinsyes       0.22440    0.03946    5.69  1.3e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.207) family taken to be 1)

    Null deviance: 5743.7  on 4405  degrees of freedom
Residual deviance: 5044.5  on 4398  degrees of freedom
AIC: 24359

Number of Fisher Scoring iterations: 1

              Theta:  1.2066 
          Std. Err.:  0.0336 

 2 x log-likelihood:  -24341.1070

Hurdle model by pscl::hurdle

They are two-component models: A truncated count component, such as Poisson, geometric or negative binomial, is employed for positive counts, and a hurdle component models zero vs. larger counts. For the latter, either a binomial model or a censored count distribution can be employed. (Ref 1)

## Fit hurdle
modelHurdle <- hurdle(formula = formula,
                      dist    = "negbin",
                      data    = DebTrivedi)
summary(modelHurdle)


Call:
hurdle(formula = formula, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.172 -0.708 -0.274  0.320 18.009 

Count model coefficients (truncated negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19770    0.05897   20.31  < 2e-16 ***
hosp             0.21190    0.02140    9.90  < 2e-16 ***
healthpoor       0.31596    0.04806    6.57  4.9e-11 ***
healthexcellent -0.33186    0.06609   -5.02  5.1e-07 ***
numchron         0.12642    0.01245   10.15  < 2e-16 ***
gendermale      -0.06832    0.03242   -2.11    0.035 *  
school           0.02069    0.00453    4.56  5.0e-06 ***
privinsyes       0.10017    0.04262    2.35    0.019 *  
Log(theta)       0.33325    0.04275    7.79  6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      0.04315    0.13985    0.31  0.75769    
hosp             0.31245    0.09144    3.42  0.00063 ***
healthpoor      -0.00872    0.16102   -0.05  0.95683    
healthexcellent -0.28957    0.14268   -2.03  0.04241 *  
numchron         0.53521    0.04538   11.79  < 2e-16 ***
gendermale      -0.41566    0.08761   -4.74  2.1e-06 ***
school           0.05854    0.01199    4.88  1.0e-06 ***
privinsyes       0.74712    0.10088    7.41  1.3e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.396
Number of iterations in BFGS optimization: 16 
Log-likelihood: -1.21e+04 on 17 Df


## hurdle part without health
modelHurdleSimpler <- hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
                             dist    = "negbin",
                             data    = DebTrivedi)
summary(modelHurdleSimpler)


Call:
hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + 
    school + privins, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.172 -0.709 -0.273  0.323 17.918 

Count model coefficients (truncated negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19770    0.05897   20.31  < 2e-16 ***
hosp             0.21190    0.02140    9.90  < 2e-16 ***
healthpoor       0.31596    0.04806    6.57  4.9e-11 ***
healthexcellent -0.33186    0.06609   -5.02  5.1e-07 ***
numchron         0.12642    0.01245   10.15  < 2e-16 ***
gendermale      -0.06832    0.03242   -2.11    0.035 *  
school           0.02069    0.00453    4.56  5.0e-06 ***
privinsyes       0.10017    0.04262    2.35    0.019 *  
Log(theta)       0.33325    0.04275    7.79  6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.0159     0.1378    0.12  0.90788    
hosp          0.3184     0.0911    3.50  0.00047 ***
numchron      0.5478     0.0436   12.57  < 2e-16 ***
gendermale   -0.4191     0.0875   -4.79  1.7e-06 ***
school        0.0571     0.0119    4.78  1.7e-06 ***
privinsyes    0.7457     0.1003    7.43  1.1e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 1.396
Number of iterations in BFGS optimization: 16 
Log-likelihood: -1.21e+04 on 15 Df


## LRT to compare these models
lmtest::lrtest(modelHurdle, modelHurdleSimpler)

Likelihood ratio test

Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins | 
    hosp + numchron + gender + school + privins
  #Df LogLik Df Chisq Pr(>Chisq)
1  17 -12088                    
2  15 -12090 -2  3.99       0.14

Zero-inflated regression

## Fit zeroinfl
modelZeroInfl <- zeroinfl(formula = formula,
                          dist    = "negbin",
                          data    = DebTrivedi)
summary(modelZeroInfl)


Call:
zeroinfl(formula = formula, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.197 -0.710 -0.278  0.326 17.766 

Count model coefficients (negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19347    0.05674   21.04  < 2e-16 ***
hosp             0.20121    0.02039    9.87  < 2e-16 ***
healthpoor       0.28719    0.04594    6.25  4.1e-10 ***
healthexcellent -0.31354    0.06298   -4.98  6.4e-07 ***
numchron         0.12895    0.01194   10.80  < 2e-16 ***
gendermale      -0.08009    0.03103   -2.58   0.0099 ** 
school           0.02134    0.00437    4.89  1.0e-06 ***
privinsyes       0.12681    0.04169    3.04   0.0023 ** 
Log(theta)       0.39473    0.03515   11.23  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -0.0635     0.2767   -0.23   0.8184    
hosp             -0.8176     0.4388   -1.86   0.0624 .  
healthpoor        0.1018     0.4407    0.23   0.8174    
healthexcellent   0.1049     0.3096    0.34   0.7348    
numchron         -1.2463     0.1792   -6.96  3.5e-12 ***
gendermale        0.6494     0.2005    3.24   0.0012 ** 
school           -0.0848     0.0268   -3.17   0.0015 ** 
privinsyes       -1.1581     0.2244   -5.16  2.4e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.484 
Number of iterations in BFGS optimization: 31 
Log-likelihood: -1.21e+04 on 17 Df


## zeroinfl part without health
modelZeroInflSimpler <- zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
                                 dist    = "negbin",
                                 data    = DebTrivedi)
summary(modelZeroInflSimpler)


Call:
zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + 
    school + privins, data = DebTrivedi, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.196 -0.711 -0.278  0.325 17.845 

Count model coefficients (negbin with log link):
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      1.19372    0.05666   21.07  < 2e-16 ***
hosp             0.20148    0.02036    9.90  < 2e-16 ***
healthpoor       0.28513    0.04509    6.32  2.6e-10 ***
healthexcellent -0.31934    0.06040   -5.29  1.2e-07 ***
numchron         0.12900    0.01193   10.81  < 2e-16 ***
gendermale      -0.08028    0.03102   -2.59   0.0097 ** 
school           0.02142    0.00436    4.92  8.8e-07 ***
privinsyes       0.12586    0.04159    3.03   0.0025 ** 
Log(theta)       0.39414    0.03503   11.25  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.0468     0.2686   -0.17   0.8615    
hosp         -0.8005     0.4208   -1.90   0.0571 .  
numchron     -1.2479     0.1783   -7.00  2.6e-12 ***
gendermale    0.6477     0.2001    3.24   0.0012 ** 
school       -0.0838     0.0263   -3.19   0.0014 ** 
privinsyes   -1.1756     0.2201   -5.34  9.3e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 1.483 
Number of iterations in BFGS optimization: 28 
Log-likelihood: -1.21e+04 on 15 Df


## LRT to compare these models
lmtest::lrtest(modelZeroInfl, modelZeroInflSimpler)

Likelihood ratio test

Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins | 
    hosp + numchron + gender + school + privins
  #Df LogLik Df Chisq Pr(>Chisq)
1  17 -12091                    
2  15 -12091 -2  0.15       0.93

Table of Contents: http://rpubs.com/kaz_yos/toc
Table of Contents (Japanese): http://rpubs.com/kaz_yos/toc-j
Top Page: http://rpubs.com/kaz_yos/
SlideShare: http://www.slideshare.net/kaz_yos