(pscl) Regression Models for Count Data in R http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf
## Load pscl
library(pscl)
## Load sandwich package for robust sandwich covariance matrix estimators
library(sandwich)
## Load lmtest package for coeftest
library(lmtest)
## Load MASS for glm.nb
library(MASS)
## Load
load("./DebTrivedi.RData")
## Show summary
summary(DebTrivedi)
ofp ofnp opp opnp emer hosp
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.000
1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:0.000
Median : 4.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000 Median :0.000
Mean : 5.77 Mean : 1.62 Mean : 0.75 Mean : 0.54 Mean : 0.264 Mean :0.296
3rd Qu.: 8.00 3rd Qu.: 1.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.:0.000
Max. :89.00 Max. :104.00 Max. :141.00 Max. :155.00 Max. :12.000 Max. :8.000
health numchron adldiff region age black gender married
poor : 554 Min. :0.00 no :3507 midwest:1157 Min. : 6.6 no :3890 female:2628 no :2000
average :3509 1st Qu.:1.00 yes: 899 noreast: 837 1st Qu.: 6.9 yes: 516 male :1778 yes:2406
excellent: 343 Median :1.00 other :1614 Median : 7.3
Mean :1.54 west : 798 Mean : 7.4
3rd Qu.:2.00 3rd Qu.: 7.8
Max. :8.00 Max. :10.9
school faminc employed privins medicaid
Min. : 0.0 Min. :-1.01 no :3951 no : 985 no :4004
1st Qu.: 8.0 1st Qu.: 0.91 yes: 455 yes:3421 yes: 402
Median :11.0 Median : 1.70
Mean :10.3 Mean : 2.53
3rd Qu.:12.0 3rd Qu.: 3.17
Max. :18.0 Max. :54.84
## Outcome
numberOfPhysicianOfficeVisits <- DebTrivedi$ofp
## Plot
plot(sort(numberOfPhysicianOfficeVisits))
This appears too optimistic with the standard variance estimation. The estimation by the robust variance estimator appears more realistic.
## Create a formula
formula <- ofp ~ hosp + health + numchron + gender + school + privins
## Fit Poisson
modelPoisson <- glm(formula = formula,
family = poisson(link = "log"),
data = DebTrivedi)
summary(modelPoisson)
Call:
glm(formula = formula, family = poisson(link = "log"), data = DebTrivedi)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.405 -1.996 -0.674 0.705 16.362
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.02887 0.02378 43.26 <2e-16 ***
hosp 0.16480 0.00600 27.48 <2e-16 ***
healthpoor 0.24831 0.01784 13.92 <2e-16 ***
healthexcellent -0.36199 0.03030 -11.95 <2e-16 ***
numchron 0.14664 0.00458 32.02 <2e-16 ***
gendermale -0.11232 0.01295 -8.68 <2e-16 ***
school 0.02614 0.00184 14.18 <2e-16 ***
privinsyes 0.20169 0.01686 11.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 26943 on 4405 degrees of freedom
Residual deviance: 23168 on 4398 degrees of freedom
AIC: 35959
Number of Fisher Scoring iterations: 5
## Test using robust sandwich covariance matrix estimators
coeftest(modelPoisson, vcov = sandwich)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.02887 0.06453 15.94 < 2e-16 ***
hosp 0.16480 0.02195 7.51 5.9e-14 ***
healthpoor 0.24831 0.05402 4.60 4.3e-06 ***
healthexcellent -0.36199 0.07745 -4.67 3.0e-06 ***
numchron 0.14664 0.01291 11.36 < 2e-16 ***
gendermale -0.11232 0.03534 -3.18 0.0015 **
school 0.02614 0.00508 5.14 2.7e-07 ***
privinsyes 0.20169 0.04313 4.68 2.9e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The dispersion parameter is greater than 1, indicating overdispersion
## Fit quasi-Poisson
modelQuasiPoisson <- glm(formula = formula,
family = quasipoisson(link = "log"),
data = DebTrivedi)
summary(modelQuasiPoisson)
Call:
glm(formula = formula, family = quasipoisson(link = "log"), data = DebTrivedi)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.405 -1.996 -0.674 0.705 16.362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.02887 0.06159 16.70 < 2e-16 ***
hosp 0.16480 0.01553 10.61 < 2e-16 ***
healthpoor 0.24831 0.04621 5.37 0.000000081 ***
healthexcellent -0.36199 0.07848 -4.61 0.000004086 ***
numchron 0.14664 0.01186 12.36 < 2e-16 ***
gendermale -0.11232 0.03352 -3.35 0.00081 ***
school 0.02614 0.00477 5.48 0.000000046 ***
privinsyes 0.20169 0.04366 4.62 0.000003959 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 6.706)
Null deviance: 26943 on 4405 degrees of freedom
Residual deviance: 23168 on 4398 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
## Fit negative binomial
modelNB <- glm.nb(formula = formula,
data = DebTrivedi)
summary(modelNB)
Call:
glm.nb(formula = formula, data = DebTrivedi, init.theta = 1.206603534,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.047 -0.995 -0.295 0.296 5.818
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.92926 0.05459 17.02 < 2e-16 ***
hosp 0.21777 0.02018 10.79 < 2e-16 ***
healthpoor 0.30501 0.04851 6.29 3.2e-10 ***
healthexcellent -0.34181 0.06092 -5.61 2.0e-08 ***
numchron 0.17492 0.01209 14.47 < 2e-16 ***
gendermale -0.12649 0.03122 -4.05 5.1e-05 ***
school 0.02682 0.00439 6.10 1.0e-09 ***
privinsyes 0.22440 0.03946 5.69 1.3e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.207) family taken to be 1)
Null deviance: 5743.7 on 4405 degrees of freedom
Residual deviance: 5044.5 on 4398 degrees of freedom
AIC: 24359
Number of Fisher Scoring iterations: 1
Theta: 1.2066
Std. Err.: 0.0336
2 x log-likelihood: -24341.1070
They are two-component models: A truncated count component, such as Poisson, geometric or negative binomial, is employed for positive counts, and a hurdle component models zero vs. larger counts. For the latter, either a binomial model or a censored count distribution can be employed. (Ref 1)
## Fit hurdle
modelHurdle <- hurdle(formula = formula,
dist = "negbin",
data = DebTrivedi)
summary(modelHurdle)
Call:
hurdle(formula = formula, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.172 -0.708 -0.274 0.320 18.009
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19770 0.05897 20.31 < 2e-16 ***
hosp 0.21190 0.02140 9.90 < 2e-16 ***
healthpoor 0.31596 0.04806 6.57 4.9e-11 ***
healthexcellent -0.33186 0.06609 -5.02 5.1e-07 ***
numchron 0.12642 0.01245 10.15 < 2e-16 ***
gendermale -0.06832 0.03242 -2.11 0.035 *
school 0.02069 0.00453 4.56 5.0e-06 ***
privinsyes 0.10017 0.04262 2.35 0.019 *
Log(theta) 0.33325 0.04275 7.79 6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.04315 0.13985 0.31 0.75769
hosp 0.31245 0.09144 3.42 0.00063 ***
healthpoor -0.00872 0.16102 -0.05 0.95683
healthexcellent -0.28957 0.14268 -2.03 0.04241 *
numchron 0.53521 0.04538 11.79 < 2e-16 ***
gendermale -0.41566 0.08761 -4.74 2.1e-06 ***
school 0.05854 0.01199 4.88 1.0e-06 ***
privinsyes 0.74712 0.10088 7.41 1.3e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.396
Number of iterations in BFGS optimization: 16
Log-likelihood: -1.21e+04 on 17 Df
## hurdle part without health
modelHurdleSimpler <- hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
dist = "negbin",
data = DebTrivedi)
summary(modelHurdleSimpler)
Call:
hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender +
school + privins, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.172 -0.709 -0.273 0.323 17.918
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19770 0.05897 20.31 < 2e-16 ***
hosp 0.21190 0.02140 9.90 < 2e-16 ***
healthpoor 0.31596 0.04806 6.57 4.9e-11 ***
healthexcellent -0.33186 0.06609 -5.02 5.1e-07 ***
numchron 0.12642 0.01245 10.15 < 2e-16 ***
gendermale -0.06832 0.03242 -2.11 0.035 *
school 0.02069 0.00453 4.56 5.0e-06 ***
privinsyes 0.10017 0.04262 2.35 0.019 *
Log(theta) 0.33325 0.04275 7.79 6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0159 0.1378 0.12 0.90788
hosp 0.3184 0.0911 3.50 0.00047 ***
numchron 0.5478 0.0436 12.57 < 2e-16 ***
gendermale -0.4191 0.0875 -4.79 1.7e-06 ***
school 0.0571 0.0119 4.78 1.7e-06 ***
privinsyes 0.7457 0.1003 7.43 1.1e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.396
Number of iterations in BFGS optimization: 16
Log-likelihood: -1.21e+04 on 15 Df
## LRT to compare these models
lmtest::lrtest(modelHurdle, modelHurdleSimpler)
Likelihood ratio test
Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins |
hosp + numchron + gender + school + privins
#Df LogLik Df Chisq Pr(>Chisq)
1 17 -12088
2 15 -12090 -2 3.99 0.14
## Fit zeroinfl
modelZeroInfl <- zeroinfl(formula = formula,
dist = "negbin",
data = DebTrivedi)
summary(modelZeroInfl)
Call:
zeroinfl(formula = formula, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.197 -0.710 -0.278 0.326 17.766
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19347 0.05674 21.04 < 2e-16 ***
hosp 0.20121 0.02039 9.87 < 2e-16 ***
healthpoor 0.28719 0.04594 6.25 4.1e-10 ***
healthexcellent -0.31354 0.06298 -4.98 6.4e-07 ***
numchron 0.12895 0.01194 10.80 < 2e-16 ***
gendermale -0.08009 0.03103 -2.58 0.0099 **
school 0.02134 0.00437 4.89 1.0e-06 ***
privinsyes 0.12681 0.04169 3.04 0.0023 **
Log(theta) 0.39473 0.03515 11.23 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0635 0.2767 -0.23 0.8184
hosp -0.8176 0.4388 -1.86 0.0624 .
healthpoor 0.1018 0.4407 0.23 0.8174
healthexcellent 0.1049 0.3096 0.34 0.7348
numchron -1.2463 0.1792 -6.96 3.5e-12 ***
gendermale 0.6494 0.2005 3.24 0.0012 **
school -0.0848 0.0268 -3.17 0.0015 **
privinsyes -1.1581 0.2244 -5.16 2.4e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.484
Number of iterations in BFGS optimization: 31
Log-likelihood: -1.21e+04 on 17 Df
## zeroinfl part without health
modelZeroInflSimpler <- zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
dist = "negbin",
data = DebTrivedi)
summary(modelZeroInflSimpler)
Call:
zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender +
school + privins, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.196 -0.711 -0.278 0.325 17.845
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19372 0.05666 21.07 < 2e-16 ***
hosp 0.20148 0.02036 9.90 < 2e-16 ***
healthpoor 0.28513 0.04509 6.32 2.6e-10 ***
healthexcellent -0.31934 0.06040 -5.29 1.2e-07 ***
numchron 0.12900 0.01193 10.81 < 2e-16 ***
gendermale -0.08028 0.03102 -2.59 0.0097 **
school 0.02142 0.00436 4.92 8.8e-07 ***
privinsyes 0.12586 0.04159 3.03 0.0025 **
Log(theta) 0.39414 0.03503 11.25 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0468 0.2686 -0.17 0.8615
hosp -0.8005 0.4208 -1.90 0.0571 .
numchron -1.2479 0.1783 -7.00 2.6e-12 ***
gendermale 0.6477 0.2001 3.24 0.0012 **
school -0.0838 0.0263 -3.19 0.0014 **
privinsyes -1.1756 0.2201 -5.34 9.3e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.483
Number of iterations in BFGS optimization: 28
Log-likelihood: -1.21e+04 on 15 Df
## LRT to compare these models
lmtest::lrtest(modelZeroInfl, modelZeroInflSimpler)
Likelihood ratio test
Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins |
hosp + numchron + gender + school + privins
#Df LogLik Df Chisq Pr(>Chisq)
1 17 -12091
2 15 -12091 -2 0.15 0.93