(pscl) Regression Models for Count Data in R http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf
DebTrivedi.rda: Example data from Deb & Trivedi (1997) in R binary format: http://www.jstatsoft.org/v27/i08/
Hilbe. Negative Binomial Regression, 2nd ed.
Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. http://www.ncbi.nlm.nih.gov/pubmed/21854279
Sources of zeros: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3238139/figure/F1/
## Load pscl
library(pscl)
## Load sandwich package for robust sandwich covariance matrix estimators
library(sandwich)
## Load lmtest package for coeftest
library(lmtest)
## Load MASS for glm.nb
library(MASS)
## Load
load("./DebTrivedi.RData")
## Show summary
summary(DebTrivedi)
ofp ofnp opp opnp emer hosp
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.000
1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:0.000
Median : 4.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000 Median :0.000
Mean : 5.77 Mean : 1.62 Mean : 0.75 Mean : 0.54 Mean : 0.264 Mean :0.296
3rd Qu.: 8.00 3rd Qu.: 1.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.:0.000
Max. :89.00 Max. :104.00 Max. :141.00 Max. :155.00 Max. :12.000 Max. :8.000
health numchron adldiff region age black gender married
poor : 554 Min. :0.00 no :3507 midwest:1157 Min. : 6.6 no :3890 female:2628 no :2000
average :3509 1st Qu.:1.00 yes: 899 noreast: 837 1st Qu.: 6.9 yes: 516 male :1778 yes:2406
excellent: 343 Median :1.00 other :1614 Median : 7.3
Mean :1.54 west : 798 Mean : 7.4
3rd Qu.:2.00 3rd Qu.: 7.8
Max. :8.00 Max. :10.9
school faminc employed privins medicaid
Min. : 0.0 Min. :-1.01 no :3951 no : 985 no :4004
1st Qu.: 8.0 1st Qu.: 0.91 yes: 455 yes:3421 yes: 402
Median :11.0 Median : 1.70
Mean :10.3 Mean : 2.53
3rd Qu.:12.0 3rd Qu.: 3.17
Max. :18.0 Max. :54.84
## Show head
head(DebTrivedi)
ofp ofnp opp opnp emer hosp health numchron adldiff region age black gender married school faminc employed
1 5 0 0 0 0 1 average 2 no other 6.9 yes male yes 6 2.8810 yes
2 1 0 2 0 2 0 average 2 no other 7.4 no female yes 10 2.7478 no
3 13 0 0 0 3 3 poor 4 yes other 6.6 yes female no 10 0.6532 no
4 16 0 5 0 1 1 poor 2 yes other 7.6 no male yes 3 0.6588 no
5 3 0 0 0 0 0 average 2 yes other 7.9 no female yes 6 0.6588 no
6 17 0 0 0 0 0 poor 5 yes other 6.6 no female no 7 0.3301 no
privins medicaid
1 yes no
2 yes no
3 no yes
4 yes no
5 yes no
6 no yes
Deb and Trivedi (1997) analyzed data on 4406 individuals, aged 66 and over, who are covered by Medicare, a public insurance program.
1 ofp Number of physician office visits (integer outcome)
2 ofnp
3 opp
4 opnp
5 emer
6 hosp Number of hospital stays (integer)
7 health Self-perceived health status (poor, average, excellent)
8 numchron Number of chronic condition (integer)
9 adldiff
10 region
11 age
12 black
13 gender Gender (female, male)
14 married
15 school Number of years of education (integer)
16 faminc
17 employed
18 privins Private insurance indicator (no, yes)
19 medicaid
## Outcome
numberOfPhysicianOfficeVisits <- DebTrivedi$ofp
## Plot
plot(sort(numberOfPhysicianOfficeVisits))
This appears too optimistic with the standard variance estimation. The estimation by the robust variance estimator appears more realistic.
## Create a formula
formula <- ofp ~ hosp + health + numchron + gender + school + privins
## Fit Poisson
modelPoisson <- glm(formula = formula,
family = poisson(link = "log"),
data = DebTrivedi)
summary(modelPoisson)
Call:
glm(formula = formula, family = poisson(link = "log"), data = DebTrivedi)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.405 -1.996 -0.674 0.705 16.362
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.02887 0.02378 43.26 <2e-16 ***
hosp 0.16480 0.00600 27.48 <2e-16 ***
healthpoor 0.24831 0.01784 13.92 <2e-16 ***
healthexcellent -0.36199 0.03030 -11.95 <2e-16 ***
numchron 0.14664 0.00458 32.02 <2e-16 ***
gendermale -0.11232 0.01295 -8.68 <2e-16 ***
school 0.02614 0.00184 14.18 <2e-16 ***
privinsyes 0.20169 0.01686 11.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 26943 on 4405 degrees of freedom
Residual deviance: 23168 on 4398 degrees of freedom
AIC: 35959
Number of Fisher Scoring iterations: 5
## Test using robust sandwich covariance matrix estimators
coeftest(modelPoisson, vcov = sandwich)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.02887 0.06453 15.94 < 2e-16 ***
hosp 0.16480 0.02195 7.51 5.9e-14 ***
healthpoor 0.24831 0.05402 4.60 4.3e-06 ***
healthexcellent -0.36199 0.07745 -4.67 3.0e-06 ***
numchron 0.14664 0.01291 11.36 < 2e-16 ***
gendermale -0.11232 0.03534 -3.18 0.0015 **
school 0.02614 0.00508 5.14 2.7e-07 ***
privinsyes 0.20169 0.04313 4.68 2.9e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Exponentiated coefficients
exp(coef(modelPoisson))
(Intercept) hosp healthpoor healthexcellent numchron gendermale school
2.7979 1.1792 1.2819 0.6963 1.1579 0.8938 1.0265
privinsyes
1.2235
Interpretation: The “baseline” average office visit count is 2.8. The other exponentiated coefficients are interpreted multiplicatively. One unit increase in hospital stays increases the average office visits by 1.18 times. Poor health compared to average health status increases the average office visits by 1.28 times, whereas excellent health decreases the average office visits by 0.70 times.
The dispersion parameter is greater than 1, indicating overdispersion
## Fit quasi-Poisson
modelQuasiPoisson <- glm(formula = formula,
family = quasipoisson(link = "log"),
data = DebTrivedi)
summary(modelQuasiPoisson)
Call:
glm(formula = formula, family = quasipoisson(link = "log"), data = DebTrivedi)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.405 -1.996 -0.674 0.705 16.362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.02887 0.06159 16.70 < 2e-16 ***
hosp 0.16480 0.01553 10.61 < 2e-16 ***
healthpoor 0.24831 0.04621 5.37 0.000000081 ***
healthexcellent -0.36199 0.07848 -4.61 0.000004086 ***
numchron 0.14664 0.01186 12.36 < 2e-16 ***
gendermale -0.11232 0.03352 -3.35 0.00081 ***
school 0.02614 0.00477 5.48 0.000000046 ***
privinsyes 0.20169 0.04366 4.62 0.000003959 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 6.706)
Null deviance: 26943 on 4405 degrees of freedom
Residual deviance: 23168 on 4398 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
The point estimates have the same interpretations as the regular Poisson regression.
## Fit negative binomial
modelNB <- glm.nb(formula = formula,
data = DebTrivedi)
summary(modelNB)
Call:
glm.nb(formula = formula, data = DebTrivedi, init.theta = 1.206603534,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.047 -0.995 -0.295 0.296 5.818
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.92926 0.05459 17.02 < 2e-16 ***
hosp 0.21777 0.02018 10.79 < 2e-16 ***
healthpoor 0.30501 0.04851 6.29 3.2e-10 ***
healthexcellent -0.34181 0.06092 -5.61 2.0e-08 ***
numchron 0.17492 0.01209 14.47 < 2e-16 ***
gendermale -0.12649 0.03122 -4.05 5.1e-05 ***
school 0.02682 0.00439 6.10 1.0e-09 ***
privinsyes 0.22440 0.03946 5.69 1.3e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.207) family taken to be 1)
Null deviance: 5743.7 on 4405 degrees of freedom
Residual deviance: 5044.5 on 4398 degrees of freedom
AIC: 24359
Number of Fisher Scoring iterations: 1
Theta: 1.2066
Std. Err.: 0.0336
2 x log-likelihood: -24341.1070
## Exponentiated coefficients
exp(coef(modelNB))
(Intercept) hosp healthpoor healthexcellent numchron gendermale school
2.5326 1.2433 1.3566 0.7105 1.1911 0.8812 1.0272
privinsyes
1.2516
Interpretation: The “baseline” average office visit count is 2.5. The other exponentiated coefficients are interpreted multiplicatively. One unit increase in hospital stays increases the average office visits by 1.24 times. Poor health compared to average health status increases the average office visits by 1.36 times, whereas excellent health decreases the average office visits by 0.71 times.
This is a two-component model: A truncated count component, such as Poisson, geometric or negative binomial, is employed for positive counts, and a hurdle (binary) component models zero vs. larger counts. For the latter, either a binomial model or a censored count distribution can be employed. (Ref 1) This is appropriate if the underlying data generating processes are different for zero and positive outcomes, i.e., only structural zeros are present.
## Fit hurdle
modelHurdle <- hurdle(formula = formula,
dist = "negbin",
data = DebTrivedi)
summary(modelHurdle)
Call:
hurdle(formula = formula, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.172 -0.708 -0.274 0.320 18.009
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19770 0.05897 20.31 < 2e-16 ***
hosp 0.21190 0.02140 9.90 < 2e-16 ***
healthpoor 0.31596 0.04806 6.57 4.9e-11 ***
healthexcellent -0.33186 0.06609 -5.02 5.1e-07 ***
numchron 0.12642 0.01245 10.15 < 2e-16 ***
gendermale -0.06832 0.03242 -2.11 0.035 *
school 0.02069 0.00453 4.56 5.0e-06 ***
privinsyes 0.10017 0.04262 2.35 0.019 *
Log(theta) 0.33325 0.04275 7.79 6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.04315 0.13985 0.31 0.75769
hosp 0.31245 0.09144 3.42 0.00063 ***
healthpoor -0.00872 0.16102 -0.05 0.95683
healthexcellent -0.28957 0.14268 -2.03 0.04241 *
numchron 0.53521 0.04538 11.79 < 2e-16 ***
gendermale -0.41566 0.08761 -4.74 2.1e-06 ***
school 0.05854 0.01199 4.88 1.0e-06 ***
privinsyes 0.74712 0.10088 7.41 1.3e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.396
Number of iterations in BFGS optimization: 16
Log-likelihood: -1.21e+04 on 17 Df
## The hurdle part can be fit as follows.
hurdlePart <- glm(formula = I(ofp>0) ~ hosp + health + numchron + gender + school + privins,
data = DebTrivedi,
family = binomial(link = "logit"))
coef(summary(hurdlePart))
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.043147 0.13985 0.30852 7.577e-01
hosp 0.312449 0.09143 3.41743 6.321e-04
healthpoor -0.008716 0.16102 -0.05413 9.568e-01
healthexcellent -0.289570 0.14268 -2.02949 4.241e-02
numchron 0.535213 0.04538 11.79472 4.156e-32
gendermale -0.415658 0.08761 -4.74456 2.090e-06
school 0.058541 0.01199 4.88282 1.046e-06
privinsyes 0.747120 0.10088 7.40614 1.300e-13
## Exponentiated coefficients
expCoef <- exp(coef((modelHurdle)))
expCoef <- matrix(expCoef, ncol = 2)
rownames(expCoef) <- names(coef(hurdlePart))
colnames(expCoef) <- c("Count_model","Zero_hurdle_model")
expCoef
Count_model Zero_hurdle_model
(Intercept) 3.3125 1.0441
hosp 1.2360 1.3668
healthpoor 1.3716 0.9913
healthexcellent 0.7176 0.7486
numchron 1.1348 1.7078
gendermale 0.9340 0.6599
school 1.0209 1.0603
privinsyes 1.1054 2.1109
Interpretation: (Zero-hurdle model) The baseline odds of having a positive count vs zero is 1.04. This odds is increased by 1.37 times by a unit increase in hospital stay. Poor health does not have significant effect, whereas excellent health decreases it by 0.75 times. (Positive count model) Given the response is positive (among those who have positive counts), the average count is 3.3. This is increased by a unit increase in hospital stay by 1.24 among those who have positive counts. Poor health increases it by 1.37 times, whereas excellent health decreases it by 0.72, both among those who have positive counts.
The coefficients do not have to match in the two parts.
## hurdle part without health
modelHurdleSimpler <- hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
dist = "negbin",
data = DebTrivedi)
summary(modelHurdleSimpler)
Call:
hurdle(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender +
school + privins, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.172 -0.709 -0.273 0.323 17.918
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19770 0.05897 20.31 < 2e-16 ***
hosp 0.21190 0.02140 9.90 < 2e-16 ***
healthpoor 0.31596 0.04806 6.57 4.9e-11 ***
healthexcellent -0.33186 0.06609 -5.02 5.1e-07 ***
numchron 0.12642 0.01245 10.15 < 2e-16 ***
gendermale -0.06832 0.03242 -2.11 0.035 *
school 0.02069 0.00453 4.56 5.0e-06 ***
privinsyes 0.10017 0.04262 2.35 0.019 *
Log(theta) 0.33325 0.04275 7.79 6.5e-15 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0159 0.1378 0.12 0.90788
hosp 0.3184 0.0911 3.50 0.00047 ***
numchron 0.5478 0.0436 12.57 < 2e-16 ***
gendermale -0.4191 0.0875 -4.79 1.7e-06 ***
school 0.0571 0.0119 4.78 1.7e-06 ***
privinsyes 0.7457 0.1003 7.43 1.1e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.396
Number of iterations in BFGS optimization: 16
Log-likelihood: -1.21e+04 on 15 Df
## LRT to compare these models
lmtest::lrtest(modelHurdle, modelHurdleSimpler)
Likelihood ratio test
Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins |
hosp + numchron + gender + school + privins
#Df LogLik Df Chisq Pr(>Chisq)
1 17 -12088
2 15 -12090 -2 3.99 0.14
The hurdle part does not have to be the same as the positive count part. Here, the hosp variable is removed. A likelihood ratio test is not significant, indicating the simpler model is sufficient.
This is also a two part model. However, both parts predict zero counts. The count model predicts some zero counts, and on the top of that the zero-inflation binary model part adds zero counts, thus, the name zero “inflation”. This is more suitable for situations where there are two types of zeros, i.e., structural zeros and sampling zeros. Those who have the structural zeros are the people who can only have zeros. Those who have sampling zeros are the people who could have experienced non-zero outcome, but by chance experienced zeros.
## Fit zeroinfl
modelZeroInfl <- zeroinfl(formula = formula,
dist = "negbin",
data = DebTrivedi)
summary(modelZeroInfl)
Call:
zeroinfl(formula = formula, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.197 -0.710 -0.278 0.326 17.766
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19347 0.05674 21.04 < 2e-16 ***
hosp 0.20121 0.02039 9.87 < 2e-16 ***
healthpoor 0.28719 0.04594 6.25 4.1e-10 ***
healthexcellent -0.31354 0.06298 -4.98 6.4e-07 ***
numchron 0.12895 0.01194 10.80 < 2e-16 ***
gendermale -0.08009 0.03103 -2.58 0.0099 **
school 0.02134 0.00437 4.89 1.0e-06 ***
privinsyes 0.12681 0.04169 3.04 0.0023 **
Log(theta) 0.39473 0.03515 11.23 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0635 0.2767 -0.23 0.8184
hosp -0.8176 0.4388 -1.86 0.0624 .
healthpoor 0.1018 0.4407 0.23 0.8174
healthexcellent 0.1049 0.3096 0.34 0.7348
numchron -1.2463 0.1792 -6.96 3.5e-12 ***
gendermale 0.6494 0.2005 3.24 0.0012 **
school -0.0848 0.0268 -3.17 0.0015 **
privinsyes -1.1581 0.2244 -5.16 2.4e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.484
Number of iterations in BFGS optimization: 31
Log-likelihood: -1.21e+04 on 17 Df
## Exponentiated coefficients
expCoef <- exp(coef((modelZeroInfl)))
expCoef <- matrix(expCoef, ncol = 2)
rownames(expCoef) <- names(coef(hurdlePart))
colnames(expCoef) <- c("Count_model","Zero_inflation_model")
expCoef
Count_model Zero_inflation_model
(Intercept) 3.2985 0.9384
hosp 1.2229 0.4415
healthpoor 1.3327 1.1071
healthexcellent 0.7309 1.1106
numchron 1.1376 0.2876
gendermale 0.9230 1.9143
school 1.0216 0.9187
privinsyes 1.1352 0.3141
Interpretation: (Zero-inflation model) The baseline odds of being among those who never visit physician officie is 0.94. The odds is decreased by one unit increase in hospital stay by 0.44. Both poor health and excellent health increase the odds of being among those who never visit physician office, but they are non-significant. Each unit increase in the number of chronic condition decrease it by 0.29. (Count model) The baseline number of physician office visit is 3.29 among those who have a chance of visiting physician office. A unit increase in hospital stay increase it by 1.22 times among those who have a chance of visiting physician office. Poor health increases it by 1.33 times, whereas excellent health decreases it by 0.73 times among those who have a chance of visiting physician office.
The coefficients do not have to match in the two parts.
## zeroinfl part without health
modelZeroInflSimpler <- zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender + school + privins,
dist = "negbin",
data = DebTrivedi)
summary(modelZeroInflSimpler)
Call:
zeroinfl(formula = ofp ~ hosp + health + numchron + gender + school + privins | hosp + numchron + gender +
school + privins, data = DebTrivedi, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.196 -0.711 -0.278 0.325 17.845
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.19372 0.05666 21.07 < 2e-16 ***
hosp 0.20148 0.02036 9.90 < 2e-16 ***
healthpoor 0.28513 0.04509 6.32 2.6e-10 ***
healthexcellent -0.31934 0.06040 -5.29 1.2e-07 ***
numchron 0.12900 0.01193 10.81 < 2e-16 ***
gendermale -0.08028 0.03102 -2.59 0.0097 **
school 0.02142 0.00436 4.92 8.8e-07 ***
privinsyes 0.12586 0.04159 3.03 0.0025 **
Log(theta) 0.39414 0.03503 11.25 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0468 0.2686 -0.17 0.8615
hosp -0.8005 0.4208 -1.90 0.0571 .
numchron -1.2479 0.1783 -7.00 2.6e-12 ***
gendermale 0.6477 0.2001 3.24 0.0012 **
school -0.0838 0.0263 -3.19 0.0014 **
privinsyes -1.1756 0.2201 -5.34 9.3e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.483
Number of iterations in BFGS optimization: 28
Log-likelihood: -1.21e+04 on 15 Df
## LRT to compare these models
lmtest::lrtest(modelZeroInfl, modelZeroInflSimpler)
Likelihood ratio test
Model 1: ofp ~ hosp + health + numchron + gender + school + privins
Model 2: ofp ~ hosp + health + numchron + gender + school + privins |
hosp + numchron + gender + school + privins
#Df LogLik Df Chisq Pr(>Chisq)
1 17 -12091
2 15 -12091 -2 0.15 0.93
Again the hosp variable can be removed.