The dvisits data comes from the Australian Health Survey of 1977–78 and consist of 5190 single adults where young and old have been oversampled.
data(dvisits, package="faraway")
head(dvisits)
pois_reg <- glm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, family=poisson, data = dvisits)
pois_reg
##
## Call: glm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, family = poisson, data = dvisits)
##
## Coefficients:
## (Intercept) sex age agesq income levyplus
## -2.22385 0.15688 1.05630 -0.84870 -0.20532 0.12319
## freepoor freerepa illness actdays hscore chcond1
## -0.44006 0.07980 0.18695 0.12685 0.03008 0.11409
## chcond2
## 0.14116
##
## Degrees of Freedom: 5189 Total (i.e. Null); 5177 Residual
## Null Deviance: 5635
## Residual Deviance: 4380 AIC: 6737
summary(pois_reg)
##
## Call:
## glm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, family = poisson, data = dvisits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9170 -0.6862 -0.5743 -0.4839 5.7005
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.223848 0.189816 -11.716 <2e-16 ***
## sex 0.156882 0.056137 2.795 0.0052 **
## age 1.056299 1.000780 1.055 0.2912
## agesq -0.848704 1.077784 -0.787 0.4310
## income -0.205321 0.088379 -2.323 0.0202 *
## levyplus 0.123185 0.071640 1.720 0.0855 .
## freepoor -0.440061 0.179811 -2.447 0.0144 *
## freerepa 0.079798 0.092060 0.867 0.3860
## illness 0.186948 0.018281 10.227 <2e-16 ***
## actdays 0.126846 0.005034 25.198 <2e-16 ***
## hscore 0.030081 0.010099 2.979 0.0029 **
## chcond1 0.114085 0.066640 1.712 0.0869 .
## chcond2 0.141158 0.083145 1.698 0.0896 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4379.5 on 5177 degrees of freedom
## AIC: 6737.1
##
## Number of Fisher Scoring iterations: 6
Since the median deviance residual is close to zero(-0.5743), this means that our model is not biased in one direction (i.e. the outcome is neither over- nor underestimated).
These results are somehow reassuring. First, the null deviance (5634.8) is high, which means it makes sense to use more than a single parameter for fitting the model. Second, the residual deviance is relatively low, which indicates that the log likelihood of our model is close to the log likelihood of the saturated model.
However, for a well-fitting model, the residual deviance should be close to the degrees of freedom (5177), which is not the case here. For example, this could be a result of overdispersion where the variation is greater than predicted by the model. This can happen for a Poisson model when the actual variance exceeds the assumed mean of \(\mu =Var(Y)\)
par(mfrow=c(2,2))
plot(pois_reg)
The red solid line is the conditional mean at each value of fitted value where grey dotted line is the approximate mean.
reduced_pois_reg <- step(pois_reg, direction="backward")
## Start: AIC=6737.08
## doctorco ~ sex + age + agesq + income + levyplus + freepoor +
## freerepa + illness + actdays + hscore + chcond1 + chcond2
##
## Df Deviance AIC
## - agesq 1 4380.1 6735.7
## - freerepa 1 4380.3 6735.8
## - age 1 4380.6 6736.2
## <none> 4379.5 6737.1
## - chcond2 1 4382.4 6738.0
## - chcond1 1 4382.5 6738.0
## - levyplus 1 4382.5 6738.1
## - income 1 4385.0 6740.5
## - freepoor 1 4386.2 6741.8
## - sex 1 4387.4 6743.0
## - hscore 1 4388.1 6743.7
## - illness 1 4481.8 6837.4
## - actdays 1 4917.1 7272.7
##
## Step: AIC=6735.7
## doctorco ~ sex + age + income + levyplus + freepoor + freerepa +
## illness + actdays + hscore + chcond1 + chcond2
##
## Df Deviance AIC
## - freerepa 1 4381.0 6734.5
## <none> 4380.1 6735.7
## - age 1 4383.0 6736.5
## - chcond1 1 4383.2 6736.8
## - levyplus 1 4383.3 6736.9
## - chcond2 1 4383.5 6737.0
## - income 1 4385.0 6738.6
## - freepoor 1 4386.8 6740.4
## - sex 1 4388.0 6741.5
## - hscore 1 4389.1 6742.7
## - illness 1 4481.9 6835.4
## - actdays 1 4917.1 7270.7
##
## Step: AIC=6734.53
## doctorco ~ sex + age + income + levyplus + freepoor + illness +
## actdays + hscore + chcond1 + chcond2
##
## Df Deviance AIC
## <none> 4381.0 6734.5
## - levyplus 1 4383.4 6735.0
## - chcond1 1 4384.3 6735.9
## - chcond2 1 4384.7 6736.3
## - income 1 4386.7 6738.2
## - age 1 4387.1 6738.7
## - freepoor 1 4389.1 6740.6
## - sex 1 4389.5 6741.0
## - hscore 1 4390.2 6741.8
## - illness 1 4482.7 6834.2
## - actdays 1 4917.6 7269.2
summary(reduced_pois_reg)
##
## Call:
## glm(formula = doctorco ~ sex + age + income + levyplus + freepoor +
## illness + actdays + hscore + chcond1 + chcond2, family = poisson,
## data = dvisits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0004 -0.6851 -0.5761 -0.4858 5.7284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.089063 0.100811 -20.723 < 2e-16 ***
## sex 0.162000 0.055824 2.902 0.00371 **
## age 0.355131 0.143196 2.480 0.01314 *
## income -0.199806 0.084328 -2.369 0.01782 *
## levyplus 0.083689 0.053544 1.563 0.11805
## freepoor -0.469596 0.176360 -2.663 0.00775 **
## illness 0.186101 0.018260 10.191 < 2e-16 ***
## actdays 0.126611 0.005029 25.177 < 2e-16 ***
## hscore 0.031116 0.010065 3.092 0.00199 **
## chcond1 0.121100 0.066389 1.824 0.06814 .
## chcond2 0.158894 0.081762 1.943 0.05197 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4381.0 on 5179 degrees of freedom
## AIC: 6734.5
##
## Number of Fisher Scoring iterations: 6
AIC: 6737.1 -> 6734.5 (with a reduced model). The AIC improved~!
Ans: Based on the coefficients, the 2 variables with the top 2 coefficients are freepoor and age. They are also statistical significant. The older folks and the folks covered by government because low income, recent immigrant, unemployed are the 2 types of person to visit the doctor the most under my selected model, the reduced model.
# tail(dvisits,1)
(lambda <- predict(pois_reg, tail(dvisits,1), type="response"))
## 5190
## 0.1533837
The mean amount of visits to the doctor for patient 5190 would be ~0.16 visits. We will set \(λ=0.1533837\)
for (i in 0:4) {
print(noquote(paste0("Probability of ", i, " doctor visits: ", round(dpois(i, lambda = lambda)*100,2), '%')))
}
## [1] Probability of 0 doctor visits: 85.78%
## [1] Probability of 1 doctor visits: 13.16%
## [1] Probability of 2 doctor visits: 1.01%
## [1] Probability of 3 doctor visits: 0.05%
## [1] Probability of 4 doctor visits: 0%
The probability of 0 or 1 doctor visit is ~99% as seen above.
lin_reg <- lm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, data=dvisits)
summary(lin_reg)
##
## Call:
## lm(formula = doctorco ~ sex + age + agesq + income + levyplus +
## freepoor + freerepa + illness + actdays + hscore + chcond1 +
## chcond2, data = dvisits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1352 -0.2588 -0.1435 -0.0433 7.0327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.027632 0.072220 0.383 0.70202
## sex 0.033811 0.021604 1.565 0.11764
## age 0.203201 0.410016 0.496 0.62020
## agesq -0.062103 0.458716 -0.135 0.89231
## income -0.057323 0.033089 -1.732 0.08326 .
## levyplus 0.035179 0.024882 1.414 0.15748
## freepoor -0.103314 0.052471 -1.969 0.04901 *
## freerepa 0.033241 0.038157 0.871 0.38371
## illness 0.059946 0.008357 7.173 8.39e-13 ***
## actdays 0.103192 0.003657 28.216 < 2e-16 ***
## hscore 0.016976 0.005190 3.271 0.00108 **
## chcond1 0.004384 0.023740 0.185 0.85349
## chcond2 0.041617 0.035863 1.160 0.24592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7139 on 5177 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.2
## F-statistic: 109.1 on 12 and 5177 DF, p-value: < 2.2e-16
The fit is different as age is no longer a predictor of statistical significance. When comparing the prediction of same last person’s # of doctor’s visit, you can see the difference.
(lambda <- predict(lin_reg, tail(dvisits,1), type="response"))
## 5190
## 0.1606531
Just for comparison. The mean amount of visits to the doctor for patient 5190 would be ~0.16 visits. The underlying rate is higher in linear model.
par(mfrow = c(1,2))
plot(predict(pois_reg),residuals(pois_reg),xlab="Fitted",ylab=" Residuals", main = "poisson reg")
plot(predict(lin_reg),residuals(lin_reg),xlab="Fitted",ylab=" Residuals", main = "lienar reg")
The graphs of the 2 residual plots differed in a sense that the poisson regression has some sort of gradual decrease in residuals against the Fitted while the linear regression has some straight line decreases.
# dvisits$predLM <- predict(lin_reg, newdata = dvisits, type = "response")
# dvisits$predPOIS <- predict(pois_reg, newdata = dvisits, type = "response")
#
# plot(predict(pois_reg),residuals(pois_reg))
# lines(predLM ~ age, dvisits, col = 2)
# lines(predPOIS ~ age, dvisits, col = 4)
# legend("topleft", legend = c("lm", "poisson"), col = c(2,4), lty = 1)