Question (from Extending the Linear Model with R [ELMR], p.73)

The dvisits data comes from the Australian Health Survey of 1977–78 and consist of 5190 single adults where young and old have been oversampled.

(a) Build a Poisson regression model with doctorco as the response and sex, age, agesq, income, levyplus, freepoor, freerepa, illness, actdays, hscore, chcond1 and chcond2 as possible predictor variables. Considering the deviance of this model, does this model fit the data?

data(dvisits, package="faraway")
head(dvisits)

pois_reg <- glm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, family=poisson, data = dvisits)

pois_reg

## 
## Call:  glm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, family = poisson, data = dvisits)
## 
## Coefficients:
## (Intercept)          sex          age        agesq       income     levyplus  
##    -2.22385      0.15688      1.05630     -0.84870     -0.20532      0.12319  
##    freepoor     freerepa      illness      actdays       hscore      chcond1  
##    -0.44006      0.07980      0.18695      0.12685      0.03008      0.11409  
##     chcond2  
##     0.14116  
## 
## Degrees of Freedom: 5189 Total (i.e. Null);  5177 Residual
## Null Deviance:       5635 
## Residual Deviance: 4380  AIC: 6737

summary(pois_reg)

## 
## Call:
## glm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, family = poisson, data = dvisits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9170  -0.6862  -0.5743  -0.4839   5.7005  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.223848   0.189816 -11.716   <2e-16 ***
## sex          0.156882   0.056137   2.795   0.0052 ** 
## age          1.056299   1.000780   1.055   0.2912    
## agesq       -0.848704   1.077784  -0.787   0.4310    
## income      -0.205321   0.088379  -2.323   0.0202 *  
## levyplus     0.123185   0.071640   1.720   0.0855 .  
## freepoor    -0.440061   0.179811  -2.447   0.0144 *  
## freerepa     0.079798   0.092060   0.867   0.3860    
## illness      0.186948   0.018281  10.227   <2e-16 ***
## actdays      0.126846   0.005034  25.198   <2e-16 ***
## hscore       0.030081   0.010099   2.979   0.0029 ** 
## chcond1      0.114085   0.066640   1.712   0.0869 .  
## chcond2      0.141158   0.083145   1.698   0.0896 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4379.5  on 5177  degrees of freedom
## AIC: 6737.1
## 
## Number of Fisher Scoring iterations: 6

Since the median deviance residual is close to zero(-0.5743), this means that our model is not biased in one direction (i.e. the outcome is neither over- nor underestimated).

These results are somehow reassuring. First, the null deviance (5634.8) is high, which means it makes sense to use more than a single parameter for fitting the model. Second, the residual deviance is relatively low, which indicates that the log likelihood of our model is close to the log likelihood of the saturated model.

However, for a well-fitting model, the residual deviance should be close to the degrees of freedom (5177), which is not the case here. For example, this could be a result of overdispersion where the variation is greater than predicted by the model. This can happen for a Poisson model when the actual variance exceeds the assumed mean of \(\mu =Var(Y)\)

(b) Plot the residuals and the fitted data - why are there lines of observations on the plot?

par(mfrow=c(2,2))
plot(pois_reg)

The red solid line is the conditional mean at each value of fitted value where grey dotted line is the approximate mean.

(c) Use backward elimination with a critical p-value of 5% to reduce the model as much as possible. Report your model.

reduced_pois_reg <- step(pois_reg, direction="backward")

## Start:  AIC=6737.08
## doctorco ~ sex + age + agesq + income + levyplus + freepoor + 
##     freerepa + illness + actdays + hscore + chcond1 + chcond2
## 
##            Df Deviance    AIC
## - agesq     1   4380.1 6735.7
## - freerepa  1   4380.3 6735.8
## - age       1   4380.6 6736.2
## <none>          4379.5 6737.1
## - chcond2   1   4382.4 6738.0
## - chcond1   1   4382.5 6738.0
## - levyplus  1   4382.5 6738.1
## - income    1   4385.0 6740.5
## - freepoor  1   4386.2 6741.8
## - sex       1   4387.4 6743.0
## - hscore    1   4388.1 6743.7
## - illness   1   4481.8 6837.4
## - actdays   1   4917.1 7272.7
## 
## Step:  AIC=6735.7
## doctorco ~ sex + age + income + levyplus + freepoor + freerepa + 
##     illness + actdays + hscore + chcond1 + chcond2
## 
##            Df Deviance    AIC
## - freerepa  1   4381.0 6734.5
## <none>          4380.1 6735.7
## - age       1   4383.0 6736.5
## - chcond1   1   4383.2 6736.8
## - levyplus  1   4383.3 6736.9
## - chcond2   1   4383.5 6737.0
## - income    1   4385.0 6738.6
## - freepoor  1   4386.8 6740.4
## - sex       1   4388.0 6741.5
## - hscore    1   4389.1 6742.7
## - illness   1   4481.9 6835.4
## - actdays   1   4917.1 7270.7
## 
## Step:  AIC=6734.53
## doctorco ~ sex + age + income + levyplus + freepoor + illness + 
##     actdays + hscore + chcond1 + chcond2
## 
##            Df Deviance    AIC
## <none>          4381.0 6734.5
## - levyplus  1   4383.4 6735.0
## - chcond1   1   4384.3 6735.9
## - chcond2   1   4384.7 6736.3
## - income    1   4386.7 6738.2
## - age       1   4387.1 6738.7
## - freepoor  1   4389.1 6740.6
## - sex       1   4389.5 6741.0
## - hscore    1   4390.2 6741.8
## - illness   1   4482.7 6834.2
## - actdays   1   4917.6 7269.2

summary(reduced_pois_reg)

## 
## Call:
## glm(formula = doctorco ~ sex + age + income + levyplus + freepoor + 
##     illness + actdays + hscore + chcond1 + chcond2, family = poisson, 
##     data = dvisits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0004  -0.6851  -0.5761  -0.4858   5.7284  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.089063   0.100811 -20.723  < 2e-16 ***
## sex          0.162000   0.055824   2.902  0.00371 ** 
## age          0.355131   0.143196   2.480  0.01314 *  
## income      -0.199806   0.084328  -2.369  0.01782 *  
## levyplus     0.083689   0.053544   1.563  0.11805    
## freepoor    -0.469596   0.176360  -2.663  0.00775 ** 
## illness      0.186101   0.018260  10.191  < 2e-16 ***
## actdays      0.126611   0.005029  25.177  < 2e-16 ***
## hscore       0.031116   0.010065   3.092  0.00199 ** 
## chcond1      0.121100   0.066389   1.824  0.06814 .  
## chcond2      0.158894   0.081762   1.943  0.05197 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4381.0  on 5179  degrees of freedom
## AIC: 6734.5
## 
## Number of Fisher Scoring iterations: 6

AIC: 6737.1 -> 6734.5 (with a reduced model). The AIC improved~!

(d) What sort of person would be predicted to visit the doctor the most under your selected model?

Ans: Based on the coefficients, the 2 variables with the top 2 coefficients are freepoor and age. They are also statistical significant. The older folks and the folks covered by government because low income, recent immigrant, unemployed are the 2 types of person to visit the doctor the most under my selected model, the reduced model.

(e) For the last person in the dataset, compute the predicted probability distribution for their visits to the doctor, i.e., give the probability they visit 0,1,2, etc. times.

# tail(dvisits,1)
(lambda <- predict(pois_reg, tail(dvisits,1), type="response"))

##      5190 
## 0.1533837

The mean amount of visits to the doctor for patient 5190 would be ~0.16 visits. We will set \(λ=0.1533837\)

for (i in 0:4) {
print(noquote(paste0("Probability of ", i, " doctor visits: ", round(dpois(i, lambda = lambda)*100,2), '%')))

}

## [1] Probability of 0 doctor visits: 85.78%
## [1] Probability of 1 doctor visits: 13.16%
## [1] Probability of 2 doctor visits: 1.01%
## [1] Probability of 3 doctor visits: 0.05%
## [1] Probability of 4 doctor visits: 0%

The probability of 0 or 1 doctor visit is ~99% as seen above.

(f) Fit a comparable (Gaussian) linear model and graphically compare the fits. Describe how they differ.

lin_reg <- lm(doctorco ~ sex + age + agesq + income + levyplus + freepoor + freerepa + illness + actdays + hscore + chcond1 + chcond2, data=dvisits)
summary(lin_reg)

## 
## Call:
## lm(formula = doctorco ~ sex + age + agesq + income + levyplus + 
##     freepoor + freerepa + illness + actdays + hscore + chcond1 + 
##     chcond2, data = dvisits)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1352 -0.2588 -0.1435 -0.0433  7.0327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.027632   0.072220   0.383  0.70202    
## sex          0.033811   0.021604   1.565  0.11764    
## age          0.203201   0.410016   0.496  0.62020    
## agesq       -0.062103   0.458716  -0.135  0.89231    
## income      -0.057323   0.033089  -1.732  0.08326 .  
## levyplus     0.035179   0.024882   1.414  0.15748    
## freepoor    -0.103314   0.052471  -1.969  0.04901 *  
## freerepa     0.033241   0.038157   0.871  0.38371    
## illness      0.059946   0.008357   7.173 8.39e-13 ***
## actdays      0.103192   0.003657  28.216  < 2e-16 ***
## hscore       0.016976   0.005190   3.271  0.00108 ** 
## chcond1      0.004384   0.023740   0.185  0.85349    
## chcond2      0.041617   0.035863   1.160  0.24592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7139 on 5177 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:    0.2 
## F-statistic: 109.1 on 12 and 5177 DF,  p-value: < 2.2e-16

The fit is different as age is no longer a predictor of statistical significance. When comparing the prediction of same last person’s # of doctor’s visit, you can see the difference.

(lambda <- predict(lin_reg, tail(dvisits,1), type="response"))

##      5190 
## 0.1606531

Just for comparison. The mean amount of visits to the doctor for patient 5190 would be ~0.16 visits. The underlying rate is higher in linear model.

par(mfrow = c(1,2))
plot(predict(pois_reg),residuals(pois_reg),xlab="Fitted",ylab=" Residuals", main = "poisson reg") 
plot(predict(lin_reg),residuals(lin_reg),xlab="Fitted",ylab=" Residuals", main = "lienar reg")

The graphs of the 2 residual plots differed in a sense that the poisson regression has some sort of gradual decrease in residuals against the Fitted while the linear regression has some straight line decreases.

# dvisits$predLM <- predict(lin_reg, newdata = dvisits, type = "response")
# dvisits$predPOIS <- predict(pois_reg, newdata = dvisits, type = "response")
# 
# plot(predict(pois_reg),residuals(pois_reg))
# lines(predLM ~ age, dvisits, col = 2)
# lines(predPOIS ~ age, dvisits, col = 4)
# legend("topleft", legend = c("lm", "poisson"), col = c(2,4), lty = 1)