For class this week I used my model building skills to finish quiz 2. The quiz was similar to the poisson regression assignment we had to complete so I was able to apply the same process in order to complete the assignment. I enjoyed learning and practicing this process because I do not have much background in mathematical statistics and poisson regression is relatively new to me aside from the r codes that I used in stat 320.
###Quiz 2
For my report I will be looking at the esdcomp dataset which contains information on 44 emergency room doctors and the following variables: visits (numeric), complaints (numeric), residency(binary), gender(binary), revenue(numeric), hours(numeric). My overall goal is to build the best model for predicting the complaints for number of doctor based on some predictor variables.
Data exploration: An important step in choosing a model is understanding the data you are working with. A good way to do this is to plot the predictors with the response variable and measure the strength of the relationship.
hist(esdcomp$complaints, xlab="number of complaints")
This is a histogram for the target response variable of complaints. We can see that the distribution follows the Poisson distribution and is heavily right skewed.
boxplot(esdcomp$complaints~esdcomp$residency, xlab="residency", ylab="number of complaints")
This boxplot shows the number of complaints based on whether or not the doctor is in residency. It looks like not being in residency leads to more complaints for the doctor based on this plot.
boxplot(esdcomp$complaints~esdcomp$gender, xlab="gender", ylab="number of complaints")
This boxplot shows the number of complaints based on the gender (sex) of the doctor. Both genders have roughly the same median but there are more male doctors with higher numbers of complaints than females.This appears to be a weaker predictor than the residency variable for number of complaints.
plot(log(esdcomp$complaints)~esdcomp$visits, xlab="visits", ylab="log number of complaints")
Because visits is a numeric variable I plotted it against the log number of complaints to investigate the relationship of the variables. There does appear to be a medium strength positive linear relationship between the variables.
plot(log(esdcomp$complaints)~esdcomp$hours, xlab="hours", ylab="log number of complaints")
Because hours is a numeric variable I plotted it against the log number of complaints to investigate the relationship of the variables. There does appear to be some positive linear relationship just like what we saw in when plotting log complaints vs visits.
plot(log(esdcomp$complaints)~esdcomp$revenue, xlab="revenue", ylab="log number of complaints")
The log number of complaints appears to be mostly randomly distributed based on the revenue of the doctor. This does not appear to be a strong predictor of number of complaints.
Based on the graphs I would say that residency, visits, and hours are looking to be the strongest predictors for number of complaints of out of the five explanatory variables.
Next I want to make and compare models that use each single predictor for complaints. We can look at the significance of the predictors individually and also see which model has the lowest AIC in order to maintain the simplicity of the model while also having powerful predictability.
modVisits <- glm(complaints~visits, family = "poisson", data = esdcomp)
coef(modVisits)
## (Intercept) visits
## -0.8493259139 0.0008101308
(visitsMeans <- round(exp(coef(modVisits)), digits = 4))
## (Intercept) visits
## 0.4277 1.0008
Every extra visit a doctor has is associated with a multiplicative increase of 1.0008 in the number of complaints the doctor receives.
modResidency <- glm(complaints~residency, family = "poisson", data = esdcomp)
coef(modResidency)
## (Intercept) residencyY
## 1.3758231 -0.4203116
(residencyMeans <- round(exp(coef(modResidency)), digits = 4))
## (Intercept) residencyY
## 3.9583 0.6568
Doctors that are not in residency have 3.9583 complaints, on average. Doctors that are in residency have 4.6151 complaints, on average.
modRevenue <- glm(complaints~revenue, family = "poisson", data = esdcomp)
coef(modRevenue)
## (Intercept) revenue
## 0.154157869 0.004011676
(revenueMeans <- round(exp(coef(modRevenue)), digits = 4))
## (Intercept) revenue
## 1.1667 1.0040
Every extra thousand dollars of revenue a doctor makes is associated with a multiplicative increase of 1.0040 complaints, on average.
modHours <- glm(complaints~hours, family = "poisson", data = esdcomp)
coef(modHours)
## (Intercept) hours
## -0.939810277 0.001445967
(hoursMeans <- round(exp(coef(modHours)), digits = 4))
## (Intercept) hours
## 0.3907 1.0014
Every extra hour of work for a doctor is associated with a multiplicative increase of 1.0014 complaints, on average.
modGender <- glm(complaints~gender, family = "poisson", data = esdcomp)
coef(modGender)
## (Intercept) genderM
## 0.9490806 0.3387737
(genderMeans <- round(exp(coef(modGender)), digits = 4))
## (Intercept) genderM
## 2.5833 1.4032
Doctors that are female have 2.5833 complaints, on average. Doctors that are male have 3.9865 complaints, on average.
We can use AIC to compare which model is best while also trying to keep a simple model. In this case all models have the same number of predictors so the simplicity is irrelevant.
AIC(modVisits)
## [1] 181.0258
AIC(modResidency)
## [1] 210.0828
AIC(modRevenue)
## [1] 213.7092
AIC(modHours)
## [1] 191.4638
AIC(modGender)
## [1] 213.2371
Since lower AIC is better we would say that the visits model is the best by far followed by hours and residency. It makes sense that visits would be the strongest predictor since one could logically reason that more patient visits would lead to more complaints in general. At this point it seems likely that visits will be used at least in some capacity to predict the number of complaints.
It is a good idea to test for lack of fit on the model with the best AIC to determine whether it is appropriate to be using this model on this data.
(vdf <- nrow(esdcomp) - length(coef(modVisits)))
## [1] 42
(teststat <- summary(modVisits)$deviance)
## [1] 54.24606
pchisq(teststat, df=vdf, lower.tail=F)
## [1] 0.09750312
There does not appear to be any lack of fit so we can move forward using this model.
mod2a <- glm(complaints~visits+residency, data = esdcomp, family = "poisson")
mod2b <- glm(complaints~visits+revenue, data = esdcomp, family = "poisson")
mod2c <- glm(complaints~visits+hours, data = esdcomp, family = "poisson")
mod2d <- glm(complaints~visits+gender, data = esdcomp, family = "poisson")
The anova tests determine whether the model with more complexity is better than the simpler model despite having more complexity.
anova(mod2a, modVisits, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + residency
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 50.882
## 2 42 54.246 -1 -3.3639 0.06664 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod2b, modVisits, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + revenue
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 52.109
## 2 42 54.246 -1 -2.137 0.1438
anova(mod2c, modVisits, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + hours
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 53.566
## 2 42 54.246 -1 -0.67983 0.4096
anova(mod2d, modVisits, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + gender
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 53.898
## 2 42 54.246 -1 -0.34833 0.5551
According to the drop-in-deviance test results we should not add another variable to the current model. Our model should stay as only visits being a predictor. Since this is our only predictor we will not look at quadratic or interaction terms for now.
Despite not having any lack of fit we can still check the halfnorm plot to make sure it alligns with our current beliefs.
halfnorm(residuals(modVisits))
There are a couple of outliers that could be cause for concern. Based on the fit results, however, we will continue with the model.
(vci <- exp(confint(modVisits, parm=2, level = .95)))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 1.000540 1.001083
Every visit that a doctor has is associated with a multiplicative change in their mean number of complaints. We are 95% confident that this multiplicative change is between 1.000540 and 1.001083.
disp <- sum(residuals(modVisits, type ="pearson")^2 / modVisits$df.residual)
disp
## [1] 1.272378
(sqrt(disp))
## [1] 1.127997
Based on the dispersion results we need to multiple our standard errors by 1.128. Our overdispersion is close to 1 so this looks good to continue with this model.
Based on this full model building analysis, my final recommendation is that we use visits to predict the number of complaints a doctor will receiver.