For class this week I used my model building skills to finish quiz 2. The quiz was similar to the poisson regression assignment we had to complete so I was able to apply the same process in order to complete the assignment. I enjoyed learning and practicing this process because I do not have much background in mathematical statistics and poisson regression is relatively new to me aside from the r codes that I used in stat 320.

###Quiz 2

For my report I will be looking at the esdcomp dataset which contains information on 44 emergency room doctors and the following variables: visits (numeric), complaints (numeric), residency(binary), gender(binary), revenue(numeric), hours(numeric). My overall goal is to build the best model for predicting the complaints for number of doctor based on some predictor variables.

  1. Explain why Poisson regression is appropriate (rather than the usual linear model). In this case, Poisson regression is appropriate for several reasons. The first reason is that our response variable of complaints against emergency room doctors is of type count. Often times we use Poisson regression to moodel a response variables of type count instead of linear. We will assume the observations are independent of one another in order to use Poisson regression. In general, these observations should be independent since the doctors can only help one patient at a time in the emergency room. There could be an issue if the same patient had a complaint for the same doctor on multiple occasions. For Poisson regression to be used the mean of the Poisson random variable must be equal to its variance. Once I have a model with strong prediction I can check to make sure this assumption is met for the model. The final way to make sure we should be using Poisson regression over linear is the linearity assumption. The log of the mean rate must be a linear function of x for us to use Poisson. This will also be checked once a model is developed.

Data exploration: An important step in choosing a model is understanding the data you are working with. A good way to do this is to plot the predictors with the response variable and measure the strength of the relationship.

hist(esdcomp$complaints, xlab="number of complaints")

This is a histogram for the target response variable of complaints. We can see that the distribution follows the Poisson distribution and is heavily right skewed.

boxplot(esdcomp$complaints~esdcomp$residency, xlab="residency", ylab="number of complaints")

This boxplot shows the number of complaints based on whether or not the doctor is in residency. It looks like not being in residency leads to more complaints for the doctor based on this plot.

boxplot(esdcomp$complaints~esdcomp$gender, xlab="gender", ylab="number of complaints")

This boxplot shows the number of complaints based on the gender (sex) of the doctor. Both genders have roughly the same median but there are more male doctors with higher numbers of complaints than females.This appears to be a weaker predictor than the residency variable for number of complaints.

plot(log(esdcomp$complaints)~esdcomp$visits, xlab="visits", ylab="log number of complaints")

Because visits is a numeric variable I plotted it against the log number of complaints to investigate the relationship of the variables. There does appear to be a medium strength positive linear relationship between the variables.

plot(log(esdcomp$complaints)~esdcomp$hours, xlab="hours", ylab="log number of complaints")

Because hours is a numeric variable I plotted it against the log number of complaints to investigate the relationship of the variables. There does appear to be some positive linear relationship just like what we saw in when plotting log complaints vs visits.

plot(log(esdcomp$complaints)~esdcomp$revenue, xlab="revenue", ylab="log number of complaints")

The log number of complaints appears to be mostly randomly distributed based on the revenue of the doctor. This does not appear to be a strong predictor of number of complaints.

Exploratory findings

Based on the graphs I would say that residency, visits, and hours are looking to be the strongest predictors for number of complaints of out of the five explanatory variables.

Next I want to make and compare models that use each single predictor for complaints. We can look at the significance of the predictors individually and also see which model has the lowest AIC in order to maintain the simplicity of the model while also having powerful predictability.

modVisits <- glm(complaints~visits, family = "poisson", data = esdcomp)
coef(modVisits)
##   (Intercept)        visits 
## -0.8493259139  0.0008101308
(visitsMeans <- round(exp(coef(modVisits)), digits = 4))
## (Intercept)      visits 
##      0.4277      1.0008

Every extra visit a doctor has is associated with a multiplicative increase of 1.0008 in the number of complaints the doctor receives.

modResidency <- glm(complaints~residency, family = "poisson", data = esdcomp)
coef(modResidency)
## (Intercept)  residencyY 
##   1.3758231  -0.4203116
(residencyMeans <- round(exp(coef(modResidency)), digits = 4))
## (Intercept)  residencyY 
##      3.9583      0.6568

Doctors that are not in residency have 3.9583 complaints, on average. Doctors that are in residency have 4.6151 complaints, on average.

modRevenue <- glm(complaints~revenue, family = "poisson", data = esdcomp)
coef(modRevenue)
## (Intercept)     revenue 
## 0.154157869 0.004011676
(revenueMeans <- round(exp(coef(modRevenue)), digits = 4))
## (Intercept)     revenue 
##      1.1667      1.0040

Every extra thousand dollars of revenue a doctor makes is associated with a multiplicative increase of 1.0040 complaints, on average.

modHours <- glm(complaints~hours, family = "poisson", data = esdcomp)
coef(modHours)
##  (Intercept)        hours 
## -0.939810277  0.001445967
(hoursMeans <- round(exp(coef(modHours)), digits = 4))
## (Intercept)       hours 
##      0.3907      1.0014

Every extra hour of work for a doctor is associated with a multiplicative increase of 1.0014 complaints, on average.

modGender <- glm(complaints~gender, family = "poisson", data = esdcomp)
coef(modGender)
## (Intercept)     genderM 
##   0.9490806   0.3387737
(genderMeans <- round(exp(coef(modGender)), digits = 4))
## (Intercept)     genderM 
##      2.5833      1.4032

Doctors that are female have 2.5833 complaints, on average. Doctors that are male have 3.9865 complaints, on average.

We can use AIC to compare which model is best while also trying to keep a simple model. In this case all models have the same number of predictors so the simplicity is irrelevant.

AIC(modVisits)
## [1] 181.0258
AIC(modResidency)
## [1] 210.0828
AIC(modRevenue)
## [1] 213.7092
AIC(modHours)
## [1] 191.4638
AIC(modGender)
## [1] 213.2371

Since lower AIC is better we would say that the visits model is the best by far followed by hours and residency. It makes sense that visits would be the strongest predictor since one could logically reason that more patient visits would lead to more complaints in general. At this point it seems likely that visits will be used at least in some capacity to predict the number of complaints.

It is a good idea to test for lack of fit on the model with the best AIC to determine whether it is appropriate to be using this model on this data.

(vdf <- nrow(esdcomp) - length(coef(modVisits)))
## [1] 42
(teststat <- summary(modVisits)$deviance)
## [1] 54.24606
pchisq(teststat, df=vdf, lower.tail=F)
## [1] 0.09750312

There does not appear to be any lack of fit so we can move forward using this model.

  1. Drop in deviance test: We can use the drop-in-deviance test along with forward selection to determine whether a second variable should be added to our model.
mod2a <- glm(complaints~visits+residency, data = esdcomp, family = "poisson")
mod2b <- glm(complaints~visits+revenue, data = esdcomp, family = "poisson")
mod2c <- glm(complaints~visits+hours, data = esdcomp, family = "poisson")
mod2d <- glm(complaints~visits+gender, data = esdcomp, family = "poisson")

The anova tests determine whether the model with more complexity is better than the simpler model despite having more complexity.

anova(mod2a, modVisits, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + residency
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        41     50.882                       
## 2        42     54.246 -1  -3.3639  0.06664 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod2b, modVisits, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + revenue
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     52.109                     
## 2        42     54.246 -1   -2.137   0.1438
anova(mod2c, modVisits, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + hours
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     53.566                     
## 2        42     54.246 -1 -0.67983   0.4096
anova(mod2d, modVisits, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + gender
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     53.898                     
## 2        42     54.246 -1 -0.34833   0.5551

According to the drop-in-deviance test results we should not add another variable to the current model. Our model should stay as only visits being a predictor. Since this is our only predictor we will not look at quadratic or interaction terms for now.

Despite not having any lack of fit we can still check the halfnorm plot to make sure it alligns with our current beliefs.

halfnorm(residuals(modVisits))

There are a couple of outliers that could be cause for concern. Based on the fit results, however, we will continue with the model.

  1. Calculate and interpret a confidence interval for a regression coefficient. We can construct a 95% confidence interval for the slope of our visits model
(vci <- exp(confint(modVisits, parm=2, level = .95)))
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 1.000540 1.001083

Every visit that a doctor has is associated with a multiplicative change in their mean number of complaints. We are 95% confident that this multiplicative change is between 1.000540 and 1.001083.

  1. Calculate and interpret the dispersion parameter
disp <- sum(residuals(modVisits, type ="pearson")^2 / modVisits$df.residual)
disp
## [1] 1.272378
(sqrt(disp))
## [1] 1.127997

Based on the dispersion results we need to multiple our standard errors by 1.128. Our overdispersion is close to 1 so this looks good to continue with this model.

Based on this full model building analysis, my final recommendation is that we use visits to predict the number of complaints a doctor will receiver.