library(faraway)
library(ggplot2)
library(cats)
data(esdcomp)
summary(esdcomp)
## visits complaints residency gender revenue
## Min. : 879 Min. : 0.000 N:24 F:12 Min. :206.4
## 1st Qu.:2036 1st Qu.: 1.000 Y:20 M:32 1st Qu.:235.9
## Median :2384 Median : 2.000 Median :258.5
## Mean :2386 Mean : 3.341 Mean :260.1
## 3rd Qu.:2813 3rd Qu.: 5.000 3rd Qu.:282.3
## Max. :3763 Max. :11.000 Max. :334.9
## hours
## Min. : 589
## 1st Qu.:1207
## Median :1512
## Mean :1417
## 3rd Qu.:1642
## Max. :1917
Below will go through each variable used as a predictor of number of complaints to determine if poisson regression should be used, and to get a better feel for the data. Mostly I wanted to make pretty rainbow graphs with cats on them.
The graph below is a histogram of the number of complaints. It is a count, which points to poisson. The data also seems to peak near the mean of 3.34, and is right skewed, which also points to poisson. This seems like a poisson distribution with a fat tail, (possibly due to not enough data points?)
ggplot(data = esdcomp, aes(x = complaints)) + add_cat() + geom_histogram(binwidth = .5, fill = "red", bins = 12) + labs(title = "Histogram of Complaints", x = "# of Complaints", y = "Count by # of Complaints")
mean(esdcomp$complaints)
## [1] 3.340909
var(esdcomp$complaints)
## [1] 7.671776
The following plots are fairly superfluous, but I wanted rainbow cats and thought it would be good to start predicting which variables might be the best predictors.
Complaints by number of visits has a fairly strong positive trend, so this may be a decent predictor where as number of visits go up, so do complaints. This would make sense since you have more exposures to getting possible complaints as you get more visits. Additionally we can notice some heteroscedasticity with vairance increasing as x does, pointing to poisson.
ggplot(data = esdcomp, aes(x = visits, y = complaints)) + add_cat() + geom_point(size = 3, col = "darkorange") + labs(title = "Complaints by # of Visits", x = "# of Visits", y = "# of Complaints")
Complaints by revenue has a pretty weak positive trend, where complaints slightly trend up as revenue goes up. This may be collinear with number of visits since I would anticipate that doctors earning more would have more experience, aka more visits. Possible heteroscedasticity, but overall a fairly scattered graph.
ggplot(data = esdcomp, aes(x = revenue, y = complaints)) + add_cat() + geom_point(size = 3, col = "gold") + labs(title = "Complaints by Revenue", x = "Revenue", y = "# of Complaints")
There seems to be a positive trend between number of complaints received and the number of hours worked, where complaints trend up as # of ours worked increases. There is also heteroscedasticity in this graph where variance increases as x does, which points to poisson. This again may be collinear with the number of visits and the revenue earned.
ggplot(data = esdcomp, aes(x = hours, y = complaints)) + add_cat() + geom_point(size = 3, col = "seagreen") + labs(title = "Complaints by # of Hours", x = "# of Hours", y = "# of Complaints")
This graph appears to show that male doctors and female doctors have the same median number of complaints, but a male docto seems more likely to have a higher number of complaints than a female doctor. Makes sense. That’s all.
ggplot(data = esdcomp, aes(x = gender, y = complaints, color = gender)) + add_cat() + geom_boxplot(lwd = .8, outlier.size = 4) + scale_color_manual(values = c("slateblue2", "steelblue1")) + labs(title = "Complaints by Gender", x = "Gender", y = "# of Complaints")
This graph shows that overall, doctors that are not in a residency are more likely to have complaints than those who are currently in residency training. I would have anticipated the opposite, but perhaps those who are out of residency are more lazy? Very weird. There is also difference in variance between the two.
ggplot(data = esdcomp, aes(x = residency, y = complaints, color = residency)) + add_cat() + geom_boxplot(lwd = .8, outlier.size = 4) + scale_color_manual(values = c("hotpink2", "mediumorchid3")) + labs(title = "Complaints Based on Residency Status", x = "In a Residency?", y = "# of Complaints")
Yes, our early evidence points to that we would want to conduct poisson regression. This is because our response variable is a count, it is slightly poisson distributed with the mean at the peak, and analysis of our variables show heteroscedasticity with more variance as number of complaints increases.
Forward & Backward Selection
As you can see below, both forward and backward selection land on the model using visits and residency as a factor as predictors when doing model selection by AIC, so I will proceed using this model.
library(MASS)
modMin <- glm(complaints ~ 1, data = esdcomp, family = "poisson")
modFull <- glm(complaints ~ visits + factor(residency) + factor(gender) + revenue + hours, data = esdcomp, family = "poisson")
modForward <- stepAIC(modMin, direction = "forward", scope = list(upper = ~visits + factor(residency) + factor(gender) + revenue + hours, lower = ~1))
## Start: AIC=214.23
## complaints ~ 1
##
## Df Deviance AIC
## + visits 1 54.246 181.03
## + hours 1 64.684 191.46
## + factor(residency) 1 83.303 210.08
## + factor(gender) 1 86.457 213.24
## + revenue 1 86.929 213.71
## <none> 89.447 214.23
##
## Step: AIC=181.03
## complaints ~ visits
##
## Df Deviance AIC
## + factor(residency) 1 50.882 179.66
## + revenue 1 52.109 180.89
## <none> 54.246 181.03
## + hours 1 53.566 182.35
## + factor(gender) 1 53.898 182.68
##
## Step: AIC=179.66
## complaints ~ visits + factor(residency)
##
## Df Deviance AIC
## <none> 50.882 179.66
## + revenue 1 50.404 181.18
## + factor(gender) 1 50.739 181.52
## + hours 1 50.870 181.65
modBackward <- stepAIC(modFull, direction = "backward")
## Start: AIC=184.77
## complaints ~ visits + factor(residency) + factor(gender) + revenue +
## hours
##
## Df Deviance AIC
## - hours 1 50.051 182.83
## - factor(gender) 1 50.251 183.03
## - revenue 1 50.665 183.44
## - factor(residency) 1 51.319 184.10
## <none> 49.995 184.78
## - visits 1 57.568 190.35
##
## Step: AIC=182.83
## complaints ~ visits + factor(residency) + factor(gender) + revenue
##
## Df Deviance AIC
## - factor(gender) 1 50.404 181.18
## - revenue 1 50.739 181.52
## - factor(residency) 1 51.321 182.10
## <none> 50.051 182.83
## - visits 1 75.947 206.73
##
## Step: AIC=181.18
## complaints ~ visits + factor(residency) + revenue
##
## Df Deviance AIC
## - revenue 1 50.882 179.66
## - factor(residency) 1 52.109 180.89
## <none> 50.404 181.18
## - visits 1 76.427 205.21
##
## Step: AIC=179.66
## complaints ~ visits + factor(residency)
##
## Df Deviance AIC
## <none> 50.882 179.66
## - factor(residency) 1 54.246 181.03
## - visits 1 83.303 210.08
Below is the model chosen through automatic regression. An increase of one visit is associated with a multiplicative change of 1.0008105 in the mean number of complaints, holding residency status constant. Also, a doctor that holds current residency status is associated with a multiplicative change of .7318550 in the mean number of complaints.
mod0 <- glm(complaints ~ visits + factor(residency), data = esdcomp, family = "poisson")
summary(mod0)
##
## Call:
## glm(formula = complaints ~ visits + factor(residency), family = "poisson",
## data = esdcomp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9264 -0.9142 -0.3832 0.6496 1.9869
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7274574 0.3972932 -1.831 0.0671 .
## visits 0.0008101 0.0001429 5.668 1.45e-08 ***
## factor(residency)Y -0.3121729 0.1725024 -1.810 0.0703 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 50.882 on 41 degrees of freedom
## AIC: 179.66
##
## Number of Fisher Scoring iterations: 5
exp(coef(mod0))
## (Intercept) visits factor(residency)Y
## 0.4831359 1.0008105 0.7318550
Now I will test for lack of fit with the model I currently have selected, mod0, that has the variables visits and residency as a factor selected. We find a pvalue of .138621. Under the typical level of significance of .05, we do not have lack of fit, so I feel semi confident in this model so far.
teststat <- summary(mod0)$deviance
wdf <- nrow(esdcomp) - length(coef(mod0))
pchisq(teststat, df = wdf, lower.tail= FALSE)
## [1] 0.138621
Next we will be conducting drop in deviance tests on each of the additional variables to see if we should include any in our final model.
Below we conduct a drop-in-deviance test by adding the variable gender as a factor. It finds a pvalue of .705 which does not provide sufficient evidence that the more complex model is an improvement over our null model.
mod1 <- glm(complaints ~ visits + factor(residency) + factor(gender), data = esdcomp, family = "poisson")
summary(mod1)
##
## Call:
## glm(formula = complaints ~ visits + factor(residency) + factor(gender),
## family = "poisson", data = esdcomp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9514 -0.9058 -0.3792 0.6189 1.9395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7611853 0.4072631 -1.869 0.0616 .
## visits 0.0007989 0.0001456 5.487 4.1e-08 ***
## factor(residency)Y -0.3046352 0.1736236 -1.755 0.0793 .
## factor(gender)M 0.0780814 0.2076261 0.376 0.7069
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 50.739 on 40 degrees of freedom
## AIC: 181.52
##
## Number of Fisher Scoring iterations: 5
teststat <- deviance(mod0) - deviance(mod1)
pchisq(teststat, df = 1, lower.tail = FALSE)
## [1] 0.7049961
Below we conduct a drop-in-deviance test by adding the variable revenue. It finds a pvalue of .489 which, although better than the previous model, does not provide sufficient evidence that the more complex model is an improvement over our null model.
mod2 <- glm(complaints ~ visits + factor(residency) + revenue, data = esdcomp, family = "poisson")
summary(mod1)
##
## Call:
## glm(formula = complaints ~ visits + factor(residency) + factor(gender),
## family = "poisson", data = esdcomp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9514 -0.9058 -0.3792 0.6189 1.9395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7611853 0.4072631 -1.869 0.0616 .
## visits 0.0007989 0.0001456 5.487 4.1e-08 ***
## factor(residency)Y -0.3046352 0.1736236 -1.755 0.0793 .
## factor(gender)M 0.0780814 0.2076261 0.376 0.7069
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 50.739 on 40 degrees of freedom
## AIC: 181.52
##
## Number of Fisher Scoring iterations: 5
teststat <- deviance(mod0) - deviance(mod2)
pchisq(teststat, df = 1, lower.tail = FALSE)
## [1] 0.4891331
Below we conduct a drop-in-deviance test by adding the variable hours. It finds a pvalue of .913, the worst of them all, which does not provide sufficient evidence that the more complex model is an improvement over our null model.
mod3 <- glm(complaints ~ visits + factor(residency) + hours, data = esdcomp, family = "poisson")
summary(mod3)
##
## Call:
## glm(formula = complaints ~ visits + factor(residency) + hours,
## family = "poisson", data = esdcomp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9053 -0.9335 -0.3787 0.6450 1.9762
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.696e-01 5.553e-01 -1.386 0.165786
## visits 7.927e-04 2.141e-04 3.703 0.000213 ***
## factor(residency)Y -3.042e-01 1.872e-01 -1.625 0.104211
## hours 5.582e-05 5.111e-04 0.109 0.913031
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 50.870 on 40 degrees of freedom
## AIC: 181.65
##
## Number of Fisher Scoring iterations: 5
teststat <- deviance(mod0) - deviance(mod3)
pchisq(teststat, df = 1, lower.tail = FALSE)
## [1] 0.9130706
After conducting drop-in-deviance tests for each additional variable, it is clear to see that we do not want to add anymore variables and to stick with the model selected through automatic regression.
Below we have the confidence intervals for our two predictors, visits and residency as a factor. visits: Holding residency status constant, we are 95% confident that an additional visit is associated with a multiplicative change in the mean number of complaints between 1.000530 and 1.001091 Residency: Holding visits constant, we are 95% confident that holding residency status is associated with a multiplicative change in the mean number of complaints between .5219009 and 1.0262710.
visitsCI <- exp(.0008101 + c(-1, 1)*0.0001429*1.96)
residencyCI <- exp(-0.3121729 + c(-1, 1)*0.1725024*1.96)
visitsCI
## [1] 1.000530 1.001091
residencyCI
## [1] 0.5219009 1.0262710
We can calculate our dispersion parameter to be 1.234 (lucky number!), which means we would need to multiply our standard errors by 1.111009. Our dispersion parameter is pretty close to 1, so it isn’t too worrisome.
dp <- sum(residuals(mod0, type = "pearson")^2/mod0$df.residual)
sqrt(dp)
## [1] 1.111009
Enjoy these cats.
here_kitty()
## mrrrrowwww
here_kitty()
## mew
here_kitty()
## mrrrrowwww
here_kitty()
## mew
here_kitty()
## mrrrrowwww
here_kitty()
## hissst
here_kitty()
## purrr
here_kitty()
## purrr
here_kitty()
## hissst
here_kitty()
## hissst