library(faraway)
data(esdcomp)
attach(esdcomp)
incre <- seq(0,11, by = 1)
hist(complaints, breaks = incre)
As you can see the majority of the our complaints fall between 0 and 1. More than 1 complaint is more seldom. The datset esdcomp looks into the factors that lead to complaints recieved in an emergency room. It would be ideal to use Poisson Regression in this scenario because our target variable (complaints recieved) is a count over some period of time. Linear Regression would not be effective because there is no way the number of complaints could be negative. Poisson Regression allows us to get rid of negative values.
begmod <- glm(complaints~1, family = "poisson")
pmod1 <- glm(complaints~visits, family = "poisson")
pmod2 <- glm(complaints~residency, family = "poisson")
pmod3 <- glm(complaints~gender, family = "poisson")
pmod4 <- glm(complaints~revenue, family = "poisson")
pmod5 <- glm(complaints~hours, family = "poisson")
AIC(begmod)
## [1] 214.2269
AICtest <- c(AIC(pmod1),AIC(pmod2),AIC(pmod3),AIC(pmod4),AIC(pmod5))
AICtest
## [1] 181.0258 210.0828 213.2371 213.7092 191.4638
The model with the lowest AIC, and therefore our best predictor for complaints, is the number of patient visits the hospital has. We want to make sure that our new model is significantly better than our grand mean model. We can do this using a drop in deviance test.
anova(pmod1,begmod, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits
## Model 2: complaints ~ 1
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 42 54.246
## 2 43 89.447 -1 -35.201 2.974e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because this is our best predictor, pmod1, now becomes the currentmodel we will use going forward
currentmod <- pmod1
amod <- glm(complaints~visits + residency, family = "poisson")
bmod <- glm(complaints~visits + gender, family = "poisson")
cmod <- glm(complaints~visits + revenue, family = "poisson")
dmod <- glm(complaints~visits + hours, family = "poisson")
One way we can see whether our 2 predictor model is significantly better than our model with just oligarchy is to use a drop in deviance test
anova(amod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + residency
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 50.882
## 2 42 54.246 -1 -3.3639 0.06664 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(bmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + gender
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 53.898
## 2 42 54.246 -1 -0.34833 0.5551
anova(cmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + revenue
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 52.109
## 2 42 54.246 -1 -2.137 0.1438
anova(dmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + hours
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 53.566
## 2 42 54.246 -1 -0.67983 0.4096
Using a drop in deviance test with an alpha of .05, no 2 predictor model was significantly better than our model with just visits. However, a few came close such as residency and revenue.
We can try to add a few more complex models by squaring and square rooting some predictors
SqrtRev <- sqrt(revenue)
Revsquared <- revenue^2
SqrtHour <- sqrt(hours)
HourSq <- hours^2
Wmod <- glm(complaints~visits + HourSq, family = "poisson")
Xmod <- glm(complaints~visits + Revsquared, family = "poisson")
Ymod <- glm(complaints~visits + SqrtRev, family = "poisson")
Zmod <- glm(complaints~visits + SqrtHour, family = "poisson")
anova(Wmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + HourSq
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 53.566
## 2 42 54.246 -1 -0.68029 0.4095
anova(Xmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + Revsquared
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 51.776
## 2 42 54.246 -1 -2.4702 0.116
anova(Ymod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + SqrtRev
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 52.274
## 2 42 54.246 -1 -1.9723 0.1602
anova(Zmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: complaints ~ visits + SqrtHour
## Model 2: complaints ~ visits
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 41 53.599
## 2 42 54.246 -1 -0.64671 0.4213
Again we have no other predictors that are significant. So, we can conclude that our predictor with visits is our best model
summary(pmod1)
##
## Call:
## glm(formula = complaints ~ visits, family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8125 -0.9655 -0.2594 0.9004 1.9214
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8493259 0.3807341 -2.231 0.0257 *
## visits 0.0008101 0.0001384 5.852 4.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 54.246 on 42 degrees of freedom
## AIC: 181.03
##
## Number of Fisher Scoring iterations: 5
Let log(λ) be equal to the mean number of complaints at the hospital
log(λ) = β0 + .0008101(visits)
We can interpret our model: An increase of 1 patient visits is associated with a multiplicative change of 1.00081 in the mean number of complaints recieved
Because our model experiences some dispersion which is not accounted for, we should calculate the dispersion
dp <- sum(residuals(pmod1, type = "pearson")^2 / pmod1$df.residual)
dp
## [1] 1.272378
We have a bit of over dispersion with our variance being about 1.27 times larger than the mean. There does not appear to be much dispersion and likely explains the good fit of the model.
summary(pmod1, dispersion = dp)
##
## Call:
## glm(formula = complaints ~ visits, family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8125 -0.9655 -0.2594 0.9004 1.9214
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8493259 0.4294671 -1.978 0.048 *
## visits 0.0008101 0.0001561 5.188 2.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.272378)
##
## Null deviance: 89.447 on 43 degrees of freedom
## Residual deviance: 54.246 on 42 degrees of freedom
## AIC: 181.03
##
## Number of Fisher Scoring iterations: 5
We should calculate a confidence interval for our model found through forward selection. We will use a 95% confidence interval
myz <- qnorm(.975)
myci <- .0008101 + c(-1,1)*myz *.0001561
myci
## [1] 0.0005041496 0.0011160504
Double Check
confint(pmod1, level = .95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.6116781800 -0.118992060
## visits 0.0005400825 0.001082786
We can interpret this confidence interval by: An increase of 1 patient visitor, we can be 95% sure, will result in a multiplicative increase between 1.000504 and 1.001117