library(faraway)
data(esdcomp)
attach(esdcomp)
incre <- seq(0,11, by = 1)
hist(complaints, breaks = incre)

As you can see the majority of the our complaints fall between 0 and 1. More than 1 complaint is more seldom. The datset esdcomp looks into the factors that lead to complaints recieved in an emergency room. It would be ideal to use Poisson Regression in this scenario because our target variable (complaints recieved) is a count over some period of time. Linear Regression would not be effective because there is no way the number of complaints could be negative. Poisson Regression allows us to get rid of negative values.

begmod <- glm(complaints~1, family = "poisson")
pmod1 <- glm(complaints~visits, family = "poisson")
pmod2 <- glm(complaints~residency, family = "poisson")
pmod3 <- glm(complaints~gender, family = "poisson")
pmod4 <- glm(complaints~revenue, family = "poisson")
pmod5 <- glm(complaints~hours, family = "poisson")
AIC(begmod)
## [1] 214.2269
AICtest <- c(AIC(pmod1),AIC(pmod2),AIC(pmod3),AIC(pmod4),AIC(pmod5))
AICtest
## [1] 181.0258 210.0828 213.2371 213.7092 191.4638

The model with the lowest AIC, and therefore our best predictor for complaints, is the number of patient visits the hospital has. We want to make sure that our new model is significantly better than our grand mean model. We can do this using a drop in deviance test.

anova(pmod1,begmod, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits
## Model 2: complaints ~ 1
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        42     54.246                          
## 2        43     89.447 -1  -35.201 2.974e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because this is our best predictor, pmod1, now becomes the currentmodel we will use going forward

currentmod <- pmod1

amod <- glm(complaints~visits + residency, family = "poisson")
bmod <- glm(complaints~visits + gender, family = "poisson")
cmod <- glm(complaints~visits + revenue, family = "poisson")
dmod <- glm(complaints~visits + hours, family = "poisson")

One way we can see whether our 2 predictor model is significantly better than our model with just oligarchy is to use a drop in deviance test

anova(amod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + residency
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        41     50.882                       
## 2        42     54.246 -1  -3.3639  0.06664 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(bmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + gender
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     53.898                     
## 2        42     54.246 -1 -0.34833   0.5551
anova(cmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + revenue
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     52.109                     
## 2        42     54.246 -1   -2.137   0.1438
anova(dmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + hours
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     53.566                     
## 2        42     54.246 -1 -0.67983   0.4096

Using a drop in deviance test with an alpha of .05, no 2 predictor model was significantly better than our model with just visits. However, a few came close such as residency and revenue.

We can try to add a few more complex models by squaring and square rooting some predictors

SqrtRev <- sqrt(revenue)
Revsquared <- revenue^2
SqrtHour <- sqrt(hours)
HourSq <- hours^2

Wmod <- glm(complaints~visits + HourSq, family = "poisson")
Xmod <- glm(complaints~visits + Revsquared, family = "poisson")
Ymod <- glm(complaints~visits + SqrtRev, family = "poisson")
Zmod <- glm(complaints~visits + SqrtHour, family = "poisson")

anova(Wmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + HourSq
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     53.566                     
## 2        42     54.246 -1 -0.68029   0.4095
anova(Xmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + Revsquared
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     51.776                     
## 2        42     54.246 -1  -2.4702    0.116
anova(Ymod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + SqrtRev
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     52.274                     
## 2        42     54.246 -1  -1.9723   0.1602
anova(Zmod,pmod1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: complaints ~ visits + SqrtHour
## Model 2: complaints ~ visits
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        41     53.599                     
## 2        42     54.246 -1 -0.64671   0.4213

Again we have no other predictors that are significant. So, we can conclude that our predictor with visits is our best model

summary(pmod1)
## 
## Call:
## glm(formula = complaints ~ visits, family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8125  -0.9655  -0.2594   0.9004   1.9214  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.8493259  0.3807341  -2.231   0.0257 *  
## visits       0.0008101  0.0001384   5.852 4.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 54.246  on 42  degrees of freedom
## AIC: 181.03
## 
## Number of Fisher Scoring iterations: 5

Let log(λ) be equal to the mean number of complaints at the hospital

log(λ) = β0 + .0008101(visits)

We can interpret our model: An increase of 1 patient visits is associated with a multiplicative change of 1.00081 in the mean number of complaints recieved

Because our model experiences some dispersion which is not accounted for, we should calculate the dispersion

dp <- sum(residuals(pmod1, type = "pearson")^2 / pmod1$df.residual)
dp
## [1] 1.272378

We have a bit of over dispersion with our variance being about 1.27 times larger than the mean. There does not appear to be much dispersion and likely explains the good fit of the model.

summary(pmod1, dispersion = dp)
## 
## Call:
## glm(formula = complaints ~ visits, family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8125  -0.9655  -0.2594   0.9004   1.9214  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.8493259  0.4294671  -1.978    0.048 *  
## visits       0.0008101  0.0001561   5.188 2.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.272378)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 54.246  on 42  degrees of freedom
## AIC: 181.03
## 
## Number of Fisher Scoring iterations: 5

We should calculate a confidence interval for our model found through forward selection. We will use a 95% confidence interval

myz <- qnorm(.975)
myci <- .0008101 + c(-1,1)*myz *.0001561
myci
## [1] 0.0005041496 0.0011160504

Double Check

confint(pmod1, level = .95)
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept) -1.6116781800 -0.118992060
## visits       0.0005400825  0.001082786

We can interpret this confidence interval by: An increase of 1 patient visitor, we can be 95% sure, will result in a multiplicative increase between 1.000504 and 1.001117