Emergency Health Services

library(faraway)
library(ggplot2)
library(cats)
data(esdcomp)
summary(esdcomp)

##      visits       complaints     residency gender    revenue     
##  Min.   : 879   Min.   : 0.000   N:24      F:12   Min.   :206.4  
##  1st Qu.:2036   1st Qu.: 1.000   Y:20      M:32   1st Qu.:235.9  
##  Median :2384   Median : 2.000                    Median :258.5  
##  Mean   :2386   Mean   : 3.341                    Mean   :260.1  
##  3rd Qu.:2813   3rd Qu.: 5.000                    3rd Qu.:282.3  
##  Max.   :3763   Max.   :11.000                    Max.   :334.9  
##      hours     
##  Min.   : 589  
##  1st Qu.:1207  
##  Median :1512  
##  Mean   :1417  
##  3rd Qu.:1642  
##  Max.   :1917

Exploratory Analyis

Below will go through each variable used as a predictor of number of complaints to determine if poisson regression should be used, and to get a better feel for the data. Mostly I wanted to make pretty rainbow graphs with cats on them.

The graph below is a histogram of the number of complaints. It is a count, which points to poisson. The data also seems to peak near the mean of 3.34, and is right skewed, which also points to poisson. This seems like a poisson distribution with a fat tail, (possibly due to not enough data points?)

ggplot(data = esdcomp, aes(x = complaints)) + add_cat() + geom_histogram(binwidth = .5, fill = "red", bins = 12) + labs(title = "Histogram of Complaints", x = "# of Complaints", y = "Count by # of Complaints")

mean(esdcomp$complaints)

## [1] 3.340909

var(esdcomp$complaints)

## [1] 7.671776

The following plots are fairly superfluous, but I wanted rainbow cats and thought it would be good to start predicting which variables might be the best predictors.

Complaints by number of visits has a fairly strong positive trend, so this may be a decent predictor where as number of visits go up, so do complaints. This would make sense since you have more exposures to getting possible complaints as you get more visits. Additionally we can notice some heteroscedasticity with vairance increasing as x does, pointing to poisson.

ggplot(data = esdcomp, aes(x = visits, y = complaints)) + add_cat() + geom_point(size = 3, col = "darkorange") + labs(title = "Complaints by # of Visits", x = "# of Visits", y = "# of Complaints")

Complaints by revenue has a pretty weak positive trend, where complaints slightly trend up as revenue goes up. This may be collinear with number of visits since I would anticipate that doctors earning more would have more experience, aka more visits. Possible heteroscedasticity, but overall a fairly scattered graph.

ggplot(data = esdcomp, aes(x = revenue, y = complaints)) + add_cat() + geom_point(size = 3, col = "gold") + labs(title = "Complaints by Revenue", x = "Revenue", y = "# of Complaints")

There seems to be a positive trend between number of complaints received and the number of hours worked, where complaints trend up as # of ours worked increases. There is also heteroscedasticity in this graph where variance increases as x does, which points to poisson. This again may be collinear with the number of visits and the revenue earned.

ggplot(data = esdcomp, aes(x = hours, y = complaints)) + add_cat() + geom_point(size = 3, col = "seagreen") + labs(title = "Complaints by # of Hours", x = "# of Hours", y = "# of Complaints")

This graph appears to show that male doctors and female doctors have the same median number of complaints, but a male docto seems more likely to have a higher number of complaints than a female doctor. Makes sense. That’s all.

ggplot(data = esdcomp, aes(x = gender, y = complaints, color = gender)) + add_cat() + geom_boxplot(lwd = .8, outlier.size = 4) + scale_color_manual(values = c("slateblue2", "steelblue1")) + labs(title = "Complaints by Gender", x = "Gender", y = "# of Complaints")

This graph shows that overall, doctors that are not in a residency are more likely to have complaints than those who are currently in residency training. I would have anticipated the opposite, but perhaps those who are out of residency are more lazy? Very weird. There is also difference in variance between the two.

ggplot(data = esdcomp, aes(x = residency, y = complaints, color = residency)) + add_cat() + geom_boxplot(lwd = .8, outlier.size = 4) + scale_color_manual(values = c("hotpink2", "mediumorchid3")) + labs(title = "Complaints Based on Residency Status", x = "In a Residency?", y = "# of Complaints")

Poisson?

Yes, our early evidence points to that we would want to conduct poisson regression. This is because our response variable is a count, it is slightly poisson distributed with the mean at the peak, and analysis of our variables show heteroscedasticity with more variance as number of complaints increases.

Model Building Process

Forward & Backward Selection

As you can see below, both forward and backward selection land on the model using visits and residency as a factor as predictors when doing model selection by AIC, so I will proceed using this model.

library(MASS)
modMin <- glm(complaints ~ 1, data = esdcomp, family = "poisson")
modFull <- glm(complaints ~ visits + factor(residency) + factor(gender) + revenue + hours, data = esdcomp, family = "poisson")
modForward <- stepAIC(modMin, direction = "forward", scope = list(upper = ~visits + factor(residency) + factor(gender) + revenue + hours, lower = ~1))

## Start:  AIC=214.23
## complaints ~ 1
## 
##                     Df Deviance    AIC
## + visits             1   54.246 181.03
## + hours              1   64.684 191.46
## + factor(residency)  1   83.303 210.08
## + factor(gender)     1   86.457 213.24
## + revenue            1   86.929 213.71
## <none>                   89.447 214.23
## 
## Step:  AIC=181.03
## complaints ~ visits
## 
##                     Df Deviance    AIC
## + factor(residency)  1   50.882 179.66
## + revenue            1   52.109 180.89
## <none>                   54.246 181.03
## + hours              1   53.566 182.35
## + factor(gender)     1   53.898 182.68
## 
## Step:  AIC=179.66
## complaints ~ visits + factor(residency)
## 
##                  Df Deviance    AIC
## <none>                50.882 179.66
## + revenue         1   50.404 181.18
## + factor(gender)  1   50.739 181.52
## + hours           1   50.870 181.65

modBackward <- stepAIC(modFull, direction = "backward")

## Start:  AIC=184.77
## complaints ~ visits + factor(residency) + factor(gender) + revenue + 
##     hours
## 
##                     Df Deviance    AIC
## - hours              1   50.051 182.83
## - factor(gender)     1   50.251 183.03
## - revenue            1   50.665 183.44
## - factor(residency)  1   51.319 184.10
## <none>                   49.995 184.78
## - visits             1   57.568 190.35
## 
## Step:  AIC=182.83
## complaints ~ visits + factor(residency) + factor(gender) + revenue
## 
##                     Df Deviance    AIC
## - factor(gender)     1   50.404 181.18
## - revenue            1   50.739 181.52
## - factor(residency)  1   51.321 182.10
## <none>                   50.051 182.83
## - visits             1   75.947 206.73
## 
## Step:  AIC=181.18
## complaints ~ visits + factor(residency) + revenue
## 
##                     Df Deviance    AIC
## - revenue            1   50.882 179.66
## - factor(residency)  1   52.109 180.89
## <none>                   50.404 181.18
## - visits             1   76.427 205.21
## 
## Step:  AIC=179.66
## complaints ~ visits + factor(residency)
## 
##                     Df Deviance    AIC
## <none>                   50.882 179.66
## - factor(residency)  1   54.246 181.03
## - visits             1   83.303 210.08

Below is the model chosen through automatic regression. An increase of one visit is associated with a multiplicative change of 1.0008105 in the mean number of complaints, holding residency status constant. Also, a doctor that holds current residency status is associated with a multiplicative change of .7318550 in the mean number of complaints.

mod0 <- glm(complaints ~ visits + factor(residency), data = esdcomp, family = "poisson")
summary(mod0)

## 
## Call:
## glm(formula = complaints ~ visits + factor(residency), family = "poisson", 
##     data = esdcomp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9264  -0.9142  -0.3832   0.6496   1.9869  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.7274574  0.3972932  -1.831   0.0671 .  
## visits              0.0008101  0.0001429   5.668 1.45e-08 ***
## factor(residency)Y -0.3121729  0.1725024  -1.810   0.0703 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 50.882  on 41  degrees of freedom
## AIC: 179.66
## 
## Number of Fisher Scoring iterations: 5

exp(coef(mod0))

##        (Intercept)             visits factor(residency)Y 
##          0.4831359          1.0008105          0.7318550

Now I will test for lack of fit with the model I currently have selected, mod0, that has the variables visits and residency as a factor selected. We find a pvalue of .138621. Under the typical level of significance of .05, we do not have lack of fit, so I feel semi confident in this model so far.

teststat <- summary(mod0)$deviance
wdf <- nrow(esdcomp) - length(coef(mod0))
pchisq(teststat, df = wdf, lower.tail= FALSE)

## [1] 0.138621

Drop In Deviance

Next we will be conducting drop in deviance tests on each of the additional variables to see if we should include any in our final model.

Below we conduct a drop-in-deviance test by adding the variable gender as a factor. It finds a pvalue of .705 which does not provide sufficient evidence that the more complex model is an improvement over our null model.

mod1 <- glm(complaints ~ visits + factor(residency) + factor(gender), data = esdcomp, family = "poisson")
summary(mod1)

## 
## Call:
## glm(formula = complaints ~ visits + factor(residency) + factor(gender), 
##     family = "poisson", data = esdcomp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9514  -0.9058  -0.3792   0.6189   1.9395  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.7611853  0.4072631  -1.869   0.0616 .  
## visits              0.0007989  0.0001456   5.487  4.1e-08 ***
## factor(residency)Y -0.3046352  0.1736236  -1.755   0.0793 .  
## factor(gender)M     0.0780814  0.2076261   0.376   0.7069    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 50.739  on 40  degrees of freedom
## AIC: 181.52
## 
## Number of Fisher Scoring iterations: 5

teststat <- deviance(mod0) - deviance(mod1)
pchisq(teststat, df = 1, lower.tail = FALSE)

## [1] 0.7049961

Below we conduct a drop-in-deviance test by adding the variable revenue. It finds a pvalue of .489 which, although better than the previous model, does not provide sufficient evidence that the more complex model is an improvement over our null model.

mod2 <- glm(complaints ~ visits + factor(residency) + revenue, data = esdcomp, family = "poisson")
summary(mod1)

## 
## Call:
## glm(formula = complaints ~ visits + factor(residency) + factor(gender), 
##     family = "poisson", data = esdcomp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9514  -0.9058  -0.3792   0.6189   1.9395  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.7611853  0.4072631  -1.869   0.0616 .  
## visits              0.0007989  0.0001456   5.487  4.1e-08 ***
## factor(residency)Y -0.3046352  0.1736236  -1.755   0.0793 .  
## factor(gender)M     0.0780814  0.2076261   0.376   0.7069    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 50.739  on 40  degrees of freedom
## AIC: 181.52
## 
## Number of Fisher Scoring iterations: 5

teststat <- deviance(mod0) - deviance(mod2)
pchisq(teststat, df = 1, lower.tail = FALSE)

## [1] 0.4891331

Below we conduct a drop-in-deviance test by adding the variable hours. It finds a pvalue of .913, the worst of them all, which does not provide sufficient evidence that the more complex model is an improvement over our null model.

mod3 <- glm(complaints ~ visits + factor(residency) + hours, data = esdcomp, family = "poisson")
summary(mod3)

## 
## Call:
## glm(formula = complaints ~ visits + factor(residency) + hours, 
##     family = "poisson", data = esdcomp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9053  -0.9335  -0.3787   0.6450   1.9762  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -7.696e-01  5.553e-01  -1.386 0.165786    
## visits              7.927e-04  2.141e-04   3.703 0.000213 ***
## factor(residency)Y -3.042e-01  1.872e-01  -1.625 0.104211    
## hours               5.582e-05  5.111e-04   0.109 0.913031    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89.447  on 43  degrees of freedom
## Residual deviance: 50.870  on 40  degrees of freedom
## AIC: 181.65
## 
## Number of Fisher Scoring iterations: 5

teststat <- deviance(mod0) - deviance(mod3)
pchisq(teststat, df = 1, lower.tail = FALSE)

## [1] 0.9130706

After conducting drop-in-deviance tests for each additional variable, it is clear to see that we do not want to add anymore variables and to stick with the model selected through automatic regression.

Confidence Interval

Below we have the confidence intervals for our two predictors, visits and residency as a factor. visits: Holding residency status constant, we are 95% confident that an additional visit is associated with a multiplicative change in the mean number of complaints between 1.000530 and 1.001091 Residency: Holding visits constant, we are 95% confident that holding residency status is associated with a multiplicative change in the mean number of complaints between .5219009 and 1.0262710.

visitsCI <- exp(.0008101 + c(-1, 1)*0.0001429*1.96)
residencyCI <- exp(-0.3121729 + c(-1, 1)*0.1725024*1.96)
visitsCI

## [1] 1.000530 1.001091

residencyCI

## [1] 0.5219009 1.0262710

Dispersion Parameter

We can calculate our dispersion parameter to be 1.234 (lucky number!), which means we would need to multiply our standard errors by 1.111009. Our dispersion parameter is pretty close to 1, so it isn’t too worrisome.

dp <- sum(residuals(mod0, type = "pearson")^2/mod0$df.residual)
sqrt(dp)

## [1] 1.111009

CATS

Enjoy these cats.

here_kitty()

## mrrrrowwww

here_kitty()

## mew

here_kitty()

## mrrrrowwww

here_kitty()

## mew

here_kitty()

## mrrrrowwww

here_kitty()

## hissst

here_kitty()

## purrr

here_kitty()

## purrr

here_kitty()

## hissst

here_kitty()

## hissst

Poisson Regression Quiz

StatsWithSarah

3/8/2021