1. Introduction

Healthcare-associated Methicillin-resistant Staphylococcus aureus (MRSA) infections occur due to the spreading of staph bacteria during invasive procedures and medical device implants, from dirty hands, and between patients in long-term care wards/hospitals (“General Information” and “MRSA infection”). According to the CDC, about \(5\%\) of hospital patients carry MRSA bacteria on their skin, meaning that many patients can put themselves and others at risk of infection (“General Information”). Additionally, untreated MRSA infections can progress to pneumonia and sepsis (“General Information”). For the two aforementioned reasons, MRSA infections are closely tracked by the CDC so they can develop methods to reduce MRSA and other healthcare-associated infections (“What CDC is Doing”).
I will be focusing specifically on MRSA data from the state of California in 2022, where hospitals report yearly MRSA infections to the National Healthcare Safety Network (NHSN) of the CDC (California Department of Public Health). The California Department of Public Health then downloads and analyzes this data for MRSA prevention purposes (California Department of Public Health).

I want to know what factors can predict higher reported MRSA cases in California hospitals in 2022. For example, if major teaching hospitals predicted higher case counts, prevention efforts might focus on proper hygiene for healthcare workers (especially for undergraduate and medical students). Alternatively, is the number of patients days a better predictor of MRSA infections? This information would be useful in determining next steps for MRSA prevention efforts.
Research Question: Can we use patient days and hospital type to predict the number of MRSA infections per year in California hospitals?

Data wrangling
After downloading the data from California hospitals the California Department of Public Health published on data.gov, I culled through the data to ensure there wouldn’t be major problems when opening the data and analyzing it in R. I deleted 4 rows at the very top of the original data set because they pooled the data from all of the other observations based on the hospital’s risk-adjustment category (these are unnecessary for my analysis). 4 other observations were omitted because they had blanks under reported infections. There were no blanks under either explanatory variable that I want to explore (Patient Days and Hospital Category).

mrsadata <- read.csv("MRSAdata.csv")
infectionaverages <- read.csv("infectionaverages.csv")

2. Methodology

\(\text{\underline{a. Important Summary Statistics}}\):

Reported infections (from excel)
Mean: 2.11
SD: 3.45

Patient Days (from excel)
Mean: 37,446
SD: 40,551

It is concerning to me that the standard deviations of reported infections and patient days are higher than the means. There is a lot of variability in this data set, stemming from the different numbers of each hospital type and the huge variation in number of infections for each hospital type, with most hospitals hardly having any infections but some having more than 15 (see major teaching in the box plot). Clearly, averaging across all 452 hospitals (and all hospital types) is not very useful!

typetable <- mrsadata %>% count(Hospital_Type)
typetable
barplot(typetable$n, main="Hospitals of each type Reporting MRSA BSIs",
legend.text = typetable$Hospital_Type,
col=c("darkgreen","darkblue","turquoise","green", "red", "purple", "pink", "grey", "brown"),
args.legend = list(x = "topleft", inset = c(0.24, -0.12)), cex.names="0.5",las="2")

For the following plot, I referenced Zach’s code on Statology to remove the hospital labels from the x axis because they were all overlapping with one another. Instead, I used the key to determine which box corresponds to each hospital type.

library(ggplot2)
qplot(Hospital_Type, Infections_Reported, data = mrsadata,
geom= "boxplot", fill = Hospital_Type, main="Reported MRSA Infections by Hospital Type") + 
  theme(axis.text.x=element_blank())
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Looking at the infections by hospital type, long-term acute care hospitals seem to have the highest median infections, though major teaching would definitely have the highest average infections since there are some outliers with more than 15 infections each. Large community hospitals also have comparable case counts but the other 6 hospital types have comparatively low numbers of reported infections.

\(\text{\underline{b. Linear Regression is not the right choice}}\):

The theoretical linear model is \(Infections = \beta_0 + \beta_1PatientDays + \epsilon\), where \(\epsilon \sim NI(0,\sigma)\)

mrsalinearmodel<- lm(Infections_Reported~Patient_Days, data=mrsadata)
summary(mrsalinearmodel)
## 
## Call:
## lm(formula = Infections_Reported ~ Patient_Days, data = mrsadata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0182 -1.0276 -0.1302  0.4703 17.4615 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.103e-01  1.516e-01  -1.387    0.166    
## Patient_Days  6.192e-05  2.749e-06  22.524   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.367 on 450 degrees of freedom
## Multiple R-squared:  0.5299, Adjusted R-squared:  0.5289 
## F-statistic: 507.3 on 1 and 450 DF,  p-value: < 2.2e-16

The fitted model is \(\hat{Infections} = -0.2103 + 0.00006192PatientDays\)
Looking at the linear model, the relationship between patient days and reported MRSA infections is statistically significant; the adjusted \(R^2\) value suggests that this model explains a good amount of the variability between hospitals. However, since reported infections are count data, the coefficients do not make a lot of sense. If patient days increase by 1, reported infections cannot increase by a fractional amount. Also, if there are no patient days, we would expect the hospital to not be running or there to be zero reported cases (instead of a negative number of cases)!

The following code will plot the predicted values (using this SLR model) along with the actual values for reported infections.

plot(x=mrsadata$Patient_Days, y=mrsalinearmodel$fitted.values, xlab='Patient Days',
     ylab='Infections Reported', ylim=c(-0.1, 32))
points(x=mrsadata$Patient_Days, y=mrsadata$Infections_Reported, pch=2, col=3)

The linear model appears to fit the actual reported values pretty well, but I also need to check model conditions.

plot(mrsalinearmodel, which=1)

plot(mrsalinearmodel, which=2)

histogram(~mrsalinearmodel$residuals,
xlab = "Residuals", data=mrsadata)

Looking at the residuals vs fitted plot, there is a pattern to the residuals and the variance is not consistent across the fitted values (it appears to increase a bit). I also have concerns about the normality of the residuals because the standardized residuals deviate a lot from the QQ line. Finally, the histogram provides further evidence that the residuals are not normally distributed (though they are centered at 0), since the density for a residual of 0 is very high.

I will also model the data using multiple linear regression with both Patient Days and Hospital Type as explanatory variables to see if it is any different from the simple linear regression model.
The theoretical model is \(Infections = \beta_0 + \beta_1PatientDays + \beta_2HospitalType + \epsilon\), where \(\epsilon \sim NI(0,\sigma)\)
Note that I simplified the 8 indicator variables to just \(\beta_2HospitalType\) because the theoretical model would look unwieldy. The summary below shows that there are 8 indicator variables and the reference is major teaching.

mrsadaystypelinearmodel<- lm(Infections_Reported~Patient_Days + Hospital_Type, data=mrsadata)
summary(mrsadaystypelinearmodel)
## 
## Call:
## lm(formula = Infections_Reported ~ Patient_Days + Hospital_Type, 
##     data = mrsadata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5572 -0.9209 -0.0616  0.6675 15.9849 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                               -3.468e-01  2.324e-01  -1.492
## Patient_Days                               7.242e-05  3.650e-06  19.841
## Hospital_TypeCommunity, >250 Beds         -1.827e+00  4.674e-01  -3.909
## Hospital_TypeCommunity, 125-250 Beds      -3.677e-01  3.545e-01  -1.037
## Hospital_TypeCritical Access               1.998e-01  4.323e-01   0.462
## Hospital_TypeFree-Standing Rehabilitation -4.654e-01  6.649e-01  -0.700
## Hospital_TypeLong-Term Acute Care          3.776e+00  5.148e-01   7.336
## Hospital_TypeMajor Teaching               -7.181e-01  3.787e-01  -1.896
## Hospital_TypePediatric                    -2.077e+00  7.382e-01  -2.814
## Hospital_TypeRehabilitation Unit           1.058e-02  3.628e-01   0.029
##                                           Pr(>|t|)    
## (Intercept)                               0.136291    
## Patient_Days                               < 2e-16 ***
## Hospital_TypeCommunity, >250 Beds         0.000107 ***
## Hospital_TypeCommunity, 125-250 Beds      0.300248    
## Hospital_TypeCritical Access              0.644150    
## Hospital_TypeFree-Standing Rehabilitation 0.484337    
## Hospital_TypeLong-Term Acute Care         1.06e-12 ***
## Hospital_TypeMajor Teaching               0.058599 .  
## Hospital_TypePediatric                    0.005110 ** 
## Hospital_TypeRehabilitation Unit          0.976761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.165 on 442 degrees of freedom
## Multiple R-squared:  0.614,  Adjusted R-squared:  0.6061 
## F-statistic: 78.12 on 9 and 442 DF,  p-value: < 2.2e-16
plot(mrsadaystypelinearmodel, which=1)

plot(mrsadaystypelinearmodel, which=2)

Looking at the residual plots for the linear model with two predictors, I still have concerns about several model conditions. The residuals deviate from the QQ line, suggesting that they are not normally distributed. The residuals vs. fitted plot has a pattern and the variance appears to increase as the fitted values increase (which is actually a condition of Poisson regression).

\(\text{\underline{c. Poisson Regression}}\):

Since a linear regression model does not work well for this data, I will explore another model. A Poisson regression model is often a good choice for count data. The histogram below shows how reported MRSA infections appear to be described well by a Poisson distribution as it is slightly skewed right (Roback and Legler).

The average reported cases can be denoted by \(\lambda\), and we want to know if the variability between \(\lambda_i\), where i is an individual hospital, can be explained by hospital type and/or patient days.
Poisson distributions utilize a log function to link \(\lambda\) and the predictor variables.

Theoretical model (Roback and Legler):
\(log(\lambda_i) = \beta_0 + \beta_1x_i\) so…
\(log(\lambda_i) = \beta_0 + \beta_1PatientDays\) and \(log(\lambda_i) = \beta_0 + \beta_1PatientDays + \beta_2HospitalType\)
The theoretical model with interaction is \(log(\lambda_i) = \beta_0 + \beta_1PatientDays + \beta_2HospitalType + \beta_3PatientDays*HospitalType\)
There is no \(\epsilon\) term for Poisson regression. We assume that \(\lambda = variance\), which means that \(\lambda\) includes both necessary parameters.

NOTE: I again simplified the 2 predictor model to just \(\beta_2HospitalType\) though it would really include 8 indicator variables, with one for each hospital type except for major teaching since that is the base level.
The interaction model would actually have 16 indicator variables, 8 for each hospital type for both the hospital type and the interaction term.
(Roback and Legler)

I chose not to develop a model with square terms because it was going to be a lot more difficult to directly compare MRSA infections between hospital types.

Model Conditions (from Roback and Legler):
1. “Poisson Response: The response variable is a count per unit of time or space, described by a Poisson distribution”

histogram(~mrsadata$Infections_Reported,
xlab = "Reported Infections", data=mrsadata)

2. “Independence: The observations must be independent of one another” Each observation comes from a single hospital. However, there may be some concerns about independence. For example, if there is an outbreak of MRSA in a given sports league, it would spread to several hospitals in that county, regardless of type (community, teaching, pediatric, etc.).

  1. “Mean of a Poisson random variable must be equal to its variance”

I can check this by comparing the mean infections reported to the variance for each group of hospital type.
I got the following code from “Poisson Regression - R Data Analysis Examples,” but changed sd to the variance so it would be easier to directly interpret.

with(mrsadata, tapply(Infections_Reported, Hospital_Type, function(x) {
  sprintf("Mean (variance) = %1.2f (%1.2f)", mean(x), sd(x)^2)
}))
##             Community, <125 Beds             Community, >250 Beds 
##  "Mean (variance) = 0.65 (1.07)"  "Mean (variance) = 3.25 (6.35)" 
##          Community, 125-250 Beds                  Critical Access 
##  "Mean (variance) = 1.86 (4.33)"  "Mean (variance) = 0.09 (0.08)" 
##     Free-Standing Rehabilitation             Long-Term Acute Care 
##  "Mean (variance) = 0.00 (0.00)" "Mean (variance) = 4.86 (10.22)" 
##                   Major Teaching                        Pediatric 
## "Mean (variance) = 4.40 (26.56)"  "Mean (variance) = 1.70 (5.57)" 
##              Rehabilitation Unit 
##  "Mean (variance) = 0.12 (0.14)"

I have some concerns about the mean being equal to the variance. Some hospital types are more concerning than others. For example, the mean and the variance are quite similar for rehabilitation units but are very different for major teaching hospitals.

  1. log(\(\lambda\)) is a linear function of the explanatory variables

Theoretically, I could assess the linearity by plotting \(log(\lambda)\) against the patient days for each hospital type. This is not very feasible for my data set since there are 9 hospital types and assessing the linearity of the model overall by looking at each of the 9 cases would become unwieldy.

3. Fitting the Models + 5. Results

\(\text{\underline{a. Patient Days model}}\): First, I will fit the model that uses only patient days as a predictor.

poissondays <- glm(Infections_Reported ~ Patient_Days, family = poisson, data = mrsadata)
summary(poissondays)
## 
## Call:
## glm(formula = Infections_Reported ~ Patient_Days, family = poisson, 
##     data = mrsadata)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.079e-02  4.628e-02  -0.233    0.816    
## Patient_Days  1.359e-05  3.861e-07  35.194   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1874.5  on 451  degrees of freedom
## Residual deviance: 1099.2  on 450  degrees of freedom
## AIC: 1812.6
## 
## Number of Fisher Scoring iterations: 5

\(\text{\underline{b. Patient Days + Hospital Type model}}\): I will also fit the 2 predictor model (patient days and hospital type) using the following code.
The first part of the code relevels the hospital type variable so all hospitals are compared to major teaching rather than small community hospitals. I chose to relevel it this way since there are so many major teaching hospitals. The code is from “Specify Reference Factor Level in Linear Regression in R”.

mrsadata$Hospital_Type <- relevel(factor(mrsadata$Hospital_Type), ref = "Major Teaching")
poissondaystype <- glm(Infections_Reported ~ Patient_Days + Hospital_Type, family = poisson, 
                       data = mrsadata)
summary(poissondaystype)
## 
## Call:
## glm(formula = Infections_Reported ~ Patient_Days + Hospital_Type, 
##     family = poisson, data = mrsadata)
## 
## Coefficients:
##                                             Estimate Std. Error z value
## (Intercept)                                4.109e-01  7.926e-02   5.185
## Patient_Days                               1.125e-05  5.329e-07  21.109
## Hospital_TypeCommunity, <125 Beds         -1.004e+00  1.492e-01  -6.729
## Hospital_TypeCommunity, >250 Beds         -1.168e-01  1.008e-01  -1.159
## Hospital_TypeCommunity, 125-250 Beds      -2.072e-01  1.088e-01  -1.904
## Hospital_TypeCritical Access              -2.904e+00  5.826e-01  -4.985
## Hospital_TypeFree-Standing Rehabilitation -1.684e+01  6.067e+02  -0.028
## Hospital_TypeLong-Term Acute Care          9.455e-01  1.197e-01   7.898
## Hospital_TypePediatric                    -6.114e-01  2.481e-01  -2.465
## Hospital_TypeRehabilitation Unit          -2.615e+00  3.856e-01  -6.780
##                                           Pr(>|z|)    
## (Intercept)                               2.17e-07 ***
## Patient_Days                               < 2e-16 ***
## Hospital_TypeCommunity, <125 Beds         1.71e-11 ***
## Hospital_TypeCommunity, >250 Beds           0.2465    
## Hospital_TypeCommunity, 125-250 Beds        0.0569 .  
## Hospital_TypeCritical Access              6.19e-07 ***
## Hospital_TypeFree-Standing Rehabilitation   0.9779    
## Hospital_TypeLong-Term Acute Care         2.82e-15 ***
## Hospital_TypePediatric                      0.0137 *  
## Hospital_TypeRehabilitation Unit          1.20e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1874.51  on 451  degrees of freedom
## Residual deviance:  751.27  on 442  degrees of freedom
## AIC: 1480.6
## 
## Number of Fisher Scoring iterations: 14

\(\text{\underline{c. Patient Days*Hospital Type model}}\):
Finally, I will fit the model that includes the Patient Days, Hospital Type interaction term with the following code.

mrsadata$Hospital_Type <- relevel(factor(mrsadata$Hospital_Type), ref = "Major Teaching")
poissondaystypeint <- glm(Infections_Reported ~ Patient_Days + Hospital_Type + 
                            Patient_Days:Hospital_Type, family = poisson, data = mrsadata)
summary(poissondaystypeint)
## 
## Call:
## glm(formula = Infections_Reported ~ Patient_Days + Hospital_Type + 
##     Patient_Days:Hospital_Type, family = poisson, data = mrsadata)
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             4.705e-01  8.011e-02
## Patient_Days                                            1.076e-05  5.559e-07
## Hospital_TypeCommunity, <125 Beds                      -2.099e+00  3.032e-01
## Hospital_TypeCommunity, >250 Beds                      -4.798e-01  3.167e-01
## Hospital_TypeCommunity, 125-250 Beds                   -8.770e-01  2.642e-01
## Hospital_TypeCritical Access                           -6.130e+00  2.276e+00
## Hospital_TypeFree-Standing Rehabilitation              -1.677e+01  1.316e+03
## Hospital_TypeLong-Term Acute Care                       1.766e-01  3.566e-01
## Hospital_TypePediatric                                 -2.438e+00  9.265e-01
## Hospital_TypeRehabilitation Unit                       -3.101e+00  6.377e-01
## Patient_Days:Hospital_TypeCommunity, <125 Beds          5.976e-05  1.284e-05
## Patient_Days:Hospital_TypeCommunity, >250 Beds          4.126e-06  3.512e-06
## Patient_Days:Hospital_TypeCommunity, 125-250 Beds       1.573e-05  5.651e-06
## Patient_Days:Hospital_TypeCritical Access               6.037e-04  3.224e-04
## Patient_Days:Hospital_TypeFree-Standing Rehabilitation -1.076e-05  1.041e-01
## Patient_Days:Hospital_TypeLong-Term Acute Care          3.434e-05  1.544e-05
## Patient_Days:Hospital_TypePediatric                     2.167e-05  9.687e-06
## Patient_Days:Hospital_TypeRehabilitation Unit           5.879e-05  6.141e-05
##                                                        z value Pr(>|z|)    
## (Intercept)                                              5.873 4.28e-09 ***
## Patient_Days                                            19.353  < 2e-16 ***
## Hospital_TypeCommunity, <125 Beds                       -6.921 4.47e-12 ***
## Hospital_TypeCommunity, >250 Beds                       -1.515 0.129779    
## Hospital_TypeCommunity, 125-250 Beds                    -3.319 0.000902 ***
## Hospital_TypeCritical Access                            -2.693 0.007085 ** 
## Hospital_TypeFree-Standing Rehabilitation               -0.013 0.989833    
## Hospital_TypeLong-Term Acute Care                        0.495 0.620449    
## Hospital_TypePediatric                                  -2.631 0.008509 ** 
## Hospital_TypeRehabilitation Unit                        -4.863 1.16e-06 ***
## Patient_Days:Hospital_TypeCommunity, <125 Beds           4.655 3.25e-06 ***
## Patient_Days:Hospital_TypeCommunity, >250 Beds           1.175 0.240081    
## Patient_Days:Hospital_TypeCommunity, 125-250 Beds        2.784 0.005373 ** 
## Patient_Days:Hospital_TypeCritical Access                1.872 0.061150 .  
## Patient_Days:Hospital_TypeFree-Standing Rehabilitation   0.000 0.999918    
## Patient_Days:Hospital_TypeLong-Term Acute Care           2.224 0.026149 *  
## Patient_Days:Hospital_TypePediatric                      2.237 0.025312 *  
## Patient_Days:Hospital_TypeRehabilitation Unit            0.957 0.338391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1874.5  on 451  degrees of freedom
## Residual deviance:  703.8  on 434  degrees of freedom
## AIC: 1449.1
## 
## Number of Fisher Scoring iterations: 14

As I mentioned above in the theoretical model section, I am not going to write out the full fitted models because it would be 17 terms, which is not really useful to look at since it would be too busy!

4. Assessing Model Conditions

See under part 2c Poisson Regression Model Conditions. Assessing model conditions with residuals versus fitted plots does not work as well for complicated Poisson Regression Models with categorical variables like mine (with 9 categories) as it does for linear regression models.

5. Results + Conclusion, continued: \(\text{\underline{a. Which Poisson regression model is best?}}\):

To decide which of my three models is the best for predicting MRSA infections, I will use a likelihood ratio test. A likelihood ratio test is useful for my Poisson regression models because I cannot use \(R^2\) values or F tests to compare the models like I can for linear regression models (because neither are included in the model summary). It works like a nested f test, where I have a full and a nested model and test the hypothesis that the more complex model is needed (Marin).

\(H_0\): There is no difference between the residual deviance of the simple model and the complex model
\(Res.Dev.nested-Res.Dev.Full=0\)

\(H_a\): The complex model has a lower residual deviance than the simple model (meaning more of the variance is explained by the model).
\(Res.Dev.nested-Res.Dev.Full>0\)

Full Model: interaction model: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays + \beta_2HospitalType + \beta_3PatientDays*HospitalType\)
Nested Model: only patient days: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays\)

I got the code for the lmtest from Stack Exchange (“Likelihood Ratio vs Wald test”).

library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
lrtest(glm(mrsadata$Infections_Reported ~ mrsadata$Patient_Days), glm(mrsadata$Infections_Reported ~ 
mrsadata$Patient_Days + mrsadata$Hospital_Type + mrsadata$Patient_Days:mrsadata$Hospital_Type, 
family="poisson"))

The p value for the LRT is nearly 0, so I reject the null hypothesis. The interaction model explains more variance in MRSA infections than the patient days model, so at least one of the extra terms is an important predictor. To ensure we actually need the interaction term, I will also compare the interaction model to the two predictor model.
\(H_0\): \(Res.Dev.nested-Res.Dev.Full=0\)

\(H_a\): \(Res.Dev.nested-Res.Dev.Full>0\)

Full Model: interaction model: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays + \beta_2HospitalType + \beta_3PatientDays*HospitalType\)
Nested Model: two predictor model: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays + \beta_2HospitalType\)

lrtest(glm(mrsadata$Infections_Reported ~ mrsadata$Patient_Days + mrsadata$Hospital_Type), 
       glm(mrsadata$Infections_Reported ~ 
mrsadata$Patient_Days + mrsadata$Hospital_Type + mrsadata$Patient_Days:mrsadata$Hospital_Type, 
family="poisson"))

The p value for this second LRT is also nearly 0, so I reject the null hypothesis. The interaction term is important to include in my Poisson regression model predicting MRSA infections.

\(\text{\underline{b. Rate Ratios}}\):

Proof (From Roback and Legler)

\(log(\lambda_x) = \beta_0 + \beta_1X\)
\(log(\lambda_{x+1}) = \beta_0 + \beta_1(X+1)\)
\(log(\lambda_{x+1}) - (log\lambda_x) = \beta_1\)

\(\lambda_{x+1}/\lambda_x = e^{\beta_1}\)

I can apply this proof to my interaction model to be able to compare hospital types! My model is \(log(\lambda) = \beta_0 + \beta_1X_1 + \beta_2X_2 ... +\beta_{17}X_{17}\) since there are 16 indicator variables, the patient days term, and the intercept. This means that the ratio of a different hospital type to the base, major teaching, with the same number of patient days can be represented by \(e^{\beta_{type}}*e^{\beta_{interaction}}\) where the \(\beta_{type}\) and the \(\beta_{interaction}\) correspond to the coefficients for the hospital I want to know about since these are the only indicator variables that are different from the base of major teaching! Here is an example: For small community hospitals, the coefficients I found in the model summary are -2.099 and 5.976 e -5. The calculation in row 1 of the table below shows that the average small community hospital has approximately 0.12 of the reported MRSA infections that major teaching hospitals do. I included the calculations for all 8 hospital types (they are all ratios with major teaching hospitals).

Looking at the table above, I noted that all of the hospital types had lower average cases than major teaching hospitals, except for long-term acute care. This makes sense when looking back at the box plot from section 2. As you can see below (if we assume that the mean for major teaching is actually higher on the graph due to the outliers), all of the other hospital types except for long-term acute care have lower reported case counts.

qplot(Hospital_Type, Infections_Reported, data = mrsadata,
geom= "boxplot", fill = Hospital_Type, main="Reported MRSA Infections by Hospital Type") + 
  theme(axis.text.x=element_blank())

If we hold the hospital type steady but increase patient days by 1, we would see a 1.000x increase in infections (all the indicator variables cancel out). I wanted to interpret the coefficient for patient days, but I don’t think this would be very practical. The patient days would increase by more than 1 in the real world, so we could do that calculation if necessary…

6. Discussion and Critique

  1. Looking at the interaction model summary, only small and mid-sized community hospitals and pediatric hospitals have both statistically significant hospital type and interaction term coefficients. I conclude that these three hospital types can predict a statistically significantly lower number of MRSA infections compared to major teaching hospitals. This conclusion is different from my rate ratios because there is a lot of variability in the data (especially for major teaching hospitals). In my calculation of the rate ratios, I did not take the SE into account and create confidence intervals. This means that there is probably an overlap in CIs for teaching hospitals and the other types because they have several outliers or there are not very many observations. For example, there are only 12 free-standing rehabilitation hospitals and no recorded infections, so the SE calculation is probably not very accurate and the CI is much larger, leading to an overlap with the \(95\%\) CI for major teaching hospitals.

  2. The interaction model is the best choice for predicting MRSA infections in California hospitals from Patient Days and Hospital Type. However, we cannot utilize this model to predict infections for a given hospital since some of the model conditions are not met (observations are not necessarily independent, variance is sometimes greater than the mean, potentially too many zeroes in the response distribution).

  3. To improve the model, I could do the following:

    1. Extend the model to include other predictors and potentially higher-order predictors (this will make it much harder to compare hospitals, though).
    2. I could relevel the hospital type category to compare hospitals to a different base hospital type with less variability than major teaching hospitals.
    3. Explore other models
      Other possible models: Address the excess “zeroes”: zero-inflated model
      -Includes two equations (count data and excess zeroes)
      Address the Variance being greater than the mean: negative binomial regression
      -Same structure as Poisson with an “extra parameter to model the over-dispersion” (UCLA)
  4. Next Steps
    Once the model is “optimized,” (ie. model conditions are met and the model fits the data well), it could be utilized to determine which hospitals are predicted to have the most MRSA infections and better allocate public health resources to those hospitals that need the most assistance with MRSA prevention efforts.

References