1. Introduction
Healthcare-associated Methicillin-resistant Staphylococcus aureus
(MRSA) infections occur due to the spreading of staph bacteria during
invasive procedures and medical device implants, from dirty hands, and
between patients in long-term care wards/hospitals (“General
Information” and “MRSA infection”). According to the CDC, about \(5\%\) of hospital patients carry MRSA
bacteria on their skin, meaning that many patients can put themselves
and others at risk of infection (“General Information”). Additionally,
untreated MRSA infections can progress to pneumonia and sepsis (“General
Information”). For the two aforementioned reasons, MRSA infections are
closely tracked by the CDC so they can develop methods to reduce MRSA
and other healthcare-associated infections (“What CDC is Doing”).
I will be focusing specifically on MRSA data from the state of
California in 2022, where hospitals report yearly MRSA infections to the
National Healthcare Safety Network (NHSN) of the CDC (California
Department of Public Health). The California Department of Public Health
then downloads and analyzes this data for MRSA prevention purposes
(California Department of Public Health).
I want to know what factors can predict higher reported MRSA cases in
California hospitals in 2022. For example, if major teaching hospitals
predicted higher case counts, prevention efforts might focus on proper
hygiene for healthcare workers (especially for undergraduate and medical
students). Alternatively, is the number of patients days a better
predictor of MRSA infections? This information would be useful in
determining next steps for MRSA prevention efforts.
Research Question: Can we use patient days and hospital type to
predict the number of MRSA infections per year in California
hospitals?
Data wrangling
After downloading the data from California hospitals the California
Department of Public Health published on data.gov, I culled through the
data to ensure there wouldn’t be major problems when opening the data
and analyzing it in R. I deleted 4 rows at the very top of the original
data set because they pooled the data from all of the other observations
based on the hospital’s risk-adjustment category (these are unnecessary
for my analysis). 4 other observations were omitted because they had
blanks under reported infections. There were no blanks under either
explanatory variable that I want to explore (Patient Days and Hospital
Category).
mrsadata <- read.csv("MRSAdata.csv")
infectionaverages <- read.csv("infectionaverages.csv")
2. Methodology
a. Important Summary Statistics
Reported infections (from excel)
Mean: 2.11
SD: 3.45
Patient Days (from excel)
Mean: 37,446
SD: 40,551
It is concerning to me that the standard deviations of reported infections and patient days are higher than the means. There is a lot of variability in this data set, stemming from the different numbers of each hospital type and the huge variation in number of infections for each hospital type, with most hospitals hardly having any infections but some having more than 15 (see major teaching in the box plot). Clearly, averaging across all 452 hospitals (and all hospital types) is not very useful!
typetable <- mrsadata %>% count(Hospital_Type)
typetable
barplot(typetable$n, main="Hospitals of each type Reporting MRSA BSIs",
legend.text = typetable$Hospital_Type,
col=c("darkgreen","darkblue","turquoise","green", "red", "purple", "pink", "grey", "brown"),
args.legend = list(x = "topleft", inset = c(0.24, -0.12)), cex.names="0.5",las="2")
For the following plot, I referenced Zach’s code on Statology to remove the hospital labels from the x axis because they were all overlapping with one another. Instead, I used the key to determine which box corresponds to each hospital type.
library(ggplot2)
qplot(Hospital_Type, Infections_Reported, data = mrsadata,
geom= "boxplot", fill = Hospital_Type, main="Reported MRSA Infections by Hospital Type") +
theme(axis.text.x=element_blank())
Looking at the infections by hospital type, long-term acute care hospitals seem to have the highest median infections, though major teaching would definitely have the highest average infections since there are some outliers with more than 15 infections each. Large community hospitals also have comparable case counts but the other 6 hospital types have comparatively low numbers of reported infections.
b. Linear Regression is not the right choice
The theoretical linear model is \(Infections = \beta_0 + \beta_1PatientDays + \epsilon\), where \(\epsilon \sim NI(0,\sigma)\)
mrsalinearmodel<- lm(Infections_Reported~Patient_Days, data=mrsadata)
summary(mrsalinearmodel)
##
## Call:
## lm(formula = Infections_Reported ~ Patient_Days, data = mrsadata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0182 -1.0276 -0.1302 0.4703 17.4615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.103e-01 1.516e-01 -1.387 0.166
## Patient_Days 6.192e-05 2.749e-06 22.524 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.367 on 450 degrees of freedom
## Multiple R-squared: 0.5299, Adjusted R-squared: 0.5289
## F-statistic: 507.3 on 1 and 450 DF, p-value: < 2.2e-16
The fitted model is \(\hat{Infections} =
-0.2103 + 0.00006192PatientDays\)
Looking at the linear model, the relationship between patient days and
reported MRSA infections is statistically significant; the adjusted
\(R^2\) value suggests that this model
explains a good amount of the variability between hospitals. However,
since reported infections are count data, the coefficients do
not make a lot of sense. If patient days increase by 1, reported
infections cannot increase by a fractional amount. Also, if there are no
patient days, we would expect the hospital to not be running or there to
be zero reported cases (instead of a negative number of cases)!
The following code will plot the predicted values (using this SLR model) along with the actual values for reported infections.
plot(x=mrsadata$Patient_Days, y=mrsalinearmodel$fitted.values, xlab='Patient Days',
ylab='Infections Reported', ylim=c(-0.1, 32))
points(x=mrsadata$Patient_Days, y=mrsadata$Infections_Reported, pch=2, col=3)
The linear model appears to fit the actual reported values pretty well, but I also need to check model conditions.
plot(mrsalinearmodel, which=1)
plot(mrsalinearmodel, which=2)
histogram(~mrsalinearmodel$residuals,
xlab = "Residuals", data=mrsadata)
Looking at the residuals vs fitted plot, there is a pattern to the residuals and the variance is not consistent across the fitted values (it appears to increase a bit). I also have concerns about the normality of the residuals because the standardized residuals deviate a lot from the QQ line. Finally, the histogram provides further evidence that the residuals are not normally distributed (though they are centered at 0), since the density for a residual of 0 is very high.
I will also model the data using multiple linear regression with both
Patient Days and Hospital Type as explanatory variables to see if it is
any different from the simple linear regression model.
The theoretical model is \(Infections =
\beta_0 + \beta_1PatientDays + \beta_2HospitalType + \epsilon\),
where \(\epsilon \sim
NI(0,\sigma)\)
Note that I simplified the 8 indicator variables to just \(\beta_2HospitalType\) because the
theoretical model would look unwieldy. The summary below shows that
there are 8 indicator variables and the reference is major
teaching.
mrsadaystypelinearmodel<- lm(Infections_Reported~Patient_Days + Hospital_Type, data=mrsadata)
summary(mrsadaystypelinearmodel)
##
## Call:
## lm(formula = Infections_Reported ~ Patient_Days + Hospital_Type,
## data = mrsadata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5572 -0.9209 -0.0616 0.6675 15.9849
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -3.468e-01 2.324e-01 -1.492
## Patient_Days 7.242e-05 3.650e-06 19.841
## Hospital_TypeCommunity, >250 Beds -1.827e+00 4.674e-01 -3.909
## Hospital_TypeCommunity, 125-250 Beds -3.677e-01 3.545e-01 -1.037
## Hospital_TypeCritical Access 1.998e-01 4.323e-01 0.462
## Hospital_TypeFree-Standing Rehabilitation -4.654e-01 6.649e-01 -0.700
## Hospital_TypeLong-Term Acute Care 3.776e+00 5.148e-01 7.336
## Hospital_TypeMajor Teaching -7.181e-01 3.787e-01 -1.896
## Hospital_TypePediatric -2.077e+00 7.382e-01 -2.814
## Hospital_TypeRehabilitation Unit 1.058e-02 3.628e-01 0.029
## Pr(>|t|)
## (Intercept) 0.136291
## Patient_Days < 2e-16 ***
## Hospital_TypeCommunity, >250 Beds 0.000107 ***
## Hospital_TypeCommunity, 125-250 Beds 0.300248
## Hospital_TypeCritical Access 0.644150
## Hospital_TypeFree-Standing Rehabilitation 0.484337
## Hospital_TypeLong-Term Acute Care 1.06e-12 ***
## Hospital_TypeMajor Teaching 0.058599 .
## Hospital_TypePediatric 0.005110 **
## Hospital_TypeRehabilitation Unit 0.976761
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.165 on 442 degrees of freedom
## Multiple R-squared: 0.614, Adjusted R-squared: 0.6061
## F-statistic: 78.12 on 9 and 442 DF, p-value: < 2.2e-16
plot(mrsadaystypelinearmodel, which=1)
plot(mrsadaystypelinearmodel, which=2)
Looking at the residual plots for the linear model with two predictors, I still have concerns about several model conditions. The residuals deviate from the QQ line, suggesting that they are not normally distributed. The residuals vs. fitted plot has a pattern and the variance appears to increase as the fitted values increase (which is actually a condition of Poisson regression).
c. Poisson Regression
Since a linear regression model does not work well for this data, I will explore another model. A Poisson regression model is often a good choice for count data. The histogram below shows how reported MRSA infections appear to be described well by a Poisson distribution as it is slightly skewed right (Roback and Legler).
The average reported cases can be denoted by \(\lambda\), and we want to know if the
variability between \(\lambda_i\),
where i is an individual hospital, can be explained by hospital type
and/or patient days.
Poisson distributions utilize a log function to link \(\lambda\) and the predictor variables.
Theoretical model (Roback and Legler):
\(log(\lambda_i) = \beta_0 +
\beta_1x_i\) so…
\(log(\lambda_i) = \beta_0 +
\beta_1PatientDays\) and \(log(\lambda_i) = \beta_0 + \beta_1PatientDays +
\beta_2HospitalType\)
The theoretical model with interaction is \(log(\lambda_i) = \beta_0 + \beta_1PatientDays +
\beta_2HospitalType + \beta_3PatientDays*HospitalType\)
There is no \(\epsilon\) term for
Poisson regression. We assume that \(\lambda =
variance\), which means that \(\lambda\) includes both necessary
parameters.
NOTE: I again simplified the 2 predictor model to
just \(\beta_2HospitalType\) though it
would really include 8 indicator variables, with one for each hospital
type except for major teaching since that is the base level.
The interaction model would actually have 16 indicator variables, 8 for
each hospital type for both the hospital type and the interaction
term.
(Roback and Legler)
I chose not to develop a model with square terms because it was going to be a lot more difficult to directly compare MRSA infections between hospital types.
Model Conditions (from Roback and Legler):
1. “Poisson Response: The response variable is a count per unit of time
or space, described by a Poisson distribution”
histogram(~mrsadata$Infections_Reported,
xlab = "Reported Infections", data=mrsadata)
“Independence: The observations must be independent of one another” Each observation comes from a single hospital. However, there may be some concerns about independence. For example, if there is an outbreak of MRSA in a given sports league, it would spread to several hospitals in that county, regardless of type (community, teaching, pediatric, etc.).
“Mean of a Poisson random variable must be equal to its variance”
I can check this by comparing the mean infections reported to the
variance for each group of hospital type.
I got the following code from “Poisson Regression - R Data Analysis
Examples,” but changed sd to the variance so it would be easier to
directly interpret.
with(mrsadata, tapply(Infections_Reported, Hospital_Type, function(x) {
sprintf("Mean (variance) = %1.2f (%1.2f)", mean(x), sd(x)^2)
}))
## Community, <125 Beds Community, >250 Beds
## "Mean (variance) = 0.65 (1.07)" "Mean (variance) = 3.25 (6.35)"
## Community, 125-250 Beds Critical Access
## "Mean (variance) = 1.86 (4.33)" "Mean (variance) = 0.09 (0.08)"
## Free-Standing Rehabilitation Long-Term Acute Care
## "Mean (variance) = 0.00 (0.00)" "Mean (variance) = 4.86 (10.22)"
## Major Teaching Pediatric
## "Mean (variance) = 4.40 (26.56)" "Mean (variance) = 1.70 (5.57)"
## Rehabilitation Unit
## "Mean (variance) = 0.12 (0.14)"
I have some concerns about the mean being equal to the variance. Some hospital types are more concerning than others. For example, the mean and the variance are quite similar for rehabilitation units but are very different for major teaching hospitals.
Theoretically, I could assess the linearity by plotting \(log(\lambda)\) against the patient days for each hospital type. This is not very feasible for my data set since there are 9 hospital types and assessing the linearity of the model overall by looking at each of the 9 cases would become unwieldy.
3. Fitting the Models + 5. Results
a. Patient Days Model
First, I will fit the model that uses only patient days as a predictor.
poissondays <- glm(Infections_Reported ~ Patient_Days, family = poisson, data = mrsadata)
summary(poissondays)
##
## Call:
## glm(formula = Infections_Reported ~ Patient_Days, family = poisson,
## data = mrsadata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.079e-02 4.628e-02 -0.233 0.816
## Patient_Days 1.359e-05 3.861e-07 35.194 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1874.5 on 451 degrees of freedom
## Residual deviance: 1099.2 on 450 degrees of freedom
## AIC: 1812.6
##
## Number of Fisher Scoring iterations: 5
b. Patient Days + Hospital Type model
I will also fit the 2 predictor model (patient days and hospital
type) using the following code.
The first part of the code relevels the hospital type variable so all
hospitals are compared to major teaching rather than small community
hospitals. I chose to relevel it this way since there are so many major
teaching hospitals. The code is from “Specify Reference Factor Level in
Linear Regression in R”.
mrsadata$Hospital_Type <- relevel(factor(mrsadata$Hospital_Type), ref = "Major Teaching")
poissondaystype <- glm(Infections_Reported ~ Patient_Days + Hospital_Type, family = poisson,
data = mrsadata)
summary(poissondaystype)
##
## Call:
## glm(formula = Infections_Reported ~ Patient_Days + Hospital_Type,
## family = poisson, data = mrsadata)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 4.109e-01 7.926e-02 5.185
## Patient_Days 1.125e-05 5.329e-07 21.109
## Hospital_TypeCommunity, <125 Beds -1.004e+00 1.492e-01 -6.729
## Hospital_TypeCommunity, >250 Beds -1.168e-01 1.008e-01 -1.159
## Hospital_TypeCommunity, 125-250 Beds -2.072e-01 1.088e-01 -1.904
## Hospital_TypeCritical Access -2.904e+00 5.826e-01 -4.985
## Hospital_TypeFree-Standing Rehabilitation -1.684e+01 6.067e+02 -0.028
## Hospital_TypeLong-Term Acute Care 9.455e-01 1.197e-01 7.898
## Hospital_TypePediatric -6.114e-01 2.481e-01 -2.465
## Hospital_TypeRehabilitation Unit -2.615e+00 3.856e-01 -6.780
## Pr(>|z|)
## (Intercept) 2.17e-07 ***
## Patient_Days < 2e-16 ***
## Hospital_TypeCommunity, <125 Beds 1.71e-11 ***
## Hospital_TypeCommunity, >250 Beds 0.2465
## Hospital_TypeCommunity, 125-250 Beds 0.0569 .
## Hospital_TypeCritical Access 6.19e-07 ***
## Hospital_TypeFree-Standing Rehabilitation 0.9779
## Hospital_TypeLong-Term Acute Care 2.82e-15 ***
## Hospital_TypePediatric 0.0137 *
## Hospital_TypeRehabilitation Unit 1.20e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1874.51 on 451 degrees of freedom
## Residual deviance: 751.27 on 442 degrees of freedom
## AIC: 1480.6
##
## Number of Fisher Scoring iterations: 14
c. Patient Days*Hospital Type
model
Finally, I will fit the model that includes the Patient Days, Hospital
Type interaction term with the following code.
mrsadata$Hospital_Type <- relevel(factor(mrsadata$Hospital_Type), ref = "Major Teaching")
poissondaystypeint <- glm(Infections_Reported ~ Patient_Days + Hospital_Type +
Patient_Days:Hospital_Type, family = poisson, data = mrsadata)
summary(poissondaystypeint)
##
## Call:
## glm(formula = Infections_Reported ~ Patient_Days + Hospital_Type +
## Patient_Days:Hospital_Type, family = poisson, data = mrsadata)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 4.705e-01 8.011e-02
## Patient_Days 1.076e-05 5.559e-07
## Hospital_TypeCommunity, <125 Beds -2.099e+00 3.032e-01
## Hospital_TypeCommunity, >250 Beds -4.798e-01 3.167e-01
## Hospital_TypeCommunity, 125-250 Beds -8.770e-01 2.642e-01
## Hospital_TypeCritical Access -6.130e+00 2.276e+00
## Hospital_TypeFree-Standing Rehabilitation -1.677e+01 1.316e+03
## Hospital_TypeLong-Term Acute Care 1.766e-01 3.566e-01
## Hospital_TypePediatric -2.438e+00 9.265e-01
## Hospital_TypeRehabilitation Unit -3.101e+00 6.377e-01
## Patient_Days:Hospital_TypeCommunity, <125 Beds 5.976e-05 1.284e-05
## Patient_Days:Hospital_TypeCommunity, >250 Beds 4.126e-06 3.512e-06
## Patient_Days:Hospital_TypeCommunity, 125-250 Beds 1.573e-05 5.651e-06
## Patient_Days:Hospital_TypeCritical Access 6.037e-04 3.224e-04
## Patient_Days:Hospital_TypeFree-Standing Rehabilitation -1.076e-05 1.041e-01
## Patient_Days:Hospital_TypeLong-Term Acute Care 3.434e-05 1.544e-05
## Patient_Days:Hospital_TypePediatric 2.167e-05 9.687e-06
## Patient_Days:Hospital_TypeRehabilitation Unit 5.879e-05 6.141e-05
## z value Pr(>|z|)
## (Intercept) 5.873 4.28e-09 ***
## Patient_Days 19.353 < 2e-16 ***
## Hospital_TypeCommunity, <125 Beds -6.921 4.47e-12 ***
## Hospital_TypeCommunity, >250 Beds -1.515 0.129779
## Hospital_TypeCommunity, 125-250 Beds -3.319 0.000902 ***
## Hospital_TypeCritical Access -2.693 0.007085 **
## Hospital_TypeFree-Standing Rehabilitation -0.013 0.989833
## Hospital_TypeLong-Term Acute Care 0.495 0.620449
## Hospital_TypePediatric -2.631 0.008509 **
## Hospital_TypeRehabilitation Unit -4.863 1.16e-06 ***
## Patient_Days:Hospital_TypeCommunity, <125 Beds 4.655 3.25e-06 ***
## Patient_Days:Hospital_TypeCommunity, >250 Beds 1.175 0.240081
## Patient_Days:Hospital_TypeCommunity, 125-250 Beds 2.784 0.005373 **
## Patient_Days:Hospital_TypeCritical Access 1.872 0.061150 .
## Patient_Days:Hospital_TypeFree-Standing Rehabilitation 0.000 0.999918
## Patient_Days:Hospital_TypeLong-Term Acute Care 2.224 0.026149 *
## Patient_Days:Hospital_TypePediatric 2.237 0.025312 *
## Patient_Days:Hospital_TypeRehabilitation Unit 0.957 0.338391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1874.5 on 451 degrees of freedom
## Residual deviance: 703.8 on 434 degrees of freedom
## AIC: 1449.1
##
## Number of Fisher Scoring iterations: 14
As I mentioned above in the theoretical model section, I am not going to write out the full fitted models because it would be 17 terms, which is not really useful to look at since it would be too busy!
4. Assessing Model Conditions
See under part 2c Poisson Regression Model Conditions. Assessing model conditions with residuals versus fitted plots does not work as well for complicated Poisson Regression Models with categorical variables like mine (with 9 categories) as it does for linear regression models.
5. Results + Conclusion, continued:
a. Which Poisson regression model is best?
To decide which of my three models is the best for predicting MRSA infections, I will use a likelihood ratio test. A likelihood ratio test is useful for my Poisson regression models because I cannot use \(R^2\) values or F tests to compare the models like I can for linear regression models (because neither are included in the model summary). It works like a nested f test, where I have a full and a nested model and test the hypothesis that the more complex model is needed (Marin).
\(H_0\): There is no difference
between the residual deviance of the simple model and the complex
model
\(Res.Dev.nested-Res.Dev.Full=0\)
\(H_a\): The complex model has a
lower residual deviance than the simple model (meaning more of the
variance is explained by the model).
\(Res.Dev.nested-Res.Dev.Full>0\)
Full Model: interaction model: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays +
\beta_2HospitalType + \beta_3PatientDays*HospitalType\)
Nested Model: only patient days: \(log(\lambda_i) = \beta_0 +
\beta_1PatientDays\)
I got the code for the lmtest from Stack Exchange (“Likelihood Ratio vs Wald test”).
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
lrtest(glm(mrsadata$Infections_Reported ~ mrsadata$Patient_Days), glm(mrsadata$Infections_Reported ~
mrsadata$Patient_Days + mrsadata$Hospital_Type + mrsadata$Patient_Days:mrsadata$Hospital_Type,
family="poisson"))
The p value for the LRT is nearly 0, so I reject the null hypothesis.
The interaction model explains more variance in MRSA infections than the
patient days model, so at least one of the extra terms is an important
predictor. To ensure we actually need the interaction term, I will also
compare the interaction model to the two predictor model.
\(H_0\): \(Res.Dev.nested-Res.Dev.Full=0\)
\(H_a\): \(Res.Dev.nested-Res.Dev.Full>0\)
Full Model: interaction model: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays +
\beta_2HospitalType + \beta_3PatientDays*HospitalType\)
Nested Model: two predictor model: \(log(\lambda_i) = \beta_0 + \beta_1PatientDays +
\beta_2HospitalType\)
lrtest(glm(mrsadata$Infections_Reported ~ mrsadata$Patient_Days + mrsadata$Hospital_Type),
glm(mrsadata$Infections_Reported ~
mrsadata$Patient_Days + mrsadata$Hospital_Type + mrsadata$Patient_Days:mrsadata$Hospital_Type,
family="poisson"))
The p value for this second LRT is also nearly 0, so I reject the null hypothesis. The interaction term is important to include in my Poisson regression model predicting MRSA infections.
b. Rate Ratios
Proof (From Roback and Legler)
\(log(\lambda_x) = \beta_0 +
\beta_1X\)
\(log(\lambda_{x+1}) = \beta_0 +
\beta_1(X+1)\)
\(log(\lambda_{x+1}) - (log\lambda_x) =
\beta_1\)
\(\lambda_{x+1}/\lambda_x = e^{\beta_1}\)
I can apply this proof to my interaction model to be able to compare hospital types! My model is \(log(\lambda) = \beta_0 + \beta_1X_1 + \beta_2X_2 ... +\beta_{17}X_{17}\) since there are 16 indicator variables, the patient days term, and the intercept. This means that the ratio of a different hospital type to the base, major teaching, with the same number of patient days can be represented by \(e^{\beta_{type}}*e^{\beta_{interaction}}\) where the \(\beta_{type}\) and the \(\beta_{interaction}\) correspond to the coefficients for the hospital I want to know about since these are the only indicator variables that are different from the base of major teaching! Here is an example: For small community hospitals, the coefficients I found in the model summary are -2.099 and 5.976 e -5. The calculation in row 1 of the table below shows that the average small community hospital has approximately 0.12 of the reported MRSA infections that major teaching hospitals do. I included the calculations for all 8 hospital types (they are all ratios with major teaching hospitals).
Looking at the table above, I noted that all of the hospital types had lower average cases than major teaching hospitals, except for long-term acute care. This makes sense when looking back at the box plot from section 2. As you can see below (if we assume that the mean for major teaching is actually higher on the graph due to the outliers), all of the other hospital types except for long-term acute care have lower reported case counts.
qplot(Hospital_Type, Infections_Reported, data = mrsadata,
geom= "boxplot", fill = Hospital_Type, main="Reported MRSA Infections by Hospital Type") +
theme(axis.text.x=element_blank())
If we hold the hospital type steady but increase patient days by 1, we would see a 1.000x increase in infections (all the indicator variables cancel out). I wanted to interpret the coefficient for patient days, but I don’t think this would be very practical. The patient days would increase by more than 1 in the real world, so we could do that calculation if necessary…
6. Discussion and Critique
Looking at the interaction model summary, only small and mid-sized community hospitals and pediatric hospitals have both statistically significant hospital type and interaction term coefficients. I conclude that these three hospital types can predict a statistically significantly lower number of MRSA infections compared to major teaching hospitals. This conclusion is different from my rate ratios because there is a lot of variability in the data (especially for major teaching hospitals). In my calculation of the rate ratios, I did not take the SE into account and create confidence intervals. This means that there is probably an overlap in CIs for teaching hospitals and the other types because they have several outliers or there are not very many observations. For example, there are only 12 free-standing rehabilitation hospitals and no recorded infections, so the SE calculation is probably not very accurate and the CI is much larger, leading to an overlap with the \(95\%\) CI for major teaching hospitals.
The interaction model is the best choice for predicting MRSA infections in California hospitals from Patient Days and Hospital Type. However, we cannot utilize this model to predict infections for a given hospital since some of the model conditions are not met (observations are not necessarily independent, variance is sometimes greater than the mean, potentially too many zeroes in the response distribution).
To improve the model, I could do the following:
Next Steps
Once the model is “optimized,” (ie. model conditions are met and the
model fits the data well), it could be utilized to determine which
hospitals are predicted to have the most MRSA infections and better
allocate public health resources to those hospitals that need the most
assistance with MRSA prevention efforts.
References