Analysis of Nosocomial Infection Control Efficacy in US Hospitals
Statistical Analysis Project
Introduction
In the 1970s the Centers for Disease Control conducted a nationwide study of hospitals to assess the current status of hospital based infections. The purpose of the study was to evaluate what hospitals were doing to prevent the spread of infection and whether those steps were helpful 1.
The project was started out of a concern of whether the steps taken by individual infection control specialists in hospitals were actually reducing infections. This is often an issue in medical research because you cannot “control” for this type of experiment. It would highly unethical and illegal to knowingly infect patients to find out whether or not your steps were effective, and randomly assigning soap to surgeries and a placebo to others is out of the question 2.
SENIC stands for Study on the Efficacy of Nosocomial Infection Control. It consists of a random sample of 113 hospitals from the original 338 hospitals surveyed. There are 113 rows and 12 columns, 1 for identification and 11 for important variables.
Packages used
knitr, car, xtable, corrplot, emmeans, multcomp, ggpubr
Goal
The project aims to investigate the effectiveness of Nosocomial Infection Control programs in reducing hospital-acquired infections in the United States.
Exploratory Data Analysis
Description
Below is a table of the with columns number, name, and a description:
senic <- read.csv("data/senic.csv") ## make sure you check your directory and that this file
## Load description dataset
senic_description <- read.csv("Data/senic_description.csv")
## Demonstrate data.frame
senic_description.table <- xtable(senic_description)
print(senic_description.table, type="html")
Variable_Number | Variable_Name | Description | |
---|---|---|---|
1 | 1 | Identification Number (Hospital) | 1-113 |
2 | 2 | Length of stay (stay) | Average length of stay of all patients in hospital (in days) |
3 | 3 | Age (age) | Average age of patients (in years) |
4 | 4 | Infection risk (infprob) | Average estimated probability of acquiring infection in hospital (in percent) |
5 | 5 | Routine culturing ratio (culratio) | Ratio of number of culture performed to number of patients without signs or symptoms of hospital-acquried infection, times 100 |
6 | 6 | Routine chest x-ray ratio (xratio) | Ratio of number of X-rays performed to number of patients without signs or symptoms of pneumonia, times 100 |
7 | 7 | Number of beds (nbeds) | Average number of beds in hospital during study period |
8 | 8 | Medical school affiliation (medschl) | 1=Yes, 2=No |
9 | 9 | Region (region) | Geographic regions, where 1 = Northeast, 2 = North Central, 3 = South, 4 = West |
10 | 10 | Average daily census (census) | Average number of patients in hospital per day during study period |
11 | 11 | Number of nurses (nurses) | Average number of full-time equivalent registered and licensed practical nurses during study period (number full time plus one half the number of part time) |
12 | 12 | Available faciliteies and services (service) | Percent of 35 potential facilites and services that are provided by the hospital |
SENIC Data Visualization
Here is the first 6 observations of the senic data:
Hospital | stay | age | infprob | culratio | xratio | nbeds | medschl | region | census | nurses | service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 7.13 | 55.70 | 4.10 | 9.00 | 39.60 | 279 | 2 | 4 | 207 | 241 | 60.00 |
2 | 2 | 8.82 | 58.20 | 1.60 | 3.80 | 51.70 | 80 | 2 | 2 | 51 | 52 | 40.00 |
3 | 3 | 8.34 | 56.90 | 2.70 | 8.10 | 74.00 | 107 | 2 | 3 | 82 | 54 | 20.00 |
4 | 4 | 8.95 | 53.70 | 5.60 | 18.90 | 122.80 | 147 | 2 | 4 | 53 | 148 | 40.00 |
5 | 5 | 11.20 | 56.50 | 5.70 | 34.50 | 88.90 | 180 | 2 | 1 | 134 | 151 | 40.00 |
6 | 6 | 9.76 | 50.90 | 5.10 | 21.90 | 97.00 | 150 | 2 | 2 | 147 | 106 | 40.00 |
I have also provided a correlation matrix to see which values are most closely associated:
senic_corr <- senic[-1][-7][-7]
corr.table <- xtable(cor(senic_corr))
print(corr.table, type="html")
stay | age | infprob | culratio | xratio | nbeds | census | nurses | service | |
---|---|---|---|---|---|---|---|---|---|
stay | 1.00 | 0.19 | 0.53 | 0.33 | 0.38 | 0.41 | 0.47 | 0.34 | 0.36 |
age | 0.19 | 1.00 | 0.00 | -0.23 | -0.02 | -0.06 | -0.05 | -0.08 | -0.04 |
infprob | 0.53 | 0.00 | 1.00 | 0.56 | 0.45 | 0.36 | 0.38 | 0.39 | 0.41 |
culratio | 0.33 | -0.23 | 0.56 | 1.00 | 0.42 | 0.14 | 0.14 | 0.20 | 0.19 |
xratio | 0.38 | -0.02 | 0.45 | 0.42 | 1.00 | 0.05 | 0.06 | 0.08 | 0.11 |
nbeds | 0.41 | -0.06 | 0.36 | 0.14 | 0.05 | 1.00 | 0.98 | 0.92 | 0.79 |
census | 0.47 | -0.05 | 0.38 | 0.14 | 0.06 | 0.98 | 1.00 | 0.91 | 0.78 |
nurses | 0.34 | -0.08 | 0.39 | 0.20 | 0.08 | 0.92 | 0.91 | 1.00 | 0.78 |
service | 0.36 | -0.04 | 0.41 | 0.19 | 0.11 | 0.79 | 0.78 | 0.78 | 1.00 |
QQ Plot of infection probability
Here is a correlation matrix using corrplot package:
We can see from the data that there are several highly correlated variables. From here we can state our hypothesis and test the data so we can accept or reject the hypothesis.
ANOVA and Post-hoc Analysis
Objective
To test whether the mean infection risk is the same in four geographic regions.
Action
- Perform one-way ANOVA to compare mean infection risk across regions.
- Conduct Tukey’s procedure to obtain confidence intervals for pairwise comparisons.
- Perform a different pairwise comparison procedure (e.g., Bonferroni or Scheffé).
Result
senic$region <- factor(senic$region)
senic$medschl <- factor(senic$medschl)
model_a <- lm(infprob~region, data = senic)
anova(model_a)
## Analysis of Variance Table
##
## Response: infprob
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 13.997 4.6656 2.714 0.04839 *
## Residuals 109 187.383 1.7191
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Test whether infection probability is different among regions.
H0: region1=region2=region3=region4
H1: at least one region is different
If F* < F(0.95;3,109) = 3.239 accept H0, otherwise reject H0 and accept H1.
Since F*=2.714 we accept H0, which means that there is no significant difference between the regions at the 95% confidence level.
comp <- rbind('1' = c(1,0,0,0),
'2' = c(0,1,0,0),
'3' = c(0,0,1,0),
'4' = c(0,0,0,1))
comp_test <- glht(model_a, linfct = mcp(region=comp))
comp_tukey <- glht(model_a, linfct = mcp(region="Tukey"))
comp_test
##
## General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Linear Hypotheses:
## Estimate
## 1 == 0 0.0000
## 2 == 0 -0.4670
## 3 == 0 -0.9337
## 4 == 0 -0.4795
##
## General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Linear Hypotheses:
## Estimate
## 2 - 1 == 0 -0.4670
## 3 - 1 == 0 -0.9337
## 4 - 1 == 0 -0.4795
## 3 - 2 == 0 -0.4667
## 4 - 2 == 0 -0.0125
## 4 - 3 == 0 0.4542
This data from Tukey tells us that the only significant difference is among region1 and region 3, since they are the only regions that don’t overlap with 0. This is more clear with the following plot.
mean_test <- emmeans(model_a, specs = "region")
mean_pairs <- pairs(mean_test)
tukey_pairs <- confint(mean_pairs, adjust="Tukey", level=0.95)
plot(tukey_pairs)
We will now perform the same pairwise comparisons using other procedures, Bonferrooni and Scheffe.
Bonferroni:
## contrast estimate SE df lower.CL upper.CL
## region1 - region2 0.4670 0.339 109 -0.4448 1.38
## region1 - region3 0.9337 0.328 109 0.0511 1.82
## region1 - region4 0.4795 0.411 109 -0.6247 1.58
## region2 - region3 0.4667 0.317 109 -0.3838 1.32
## region2 - region4 0.0125 0.401 109 -1.0663 1.09
## region3 - region4 -0.4542 0.392 109 -1.5085 0.60
##
## Confidence level used: 0.95
## Conf-level adjustment: bonferroni method for 6 estimates
region1 and region3 are the only significant pairs that don’t overlap with 0.
Scheffe:
## contrast estimate SE df lower.CL upper.CL
## region1 - region2 0.4670 0.339 109 -0.49651 1.43
## region1 - region3 0.9337 0.328 109 0.00109 1.87
## region1 - region4 0.4795 0.411 109 -0.68736 1.65
## region2 - region3 0.4667 0.317 109 -0.43209 1.37
## region2 - region4 0.0125 0.401 109 -1.12750 1.15
## region3 - region4 -0.4542 0.392 109 -1.56825 0.66
##
## Confidence level used: 0.95
## Conf-level adjustment: scheffe method with rank 3
region1 and region3 are the only significant pairs that don’t overlap with 0.
Conclusion
In this analysis, we aimed to assess whether there is a significant difference in infection probability (infprob) among four geographic regions (region1, region2, region3, region4) based on data from the SENIC project. We followed a step-by-step approach, including ANOVA, Tukey’s procedure, and alternative pairwise comparison procedures, to draw meaningful conclusions regarding regional differences in infection risk.
ANOVA: We initially conducted an analysis of variance (ANOVA) to test the null hypothesis (H0) that there is no significant difference in infection probability across the four geographic regions, versus the alternative hypothesis (H1) that at least one region differs from the others. Our ANOVA test yielded an F-statistic of F* = 2.714. We compared this statistic to the critical F(0.95;3,109) value, which is 3.239. Since F* < F(0.95;3,109), we accept the null hypothesis (H0), indicating that there is no significant difference between the regions at the 95% confidence level.
Post-hoc Pairwise Comparisons: To delve further into the differences among the regions, we performed pairwise comparisons using Tukey’s procedure, as well as alternative procedures such as Bonferroni and Scheffe. Here are the key findings:
Tukey’s Procedure: After applying Tukey’s procedure, we found that the only significant mean comparison was between region1 and region3. All other pairwise comparisons were not significant.
Bonferroni Procedure: The Bonferroni procedure also indicated that the only significant pairwise comparison was between region1 and region3.
Scheffe Procedure: The Scheffe procedure yielded results consistent with Tukey and Bonferroni, identifying region1-region3 as the only significant pairwise comparison.
Overall summary and conclusion
Upon comparing the results of different pairwise comparison procedures, we found consistent evidence that the significant difference in infection probability was primarily driven by region1 and region3. This suggests that there may be specific factors or conditions in these regions that lead to varying infection risks compared to the other regions.
In summary, our analysis indicates that while there is no overall significant difference in infection probability among the four geographic regions, a closer look reveals that region1 and region3 exhibit distinct infection risk levels. These findings could have practical implications for healthcare professionals and policymakers interested in improving infection control measures in specific regions.
This analysis serves as a valuable contribution to understanding regional variations in nosocomial infection risk, providing a basis for further investigation and targeted interventions as needed.
Citation
[1] Haley, R. W., Culver, D. H., White, J. W., Morgan, W. M., Emori, T. G., Munn, V. P., & Hooton, T. M. (1985). The efficacy of infection surveillance and control programs in preventing nosocomial infections in US hospitals. American journal of epidemiology, 121(2), 182–205. https://doi.org/10.1093/oxfordjournals.aje.a113990
[2] THEODORE C. EICKHOFF, GENERAL COMMENTS ON THE STUDY ON THE EFFICACY OF NOSOCOMIAL INFECTION CONTROL (SENIC PROJECT), American Journal of Epidemiology, Volume 111, Issue 5, May 1980, Pages 465–469, https://doi.org/10.1093/oxfordjournals.aje.a112926