Analysis of Nosocomial Infection Control Efficacy in US Hospitals

Statistical Analysis Project

Introduction

In the 1970s the Centers for Disease Control conducted a nationwide study of hospitals to assess the current status of hospital based infections. The purpose of the study was to evaluate what hospitals were doing to prevent the spread of infection and whether those steps were helpful 1.

The project was started out of a concern of whether the steps taken by individual infection control specialists in hospitals were actually reducing infections. This is often an issue in medical research because you cannot “control” for this type of experiment. It would highly unethical and illegal to knowingly infect patients to find out whether or not your steps were effective, and randomly assigning soap to surgeries and a placebo to others is out of the question 2.

SENIC stands for Study on the Efficacy of Nosocomial Infection Control. It consists of a random sample of 113 hospitals from the original 338 hospitals surveyed. There are 113 rows and 12 columns, 1 for identification and 11 for important variables.

Packages used

knitr, car, xtable, corrplot, emmeans, multcomp, ggpubr

Goal

The project aims to investigate the effectiveness of Nosocomial Infection Control programs in reducing hospital-acquired infections in the United States.

Exploratory Data Analysis

Description

Below is a table of the with columns number, name, and a description:

senic <- read.csv("data/senic.csv") ## make sure you check your directory and that this file 

## Load description dataset
senic_description <- read.csv("Data/senic_description.csv")

## Demonstrate data.frame
senic_description.table <- xtable(senic_description)
print(senic_description.table, type="html")
Variable_Number Variable_Name Description
1 1 Identification Number (Hospital) 1-113
2 2 Length of stay (stay) Average length of stay of all patients in hospital (in days)
3 3 Age (age) Average age of patients (in years)
4 4 Infection risk (infprob) Average estimated probability of acquiring infection in hospital (in percent)
5 5 Routine culturing ratio (culratio) Ratio of number of culture performed to number of patients without signs or symptoms of hospital-acquried infection, times 100
6 6 Routine chest x-ray ratio (xratio) Ratio of number of X-rays performed to number of patients without signs or symptoms of pneumonia, times 100
7 7 Number of beds (nbeds) Average number of beds in hospital during study period
8 8 Medical school affiliation (medschl) 1=Yes, 2=No
9 9 Region (region) Geographic regions, where 1 = Northeast, 2 = North Central, 3 = South, 4 = West
10 10 Average daily census (census) Average number of patients in hospital per day during study period
11 11 Number of nurses (nurses) Average number of full-time equivalent registered and licensed practical nurses during study period (number full time plus one half the number of part time)
12 12 Available faciliteies and services (service) Percent of 35 potential facilites and services that are provided by the hospital

SENIC Data Visualization

Here is the first 6 observations of the senic data:

senic.table <- xtable(head(senic))
print(senic.table, type="html")
Hospital stay age infprob culratio xratio nbeds medschl region census nurses service
1 1 7.13 55.70 4.10 9.00 39.60 279 2 4 207 241 60.00
2 2 8.82 58.20 1.60 3.80 51.70 80 2 2 51 52 40.00
3 3 8.34 56.90 2.70 8.10 74.00 107 2 3 82 54 20.00
4 4 8.95 53.70 5.60 18.90 122.80 147 2 4 53 148 40.00
5 5 11.20 56.50 5.70 34.50 88.90 180 2 1 134 151 40.00
6 6 9.76 50.90 5.10 21.90 97.00 150 2 2 147 106 40.00

I have also provided a correlation matrix to see which values are most closely associated:

senic_corr <- senic[-1][-7][-7]
corr.table <- xtable(cor(senic_corr))
print(corr.table, type="html")
stay age infprob culratio xratio nbeds census nurses service
stay 1.00 0.19 0.53 0.33 0.38 0.41 0.47 0.34 0.36
age 0.19 1.00 0.00 -0.23 -0.02 -0.06 -0.05 -0.08 -0.04
infprob 0.53 0.00 1.00 0.56 0.45 0.36 0.38 0.39 0.41
culratio 0.33 -0.23 0.56 1.00 0.42 0.14 0.14 0.20 0.19
xratio 0.38 -0.02 0.45 0.42 1.00 0.05 0.06 0.08 0.11
nbeds 0.41 -0.06 0.36 0.14 0.05 1.00 0.98 0.92 0.79
census 0.47 -0.05 0.38 0.14 0.06 0.98 1.00 0.91 0.78
nurses 0.34 -0.08 0.39 0.20 0.08 0.92 0.91 1.00 0.78
service 0.36 -0.04 0.41 0.19 0.11 0.79 0.78 0.78 1.00

QQ Plot of infection probability

ggqqplot(senic$infprob)

Here is a correlation matrix using corrplot package:

corrplot(cor(senic_corr), method = 'shade', order = 'AOE', diag = FALSE)

We can see from the data that there are several highly correlated variables. From here we can state our hypothesis and test the data so we can accept or reject the hypothesis.

ANOVA and Post-hoc Analysis

Objective

To test whether the mean infection risk is the same in four geographic regions.

Action

  1. Perform one-way ANOVA to compare mean infection risk across regions.
  2. Conduct Tukey’s procedure to obtain confidence intervals for pairwise comparisons.
  3. Perform a different pairwise comparison procedure (e.g., Bonferroni or Scheffé).

Result

senic$region <- factor(senic$region)
senic$medschl <- factor(senic$medschl)

model_a <- lm(infprob~region, data = senic)

anova(model_a)
## Analysis of Variance Table
## 
## Response: infprob
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## region      3  13.997  4.6656   2.714 0.04839 *
## Residuals 109 187.383  1.7191                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test whether infection probability is different among regions.

H0: region1=region2=region3=region4

H1: at least one region is different

If F* < F(0.95;3,109) = 3.239 accept H0, otherwise reject H0 and accept H1.

Since F*=2.714 we accept H0, which means that there is no significant difference between the regions at the 95% confidence level.

comp <- rbind('1' = c(1,0,0,0),
              '2' = c(0,1,0,0),
              '3' = c(0,0,1,0),
              '4' = c(0,0,0,1))

comp_test <- glht(model_a, linfct =  mcp(region=comp))
comp_tukey <- glht(model_a, linfct = mcp(region="Tukey"))

comp_test
## 
##   General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Linear Hypotheses:
##        Estimate
## 1 == 0   0.0000
## 2 == 0  -0.4670
## 3 == 0  -0.9337
## 4 == 0  -0.4795
comp_tukey
## 
##   General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Linear Hypotheses:
##            Estimate
## 2 - 1 == 0  -0.4670
## 3 - 1 == 0  -0.9337
## 4 - 1 == 0  -0.4795
## 3 - 2 == 0  -0.4667
## 4 - 2 == 0  -0.0125
## 4 - 3 == 0   0.4542

This data from Tukey tells us that the only significant difference is among region1 and region 3, since they are the only regions that don’t overlap with 0. This is more clear with the following plot.

mean_test <- emmeans(model_a, specs = "region")
mean_pairs <- pairs(mean_test)

tukey_pairs <- confint(mean_pairs, adjust="Tukey", level=0.95)
plot(tukey_pairs)

We will now perform the same pairwise comparisons using other procedures, Bonferrooni and Scheffe.

Bonferroni:

bonf_pairs <- confint(mean_pairs, adjust="bonferroni", level=0.95)
bonf_pairs
##  contrast          estimate    SE  df lower.CL upper.CL
##  region1 - region2   0.4670 0.339 109  -0.4448     1.38
##  region1 - region3   0.9337 0.328 109   0.0511     1.82
##  region1 - region4   0.4795 0.411 109  -0.6247     1.58
##  region2 - region3   0.4667 0.317 109  -0.3838     1.32
##  region2 - region4   0.0125 0.401 109  -1.0663     1.09
##  region3 - region4  -0.4542 0.392 109  -1.5085     0.60
## 
## Confidence level used: 0.95 
## Conf-level adjustment: bonferroni method for 6 estimates

region1 and region3 are the only significant pairs that don’t overlap with 0.

plot(bonf_pairs)

Scheffe:

scheffe_pairs <- confint(mean_pairs, adjust="scheffe", level=0.95)
scheffe_pairs
##  contrast          estimate    SE  df lower.CL upper.CL
##  region1 - region2   0.4670 0.339 109 -0.49651     1.43
##  region1 - region3   0.9337 0.328 109  0.00109     1.87
##  region1 - region4   0.4795 0.411 109 -0.68736     1.65
##  region2 - region3   0.4667 0.317 109 -0.43209     1.37
##  region2 - region4   0.0125 0.401 109 -1.12750     1.15
##  region3 - region4  -0.4542 0.392 109 -1.56825     0.66
## 
## Confidence level used: 0.95 
## Conf-level adjustment: scheffe method with rank 3

region1 and region3 are the only significant pairs that don’t overlap with 0.

plot(scheffe_pairs)

Conclusion

In this analysis, we aimed to assess whether there is a significant difference in infection probability (infprob) among four geographic regions (region1, region2, region3, region4) based on data from the SENIC project. We followed a step-by-step approach, including ANOVA, Tukey’s procedure, and alternative pairwise comparison procedures, to draw meaningful conclusions regarding regional differences in infection risk.

ANOVA: We initially conducted an analysis of variance (ANOVA) to test the null hypothesis (H0) that there is no significant difference in infection probability across the four geographic regions, versus the alternative hypothesis (H1) that at least one region differs from the others. Our ANOVA test yielded an F-statistic of F* = 2.714. We compared this statistic to the critical F(0.95;3,109) value, which is 3.239. Since F* < F(0.95;3,109), we accept the null hypothesis (H0), indicating that there is no significant difference between the regions at the 95% confidence level.

Post-hoc Pairwise Comparisons: To delve further into the differences among the regions, we performed pairwise comparisons using Tukey’s procedure, as well as alternative procedures such as Bonferroni and Scheffe. Here are the key findings:

  • Tukey’s Procedure: After applying Tukey’s procedure, we found that the only significant mean comparison was between region1 and region3. All other pairwise comparisons were not significant.

  • Bonferroni Procedure: The Bonferroni procedure also indicated that the only significant pairwise comparison was between region1 and region3.

  • Scheffe Procedure: The Scheffe procedure yielded results consistent with Tukey and Bonferroni, identifying region1-region3 as the only significant pairwise comparison.

Overall summary and conclusion

Upon comparing the results of different pairwise comparison procedures, we found consistent evidence that the significant difference in infection probability was primarily driven by region1 and region3. This suggests that there may be specific factors or conditions in these regions that lead to varying infection risks compared to the other regions.

In summary, our analysis indicates that while there is no overall significant difference in infection probability among the four geographic regions, a closer look reveals that region1 and region3 exhibit distinct infection risk levels. These findings could have practical implications for healthcare professionals and policymakers interested in improving infection control measures in specific regions.

This analysis serves as a valuable contribution to understanding regional variations in nosocomial infection risk, providing a basis for further investigation and targeted interventions as needed.

Citation

[1] Haley, R. W., Culver, D. H., White, J. W., Morgan, W. M., Emori, T. G., Munn, V. P., & Hooton, T. M. (1985). The efficacy of infection surveillance and control programs in preventing nosocomial infections in US hospitals. American journal of epidemiology, 121(2), 182–205. https://doi.org/10.1093/oxfordjournals.aje.a113990

[2] THEODORE C. EICKHOFF, GENERAL COMMENTS ON THE STUDY ON THE EFFICACY OF NOSOCOMIAL INFECTION CONTROL (SENIC PROJECT), American Journal of Epidemiology, Volume 111, Issue 5, May 1980, Pages 465–469, https://doi.org/10.1093/oxfordjournals.aje.a112926