Description

The 2014-2015 “Healthcare Associated Infections” dataset provided by https://www.medicare.gov measures how often patients in a particular hospital contract certain infections during the course of their medical treatment, when compared to like hospitals. The HAI measures apply to all patients treated in acute care hospitals, including adult, pediatric, neonatal, Medicare, and non-Medicare patients. The infection types captured by the dataset are:

  1. central line-associated bloodstream infections (CLABSI)
  2. central line-associated bloodstream infections (CLABSI) (ICU only)
  3. catheter-associated urinary tract infections (CAUTI)
  4. catheter-associated urinary tract infections (CAUTI) (ICU only)
  5. surgical site infection (SSI) from colon surgery
  6. surgical site infection (SSI) from abdominal hysterectomy
  7. methicillin-resistant Staphylococcus Aureus (MRSA) blood laboratory-identified events (bloodstream infections)
  8. Clostridium difficile (C.diff.) laboratory-identified events (intestinal infections)

The data is captured during a period of months. The CDC calculates a Standardized Infection Ratio (SIR) which may take into account the type of patient care location, number of patients with an existing infection, laboratory methods, hospital affiliation with a medical school, bed size of the hospital, patient age, and classification of patient health. Predicted values are determined by the National Healthcare Safety Network using historical data. An SIR greater than 1.0 indicates that more HAIs were observed than predicted; conversely, an SIR less than 1.0 indicates that fewer HAIs were observed than predicted.

NOTE: If the number of predicted infections is less than 1, the SIR and confidence interval cannot be calculated.

Find dataset here: https://data.world/health/hospital-infections/workspace/project-summary

Objective

To provide one with a sense of the type of HAI one is most likely to encounter as a result of hospitalization and the hospitals one may look to avoid.


Library imports

library(DescTools)
library(dplyr)
library(tidyr)
library(reshape2)
library(ggplot2)
library(plotly)

Read in data set

data = read.csv('/Users/Micho/Downloads/Healthcare_Associated_Infections_-_Hospital.csv')

Get some descriptive statistics

dim(data)
## [1] 222864     16
colnames(data)
##  [1] "Provider.ID"          "Hospital.Name"        "Address"             
##  [4] "City"                 "State"                "ZIP.Code"            
##  [7] "County.Name"          "Phone.Number"         "Measure.Name"        
## [10] "Measure.ID"           "Compared.to.National" "Score"               
## [13] "Footnote"             "Measure.Start.Date"   "Measure.End.Date"    
## [16] "Location"

The columns we will focus on are:

  1. “Hospital.Name” - name of the hospital
  2. “City” - city in which the hospital is located
  3. “State” - state/territory in which the hospital is located
  4. “Measure.ID” - code indicating the infection type being assessed and the specific measurement of that infection (i.e. lower confidence bound, upper confidence bound, number of predicted cases, number of actual cases, SIR measurement)
  5. “Compared.to.National” - one of the following: “Worse than the National Benchmark”, “No Different than National Benchmark”, “Better than the National Benchmark”
  6. “Score” - the measurement value associated with “Measure.ID”" for that row

Most of the columns provide identifying information that will not be useful in discovering the prevalence of certain HAIs. The date ranges over which the data spans are too extensive (6 months - 1 year) to provide useful information about infections as they relate to time of year.

Format data

# get predicted cases (ELIGCASES), observed cases (NUMERATOR), SIR measure (SIR) 
rows_to_keep = which(data$Measure.ID %like any% c("%ELIGCASES%", "%NUMERATOR%", "%SIR%"))
# filter and mutate data set
data.filtered = data[rows_to_keep, ] %>%
  # ignore rows with footnote comments
  filter(Score != "Not Available" & Footnote == '') %>%
  # convert data type of Score from factor to numeric ->
  # requires conversion from factor to character, then conversion from character to numeric
  mutate(Score=as.numeric(as.character(Score))) %>%
  # two or more hospitals can have the same name yet be in different city & state,
  # so include the city & state in name for group by
  mutate(Hospital.Name=sprintf("%s (%s, %s)", as.character(Hospital.Name),
                               as.character(City), as.character(State))) %>%
  # exclude irrelevant rows
  select(Hospital.Name, State, Measure.ID, Compared.to.National, Score)

Which infection type is most frequent on a national level?

# extract the observed occurrences for each infection type
infection_type = data.filtered %>%
  filter(Measure.ID %like any% c("%NUMERATOR%")) %>%
  group_by(Measure.ID) %>%
  summarise(count=length(Measure.ID)) %>%
  arrange(desc(count))

labs = c("HAI_6_NUMERATOR"  = "C.diff. Intestinal Infection",
         "HAI_2_NUMERATOR"  = "Catheter-Associated Urinary Tract Infection",
         "HAI_2a_NUMERATOR" = "Catheter-Associated Urinary Tract Infection (ICU only)",
         "HAI_3_NUMERATOR"  = "Surgical Site Infection from Colon Surgery",
         "HAI_5_NUMERATOR"  = "MRSA Bloodstream Infection",
         "HAI_1_NUMERATOR"  = "Central Line-Associated Bloodstream Infection",
         "HAI_1a_NUMERATOR" = "Central Line-Associated Bloodstream Infection (ICU only)",
         "HAI_4_NUMERATOR"  = "Surgical Site Infection from Abdominal Hysterectomy")

ggplot(infection_type, aes(x=reorder(Measure.ID, count), y=count, group=1),
       width=1200, height=1000) + 
  coord_flip() + # change from vertical plot to horizontal plot
  geom_bar(stat="identity") +
  scale_x_discrete(labels=labs) +
  labs(x="Infection Type", y="Count") + 
  labs(title="Overall National HAI Frequency") +
  theme(plot.title=element_text(hjust=0.5)) # center title  

We can get a general, national-level picture of HAIs from the plot above that shows the total number of actual occurrences of each HAI nationally. C.diff. intestinal infections are by far the most common type of HAI.

Compare average SIR measurement by state.

# determine average SIR measurement for each state
SIR_avg_by_state = data.filtered %>%
  filter(Measure.ID %like any% c("%SIR%")) %>%
  group_by(State) %>%
  summarise(SIR_average=sum(Score)/length(Score))
# create column for color differentiation
SIR_avg_by_state$color = ifelse(SIR_avg_by_state$SIR_average > 1.0, "> 1.0", "<= 1.0")

ggplot(SIR_avg_by_state, aes(x=State, y=SIR_average, fill=color),
       width=2000, height=1000) + 
  geom_bar(stat="identity") +
  scale_fill_manual(values=c("#999999", "#FF6666")) + # change bar colors
  theme(axis.text.x=element_text(angle=90, hjust=1, size=8.0)) + # adjust x-axis labels
  labs(y="Average SIR Measurement") + 
  labs(title="Average Standardized Infection Ratio\n(SIR) Measurement by State/Territory") +
  theme(plot.title=element_text(hjust=0.5)) + # center title
  guides(fill=guide_legend(title="SIR", reverse=TRUE)) # modify legend title and ordering

The plot above showing the average SIR measurement for each US state/territory, reveals that the average SIR measurment in Rhode Island and the Virgin Islands is above 1.0. This indicates that the average number of actual HAIs is greater than the predicted number (based on historical data analyzed by the National Healthcare Safety Network). This is an aggregate measure telling only the risk of the state and not the risk of any individual hospital but it can informative of a general level risk associated with each state/territory. An observation worth making is that no U.S. state is excluded from this plot, suggesting that HAIs are present to some extent in any state.

Compare how frequently hospitals result in more infections than expected.

SIR_count_by_hospital = data.filtered %>%
  filter(Measure.ID %like any% c("%SIR%")) %>% # get SIR scores
  group_by(Hospital.Name) %>%
  summarise(count_above=length(which(Score > 1.0)), count_below=length(which(Score <= 1.0))) %>%
  arrange(desc(count_above)) %>%
  slice(seq_len(20)) # get top 20


plot_ly(SIR_count_by_hospital, x=~count_above, y=~Hospital.Name,
        name='Actual > Predicted', type='bar', width=975, height=600,
        marker=list(color = 'black', line=list(color = 'rgb(8,48,107)', width = 1.5))) %>%
  add_trace(x=~count_below, name='Actual <= Predicted', marker=list(color="lime")) %>%
  layout(barmode="group", margin=list(t=70),
         title="Combined Summary of the 8 Infection Types\nActual Count VS Predicted Count (20 Worst Hospitals)\n\n",
         titlefont=list(size=15),
         xaxis=list(title="Count"),
         yaxis=list(side="left", title=NA, categoryarray=~count_above))

One thing to notice is that not every hospital referenced in the plot contains a measure for each of the eight infection types. Certain values were listed as not available (see NOTE in description) or as not able to be calculated, thus explaining missing values for certain hospitals. The plot provides visual evidence of the count of infection types that had a more frequent occurrence than predicted (colored in black) and of the count of infection types that had a less frequent occurrence than predicted (colored in green). According to the plot above, the hospital Chesapeake General Hospital in Chesapeake, VA had more actual occurrences than predicted for all eight infection types. It is worth noting that this hospital is not located in either Rhode Island or the Virgin Islands, which we previously cited as the two territories with SIR measurements above 1.0. Next we will look more in depth at the infection frequency for this hospital to determine what HAI is most responsible for this phenomenon and how this hospital-specific distribution reflects the national distribution.

Frequency of infections at hospital with highest frequency of unexpected infections i.e. Chesapeake General Hospital

# filter to get values of actual infection counts
worst_hospital = data.filtered[data.filtered$Hospital.Name == "CHESAPEAKE GENERAL HOSPITAL (CHESAPEAKE, VA)", ] %>%
  filter(Measure.ID %like any% c("%NUMERATOR%")) %>%
  select(Measure.ID, Score)
  
ggplot(worst_hospital, aes(x=reorder(Measure.ID, Score), y=Score, group=1),
       width=2000, height=1000) + 
  coord_flip() + # change from vertical plot to horizontal plot
  geom_bar(stat="identity") +
  scale_x_discrete(labels=labs) +
  labs(x="Infection Type", y="Count") + 
  labs(title="Infection Frequency at Chesapeake\nGeneral Hospital (Chesapeake, VA)") +
  theme(plot.title=element_text(hjust=0.5)) # center title  

The plot above shows the number of actual occurrences of each infection type at Chesapeake General Hospital in Chesapeake, VA. The distribution is somewhat similar to the national distribution with intestinal infections being the most common HAI.


Conclusion

Healthcare-Associated Infections are in every U.S. state/territory. One should be vigilant, if hospitalized, of intestinal infections and catheter-associated infections–the two most common HAIs on a national level.