The 2014-2015 “Healthcare Associated Infections” dataset provided by https://www.medicare.gov measures how often patients in a particular hospital contract certain infections during the course of their medical treatment, when compared to like hospitals. The HAI measures apply to all patients treated in acute care hospitals, including adult, pediatric, neonatal, Medicare, and non-Medicare patients. The infection types captured by the dataset are:
The data is captured during a period of months. The CDC calculates a Standardized Infection Ratio (SIR) which may take into account the type of patient care location, number of patients with an existing infection, laboratory methods, hospital affiliation with a medical school, bed size of the hospital, patient age, and classification of patient health. Predicted values are determined by the National Healthcare Safety Network using historical data. An SIR greater than 1.0 indicates that more HAIs were observed than predicted; conversely, an SIR less than 1.0 indicates that fewer HAIs were observed than predicted.
NOTE: If the number of predicted infections is less than 1, the SIR and confidence interval cannot be calculated.
Find dataset here: https://data.world/health/hospital-infections/workspace/project-summary
To provide one with a sense of the type of HAI one is most likely to encounter as a result of hospitalization and the hospitals one may look to avoid.
library(DescTools)
library(dplyr)
library(tidyr)
library(reshape2)
library(ggplot2)
library(plotly)
data = read.csv('/Users/Micho/Downloads/Healthcare_Associated_Infections_-_Hospital.csv')
dim(data)
## [1] 222864 16
colnames(data)
## [1] "Provider.ID" "Hospital.Name" "Address"
## [4] "City" "State" "ZIP.Code"
## [7] "County.Name" "Phone.Number" "Measure.Name"
## [10] "Measure.ID" "Compared.to.National" "Score"
## [13] "Footnote" "Measure.Start.Date" "Measure.End.Date"
## [16] "Location"
The columns we will focus on are:
Most of the columns provide identifying information that will not be useful in discovering the prevalence of certain HAIs. The date ranges over which the data spans are too extensive (6 months - 1 year) to provide useful information about infections as they relate to time of year.
# get predicted cases (ELIGCASES), observed cases (NUMERATOR), SIR measure (SIR)
rows_to_keep = which(data$Measure.ID %like any% c("%ELIGCASES%", "%NUMERATOR%", "%SIR%"))
# filter and mutate data set
data.filtered = data[rows_to_keep, ] %>%
# ignore rows with footnote comments
filter(Score != "Not Available" & Footnote == '') %>%
# convert data type of Score from factor to numeric ->
# requires conversion from factor to character, then conversion from character to numeric
mutate(Score=as.numeric(as.character(Score))) %>%
# two or more hospitals can have the same name yet be in different city & state,
# so include the city & state in name for group by
mutate(Hospital.Name=sprintf("%s (%s, %s)", as.character(Hospital.Name),
as.character(City), as.character(State))) %>%
# exclude irrelevant rows
select(Hospital.Name, State, Measure.ID, Compared.to.National, Score)
# extract the observed occurrences for each infection type
infection_type = data.filtered %>%
filter(Measure.ID %like any% c("%NUMERATOR%")) %>%
group_by(Measure.ID) %>%
summarise(count=length(Measure.ID)) %>%
arrange(desc(count))
labs = c("HAI_6_NUMERATOR" = "C.diff. Intestinal Infection",
"HAI_2_NUMERATOR" = "Catheter-Associated Urinary Tract Infection",
"HAI_2a_NUMERATOR" = "Catheter-Associated Urinary Tract Infection (ICU only)",
"HAI_3_NUMERATOR" = "Surgical Site Infection from Colon Surgery",
"HAI_5_NUMERATOR" = "MRSA Bloodstream Infection",
"HAI_1_NUMERATOR" = "Central Line-Associated Bloodstream Infection",
"HAI_1a_NUMERATOR" = "Central Line-Associated Bloodstream Infection (ICU only)",
"HAI_4_NUMERATOR" = "Surgical Site Infection from Abdominal Hysterectomy")
ggplot(infection_type, aes(x=reorder(Measure.ID, count), y=count, group=1),
width=1200, height=1000) +
coord_flip() + # change from vertical plot to horizontal plot
geom_bar(stat="identity") +
scale_x_discrete(labels=labs) +
labs(x="Infection Type", y="Count") +
labs(title="Overall National HAI Frequency") +
theme(plot.title=element_text(hjust=0.5)) # center title
We can get a general, national-level picture of HAIs from the plot above that shows the total number of actual occurrences of each HAI nationally. C.diff. intestinal infections are by far the most common type of HAI.
# determine average SIR measurement for each state
SIR_avg_by_state = data.filtered %>%
filter(Measure.ID %like any% c("%SIR%")) %>%
group_by(State) %>%
summarise(SIR_average=sum(Score)/length(Score))
# create column for color differentiation
SIR_avg_by_state$color = ifelse(SIR_avg_by_state$SIR_average > 1.0, "> 1.0", "<= 1.0")
ggplot(SIR_avg_by_state, aes(x=State, y=SIR_average, fill=color),
width=2000, height=1000) +
geom_bar(stat="identity") +
scale_fill_manual(values=c("#999999", "#FF6666")) + # change bar colors
theme(axis.text.x=element_text(angle=90, hjust=1, size=8.0)) + # adjust x-axis labels
labs(y="Average SIR Measurement") +
labs(title="Average Standardized Infection Ratio\n(SIR) Measurement by State/Territory") +
theme(plot.title=element_text(hjust=0.5)) + # center title
guides(fill=guide_legend(title="SIR", reverse=TRUE)) # modify legend title and ordering
The plot above showing the average SIR measurement for each US state/territory, reveals that the average SIR measurment in Rhode Island and the Virgin Islands is above 1.0. This indicates that the average number of actual HAIs is greater than the predicted number (based on historical data analyzed by the National Healthcare Safety Network). This is an aggregate measure telling only the risk of the state and not the risk of any individual hospital but it can informative of a general level risk associated with each state/territory. An observation worth making is that no U.S. state is excluded from this plot, suggesting that HAIs are present to some extent in any state.
SIR_count_by_hospital = data.filtered %>%
filter(Measure.ID %like any% c("%SIR%")) %>% # get SIR scores
group_by(Hospital.Name) %>%
summarise(count_above=length(which(Score > 1.0)), count_below=length(which(Score <= 1.0))) %>%
arrange(desc(count_above)) %>%
slice(seq_len(20)) # get top 20
plot_ly(SIR_count_by_hospital, x=~count_above, y=~Hospital.Name,
name='Actual > Predicted', type='bar', width=975, height=600,
marker=list(color = 'black', line=list(color = 'rgb(8,48,107)', width = 1.5))) %>%
add_trace(x=~count_below, name='Actual <= Predicted', marker=list(color="lime")) %>%
layout(barmode="group", margin=list(t=70),
title="Combined Summary of the 8 Infection Types\nActual Count VS Predicted Count (20 Worst Hospitals)\n\n",
titlefont=list(size=15),
xaxis=list(title="Count"),
yaxis=list(side="left", title=NA, categoryarray=~count_above))
One thing to notice is that not every hospital referenced in the plot contains a measure for each of the eight infection types. Certain values were listed as not available (see NOTE in description) or as not able to be calculated, thus explaining missing values for certain hospitals. The plot provides visual evidence of the count of infection types that had a more frequent occurrence than predicted (colored in black) and of the count of infection types that had a less frequent occurrence than predicted (colored in green). According to the plot above, the hospital Chesapeake General Hospital in Chesapeake, VA had more actual occurrences than predicted for all eight infection types. It is worth noting that this hospital is not located in either Rhode Island or the Virgin Islands, which we previously cited as the two territories with SIR measurements above 1.0. Next we will look more in depth at the infection frequency for this hospital to determine what HAI is most responsible for this phenomenon and how this hospital-specific distribution reflects the national distribution.
# filter to get values of actual infection counts
worst_hospital = data.filtered[data.filtered$Hospital.Name == "CHESAPEAKE GENERAL HOSPITAL (CHESAPEAKE, VA)", ] %>%
filter(Measure.ID %like any% c("%NUMERATOR%")) %>%
select(Measure.ID, Score)
ggplot(worst_hospital, aes(x=reorder(Measure.ID, Score), y=Score, group=1),
width=2000, height=1000) +
coord_flip() + # change from vertical plot to horizontal plot
geom_bar(stat="identity") +
scale_x_discrete(labels=labs) +
labs(x="Infection Type", y="Count") +
labs(title="Infection Frequency at Chesapeake\nGeneral Hospital (Chesapeake, VA)") +
theme(plot.title=element_text(hjust=0.5)) # center title
The plot above shows the number of actual occurrences of each infection type at Chesapeake General Hospital in Chesapeake, VA. The distribution is somewhat similar to the national distribution with intestinal infections being the most common HAI.
Healthcare-Associated Infections are in every U.S. state/territory. One should be vigilant, if hospitalized, of intestinal infections and catheter-associated infections–the two most common HAIs on a national level.