CHG impact

Import data

CHG compliance

Data from the CHG compliance dashboard is fairly straightforward to download. Go to device type and only select central lines. Then export the heatmap to CSV

Show code

CHG_raw <- "TableauData/2024-09-22/Overall Rate by Hospital and Department.csv" %>% 
  read_csv(col_types = "ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc",
           skip = 1)
## These are the same months, but split across 2023-2024
# • `31-Dec-23` -> `31-Dec-23...23`
# • `31-Dec-23` -> `31-Dec-23...24`

CHG_clean <- CHG_raw %>%
  rename(Hospital=...1, Benchmark=...2, AtGoal=...3, Measure=...4) %>%
  
  mutate(Measure = case_when(str_detect(Measure, "Numerator") ~ "Numerator",
                             str_detect(Measure, "Denominator") ~ "Denominator",
                             TRUE ~ "Other")) %>%
  filter(Measure %in% c("Numerator", "Denominator")) %>%
  
  # Make long format
  pivot_longer(cols = -c(Hospital, Benchmark, AtGoal, Measure), 
               names_to = "Date", values_to = "n") %>%
  
  # Parse the numbers
  mutate(n = str_remove(n, ","), # Remove comma
         n = as.numeric(n)) %>%
  
  # make 31-Dec-23...23 amd 31-Dec-23...24 the same
  mutate(Date = case_when(Date=="31-Dec-23...23" ~ "31-Dec-23",
                          Date=="31-Dec-23...24" ~ "31-Dec-23",
                          TRUE ~ Date)) %>%
  
  filter(Date!="27-Aug-23") %>%
  
  # Group by Hospital + Month + Denominator
  group_by(Hospital, Date, Measure) %>%
  summarise(n = sum(n, na.rm = T)) %>%
  
  # Pivot data out 
  pivot_wider(names_from = Measure, values_from = n) %>%
  
  # Process dates
  mutate(Date = lubridate::dmy(Date)) %>%
  ungroup() 

rm(CHG_raw)

Show code

CHG_raw <- "TableauData/2024-10-18 CHG CL only/All Overall Rate by Hospital and Department_data.csv" %>% 
  read_csv()

CHG_clean <- CHG_raw %>%
  select(Hospital, Date=`Week of Census Time`,
         Denominator=`All Lines Denominator`,
         Numerator=`All Lines Numerator`) %>%
  
  mutate(Date = mdy(Date)) %>%
  
  group_by(Hospital, Date) %>%
  summarise(Denominator = sum(Denominator, na.rm = T),
            Numerator   = sum(Numerator, na.rm = T)) %>%
  ungroup() %>%
  
  # Remove most recent week 
  filter(Date != ymd("2024-10-13"))
   

rm(CHG_raw)

Show code

CHG_clean %>%
  rmarkdown::paged_table()

Note

I can only download the past 12 months of data from the dashboard. Will need to get access to more of the data to do a good pre-post analysis

This is a quick glimpse of the CHG compliance rates over time

Show code

# CHG_clean %>%
#   mutate(rate = Numerator / Denominator ) %>%
# 
#   ggplot(aes(x=Date, y=rate, color=Hospital)) + 
#   geom_line(key_glyph = "timeseries") +
#   geom_point() + 
#   scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +
#   facet_wrap("Hospital", ncol = 4) +
#   guides(color=F) +
#   theme_bw()

CHG_clean %>%
  mutate(rate = Numerator / Denominator ) %>%

  ggplot(aes(x=Date, y=rate, color=Denominator)) + 
  geom_line(key_glyph = "timeseries") +
  geom_point() + 
  scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +
  facet_wrap("Hospital", ncol = 4) +
  # scale_colour_gradient2(trans="log10", mid = "yellow", high = "red")
  scale_color_viridis_c(option = "D", trans="log10") +
  # guides(color=F) +
  theme_bw()

Because it is reported weekly (and everything else is reported monthly), we need to convert the weekly data into monthly data

How are the weeks calculated?

When you extract the data from Tableau, the dates it lists are all Sundays. It doesn’t look like it’s explicitly documented anywhere, but I assume that this means the week starts on Sunday, meaning the date provided is day 1 of 7 for that week

Some weeks span the course of multiple months¹. In these cases, we will group them with whichever month that the majority of their days fell under

Show code

week_to_month <- function(date, beginning=T) {
  require(lubridate)
  
  # Make sure you are given only one date, not a vector
  stopifnot(length(date)==1)
  
  # Calculate the midpoint of the week, which is 3 days away from the start 
  # (or end) date
  if(beginning) {
    mid_date <- date + days(3)
  } else({
    # Otherwise, treat the given day as the end of the week
    mid_date <- date - days(3)
  })
  
  # Extract the month and the year from this date
  new_month <- month(mid_date)
  new_year  <- year(mid_date)
  
  # Create a new date and return
  new_date <- ymd(paste0(new_year, "-", new_month, "-15"))
  return(new_date)
  
}

Show code

CHG_monthly <- CHG_clean %>% 
  #  Needed to only process one date at a time
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  group_by(Hospital, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T))

Show code

CHG_clean %>% 
  #  Needed to only process one date at a time
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  group_by(Hospital, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T)) %>%
  mutate(rate = Numerator / Denominator) %>%
 
  ggplot(aes(x=Month, y=rate, color=Hospital)) +
  geom_line(key_glyph = "timeseries") +
  geom_point() +
  scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +
  facet_wrap("Hospital", ncol = 4) +
  guides(color=F) +
  theme_bw()

NHSN data

Show code

### THIS WAS A CHUNK FOR OBSERVED RATES, BUT IT'S INCLUDED IN SIR FILE NOW
CLABSI_abs_raw <- read_csv("TableauData/2024-09-22/Observed Infections by Month & Year.csv")

CLABSI_abs <- CLABSI_abs_raw %>%
  select(-LOCATIONS) %>%
  rename(Hospital=ORGID) %>%
  
  # Make long format
  pivot_longer(cols = -Hospital, 
               names_to = "Date", values_to = "n") %>%
  
  mutate(Date = ymd(paste0(Date, "-1")),
         Month = month(Date),
         Year = year(Date),
         Qrt = quarter(Date))

rm(CLABSI_abs_raw)

CLABSI_abs %>%
  ggplot(aes(x=Date, y=n, color=Hospital)) + 
  # geom_jitter(key_glyph = "timeseries") +
  geom_line(key_glyph = "timeseries") + 
  scale_x_date(date_breaks = "6 months", date_labels = "%b\n'%y") +
  facet_wrap("Hospital", ncol = 3) +
  guides(color=F) +
  theme_bw() +
  labs(y="# CLABSI")

Workflow

This isn’t as straightforward of a download from Tableau, so to keep it reproducible this is my workflow for downloading the data

Navigate to the NHSN Infections Dashboard, go to the Compare Hospitals dashboard
Select “all” for the Month & Year drop down
Ensure the Infection Type is set to CLABSI
Use the Hospital drop down to select the hospital of interest

Once you’ve selected the hospital, export the data to CSV with the following steps:

Click the “Observed & Expected by Month & Year” panel (bottom left)
Click the download icon in the top right corner (close to the share option). There should be an option to download data (if it’s grey’ed out, something is wrong). Clicking the data icon opens a new window
In the new window, select Full Data (instead of summary)
Click Show Fields, scroll up and select “all”
Finally, hit download. To keep it organized, I renamed the downloaded file with the hospital’s abbreviation

I did the above steps for all of the hospitals that I could select in Tableau

Show code

SIR_junk <- "TableauData/2024-10-09 CLASBI SIR" %>%
  
  # Use `fs` package to get the names of all of the CSVs
  fs::dir_ls() %>%
  
  # Read all of the CSVs using map
  purrr::map(readr::read_csv, 
             col_types = "ccccccccccccccccccccccccccccccccccccccc")  %>%
  
  # Bind into a single table
  list_rbind()

Missing data

This has quite a bit of missing data (since we downloaded all of the fields), so we can take a quick look at what is missing

Show code

library(naniar)
naniar::vis_miss(SIR_junk ) + coord_flip()

This is seen a little easier in table format, showing quite a few fields have 100% missing data, which we will go ahead and remove

Show code

naniar::miss_var_summary(SIR_junk) %>%
  rmarkdown::paged_table()

Show code

to_remove <- naniar::miss_var_summary(SIR_junk) %>% 
  filter(pct_miss==100) %>% 
  dplyr::pull(variable)

df_temp <- SIR_junk %>%
  select(-to_remove)

rm(to_remove)

Renaming and dropping columns

There are also some uninformative columns (meaning that they say the same thing in every row) which we should remove. We also need to convert each column to the correct data type and give them better names, which we will to here.

Show code

df_temp %>%
  mutate(across(everything(), as.factor)) %>%
  summary()

Show code

## Not run, but this verifies the assumptions below

# Is SUMMARYYM == Month of SUMMARYYM
df_temp %>%
  filter(SUMMARYYM != `Month of SUMMARYYM`)

# Is ORGID == ORGID (test)
df_temp %>%
  filter(ORGID != `ORGID (test)`)

# Verify that NUMPRED & INFCOUNT are all numeric data
df_temp %>% 
  filter(!is.na(NUMPRED)) %>%
  mutate(NUMPRED = as.numeric(NUMPRED)) %>%
  filter(is.na(NUMPRED))
df_temp %>% 
  filter(!is.na(INFCOUNT)) %>%
  mutate(INFCOUNT = as.numeric(INFCOUNT)) %>%
  filter(is.na(INFCOUNT))

# Is INFCOUNT == Infections Observed
df_temp %>%
  mutate(INFCOUNT = as.numeric(INFCOUNT),
         `Infections Observed` = as.numeric(`Infections Observed`)) %>%
  filter(INFCOUNT != `Infections Observed`)

# Verify that ORGID captures CCN, `Org ID (2)`, `Org ID (copy)`
df_temp %>% 
  count(ORGID, CCN, `Org ID (2)`, `Org ID (copy)`)

# Verify that NUMCLDAYS is all numeric data
df_temp %>% 
  filter(!is.na(NUMCLDAYS)) %>%
  mutate(NUMCLDAYS = as.numeric(NUMCLDAYS)) %>%
  filter(is.na(NUMCLDAYS))

# Verify that SIR is all numeric data (where it is not missing)
df_temp %>% 
  filter(!is.na(SIR)) %>%
  mutate(SIR = as.numeric(SIR)) %>%
  filter(is.na(SIR))

# Is SIR == SIR (all)
df_temp %>%
  filter(SIR != `SIR (all)`)

These are the changes we will make to these columns. The old name of the column is shown in [brackets]

Month [Month of SUMMARYYM]: Character of the month-year. Will set the day to be the 15th
Date [derived from Month of SUMMARYYM]: Date Date format of Month, with the day set to be the 15th
removed [Infection Type (fix)]: All that is said before was “CLABSI” (for all rows, no missing data)
removed [LOCATION]: All that is said before was “(All Locations)” (for all rows, no missing data)
removed [ORGID (test)]: This is the same as [ORGID], so we will just keep Hosp
removed [Infection Type]: All that is said before was “CLA” (for all rows, no missing data)
Hosp [ORGID]: Factor of the hospital abbreviation
data_source [Table Name]: Factor; appears to be the CSV file that the data was pulled from
removed [INFCOUNT]: Aside from having decimal points, this is the same as [Infections Observed], so we will just keep Inf_obs
Inf_pred [NUMPRED]: Numeric
removed [CCN]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate
Inf_obs [Infections Observed]: Numeric, looks like it’s the number of infections that occured
removed [LOCATIONS]: All that is said before was “(All Locations)” (for all rows, no missing data)
CL_days [NUMCLDAYS]: Numeric
removed [ORGID (2)]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate
removed [ORGID (copy)]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate
SIR [SIR]: Numeric
removed [SIR (all)]: This is the same as [SIR], so we will just keep SIR
SIR_p [SIR_PVAL]: Character. There are periods for some of these p values, which we will make NA
SIR_95CI [SIR95CI]: Character
removed [SUMMARYYM]: This is the same as [Month of SUMMARYYM], so we will just keep Month

Show code

SIR_full <- df_temp %>%
  mutate(Month    = `Month of SUMMARYYM`,
         Date     = dmy(str_glue("15-{Month}")),
         Hosp     = as_factor(ORGID),
         Inf_pred = as.numeric(NUMPRED),
         Inf_obs  = as.numeric(`Infections Observed`),
         CL_days  = as.numeric(NUMCLDAYS),
         SIR      = as.numeric(SIR),
         SIR_p    = SIR_PVAL,
         SIR_95CI = SIR95CI,
         data_source = as_factor(`Table Name`),
         # .keep="none" is the same as transmute, keeps only new columns
         .keep="none") %>%
  
  # Reorder SIR to be with p-value & CI
  relocate(SIR, .before = SIR_p) %>%
  
  # Replace p=. with NA 
  mutate(SIR_p = na_if(SIR_p, "."))

SIR_full %>% rmarkdown::paged_table()

Show code

rm(df_temp, SIR_junk)

SIR

You can see that the SIR is not calculated for most of the hospitals in the data set. Below is a table showing what percent of each variable is missing, stratified by hospital. This is because the SIR is not calculated when the number of predicted infections are less than 1, per NHSN resources (pg 5). More on this later

Show code

# gg_miss_upset(SIR)
SIR_full %>%
  group_by(Hosp) %>%
  miss_var_summary() %>%
  select(-n_miss) %>%
  pivot_wider(names_from = variable, values_from = pct_miss) %>% 
  select(Hosp, starts_with("SIR")) %>%
  knitr::kable(caption = "Percent of variable that is missing, by hospital")

Percent of variable that is missing, by hospital
Hosp	SIR	SIR_p	SIR_95CI
BMC	100.00000	100.00000	100.00000
BRN	100.00000	100.00000	100.00000
BRX	100.00000	100.00000	100.00000
CCMC	100.00000	100.00000	100.00000
GMH	100.00000	100.00000	100.00000
GRMC	100.00000	100.00000	100.00000
HRS	100.00000	100.00000	100.00000
JAX	100.00000	100.00000	100.00000
JMC	100.00000	100.00000	100.00000
PRN	100.00000	100.00000	100.00000
PVH	100.00000	100.00000	100.00000
RMH	100.00000	100.00000	100.00000
SMR	100.00000	100.00000	100.00000
STJ	100.00000	100.00000	100.00000
UHC	100.00000	100.00000	100.00000
UTN	100.00000	100.00000	100.00000
WHL	97.05882	97.05882	97.05882
WTZ	100.00000	100.00000	100.00000
WVUH	0.00000	0.00000	0.00000

Hospital data

This comes from a Google Sheet where I store some of the info for each hospital

Show code

library(googlesheets4)
options(gargle_oauth_email = TRUE)
gs4_auth(cache = TRUE)
if(!exists("hospitals")){
  hospitals_raw <- paste0("https://docs.google.com/",
                          "spreadsheets/d/",
                          "1zZ1o1HhzOPrbj2KBdGsaO4rJvLDPMWqN0ogjMCNHfGw/") %>%
    read_sheet(sheet = "CHG Hospitals")
  
  hospitals <- hospitals_raw
  rm(hospitals_raw)
}
hospitals %>% rmarkdown::paged_table()

The reporting across these two databases aren’t exactly the same for the hospitals. There are some hospitals that show up in the CHG dataset but not the NHSN dataset. Specifically, Children’s, Fairmont, St Francis, and Thomas.

Show code

hospitals %>%
  select(NHSN_abbv, CHG_abbv, name_short) %>%
  full_join(count(SIR_full, Hosp), by=c("NHSN_abbv"="Hosp")) %>%
  rename(NHSN=n) %>%
  full_join(count(CHG_clean, Hospital), by=c("CHG_abbv"="Hospital")) %>%
  rename(CHG=n) %>% 
  relocate(name_short) %>%
  filter(!is.na(NHSN_abbv)|!is.na(CHG_abbv)) %>%
  
  gt(rowname_col = "name_short") %>%
  fmt_missing() %>%
  
  tab_spanner(
    label = '# rows',
    columns = c(NHSN, CHG)
  ) %>%
  #  cols_label() %>%

  tab_header(
    title = md('**Hospitals in this dataset**'),
    # subtitle = md('Showing counts of ')
  ) %>%
  opt_stylize(5)

	NHSN_abbv	CHG_abbv	# rows
Hospitals in this dataset
	NHSN_abbv	CHG_abbv	NHSN	CHG
Ruby	WVUH	WVU/UHA	68	54
Childrens	—	CH	—	54
UHC	UHC	UHC	67	54
Camden Clark	CCMC	CCM	68	54
Berkeley	BMC	BMC	67	54
Barnesville	BRN	BRN	39	10
Braxton	BRX	BRX	31	37
Fairmont	—	FMT	—	53
Garrett	GRMC	GRMC	68	54
Grant	GMH	GMH	64	51
Harrison	HRS	HRS	66	13
Jackson	JAX	JAX	68	36
Jefferson	JMC	JMC	67	46
Potomac Valley	PVH	PVH	68	54
Princeton	PRN	PRN	68	54
Reynolds	RMH	RMH	68	54
St Francis	—	STF	—	17
St Joes	STJ	STJ	67	49
Summersville	SMR	SMR	67	54
Thomas	—	TMH	—	27
Uniontown	UTN	UTN	67	54
Wetzel	WTZ	WTZ	62	37
Wheeling	WHL	WHL	68	54

What should we do?

I assume that Children’s (CH) should be lumped with Ruby (WVUH) but what about Fairmont (FMT)? Until I hear otherwise, I will

Lump CH with WVUH
Ignore FMT, STF, and TMH

Pooling together predicted infections

From the NHSN documentation

The standardized infection ratio (SIR) is a summary measure used to track HAIs at a national, state, or local level over time. The SIR adjusts for various facility and/or patient-level factors that contribute to HAI risk within each facility. The method of calculating an SIR is similar to the method used to calculate the Standardized Mortality Ratio (SMR), a summary statistic widely used in public health to analyze mortality data. In HAI data analysis, the SIR compares the actual number of HAIs reported to the number that would be predicted, given the standard population (i.e., NHSN baseline), adjusting for several risk factors that have been found to be significantly associated with differences in infection incidence.

The formula for SIR is:

\[ SIR = \frac{Observed\ HAIs}{Predicted\ HAIs} \]

The problem

The SIR is only calculated when the number of predicted HAIs is at least 1, which prevents extremely high values and fluctuations in the SIR. As our data reports \(CLABSI_{obs}\) and \(CLABSI_{pred}\) for each hospital-month, very few instances have a \(CLABSI_{pred} \ge 1\) , so the SIR is not calculated in most cases.

Moreover, the predicted number of CLABSIs varies greatly among hospitals, as shown in an extract of the raw data below.

Show code

SIR_full %>%
  filter(Hosp %in% c("WVUH", "UHC", "GRMC", "JAX")) %>%
  select(Hosp, Month, Inf_pred, Inf_obs, SIR) %>%
  group_by(Hosp) %>%
  filter(row_number()<4) %>%
  ungroup() %>%
  # Add seperators, which requires changing everything to chars
  mutate(across(everything(), as.character)) %>%
  add_row(Hosp = "...", Month="...", 
          Inf_pred = "...", Inf_obs = "...", 
          SIR = "...", .after = 3) %>%
  add_row(Hosp = "...", Month="...", 
          Inf_pred = "...", Inf_obs = "...", 
          SIR = "...", .after = 7) %>%
  add_row(Hosp = "...", Month="...", 
          Inf_pred = "...", Inf_obs = "...", 
          SIR = "...", .after = 11) %>%
  knitr::kable(caption = "Sample of NSHN data for hospitals of various sizes")

Sample of NSHN data for hospitals of various sizes
Hosp	Month	Inf_pred	Inf_obs	SIR
GRMC	Jan-19	0.046	0	NA
GRMC	Feb-19	0.038	0	NA
GRMC	Mar-19	0.033	0	NA
…	…	…	…	…
JAX	Jan-19	0.002	0	NA
JAX	Feb-19	0.003	0	NA
JAX	Mar-19	0	0	NA
…	…	…	…	…
UHC	Jan-19	0.424	0	NA
UHC	Feb-19	0.448	1	NA
UHC	Mar-19	0.401	0	NA
…	…	…	…	…
WVUH	Jan-19	2.925	2	0.684
WVUH	Feb-19	2.689	1	0.372
WVUH	Mar-19	3.147	3	0.953

These histograms show the more detailed breakdown of each hospital’s monthly expected CLABSIs. As expected, some of the small hospitals (e.g. critical access) have incredibly low \(CLABSI_{pred}\) (some are zero), whereas larger hopsitals like Ruby (WVUM) mostly have monthly expected rates of >1

Show code

SIR_full %>%
  ggplot(aes(x=Inf_pred)) +
  geom_histogram() +
  facet_wrap("Hosp", scales = "free_y", ncol = 4) +  
  theme_bw() +
  scale_x_log10(breaks = scales::log_breaks(),
                labels = scales::label_log()) +
  labs(title="Histogram of monthly predicted CLABSIs, by hospital",
       x = "Predicted infections (visually log transformed)")

Solutions

The NHSN documentation (page 12) suggests using a longer time period in order to reach the 1.0 threshold of predicted infections.

This looks like it may work for some of the medium sized hospitals (e.g. UTN, CCMC, RMH), but most of the smaller hospitals don’t reach that threshold across the entire time that we have data for (not to mention what would happen if you split it by pre-post). This is demonstrated in the table below, which shows the total (sum) of each hospital’s expected & observed CLABSIs (across the entire time period)

Show code

SIR_full %>%
  group_by(Hosp) %>%
  summarise(expected = sum(Inf_pred),
            observed = sum(Inf_obs)) %>%
  arrange(expected) %>%
  rmarkdown::paged_table()

Show code



# SIR_full %>%
#   mutate(Rollout = if_else(Date < ymd("2023-09-01"), "Before", "After")) %>%
#   group_by(Hosp, Rollout) %>%
#   summarise(expected = sum(Inf_pred),
#             observed = sum(Inf_obs)) %>%
#   pivot_wider(names_from = Rollout,
#               names_glue = "{Rollout}_{.value}",
#               values_from = c(expected, observed)) %>%
#   mutate(Total_expected = Before_expected + After_expected,
#          Total_observed = Before_observed + After_observed) %>%
#   relocate(Hosp, Before_expected, After_expected, Total_expected, 
#            Before_observed, After_observed, Total_observed)

#1: Pooling the months

Aggregating the predicted infections across multiple months (until the sum across those months reaches 1) is what is suggested by NHSN. Because each hospital has different rates of predicted infections, the length of time needed to pool across (e.g. 2 months, 5 months, 1 year) may vary depending on the hospital.

I made a function that splits each hospitals data into chunks² of time (e.g. April 2024 to July 2024), trying to keep the number of predicted infection roughly even within each chunk. This means that the duration of each chunk is variable, depending on how long it takes to reach a sum(Inf_pred) >= 1, but that now each row of data should have at least one³ predicted infection, allowing us to calculate the SIR.

Code for the function

create_buckets <- function(data, chunk_size=1, round_up=0.8, 
                           date_col="Date", data_col="Inf_pred",
                           verbose=F) {
  #' Creating buckets for predicted infections
  #'
  #' This function takes a data.frame (`data`) from containing a column with 
  #' the number of predicted infections (`data_col`) and splits it up into
  #' chunks where the sum of `data_col` within each chunk is at least 
  #' `chunk_size`
  #' 
  #' @param data Data frame containing a column of predicted infections
  #' @param chunk_size Target minimum number of infections to have in each 
  #' bucket. This should ideally be 1
  #' @param round_up At what cutoff (of remainder of infections) should you 
  #' increase the bucket size? 
  #' @param date_col Label for a column with date objects
  #' @param data_col Label for the column in `data` with the predicted infections
  #' @param verbose Should warnings be given?
  #'
  #' @return
  #' The same data frame, but with a new column named bucket containing the
  #' bucket IDs
  
  # Ensure data is sorted by date
  data <- data %>% arrange(!!rlang::sym(date_col))
  
  # This is the total number of predicted infections
  total <- data %>% dplyr::pull(data_col) %>% sum()
  
  ### FIRST: Calculate the number of buckets to break things down into
  
  # Calculate the number of buckets to divide things up into. This is floor 
  # division, meaning that it will round down (to make sure each chunk has 
  # greater than then size specified in chunk_size). For example, if there is 
  # a total of 2.3 predicted infections, it will split it into two chunks 
  # since 2.3 %/% 1 = 2 (assuming a target chunk size of at least 1)
  n_buckets <- total %/% chunk_size
  
  if(total %% chunk_size >= round_up){
    # If the remainder of the total infections is greater than the cutoff given
    # in round_up (e.g. total predicted of 2.8 has a remiander of 0.8), increase
    # the bucket size by one to prevent wasted data
    n_buckets <- n_buckets + 1
    if(verbose) warning("increasing bucket size")
  }
  
  ### SECOND: Determine the actual size chunk size to target
  actual_chunk_size <- total / n_buckets
  
  ### THIRD: Define values used during the loop
  # This is the vector where we store what bucket that row falls under
  bucket <- c()
  # This is the starting bucket ID
  bucket_id <- 1
  # This represents the running total (cumulative sum) of predicted infections
  # for the bucket
  running_total <- 0
  
  ### FOURTH: Loop thriught the data
  # Run through each row of the data, keeping a running sum of predicted 
  # infections. Once the running total exceeds our target chunk size (see
  # `actual_chunk_size` in step 2), reset `running_total` to zero and move
  # to then next bucket
  for (i in 1:nrow(data)) {
    running_total <- running_total + dplyr::pull(data, data_col)[i]
    
    # If the running total exceeds the actual chunk size, assign the current
    #  bucket, reset the running total, and move to the bucket
    if (running_total >= actual_chunk_size) {
      bucket <- c(bucket, bucket_id) # append bucket_id to running list
      bucket_id <- bucket_id + 1
      running_total <- 0
      
    } else {
      bucket <- c(bucket, bucket_id) # append bucket_id to running list
    }
  }
  ### FIFTH: Add the vector of bucket_id's calculated in the loop as a 
  #          column to the original data, and return the data frame
  data$bucket <- bucket
  return(data)
}

For a larger hospital like UHC, we could probably get away with lumping things into 2-3 month bins, but medium-small hospitals (like GRMC & RMH) require much longer periods of time to reach a (cumulative) total of 1 infection. This is demonstrated in the two respective tables below:

Show code

SIR_full %>% 
  filter(Hosp=="UHC") %>%
  select(Month, Date, Hosp, Inf_pred) %>%
  create_buckets() %>%
  group_by(bucket) %>%
  # View()
  filter(bucket<3 | bucket>28) %>%
  mutate(cumsum = cumsum(Inf_pred)) %>%
  mutate(bucket = str_glue("Chunk #{bucket} (size={sum(Inf_pred)})")) %>%
  select(-Date, -Hosp) %>%
  gt(groupname_col = "bucket", rowname_col = "Month") %>%
  tab_stub_indent(everything(), 5) %>%
   cols_label(Inf_pred = "Monthly",
              cumsum = md("Chunk's running total")) %>%
  tab_spanner(
    label = md('**Predicted infections** (UHC)'),
    columns = c(Inf_pred, cumsum)
  ) %>%
  tab_header(
    title = 'Example of grouping months together in chunks',
    subtitle = md('Data is limited to UHC, **not all chunks shown**')
  ) %>%
  opt_stylize(5)

	Predicted infections (UHC)
Example of grouping months together in chunks
Data is limited to UHC, not all chunks shown
	Monthly	Chunk’s running total
Chunk #1 (size=1.273)
Jan-19	0.424	0.424
Feb-19	0.448	0.872
Mar-19	0.401	1.273
Chunk #2 (size=1.018)
Apr-19	0.372	0.372
May-19	0.312	0.684
Jun-19	0.334	1.018
Chunk #29 (size=1.609)
Mar-24	0.821	0.821
Apr-24	0.788	1.609
Chunk #30 (size=1.678)
May-24	0.918	0.918
Jun-24	0.760	1.678
Chunk #31 (size=0.915)
Jul-24	0.915	0.915

Show code

SIR_full %>% 
  filter(Hosp %in% c("GRMC", "RMH", "UHC")) %>%
  nest(.by = Hosp) %>%
  mutate(data = map(data, create_buckets)) %>%
  unnest(data) %>%
  
  
  # select(Month, Date, Hosp, Inf_pred) %>%
  # create_buckets() %>%
  group_by(Hosp, bucket) %>%
  summarise(
    median = qwraps2::median_iqr(Inf_pred),
    Inf_pred = sum(Inf_pred),
    # Inf_obs  = sum(Inf_obs),
    months = n(),
    start = min(Date),
    end   = max(Date)
  ) %>%
  ungroup() %>%
  # rowwise() %>%
  
  # Filter data to show select rows from UHC too
  mutate(keep = case_when(Hosp != "UHC" ~ TRUE,
                          bucket <3     ~ TRUE,
                          bucket >28    ~ TRUE,
                          TRUE          ~ FALSE)) %>%
  filter(keep) %>% select(-keep) %>%
  
  mutate(bucket = str_glue("Chunk #{bucket}")) %>%
  mutate(start = format(start, "%b '%y"),
         end = format(end, "%b '%y")) %>%
  relocate(months, median, .after=end) %>%
  # select(-Inf_obs)
  
  gt(groupname_col = "Hosp", rowname_col = "bucket") %>%
  tab_stub_indent(everything(), 5) %>%
  gt::cols_merge_range(start, end) %>%
   cols_label(start = md("Time span<br>&nbsp;"),
              Inf_pred = md("Total infections<br>(during chunk)"),
              median = md("Median (IQR)<br>monthly infections"),
              months = md("Duration<br>(months)")) %>%
  opt_stylize(5, color = "green") %>%
  
  tab_header(
    title = md('**Grouping months together in chunks**'),
    subtitle = md('Showing **predicted** infections; *select hospitals shown*')
  )

	Total infections (during chunk)	Time span	Duration (months)	Median (IQR) monthly infections
Grouping months together in chunks
Showing predicted infections; select hospitals shown
GRMC
Chunk #1	1.023	Jan '19–Aug '21	32	0.03 (0.02, 0.04)
Chunk #2	1.010	Sep '21–Nov '22	15	0.07 (0.04, 0.09)
Chunk #3	0.955	Dec '22–Aug '24	21	0.05 (0.04, 0.05)
RMH
Chunk #1	1.242	Jan '19–Nov '20	23	0.05 (0.04, 0.06)
Chunk #2	1.234	Dec '20–Jan '22	14	0.08 (0.07, 0.10)
Chunk #3	1.272	Feb '22–May '23	16	0.08 (0.07, 0.09)
Chunk #4	1.005	Jun '23–Aug '24	15	0.07 (0.05, 0.08)
UHC
Chunk #1	1.273	Jan '19–Mar '19	3	0.42 (0.41, 0.44)
Chunk #2	1.018	Apr '19–Jun '19	3	0.33 (0.32, 0.35)
Chunk #29	1.609	Mar '24–Apr '24	2	0.80 (0.80, 0.81)
Chunk #30	1.678	May '24–Jun '24	2	0.84 (0.80, 0.88)
Chunk #31	0.915	Jul '24–Jul '24	1	0.92 (0.92, 0.92)

A visual representation of this process is shown below for RMH, with the original data shown in red and the aggregated data in grey⁴

Show code

SIR_full %>% 
  filter(Hosp=="RMH") %>%
  select(Month, Date, Hosp, Inf_pred) %>%
  create_buckets() %>%
  group_by(bucket) %>%
  summarise(total = sum(Inf_pred),
            start = min(Date),
            end = max(Date)) %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=total)) +
  geom_col(aes(x=Date, y=Inf_pred), data=filter(SIR_full, Hosp=="RMH"), fill="red") +
  theme_bw() +
  scale_x_date(date_breaks = "6 months", date_labels = "%b\n'%y") +
  labs(title="Predicted infections vs time: RMH",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

This is the same type of figure among most of the hospitals. I’ve omitted hospitals that had fewer than 1 total predicted infections (across the entire data set), as we will address these in the next section. I’ve also omitted WVUH (Ruby) as it always has at least one monthly predicted infection

Show code

df <- SIR_full %>%
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data) %>% 
  # select(Month, Date, Hosp, Inf_pred) %>%
  # create_buckets() %>%
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date)) 

# df %>%
#   group_by(Hosp) %>%
#   filter(n()>1) %>%
#   filter(Inf_pred==min(Inf_pred))

# df %>%
#   group_by(Hosp) %>%
#   filter(n()>1) %>%
#   filter(Hosp!="WVUH") %>%
#   # mutate(duration = as.numeric(days(end - start)) / (24 * 60 * 60)) %>%
#   mutate(months = time_length(end - start, unit="months")) %>%
#   ggplot(aes(x=Hosp, y=Inf_pred)) + 
#   geom_boxplot() +
#   geom_jitter(aes(color=months),alpha=.5)
  
df %>%
  group_by(Hosp) %>%
  filter(n()>1) %>% 
  filter(Hosp!="WVUH") %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(13), xmax=end+days(13), ymin=0, ymax=Inf_pred)) +
  facet_wrap("Hosp") + 
  geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,
           data=filter(SIR_full, Hosp %in% pull(filter(group_by(df, Hosp), n()>1, Hosp!="WVUH"),Hosp)) ) +
  theme_bw() + 
  labs(title="Predicted infections vs time",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

This allows us to finally calculate the SIR across more of the hospitals than we could at first (when we could only view the SIR for WVUH)

Show code

df %>%
  group_by(Hosp) %>%
  filter(n()>1) %>% 
  # filter(Hosp!="WVUH") %>%
  mutate(SIR = Inf_obs / Inf_pred) %>%
  mutate(SIR = if_else(SIR==0, 0.1, SIR)) %>%
  ggplot(aes(fill=SIR)) +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=SIR)) +
  facet_wrap("Hosp") +
  theme_bw() +
  scale_fill_viridis_c() +
  # scale_fill_gradient2(midpoint = 1)
  labs(title="SIR using pooled time periods",
       y = "Standardized Infection Ratio\n(SIR)")

Show code

rm(df)

#2: Pooling hospitals

In the previous section, we focused on aggregating predicted infections across multiple months. This allowed us to give each hospital “chunks” of time during which we would expect to see one observed infection.

When we did this, we skipped over⁵ hospitals that did not have a total of at least one predicted infection (across the entire dataset). These tended to be the smaller hospitals, often (but not always) critical access hospitals. In this section, we will return to those hospitals that we skipped over and try to find a way to get pool their predicted infections to reach thresholds greater than one.

Recall the table shown at the beginning of the Solutions section, which showed the total number of predicted infections by hospital. Below is the same table, supplemented with each hospital’s classification⁶ and their median monthly number of central line days:

Hospital classification terminology

I am open to using different terminology here for classification. One thought would be to do it by bed size (large, medium, small), but I don’t have an accurate source for each hospital’s bed size

Show code

SIR_full %>%
  group_by(Hosp) %>%
  summarise(expected = sum(Inf_pred),
            observed = sum(Inf_obs),
            # median_CL_days  = qwraps2::median_iqr(CL_days)
            median_CL_days  = median(CL_days)
            ) %>%
  arrange(expected) %>%
  left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%
  select(Hosp:median_CL_days, hosp_type) %>%
  relocate(hosp_type, .before = expected) %>%
  ungroup() %>%
  mutate(totalAboveOne = expected >=1,
         totalAboveOne = if_else(totalAboveOne, 
                                 ">= 1 total expected infection",
                                 "<1 total expected infection")) %>%
  
  gt(rowname_col = "Hosp", groupname_col = "totalAboveOne") %>%
  
  tab_stub_indent(everything(), 5) %>%
  
  cols_label(expected = md("**Predicted**"),
             observed = md("Observed"),
             # median_CL_days = md("Median (IQR) *monthly*<br>central line days"),
             median_CL_days = md("Median *monthly*<br>central line days"),
             hosp_type = md("Hospital Type")) %>%
   tab_spanner(
    label = md('Total infections'),
    columns = c(expected, observed)
  ) %>%
  gtExtras::gt_color_rows(
    columns = expected,
    domain = c(0,50)
    # palette = c("purple", "lightgrey", "green")
  ) %>%
  data_color(
    columns = median_CL_days,
    domain = c(0,1000),
    palette = "BuPu",
    # pal_type = "continuous"
  ) %>%
  tab_style_body(
    columns = hosp_type,
    values = "Critical access",
    style = cell_text(color = '#9f0101')
  ) %>%
  tab_style_body(
    columns = hosp_type,
    values = "Community",
    style = cell_text(color = '#5a3286')
  ) %>%
  tab_style_body(
    columns = hosp_type,
    values = "???",
    style = cell_fill(color = '#e5ce90')
  ) %>%

  tab_header(
    title = md('Total number of infections & central line days, by hospital'),
    subtitle = md(str_glue('Data shown is across all months (**{format(min(SIR_full$Date), "%b %Y")} to {format(max(SIR_full$Date), "%b %Y")}**)'))
  ) %>%
  opt_stylize(6, color="gray")

	Hospital Type	Total infections		Median monthly central line days
Total number of infections & central line days, by hospital
Data shown is across all months (Jan 2019 to Aug 2024)
	Hospital Type	Predicted	Observed	Median monthly central line days
<1 total expected infection
HRS	Critical access	0.023	0	0.0
BRN	Critical access	0.028	0	0.0
BRX	Critical access	0.067	0	5.0
JAX	Critical access	0.242	2	9.0
WTZ	???	0.262	0	1.0
STJ	Critical access	0.434	0	19.0
PVH	Critical access	0.670	1	30.0
GMH	Critical access	0.728	0	37.5
JMC	Critical access	0.896	0	47.0
SMR	Critical access	0.979	0	46.0
>= 1 total expected infection
GRMC	Critical access	2.988	2	67.5
RMH	Community	4.753	3	95.0
UTN	Community	13.707	36	288.0
PRN	Community	21.481	61	482.5
BMC	Community	21.961	6	359.0
CCMC	Community	36.191	24	487.0
WHL	Community	41.671	39	668.5
UHC	Community	43.610	34	747.0
WVUH	Academic	185.902	116	2433.0

Misclassifications?

GRMC is listed as a critical access hospital, but has a much higher number of predicted infections than other critical access hospitals. Any idea on what is causing that? Regardless, might make sense to look at it separate from other CAHs
WTZ didn’t have critical access designation when I looked it up, but it sure behaves like a CAH

By hospital type

The first solution that comes to mind would be to lump all the CAH together and treat them as the “same hospital”. The NHSN reference looks like they use an intercept-only model for CAHs (Table 3 on page 12), so that tells me they didn’t find many useful predictors within each critical access hospital.

If we lump together all of the critical access hospitals, we get some acceptable results, shown in the figure below. Again, I’m not showing the WVUH data since it throws off the scale of the Y-axis. The only issue here is WTZ still can’t be used and we’ve now pooled GRMC’s data with all of the critical access hospitals.

Show code

CAH <- hospitals %>% 
  filter(hosp_type=="Critical access") %>%
  pull(NHSN_abbv)

df <- SIR_full %>%
  mutate(Hosp = case_when(
    Hosp %in% CAH ~ "Critical access",
    TRUE ~ Hosp
  )) %>%
  group_by(Hosp, Month, Date) %>%
  # arrange(Date) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs = sum(Inf_obs),
            CL_days = sum(CL_days)) %>%
  ungroup() 

df2 <- df %>%
  
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            Hosp=="Critical access" ~ 1.05,
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data)  %>% 
  
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date)) 


df2 %>%
  group_by(Hosp) %>%
  # filter(n()>1) %>% 
  filter(Hosp!="WVUH") %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +
  facet_wrap("Hosp") + 
  geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,
           data=filter(df, Hosp!="WVUH") ) +
  theme_bw() + 
  labs(title="Predicted infections vs time",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

Show code



# df %>%
#   ggplot(aes(x=Inf_pred)) +
#   geom_histogram() +
#   facet_wrap("Hosp", scales = "free", ncol = 4)

rm(df, df2, CAH)

Central line days

Another option would be to use some cutoff to aggregate hospitals together. The one that comes to mind is central line days, as this tracks well with predicted number of infections (as would be expected)⁷. The natural cutoff in the dataset appears to be around 150 central line days per month, as shown below

Show code

SIR_full %>%
  left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%
  ggplot(aes(x=Hosp, y=CL_days)) +
  geom_boxplot(aes(color=hosp_type)) + 
  # geom_boxplot() + facet_wrap("hosp_type", scales = "free_x") +
  coord_flip() + 
  scale_y_sqrt() +
  geom_hline(yintercept=150, linetype="dashed", color = "black") +
  geom_segment(aes(x = 2.5, y = 300, xend = 1.5, yend = 150),
                  arrow = arrow(length = unit(0.25, "cm"))) +
  annotate("text", x=2.5, y=310, label="150", hjust="left") +
  labs(title="Monthly central line days, by hopsital",
       y = "CL days (transformed)") +
  theme_bw()

If we use this cutoff to aggregate smaller hospitals together, we get the most number of chunks per “hospital” among all of the solutions we’ve done this far. The trade off is, we now have tons of hospitals aggregated into one group (specifically BRN, BRX, GMH, GRMC, HRS, JAX, JMC, PVH, RMH, SMR, STJ, and WTZ)

Show code

cutoff <- 150

df <- SIR_full %>%
  
  # Calculate average CL_days, and rename Hosp if below cutoff
  group_by(Hosp) %>%
  mutate(mean_CLD = mean(CL_days),
         CL_rare = mean_CLD < cutoff) %>%
  mutate(Hosp = if_else(CL_rare, "CL rare", Hosp)) %>%
  ungroup() %>%
  
  group_by(Hosp, Month, Date) %>%
  # arrange(Date) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs = sum(Inf_obs),
            CL_days = sum(CL_days)) %>%
  ungroup() 
  
df2 <- df %>%
  
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            # Hosp=="CL rare" ~ 1.05,
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data)  %>% 
  
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date)) 
   
# df2 %>% filter(Hosp=="CL rare") %>% tail()

df2 %>%
  group_by(Hosp) %>%
  # filter(n()>1) %>% 
  filter(Hosp!="WVUH") %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +
  facet_wrap("Hosp", ncol = 2) + 
  geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,
           data=filter(df, Hosp!="WVUH") ) +
  theme_bw() + 
  labs(title="Predicted infections vs time",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

Show code


rm(cutoff, df, df2)

Joining the data

At this point we have a few data.frames we are working with

CHG data, which has the number of patients who got CHG bathing (Numerator) and the number of patients who were eligible (Denominator). This is at a hospital level where each row is the weekly data
NHSN data, which has the predicted (Inf_pred) and observed (Inf_obs) number of infections. This is monthly data for each hospital, but as shown in the “Pooling together predicted infections” section, we can aggregate it across months &/or hospitals
Hospital data, which comes from the Google Sheet

This is where it gets fun! To get the CHG data to line up with the NHSN data, we need to

Pool together the NSHN data, both across hospitals and time
Pool the CHG data across hospitals (e.g. lump all the CAH together), matching what we did in step 1
Pool the CHG data by month
Aggregate the CHG data across time, specific to that hospital’s chunks of time (since we chunked the NHSN into different time frames depending on the hospital

Show code

temp_NHSN <- SIR_full
temp_CHG  <- CHG_clean

Step 1: Aggregate NHSN data

This is effectively what we did in Pooling together predicted infections

Important

I will assume we are grouping together all hospitals that have an fewer than 150 central line days per month (on average)

Show code

cutoff <- 150

# Store the names of the small hospitals for later
small_hosp <- temp_NHSN %>%
  # Calculate average CL_days, and rename Hosp if below cutoff
  group_by(Hosp) %>%
  filter(mean(CL_days) < cutoff) %>%
  pull(Hosp) %>%
  unique() %>% 
  as.character()

# Make the NHSN data pooled by hospital and chunked into time buckets
temp_NHSN <- temp_NHSN %>%
  # Calculate average CL_days, and rename Hosp if below cutoff
  group_by(Hosp) %>%
  mutate(mean_CLD = mean(CL_days),
         CL_rare = mean_CLD < cutoff) %>%
  mutate(Hosp = if_else(CL_rare, "Small", Hosp)) %>%
  ungroup() %>%
  
  group_by(Hosp, Month, Date) %>%
  # arrange(Date) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs = sum(Inf_obs),
            CL_days = sum(CL_days),
            .groups = "drop") %>%
  ungroup() %>%
  
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            # Hosp=="Small" ~ 1.05,
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data)  %>% 
  
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date), 
            .groups = "drop")

Step 2: Aggregate CHG by hospital

As mentioned in the section on importing Hospital data, the CHG dashboard is more granular than the NHSN data (e.g. Childrens Hospital is split from Ruby) and some of the abbreviations differ between the datasets (e.g. CCMC vs CCM).

Here, we will fix those abbreviations (we will use the NHSN abbreviations) and merge CH with WVUH. We will also group together the smaller hospitals (e.g. CAH), based on what we did in step 1 above. Once all of that is done, we sum up everything up by hospital (or hospital group) & week.

Show code

temp_CHG <- temp_CHG %>%
  select(Hospital, Date, Denominator, Numerator) %>%
  left_join(select(hospitals, NHSN_abbv, CHG_abbv), 
            by=c("Hospital"="CHG_abbv")) %>%
  
  # Make a new column named Hosp, with the adjusted names
  # Rename hospital names to set us up to pool across names
  # !!! Be careful of which one you use here (Hospital vs NHSN_abbv)
  mutate(Hosp = case_when(Hospital=="CH"           ~ "WVUH",
                          NHSN_abbv %in% small_hosp ~ "Small",
                          TRUE                 ~ NHSN_abbv)) %>%
  
  # # Check your work
  # filter(Hospital!=NHSN_abbv | is.na(NHSN_abbv)) %>%
  # count(Hospital, NHSN_abbv, Hosp)
  select(-Hospital, -NHSN_abbv) %>%
  
  # Remove hospitals not listed in NHSN
  filter(Hosp %in% temp_NHSN$Hosp) %>%
  
  # Aggregate data
  group_by(Hosp, Date) %>%
  summarise(Denominator = sum(Denominator, na.rm = T),
            Numerator   = sum(Numerator, na.rm = T),
            .groups = "drop") %>%
  ungroup()

Step 3: Aggregate CHG by month

This is exactly what we did in the section where we imported the CHG compliance data, we just have to do it again now that we aggregated the CHG data by hospital

Show code

temp_CHG <- temp_CHG %>% 
  #  Needed to only process one date at a time
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  group_by(Hosp, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T))

Step 4: Match CHG chunks to NHSN chunks

At this point, we have NHSN data broken down into chunks, like the sample below

Show code

temp_NHSN %>%
  filter(Hosp %in% c("PRN", "Small", "CCMC")) %>%
  group_by(Hosp) %>%
  filter(bucket == max(bucket)) %>%
  select(Hosp, starts_with("Inf"), start, end) %>%
  knitr::kable(caption="Sample of NHSN data")

Sample of NHSN data
Hosp	Inf_pred	Inf_obs	start	end
CCMC	0.979	1	2024-07-15	2024-08-15
PRN	0.943	1	2024-06-15	2024-08-15
Small	0.917	0	2024-03-15	2024-08-15

To join the CHG compliance data with our chunked NHSN data, we will work row-by-row through the NHSN table. For each row, we will filter the CHG data:

The hospital name matches (\(Hosp_{\ CHG} = Hosp_{\ NHSN}\))
The month of the CHG data is after or equal to the start date in NHSN (\(Month_{\ CHG} \ge Start_{\ NHSN}\))
The month of CHG data is before or equal to the end date in NHSN (\(Month_{\ CHG} \le End_{\ NHSN}\))

For the sample data shown above, this would translate to these three sets of tables

Desired output, with rows of NHSN shown on the left, and filtered CHG data on the right

Hosp	Inf_pred	Inf_obs	start	end
CCMC	0.979	1	2024-07-15	2024-08-15

Hosp	Month	Denominator	Numerator
CCMC	2024-07-15	1998	1823
CCMC	2024-08-15	1887	1723

Hosp	Inf_pred	Inf_obs	start	end
PRN	0.943	1	2024-06-15	2024-08-15

Hosp	Month	Denominator	Numerator
PRN	2024-06-15	602	480
PRN	2024-07-15	712	581
PRN	2024-08-15	574	497

Hosp	Inf_pred	Inf_obs	start	end
Small	0.917	0	2024-03-15	2024-08-15

Hosp	Month	Denominator	Numerator
Small	2024-03-15	645	399
Small	2024-04-15	478	375
Small	2024-05-15	663	556
Small	2024-06-15	397	338
Small	2024-07-15	551	487
Small	2024-08-15	413	344

Code for the function

# Define a helper function that will filter the CHG data based on given parameters

match_CHG <- function(hosp, 
                      start, end, 
                      data_name,
                      date_col="Month",
                      hosp_col = "Hosp") {
  #' Matching CHG data to pooled NHSN data
  #' 
  #' @param hosp String of the hospital name, used to filter the column
  #' named `hosp_col` in the data.frame named `data_name`
  #' @param start The date to begin with (e.g. discards any dates before)
  #' @param start The date to end with (e.g. discards any dates after)
  #' @param data_name a string matching the name of the global variable 
  #' with the CHG data (e.g. "temp_CHG" or "CHG_monthly")
  #' @param date_col The name of the column in `data_name` where we should
  #' filter the dates from
  #' @param hosp_col The name of the column in `data_name` where we should
  #' filter the hospitals from
  #'
  #' @return
  #' A data frame with one row. The colums will depend on if the filter
  #' function was able to identify any data from the CHG table meeting
  #' the given criteria
  #' 
  #' If it found match(es), it will return a df with the following columns
  #' - Denominator: the sum of the denominator during time period
  #' - Numerator:   the sum of the numerator during time period
  #' - CHG_start:   the variable given from `start`
  #' - CHG_end:     the variable given from `end`
  #' - Months_CHG:  a nested list with the months that were aggregated
  #' 
  #' If no matches were found, it will return a df with the following columns
  #' - CHG_start:   the variable given from `start`
  #' - CHG_end:     the variable given from `end`
  #' 
  #' @examples
    #' # example code 
    #' match_CHG(hosp = "PRN",
    #'           start = ymd("2023-11-15"),
    #'           end = ymd("2024-01-15"),
    #'           data_name = "CHG_monthly",
    #'           hosp_col = "Hospital")
    #'           
    #' # This is more useful using it inside of a mutate function
    #' temp_NHSN %>%
    #'   rowwise() %>%
    #'   mutate(CHG = pmap(list(hosp=Hosp, 
    #'                          start=start, end=end, 
    #'                          data_name="temp_CHG"),
    #'                     match_CHG)) %>%
    #'   unnest(CHG)         
  
  
  # I couldn't find an easy way to pass the data frame using map, so this 
  # will have to do
  data <- get(data_name)
  
  ### PART 1: Run validation tests
  # Check that data_name matches to a data.frame
  testthat::expect_s3_class(data, "data.frame")
  
  # Test that the provided data has columns with the given names
  testthat::expect_contains(names(data), hosp_col)
  testthat::expect_contains(names(data), date_col)

  # Ensure that the column for date is indeed dates, and the dates provided
  # are in date format
  testthat::expect_s3_class(data[[date_col]], "Date")
  testthat::expect_s3_class(start, "Date")
  testthat::expect_s3_class(end, "Date")
  
  # Check that start & end are length of one (e.g. forgot rowwise)
  testthat::expect_length(start, 1)
  testthat::expect_length(end, 1)
  
  # Make sure the name of the hospital is found in the data
  testthat::expect_contains(unique(data[[hosp_col]]), hosp)
  
  
  ### PART 2: Filter CHG data to that hospital's time period
  # Filter to only get the info we want
  filter_data <- data %>%
    filter(!!rlang::sym(hosp_col) %in% hosp) %>%
    filter(!!rlang::sym(date_col) >= start) %>%
    filter(!!rlang::sym(date_col) <= end) 
   
  ### PART 3: Aggregate to make it one row
  single_row <- filter_data %>%
    ungroup() %>%
    group_by(!!rlang::sym(hosp_col)) %>%
    summarise(Denominator = sum(Denominator, na.rm = T),
              Numerator   = sum(Numerator, na.rm = T),
              Months_CHG  = list(!!rlang::sym(date_col))) %>%
    mutate(CHG_start = start,
           CHG_end   = end) %>%
    select(-!!rlang::sym(hosp_col)) %>%
    relocate(Months_CHG, .after = CHG_end)
  
  ### PART 4: If no matching data is found, make a placeholder row
  #           that will be filled with NAs
  if(nrow(single_row)==0){
    single_row <- tibble(
      # Hospital = hosp,
      # Denominator = -1,
      # Numerator = -1,
      # Months_CHG = list(),
      CHG_start = start,
      CHG_end   = end)
  }
  return(single_row)
  
}

We ultimately will sum together all of the numerators & denominators (within each table) to give us the overall numerators & denominators for CHG bathing during that time period. Once we’ve done that, we can calculate the CHG_rate during that chunk of time, and our data is ready for analysis!

Here is the final, joined data:

Show code

temp_joined <- temp_NHSN %>%
  rowwise() %>%
  mutate(CHG = pmap(list(hosp=Hosp, start=start, end=end, data_name="temp_CHG"),
                    match_CHG)) %>%
  unnest(CHG) %>%
  
  mutate(CHG_rate = Numerator / Denominator,
         SIR = Inf_obs / Inf_pred) %>%
  
  select(Hosp, Inf_pred, Inf_obs, SIR, 
         start, end, 
         CHG_rate, Numerator, Denominator)

temp_joined %>%
  group_by(Hosp) %>%
  arrange(Hosp, desc(end)) %>%
  rmarkdown::paged_table()

Show code

# x %>%
#   
#   ggplot(aes(x=CHG_rate, y=SIR)) +
#   geom_jitter() +
#   geom_smooth() +
#   facet_wrap("Hosp")

Note

Much of the CHG data is listed as NA, becuase the CHG data only goes back to 2023-10-01, whereas the NHSN data goes back to 2019-01-15.

We really need to get the CHG data from earlier (ideally during implementation) to take full advantage of the NHSN data

Descriptive statistics

Todo

Will work on once we have some of the methodology stuff sorted above (and ideally the full CHG data)

For this run (2024-10-31) of the analysis:

The data set for CHG bathing spans from Oct 01 2023 to Oct 06 2024
The data set for NHSN spans from Jan 2019 to Aug 2024

Show code

SIR_full %>%
  group_by(Hosp) %>%
  summarise(start = min(Date),
            end = max(Date)) %>%
  ggplot(aes(x=Hosp)) +
  # geom_point(aes(y=start)) +
  geom_errorbar(aes(ymin = start, ymax=end), width=.3) +
  coord_flip() +
  theme_bw() + 
  labs(title="Available NHSN data",
       y="Date", x="Hospital")

Analysis

Pre-post

In an ideal world, we could get CHG data going back to around the time of the roll out (or before), identify the time period where hospitals transitioned, and compare the SIRs before & after the roll out

Show code

set.seed(4)
df <- tibble(y = pnorm(-10:10 * .3, sd = .5),
       noise = rnorm(1:21, sd = .02)) %>%
  mutate(x = row_number(),
         y2 = y+noise,
         y3 = y+.05,
         y3 = if_else(y3>=1, y2-.05, y3),
         rate = y3)

df %>%
  ggplot(aes(x=x, y=rate)) + 
  geom_point() + geom_line() +
  annotate("rect", xmin=0, xmax=8.5,
           fill="red",
           ymin=0, ymax=1, alpha=0.3) +
  annotate("rect", xmin=8.5, xmax=12.5,
           fill="blue",
           ymin=0, ymax=1, alpha=0.3) +
  annotate("rect", xmin=12.5, xmax=21,
           fill="green",
           ymin=0, ymax=1, alpha=0.3) +
  annotate("text", x=4, y=.5, label="Before", hjust="left") +
  annotate("text", x=17, y=.5, label="After", hjust="right") +
  annotate("text", x=10.5, y=.9, label="Transition", hjust="inward") +
  scale_y_continuous(labels = scales::percent) +
  theme_classic() +
  labs(x="Time", y="Rate", title="This would be great")

Show code

rm(df)

If we can’t get that information, we could just assume that rates of CHG bathing were low before September of 2023 and split the data into Pre/Post by the date

Show code

df <- temp_joined %>%
  mutate(Cluster = case_when(end < ymd("2023-09-01")   ~ "Before",
                             start > ymd("2023-09-30") ~ "After")) %>%
  filter(!is.na(Cluster))


df %>%
  ggplot(aes(fill=Cluster)) +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=SIR)) +
  facet_wrap("Hosp") +
  theme_bw() +
  theme(legend.position = "bottom") +

  labs(y="SIR",
       title="SIR, before & after")

Show code

df %>%
  ggplot(aes(x=Hosp, y=SIR, color=Cluster)) +
  geom_boxplot() +
  coord_flip() +
  theme_bw() +
  labs(y="SIR", x="",
       title="SIR, before & after")

Show code

t.test(SIR ~ Cluster, data=df)


    Welch Two Sample t-test

data:  SIR by Cluster
t = -1.4226, df = 63.002, p-value = 0.1598
alternative hypothesis: true difference in means between group After and group Before is not equal to 0
95 percent confidence interval:
 -0.6322592  0.1064125
sample estimates:
 mean in group After mean in group Before 
           0.7101937            0.9731171

Show code

# CHG_monthly %>%
#   mutate(Hospital = case_when(Hospital %in% small_hosp ~ "Small",
#                               TRUE ~ Hospital)) %>%
#   group_by(Hospital, Month) %>% 
#   summarise(Denominator =sum(Denominator , na.rm = T),
#             Numerator   =sum(Numerator   , na.rm = T)) %>%
#   ggplot(aes(x=Month, y=Numerator/Denominator, color=Hospital)) +
#   geom_line() +
#   facet_wrap("Hospital")

GLM

Another option would be using a generalized linear model to predict the number of infections observed, and using predicted infections (and CHG bathing) as predictors. Since we would be looking at count data, we could do a Poisson regression (or more likely a negative binomial regression)…

\[ \text{Observed infections} = \beta_0 + \beta_1({predicted\ infections}) + \beta_2(CHG\ rate) + ... + \epsilon \]

Show code

# library(DHARMa)
mod_nb <- MASS::glm.nb(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined) 

summary(mod_nb)

# Compare to poisson
mod_pos <- glm(Inf_obs ~ Inf_pred + CHG_rate, 
               family = 'poisson',
               data = temp_joined) 
calc_dispersion <- function(mod, verbose=F){
  # Identify the model family
  fam_raw <- mod$family$family
  
  if(str_detect(fam_raw, "poisson")) family <- "poisson"
  if(str_detect(fam_raw, "Negative Binomial")) family <- "NB"
  
  E2 <- resid(mod, type = "pearson")
  N <- nrow(mod$model)
  p <- length(coef(mod)) 
  
  if(family=="NB") p <- p + 1 # '+1' is for variance parameter in NB
  
  result <- sum(E2^2) / (N - p)
  
  if(verbose) {
    if(result>1) message("Overdisperion")
    if(result<1) message("Underdisperion")
  }
  
  return(result)
}
# DHARMa::testDispersion(mod_nb)
calc_dispersion(mod_nb)
calc_dispersion(mod_pos)
rm(mod_nb, mod_pos, calc_dispersion)

ZIP & ZINB

I haven’t highlighted this yet (it will go into descriptive statistics) but our data has a whole lot of zeros, namely for the observed infections

Figure 2: High number of zeros, highlighted by histograms of the number of observed infections

I’ve used zero-inflated regression models before, but I’m not sure if that would apply here. These are used when there are two processes⁸ causing a mixed distribution, but I don’t think that is the case here. But leaving this section in just in case

Show code

library(pscl) # also see {ZIM} package

pscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | 
           Inf_pred , # these are the predictors for the zero inflated part
         dist = "poisson", # Use 'negbin' for negative binomial
         data = temp_joined) %>%
  summary()

# # Versus
# glm(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined, family = poisson) %>%
#   summary()

Show code

x %>%
  mutate(CHG_rate = Numerator / Denominator,
         SIR = Inf_obs / Inf_pred) %>%
  filter(!is.na(CHG_rate)) %>%
  pscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | 
           Inf_pred + CHG_rate, # these are the predictors for the zero inflated part
         dist = "negbin", 
         data = .) %>%
  summary()

Show code

df1 <- CHG_clean %>%
  
  #  Make CHG data monthly
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  # Recode CHG hospital names to NHSN names
  mutate(Hospital = recode(Hospital, 
                           "CCM"="CCMC",
                           
                           "WVU/UHA" = "WVUH",
                           "CH"      = "WVUH",
                           "FMT"     = "WVUH",
                           
                           "STF"     = "TMH")) %>%
  
  # Aggregate data
  group_by(Hospital, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T),
            .groups = "drop") 

# df1 %>% count(Hospital)
# SIR_full %>%count(Hosp)

df2 <- SIR_full %>%
  select(Date, Hosp, Inf_pred, Inf_obs, CL_days) %>%
  left_join(df1, by=c("Date"="Month", "Hosp"="Hospital")) %>%
  rename(CHG_numerator=Numerator,
         CHG_denominator=Denominator) %>%
  mutate(CHG_rate = CHG_numerator / CHG_denominator) %>%
  mutate(postCHG = Date > ymd("2023-08-01"))

library(glmmTMB)
glmmTMB(Inf_obs ~ Inf_pred + postCHG + (1|Hosp) ,
        data = df2, 
        family = "nbinom1") %>% 
  summary()

Footnotes

For example, March 31st (2024) was on a Sunday, so the week starting on 3/31/24 mostly reflects April 2024↩︎
I use the terms chunks and buckets interchangeably (I might also use the term “bin” somewhere in here too↩︎
This isn’t strictly the case, since there is a trade off between making sure each chunk has a sum(Inf_pred) > 1 (which reduces the number of chunks we get) and maximizing the number of chunks (to give us the most granular data points about time). So some chunks do have fewer than 1 predicted infection, but I tried to keep that to a minimum↩︎
Note that the area of the grey boxes is not proportional to the total number of infections, only the height is↩︎
Even if we did try with these hospitals, they would have one chunk of time (spanning the entire data) which still would be insufficient to calculate a SIR from↩︎
These classifications can be changed on the Google Sheet and will update whenever I run the analysis again↩︎
In practice, this is kind of splitting hospitals by predicted infections. But I think I prefer this because it’s more translatable to those who don’t know how NHSN does all of it’s math (like me!)↩︎
I think it would apply if we had a situation like: Does the hospital have patients with central lines (the logistical regression portion) and if so, how many CLABSIs did they get (the negative binomial portion)↩︎