Import data

CHG compliance

Data from the CHG compliance dashboard is fairly straightforward to download. Go to device type and only select central lines. Then export the heatmap to CSV

Show code
CHG_raw <- "TableauData/2024-09-22/Overall Rate by Hospital and Department.csv" %>% 
  read_csv(col_types = "ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc",
           skip = 1)
## These are the same months, but split across 2023-2024
# • `31-Dec-23` -> `31-Dec-23...23`
# • `31-Dec-23` -> `31-Dec-23...24`

CHG_clean <- CHG_raw %>%
  rename(Hospital=...1, Benchmark=...2, AtGoal=...3, Measure=...4) %>%
  
  mutate(Measure = case_when(str_detect(Measure, "Numerator") ~ "Numerator",
                             str_detect(Measure, "Denominator") ~ "Denominator",
                             TRUE ~ "Other")) %>%
  filter(Measure %in% c("Numerator", "Denominator")) %>%
  
  # Make long format
  pivot_longer(cols = -c(Hospital, Benchmark, AtGoal, Measure), 
               names_to = "Date", values_to = "n") %>%
  
  # Parse the numbers
  mutate(n = str_remove(n, ","), # Remove comma
         n = as.numeric(n)) %>%
  
  # make 31-Dec-23...23 amd 31-Dec-23...24 the same
  mutate(Date = case_when(Date=="31-Dec-23...23" ~ "31-Dec-23",
                          Date=="31-Dec-23...24" ~ "31-Dec-23",
                          TRUE ~ Date)) %>%
  
  filter(Date!="27-Aug-23") %>%
  
  # Group by Hospital + Month + Denominator
  group_by(Hospital, Date, Measure) %>%
  summarise(n = sum(n, na.rm = T)) %>%
  
  # Pivot data out 
  pivot_wider(names_from = Measure, values_from = n) %>%
  
  # Process dates
  mutate(Date = lubridate::dmy(Date)) %>%
  ungroup() 

rm(CHG_raw)
Show code
CHG_raw <- "TableauData/2024-10-18 CHG CL only/All Overall Rate by Hospital and Department_data.csv" %>% 
  read_csv()

CHG_clean <- CHG_raw %>%
  select(Hospital, Date=`Week of Census Time`,
         Denominator=`All Lines Denominator`,
         Numerator=`All Lines Numerator`) %>%
  
  mutate(Date = mdy(Date)) %>%
  
  group_by(Hospital, Date) %>%
  summarise(Denominator = sum(Denominator, na.rm = T),
            Numerator   = sum(Numerator, na.rm = T)) %>%
  ungroup() %>%
  
  # Remove most recent week 
  filter(Date != ymd("2024-10-13"))
   

rm(CHG_raw)
Show code
CHG_clean %>%
  rmarkdown::paged_table()
Note

I can only download the past 12 months of data from the dashboard. Will need to get access to more of the data to do a good pre-post analysis

This is a quick glimpse of the CHG compliance rates over time

Show code
# CHG_clean %>%
#   mutate(rate = Numerator / Denominator ) %>%
# 
#   ggplot(aes(x=Date, y=rate, color=Hospital)) + 
#   geom_line(key_glyph = "timeseries") +
#   geom_point() + 
#   scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +
#   facet_wrap("Hospital", ncol = 4) +
#   guides(color=F) +
#   theme_bw()

CHG_clean %>%
  mutate(rate = Numerator / Denominator ) %>%

  ggplot(aes(x=Date, y=rate, color=Denominator)) + 
  geom_line(key_glyph = "timeseries") +
  geom_point() + 
  scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +
  facet_wrap("Hospital", ncol = 4) +
  # scale_colour_gradient2(trans="log10", mid = "yellow", high = "red")
  scale_color_viridis_c(option = "D", trans="log10") +
  # guides(color=F) +
  theme_bw()

Because it is reported weekly (and everything else is reported monthly), we need to convert the weekly data into monthly data

How are the weeks calculated?

When you extract the data from Tableau, the dates it lists are all Sundays. It doesn’t look like it’s explicitly documented anywhere, but I assume that this means the week starts on Sunday, meaning the date provided is day 1 of 7 for that week

Some weeks span the course of multiple months1. In these cases, we will group them with whichever month that the majority of their days fell under

Show code
week_to_month <- function(date, beginning=T) {
  require(lubridate)
  
  # Make sure you are given only one date, not a vector
  stopifnot(length(date)==1)
  
  # Calculate the midpoint of the week, which is 3 days away from the start 
  # (or end) date
  if(beginning) {
    mid_date <- date + days(3)
  } else({
    # Otherwise, treat the given day as the end of the week
    mid_date <- date - days(3)
  })
  
  # Extract the month and the year from this date
  new_month <- month(mid_date)
  new_year  <- year(mid_date)
  
  # Create a new date and return
  new_date <- ymd(paste0(new_year, "-", new_month, "-15"))
  return(new_date)
  
}
Show code
CHG_monthly <- CHG_clean %>% 
  #  Needed to only process one date at a time
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  group_by(Hospital, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T))
Show code
CHG_clean %>% 
  #  Needed to only process one date at a time
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  group_by(Hospital, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T)) %>%
  mutate(rate = Numerator / Denominator) %>%
 
  ggplot(aes(x=Month, y=rate, color=Hospital)) +
  geom_line(key_glyph = "timeseries") +
  geom_point() +
  scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +
  facet_wrap("Hospital", ncol = 4) +
  guides(color=F) +
  theme_bw()

NHSN data

Show code
### THIS WAS A CHUNK FOR OBSERVED RATES, BUT IT'S INCLUDED IN SIR FILE NOW
CLABSI_abs_raw <- read_csv("TableauData/2024-09-22/Observed Infections by Month & Year.csv")

CLABSI_abs <- CLABSI_abs_raw %>%
  select(-LOCATIONS) %>%
  rename(Hospital=ORGID) %>%
  
  # Make long format
  pivot_longer(cols = -Hospital, 
               names_to = "Date", values_to = "n") %>%
  
  mutate(Date = ymd(paste0(Date, "-1")),
         Month = month(Date),
         Year = year(Date),
         Qrt = quarter(Date))

rm(CLABSI_abs_raw)

CLABSI_abs %>%
  ggplot(aes(x=Date, y=n, color=Hospital)) + 
  # geom_jitter(key_glyph = "timeseries") +
  geom_line(key_glyph = "timeseries") + 
  scale_x_date(date_breaks = "6 months", date_labels = "%b\n'%y") +
  facet_wrap("Hospital", ncol = 3) +
  guides(color=F) +
  theme_bw() +
  labs(y="# CLABSI")

Workflow

This isn’t as straightforward of a download from Tableau, so to keep it reproducible this is my workflow for downloading the data

  • Navigate to the NHSN Infections Dashboard, go to the Compare Hospitals dashboard

  • Select “all” for the Month & Year drop down

  • Ensure the Infection Type is set to CLABSI

  • Use the Hospital drop down to select the hospital of interest

Once you’ve selected the hospital, export the data to CSV with the following steps:

  1. Click the “Observed & Expected by Month & Year” panel (bottom left)

  2. Click the download icon in the top right corner (close to the share option). There should be an option to download data (if it’s grey’ed out, something is wrong). Clicking the data icon opens a new window

  3. In the new window, select Full Data (instead of summary)

  4. Click Show Fields, scroll up and select “all

  5. Finally, hit download. To keep it organized, I renamed the downloaded file with the hospital’s abbreviation

I did the above steps for all of the hospitals that I could select in Tableau

Show code
SIR_junk <- "TableauData/2024-10-09 CLASBI SIR" %>%
  
  # Use `fs` package to get the names of all of the CSVs
  fs::dir_ls() %>%
  
  # Read all of the CSVs using map
  purrr::map(readr::read_csv, 
             col_types = "ccccccccccccccccccccccccccccccccccccccc")  %>%
  
  # Bind into a single table
  list_rbind()

Missing data

This has quite a bit of missing data (since we downloaded all of the fields), so we can take a quick look at what is missing

Show code
library(naniar)
naniar::vis_miss(SIR_junk ) + coord_flip()

This is seen a little easier in table format, showing quite a few fields have 100% missing data, which we will go ahead and remove

Show code
naniar::miss_var_summary(SIR_junk) %>%
  rmarkdown::paged_table()
Show code
to_remove <- naniar::miss_var_summary(SIR_junk) %>% 
  filter(pct_miss==100) %>% 
  dplyr::pull(variable)

df_temp <- SIR_junk %>%
  select(-to_remove)

rm(to_remove)

Renaming and dropping columns

There are also some uninformative columns (meaning that they say the same thing in every row) which we should remove. We also need to convert each column to the correct data type and give them better names, which we will to here.

Show code
df_temp %>%
  mutate(across(everything(), as.factor)) %>%
  summary()
Show code
## Not run, but this verifies the assumptions below

# Is SUMMARYYM == Month of SUMMARYYM
df_temp %>%
  filter(SUMMARYYM != `Month of SUMMARYYM`)

# Is ORGID == ORGID (test)
df_temp %>%
  filter(ORGID != `ORGID (test)`)

# Verify that NUMPRED & INFCOUNT are all numeric data
df_temp %>% 
  filter(!is.na(NUMPRED)) %>%
  mutate(NUMPRED = as.numeric(NUMPRED)) %>%
  filter(is.na(NUMPRED))
df_temp %>% 
  filter(!is.na(INFCOUNT)) %>%
  mutate(INFCOUNT = as.numeric(INFCOUNT)) %>%
  filter(is.na(INFCOUNT))

# Is INFCOUNT == Infections Observed
df_temp %>%
  mutate(INFCOUNT = as.numeric(INFCOUNT),
         `Infections Observed` = as.numeric(`Infections Observed`)) %>%
  filter(INFCOUNT != `Infections Observed`)

# Verify that ORGID captures CCN, `Org ID (2)`, `Org ID (copy)`
df_temp %>% 
  count(ORGID, CCN, `Org ID (2)`, `Org ID (copy)`)

# Verify that NUMCLDAYS is all numeric data
df_temp %>% 
  filter(!is.na(NUMCLDAYS)) %>%
  mutate(NUMCLDAYS = as.numeric(NUMCLDAYS)) %>%
  filter(is.na(NUMCLDAYS))

# Verify that SIR is all numeric data (where it is not missing)
df_temp %>% 
  filter(!is.na(SIR)) %>%
  mutate(SIR = as.numeric(SIR)) %>%
  filter(is.na(SIR))

# Is SIR == SIR (all)
df_temp %>%
  filter(SIR != `SIR (all)`)

These are the changes we will make to these columns. The old name of the column is shown in [brackets]

  • Month [Month of SUMMARYYM]: Character of the month-year. Will set the day to be the 15th

  • Date [derived from Month of SUMMARYYM]: Date Date format of Month, with the day set to be the 15th

  • removed [Infection Type (fix)]: All that is said before was “CLABSI” (for all rows, no missing data)

  • removed [LOCATION]: All that is said before was “(All Locations)” (for all rows, no missing data)

  • removed [ORGID (test)]: This is the same as [ORGID], so we will just keep Hosp

  • removed [Infection Type]: All that is said before was “CLA” (for all rows, no missing data)

  • Hosp [ORGID]: Factor of the hospital abbreviation

  • data_source [Table Name]: Factor; appears to be the CSV file that the data was pulled from

  • removed [INFCOUNT]: Aside from having decimal points, this is the same as [Infections Observed], so we will just keep Inf_obs

  • Inf_pred [NUMPRED]: Numeric

  • removed [CCN]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate

  • Inf_obs [Infections Observed]: Numeric, looks like it’s the number of infections that occured

  • removed [LOCATIONS]: All that is said before was “(All Locations)” (for all rows, no missing data)

  • CL_days [NUMCLDAYS]: Numeric

  • removed [ORGID (2)]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate

  • removed [ORGID (copy)]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate

  • SIR [SIR]: Numeric

  • removed [SIR (all)]: This is the same as [SIR], so we will just keep SIR

  • SIR_p [SIR_PVAL]: Character. There are periods for some of these p values, which we will make NA

  • SIR_95CI [SIR95CI]: Character

  • removed [SUMMARYYM]: This is the same as [Month of SUMMARYYM], so we will just keep Month

Show code
SIR_full <- df_temp %>%
  mutate(Month    = `Month of SUMMARYYM`,
         Date     = dmy(str_glue("15-{Month}")),
         Hosp     = as_factor(ORGID),
         Inf_pred = as.numeric(NUMPRED),
         Inf_obs  = as.numeric(`Infections Observed`),
         CL_days  = as.numeric(NUMCLDAYS),
         SIR      = as.numeric(SIR),
         SIR_p    = SIR_PVAL,
         SIR_95CI = SIR95CI,
         data_source = as_factor(`Table Name`),
         # .keep="none" is the same as transmute, keeps only new columns
         .keep="none") %>%
  
  # Reorder SIR to be with p-value & CI
  relocate(SIR, .before = SIR_p) %>%
  
  # Replace p=. with NA 
  mutate(SIR_p = na_if(SIR_p, "."))

SIR_full %>% rmarkdown::paged_table()
Show code
rm(df_temp, SIR_junk)

SIR

You can see that the SIR is not calculated for most of the hospitals in the data set. Below is a table showing what percent of each variable is missing, stratified by hospital. This is because the SIR is not calculated when the number of predicted infections are less than 1, per NHSN resources (pg 5). More on this later

Show code
# gg_miss_upset(SIR)
SIR_full %>%
  group_by(Hosp) %>%
  miss_var_summary() %>%
  select(-n_miss) %>%
  pivot_wider(names_from = variable, values_from = pct_miss) %>% 
  select(Hosp, starts_with("SIR")) %>%
  knitr::kable(caption = "Percent of variable that is missing, by hospital")
Percent of variable that is missing, by hospital
Hosp SIR SIR_p SIR_95CI
BMC 100.00000 100.00000 100.00000
BRN 100.00000 100.00000 100.00000
BRX 100.00000 100.00000 100.00000
CCMC 100.00000 100.00000 100.00000
GMH 100.00000 100.00000 100.00000
GRMC 100.00000 100.00000 100.00000
HRS 100.00000 100.00000 100.00000
JAX 100.00000 100.00000 100.00000
JMC 100.00000 100.00000 100.00000
PRN 100.00000 100.00000 100.00000
PVH 100.00000 100.00000 100.00000
RMH 100.00000 100.00000 100.00000
SMR 100.00000 100.00000 100.00000
STJ 100.00000 100.00000 100.00000
UHC 100.00000 100.00000 100.00000
UTN 100.00000 100.00000 100.00000
WHL 97.05882 97.05882 97.05882
WTZ 100.00000 100.00000 100.00000
WVUH 0.00000 0.00000 0.00000

Hospital data

This comes from a Google Sheet where I store some of the info for each hospital

Show code
library(googlesheets4)
options(gargle_oauth_email = TRUE)
gs4_auth(cache = TRUE)
if(!exists("hospitals")){
  hospitals_raw <- paste0("https://docs.google.com/",
                          "spreadsheets/d/",
                          "1zZ1o1HhzOPrbj2KBdGsaO4rJvLDPMWqN0ogjMCNHfGw/") %>%
    read_sheet(sheet = "CHG Hospitals")
  
  hospitals <- hospitals_raw
  rm(hospitals_raw)
}
hospitals %>% rmarkdown::paged_table()

The reporting across these two databases aren’t exactly the same for the hospitals. There are some hospitals that show up in the CHG dataset but not the NHSN dataset. Specifically, Children’s, Fairmont, St Francis, and Thomas.

Show code
hospitals %>%
  select(NHSN_abbv, CHG_abbv, name_short) %>%
  full_join(count(SIR_full, Hosp), by=c("NHSN_abbv"="Hosp")) %>%
  rename(NHSN=n) %>%
  full_join(count(CHG_clean, Hospital), by=c("CHG_abbv"="Hospital")) %>%
  rename(CHG=n) %>% 
  relocate(name_short) %>%
  filter(!is.na(NHSN_abbv)|!is.na(CHG_abbv)) %>%
  
  gt(rowname_col = "name_short") %>%
  fmt_missing() %>%
  
  tab_spanner(
    label = '# rows',
    columns = c(NHSN, CHG)
  ) %>%
  #  cols_label() %>%

  tab_header(
    title = md('**Hospitals in this dataset**'),
    # subtitle = md('Showing counts of ')
  ) %>%
  opt_stylize(5) 
Hospitals in this dataset
NHSN_abbv CHG_abbv # rows
NHSN CHG
Ruby WVUH WVU/UHA 68 54
Childrens CH 54
UHC UHC UHC 67 54
Camden Clark CCMC CCM 68 54
Berkeley BMC BMC 67 54
Barnesville BRN BRN 39 10
Braxton BRX BRX 31 37
Fairmont FMT 53
Garrett GRMC GRMC 68 54
Grant GMH GMH 64 51
Harrison HRS HRS 66 13
Jackson JAX JAX 68 36
Jefferson JMC JMC 67 46
Potomac Valley PVH PVH 68 54
Princeton PRN PRN 68 54
Reynolds RMH RMH 68 54
St Francis STF 17
St Joes STJ STJ 67 49
Summersville SMR SMR 67 54
Thomas TMH 27
Uniontown UTN UTN 67 54
Wetzel WTZ WTZ 62 37
Wheeling WHL WHL 68 54
What should we do?

I assume that Children’s (CH) should be lumped with Ruby (WVUH) but what about Fairmont (FMT)? Until I hear otherwise, I will

  1. Lump CH with WVUH

  2. Ignore FMT, STF, and TMH

Pooling together predicted infections

From the NHSN documentation

The standardized infection ratio (SIR) is a summary measure used to track HAIs at a national, state, or local level over time. The SIR adjusts for various facility and/or patient-level factors that contribute to HAI risk within each facility. The method of calculating an SIR is similar to the method used to calculate the Standardized Mortality Ratio (SMR), a summary statistic widely used in public health to analyze mortality data. In HAI data analysis, the SIR compares the actual number of HAIs reported to the number that would be predicted, given the standard population (i.e., NHSN baseline), adjusting for several risk factors that have been found to be significantly associated with differences in infection incidence.

The formula for SIR is:

\[ SIR = \frac{Observed\ HAIs}{Predicted\ HAIs} \]

The problem

The SIR is only calculated when the number of predicted HAIs is at least 1, which prevents extremely high values and fluctuations in the SIR. As our data reports \(CLABSI_{obs}\) and \(CLABSI_{pred}\) for each hospital-month, very few instances have a \(CLABSI_{pred} \ge 1\) , so the SIR is not calculated in most cases.

Moreover, the predicted number of CLABSIs varies greatly among hospitals, as shown in an extract of the raw data below.

Show code
SIR_full %>%
  filter(Hosp %in% c("WVUH", "UHC", "GRMC", "JAX")) %>%
  select(Hosp, Month, Inf_pred, Inf_obs, SIR) %>%
  group_by(Hosp) %>%
  filter(row_number()<4) %>%
  ungroup() %>%
  # Add seperators, which requires changing everything to chars
  mutate(across(everything(), as.character)) %>%
  add_row(Hosp = "...", Month="...", 
          Inf_pred = "...", Inf_obs = "...", 
          SIR = "...", .after = 3) %>%
  add_row(Hosp = "...", Month="...", 
          Inf_pred = "...", Inf_obs = "...", 
          SIR = "...", .after = 7) %>%
  add_row(Hosp = "...", Month="...", 
          Inf_pred = "...", Inf_obs = "...", 
          SIR = "...", .after = 11) %>%
  knitr::kable(caption = "Sample of NSHN data for hospitals of various sizes")
Sample of NSHN data for hospitals of various sizes
Hosp Month Inf_pred Inf_obs SIR
GRMC Jan-19 0.046 0 NA
GRMC Feb-19 0.038 0 NA
GRMC Mar-19 0.033 0 NA
JAX Jan-19 0.002 0 NA
JAX Feb-19 0.003 0 NA
JAX Mar-19 0 0 NA
UHC Jan-19 0.424 0 NA
UHC Feb-19 0.448 1 NA
UHC Mar-19 0.401 0 NA
WVUH Jan-19 2.925 2 0.684
WVUH Feb-19 2.689 1 0.372
WVUH Mar-19 3.147 3 0.953

These histograms show the more detailed breakdown of each hospital’s monthly expected CLABSIs. As expected, some of the small hospitals (e.g. critical access) have incredibly low \(CLABSI_{pred}\) (some are zero), whereas larger hopsitals like Ruby (WVUM) mostly have monthly expected rates of >1

Show code
SIR_full %>%
  ggplot(aes(x=Inf_pred)) +
  geom_histogram() +
  facet_wrap("Hosp", scales = "free_y", ncol = 4) +  
  theme_bw() +
  scale_x_log10(breaks = scales::log_breaks(),
                labels = scales::label_log()) +
  labs(title="Histogram of monthly predicted CLABSIs, by hospital",
       x = "Predicted infections (visually log transformed)")

Solutions

The NHSN documentation (page 12) suggests using a longer time period in order to reach the 1.0 threshold of predicted infections.

This looks like it may work for some of the medium sized hospitals (e.g. UTN, CCMC, RMH), but most of the smaller hospitals don’t reach that threshold across the entire time that we have data for (not to mention what would happen if you split it by pre-post). This is demonstrated in the table below, which shows the total (sum) of each hospital’s expected & observed CLABSIs (across the entire time period)

Show code
SIR_full %>%
  group_by(Hosp) %>%
  summarise(expected = sum(Inf_pred),
            observed = sum(Inf_obs)) %>%
  arrange(expected) %>%
  rmarkdown::paged_table()
Show code


# SIR_full %>%
#   mutate(Rollout = if_else(Date < ymd("2023-09-01"), "Before", "After")) %>%
#   group_by(Hosp, Rollout) %>%
#   summarise(expected = sum(Inf_pred),
#             observed = sum(Inf_obs)) %>%
#   pivot_wider(names_from = Rollout,
#               names_glue = "{Rollout}_{.value}",
#               values_from = c(expected, observed)) %>%
#   mutate(Total_expected = Before_expected + After_expected,
#          Total_observed = Before_observed + After_observed) %>%
#   relocate(Hosp, Before_expected, After_expected, Total_expected, 
#            Before_observed, After_observed, Total_observed)

#1: Pooling the months

Aggregating the predicted infections across multiple months (until the sum across those months reaches 1) is what is suggested by NHSN. Because each hospital has different rates of predicted infections, the length of time needed to pool across (e.g. 2 months, 5 months, 1 year) may vary depending on the hospital.

I made a function that splits each hospitals data into chunks2 of time (e.g. April 2024 to July 2024), trying to keep the number of predicted infection roughly even within each chunk. This means that the duration of each chunk is variable, depending on how long it takes to reach a sum(Inf_pred) >= 1, but that now each row of data should have at least one3 predicted infection, allowing us to calculate the SIR.

Code for the function
create_buckets <- function(data, chunk_size=1, round_up=0.8, 
                           date_col="Date", data_col="Inf_pred",
                           verbose=F) {
  #' Creating buckets for predicted infections
  #'
  #' This function takes a data.frame (`data`) from containing a column with 
  #' the number of predicted infections (`data_col`) and splits it up into
  #' chunks where the sum of `data_col` within each chunk is at least 
  #' `chunk_size`
  #' 
  #' @param data Data frame containing a column of predicted infections
  #' @param chunk_size Target minimum number of infections to have in each 
  #' bucket. This should ideally be 1
  #' @param round_up At what cutoff (of remainder of infections) should you 
  #' increase the bucket size? 
  #' @param date_col Label for a column with date objects
  #' @param data_col Label for the column in `data` with the predicted infections
  #' @param verbose Should warnings be given?
  #'
  #' @return
  #' The same data frame, but with a new column named bucket containing the
  #' bucket IDs
  
  # Ensure data is sorted by date
  data <- data %>% arrange(!!rlang::sym(date_col))
  
  # This is the total number of predicted infections
  total <- data %>% dplyr::pull(data_col) %>% sum()
  
  ### FIRST: Calculate the number of buckets to break things down into
  
  # Calculate the number of buckets to divide things up into. This is floor 
  # division, meaning that it will round down (to make sure each chunk has 
  # greater than then size specified in chunk_size). For example, if there is 
  # a total of 2.3 predicted infections, it will split it into two chunks 
  # since 2.3 %/% 1 = 2 (assuming a target chunk size of at least 1)
  n_buckets <- total %/% chunk_size
  
  if(total %% chunk_size >= round_up){
    # If the remainder of the total infections is greater than the cutoff given
    # in round_up (e.g. total predicted of 2.8 has a remiander of 0.8), increase
    # the bucket size by one to prevent wasted data
    n_buckets <- n_buckets + 1
    if(verbose) warning("increasing bucket size")
  }
  
  ### SECOND: Determine the actual size chunk size to target
  actual_chunk_size <- total / n_buckets
  
  ### THIRD: Define values used during the loop
  # This is the vector where we store what bucket that row falls under
  bucket <- c()
  # This is the starting bucket ID
  bucket_id <- 1
  # This represents the running total (cumulative sum) of predicted infections
  # for the bucket
  running_total <- 0
  
  ### FOURTH: Loop thriught the data
  # Run through each row of the data, keeping a running sum of predicted 
  # infections. Once the running total exceeds our target chunk size (see
  # `actual_chunk_size` in step 2), reset `running_total` to zero and move
  # to then next bucket
  for (i in 1:nrow(data)) {
    running_total <- running_total + dplyr::pull(data, data_col)[i]
    
    # If the running total exceeds the actual chunk size, assign the current
    #  bucket, reset the running total, and move to the bucket
    if (running_total >= actual_chunk_size) {
      bucket <- c(bucket, bucket_id) # append bucket_id to running list
      bucket_id <- bucket_id + 1
      running_total <- 0
      
    } else {
      bucket <- c(bucket, bucket_id) # append bucket_id to running list
    }
  }
  ### FIFTH: Add the vector of bucket_id's calculated in the loop as a 
  #          column to the original data, and return the data frame
  data$bucket <- bucket
  return(data)
}

For a larger hospital like UHC, we could probably get away with lumping things into 2-3 month bins, but medium-small hospitals (like GRMC & RMH) require much longer periods of time to reach a (cumulative) total of 1 infection. This is demonstrated in the two respective tables below:

Show code
SIR_full %>% 
  filter(Hosp=="UHC") %>%
  select(Month, Date, Hosp, Inf_pred) %>%
  create_buckets() %>%
  group_by(bucket) %>%
  # View()
  filter(bucket<3 | bucket>28) %>%
  mutate(cumsum = cumsum(Inf_pred)) %>%
  mutate(bucket = str_glue("Chunk #{bucket} (size={sum(Inf_pred)})")) %>%
  select(-Date, -Hosp) %>%
  gt(groupname_col = "bucket", rowname_col = "Month") %>%
  tab_stub_indent(everything(), 5) %>%
   cols_label(Inf_pred = "Monthly",
              cumsum = md("Chunk's running total")) %>%
  tab_spanner(
    label = md('**Predicted infections** (UHC)'),
    columns = c(Inf_pred, cumsum)
  ) %>%
  tab_header(
    title = 'Example of grouping months together in chunks',
    subtitle = md('Data is limited to UHC, **not all chunks shown**')
  ) %>%
  opt_stylize(5)
Example of grouping months together in chunks
Data is limited to UHC, not all chunks shown
Predicted infections (UHC)
Monthly Chunk’s running total
Chunk #1 (size=1.273)
Jan-19 0.424 0.424
Feb-19 0.448 0.872
Mar-19 0.401 1.273
Chunk #2 (size=1.018)
Apr-19 0.372 0.372
May-19 0.312 0.684
Jun-19 0.334 1.018
Chunk #29 (size=1.609)
Mar-24 0.821 0.821
Apr-24 0.788 1.609
Chunk #30 (size=1.678)
May-24 0.918 0.918
Jun-24 0.760 1.678
Chunk #31 (size=0.915)
Jul-24 0.915 0.915
Show code
SIR_full %>% 
  filter(Hosp %in% c("GRMC", "RMH", "UHC")) %>%
  nest(.by = Hosp) %>%
  mutate(data = map(data, create_buckets)) %>%
  unnest(data) %>%
  
  
  # select(Month, Date, Hosp, Inf_pred) %>%
  # create_buckets() %>%
  group_by(Hosp, bucket) %>%
  summarise(
    median = qwraps2::median_iqr(Inf_pred),
    Inf_pred = sum(Inf_pred),
    # Inf_obs  = sum(Inf_obs),
    months = n(),
    start = min(Date),
    end   = max(Date)
  ) %>%
  ungroup() %>%
  # rowwise() %>%
  
  # Filter data to show select rows from UHC too
  mutate(keep = case_when(Hosp != "UHC" ~ TRUE,
                          bucket <3     ~ TRUE,
                          bucket >28    ~ TRUE,
                          TRUE          ~ FALSE)) %>%
  filter(keep) %>% select(-keep) %>%
  
  mutate(bucket = str_glue("Chunk #{bucket}")) %>%
  mutate(start = format(start, "%b '%y"),
         end = format(end, "%b '%y")) %>%
  relocate(months, median, .after=end) %>%
  # select(-Inf_obs)
  
  gt(groupname_col = "Hosp", rowname_col = "bucket") %>%
  tab_stub_indent(everything(), 5) %>%
  gt::cols_merge_range(start, end) %>%
   cols_label(start = md("Time span<br>&nbsp;"),
              Inf_pred = md("Total infections<br>(during chunk)"),
              median = md("Median (IQR)<br>monthly infections"),
              months = md("Duration<br>(months)")) %>%
  opt_stylize(5, color = "green") %>%
  
  tab_header(
    title = md('**Grouping months together in chunks**'),
    subtitle = md('Showing **predicted** infections; *select hospitals shown*')
  ) 
Grouping months together in chunks
Showing predicted infections; select hospitals shown
Total infections
(during chunk)
Time span
 
Duration
(months)
Median (IQR)
monthly infections
GRMC
Chunk #1 1.023 Jan '19–Aug '21 32 0.03 (0.02, 0.04)
Chunk #2 1.010 Sep '21–Nov '22 15 0.07 (0.04, 0.09)
Chunk #3 0.955 Dec '22–Aug '24 21 0.05 (0.04, 0.05)
RMH
Chunk #1 1.242 Jan '19–Nov '20 23 0.05 (0.04, 0.06)
Chunk #2 1.234 Dec '20–Jan '22 14 0.08 (0.07, 0.10)
Chunk #3 1.272 Feb '22–May '23 16 0.08 (0.07, 0.09)
Chunk #4 1.005 Jun '23–Aug '24 15 0.07 (0.05, 0.08)
UHC
Chunk #1 1.273 Jan '19–Mar '19 3 0.42 (0.41, 0.44)
Chunk #2 1.018 Apr '19–Jun '19 3 0.33 (0.32, 0.35)
Chunk #29 1.609 Mar '24–Apr '24 2 0.80 (0.80, 0.81)
Chunk #30 1.678 May '24–Jun '24 2 0.84 (0.80, 0.88)
Chunk #31 0.915 Jul '24–Jul '24 1 0.92 (0.92, 0.92)

A visual representation of this process is shown below for RMH, with the original data shown in red and the aggregated data in grey4

Show code
SIR_full %>% 
  filter(Hosp=="RMH") %>%
  select(Month, Date, Hosp, Inf_pred) %>%
  create_buckets() %>%
  group_by(bucket) %>%
  summarise(total = sum(Inf_pred),
            start = min(Date),
            end = max(Date)) %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=total)) +
  geom_col(aes(x=Date, y=Inf_pred), data=filter(SIR_full, Hosp=="RMH"), fill="red") +
  theme_bw() +
  scale_x_date(date_breaks = "6 months", date_labels = "%b\n'%y") +
  labs(title="Predicted infections vs time: RMH",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

This is the same type of figure among most of the hospitals. I’ve omitted hospitals that had fewer than 1 total predicted infections (across the entire data set), as we will address these in the next section. I’ve also omitted WVUH (Ruby) as it always has at least one monthly predicted infection

Show code
df <- SIR_full %>%
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data) %>% 
  # select(Month, Date, Hosp, Inf_pred) %>%
  # create_buckets() %>%
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date)) 

# df %>%
#   group_by(Hosp) %>%
#   filter(n()>1) %>%
#   filter(Inf_pred==min(Inf_pred))

# df %>%
#   group_by(Hosp) %>%
#   filter(n()>1) %>%
#   filter(Hosp!="WVUH") %>%
#   # mutate(duration = as.numeric(days(end - start)) / (24 * 60 * 60)) %>%
#   mutate(months = time_length(end - start, unit="months")) %>%
#   ggplot(aes(x=Hosp, y=Inf_pred)) + 
#   geom_boxplot() +
#   geom_jitter(aes(color=months),alpha=.5)
  
df %>%
  group_by(Hosp) %>%
  filter(n()>1) %>% 
  filter(Hosp!="WVUH") %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(13), xmax=end+days(13), ymin=0, ymax=Inf_pred)) +
  facet_wrap("Hosp") + 
  geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,
           data=filter(SIR_full, Hosp %in% pull(filter(group_by(df, Hosp), n()>1, Hosp!="WVUH"),Hosp)) ) +
  theme_bw() + 
  labs(title="Predicted infections vs time",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

This allows us to finally calculate the SIR across more of the hospitals than we could at first (when we could only view the SIR for WVUH)

Show code
df %>%
  group_by(Hosp) %>%
  filter(n()>1) %>% 
  # filter(Hosp!="WVUH") %>%
  mutate(SIR = Inf_obs / Inf_pred) %>%
  mutate(SIR = if_else(SIR==0, 0.1, SIR)) %>%
  ggplot(aes(fill=SIR)) +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=SIR)) +
  facet_wrap("Hosp") +
  theme_bw() +
  scale_fill_viridis_c() +
  # scale_fill_gradient2(midpoint = 1)
  labs(title="SIR using pooled time periods",
       y = "Standardized Infection Ratio\n(SIR)")

Show code
rm(df)

#2: Pooling hospitals

In the previous section, we focused on aggregating predicted infections across multiple months. This allowed us to give each hospital “chunks” of time during which we would expect to see one observed infection.

When we did this, we skipped over5 hospitals that did not have a total of at least one predicted infection (across the entire dataset). These tended to be the smaller hospitals, often (but not always) critical access hospitals. In this section, we will return to those hospitals that we skipped over and try to find a way to get pool their predicted infections to reach thresholds greater than one.

Recall the table shown at the beginning of the Solutions section, which showed the total number of predicted infections by hospital. Below is the same table, supplemented with each hospital’s classification6 and their median monthly number of central line days:

Hospital classification terminology

I am open to using different terminology here for classification. One thought would be to do it by bed size (large, medium, small), but I don’t have an accurate source for each hospital’s bed size

Show code
SIR_full %>%
  group_by(Hosp) %>%
  summarise(expected = sum(Inf_pred),
            observed = sum(Inf_obs),
            # median_CL_days  = qwraps2::median_iqr(CL_days)
            median_CL_days  = median(CL_days)
            ) %>%
  arrange(expected) %>%
  left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%
  select(Hosp:median_CL_days, hosp_type) %>%
  relocate(hosp_type, .before = expected) %>%
  ungroup() %>%
  mutate(totalAboveOne = expected >=1,
         totalAboveOne = if_else(totalAboveOne, 
                                 ">= 1 total expected infection",
                                 "<1 total expected infection")) %>%
  
  gt(rowname_col = "Hosp", groupname_col = "totalAboveOne") %>%
  
  tab_stub_indent(everything(), 5) %>%
  
  cols_label(expected = md("**Predicted**"),
             observed = md("Observed"),
             # median_CL_days = md("Median (IQR) *monthly*<br>central line days"),
             median_CL_days = md("Median *monthly*<br>central line days"),
             hosp_type = md("Hospital Type")) %>%
   tab_spanner(
    label = md('Total infections'),
    columns = c(expected, observed)
  ) %>%
  gtExtras::gt_color_rows(
    columns = expected,
    domain = c(0,50)
    # palette = c("purple", "lightgrey", "green")
  ) %>%
  data_color(
    columns = median_CL_days,
    domain = c(0,1000),
    palette = "BuPu",
    # pal_type = "continuous"
  ) %>%
  tab_style_body(
    columns = hosp_type,
    values = "Critical access",
    style = cell_text(color = '#9f0101')
  ) %>%
  tab_style_body(
    columns = hosp_type,
    values = "Community",
    style = cell_text(color = '#5a3286')
  ) %>%
  tab_style_body(
    columns = hosp_type,
    values = "???",
    style = cell_fill(color = '#e5ce90')
  ) %>%

  tab_header(
    title = md('Total number of infections & central line days, by hospital'),
    subtitle = md(str_glue('Data shown is across all months (**{format(min(SIR_full$Date), "%b %Y")} to {format(max(SIR_full$Date), "%b %Y")}**)'))
  ) %>%
  opt_stylize(6, color="gray") 
Total number of infections & central line days, by hospital
Data shown is across all months (Jan 2019 to Aug 2024)
Hospital Type Total infections Median monthly
central line days
Predicted Observed
<1 total expected infection
HRS Critical access 0.023 0 0.0
BRN Critical access 0.028 0 0.0
BRX Critical access 0.067 0 5.0
JAX Critical access 0.242 2 9.0
WTZ ??? 0.262 0 1.0
STJ Critical access 0.434 0 19.0
PVH Critical access 0.670 1 30.0
GMH Critical access 0.728 0 37.5
JMC Critical access 0.896 0 47.0
SMR Critical access 0.979 0 46.0
>= 1 total expected infection
GRMC Critical access 2.988 2 67.5
RMH Community 4.753 3 95.0
UTN Community 13.707 36 288.0
PRN Community 21.481 61 482.5
BMC Community 21.961 6 359.0
CCMC Community 36.191 24 487.0
WHL Community 41.671 39 668.5
UHC Community 43.610 34 747.0
WVUH Academic 185.902 116 2433.0
Misclassifications?
  • GRMC is listed as a critical access hospital, but has a much higher number of predicted infections than other critical access hospitals. Any idea on what is causing that? Regardless, might make sense to look at it separate from other CAHs

  • WTZ didn’t have critical access designation when I looked it up, but it sure behaves like a CAH

By hospital type

The first solution that comes to mind would be to lump all the CAH together and treat them as the “same hospital”. The NHSN reference looks like they use an intercept-only model for CAHs (Table 3 on page 12), so that tells me they didn’t find many useful predictors within each critical access hospital.

If we lump together all of the critical access hospitals, we get some acceptable results, shown in the figure below. Again, I’m not showing the WVUH data since it throws off the scale of the Y-axis. The only issue here is WTZ still can’t be used and we’ve now pooled GRMC’s data with all of the critical access hospitals.

Show code
CAH <- hospitals %>% 
  filter(hosp_type=="Critical access") %>%
  pull(NHSN_abbv)

df <- SIR_full %>%
  mutate(Hosp = case_when(
    Hosp %in% CAH ~ "Critical access",
    TRUE ~ Hosp
  )) %>%
  group_by(Hosp, Month, Date) %>%
  # arrange(Date) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs = sum(Inf_obs),
            CL_days = sum(CL_days)) %>%
  ungroup() 

df2 <- df %>%
  
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            Hosp=="Critical access" ~ 1.05,
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data)  %>% 
  
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date)) 


df2 %>%
  group_by(Hosp) %>%
  # filter(n()>1) %>% 
  filter(Hosp!="WVUH") %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +
  facet_wrap("Hosp") + 
  geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,
           data=filter(df, Hosp!="WVUH") ) +
  theme_bw() + 
  labs(title="Predicted infections vs time",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

Show code


# df %>%
#   ggplot(aes(x=Inf_pred)) +
#   geom_histogram() +
#   facet_wrap("Hosp", scales = "free", ncol = 4)

rm(df, df2, CAH)

Central line days

Another option would be to use some cutoff to aggregate hospitals together. The one that comes to mind is central line days, as this tracks well with predicted number of infections (as would be expected)7. The natural cutoff in the dataset appears to be around 150 central line days per month, as shown below

Show code
SIR_full %>%
  left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%
  ggplot(aes(x=Hosp, y=CL_days)) +
  geom_boxplot(aes(color=hosp_type)) + 
  # geom_boxplot() + facet_wrap("hosp_type", scales = "free_x") +
  coord_flip() + 
  scale_y_sqrt() +
  geom_hline(yintercept=150, linetype="dashed", color = "black") +
  geom_segment(aes(x = 2.5, y = 300, xend = 1.5, yend = 150),
                  arrow = arrow(length = unit(0.25, "cm"))) +
  annotate("text", x=2.5, y=310, label="150", hjust="left") +
  labs(title="Monthly central line days, by hopsital",
       y = "CL days (transformed)") +
  theme_bw()

If we use this cutoff to aggregate smaller hospitals together, we get the most number of chunks per “hospital” among all of the solutions we’ve done this far. The trade off is, we now have tons of hospitals aggregated into one group (specifically BRN, BRX, GMH, GRMC, HRS, JAX, JMC, PVH, RMH, SMR, STJ, and WTZ)

Show code
cutoff <- 150

df <- SIR_full %>%
  
  # Calculate average CL_days, and rename Hosp if below cutoff
  group_by(Hosp) %>%
  mutate(mean_CLD = mean(CL_days),
         CL_rare = mean_CLD < cutoff) %>%
  mutate(Hosp = if_else(CL_rare, "CL rare", Hosp)) %>%
  ungroup() %>%
  
  group_by(Hosp, Month, Date) %>%
  # arrange(Date) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs = sum(Inf_obs),
            CL_days = sum(CL_days)) %>%
  ungroup() 
  
df2 <- df %>%
  
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            # Hosp=="CL rare" ~ 1.05,
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data)  %>% 
  
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date)) 
   
# df2 %>% filter(Hosp=="CL rare") %>% tail()

df2 %>%
  group_by(Hosp) %>%
  # filter(n()>1) %>% 
  filter(Hosp!="WVUH") %>%
  ggplot() +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +
  facet_wrap("Hosp", ncol = 2) + 
  geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,
           data=filter(df, Hosp!="WVUH") ) +
  theme_bw() + 
  labs(title="Predicted infections vs time",
       subtitle = "Date buckets are in grey, original data is in red",
       caption = paste0("The width of the grey bars reflect the duration of the chunks",
       "\nThe hight of the grey bars are the total number of predicted infections during the chunk"),
       y = "Predicted infections",
       x="")

Show code

rm(cutoff, df, df2)

Joining the data

At this point we have a few data.frames we are working with

  • CHG data, which has the number of patients who got CHG bathing (Numerator) and the number of patients who were eligible (Denominator). This is at a hospital level where each row is the weekly data

  • NHSN data, which has the predicted (Inf_pred) and observed (Inf_obs) number of infections. This is monthly data for each hospital, but as shown in the “Pooling together predicted infections” section, we can aggregate it across months &/or hospitals

  • Hospital data, which comes from the Google Sheet


This is where it gets fun! To get the CHG data to line up with the NHSN data, we need to

  1. Pool together the NSHN data, both across hospitals and time

  2. Pool the CHG data across hospitals (e.g. lump all the CAH together), matching what we did in step 1

  3. Pool the CHG data by month

  4. Aggregate the CHG data across time, specific to that hospital’s chunks of time (since we chunked the NHSN into different time frames depending on the hospital

Show code
temp_NHSN <- SIR_full
temp_CHG  <- CHG_clean

Step 1: Aggregate NHSN data

This is effectively what we did in Pooling together predicted infections

Important

I will assume we are grouping together all hospitals that have an fewer than 150 central line days per month (on average)

Show code
cutoff <- 150

# Store the names of the small hospitals for later
small_hosp <- temp_NHSN %>%
  # Calculate average CL_days, and rename Hosp if below cutoff
  group_by(Hosp) %>%
  filter(mean(CL_days) < cutoff) %>%
  pull(Hosp) %>%
  unique() %>% 
  as.character()

# Make the NHSN data pooled by hospital and chunked into time buckets
temp_NHSN <- temp_NHSN %>%
  # Calculate average CL_days, and rename Hosp if below cutoff
  group_by(Hosp) %>%
  mutate(mean_CLD = mean(CL_days),
         CL_rare = mean_CLD < cutoff) %>%
  mutate(Hosp = if_else(CL_rare, "Small", Hosp)) %>%
  ungroup() %>%
  
  group_by(Hosp, Month, Date) %>%
  # arrange(Date) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs = sum(Inf_obs),
            CL_days = sum(CL_days),
            .groups = "drop") %>%
  ungroup() %>%
  
  nest(.by = Hosp) %>%
  # Play around with the numbers to get the right sizes
  mutate(c_size = case_when(Hosp=="PRN" ~ 1.1,
                            Hosp=="WHL" ~ 0.95,
                            Hosp=="BMC" ~ .92, 
                            # Hosp=="Small" ~ 1.05,
                            TRUE ~ formals(create_buckets)[["chunk_size"]])) %>%
  
  mutate(data = map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%
  unnest(data)  %>% 
  
  group_by(Hosp, bucket) %>%
  summarise(Inf_pred = sum(Inf_pred),
            Inf_obs  = sum(Inf_obs),
            CL_days  = sum(CL_days),
            start = min(Date),
            end = max(Date), 
            .groups = "drop") 

Step 2: Aggregate CHG by hospital

As mentioned in the section on importing Hospital data, the CHG dashboard is more granular than the NHSN data (e.g. Childrens Hospital is split from Ruby) and some of the abbreviations differ between the datasets (e.g. CCMC vs CCM).

Here, we will fix those abbreviations (we will use the NHSN abbreviations) and merge CH with WVUH. We will also group together the smaller hospitals (e.g. CAH), based on what we did in step 1 above. Once all of that is done, we sum up everything up by hospital (or hospital group) & week.

Show code
temp_CHG <- temp_CHG %>%
  select(Hospital, Date, Denominator, Numerator) %>%
  left_join(select(hospitals, NHSN_abbv, CHG_abbv), 
            by=c("Hospital"="CHG_abbv")) %>%
  
  # Make a new column named Hosp, with the adjusted names
  # Rename hospital names to set us up to pool across names
  # !!! Be careful of which one you use here (Hospital vs NHSN_abbv)
  mutate(Hosp = case_when(Hospital=="CH"           ~ "WVUH",
                          NHSN_abbv %in% small_hosp ~ "Small",
                          TRUE                 ~ NHSN_abbv)) %>%
  
  # # Check your work
  # filter(Hospital!=NHSN_abbv | is.na(NHSN_abbv)) %>%
  # count(Hospital, NHSN_abbv, Hosp)
  select(-Hospital, -NHSN_abbv) %>%
  
  # Remove hospitals not listed in NHSN
  filter(Hosp %in% temp_NHSN$Hosp) %>%
  
  # Aggregate data
  group_by(Hosp, Date) %>%
  summarise(Denominator = sum(Denominator, na.rm = T),
            Numerator   = sum(Numerator, na.rm = T),
            .groups = "drop") %>%
  ungroup() 

Step 3: Aggregate CHG by month

This is exactly what we did in the section where we imported the CHG compliance data, we just have to do it again now that we aggregated the CHG data by hospital

Show code
temp_CHG <- temp_CHG %>% 
  #  Needed to only process one date at a time
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  group_by(Hosp, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T))

Step 4: Match CHG chunks to NHSN chunks

At this point, we have NHSN data broken down into chunks, like the sample below

Show code
temp_NHSN %>%
  filter(Hosp %in% c("PRN", "Small", "CCMC")) %>%
  group_by(Hosp) %>%
  filter(bucket == max(bucket)) %>%
  select(Hosp, starts_with("Inf"), start, end) %>%
  knitr::kable(caption="Sample of NHSN data")
Sample of NHSN data
Hosp Inf_pred Inf_obs start end
CCMC 0.979 1 2024-07-15 2024-08-15
PRN 0.943 1 2024-06-15 2024-08-15
Small 0.917 0 2024-03-15 2024-08-15

To join the CHG compliance data with our chunked NHSN data, we will work row-by-row through the NHSN table. For each row, we will filter the CHG data:

  • The hospital name matches (\(Hosp_{\ CHG} = Hosp_{\ NHSN}\))

  • The month of the CHG data is after or equal to the start date in NHSN (\(Month_{\ CHG} \ge Start_{\ NHSN}\))

  • The month of CHG data is before or equal to the end date in NHSN (\(Month_{\ CHG} \le End_{\ NHSN}\))


For the sample data shown above, this would translate to these three sets of tables

Desired output, with rows of NHSN shown on the left, and filtered CHG data on the right

Hosp Inf_pred Inf_obs start end
CCMC 0.979 1 2024-07-15 2024-08-15
Hosp Month Denominator Numerator
CCMC 2024-07-15 1998 1823
CCMC 2024-08-15 1887 1723
Hosp Inf_pred Inf_obs start end
PRN 0.943 1 2024-06-15 2024-08-15
Hosp Month Denominator Numerator
PRN 2024-06-15 602 480
PRN 2024-07-15 712 581
PRN 2024-08-15 574 497
Hosp Inf_pred Inf_obs start end
Small 0.917 0 2024-03-15 2024-08-15
Hosp Month Denominator Numerator
Small 2024-03-15 645 399
Small 2024-04-15 478 375
Small 2024-05-15 663 556
Small 2024-06-15 397 338
Small 2024-07-15 551 487
Small 2024-08-15 413 344

Code for the function
# Define a helper function that will filter the CHG data based on given parameters

match_CHG <- function(hosp, 
                      start, end, 
                      data_name,
                      date_col="Month",
                      hosp_col = "Hosp") {
  #' Matching CHG data to pooled NHSN data
  #' 
  #' @param hosp String of the hospital name, used to filter the column
  #' named `hosp_col` in the data.frame named `data_name`
  #' @param start The date to begin with (e.g. discards any dates before)
  #' @param start The date to end with (e.g. discards any dates after)
  #' @param data_name a string matching the name of the global variable 
  #' with the CHG data (e.g. "temp_CHG" or "CHG_monthly")
  #' @param date_col The name of the column in `data_name` where we should
  #' filter the dates from
  #' @param hosp_col The name of the column in `data_name` where we should
  #' filter the hospitals from
  #'
  #' @return
  #' A data frame with one row. The colums will depend on if the filter
  #' function was able to identify any data from the CHG table meeting
  #' the given criteria
  #' 
  #' If it found match(es), it will return a df with the following columns
  #' - Denominator: the sum of the denominator during time period
  #' - Numerator:   the sum of the numerator during time period
  #' - CHG_start:   the variable given from `start`
  #' - CHG_end:     the variable given from `end`
  #' - Months_CHG:  a nested list with the months that were aggregated
  #' 
  #' If no matches were found, it will return a df with the following columns
  #' - CHG_start:   the variable given from `start`
  #' - CHG_end:     the variable given from `end`
  #' 
  #' @examples
    #' # example code 
    #' match_CHG(hosp = "PRN",
    #'           start = ymd("2023-11-15"),
    #'           end = ymd("2024-01-15"),
    #'           data_name = "CHG_monthly",
    #'           hosp_col = "Hospital")
    #'           
    #' # This is more useful using it inside of a mutate function
    #' temp_NHSN %>%
    #'   rowwise() %>%
    #'   mutate(CHG = pmap(list(hosp=Hosp, 
    #'                          start=start, end=end, 
    #'                          data_name="temp_CHG"),
    #'                     match_CHG)) %>%
    #'   unnest(CHG)         
  
  
  # I couldn't find an easy way to pass the data frame using map, so this 
  # will have to do
  data <- get(data_name)
  
  ### PART 1: Run validation tests
  # Check that data_name matches to a data.frame
  testthat::expect_s3_class(data, "data.frame")
  
  # Test that the provided data has columns with the given names
  testthat::expect_contains(names(data), hosp_col)
  testthat::expect_contains(names(data), date_col)

  # Ensure that the column for date is indeed dates, and the dates provided
  # are in date format
  testthat::expect_s3_class(data[[date_col]], "Date")
  testthat::expect_s3_class(start, "Date")
  testthat::expect_s3_class(end, "Date")
  
  # Check that start & end are length of one (e.g. forgot rowwise)
  testthat::expect_length(start, 1)
  testthat::expect_length(end, 1)
  
  # Make sure the name of the hospital is found in the data
  testthat::expect_contains(unique(data[[hosp_col]]), hosp)
  
  
  ### PART 2: Filter CHG data to that hospital's time period
  # Filter to only get the info we want
  filter_data <- data %>%
    filter(!!rlang::sym(hosp_col) %in% hosp) %>%
    filter(!!rlang::sym(date_col) >= start) %>%
    filter(!!rlang::sym(date_col) <= end) 
   
  ### PART 3: Aggregate to make it one row
  single_row <- filter_data %>%
    ungroup() %>%
    group_by(!!rlang::sym(hosp_col)) %>%
    summarise(Denominator = sum(Denominator, na.rm = T),
              Numerator   = sum(Numerator, na.rm = T),
              Months_CHG  = list(!!rlang::sym(date_col))) %>%
    mutate(CHG_start = start,
           CHG_end   = end) %>%
    select(-!!rlang::sym(hosp_col)) %>%
    relocate(Months_CHG, .after = CHG_end)
  
  ### PART 4: If no matching data is found, make a placeholder row
  #           that will be filled with NAs
  if(nrow(single_row)==0){
    single_row <- tibble(
      # Hospital = hosp,
      # Denominator = -1,
      # Numerator = -1,
      # Months_CHG = list(),
      CHG_start = start,
      CHG_end   = end)
  }
  return(single_row)
  
}

We ultimately will sum together all of the numerators & denominators (within each table) to give us the overall numerators & denominators for CHG bathing during that time period. Once we’ve done that, we can calculate the CHG_rate during that chunk of time, and our data is ready for analysis!

Here is the final, joined data:

Show code
temp_joined <- temp_NHSN %>%
  rowwise() %>%
  mutate(CHG = pmap(list(hosp=Hosp, start=start, end=end, data_name="temp_CHG"),
                    match_CHG)) %>%
  unnest(CHG) %>%
  
  mutate(CHG_rate = Numerator / Denominator,
         SIR = Inf_obs / Inf_pred) %>%
  
  select(Hosp, Inf_pred, Inf_obs, SIR, 
         start, end, 
         CHG_rate, Numerator, Denominator)

temp_joined %>%
  group_by(Hosp) %>%
  arrange(Hosp, desc(end)) %>%
  rmarkdown::paged_table()
Show code
# x %>%
#   
#   ggplot(aes(x=CHG_rate, y=SIR)) +
#   geom_jitter() +
#   geom_smooth() +
#   facet_wrap("Hosp")
Note

Much of the CHG data is listed as NA, becuase the CHG data only goes back to 2023-10-01, whereas the NHSN data goes back to 2019-01-15.

We really need to get the CHG data from earlier (ideally during implementation) to take full advantage of the NHSN data

Descriptive statistics

Todo

Will work on once we have some of the methodology stuff sorted above (and ideally the full CHG data)

For this run (2024-10-31) of the analysis:

  • The data set for CHG bathing spans from Oct 01 2023 to Oct 06 2024

  • The data set for NHSN spans from Jan 2019 to Aug 2024

Show code
SIR_full %>%
  group_by(Hosp) %>%
  summarise(start = min(Date),
            end = max(Date)) %>%
  ggplot(aes(x=Hosp)) +
  # geom_point(aes(y=start)) +
  geom_errorbar(aes(ymin = start, ymax=end), width=.3) +
  coord_flip() +
  theme_bw() + 
  labs(title="Available NHSN data",
       y="Date", x="Hospital") 

Show code
temp_joined %>%
  ggplot(aes(fill=SIR)) +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=SIR)) +
  facet_wrap("Hosp", ncol = 2) +
  
  
  scale_fill_binned(low = "#FEE5D9", high="#A50F15") + # RColorBrewer::brewer.pal(5, "Greens")
  theme_bw() +
  theme(legend.position = "bottom") +

  labs(y="SIR",
       title="SIR") 

Show code
temp_joined %>%
  ggplot(aes(fill=Denominator)) +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=CHG_rate)) +
  facet_wrap("Hosp", ncol = 2) +
  xlim(c(min(temp_joined$start), max(temp_joined$end)))+
  scale_y_continuous(labels = scales::percent) +
  scale_fill_binned(low = "#EDF8E9", high="#006D2C") + # RColorBrewer::brewer.pal(5, "Greens")
  theme_bw() +
  theme(legend.position = "bottom") +

  labs(y="CHG compliance",
       title="CHG compliance") 

Figure 1: SIR vs CHG

Analysis

Pre-post

In an ideal world, we could get CHG data going back to around the time of the roll out (or before), identify the time period where hospitals transitioned, and compare the SIRs before & after the roll out

Show code
set.seed(4)
df <- tibble(y = pnorm(-10:10 * .3, sd = .5),
       noise = rnorm(1:21, sd = .02)) %>%
  mutate(x = row_number(),
         y2 = y+noise,
         y3 = y+.05,
         y3 = if_else(y3>=1, y2-.05, y3),
         rate = y3)

df %>%
  ggplot(aes(x=x, y=rate)) + 
  geom_point() + geom_line() +
  annotate("rect", xmin=0, xmax=8.5,
           fill="red",
           ymin=0, ymax=1, alpha=0.3) +
  annotate("rect", xmin=8.5, xmax=12.5,
           fill="blue",
           ymin=0, ymax=1, alpha=0.3) +
  annotate("rect", xmin=12.5, xmax=21,
           fill="green",
           ymin=0, ymax=1, alpha=0.3) +
  annotate("text", x=4, y=.5, label="Before", hjust="left") +
  annotate("text", x=17, y=.5, label="After", hjust="right") +
  annotate("text", x=10.5, y=.9, label="Transition", hjust="inward") +
  scale_y_continuous(labels = scales::percent) +
  theme_classic() +
  labs(x="Time", y="Rate", title="This would be great")

Show code
rm(df)

If we can’t get that information, we could just assume that rates of CHG bathing were low before September of 2023 and split the data into Pre/Post by the date

Show code
df <- temp_joined %>%
  mutate(Cluster = case_when(end < ymd("2023-09-01")   ~ "Before",
                             start > ymd("2023-09-30") ~ "After")) %>%
  filter(!is.na(Cluster))


df %>%
  ggplot(aes(fill=Cluster)) +
  geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=SIR)) +
  facet_wrap("Hosp") +
  theme_bw() +
  theme(legend.position = "bottom") +

  labs(y="SIR",
       title="SIR, before & after") 

Show code
df %>%
  ggplot(aes(x=Hosp, y=SIR, color=Cluster)) +
  geom_boxplot() +
  coord_flip() +
  theme_bw() +
  labs(y="SIR", x="",
       title="SIR, before & after")

Show code
t.test(SIR ~ Cluster, data=df) 

    Welch Two Sample t-test

data:  SIR by Cluster
t = -1.4226, df = 63.002, p-value = 0.1598
alternative hypothesis: true difference in means between group After and group Before is not equal to 0
95 percent confidence interval:
 -0.6322592  0.1064125
sample estimates:
 mean in group After mean in group Before 
           0.7101937            0.9731171 
Show code
# CHG_monthly %>%
#   mutate(Hospital = case_when(Hospital %in% small_hosp ~ "Small",
#                               TRUE ~ Hospital)) %>%
#   group_by(Hospital, Month) %>% 
#   summarise(Denominator =sum(Denominator , na.rm = T),
#             Numerator   =sum(Numerator   , na.rm = T)) %>%
#   ggplot(aes(x=Month, y=Numerator/Denominator, color=Hospital)) +
#   geom_line() +
#   facet_wrap("Hospital")

GLM

Another option would be using a generalized linear model to predict the number of infections observed, and using predicted infections (and CHG bathing) as predictors. Since we would be looking at count data, we could do a Poisson regression (or more likely a negative binomial regression)…

\[ \text{Observed infections} = \beta_0 + \beta_1({predicted\ infections}) + \beta_2(CHG\ rate) + ... + \epsilon \]

Show code
# library(DHARMa)
mod_nb <- MASS::glm.nb(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined) 

summary(mod_nb)

# Compare to poisson
mod_pos <- glm(Inf_obs ~ Inf_pred + CHG_rate, 
               family = 'poisson',
               data = temp_joined) 
calc_dispersion <- function(mod, verbose=F){
  # Identify the model family
  fam_raw <- mod$family$family
  
  if(str_detect(fam_raw, "poisson")) family <- "poisson"
  if(str_detect(fam_raw, "Negative Binomial")) family <- "NB"
  
  E2 <- resid(mod, type = "pearson")
  N <- nrow(mod$model)
  p <- length(coef(mod)) 
  
  if(family=="NB") p <- p + 1 # '+1' is for variance parameter in NB
  
  result <- sum(E2^2) / (N - p)
  
  if(verbose) {
    if(result>1) message("Overdisperion")
    if(result<1) message("Underdisperion")
  }
  
  return(result)
}
# DHARMa::testDispersion(mod_nb)
calc_dispersion(mod_nb)
calc_dispersion(mod_pos)
rm(mod_nb, mod_pos, calc_dispersion)

ZIP & ZINB

I haven’t highlighted this yet (it will go into descriptive statistics) but our data has a whole lot of zeros, namely for the observed infections

Show code
SIR_full %>%
  ggplot() +
  geom_bar(aes(x=Inf_obs)) +
  labs(x="Observed infections (per month)",
       y="Count of occurances",
       title = "By month")

Show code
temp_joined %>%
  ggplot() +
  geom_bar(aes(x=Inf_obs)) +
  labs(x="Observed infections (per chunk)",
       y="Count of occurances",
       title = "By chunk")

Figure 2: High number of zeros, highlighted by histograms of the number of observed infections

I’ve used zero-inflated regression models before, but I’m not sure if that would apply here. These are used when there are two processes8 causing a mixed distribution, but I don’t think that is the case here. But leaving this section in just in case

Show code
library(pscl) # also see {ZIM} package

pscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | 
           Inf_pred , # these are the predictors for the zero inflated part
         dist = "poisson", # Use 'negbin' for negative binomial
         data = temp_joined) %>%
  summary()

# # Versus
# glm(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined, family = poisson) %>%
#   summary()
Show code
x %>%
  mutate(CHG_rate = Numerator / Denominator,
         SIR = Inf_obs / Inf_pred) %>%
  filter(!is.na(CHG_rate)) %>%
  pscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | 
           Inf_pred + CHG_rate, # these are the predictors for the zero inflated part
         dist = "negbin", 
         data = .) %>%
  summary()
Show code
df1 <- CHG_clean %>%
  
  #  Make CHG data monthly
  rowwise() %>%
  mutate(Month = week_to_month(Date)) %>%
  ungroup() %>%
  
  # Recode CHG hospital names to NHSN names
  mutate(Hospital = recode(Hospital, 
                           "CCM"="CCMC",
                           
                           "WVU/UHA" = "WVUH",
                           "CH"      = "WVUH",
                           "FMT"     = "WVUH",
                           
                           "STF"     = "TMH")) %>%
  
  # Aggregate data
  group_by(Hospital, Month) %>%
  summarise(Denominator =sum(Denominator , na.rm = T),
            Numerator   =sum(Numerator   , na.rm = T),
            .groups = "drop") 

# df1 %>% count(Hospital)
# SIR_full %>%count(Hosp)

df2 <- SIR_full %>%
  select(Date, Hosp, Inf_pred, Inf_obs, CL_days) %>%
  left_join(df1, by=c("Date"="Month", "Hosp"="Hospital")) %>%
  rename(CHG_numerator=Numerator,
         CHG_denominator=Denominator) %>%
  mutate(CHG_rate = CHG_numerator / CHG_denominator) %>%
  mutate(postCHG = Date > ymd("2023-08-01"))

library(glmmTMB)
glmmTMB(Inf_obs ~ Inf_pred + postCHG + (1|Hosp) ,
        data = df2, 
        family = "nbinom1") %>% 
  summary()

Footnotes

  1. For example, March 31st (2024) was on a Sunday, so the week starting on 3/31/24 mostly reflects April 2024↩︎

  2. I use the terms chunks and buckets interchangeably (I might also use the term “bin” somewhere in here too↩︎

  3. This isn’t strictly the case, since there is a trade off between making sure each chunk has a sum(Inf_pred) > 1 (which reduces the number of chunks we get) and maximizing the number of chunks (to give us the most granular data points about time). So some chunks do have fewer than 1 predicted infection, but I tried to keep that to a minimum↩︎

  4. Note that the area of the grey boxes is not proportional to the total number of infections, only the height is↩︎

  5. Even if we did try with these hospitals, they would have one chunk of time (spanning the entire data) which still would be insufficient to calculate a SIR from↩︎

  6. These classifications can be changed on the Google Sheet and will update whenever I run the analysis again↩︎

  7. In practice, this is kind of splitting hospitals by predicted infections. But I think I prefer this because it’s more translatable to those who don’t know how NHSN does all of it’s math (like me!)↩︎

  8. I think it would apply if we had a situation like: Does the hospital have patients with central lines (the logistical regression portion) and if so, how many CLABSIs did they get (the negative binomial portion)↩︎