Data from the CHG compliance dashboard is fairly straightforward to download. Go to device type and only select central lines. Then export the heatmap to CSV
Show code
CHG_raw <-"TableauData/2024-09-22/Overall Rate by Hospital and Department.csv"%>%read_csv(col_types ="ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc",skip =1)## These are the same months, but split across 2023-2024# • `31-Dec-23` -> `31-Dec-23...23`# • `31-Dec-23` -> `31-Dec-23...24`CHG_clean <- CHG_raw %>%rename(Hospital=...1, Benchmark=...2, AtGoal=...3, Measure=...4) %>%mutate(Measure =case_when(str_detect(Measure, "Numerator") ~"Numerator",str_detect(Measure, "Denominator") ~"Denominator",TRUE~"Other")) %>%filter(Measure %in%c("Numerator", "Denominator")) %>%# Make long formatpivot_longer(cols =-c(Hospital, Benchmark, AtGoal, Measure), names_to ="Date", values_to ="n") %>%# Parse the numbersmutate(n =str_remove(n, ","), # Remove comman =as.numeric(n)) %>%# make 31-Dec-23...23 amd 31-Dec-23...24 the samemutate(Date =case_when(Date=="31-Dec-23...23"~"31-Dec-23", Date=="31-Dec-23...24"~"31-Dec-23",TRUE~ Date)) %>%filter(Date!="27-Aug-23") %>%# Group by Hospital + Month + Denominatorgroup_by(Hospital, Date, Measure) %>%summarise(n =sum(n, na.rm = T)) %>%# Pivot data out pivot_wider(names_from = Measure, values_from = n) %>%# Process datesmutate(Date = lubridate::dmy(Date)) %>%ungroup() rm(CHG_raw)
Show code
CHG_raw <-"TableauData/2024-10-18 CHG CL only/All Overall Rate by Hospital and Department_data.csv"%>%read_csv()CHG_clean <- CHG_raw %>%select(Hospital, Date=`Week of Census Time`,Denominator=`All Lines Denominator`,Numerator=`All Lines Numerator`) %>%mutate(Date =mdy(Date)) %>%group_by(Hospital, Date) %>%summarise(Denominator =sum(Denominator, na.rm = T),Numerator =sum(Numerator, na.rm = T)) %>%ungroup() %>%# Remove most recent week filter(Date !=ymd("2024-10-13"))rm(CHG_raw)
Show code
CHG_clean %>% rmarkdown::paged_table()
Note
I can only download the past 12 months of data from the dashboard. Will need to get access to more of the data to do a good pre-post analysis
This is a quick glimpse of the CHG compliance rates over time
Because it is reported weekly (and everything else is reported monthly), we need to convert the weekly data into monthly data
How are the weeks calculated?
When you extract the data from Tableau, the dates it lists are all Sundays. It doesn’t look like it’s explicitly documented anywhere, but I assume that this means the week starts on Sunday, meaning the date provided is day 1 of 7 for that week
Some weeks span the course of multiple months1. In these cases, we will group them with whichever month that the majority of their days fell under
Show code
week_to_month <-function(date, beginning=T) {require(lubridate)# Make sure you are given only one date, not a vectorstopifnot(length(date)==1)# Calculate the midpoint of the week, which is 3 days away from the start # (or end) dateif(beginning) { mid_date <- date +days(3) } else({# Otherwise, treat the given day as the end of the week mid_date <- date -days(3) })# Extract the month and the year from this date new_month <-month(mid_date) new_year <-year(mid_date)# Create a new date and return new_date <-ymd(paste0(new_year, "-", new_month, "-15"))return(new_date)}
Show code
CHG_monthly <- CHG_clean %>%# Needed to only process one date at a timerowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%group_by(Hospital, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T))
Show code
CHG_clean %>%# Needed to only process one date at a timerowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%group_by(Hospital, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T)) %>%mutate(rate = Numerator / Denominator) %>%ggplot(aes(x=Month, y=rate, color=Hospital)) +geom_line(key_glyph ="timeseries") +geom_point() +scale_x_date(date_breaks ="3 months", date_labels ="%b\n'%y") +facet_wrap("Hospital", ncol =4) +guides(color=F) +theme_bw()
NHSN data
Show code
### THIS WAS A CHUNK FOR OBSERVED RATES, BUT IT'S INCLUDED IN SIR FILE NOWCLABSI_abs_raw <-read_csv("TableauData/2024-09-22/Observed Infections by Month & Year.csv")CLABSI_abs <- CLABSI_abs_raw %>%select(-LOCATIONS) %>%rename(Hospital=ORGID) %>%# Make long formatpivot_longer(cols =-Hospital, names_to ="Date", values_to ="n") %>%mutate(Date =ymd(paste0(Date, "-1")),Month =month(Date),Year =year(Date),Qrt =quarter(Date))rm(CLABSI_abs_raw)CLABSI_abs %>%ggplot(aes(x=Date, y=n, color=Hospital)) +# geom_jitter(key_glyph = "timeseries") +geom_line(key_glyph ="timeseries") +scale_x_date(date_breaks ="6 months", date_labels ="%b\n'%y") +facet_wrap("Hospital", ncol =3) +guides(color=F) +theme_bw() +labs(y="# CLABSI")
Workflow
This isn’t as straightforward of a download from Tableau, so to keep it reproducible this is my workflow for downloading the data
Use the Hospital drop down to select the hospital of interest
Once you’ve selected the hospital, export the data to CSV with the following steps:
Click the “Observed & Expected by Month & Year” panel (bottom left)
Click the download icon in the top right corner (close to the share option). There should be an option to download data (if it’s grey’ed out, something is wrong). Clicking the data icon opens a new window
In the new window, select Full Data (instead of summary)
Click Show Fields, scroll up and select “all”
Finally, hit download. To keep it organized, I renamed the downloaded file with the hospital’s abbreviation
I did the above steps for all of the hospitals that I could select in Tableau
Show code
SIR_junk <-"TableauData/2024-10-09 CLASBI SIR"%>%# Use `fs` package to get the names of all of the CSVs fs::dir_ls() %>%# Read all of the CSVs using map purrr::map(readr::read_csv, col_types ="ccccccccccccccccccccccccccccccccccccccc") %>%# Bind into a single tablelist_rbind()
Missing data
This has quite a bit of missing data (since we downloaded all of the fields), so we can take a quick look at what is missing
There are also some uninformative columns (meaning that they say the same thing in every row) which we should remove. We also need to convert each column to the correct data type and give them better names, which we will to here.
## Not run, but this verifies the assumptions below# Is SUMMARYYM == Month of SUMMARYYMdf_temp %>%filter(SUMMARYYM !=`Month of SUMMARYYM`)# Is ORGID == ORGID (test)df_temp %>%filter(ORGID !=`ORGID (test)`)# Verify that NUMPRED & INFCOUNT are all numeric datadf_temp %>%filter(!is.na(NUMPRED)) %>%mutate(NUMPRED =as.numeric(NUMPRED)) %>%filter(is.na(NUMPRED))df_temp %>%filter(!is.na(INFCOUNT)) %>%mutate(INFCOUNT =as.numeric(INFCOUNT)) %>%filter(is.na(INFCOUNT))# Is INFCOUNT == Infections Observeddf_temp %>%mutate(INFCOUNT =as.numeric(INFCOUNT),`Infections Observed`=as.numeric(`Infections Observed`)) %>%filter(INFCOUNT !=`Infections Observed`)# Verify that ORGID captures CCN, `Org ID (2)`, `Org ID (copy)`df_temp %>%count(ORGID, CCN, `Org ID (2)`, `Org ID (copy)`)# Verify that NUMCLDAYS is all numeric datadf_temp %>%filter(!is.na(NUMCLDAYS)) %>%mutate(NUMCLDAYS =as.numeric(NUMCLDAYS)) %>%filter(is.na(NUMCLDAYS))# Verify that SIR is all numeric data (where it is not missing)df_temp %>%filter(!is.na(SIR)) %>%mutate(SIR =as.numeric(SIR)) %>%filter(is.na(SIR))# Is SIR == SIR (all)df_temp %>%filter(SIR !=`SIR (all)`)
These are the changes we will make to these columns. The old name of the column is shown in [brackets]
Month [Month of SUMMARYYM]: Character of the month-year. Will set the day to be the 15th
Date [derived from Month of SUMMARYYM]: Date Date format of Month, with the day set to be the 15th
removed [Infection Type (fix)]: All that is said before was “CLABSI” (for all rows, no missing data)
removed [LOCATION]: All that is said before was “(All Locations)” (for all rows, no missing data)
removed [ORGID (test)]: This is the same as [ORGID], so we will just keep Hosp
removed [Infection Type]: All that is said before was “CLA” (for all rows, no missing data)
Hosp [ORGID]: Factor of the hospital abbreviation
data_source [Table Name]: Factor; appears to be the CSV file that the data was pulled from
removed [INFCOUNT]: Aside from having decimal points, this is the same as [Infections Observed], so we will just keep Inf_obs
Inf_pred [NUMPRED]: Numeric
removed [CCN]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate
Inf_obs [Infections Observed]: Numeric, looks like it’s the number of infections that occured
removed [LOCATIONS]: All that is said before was “(All Locations)” (for all rows, no missing data)
CL_days [NUMCLDAYS]: Numeric
removed [ORGID (2)]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate
removed [ORGID (copy)]: This maps 1:1 with [ORGID], so will just use Hosp as a surrogate
SIR [SIR]: Numeric
removed [SIR (all)]: This is the same as [SIR], so we will just keep SIR
SIR_p [SIR_PVAL]: Character. There are periods for some of these p values, which we will make NA
SIR_95CI [SIR95CI]: Character
removed [SUMMARYYM]: This is the same as [Month of SUMMARYYM], so we will just keep Month
Show code
SIR_full <- df_temp %>%mutate(Month =`Month of SUMMARYYM`,Date =dmy(str_glue("15-{Month}")),Hosp =as_factor(ORGID),Inf_pred =as.numeric(NUMPRED),Inf_obs =as.numeric(`Infections Observed`),CL_days =as.numeric(NUMCLDAYS),SIR =as.numeric(SIR),SIR_p = SIR_PVAL,SIR_95CI = SIR95CI,data_source =as_factor(`Table Name`),# .keep="none" is the same as transmute, keeps only new columns.keep="none") %>%# Reorder SIR to be with p-value & CIrelocate(SIR, .before = SIR_p) %>%# Replace p=. with NA mutate(SIR_p =na_if(SIR_p, "."))SIR_full %>% rmarkdown::paged_table()
Show code
rm(df_temp, SIR_junk)
SIR
You can see that the SIR is not calculated for most of the hospitals in the data set. Below is a table showing what percent of each variable is missing, stratified by hospital. This is because the SIR is not calculated when the number of predicted infections are less than 1, per NHSN resources (pg 5). More on this later
Show code
# gg_miss_upset(SIR)SIR_full %>%group_by(Hosp) %>%miss_var_summary() %>%select(-n_miss) %>%pivot_wider(names_from = variable, values_from = pct_miss) %>%select(Hosp, starts_with("SIR")) %>% knitr::kable(caption ="Percent of variable that is missing, by hospital")
Percent of variable that is missing, by hospital
Hosp
SIR
SIR_p
SIR_95CI
BMC
100.00000
100.00000
100.00000
BRN
100.00000
100.00000
100.00000
BRX
100.00000
100.00000
100.00000
CCMC
100.00000
100.00000
100.00000
GMH
100.00000
100.00000
100.00000
GRMC
100.00000
100.00000
100.00000
HRS
100.00000
100.00000
100.00000
JAX
100.00000
100.00000
100.00000
JMC
100.00000
100.00000
100.00000
PRN
100.00000
100.00000
100.00000
PVH
100.00000
100.00000
100.00000
RMH
100.00000
100.00000
100.00000
SMR
100.00000
100.00000
100.00000
STJ
100.00000
100.00000
100.00000
UHC
100.00000
100.00000
100.00000
UTN
100.00000
100.00000
100.00000
WHL
97.05882
97.05882
97.05882
WTZ
100.00000
100.00000
100.00000
WVUH
0.00000
0.00000
0.00000
Hospital data
This comes from a Google Sheet where I store some of the info for each hospital
The reporting across these two databases aren’t exactly the same for the hospitals. There are some hospitals that show up in the CHG dataset but not the NHSN dataset. Specifically, Children’s, Fairmont, St Francis, and Thomas.
I assume that Children’s (CH) should be lumped with Ruby (WVUH) but what about Fairmont (FMT)? Until I hear otherwise, I will
Lump CH with WVUH
Ignore FMT, STF, and TMH
Pooling together predicted infections
From the NHSN documentation
The standardized infection ratio (SIR) is a summary measure used to track HAIs at a national, state, or local level over time. The SIR adjusts for various facility and/or patient-level factors that contribute to HAI risk within each facility. The method of calculating an SIR is similar to the method used to calculate the Standardized Mortality Ratio (SMR), a summary statistic widely used in public health to analyze mortality data. In HAI data analysis, the SIR compares the actual number of HAIs reported to the number that would be predicted, given the standard population (i.e., NHSN baseline), adjusting for several risk factors that have been found to be significantly associated with differences in infection incidence.
The formula for SIR is:
\[
SIR = \frac{Observed\ HAIs}{Predicted\ HAIs}
\]
The problem
The SIR is only calculated when the number of predicted HAIs is at least 1, which prevents extremely high values and fluctuations in the SIR. As our data reports \(CLABSI_{obs}\) and \(CLABSI_{pred}\) for each hospital-month, very few instances have a \(CLABSI_{pred} \ge 1\) , so the SIR is not calculated in most cases.
Moreover, the predicted number of CLABSIs varies greatly among hospitals, as shown in an extract of the raw data below.
Show code
SIR_full %>%filter(Hosp %in%c("WVUH", "UHC", "GRMC", "JAX")) %>%select(Hosp, Month, Inf_pred, Inf_obs, SIR) %>%group_by(Hosp) %>%filter(row_number()<4) %>%ungroup() %>%# Add seperators, which requires changing everything to charsmutate(across(everything(), as.character)) %>%add_row(Hosp ="...", Month="...", Inf_pred ="...", Inf_obs ="...", SIR ="...", .after =3) %>%add_row(Hosp ="...", Month="...", Inf_pred ="...", Inf_obs ="...", SIR ="...", .after =7) %>%add_row(Hosp ="...", Month="...", Inf_pred ="...", Inf_obs ="...", SIR ="...", .after =11) %>% knitr::kable(caption ="Sample of NSHN data for hospitals of various sizes")
Sample of NSHN data for hospitals of various sizes
Hosp
Month
Inf_pred
Inf_obs
SIR
GRMC
Jan-19
0.046
0
NA
GRMC
Feb-19
0.038
0
NA
GRMC
Mar-19
0.033
0
NA
…
…
…
…
…
JAX
Jan-19
0.002
0
NA
JAX
Feb-19
0.003
0
NA
JAX
Mar-19
0
0
NA
…
…
…
…
…
UHC
Jan-19
0.424
0
NA
UHC
Feb-19
0.448
1
NA
UHC
Mar-19
0.401
0
NA
…
…
…
…
…
WVUH
Jan-19
2.925
2
0.684
WVUH
Feb-19
2.689
1
0.372
WVUH
Mar-19
3.147
3
0.953
These histograms show the more detailed breakdown of each hospital’s monthly expected CLABSIs. As expected, some of the small hospitals (e.g. critical access) have incredibly low \(CLABSI_{pred}\) (some are zero), whereas larger hopsitals like Ruby (WVUM) mostly have monthly expected rates of >1
The NHSN documentation (page 12) suggests using a longer time period in order to reach the 1.0 threshold of predicted infections.
This looks like it may work for some of the medium sized hospitals (e.g. UTN, CCMC, RMH), but most of the smaller hospitals don’t reach that threshold across the entire time that we have data for (not to mention what would happen if you split it by pre-post). This is demonstrated in the table below, which shows the total (sum) of each hospital’s expected & observed CLABSIs (across the entire time period)
Aggregating the predicted infections across multiple months (until the sum across those months reaches 1) is what is suggested by NHSN. Because each hospital has different rates of predicted infections, the length of time needed to pool across (e.g. 2 months, 5 months, 1 year) may vary depending on the hospital.
I made a function that splits each hospitals data into chunks2 of time (e.g. April 2024 to July 2024), trying to keep the number of predicted infection roughly even within each chunk. This means that the duration of each chunk is variable, depending on how long it takes to reach a sum(Inf_pred) >= 1, but that now each row of data should have at least one3 predicted infection, allowing us to calculate the SIR.
Code for the function
create_buckets <-function(data, chunk_size=1, round_up=0.8, date_col="Date", data_col="Inf_pred",verbose=F) {#' Creating buckets for predicted infections#'#' This function takes a data.frame (`data`) from containing a column with #' the number of predicted infections (`data_col`) and splits it up into#' chunks where the sum of `data_col` within each chunk is at least #' `chunk_size`#' #' @param data Data frame containing a column of predicted infections#' @param chunk_size Target minimum number of infections to have in each #' bucket. This should ideally be 1#' @param round_up At what cutoff (of remainder of infections) should you #' increase the bucket size? #' @param date_col Label for a column with date objects#' @param data_col Label for the column in `data` with the predicted infections#' @param verbose Should warnings be given?#'#' @return#' The same data frame, but with a new column named bucket containing the#' bucket IDs# Ensure data is sorted by date data <- data %>%arrange(!!rlang::sym(date_col))# This is the total number of predicted infections total <- data %>% dplyr::pull(data_col) %>%sum()### FIRST: Calculate the number of buckets to break things down into# Calculate the number of buckets to divide things up into. This is floor # division, meaning that it will round down (to make sure each chunk has # greater than then size specified in chunk_size). For example, if there is # a total of 2.3 predicted infections, it will split it into two chunks # since 2.3 %/% 1 = 2 (assuming a target chunk size of at least 1) n_buckets <- total %/% chunk_sizeif(total %% chunk_size >= round_up){# If the remainder of the total infections is greater than the cutoff given# in round_up (e.g. total predicted of 2.8 has a remiander of 0.8), increase# the bucket size by one to prevent wasted data n_buckets <- n_buckets +1if(verbose) warning("increasing bucket size") }### SECOND: Determine the actual size chunk size to target actual_chunk_size <- total / n_buckets### THIRD: Define values used during the loop# This is the vector where we store what bucket that row falls under bucket <-c()# This is the starting bucket ID bucket_id <-1# This represents the running total (cumulative sum) of predicted infections# for the bucket running_total <-0### FOURTH: Loop thriught the data# Run through each row of the data, keeping a running sum of predicted # infections. Once the running total exceeds our target chunk size (see# `actual_chunk_size` in step 2), reset `running_total` to zero and move# to then next bucketfor (i in1:nrow(data)) { running_total <- running_total + dplyr::pull(data, data_col)[i]# If the running total exceeds the actual chunk size, assign the current# bucket, reset the running total, and move to the bucketif (running_total >= actual_chunk_size) { bucket <-c(bucket, bucket_id) # append bucket_id to running list bucket_id <- bucket_id +1 running_total <-0 } else { bucket <-c(bucket, bucket_id) # append bucket_id to running list } }### FIFTH: Add the vector of bucket_id's calculated in the loop as a # column to the original data, and return the data frame data$bucket <- bucketreturn(data)}
For a larger hospital like UHC, we could probably get away with lumping things into 2-3 month bins, but medium-small hospitals (like GRMC & RMH) require much longer periods of time to reach a (cumulative) total of 1 infection. This is demonstrated in the two respective tables below:
Show code
SIR_full %>%filter(Hosp=="UHC") %>%select(Month, Date, Hosp, Inf_pred) %>%create_buckets() %>%group_by(bucket) %>%# View()filter(bucket<3| bucket>28) %>%mutate(cumsum =cumsum(Inf_pred)) %>%mutate(bucket =str_glue("Chunk #{bucket} (size={sum(Inf_pred)})")) %>%select(-Date, -Hosp) %>%gt(groupname_col ="bucket", rowname_col ="Month") %>%tab_stub_indent(everything(), 5) %>%cols_label(Inf_pred ="Monthly",cumsum =md("Chunk's running total")) %>%tab_spanner(label =md('**Predicted infections** (UHC)'),columns =c(Inf_pred, cumsum) ) %>%tab_header(title ='Example of grouping months together in chunks',subtitle =md('Data is limited to UHC, **not all chunks shown**') ) %>%opt_stylize(5)
A visual representation of this process is shown below for RMH, with the original data shown in red and the aggregated data in grey4
Show code
SIR_full %>%filter(Hosp=="RMH") %>%select(Month, Date, Hosp, Inf_pred) %>%create_buckets() %>%group_by(bucket) %>%summarise(total =sum(Inf_pred),start =min(Date),end =max(Date)) %>%ggplot() +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=total)) +geom_col(aes(x=Date, y=Inf_pred), data=filter(SIR_full, Hosp=="RMH"), fill="red") +theme_bw() +scale_x_date(date_breaks ="6 months", date_labels ="%b\n'%y") +labs(title="Predicted infections vs time: RMH",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")
This is the same type of figure among most of the hospitals. I’ve omitted hospitals that had fewer than 1 total predicted infections (across the entire data set), as we will address these in the next section. I’ve also omitted WVUH (Ruby) as it always has at least one monthly predicted infection
Show code
df <- SIR_full %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%# select(Month, Date, Hosp, Inf_pred) %>%# create_buckets() %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date)) # df %>%# group_by(Hosp) %>%# filter(n()>1) %>%# filter(Inf_pred==min(Inf_pred))# df %>%# group_by(Hosp) %>%# filter(n()>1) %>%# filter(Hosp!="WVUH") %>%# # mutate(duration = as.numeric(days(end - start)) / (24 * 60 * 60)) %>%# mutate(months = time_length(end - start, unit="months")) %>%# ggplot(aes(x=Hosp, y=Inf_pred)) + # geom_boxplot() +# geom_jitter(aes(color=months),alpha=.5)df %>%group_by(Hosp) %>%filter(n()>1) %>%filter(Hosp!="WVUH") %>%ggplot() +geom_rect(aes(xmin=start-days(13), xmax=end+days(13), ymin=0, ymax=Inf_pred)) +facet_wrap("Hosp") +geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,data=filter(SIR_full, Hosp %in%pull(filter(group_by(df, Hosp), n()>1, Hosp!="WVUH"),Hosp)) ) +theme_bw() +labs(title="Predicted infections vs time",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")
This allows us to finally calculate the SIR across more of the hospitals than we could at first (when we could only view the SIR for WVUH)
In the previous section, we focused on aggregating predicted infections across multiple months. This allowed us to give each hospital “chunks” of time during which we would expect to see one observed infection.
When we did this, we skipped over5 hospitals that did not have a total of at least one predicted infection (across the entire dataset). These tended to be the smaller hospitals, often (but not always) critical access hospitals. In this section, we will return to those hospitals that we skipped over and try to find a way to get pool their predicted infections to reach thresholds greater than one.
Recall the table shown at the beginning of the Solutions section, which showed the total number of predicted infections by hospital. Below is the same table, supplemented with each hospital’s classification6 and their median monthly number of central line days:
Hospital classification terminology
I am open to using different terminology here for classification. One thought would be to do it by bed size (large, medium, small), but I don’t have an accurate source for each hospital’s bed size
Show code
SIR_full %>%group_by(Hosp) %>%summarise(expected =sum(Inf_pred),observed =sum(Inf_obs),# median_CL_days = qwraps2::median_iqr(CL_days)median_CL_days =median(CL_days) ) %>%arrange(expected) %>%left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%select(Hosp:median_CL_days, hosp_type) %>%relocate(hosp_type, .before = expected) %>%ungroup() %>%mutate(totalAboveOne = expected >=1,totalAboveOne =if_else(totalAboveOne, ">= 1 total expected infection","<1 total expected infection")) %>%gt(rowname_col ="Hosp", groupname_col ="totalAboveOne") %>%tab_stub_indent(everything(), 5) %>%cols_label(expected =md("**Predicted**"),observed =md("Observed"),# median_CL_days = md("Median (IQR) *monthly*<br>central line days"),median_CL_days =md("Median *monthly*<br>central line days"),hosp_type =md("Hospital Type")) %>%tab_spanner(label =md('Total infections'),columns =c(expected, observed) ) %>% gtExtras::gt_color_rows(columns = expected,domain =c(0,50)# palette = c("purple", "lightgrey", "green") ) %>%data_color(columns = median_CL_days,domain =c(0,1000),palette ="BuPu",# pal_type = "continuous" ) %>%tab_style_body(columns = hosp_type,values ="Critical access",style =cell_text(color ='#9f0101') ) %>%tab_style_body(columns = hosp_type,values ="Community",style =cell_text(color ='#5a3286') ) %>%tab_style_body(columns = hosp_type,values ="???",style =cell_fill(color ='#e5ce90') ) %>%tab_header(title =md('Total number of infections & central line days, by hospital'),subtitle =md(str_glue('Data shown is across all months (**{format(min(SIR_full$Date), "%b %Y")} to {format(max(SIR_full$Date), "%b %Y")}**)')) ) %>%opt_stylize(6, color="gray")
Total number of infections & central line days, by hospital
Data shown is across all months (Jan 2019 to Aug 2024)
Hospital Type
Total infections
Median monthly central line days
Predicted
Observed
<1 total expected infection
HRS
Critical access
0.023
0
0.0
BRN
Critical access
0.028
0
0.0
BRX
Critical access
0.067
0
5.0
JAX
Critical access
0.242
2
9.0
WTZ
???
0.262
0
1.0
STJ
Critical access
0.434
0
19.0
PVH
Critical access
0.670
1
30.0
GMH
Critical access
0.728
0
37.5
JMC
Critical access
0.896
0
47.0
SMR
Critical access
0.979
0
46.0
>= 1 total expected infection
GRMC
Critical access
2.988
2
67.5
RMH
Community
4.753
3
95.0
UTN
Community
13.707
36
288.0
PRN
Community
21.481
61
482.5
BMC
Community
21.961
6
359.0
CCMC
Community
36.191
24
487.0
WHL
Community
41.671
39
668.5
UHC
Community
43.610
34
747.0
WVUH
Academic
185.902
116
2433.0
Misclassifications?
GRMC is listed as a critical access hospital, but has a much higher number of predicted infections than other critical access hospitals. Any idea on what is causing that? Regardless, might make sense to look at it separate from other CAHs
WTZ didn’t have critical access designation when I looked it up, but it sure behaves like a CAH
By hospital type
The first solution that comes to mind would be to lump all the CAH together and treat them as the “same hospital”. The NHSN reference looks like they use an intercept-only model for CAHs (Table 3 on page 12), so that tells me they didn’t find many useful predictors within each critical access hospital.
If we lump together all of the critical access hospitals, we get some acceptable results, shown in the figure below. Again, I’m not showing the WVUH data since it throws off the scale of the Y-axis. The only issue here is WTZ still can’t be used and we’ve now pooled GRMC’s data with all of the critical access hospitals.
Show code
CAH <- hospitals %>%filter(hosp_type=="Critical access") %>%pull(NHSN_abbv)df <- SIR_full %>%mutate(Hosp =case_when( Hosp %in% CAH ~"Critical access",TRUE~ Hosp )) %>%group_by(Hosp, Month, Date) %>%# arrange(Date) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days)) %>%ungroup() df2 <- df %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, Hosp=="Critical access"~1.05,TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date)) df2 %>%group_by(Hosp) %>%# filter(n()>1) %>% filter(Hosp!="WVUH") %>%ggplot() +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +facet_wrap("Hosp") +geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,data=filter(df, Hosp!="WVUH") ) +theme_bw() +labs(title="Predicted infections vs time",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")
Another option would be to use some cutoff to aggregate hospitals together. The one that comes to mind is central line days, as this tracks well with predicted number of infections (as would be expected)7. The natural cutoff in the dataset appears to be around 150 central line days per month, as shown below
Show code
SIR_full %>%left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%ggplot(aes(x=Hosp, y=CL_days)) +geom_boxplot(aes(color=hosp_type)) +# geom_boxplot() + facet_wrap("hosp_type", scales = "free_x") +coord_flip() +scale_y_sqrt() +geom_hline(yintercept=150, linetype="dashed", color ="black") +geom_segment(aes(x =2.5, y =300, xend =1.5, yend =150),arrow =arrow(length =unit(0.25, "cm"))) +annotate("text", x=2.5, y=310, label="150", hjust="left") +labs(title="Monthly central line days, by hopsital",y ="CL days (transformed)") +theme_bw()
If we use this cutoff to aggregate smaller hospitals together, we get the most number of chunks per “hospital” among all of the solutions we’ve done this far. The trade off is, we now have tons of hospitals aggregated into one group (specifically BRN, BRX, GMH, GRMC, HRS, JAX, JMC, PVH, RMH, SMR, STJ, and WTZ)
Show code
cutoff <-150df <- SIR_full %>%# Calculate average CL_days, and rename Hosp if below cutoffgroup_by(Hosp) %>%mutate(mean_CLD =mean(CL_days),CL_rare = mean_CLD < cutoff) %>%mutate(Hosp =if_else(CL_rare, "CL rare", Hosp)) %>%ungroup() %>%group_by(Hosp, Month, Date) %>%# arrange(Date) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days)) %>%ungroup() df2 <- df %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, # Hosp=="CL rare" ~ 1.05,TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date)) # df2 %>% filter(Hosp=="CL rare") %>% tail()df2 %>%group_by(Hosp) %>%# filter(n()>1) %>% filter(Hosp!="WVUH") %>%ggplot() +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +facet_wrap("Hosp", ncol =2) +geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,data=filter(df, Hosp!="WVUH") ) +theme_bw() +labs(title="Predicted infections vs time",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")
Show code
rm(cutoff, df, df2)
Joining the data
At this point we have a few data.frames we are working with
CHG data, which has the number of patients who got CHG bathing (Numerator) and the number of patients who were eligible (Denominator). This is at a hospital level where each row is the weekly data
NHSN data, which has the predicted (Inf_pred) and observed (Inf_obs) number of infections. This is monthly data for each hospital, but as shown in the “Pooling together predicted infections” section, we can aggregate it across months &/or hospitals
This is where it gets fun! To get the CHG data to line up with the NHSN data, we need to
Pool together the NSHN data, both across hospitals and time
Pool the CHG data across hospitals (e.g. lump all the CAH together), matching what we did in step 1
Pool the CHG data by month
Aggregate the CHG data across time, specific to that hospital’s chunks of time (since we chunked the NHSN into different time frames depending on the hospital
I will assume we are grouping together all hospitals that have an fewer than 150 central line days per month (on average)
Show code
cutoff <-150# Store the names of the small hospitals for latersmall_hosp <- temp_NHSN %>%# Calculate average CL_days, and rename Hosp if below cutoffgroup_by(Hosp) %>%filter(mean(CL_days) < cutoff) %>%pull(Hosp) %>%unique() %>%as.character()# Make the NHSN data pooled by hospital and chunked into time bucketstemp_NHSN <- temp_NHSN %>%# Calculate average CL_days, and rename Hosp if below cutoffgroup_by(Hosp) %>%mutate(mean_CLD =mean(CL_days),CL_rare = mean_CLD < cutoff) %>%mutate(Hosp =if_else(CL_rare, "Small", Hosp)) %>%ungroup() %>%group_by(Hosp, Month, Date) %>%# arrange(Date) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),.groups ="drop") %>%ungroup() %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, # Hosp=="Small" ~ 1.05,TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date), .groups ="drop")
Step 2: Aggregate CHG by hospital
As mentioned in the section on importing Hospital data, the CHG dashboard is more granular than the NHSN data (e.g. Childrens Hospital is split from Ruby) and some of the abbreviations differ between the datasets (e.g. CCMC vs CCM).
Here, we will fix those abbreviations (we will use the NHSN abbreviations) and merge CH with WVUH. We will also group together the smaller hospitals (e.g. CAH), based on what we did in step 1 above. Once all of that is done, we sum up everything up by hospital (or hospital group) & week.
Show code
temp_CHG <- temp_CHG %>%select(Hospital, Date, Denominator, Numerator) %>%left_join(select(hospitals, NHSN_abbv, CHG_abbv), by=c("Hospital"="CHG_abbv")) %>%# Make a new column named Hosp, with the adjusted names# Rename hospital names to set us up to pool across names# !!! Be careful of which one you use here (Hospital vs NHSN_abbv)mutate(Hosp =case_when(Hospital=="CH"~"WVUH", NHSN_abbv %in% small_hosp ~"Small",TRUE~ NHSN_abbv)) %>%# # Check your work# filter(Hospital!=NHSN_abbv | is.na(NHSN_abbv)) %>%# count(Hospital, NHSN_abbv, Hosp)select(-Hospital, -NHSN_abbv) %>%# Remove hospitals not listed in NHSNfilter(Hosp %in% temp_NHSN$Hosp) %>%# Aggregate datagroup_by(Hosp, Date) %>%summarise(Denominator =sum(Denominator, na.rm = T),Numerator =sum(Numerator, na.rm = T),.groups ="drop") %>%ungroup()
Step 3: Aggregate CHG by month
This is exactly what we did in the section where we imported the CHG compliance data, we just have to do it again now that we aggregated the CHG data by hospital
Show code
temp_CHG <- temp_CHG %>%# Needed to only process one date at a timerowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%group_by(Hosp, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T))
Step 4: Match CHG chunks to NHSN chunks
At this point, we have NHSN data broken down into chunks, like the sample below
To join the CHG compliance data with our chunked NHSN data, we will work row-by-row through the NHSN table. For each row, we will filter the CHG data:
The hospital name matches (\(Hosp_{\ CHG} = Hosp_{\ NHSN}\))
The month of the CHG data is after or equal to the start date in NHSN (\(Month_{\ CHG} \ge Start_{\ NHSN}\))
The month of CHG data is before or equal to the end date in NHSN (\(Month_{\ CHG} \le End_{\ NHSN}\))
For the sample data shown above, this would translate to these three sets of tables
Desired output, with rows of NHSN shown on the left, and filtered CHG data on the right
Hosp
Inf_pred
Inf_obs
start
end
CCMC
0.979
1
2024-07-15
2024-08-15
Hosp
Month
Denominator
Numerator
CCMC
2024-07-15
1998
1823
CCMC
2024-08-15
1887
1723
Hosp
Inf_pred
Inf_obs
start
end
PRN
0.943
1
2024-06-15
2024-08-15
Hosp
Month
Denominator
Numerator
PRN
2024-06-15
602
480
PRN
2024-07-15
712
581
PRN
2024-08-15
574
497
Hosp
Inf_pred
Inf_obs
start
end
Small
0.917
0
2024-03-15
2024-08-15
Hosp
Month
Denominator
Numerator
Small
2024-03-15
645
399
Small
2024-04-15
478
375
Small
2024-05-15
663
556
Small
2024-06-15
397
338
Small
2024-07-15
551
487
Small
2024-08-15
413
344
Code for the function
# Define a helper function that will filter the CHG data based on given parametersmatch_CHG <-function(hosp, start, end, data_name,date_col="Month",hosp_col ="Hosp") {#' Matching CHG data to pooled NHSN data#' #' @param hosp String of the hospital name, used to filter the column#' named `hosp_col` in the data.frame named `data_name`#' @param start The date to begin with (e.g. discards any dates before)#' @param start The date to end with (e.g. discards any dates after)#' @param data_name a string matching the name of the global variable #' with the CHG data (e.g. "temp_CHG" or "CHG_monthly")#' @param date_col The name of the column in `data_name` where we should#' filter the dates from#' @param hosp_col The name of the column in `data_name` where we should#' filter the hospitals from#'#' @return#' A data frame with one row. The colums will depend on if the filter#' function was able to identify any data from the CHG table meeting#' the given criteria#' #' If it found match(es), it will return a df with the following columns#' - Denominator: the sum of the denominator during time period#' - Numerator: the sum of the numerator during time period#' - CHG_start: the variable given from `start`#' - CHG_end: the variable given from `end`#' - Months_CHG: a nested list with the months that were aggregated#' #' If no matches were found, it will return a df with the following columns#' - CHG_start: the variable given from `start`#' - CHG_end: the variable given from `end`#' #' @examples#' # example code #' match_CHG(hosp = "PRN",#' start = ymd("2023-11-15"),#' end = ymd("2024-01-15"),#' data_name = "CHG_monthly",#' hosp_col = "Hospital")#' #' # This is more useful using it inside of a mutate function#' temp_NHSN %>%#' rowwise() %>%#' mutate(CHG = pmap(list(hosp=Hosp, #' start=start, end=end, #' data_name="temp_CHG"),#' match_CHG)) %>%#' unnest(CHG) # I couldn't find an easy way to pass the data frame using map, so this # will have to do data <-get(data_name)### PART 1: Run validation tests# Check that data_name matches to a data.frame testthat::expect_s3_class(data, "data.frame")# Test that the provided data has columns with the given names testthat::expect_contains(names(data), hosp_col) testthat::expect_contains(names(data), date_col)# Ensure that the column for date is indeed dates, and the dates provided# are in date format testthat::expect_s3_class(data[[date_col]], "Date") testthat::expect_s3_class(start, "Date") testthat::expect_s3_class(end, "Date")# Check that start & end are length of one (e.g. forgot rowwise) testthat::expect_length(start, 1) testthat::expect_length(end, 1)# Make sure the name of the hospital is found in the data testthat::expect_contains(unique(data[[hosp_col]]), hosp)### PART 2: Filter CHG data to that hospital's time period# Filter to only get the info we want filter_data <- data %>%filter(!!rlang::sym(hosp_col) %in% hosp) %>%filter(!!rlang::sym(date_col) >= start) %>%filter(!!rlang::sym(date_col) <= end) ### PART 3: Aggregate to make it one row single_row <- filter_data %>%ungroup() %>%group_by(!!rlang::sym(hosp_col)) %>%summarise(Denominator =sum(Denominator, na.rm = T),Numerator =sum(Numerator, na.rm = T),Months_CHG =list(!!rlang::sym(date_col))) %>%mutate(CHG_start = start,CHG_end = end) %>%select(-!!rlang::sym(hosp_col)) %>%relocate(Months_CHG, .after = CHG_end)### PART 4: If no matching data is found, make a placeholder row# that will be filled with NAsif(nrow(single_row)==0){ single_row <-tibble(# Hospital = hosp,# Denominator = -1,# Numerator = -1,# Months_CHG = list(),CHG_start = start,CHG_end = end) }return(single_row)}
We ultimately will sum together all of the numerators & denominators (within each table) to give us the overall numerators & denominators for CHG bathing during that time period. Once we’ve done that, we can calculate the CHG_rate during that chunk of time, and our data is ready for analysis!
In an ideal world, we could get CHG data going back to around the time of the roll out (or before), identify the time period where hospitals transitioned, and compare the SIRs before & after the roll out
If we can’t get that information, we could just assume that rates of CHG bathing were low before September of 2023 and split the data into Pre/Post by the date
Welch Two Sample t-test
data: SIR by Cluster
t = -1.4226, df = 63.002, p-value = 0.1598
alternative hypothesis: true difference in means between group After and group Before is not equal to 0
95 percent confidence interval:
-0.6322592 0.1064125
sample estimates:
mean in group After mean in group Before
0.7101937 0.9731171
Another option would be using a generalized linear model to predict the number of infections observed, and using predicted infections (and CHG bathing) as predictors. Since we would be looking at count data, we could do a Poisson regression (or more likely a negative binomial regression)…
# library(DHARMa)mod_nb <- MASS::glm.nb(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined) summary(mod_nb)# Compare to poissonmod_pos <-glm(Inf_obs ~ Inf_pred + CHG_rate, family ='poisson',data = temp_joined) calc_dispersion <-function(mod, verbose=F){# Identify the model family fam_raw <- mod$family$familyif(str_detect(fam_raw, "poisson")) family <-"poisson"if(str_detect(fam_raw, "Negative Binomial")) family <-"NB" E2 <-resid(mod, type ="pearson") N <-nrow(mod$model) p <-length(coef(mod)) if(family=="NB") p <- p +1# '+1' is for variance parameter in NB result <-sum(E2^2) / (N - p)if(verbose) {if(result>1) message("Overdisperion")if(result<1) message("Underdisperion") }return(result)}# DHARMa::testDispersion(mod_nb)calc_dispersion(mod_nb)calc_dispersion(mod_pos)rm(mod_nb, mod_pos, calc_dispersion)
ZIP & ZINB
I haven’t highlighted this yet (it will go into descriptive statistics) but our data has a whole lot of zeros, namely for the observed infections
Figure 2: High number of zeros, highlighted by histograms of the number of observed infections
I’ve used zero-inflated regressionmodels before, but I’m not sure if that would apply here. These are used when there are two processes8 causing a mixed distribution, but I don’t think that is the case here. But leaving this section in just in case
Show code
library(pscl) # also see {ZIM} packagepscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | Inf_pred , # these are the predictors for the zero inflated partdist ="poisson", # Use 'negbin' for negative binomialdata = temp_joined) %>%summary()# # Versus# glm(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined, family = poisson) %>%# summary()
Show code
x %>%mutate(CHG_rate = Numerator / Denominator,SIR = Inf_obs / Inf_pred) %>%filter(!is.na(CHG_rate)) %>% pscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | Inf_pred + CHG_rate, # these are the predictors for the zero inflated partdist ="negbin", data = .) %>%summary()
For example, March 31st (2024) was on a Sunday, so the week starting on 3/31/24 mostly reflects April 2024↩︎
I use the terms chunks and buckets interchangeably (I might also use the term “bin” somewhere in here too↩︎
This isn’t strictly the case, since there is a trade off between making sure each chunk has a sum(Inf_pred) > 1 (which reduces the number of chunks we get) and maximizing the number of chunks (to give us the most granular data points about time). So some chunks do have fewer than 1 predicted infection, but I tried to keep that to a minimum↩︎
Note that the area of the grey boxes is not proportional to the total number of infections, only the height is↩︎
Even if we did try with these hospitals, they would have one chunk of time (spanning the entire data) which still would be insufficient to calculate a SIR from↩︎
These classifications can be changed on the Google Sheet and will update whenever I run the analysis again↩︎
In practice, this is kind of splitting hospitals by predicted infections. But I think I prefer this because it’s more translatable to those who don’t know how NHSN does all of it’s math (like me!)↩︎
I think it would apply if we had a situation like: Does the hospital have patients with central lines (the logistical regression portion) and if so, how many CLABSIs did they get (the negative binomial portion)↩︎
Source Code
---title: "CHG impact"format: html: toc: true code-fold: true code-summary: "Show code" code-tools: true lightbox: trueeditor: visual---```{r setup, include=FALSE, echo=F}knitr::opts_chunk$set(warning=F, message=F, echo=T)library(tidyverse)library(lubridate)library(naniar) # for missing datalibrary(googlesheets4)library(gt); library(gtExtras)```# Import data## CHG complianceData from the [CHG compliance dashboard](https://tableau.wvumedicine.org/#/views/CHGTreatmentCompliance/AllCHGTreatmentCompliance?:iid=1) is fairly straightforward to download. Go to `device type` and only select central lines. Then export the heatmap to CSV```{r read_CHG_old}#| echo: true#| # eval: falseCHG_raw <-"TableauData/2024-09-22/Overall Rate by Hospital and Department.csv"%>%read_csv(col_types ="ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc",skip =1)## These are the same months, but split across 2023-2024# • `31-Dec-23` -> `31-Dec-23...23`# • `31-Dec-23` -> `31-Dec-23...24`CHG_clean <- CHG_raw %>%rename(Hospital=...1, Benchmark=...2, AtGoal=...3, Measure=...4) %>%mutate(Measure =case_when(str_detect(Measure, "Numerator") ~"Numerator",str_detect(Measure, "Denominator") ~"Denominator",TRUE~"Other")) %>%filter(Measure %in%c("Numerator", "Denominator")) %>%# Make long formatpivot_longer(cols =-c(Hospital, Benchmark, AtGoal, Measure), names_to ="Date", values_to ="n") %>%# Parse the numbersmutate(n =str_remove(n, ","), # Remove comman =as.numeric(n)) %>%# make 31-Dec-23...23 amd 31-Dec-23...24 the samemutate(Date =case_when(Date=="31-Dec-23...23"~"31-Dec-23", Date=="31-Dec-23...24"~"31-Dec-23",TRUE~ Date)) %>%filter(Date!="27-Aug-23") %>%# Group by Hospital + Month + Denominatorgroup_by(Hospital, Date, Measure) %>%summarise(n =sum(n, na.rm = T)) %>%# Pivot data out pivot_wider(names_from = Measure, values_from = n) %>%# Process datesmutate(Date = lubridate::dmy(Date)) %>%ungroup() rm(CHG_raw)``````{r read_CHG_new}CHG_raw <-"TableauData/2024-10-18 CHG CL only/All Overall Rate by Hospital and Department_data.csv"%>%read_csv()CHG_clean <- CHG_raw %>%select(Hospital, Date=`Week of Census Time`,Denominator=`All Lines Denominator`,Numerator=`All Lines Numerator`) %>%mutate(Date =mdy(Date)) %>%group_by(Hospital, Date) %>%summarise(Denominator =sum(Denominator, na.rm = T),Numerator =sum(Numerator, na.rm = T)) %>%ungroup() %>%# Remove most recent week filter(Date !=ymd("2024-10-13"))rm(CHG_raw)``````{r}CHG_clean %>% rmarkdown::paged_table()```::: callout-noteI can only download the past 12 months of data from the dashboard. Will need to get access to more of the data to do a good pre-post analysis:::This is a quick glimpse of the CHG compliance rates over time```{r}#| fig-height: 8# CHG_clean %>%# mutate(rate = Numerator / Denominator ) %>%# # ggplot(aes(x=Date, y=rate, color=Hospital)) + # geom_line(key_glyph = "timeseries") +# geom_point() + # scale_x_date(date_breaks = "3 months", date_labels = "%b\n'%y") +# facet_wrap("Hospital", ncol = 4) +# guides(color=F) +# theme_bw()CHG_clean %>%mutate(rate = Numerator / Denominator ) %>%ggplot(aes(x=Date, y=rate, color=Denominator)) +geom_line(key_glyph ="timeseries") +geom_point() +scale_x_date(date_breaks ="3 months", date_labels ="%b\n'%y") +facet_wrap("Hospital", ncol =4) +# scale_colour_gradient2(trans="log10", mid = "yellow", high = "red")scale_color_viridis_c(option ="D", trans="log10") +# guides(color=F) +theme_bw()```Because it is reported weekly (and everything else is reported monthly), we need to convert the weekly data into monthly data::: callout-note## How are the weeks calculated?When you extract the data from Tableau, the dates it lists are all Sundays. It doesn't look like it's explicitly documented anywhere, but I assume that this means the **week starts on Sunday**, meaning the date provided is day 1 of 7 for that week:::Some weeks span the course of multiple months[^1]. In these cases, we will group them with whichever month that the majority of their days fell under[^1]: For example, March 31st (2024) was on a Sunday, so the week starting on 3/31/24 mostly reflects April 2024```{r}week_to_month <-function(date, beginning=T) {require(lubridate)# Make sure you are given only one date, not a vectorstopifnot(length(date)==1)# Calculate the midpoint of the week, which is 3 days away from the start # (or end) dateif(beginning) { mid_date <- date +days(3) } else({# Otherwise, treat the given day as the end of the week mid_date <- date -days(3) })# Extract the month and the year from this date new_month <-month(mid_date) new_year <-year(mid_date)# Create a new date and return new_date <-ymd(paste0(new_year, "-", new_month, "-15"))return(new_date)}``````{r}CHG_monthly <- CHG_clean %>%# Needed to only process one date at a timerowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%group_by(Hospital, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T))``````{r}#| fig-height: 8CHG_clean %>%# Needed to only process one date at a timerowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%group_by(Hospital, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T)) %>%mutate(rate = Numerator / Denominator) %>%ggplot(aes(x=Month, y=rate, color=Hospital)) +geom_line(key_glyph ="timeseries") +geom_point() +scale_x_date(date_breaks ="3 months", date_labels ="%b\n'%y") +facet_wrap("Hospital", ncol =4) +guides(color=F) +theme_bw()```## NHSN data```{r}#| eval: false#| fig-height: 8### THIS WAS A CHUNK FOR OBSERVED RATES, BUT IT'S INCLUDED IN SIR FILE NOWCLABSI_abs_raw <-read_csv("TableauData/2024-09-22/Observed Infections by Month & Year.csv")CLABSI_abs <- CLABSI_abs_raw %>%select(-LOCATIONS) %>%rename(Hospital=ORGID) %>%# Make long formatpivot_longer(cols =-Hospital, names_to ="Date", values_to ="n") %>%mutate(Date =ymd(paste0(Date, "-1")),Month =month(Date),Year =year(Date),Qrt =quarter(Date))rm(CLABSI_abs_raw)CLABSI_abs %>%ggplot(aes(x=Date, y=n, color=Hospital)) +# geom_jitter(key_glyph = "timeseries") +geom_line(key_glyph ="timeseries") +scale_x_date(date_breaks ="6 months", date_labels ="%b\n'%y") +facet_wrap("Hospital", ncol =3) +guides(color=F) +theme_bw() +labs(y="# CLABSI")```### WorkflowThis isn't as straightforward of a download from Tableau, so to keep it reproducible this is my workflow for downloading the data- Navigate to the [NHSN Infections Dashboard](https://tableau.wvumedicine.org/#/workbooks/959/views), go to the Compare Hospitals dashboard- Select "**all**" for the `Month & Year` drop down- Ensure the `Infection Type` is set to **CLABSI**- Use the `Hospital` drop down to select the hospital of interestOnce you've selected the hospital, export the data to CSV with the following steps:1. Click the "*Observed & Expected by Month & Year*" panel (bottom left)2. Click the **download icon** in the top right corner (close to the share option). There should be an option to download **data** (if it's grey'ed out, something is wrong). Clicking the data icon opens a new window3. In the new window, select `Full Data` (instead of summary)4. Click `Show Fields`, scroll up and select "**all**"5. Finally, hit download. To keep it organized, I renamed the downloaded file with the hospital's abbreviationI did the above steps for all of the hospitals that I could select in Tableau```{r}#| code-fold: trueSIR_junk <-"TableauData/2024-10-09 CLASBI SIR"%>%# Use `fs` package to get the names of all of the CSVs fs::dir_ls() %>%# Read all of the CSVs using map purrr::map(readr::read_csv, col_types ="ccccccccccccccccccccccccccccccccccccccc") %>%# Bind into a single tablelist_rbind()```### Missing dataThis has quite a bit of missing data (since we downloaded all of the fields), so we can take a quick look at what is missing```{r}#| fig-height: 8library(naniar)naniar::vis_miss(SIR_junk ) +coord_flip()```This is seen a little easier in table format, showing quite a few fields have 100% missing data, which we will go ahead and remove```{r}naniar::miss_var_summary(SIR_junk) %>% rmarkdown::paged_table()``````{r}to_remove <- naniar::miss_var_summary(SIR_junk) %>%filter(pct_miss==100) %>% dplyr::pull(variable)df_temp <- SIR_junk %>%select(-to_remove)rm(to_remove)```### Renaming and dropping columnsThere are also some uninformative columns (meaning that they say the same thing in every row) which we should remove. We also need to convert each column to the correct data type and give them better names, which we will to here.```{r}#| eval: falsedf_temp %>%mutate(across(everything(), as.factor)) %>%summary()``````{r}#| eval: false## Not run, but this verifies the assumptions below# Is SUMMARYYM == Month of SUMMARYYMdf_temp %>%filter(SUMMARYYM !=`Month of SUMMARYYM`)# Is ORGID == ORGID (test)df_temp %>%filter(ORGID !=`ORGID (test)`)# Verify that NUMPRED & INFCOUNT are all numeric datadf_temp %>%filter(!is.na(NUMPRED)) %>%mutate(NUMPRED =as.numeric(NUMPRED)) %>%filter(is.na(NUMPRED))df_temp %>%filter(!is.na(INFCOUNT)) %>%mutate(INFCOUNT =as.numeric(INFCOUNT)) %>%filter(is.na(INFCOUNT))# Is INFCOUNT == Infections Observeddf_temp %>%mutate(INFCOUNT =as.numeric(INFCOUNT),`Infections Observed`=as.numeric(`Infections Observed`)) %>%filter(INFCOUNT !=`Infections Observed`)# Verify that ORGID captures CCN, `Org ID (2)`, `Org ID (copy)`df_temp %>%count(ORGID, CCN, `Org ID (2)`, `Org ID (copy)`)# Verify that NUMCLDAYS is all numeric datadf_temp %>%filter(!is.na(NUMCLDAYS)) %>%mutate(NUMCLDAYS =as.numeric(NUMCLDAYS)) %>%filter(is.na(NUMCLDAYS))# Verify that SIR is all numeric data (where it is not missing)df_temp %>%filter(!is.na(SIR)) %>%mutate(SIR =as.numeric(SIR)) %>%filter(is.na(SIR))# Is SIR == SIR (all)df_temp %>%filter(SIR !=`SIR (all)`)```These are the changes we will make to these columns. The old name of the column is shown in \[*brackets*\]- `Month`\[*Month of SUMMARYYM*\]: [Character]{.underline} of the month-year. Will set the day to be the 15th- `Date`\[derived from *Month of SUMMARYYM*\]: [Date]{.underline} Date format of `Month`, with the day set to be the 15th- **removed** \[*Infection Type (fix)*\]: All that is said before was "CLABSI" (for all rows, no missing data)- **removed** \[*LOCATION*\]: All that is said before was "(All Locations)" (for all rows, no missing data)- **removed** \[*ORGID (test)*\]: This is the same as \[*ORGID*\], so we will just keep `Hosp`- **removed** \[*Infection Type*\]: All that is said before was "CLA" (for all rows, no missing data)- `Hosp`\[*ORGID*\]: [Factor]{.underline} of the hospital abbreviation- `data_source`\[*Table Name*\]: [Factor]{.underline}; appears to be the CSV file that the data was pulled from- **removed** \[*INFCOUNT*\]: Aside from having decimal points, this is the same as \[*Infections Observed*\], so we will just keep `Inf_obs`- `Inf_pred`\[*NUMPRED*\]: [Numeric]{.underline}- **removed** \[*CCN*\]: This maps 1:1 with \[*ORGID*\], so will just use `Hosp` as a surrogate- `Inf_obs`\[*Infections Observed*\]: [Numeric]{.underline}, looks like it's the number of infections that occured- **removed** \[*LOCATIONS*\]: All that is said before was "(All Locations)" (for all rows, no missing data)- `CL_days`\[*NUMCLDAYS*\]: [Numeric]{.underline}- **removed** \[*ORGID (2)*\]: This maps 1:1 with \[*ORGID*\], so will just use `Hosp` as a surrogate- **removed** \[*ORGID (copy)*\]: This maps 1:1 with \[*ORGID*\], so will just use `Hosp` as a surrogate- `SIR`\[*SIR*\]: [Numeric]{.underline}- **removed** \[*SIR (all)*\]: This is the same as \[*SIR*\], so we will just keep `SIR`- `SIR_p`\[*SIR_PVAL*\]: [Character]{.underline}. There are periods for some of these p values, which we will make NA- `SIR_95CI`\[*SIR95CI*\]: [Character]{.underline}- **removed** \[*SUMMARYYM*\]: This is the same as \[*Month of SUMMARYYM*\], so we will just keep `Month````{r}SIR_full <- df_temp %>%mutate(Month =`Month of SUMMARYYM`,Date =dmy(str_glue("15-{Month}")),Hosp =as_factor(ORGID),Inf_pred =as.numeric(NUMPRED),Inf_obs =as.numeric(`Infections Observed`),CL_days =as.numeric(NUMCLDAYS),SIR =as.numeric(SIR),SIR_p = SIR_PVAL,SIR_95CI = SIR95CI,data_source =as_factor(`Table Name`),# .keep="none" is the same as transmute, keeps only new columns.keep="none") %>%# Reorder SIR to be with p-value & CIrelocate(SIR, .before = SIR_p) %>%# Replace p=. with NA mutate(SIR_p =na_if(SIR_p, "."))SIR_full %>% rmarkdown::paged_table()rm(df_temp, SIR_junk)```### SIRYou can see that the SIR is not calculated for most of the hospitals in the data set. Below is a table showing what **percent of each variable is missing**, stratified by hospital. This is because the SIR is not calculated when the number of predicted infections are less than 1, per [NHSN resources (pg 5)](https://www.cdc.gov/nhsn/pdfs/ps-analysis-resources/nhsn-sir-guide.pdf#page=5). More on this later```{r}# gg_miss_upset(SIR)SIR_full %>%group_by(Hosp) %>%miss_var_summary() %>%select(-n_miss) %>%pivot_wider(names_from = variable, values_from = pct_miss) %>%select(Hosp, starts_with("SIR")) %>% knitr::kable(caption ="Percent of variable that is missing, by hospital")```## Hospital dataThis comes from a [Google Sheet](https://docs.google.com/spreadsheets/d/1zZ1o1HhzOPrbj2KBdGsaO4rJvLDPMWqN0ogjMCNHfGw/edit?usp=sharing) where I store some of the info for each hospital```{r}library(googlesheets4)options(gargle_oauth_email =TRUE)gs4_auth(cache =TRUE)if(!exists("hospitals")){ hospitals_raw <-paste0("https://docs.google.com/","spreadsheets/d/","1zZ1o1HhzOPrbj2KBdGsaO4rJvLDPMWqN0ogjMCNHfGw/") %>%read_sheet(sheet ="CHG Hospitals") hospitals <- hospitals_rawrm(hospitals_raw)}hospitals %>% rmarkdown::paged_table()```The reporting across these two databases aren't exactly the same for the hospitals. There are some hospitals that show up in the CHG dataset but not the NHSN dataset. Specifically, Children's, Fairmont, St Francis, and Thomas.```{r}hospitals %>%select(NHSN_abbv, CHG_abbv, name_short) %>%full_join(count(SIR_full, Hosp), by=c("NHSN_abbv"="Hosp")) %>%rename(NHSN=n) %>%full_join(count(CHG_clean, Hospital), by=c("CHG_abbv"="Hospital")) %>%rename(CHG=n) %>%relocate(name_short) %>%filter(!is.na(NHSN_abbv)|!is.na(CHG_abbv)) %>%gt(rowname_col ="name_short") %>%fmt_missing() %>%tab_spanner(label ='# rows',columns =c(NHSN, CHG) ) %>%# cols_label() %>%tab_header(title =md('**Hospitals in this dataset**'),# subtitle = md('Showing counts of ') ) %>%opt_stylize(5) ```::: callout-note## What should we do?I assume that Children's (CH) should be lumped with Ruby (WVUH) but what about Fairmont (FMT)? Until I hear otherwise, I will1. Lump CH with WVUH2. Ignore FMT, STF, and TMH:::# Pooling together predicted infectionsFrom the NHSN documentation> The standardized infection ratio (SIR) is a summary measure used to track HAIs at a national, state, or local level over time. The SIR adjusts for various facility and/or patient-level factors that contribute to HAI risk within each facility. The method of calculating an SIR is similar to the method used to calculate the Standardized Mortality Ratio (SMR), a summary statistic widely used in public health to analyze mortality data. In HAI data analysis, the SIR compares the actual number of HAIs reported to the number that would be predicted, given the standard population (i.e., NHSN baseline), adjusting for several risk factors that have been found to be significantly associated with differences in infection incidence.The formula for SIR is:$$SIR = \frac{Observed\ HAIs}{Predicted\ HAIs}$$## The problemThe SIR *is only calculated when the number of predicted HAIs is at **least 1***, which prevents extremely high values and fluctuations in the SIR. As our data reports $CLABSI_{obs}$ and $CLABSI_{pred}$ for each hospital-month, very few instances have a $CLABSI_{pred} \ge 1$ , so the SIR is not calculated in most cases.Moreover, the predicted number of CLABSIs varies greatly among hospitals, as shown in an extract of the raw data below.```{r}SIR_full %>%filter(Hosp %in%c("WVUH", "UHC", "GRMC", "JAX")) %>%select(Hosp, Month, Inf_pred, Inf_obs, SIR) %>%group_by(Hosp) %>%filter(row_number()<4) %>%ungroup() %>%# Add seperators, which requires changing everything to charsmutate(across(everything(), as.character)) %>%add_row(Hosp ="...", Month="...", Inf_pred ="...", Inf_obs ="...", SIR ="...", .after =3) %>%add_row(Hosp ="...", Month="...", Inf_pred ="...", Inf_obs ="...", SIR ="...", .after =7) %>%add_row(Hosp ="...", Month="...", Inf_pred ="...", Inf_obs ="...", SIR ="...", .after =11) %>% knitr::kable(caption ="Sample of NSHN data for hospitals of various sizes")```These histograms show the more detailed breakdown of each hospital's **monthly expected** CLABSIs. As expected, some of the small hospitals (e.g. critical access) have incredibly low $CLABSI_{pred}$ (some are zero), whereas larger hopsitals like Ruby (WVUM) mostly have monthly expected rates of \>1```{r}#| fig-height: 7.5#| fig-width: 8SIR_full %>%ggplot(aes(x=Inf_pred)) +geom_histogram() +facet_wrap("Hosp", scales ="free_y", ncol =4) +theme_bw() +scale_x_log10(breaks = scales::log_breaks(),labels = scales::label_log()) +labs(title="Histogram of monthly predicted CLABSIs, by hospital",x ="Predicted infections (visually log transformed)")```## SolutionsThe NHSN documentation ([page 12](https://www.cdc.gov/nhsn/pdfs/ps-analysis-resources/nhsn-sir-guide.pdf#page=12)) suggests using a longer time period in order to reach the 1.0 threshold of predicted infections.This looks like it may work for some of the medium sized hospitals (e.g. UTN, CCMC, RMH), but most of the smaller hospitals don't reach that threshold across the entire time that we have data for (not to mention what would happen if you split it by pre-post). This is demonstrated in the table below, which shows the total (sum) of each hospital's expected & observed CLABSIs (*across the **entire time period***)```{r}#| collapse: TRUESIR_full %>%group_by(Hosp) %>%summarise(expected =sum(Inf_pred),observed =sum(Inf_obs)) %>%arrange(expected) %>% rmarkdown::paged_table()# SIR_full %>%# mutate(Rollout = if_else(Date < ymd("2023-09-01"), "Before", "After")) %>%# group_by(Hosp, Rollout) %>%# summarise(expected = sum(Inf_pred),# observed = sum(Inf_obs)) %>%# pivot_wider(names_from = Rollout,# names_glue = "{Rollout}_{.value}",# values_from = c(expected, observed)) %>%# mutate(Total_expected = Before_expected + After_expected,# Total_observed = Before_observed + After_observed) %>%# relocate(Hosp, Before_expected, After_expected, Total_expected, # Before_observed, After_observed, Total_observed)```### #1: Pooling the monthsAggregating the predicted infections across multiple months (until the sum across those months reaches 1) is what is suggested by NHSN. Because each hospital has different rates of predicted infections, the length of time needed to pool across (e.g. 2 months, 5 months, 1 year) may vary depending on the hospital.I made a function that splits each hospitals data into chunks[^2] of time (e.g. April 2024 to July 2024), trying to keep the number of predicted infection roughly even within each chunk. This means that the *duration* of each chunk is variable, depending on how long it takes to reach a `sum(Inf_pred) >= 1`, but that now each row of data should have at least one[^3] predicted infection, allowing us to calculate the SIR.[^2]: I use the terms **chunks** and **buckets** interchangeably (I might also use the term "bin" somewhere in here too[^3]: This isn't strictly the case, since there is a trade off between making sure each chunk has a `sum(Inf_pred) > 1` (which reduces the number of chunks we get) and maximizing the number of chunks (to give us the most granular data points about time). So some chunks do have fewer than 1 predicted infection, but I tried to keep that to a minimum```{r}#| code-fold: true#| code-summary: "Code for the function"create_buckets <-function(data, chunk_size=1, round_up=0.8, date_col="Date", data_col="Inf_pred",verbose=F) {#' Creating buckets for predicted infections#'#' This function takes a data.frame (`data`) from containing a column with #' the number of predicted infections (`data_col`) and splits it up into#' chunks where the sum of `data_col` within each chunk is at least #' `chunk_size`#' #' @param data Data frame containing a column of predicted infections#' @param chunk_size Target minimum number of infections to have in each #' bucket. This should ideally be 1#' @param round_up At what cutoff (of remainder of infections) should you #' increase the bucket size? #' @param date_col Label for a column with date objects#' @param data_col Label for the column in `data` with the predicted infections#' @param verbose Should warnings be given?#'#' @return#' The same data frame, but with a new column named bucket containing the#' bucket IDs# Ensure data is sorted by date data <- data %>%arrange(!!rlang::sym(date_col))# This is the total number of predicted infections total <- data %>% dplyr::pull(data_col) %>%sum()### FIRST: Calculate the number of buckets to break things down into# Calculate the number of buckets to divide things up into. This is floor # division, meaning that it will round down (to make sure each chunk has # greater than then size specified in chunk_size). For example, if there is # a total of 2.3 predicted infections, it will split it into two chunks # since 2.3 %/% 1 = 2 (assuming a target chunk size of at least 1) n_buckets <- total %/% chunk_sizeif(total %% chunk_size >= round_up){# If the remainder of the total infections is greater than the cutoff given# in round_up (e.g. total predicted of 2.8 has a remiander of 0.8), increase# the bucket size by one to prevent wasted data n_buckets <- n_buckets +1if(verbose) warning("increasing bucket size") }### SECOND: Determine the actual size chunk size to target actual_chunk_size <- total / n_buckets### THIRD: Define values used during the loop# This is the vector where we store what bucket that row falls under bucket <-c()# This is the starting bucket ID bucket_id <-1# This represents the running total (cumulative sum) of predicted infections# for the bucket running_total <-0### FOURTH: Loop thriught the data# Run through each row of the data, keeping a running sum of predicted # infections. Once the running total exceeds our target chunk size (see# `actual_chunk_size` in step 2), reset `running_total` to zero and move# to then next bucketfor (i in1:nrow(data)) { running_total <- running_total + dplyr::pull(data, data_col)[i]# If the running total exceeds the actual chunk size, assign the current# bucket, reset the running total, and move to the bucketif (running_total >= actual_chunk_size) { bucket <-c(bucket, bucket_id) # append bucket_id to running list bucket_id <- bucket_id +1 running_total <-0 } else { bucket <-c(bucket, bucket_id) # append bucket_id to running list } }### FIFTH: Add the vector of bucket_id's calculated in the loop as a # column to the original data, and return the data frame data$bucket <- bucketreturn(data)}```For a larger hospital like UHC, we could probably get away with lumping things into 2-3 month bins, but medium-small hospitals (like GRMC & RMH) require much longer periods of time to reach a (cumulative) total of 1 infection. This is demonstrated in the two respective tables below:```{r}SIR_full %>%filter(Hosp=="UHC") %>%select(Month, Date, Hosp, Inf_pred) %>%create_buckets() %>%group_by(bucket) %>%# View()filter(bucket<3| bucket>28) %>%mutate(cumsum =cumsum(Inf_pred)) %>%mutate(bucket =str_glue("Chunk #{bucket} (size={sum(Inf_pred)})")) %>%select(-Date, -Hosp) %>%gt(groupname_col ="bucket", rowname_col ="Month") %>%tab_stub_indent(everything(), 5) %>%cols_label(Inf_pred ="Monthly",cumsum =md("Chunk's running total")) %>%tab_spanner(label =md('**Predicted infections** (UHC)'),columns =c(Inf_pred, cumsum) ) %>%tab_header(title ='Example of grouping months together in chunks',subtitle =md('Data is limited to UHC, **not all chunks shown**') ) %>%opt_stylize(5)``````{r}SIR_full %>%filter(Hosp %in%c("GRMC", "RMH", "UHC")) %>%nest(.by = Hosp) %>%mutate(data =map(data, create_buckets)) %>%unnest(data) %>%# select(Month, Date, Hosp, Inf_pred) %>%# create_buckets() %>%group_by(Hosp, bucket) %>%summarise(median = qwraps2::median_iqr(Inf_pred),Inf_pred =sum(Inf_pred),# Inf_obs = sum(Inf_obs),months =n(),start =min(Date),end =max(Date) ) %>%ungroup() %>%# rowwise() %>%# Filter data to show select rows from UHC toomutate(keep =case_when(Hosp !="UHC"~TRUE, bucket <3~TRUE, bucket >28~TRUE,TRUE~FALSE)) %>%filter(keep) %>%select(-keep) %>%mutate(bucket =str_glue("Chunk #{bucket}")) %>%mutate(start =format(start, "%b '%y"),end =format(end, "%b '%y")) %>%relocate(months, median, .after=end) %>%# select(-Inf_obs)gt(groupname_col ="Hosp", rowname_col ="bucket") %>%tab_stub_indent(everything(), 5) %>% gt::cols_merge_range(start, end) %>%cols_label(start =md("Time span<br> "),Inf_pred =md("Total infections<br>(during chunk)"),median =md("Median (IQR)<br>monthly infections"),months =md("Duration<br>(months)")) %>%opt_stylize(5, color ="green") %>%tab_header(title =md('**Grouping months together in chunks**'),subtitle =md('Showing **predicted** infections; *select hospitals shown*') ) ```A visual representation of this process is shown below for RMH, with the original data shown in red and the aggregated data in grey[^4][^4]: Note that the area of the grey boxes is not proportional to the total number of infections, only the height is```{r}SIR_full %>%filter(Hosp=="RMH") %>%select(Month, Date, Hosp, Inf_pred) %>%create_buckets() %>%group_by(bucket) %>%summarise(total =sum(Inf_pred),start =min(Date),end =max(Date)) %>%ggplot() +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=total)) +geom_col(aes(x=Date, y=Inf_pred), data=filter(SIR_full, Hosp=="RMH"), fill="red") +theme_bw() +scale_x_date(date_breaks ="6 months", date_labels ="%b\n'%y") +labs(title="Predicted infections vs time: RMH",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")```This is the same type of figure among most of the hospitals. I've omitted hospitals that had fewer than 1 total predicted infections (across the entire data set), as we will address these in the next section. I've also omitted WVUH (Ruby) as it always has at least one monthly predicted infection```{r}#| collapse: TRUEdf <- SIR_full %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%# select(Month, Date, Hosp, Inf_pred) %>%# create_buckets() %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date)) # df %>%# group_by(Hosp) %>%# filter(n()>1) %>%# filter(Inf_pred==min(Inf_pred))# df %>%# group_by(Hosp) %>%# filter(n()>1) %>%# filter(Hosp!="WVUH") %>%# # mutate(duration = as.numeric(days(end - start)) / (24 * 60 * 60)) %>%# mutate(months = time_length(end - start, unit="months")) %>%# ggplot(aes(x=Hosp, y=Inf_pred)) + # geom_boxplot() +# geom_jitter(aes(color=months),alpha=.5)df %>%group_by(Hosp) %>%filter(n()>1) %>%filter(Hosp!="WVUH") %>%ggplot() +geom_rect(aes(xmin=start-days(13), xmax=end+days(13), ymin=0, ymax=Inf_pred)) +facet_wrap("Hosp") +geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,data=filter(SIR_full, Hosp %in%pull(filter(group_by(df, Hosp), n()>1, Hosp!="WVUH"),Hosp)) ) +theme_bw() +labs(title="Predicted infections vs time",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")```This allows us to finally calculate the SIR across more of the hospitals than we could at first (when we could only view the SIR for WVUH)```{r}df %>%group_by(Hosp) %>%filter(n()>1) %>%# filter(Hosp!="WVUH") %>%mutate(SIR = Inf_obs / Inf_pred) %>%mutate(SIR =if_else(SIR==0, 0.1, SIR)) %>%ggplot(aes(fill=SIR)) +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=SIR)) +facet_wrap("Hosp") +theme_bw() +scale_fill_viridis_c() +# scale_fill_gradient2(midpoint = 1)labs(title="SIR using pooled time periods",y ="Standardized Infection Ratio\n(SIR)")``````{r}rm(df)```### #2: Pooling hospitalsIn the previous section, we focused on aggregating predicted infections across multiple months. This allowed us to give each hospital "chunks" of time during which we would expect to see one observed infection.When we did this, we skipped over[^5] hospitals that did **not** have a total of at least one predicted infection (across the entire dataset). These tended to be the smaller hospitals, often (but not always) critical access hospitals. In this section, we will return to those hospitals that we skipped over and try to find a way to get pool their predicted infections to reach thresholds greater than one.[^5]: Even if we did try with these hospitals, they would have one chunk of time (spanning the entire data) which still would be insufficient to calculate a SIR fromRecall the table shown at the beginning of the [Solutions] section, which showed the total number of predicted infections by hospital. Below is the same table, supplemented with each hospital's classification[^6] and their median monthly number of central line days:[^6]: These classifications can be changed on the Google Sheet and will update whenever I run the analysis again::: callout-note## Hospital classification terminologyI am open to using different terminology here for classification. One thought would be to do it by bed size (large, medium, small), but I don't have an accurate source for each hospital's bed size:::```{r}SIR_full %>%group_by(Hosp) %>%summarise(expected =sum(Inf_pred),observed =sum(Inf_obs),# median_CL_days = qwraps2::median_iqr(CL_days)median_CL_days =median(CL_days) ) %>%arrange(expected) %>%left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%select(Hosp:median_CL_days, hosp_type) %>%relocate(hosp_type, .before = expected) %>%ungroup() %>%mutate(totalAboveOne = expected >=1,totalAboveOne =if_else(totalAboveOne, ">= 1 total expected infection","<1 total expected infection")) %>%gt(rowname_col ="Hosp", groupname_col ="totalAboveOne") %>%tab_stub_indent(everything(), 5) %>%cols_label(expected =md("**Predicted**"),observed =md("Observed"),# median_CL_days = md("Median (IQR) *monthly*<br>central line days"),median_CL_days =md("Median *monthly*<br>central line days"),hosp_type =md("Hospital Type")) %>%tab_spanner(label =md('Total infections'),columns =c(expected, observed) ) %>% gtExtras::gt_color_rows(columns = expected,domain =c(0,50)# palette = c("purple", "lightgrey", "green") ) %>%data_color(columns = median_CL_days,domain =c(0,1000),palette ="BuPu",# pal_type = "continuous" ) %>%tab_style_body(columns = hosp_type,values ="Critical access",style =cell_text(color ='#9f0101') ) %>%tab_style_body(columns = hosp_type,values ="Community",style =cell_text(color ='#5a3286') ) %>%tab_style_body(columns = hosp_type,values ="???",style =cell_fill(color ='#e5ce90') ) %>%tab_header(title =md('Total number of infections & central line days, by hospital'),subtitle =md(str_glue('Data shown is across all months (**{format(min(SIR_full$Date), "%b %Y")} to {format(max(SIR_full$Date), "%b %Y")}**)')) ) %>%opt_stylize(6, color="gray") ```::: callout-warning## Misclassifications?- **GRMC** is listed as a critical access hospital, but has a much higher number of predicted infections than other critical access hospitals. Any idea on what is causing that? Regardless, might make sense to look at it separate from other CAHs- **WTZ** didn't have critical access designation when I looked it up, but it sure behaves like a CAH:::#### By hospital typeThe first solution that comes to mind would be to lump all the CAH together and treat them as the "same hospital". The NHSN reference looks like they use an intercept-only model for CAHs ([Table 3 on page 12](https://www.cdc.gov/nhsn/pdfs/ps-analysis-resources/nhsn-sir-guide.pdf#page=18)), so that tells me they didn't find many useful predictors within each critical access hospital.If we lump together all of the critical access hospitals, we get some acceptable results, shown in the figure below. Again, I'm not showing the WVUH data since it throws off the scale of the Y-axis. The only issue here is **WTZ still can't be used** and we've now **pooled GRMC's data with all of the critical access** hospitals.```{r}#| collapse: trueCAH <- hospitals %>%filter(hosp_type=="Critical access") %>%pull(NHSN_abbv)df <- SIR_full %>%mutate(Hosp =case_when( Hosp %in% CAH ~"Critical access",TRUE~ Hosp )) %>%group_by(Hosp, Month, Date) %>%# arrange(Date) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days)) %>%ungroup() df2 <- df %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, Hosp=="Critical access"~1.05,TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date)) df2 %>%group_by(Hosp) %>%# filter(n()>1) %>% filter(Hosp!="WVUH") %>%ggplot() +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +facet_wrap("Hosp") +geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,data=filter(df, Hosp!="WVUH") ) +theme_bw() +labs(title="Predicted infections vs time",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")# df %>%# ggplot(aes(x=Inf_pred)) +# geom_histogram() +# facet_wrap("Hosp", scales = "free", ncol = 4)rm(df, df2, CAH)```#### Central line daysAnother option would be to use some cutoff to aggregate hospitals together. The one that comes to mind is central line days, as this tracks well with predicted number of infections (as would be expected)[^7]. The natural cutoff in the dataset appears to be around 150 central line days per month, as shown below[^7]: In practice, this is kind of splitting hospitals by predicted infections. But I think I prefer this because it's more translatable to those who don't know how NHSN does all of it's math (like me!)```{r}SIR_full %>%left_join(hospitals, by=c("Hosp"="NHSN_abbv")) %>%ggplot(aes(x=Hosp, y=CL_days)) +geom_boxplot(aes(color=hosp_type)) +# geom_boxplot() + facet_wrap("hosp_type", scales = "free_x") +coord_flip() +scale_y_sqrt() +geom_hline(yintercept=150, linetype="dashed", color ="black") +geom_segment(aes(x =2.5, y =300, xend =1.5, yend =150),arrow =arrow(length =unit(0.25, "cm"))) +annotate("text", x=2.5, y=310, label="150", hjust="left") +labs(title="Monthly central line days, by hopsital",y ="CL days (transformed)") +theme_bw()```If we use this cutoff to aggregate smaller hospitals together, we get the most number of chunks per "hospital" among all of the solutions we've done this far. The trade off is, we now have tons of hospitals aggregated into one group (specifically `r str_flatten_comma(as.character(unique(pull(filter(group_by(SIR_full, Hosp), mean(CL_days)<150), Hosp))),last = ", and ")`)```{r}#| collapse: true#| fig-height: 8cutoff <-150df <- SIR_full %>%# Calculate average CL_days, and rename Hosp if below cutoffgroup_by(Hosp) %>%mutate(mean_CLD =mean(CL_days),CL_rare = mean_CLD < cutoff) %>%mutate(Hosp =if_else(CL_rare, "CL rare", Hosp)) %>%ungroup() %>%group_by(Hosp, Month, Date) %>%# arrange(Date) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days)) %>%ungroup() df2 <- df %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, # Hosp=="CL rare" ~ 1.05,TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date)) # df2 %>% filter(Hosp=="CL rare") %>% tail()df2 %>%group_by(Hosp) %>%# filter(n()>1) %>% filter(Hosp!="WVUH") %>%ggplot() +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0, ymax=Inf_pred)) +facet_wrap("Hosp", ncol =2) +geom_col(aes(x=Date, y=Inf_pred), fill="red", alpha=0.5,data=filter(df, Hosp!="WVUH") ) +theme_bw() +labs(title="Predicted infections vs time",subtitle ="Date buckets are in grey, original data is in red",caption =paste0("The width of the grey bars reflect the duration of the chunks","\nThe hight of the grey bars are the total number of predicted infections during the chunk"),y ="Predicted infections",x="")rm(cutoff, df, df2)```# Joining the dataAt this point we have a few data.frames we are working with- **CHG data**, which has the number of patients who got CHG bathing (`Numerator`) and the number of patients who were eligible (`Denominator`). This is at a hospital level where each row is the [weekly]{.underline} data- **NHSN data**, which has the predicted (`Inf_pred`) and observed (`Inf_obs`) number of infections. This is [monthly]{.underline} data for each hospital, but as shown in the "[Pooling together predicted infections]" section, we can aggregate it across months &/or hospitals- **Hospital data**, which comes from the [Google Sheet](https://docs.google.com/spreadsheets/d/1zZ1o1HhzOPrbj2KBdGsaO4rJvLDPMWqN0ogjMCNHfGw/edit?usp=sharing)------------------------------------------------------------------------This is where it gets fun! To get the CHG data to line up with the NHSN data, we need to1. Pool together the NSHN data, both across hospitals and time2. Pool the CHG data across hospitals (e.g. lump all the CAH together), matching what we did in step 13. Pool the CHG data by month4. Aggregate the CHG data across time, *specific to that hospital's chunks of time* (since we chunked the NHSN into different time frames depending on the hospital```{r}temp_NHSN <- SIR_fulltemp_CHG <- CHG_clean```## Step 1: Aggregate NHSN dataThis is effectively what we did in [Pooling together predicted infections]::: callout-importantI will assume we are grouping together all hospitals that have an fewer than 150 central line days per month (on average):::```{r}cutoff <-150# Store the names of the small hospitals for latersmall_hosp <- temp_NHSN %>%# Calculate average CL_days, and rename Hosp if below cutoffgroup_by(Hosp) %>%filter(mean(CL_days) < cutoff) %>%pull(Hosp) %>%unique() %>%as.character()# Make the NHSN data pooled by hospital and chunked into time bucketstemp_NHSN <- temp_NHSN %>%# Calculate average CL_days, and rename Hosp if below cutoffgroup_by(Hosp) %>%mutate(mean_CLD =mean(CL_days),CL_rare = mean_CLD < cutoff) %>%mutate(Hosp =if_else(CL_rare, "Small", Hosp)) %>%ungroup() %>%group_by(Hosp, Month, Date) %>%# arrange(Date) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),.groups ="drop") %>%ungroup() %>%nest(.by = Hosp) %>%# Play around with the numbers to get the right sizesmutate(c_size =case_when(Hosp=="PRN"~1.1, Hosp=="WHL"~0.95, Hosp=="BMC"~ .92, # Hosp=="Small" ~ 1.05,TRUE~formals(create_buckets)[["chunk_size"]])) %>%mutate(data =map2(data, c_size, \ (x,y) create_buckets(x, chunk_size = y))) %>%unnest(data) %>%group_by(Hosp, bucket) %>%summarise(Inf_pred =sum(Inf_pred),Inf_obs =sum(Inf_obs),CL_days =sum(CL_days),start =min(Date),end =max(Date), .groups ="drop") ```## Step 2: Aggregate CHG by hospitalAs mentioned in the section on importing [Hospital data], the CHG dashboard is more granular than the NHSN data (e.g. Childrens Hospital is split from Ruby) and some of the abbreviations differ between the datasets (e.g. CCMC vs CCM).Here, we will fix those abbreviations (we will use the NHSN abbreviations) and merge CH with WVUH. We will also group together the smaller hospitals (e.g. CAH), based on what we did in step 1 above. Once all of that is done, we sum up everything up by hospital (or hospital group) & week.```{r}temp_CHG <- temp_CHG %>%select(Hospital, Date, Denominator, Numerator) %>%left_join(select(hospitals, NHSN_abbv, CHG_abbv), by=c("Hospital"="CHG_abbv")) %>%# Make a new column named Hosp, with the adjusted names# Rename hospital names to set us up to pool across names# !!! Be careful of which one you use here (Hospital vs NHSN_abbv)mutate(Hosp =case_when(Hospital=="CH"~"WVUH", NHSN_abbv %in% small_hosp ~"Small",TRUE~ NHSN_abbv)) %>%# # Check your work# filter(Hospital!=NHSN_abbv | is.na(NHSN_abbv)) %>%# count(Hospital, NHSN_abbv, Hosp)select(-Hospital, -NHSN_abbv) %>%# Remove hospitals not listed in NHSNfilter(Hosp %in% temp_NHSN$Hosp) %>%# Aggregate datagroup_by(Hosp, Date) %>%summarise(Denominator =sum(Denominator, na.rm = T),Numerator =sum(Numerator, na.rm = T),.groups ="drop") %>%ungroup() ```## Step 3: Aggregate CHG by monthThis is exactly what we did in the section where we imported the [CHG compliance] data, we just have to do it again now that we aggregated the CHG data by hospital```{r}temp_CHG <- temp_CHG %>%# Needed to only process one date at a timerowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%group_by(Hosp, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T))```## Step 4: Match CHG chunks to NHSN chunksAt this point, we have NHSN data broken down into chunks, like the sample below```{r}temp_NHSN %>%filter(Hosp %in%c("PRN", "Small", "CCMC")) %>%group_by(Hosp) %>%filter(bucket ==max(bucket)) %>%select(Hosp, starts_with("Inf"), start, end) %>% knitr::kable(caption="Sample of NHSN data")```To join the [CHG compliance] data with our chunked NHSN data, we will work row-by-row through the NHSN table. **For each row**, we will filter the CHG data:- The hospital name matches ($Hosp_{\ CHG} = Hosp_{\ NHSN}$)- The month of the CHG data is after or equal to the start date in NHSN ($Month_{\ CHG} \ge Start_{\ NHSN}$)- The month of CHG data is before or equal to the end date in NHSN ($Month_{\ CHG} \le End_{\ NHSN}$)------------------------------------------------------------------------For the sample data shown above, this would translate to these three sets of tables::: {#tbl-panel layout-ncol="2"}```{r}#| echo: falsetemp_NHSN %>%filter(Hosp =="CCMC") %>%group_by(Hosp) %>%filter(bucket ==max(bucket)) %>%select(Hosp, starts_with("Inf"), start, end) %>% knitr::kable()``````{r}#| echo: falsex <- temp_NHSN %>%filter(Hosp =="CCMC") %>%group_by(Hosp) %>%filter(bucket ==max(bucket))temp_CHG %>%filter(Hosp=="CCMC", Month>=x[["start"]], Month<=x[["end"]]) %>% knitr::kable()rm(x)``````{r}#| echo: falsetemp_NHSN %>%filter(Hosp =="PRN") %>%group_by(Hosp) %>%filter(bucket ==max(bucket)) %>%select(Hosp, starts_with("Inf"), start, end) %>% knitr::kable()``````{r}#| echo: falsex <- temp_NHSN %>%filter(Hosp =="PRN") %>%group_by(Hosp) %>%filter(bucket ==max(bucket))temp_CHG %>%filter(Hosp=="PRN", Month>=x[["start"]], Month<=x[["end"]]) %>% knitr::kable()rm(x)``````{r}#| echo: falsetemp_NHSN %>%filter(Hosp =="Small") %>%group_by(Hosp) %>%filter(bucket ==max(bucket)) %>%select(Hosp, starts_with("Inf"), start, end) %>% knitr::kable()``````{r}#| echo: falsex <- temp_NHSN %>%filter(Hosp =="Small") %>%group_by(Hosp) %>%filter(bucket ==max(bucket))temp_CHG %>%filter(Hosp=="Small", Month>=x[["start"]], Month<=x[["end"]]) %>% knitr::kable()rm(x)```Desired output, with rows of NHSN shown on the left, and filtered CHG data on the right:::------------------------------------------------------------------------```{r}#| code-summary: "Code for the function"# Define a helper function that will filter the CHG data based on given parametersmatch_CHG <-function(hosp, start, end, data_name,date_col="Month",hosp_col ="Hosp") {#' Matching CHG data to pooled NHSN data#' #' @param hosp String of the hospital name, used to filter the column#' named `hosp_col` in the data.frame named `data_name`#' @param start The date to begin with (e.g. discards any dates before)#' @param start The date to end with (e.g. discards any dates after)#' @param data_name a string matching the name of the global variable #' with the CHG data (e.g. "temp_CHG" or "CHG_monthly")#' @param date_col The name of the column in `data_name` where we should#' filter the dates from#' @param hosp_col The name of the column in `data_name` where we should#' filter the hospitals from#'#' @return#' A data frame with one row. The colums will depend on if the filter#' function was able to identify any data from the CHG table meeting#' the given criteria#' #' If it found match(es), it will return a df with the following columns#' - Denominator: the sum of the denominator during time period#' - Numerator: the sum of the numerator during time period#' - CHG_start: the variable given from `start`#' - CHG_end: the variable given from `end`#' - Months_CHG: a nested list with the months that were aggregated#' #' If no matches were found, it will return a df with the following columns#' - CHG_start: the variable given from `start`#' - CHG_end: the variable given from `end`#' #' @examples#' # example code #' match_CHG(hosp = "PRN",#' start = ymd("2023-11-15"),#' end = ymd("2024-01-15"),#' data_name = "CHG_monthly",#' hosp_col = "Hospital")#' #' # This is more useful using it inside of a mutate function#' temp_NHSN %>%#' rowwise() %>%#' mutate(CHG = pmap(list(hosp=Hosp, #' start=start, end=end, #' data_name="temp_CHG"),#' match_CHG)) %>%#' unnest(CHG) # I couldn't find an easy way to pass the data frame using map, so this # will have to do data <-get(data_name)### PART 1: Run validation tests# Check that data_name matches to a data.frame testthat::expect_s3_class(data, "data.frame")# Test that the provided data has columns with the given names testthat::expect_contains(names(data), hosp_col) testthat::expect_contains(names(data), date_col)# Ensure that the column for date is indeed dates, and the dates provided# are in date format testthat::expect_s3_class(data[[date_col]], "Date") testthat::expect_s3_class(start, "Date") testthat::expect_s3_class(end, "Date")# Check that start & end are length of one (e.g. forgot rowwise) testthat::expect_length(start, 1) testthat::expect_length(end, 1)# Make sure the name of the hospital is found in the data testthat::expect_contains(unique(data[[hosp_col]]), hosp)### PART 2: Filter CHG data to that hospital's time period# Filter to only get the info we want filter_data <- data %>%filter(!!rlang::sym(hosp_col) %in% hosp) %>%filter(!!rlang::sym(date_col) >= start) %>%filter(!!rlang::sym(date_col) <= end) ### PART 3: Aggregate to make it one row single_row <- filter_data %>%ungroup() %>%group_by(!!rlang::sym(hosp_col)) %>%summarise(Denominator =sum(Denominator, na.rm = T),Numerator =sum(Numerator, na.rm = T),Months_CHG =list(!!rlang::sym(date_col))) %>%mutate(CHG_start = start,CHG_end = end) %>%select(-!!rlang::sym(hosp_col)) %>%relocate(Months_CHG, .after = CHG_end)### PART 4: If no matching data is found, make a placeholder row# that will be filled with NAsif(nrow(single_row)==0){ single_row <-tibble(# Hospital = hosp,# Denominator = -1,# Numerator = -1,# Months_CHG = list(),CHG_start = start,CHG_end = end) }return(single_row)}```We ultimately will sum together all of the numerators & denominators (within each table) to give us the overall numerators & denominators for CHG bathing during that time period. Once we've done that, we can calculate the `CHG_rate` during that chunk of time, and our data is ready for analysis!Here is the final, joined data:```{r}temp_joined <- temp_NHSN %>%rowwise() %>%mutate(CHG =pmap(list(hosp=Hosp, start=start, end=end, data_name="temp_CHG"), match_CHG)) %>%unnest(CHG) %>%mutate(CHG_rate = Numerator / Denominator,SIR = Inf_obs / Inf_pred) %>%select(Hosp, Inf_pred, Inf_obs, SIR, start, end, CHG_rate, Numerator, Denominator)temp_joined %>%group_by(Hosp) %>%arrange(Hosp, desc(end)) %>% rmarkdown::paged_table()# x %>%# # ggplot(aes(x=CHG_rate, y=SIR)) +# geom_jitter() +# geom_smooth() +# facet_wrap("Hosp")```::: callout-noteMuch of the CHG data is listed as NA, becuase the CHG data only goes back to `r min(CHG_clean$Date)`, whereas the NHSN data goes back to `r min(SIR_full$Date)`.We really need to get the CHG data from earlier (ideally during implementation) to take full advantage of the NHSN data:::# Descriptive statistics::: callout-tip## TodoWill work on once we have some of the methodology stuff sorted above (and ideally the full CHG data):::For this run (`r today()`) of the analysis:- The data set for CHG bathing spans from **`r format(min(CHG_clean$Date), "%b %d %Y")`** to **`r format(max(CHG_clean$Date), "%b %d %Y")`**- The data set for NHSN spans from **`r format(min(SIR_full$Date), "%b %Y")`** to **`r format(max(SIR_full$Date), "%b %Y")`**```{r}#| fig-height: 6.5#| fig-width: 8SIR_full %>%group_by(Hosp) %>%summarise(start =min(Date),end =max(Date)) %>%ggplot(aes(x=Hosp)) +# geom_point(aes(y=start)) +geom_errorbar(aes(ymin = start, ymax=end), width=.3) +coord_flip() +theme_bw() +labs(title="Available NHSN data",y="Date", x="Hospital") ```::: {#fig-SIR-vs-CHG layout-ncol="2"}```{r}#| fig-width: 4#| fig-height: 5.5temp_joined %>%ggplot(aes(fill=SIR)) +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=SIR)) +facet_wrap("Hosp", ncol =2) +scale_fill_binned(low ="#FEE5D9", high="#A50F15") +# RColorBrewer::brewer.pal(5, "Greens")theme_bw() +theme(legend.position ="bottom") +labs(y="SIR",title="SIR") ``````{r}#| fig-width: 4#| fig-height: 5.5temp_joined %>%ggplot(aes(fill=Denominator)) +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=CHG_rate)) +facet_wrap("Hosp", ncol =2) +xlim(c(min(temp_joined$start), max(temp_joined$end)))+scale_y_continuous(labels = scales::percent) +scale_fill_binned(low ="#EDF8E9", high="#006D2C") +# RColorBrewer::brewer.pal(5, "Greens")theme_bw() +theme(legend.position ="bottom") +labs(y="CHG compliance",title="CHG compliance") ```SIR vs CHG:::# Analysis## Pre-postIn an ideal world, we could get CHG data going back to around the time of the roll out (or before), identify the time period where hospitals transitioned, and compare the SIRs before & after the roll out```{r}#| fig-width: 5#| fig-height: 4#| collapse: trueset.seed(4)df <-tibble(y =pnorm(-10:10* .3, sd = .5),noise =rnorm(1:21, sd = .02)) %>%mutate(x =row_number(),y2 = y+noise,y3 = y+.05,y3 =if_else(y3>=1, y2-.05, y3),rate = y3)df %>%ggplot(aes(x=x, y=rate)) +geom_point() +geom_line() +annotate("rect", xmin=0, xmax=8.5,fill="red",ymin=0, ymax=1, alpha=0.3) +annotate("rect", xmin=8.5, xmax=12.5,fill="blue",ymin=0, ymax=1, alpha=0.3) +annotate("rect", xmin=12.5, xmax=21,fill="green",ymin=0, ymax=1, alpha=0.3) +annotate("text", x=4, y=.5, label="Before", hjust="left") +annotate("text", x=17, y=.5, label="After", hjust="right") +annotate("text", x=10.5, y=.9, label="Transition", hjust="inward") +scale_y_continuous(labels = scales::percent) +theme_classic() +labs(x="Time", y="Rate", title="This would be great")rm(df)```If we can't get that information, we could just assume that rates of CHG bathing were low before September of 2023 and split the data into Pre/Post by the date```{r}df <- temp_joined %>%mutate(Cluster =case_when(end <ymd("2023-09-01") ~"Before", start >ymd("2023-09-30") ~"After")) %>%filter(!is.na(Cluster))df %>%ggplot(aes(fill=Cluster)) +geom_rect(aes(xmin=start-days(14), xmax=end+days(14), ymin=0.01, ymax=SIR)) +facet_wrap("Hosp") +theme_bw() +theme(legend.position ="bottom") +labs(y="SIR",title="SIR, before & after") df %>%ggplot(aes(x=Hosp, y=SIR, color=Cluster)) +geom_boxplot() +coord_flip() +theme_bw() +labs(y="SIR", x="",title="SIR, before & after")t.test(SIR ~ Cluster, data=df) ``````{r}# CHG_monthly %>%# mutate(Hospital = case_when(Hospital %in% small_hosp ~ "Small",# TRUE ~ Hospital)) %>%# group_by(Hospital, Month) %>% # summarise(Denominator =sum(Denominator , na.rm = T),# Numerator =sum(Numerator , na.rm = T)) %>%# ggplot(aes(x=Month, y=Numerator/Denominator, color=Hospital)) +# geom_line() +# facet_wrap("Hospital")```## GLMAnother option would be using a generalized linear model to predict the number of infections observed, and using predicted infections (and CHG bathing) as predictors. Since we would be looking at count data, we could do a Poisson regression (or more likely a negative binomial regression)...$$\text{Observed infections} = \beta_0 + \beta_1({predicted\ infections}) + \beta_2(CHG\ rate) + ... + \epsilon$$```{r}#| eval: false# library(DHARMa)mod_nb <- MASS::glm.nb(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined) summary(mod_nb)# Compare to poissonmod_pos <-glm(Inf_obs ~ Inf_pred + CHG_rate, family ='poisson',data = temp_joined) calc_dispersion <-function(mod, verbose=F){# Identify the model family fam_raw <- mod$family$familyif(str_detect(fam_raw, "poisson")) family <-"poisson"if(str_detect(fam_raw, "Negative Binomial")) family <-"NB" E2 <-resid(mod, type ="pearson") N <-nrow(mod$model) p <-length(coef(mod)) if(family=="NB") p <- p +1# '+1' is for variance parameter in NB result <-sum(E2^2) / (N - p)if(verbose) {if(result>1) message("Overdisperion")if(result<1) message("Underdisperion") }return(result)}# DHARMa::testDispersion(mod_nb)calc_dispersion(mod_nb)calc_dispersion(mod_pos)rm(mod_nb, mod_pos, calc_dispersion)```### ZIP & ZINBI haven't highlighted this yet (it will go into descriptive statistics) but our data has a whole lot of zeros, namely for the observed infections::: {#fig-Zero-Inflated layout-ncol="2"}```{r}#| fig-width: 4#| fig-height: 3SIR_full %>%ggplot() +geom_bar(aes(x=Inf_obs)) +labs(x="Observed infections (per month)",y="Count of occurances",title ="By month")``````{r}#| fig-width: 4#| fig-height: 3temp_joined %>%ggplot() +geom_bar(aes(x=Inf_obs)) +labs(x="Observed infections (per chunk)",y="Count of occurances",title ="By chunk")```High number of zeros, highlighted by histograms of the number of observed infections:::I've used **zero-inflated regression** [models before](https://www.hunterratliff1.com/Pedi_PCR_files/IDWeek_2020_Poster.pdf), but I'm not sure if that would apply here. These are used when there are two processes[^8] causing a mixed distribution, but I don't think that is the case here. But leaving this section in just in case[^8]: I think it would apply if we had a situation like: Does the hospital have patients with central lines (the logistical regression portion) and if so, how many CLABSIs did they get (the negative binomial portion)```{r}#| eval: false#| code-fold: showlibrary(pscl) # also see {ZIM} packagepscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | Inf_pred , # these are the predictors for the zero inflated partdist ="poisson", # Use 'negbin' for negative binomialdata = temp_joined) %>%summary()# # Versus# glm(Inf_obs ~ Inf_pred + CHG_rate, data = temp_joined, family = poisson) %>%# summary()``````{r}#| eval: falsex %>%mutate(CHG_rate = Numerator / Denominator,SIR = Inf_obs / Inf_pred) %>%filter(!is.na(CHG_rate)) %>% pscl::zeroinfl(Inf_obs ~ Inf_pred + CHG_rate | Inf_pred + CHG_rate, # these are the predictors for the zero inflated partdist ="negbin", data = .) %>%summary()``````{r}#| eval: falsedf1 <- CHG_clean %>%# Make CHG data monthlyrowwise() %>%mutate(Month =week_to_month(Date)) %>%ungroup() %>%# Recode CHG hospital names to NHSN namesmutate(Hospital =recode(Hospital, "CCM"="CCMC","WVU/UHA"="WVUH","CH"="WVUH","FMT"="WVUH","STF"="TMH")) %>%# Aggregate datagroup_by(Hospital, Month) %>%summarise(Denominator =sum(Denominator , na.rm = T),Numerator =sum(Numerator , na.rm = T),.groups ="drop") # df1 %>% count(Hospital)# SIR_full %>%count(Hosp)df2 <- SIR_full %>%select(Date, Hosp, Inf_pred, Inf_obs, CL_days) %>%left_join(df1, by=c("Date"="Month", "Hosp"="Hospital")) %>%rename(CHG_numerator=Numerator,CHG_denominator=Denominator) %>%mutate(CHG_rate = CHG_numerator / CHG_denominator) %>%mutate(postCHG = Date >ymd("2023-08-01"))library(glmmTMB)glmmTMB(Inf_obs ~ Inf_pred + postCHG + (1|Hosp) ,data = df2, family ="nbinom1") %>%summary()```