Part 0: Data Loading and Setup

Step 1. Load Required Packages

  • Required packages for data cleaning: tidyverse
  • Required packages for presentation: knitr, kableExtra
  • Required packages for data retrieval: vroom

Step 2. Import Run Mapping Information

Due to incompatibilities between the Clinical Classification Software (CCS) groupings in the data from 2016-2017 and the Clinical Classification Software (CCSR), the primary analysis will be completed exclusively with data from 2018-2019.

However, there are a small number (35) of diagnoses that can be directly mapped between the two systems. A sensitivity analysis can be conducted on one of the diagnoses listed in the crosswalk. The workflow can process and ingest the 2016 and 2017 by switching the initial ingest from the currently loaded one to “LU_byYear_2016-2019-AlexMossawir_3.15.26.xlsx”.

2018 Documentation

Dataset covers Inpatient Hospital Discharges that occured in calendar year 2018, statewide coverage, issued at the Discharge level. Product of the Office of Quality and Patient Safety.

2.35M Rows, 33 Columns

Source: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/yjgt-tq93/about_data

2019 Documentation

Dataset covers Inpatient Hospital Discharges that occured in calendar year 2019, statewide coverage, issued at the Discharge level. Product of the Office of Quality and Patient Safety.

2.34M Rows, 33 Columns

Source: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/4ny4-j5zv/about_data

#   Import Project Lookup
#--------------------------------------------------------------------------------------------------
#   This can be used to add additional years or remove years more quickly. This is currently
#   set to 2018-2019, but by switching the excel here to:
#           "LU_byYear_2016-2019-AlexMossawir_3.15.26.xlsx"
#   the full 2016-2019 range can be ingested.
#--------------------------------------------------------------------------------------------------

lu_run <- read_excel("LU_byYear_2018-2019-AlexMossawir_3.15.26.xlsx")
##Print LU for Reference
lu_run %>%
kable(
    caption = "Project Lookup"
  ) 
Project Lookup
run_id lu_year lu_url lu_dict limit
puf_2018 2018 https://health.data.ny.gov/api/views/yjgt-tq93/rows.csv?accessType=DOWNLOAD dict_2018 NA
puf_2019 2019 https://health.data.ny.gov/api/views/4ny4-j5zv/rows.csv?accessType=DOWNLOAD dict_2019 NA

Step 3. Generate Necessary Fields for Downloading CSV

Original version of this workflow queried the OData API, but I was able to confirm that even with pagination through a few different approaches, this data source has a hard cap at 500k rows per query. CSV will be downloaded and cleaned for some inconsistencies in unicode and column headings.

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data, to avoid re-downloading every time this
#     gets knit.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run$cache <- file.path(paste0("data/raw/",lu_run$run_id,".rds"))

## File Timestamp saved to
lu_run$cache_ts <- file.path(paste0("data/raw/",lu_run$run_id,"_ts.txt"))

Step 3. Import Data

Helper Function - import_fetch or load

This function checks if the document specified in the run mapping is currently cached within the project folder. If it is, it loads the data into the LU directly. If it is not, it pulls the number of rows specified in the document (or all) from the url in the lookup, caches the new file, and loads into the LU.

#   Fetch Or Load Function
#--------------------------------------------------------------------------------------------------
#   The function will first check for cached data at the specified location. Then, if it exists,
#   it'll load the cached data. If it doesn't exist, it'll use request generated previously.
#--------------------------------------------------------------------------------------------------
#   This function will be called on an entire row of the LU at a time - The following columns will
#   always need to be in the LU for this to work:
#       - cache - relative file path + name of .rds for each dataset imported
#       - cache_ts - relative file path + name of .txt file with time that the url was queried
#       - lu_url - odata URL where data is being pulled from
#       - request - httr2 request to get data from url
#   The following columns will be generated within the LU by this process:
#       - data - a full dataframe of the data for use
#       - data_ts - timestamp of when the data was first queried
#
#--------------------------------------------------------------------------------------------------

import_fetchorload <- function(LU){ 
  
  ## Check if the cached data exists 
if(file.exists(LU$cache[[1]])) {
  ## If so - print the following information:
  print(
  paste0(
    "Loading data from ", LU$cache[[1]],
    " (saved at ", read_lines(LU$cache_ts[[1]])[[1]], ")",
    " - saving to df."
        )
      )
  ## Read in data from cached location and ts
  LU$data [[1]]<- readRDS(LU$cache[[1]])
  LU$data_timestamp <- read_lines(LU$cache_ts[[1]])
  return(LU)
  
} else {
  ##Print the message
    print(
      paste0(
        "No data saved at ", as.character(LU$cache[[1]]), 
        "- loading from url: ",as.character(LU$lu_url[[1]])
          )
        )
  if (is.na(LU$limit[[1]])==TRUE) {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]])))
  } else {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]]), n_max = LU$limit[[1]]))}
  ts <- strftime(Sys.time())
  cat(ts, file = LU$cache_ts[[1]])
  saveRDS(temp,LU$cache[[1]])
  LU$data_timestamp <- ts
  LU$data[[1]] <- temp
  return(LU)
}}

Loop through LU to import data

#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#   Final output will be a series of messages in console indicating whether it needed to query 
#   the URL or  whether it was able to just pull from the DF directly. Data frames will be stored
#   Within the LU table that we've been operating on.
#--------------------------------------------------------------------------------------------------

lu_run <- rowwise(lu_run) %>% 
  mutate(import_fetchorload(cur_data()))
## [1] "Loading data from data/raw/puf_2018.rds (saved at 2026-03-14 15:48:06) - saving to df."
## [1] "Loading data from data/raw/puf_2019.rds (saved at 2026-03-14 15:51:40) - saving to df."

2018

Hospital Service Area Hospital County Operating Certificate Number Permanent Facility Id Facility Name Age Group Zip Code - 3 digits Gender Race Ethnicity Length of Stay Type of Admission Patient Disposition Discharge Year CCSR Diagnosis Code CCSR Diagnosis Description CCSR Procedure Code CCSR Procedure Description APR DRG Code APR DRG Description APR MDC Code APR MDC Description APR Severity of Illness Code APR Severity of Illness Description APR Risk of Mortality APR Medical Surgical Description Payment Typology 1 Payment Typology 2 Payment Typology 3 Birth Weight Emergency Department Indicator Total Charges Total Costs
Capital/Adirond Columbia 1001000 000146 Columbia Memorial Hospital 50 to 69 125 F White Not Span/Hispanic 2 Elective Short-term Hospital 2018 CIR020 CEREBRAL INFARCTION NA NA 045 CVA & PRECEREBRAL OCCLUSION W INFARCT 01 DISEASES AND DISORDERS OF THE NERVOUS SYSTEM 3 Major Major Medical Managed Care, Unspecified Self-Pay NA NA TRUE 17591.41 3570.06
New York City Manhattan 7002009 001445 Harlem Hospital Center 50 to 69 100 F Black/African American Not Span/Hispanic 16 Emergency Short-term Hospital 2018 RSP017 POSTPROCEDURAL OR POSTOPERATIVE RESPIRATORY SYSTEM COMPLICATION ESA003 MECHANICAL VENTILATION 121 OTHER RESPIRATORY & CHEST PROCEDURES 04 DISEASES AND DISORDERS OF THE RESPIRATORY SYSTEM 4 Extreme Extreme Surgical Medicare Medicaid NA NA TRUE 106190.21 78913.24
New York City Manhattan 7002009 001445 Harlem Hospital Center 50 to 69 100 F Black/African American Not Span/Hispanic 4 Emergency Home or Self Care 2018 MBD003 BIPOLAR AND RELATED DISORDERS MHT001 PHARMACOTHERAPY FOR MENTAL HEALTH (EXCLUDING SUBSTANCE USE) 753 BIPOLAR DISORDERS 19 MENTAL DISEASES AND DISORDERS 2 Moderate Minor Medical Medicaid NA NA NA FALSE 5600.00 4161.53

2019

Hospital Service Area Hospital County Operating Certificate Number Permanent Facility Id Facility Name Age Group Zip Code - 3 digits Gender Race Ethnicity Length of Stay Type of Admission Patient Disposition Discharge Year CCSR Diagnosis Code CCSR Diagnosis Description CCSR Procedure Code CCSR Procedure Description APR DRG Code APR DRG Description APR MDC Code APR MDC Description APR Severity of Illness Code APR Severity of Illness Description APR Risk of Mortality APR Medical Surgical Description Payment Typology 1 Payment Typology 2 Payment Typology 3 Birth Weight Emergency Department Indicator Total Charges Total Costs
Hudson Valley Westchester 5902001 001045 White Plains Hospital Center 30 to 49 106 M Other Race Not Span/Hispanic 2 Emergency Home or Self Care 2019 INF008 VIRAL INFECTION NA NA 723 VIRAL ILLNESS 18 INFECTIOUS AND PARASITIC DISEASES (SYSTEMIC OR UNSPECIFIED SITES) 2 Moderate Minor Medical Private Health Insurance NA NA NA TRUE 26507 4773.11
Hudson Valley Westchester 5902001 001045 White Plains Hospital Center 70 or Older 105 F Black/African American Not Span/Hispanic 3 Emergency Home or Self Care 2019 GEN002 ACUTE AND UNSPECIFIED RENAL FAILURE NA NA 469 ACUTE KIDNEY INJURY 11 DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT 2 Moderate Moderate Medical Medicare NA NA NA TRUE 20693 5631.30
Hudson Valley Westchester 5902001 001045 White Plains Hospital Center 50 to 69 105 F Other Race Not Span/Hispanic 7 Elective Home or Self Care 2019 GEN006 OTHER SPECIFIED AND UNSPECIFIED DISEASES OF KIDNEY AND URETERS GIS005 EXPLORATION OF PERITONEAL CAVITY 443 KIDNEY & URINARY TRACT PROCEDURES FOR NONMALIGNANCY 11 DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT 4 Extreme Extreme Surgical Medicare Blue Cross/Blue Shield NA NA FALSE 138252 29848.69

Step 4. Import all Dictionaries

Dictionaries for 2016-2017 have been created for future analysis and are saved in the same folder, but the workflow will only read in dictionaries for current runs.

2018 Dictionary

dict_2018 is sourced from 2018 tab in Excel file shared in repository.

This was manually extracted from: https://health.data.ny.gov/api/views/yjgt-tq93/files/9667d39a-be57-4df1-9ce4-537bce120215?download=true&filename=NYSDOH_SPARCS_De-Identified_Data_Dictionary_2018.pdf

Additional API field names have been included in Data_Col

2018 Dictionary

dict_2018 is sourced from 2018 tab in Excel file shared in repository.

This was manually extracted from: https://health.data.ny.gov/api/views/yjgt-tq93/files/9667d39a-be57-4df1-9ce4-537bce120215?download=true&filename=NYSDOH_SPARCS_De-Identified_Data_Dictionary_2018.pdf

Additional API field names have been included in Data_Col

#   Import dictionaries
#--------------------------------------------------------------------------------------------------
#   This can be used to add additional years or remove years more quickly. building this process 
#   on a subset of 20 rows from 2016 and 2017, and then revising to add in the full year range 
#   and the full datasets once workflows are clean, so it can be batched.
#   
#   This is designed so that each dictionary is sourced from a different tab in an excel document.
#   
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx"
#--------------------------------------------------------------------------------------------------
 
##pull years out of the run lu to loop over

year <- as.character(lu_run$lu_year)
##create a dictionary list of sheets read from excel, looping over years
dicts <- lapply(year, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)

})
## 
## Loading dictionary from:DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx
## Read sheet: 2018
## Loading dictionary from:DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx
## Read sheet: 2019
## set the names of the objects using years
dicts <- setNames(dicts, paste0("dict_",year))

Part 0-1: Data Structure Merging

Column names and data types are slightly different across years - they are standardized across datasets.

Step 1. Column Rename and Type Classification

Helper Function - standardize_columns

This function uses the appropriate dictionary for the ingested dataset (ie. 2018 data will get dict_2018) as specified in the lookup. This dataset will rename columns to the standard names across variables, and ensure types are aligned.

#   Standardize Columns Function
#-------------------------------------------------------------------------------------------------
#   This is a function that standardizes the columns based on the provided dictionaries -
#   If this fails, it means that there is something that does not match in the dictionary.
#   
#   This is by design - I want it to be able to account for every single row, if something is
#   mapped incorrectly in the dataset, then it neeeds a new dictionary.
#
#   For example, while making this, I copy/pasted the rows from a print statement, but I
#   dropped one of the underscores on "__id" for the 2017, but not for the 2016 data. 
#
#--------------------------------------------------------------------------------------------------

standardize_columns <- function(run_lu, dictionaries){ 
    # Grab dictionary name out of the lookup
##Call working versions of the data and dictionary objects from the lists that contain them
    w_data <- run_lu$data[[1]]
# Grab dictionary from dicts
    w_dict <- dictionaries[[ run_lu$lu_dict[[1]] ]]  
    
## Named vector of old names using
    w_names <- setNames(w_dict$api_field, w_dict$final_name)

## Fill a_data column in the run LU with the new, renamed dataset
    temp <- w_data %>%
  ## Use the named vector
 
          rename(!!!w_names) %>%
          mutate(
            across(
                all_of(w_dict$final_name[w_dict$final_type == "character"]),
                as.character
                   ),
            across(
                all_of(w_dict$final_name[w_dict$final_type == "numeric"]),
                as.numeric
                  )
                )
        run_lu$a_data[[1]] <- temp
    
    return(run_lu)
}

Loop through datasets to standardize

#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#   If this fails, check the console for issues with the LUs not matching.
#--------------------------------------------------------------------------------------------------    
lu_run <- rowwise(lu_run) %>% 
  mutate(standardize_columns(cur_data(),dicts))

Data Outputs

2018

v_hsa v_county operating_certificate_number v_facilityid facility_name v_age v_patientzip v_gender v_race v_ethnicity v_los v_admissiontype v_disposition v_year v_dx_ccsr dxccsr_description v_pr_ccsr prccsr_description v_apr_drg apr_drg_description v_mdc v_mdc_description v_apr_soi apr_soi_description v_apr_rom v_apr_type v_pay_1 v_pay_2 v_pay_3 birth_weight flag_ed total_charges total_costs
Capital/Adirond Columbia 1001000 000146 Columbia Memorial Hospital 50 to 69 125 F White Not Span/Hispanic 2 Elective Short-term Hospital 2018 CIR020 CEREBRAL INFARCTION NA NA 045 CVA & PRECEREBRAL OCCLUSION W INFARCT 01 DISEASES AND DISORDERS OF THE NERVOUS SYSTEM 3 Major Major Medical Managed Care, Unspecified Self-Pay NA NA TRUE 17591.41 3570.06
New York City Manhattan 7002009 001445 Harlem Hospital Center 50 to 69 100 F Black/African American Not Span/Hispanic 16 Emergency Short-term Hospital 2018 RSP017 POSTPROCEDURAL OR POSTOPERATIVE RESPIRATORY SYSTEM COMPLICATION ESA003 MECHANICAL VENTILATION 121 OTHER RESPIRATORY & CHEST PROCEDURES 04 DISEASES AND DISORDERS OF THE RESPIRATORY SYSTEM 4 Extreme Extreme Surgical Medicare Medicaid NA NA TRUE 106190.21 78913.24
New York City Manhattan 7002009 001445 Harlem Hospital Center 50 to 69 100 F Black/African American Not Span/Hispanic 4 Emergency Home or Self Care 2018 MBD003 BIPOLAR AND RELATED DISORDERS MHT001 PHARMACOTHERAPY FOR MENTAL HEALTH (EXCLUDING SUBSTANCE USE) 753 BIPOLAR DISORDERS 19 MENTAL DISEASES AND DISORDERS 2 Moderate Minor Medical Medicaid NA NA NA FALSE 5600.00 4161.53

2019

v_hsa v_county operating_certificate_number v_facilityid facility_name v_age v_patientzip v_gender v_race v_ethnicity v_los v_admissiontype v_disposition v_year v_dx_ccsr dxccsr_description v_pr_ccsr prccsr_description v_apr_drg apr_drg_description v_mdc v_mdc_description v_apr_soi apr_soi_description v_apr_rom v_apr_type v_pay_1 v_pay_2 v_pay_3 birth_weight flag_ed total_charges total_costs
Hudson Valley Westchester 5902001 001045 White Plains Hospital Center 30 to 49 106 M Other Race Not Span/Hispanic 2 Emergency Home or Self Care 2019 INF008 VIRAL INFECTION NA NA 723 VIRAL ILLNESS 18 INFECTIOUS AND PARASITIC DISEASES (SYSTEMIC OR UNSPECIFIED SITES) 2 Moderate Minor Medical Private Health Insurance NA NA NA TRUE 26507 4773.11
Hudson Valley Westchester 5902001 001045 White Plains Hospital Center 70 or Older 105 F Black/African American Not Span/Hispanic 3 Emergency Home or Self Care 2019 GEN002 ACUTE AND UNSPECIFIED RENAL FAILURE NA NA 469 ACUTE KIDNEY INJURY 11 DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT 2 Moderate Moderate Medical Medicare NA NA NA TRUE 20693 5631.30
Hudson Valley Westchester 5902001 001045 White Plains Hospital Center 50 to 69 105 F Other Race Not Span/Hispanic 7 Elective Home or Self Care 2019 GEN006 OTHER SPECIFIED AND UNSPECIFIED DISEASES OF KIDNEY AND URETERS GIS005 EXPLORATION OF PERITONEAL CAVITY 443 KIDNEY & URINARY TRACT PROCEDURES FOR NONMALIGNANCY 11 DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT 4 Extreme Extreme Surgical Medicare Blue Cross/Blue Shield NA NA FALSE 138252 29848.69

Part 1: Data Subsetting on Exclusion Criteria

Step 1. Data Quality Rule Import

Based on the contents of the data dictionaries and the key variables being used for the analysis (permanent facility id), I have created a rule dictionary with instructions to be placed into the rule engine.

This first pass is to run through the basic data quality filters - not study design filters. Any observations that cannot be used for this analysis (e.g. enhanced de-identification redacting facility information, missing key case-mix identifier like age group) will be filtered through this stage. The engine will count the number of rows in each run’s dataset and log it, as it passes the data onto the next rule.

All rule dictionaries will have the same structure, so they can be fed through the same update/logging engine downstream, this pass uses the lu filters rule dictionary.

Dictionaries have the following structure:

  • stage: This indicates which rules should be applied - I am likely keeping these dictionaries in separate tabs, but I wanted to make this filter in case I stack it later.
  • rule_id: Unique identifier for the rules for QA/querying
  • rule_name: Unique legible identifier for QA
  • rule condition: the logic that will be used to identify which rows the rules are applied to.
  • rule_outcome: For filtering, all rules apply exclude, but downstream and for sensitivity analysis, the rule engine may be used for partitioning the dataset or other outcomes - this will be used to determine that process.
  • applies: List of which runs to apply the rule to - for example, the 2016 and 2017 datasets have the abortion_edit_indicator for enhanced de-identification, while the 2018 and 2019 datasets do not have this indicator, and identifying redactions must be done by searching for “Redacted for Confidentiality” within facility_name.
  • order: Rules are applied sequentially based on this column -logged_as This is the clean string used to flag what the transformation was for the log table, to be formatted for final tables and outputs.
#   Import LU
#--------------------------------------------------------------------------------------------------
#   Set the stage here with the variable stage - this step is currently the data_quality pass, so
#   it's going to pull in the rules flagged with "data_quality" - in this case, from tab name.
#   Store it to data quality, again. If we want to import multiple LUs  - add tab names to 
#   the list.
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "LU_AppliedFilters_2016-2019-AlexMossawir_3.7.26.xlsx"
stage <- c("data_quality")
#--------------------------------------------------------------------------------------------------
 
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)
})
## 
## Loading dictionary from:LU_AppliedFilters_2016-2019-AlexMossawir_3.7.26.xlsx
## Read sheet: data_quality
temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))

dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts)

##Print LU for Reference
dicts$lu_data_quality %>%
kable(
    caption = "Rules Lookup for Filtering"
  ) 
Rules Lookup for Filtering
stage rule_id rule_name rule_condition rule_outcome output_prefix applies order logged_as
data_quality 1 exclude_abortions flag_abortion == “Y” exclude a puf_2016, puf_2017 2 Enhanced De-ID
data_quality 2 exclude_deid facility_name == “Redacted for Confidentiality” exclude a puf_2018, puf_2019 3 Enhanced De-ID
data_quality 3 missing_facility is.na(v_facilityid) exclude a all 1 Missing Facility ID
data_quality 4 missing_age is.na(v_age) exclude a all 4 Missing Age Group
data_quality 5 newborn v_admissiontype == “Newborn” exclude a all 5 Excluding Newborns
data_quality 6 medical_only !(v_apr_type == “Medical”) exclude a all 6 Include Only Medical Admissions

Step 2. Data Quality Rules Pass

The structure of the engine will first loop through over run_id using lu_run, this is the exterior loop. For each run, the engine will first determine which rules apply applies.

Then it will loop over the rules in an interior loop. For each rule, it will log iniitial nrow, and save to audit log. Then it will either apply the rule filtering or skip over it. It will output final nrows to the audit log, and pass the revised data to the next step of the interior loop. It proceeds sequentially over the rows, so working data will be updated once the rule has been executed and logged.

The inner loop passes the final audit lists and the final w_data back to the outer loop. The outer loop saves the audit log and the data to columns in the lookup.

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data - this is write-only, just to keep records.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run[[paste0("cache_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,".rds"))
print(lu_run[[paste0("cache_", stage)]])
## [1] "data/data_quality/puf_2018.rds" "data/data_quality/puf_2019.rds"
## File Timestamp saved to
lu_run[[paste0("cache_ts_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,"_ts.txt"))
print(lu_run[[paste0("cache_ts_", stage)]])
## [1] "data/data_quality/puf_2018_ts.txt" "data/data_quality/puf_2019_ts.txt"

Helper Functions - Filtering and Exclusion Rules

These functions are hidden by default - press show to see the code!

flag_rule_application

This function is called at a run_id level within the outer loop. Once the working lookup is selected, it loops over the rule list to determine if the rule is applicable to this particular run. i.e: a rule with the applies text “puf_2016, puf_2017” will be split into a vector and evaluated - it will only be applied when the run_id is in that vector.

#Function to Determine Rule Applicability - Within Outer Loop
#--------------------------------------------------------------------------------------------------
#   Within outer loop, looking at the current run_id that is being passed, 
#    and determine which rules should be applied.
#--------------------------------------------------------------------------------------------------

flag_rule_application <- function(rule, run_id) {
  applies_vec <- rule$applies_rule
  if ("all" %in% applies_vec) { TRUE } else if (run_id %in% applies_vec) { TRUE } else { FALSE }
}

apply_exclude

This function is called at a rule level within the inner loop. Because the lookup stores the logic for what columns should be excluded as a string, it needs to be unquoted to be properly evaluated as logic. This applies that logic with a filter function from dplyr in order to exclude anything that fits that criteria.

#Function Within Inner Loop - Actually Exclude
#--------------------------------------------------------------------------------------------------
#   Applies filter on the dataset using rule expression - called within the inner loop
#--------------------------------------------------------------------------------------------------

##function to if it excludes, does it
apply_exclude <- function(w_data, rule_expr) {
  w_data %>% filter(!(!!rule_expr))
}

apply_rules

This function is called on each specific rule list within the inner loop. It takes a single rule from the inner loop and applies it to the w_data working data set. It also generates a list of values to output to an audit table so all transformations (or skipped rules) can be tracked. This audit row lists the initial number of rows, and the final number of rows after transformation, and reports the number of observations affected by each rule.

This is the core of the inner loop. It returns transformed w_data and a row of the audit table.

#--------------------------------------------------------------------------------------------------
#   Function to Apply Rules - Within Inner Loop
#--------------------------------------------------------------------------------------------------
#   This takes the working dictionary and working data from the interior loop, and continues,
#   passing run_id from the outer loop. It always references [[1]] because it will be passed
#   rowwise. Outputs the updated dataset and a new list, audit_log.
#--------------------------------------------------------------------------------------------------
apply_rule <- function(rule, w_data, run_id) {
##Prelim Information  
  run_flag <- rule$flag_rule_applies
#Generate Audit Row
audit_row <- list(
  stage         = rule$stage,
  run_id        = run_id,
  rule_id       = rule$rule_id,
  rule_name     = rule$rule_name,
  rule_condition= rule$rule_condition,
  rule_outcome  = rule$rule_outcome,
  n_initial     = nrow(w_data),
  logged_as     = rule$logged_as,
  skipped       = !run_flag,
  applied       = run_flag
)
##     cat ( paste0("Run Id: ", string(run_id),   " Rule Name ",rule$rule_name), " Run flag: ", run_flag, " Class W_Data: ",  class(w_data))
## Apply dplyr rules based on what rules text is - this can be updated when I add others. But.
if (run_flag == FALSE) {
  # do nothing
} else if (rule$rule_outcome == "exclude") {
  rule_expr <- parse_expr(rule$rule_condition)
  w_data <- apply_exclude(w_data, rule_expr)
}
audit_row$n_final  <- nrow(w_data)
audit_row$n_change <- audit_row$n_initial - audit_row$n_final

audit_row$timestamp <- Sys.time()
  
   return(list(w_data = w_data, 
              audit_row = audit_row))
         
} 

outer_pass

This function is the core of the outer loop. It is passed a single row from lu_run, and uses that information to call the relevant objects and manage the inner loop.

While it only takes a row from lu_run as a direct argument, it references the following global variables to preserve clarity and naming:

  • stage: This is used to specify throughout that we are on the data quality filtering step. This allows future steps to be built off the same engine.
  • stage_prefix: This is used to specify the output column name’s prefix - in this case, it is ‘a2’
  • prev_col: This is used to specify the source column name - in this case it is ‘a_data’.
#Function Within Outer Loop - Manage Inner Loop
#--------------------------------------------------------------------------------------------------
#   This is called by the outer loop and is used to manage the inner loop. It passes both
#   transformed data and the audit log back.
#--------------------------------------------------------------------------------------------------

outer_pass <- function(lu_row) {
      
  ##Get that run_id, data, dict to keep passing through  
    run_id <- lu_row$run_id
  # the whole reason this loop wasn't working is because this a_data is apparently. for some reason. a list length 1
    ## with a tibble at that first place. 
    w_data <- lu_row[[prev_col]][[1]]
    w_dict <- dicts[[paste0("lu_",stage)]]
    
    ## Convert rules from a dataframe into list of lists
    rules <- apply(w_dict, 1, as.list)

    for (j in seq_along(rules)) {
      rule <- rules[[j]]
      ##Applies is currently a vector hand-written in excel
        ##First remove all whitespace, then split out what remains
              ## It's gonna put the list into a list of length one with a vector in it.
              #you literally just want the vector.
      rule$applies_rule <- strsplit(gsub("\\s", "",rule$applies), ",")[[1]]
      
      ##We're looping through all the rules rn
      rule$flag_rule_applies <- flag_rule_application(rule, run_id)
      ## and saving the new items back into rules   
       rules[[j]] <- rule
          }

  ## Create an empty list object so that we can store the audit as we go.
    audit <- list()
  ##i did this just counting first but seq_along actually passes info along each
    for (j in seq_along(rules)) {
    ##pull out list j from dictionary
      rule <- rules[[j]]
##Call function apply_rule
##    Apply rule input:
##        rule = one single row (but as list) of dictionary
##        w_data = working dataset
##        run_id = the same for everything in this inner loop
##    Apply rule output:
##        a list of: w_data, audit_row    
      result <- apply_rule(rule, w_data, run_id)
      ## replace w_data with the new w_data saved in result returned list
      w_data <- result$w_data
      ##store 
      audit[[j]] <- result$audit_row
  
      
      }
    list(
    data = w_data,
    audit = audit
  )

}

Execute Function Calls to Clean Data

The functions above will refer to global stage, prev_col, stage_prefix variables. This is to allow for flexibility of the pipeline. prev_col indicates where the data for this stage should be sourced from, and stage_prefix indicates what the column name of the destination should be. I will revise these away from using the global variable approach when I have more time.

#--------------------------------------------------------------------------------------------------
## Stage Variable
stage <- "data_quality"
stage_prefix <- "a2_"

## Column Data Source
prev_col <- "a_data"

#--------------------------------------------------------------------------------------------------

newcolname <- paste0(stage_prefix, "data")

#--------------------------------------------------------------------------------------------------
results <- vector("list", nrow(lu_run))

#--------------------------------------------------------------------------------------------------

for (i in seq_len(nrow(lu_run))) {
  ##Actually run the nested loops here, saving to a new vector
  results[[i]] <- outer_pass(lu_run[i, ])
  
  ## Save the outputted day in the specified new column
  lu_run[[newcolname]][[i]] <- results[[i]]$data
  ##save audit log to a table
  lu_run$audit[[i]] <- bind_rows(results[[i]]$audit)
  ## Cache this
  saveRDS(results[[i]]$data,lu_run$cache_data_quality[[i]])
  ##Make a timestamp
  ts <- strftime(Sys.time())
  
  ##save the timestamp
  cat (ts, file = lu_run$cache_ts_data_quality[[i]])
}
rm(results)

Step 3. Analytical Sample

The workflow has filtered the data down to the analytical sample and outputted the results of each step of the process into a complete audit table.

The complete table with n at each stage is below, and a summary table of initial observations, exclusion volume, and final analytical sample size, by year, can be found below.

Complete Audit Summary Below

Click to view Complete Table of Sample Exclusions
#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
audit_table <- lu_run$audit %>% 
  bind_rows() %>%
  arrange(run_id, rule_id) %>%
  mutate(
    logged_as = factor(logged_as, levels = unique(logged_as))
  ) %>%
  group_by(run_id, logged_as) %>%
  summarise(
    n_initial = max(n_initial),
    Excluded = sum(n_change),
    n_final = min(n_final),
  )

audit_table <- audit_table %>%
  mutate(
year = factor(run_id, 
                 levels = c("puf_2016","puf_2017","puf_2018","puf_2019"),
                 labels = c("2016", 
                            "2017",
                            "2018",
                            "2019")
))

audit_table %>% kable()
run_id logged_as n_initial Excluded n_final year
puf_2018 Enhanced De-ID 2352807 8804 2344003 2018
puf_2018 Missing Facility ID 2344003 0 2344003 2018
puf_2018 Missing Age Group 2344003 0 2344003 2018
puf_2018 Excluding Newborns 2344003 212809 2131194 2018
puf_2018 Include Only Medical Admissions 2131194 562563 1568631 2018
puf_2019 Enhanced De-ID 2339462 9071 2330391 2019
puf_2019 Missing Facility ID 2330391 0 2330391 2019
puf_2019 Missing Age Group 2330391 0 2330391 2019
puf_2019 Excluding Newborns 2330391 210594 2119797 2019
puf_2019 Include Only Medical Admissions 2119797 563070 1556727 2019

Summary Table

Analytical Sample Size
2018 2019
Initial Sample Size 2,352,807 2,339,462
Enhanced De-ID 8,804 9,071
Excluding Newborns 212,809 210,594
Include Only Medical Admissions 562,563 563,070
Final Analytical Sample 1,568,631 1,556,727

Part 2: Variable Selection, Recoding, and Data Cleaning

Step 1. Outcome Variable

Name the variable and describe its scale (continuous, binary, etc.). Report the distribution: mean and SD for continuous outcomes, or frequency and % for binary outcomes. Note any recoding from the raw variable (e.g., converting BMD from NHANES units, collapsing categories).

The primary outcome variable for this analysis is a binary flag indicating whether a hospital discharge is assigned to a selected CCSR Diagnosis Group (1- dischage falls into the category).

Per the data dictionaries, SPARCS data from 2016-2017 reflects the Clinical Classification Software (CCS) codes and descriptions. 2018-2019 reflect Clinical Classification Software - Refined (CCSR) codes and descriptions. The year version of each category has not been specified in the data dictionaries provided.

The CCS and CCSR categories have largely been found to be non-translateable, as CCS beta was built off of ICD-9, while CCSR is using ICD-10. For this reason, the analysis will be restricted to 2018-2019 data, and the additional data points from 2016-2017 may be incorporated for sensitivity analysis.

Note: There are 35 codes with a 1-1 match between CCS and CCSR, per a comparison published by ARQ, so these specific codes can be recoded and highlighted in the data as candidates for a sensitivity analysis including 2016-2017 data.

https://hcup-us.ahrq.gov/toolssoftware/ccsr/DXCCSR-vs-Beta-CCS-Comparison.pdf

count_ccsr <- bind_rows(lu_run$a2_data) %>% 
  select(
  c("v_dx_ccsr","dxccsr_description")
) %>%
  unique() %>% nrow()

There are 472 total diagnosis codes present in the datasets, across 22 body system categories sourced from the ICD-10. This is not a complete match to the MDC body system grouping, which reflects 25 total categories relevant to hospital resource use, though the two have overlap and both are present in this dataset.

top_10 <- bind_rows(lu_run$a2_data) %>% 
  select(
  c("v_dx_ccsr","dxccsr_description")
        )%>%
  group_by(v_dx_ccsr,dxccsr_description) %>%
    summarize(
    count_obs = n()
  )  %>% arrange(desc(count_obs)) 


head(top_10, n=10) %>% kable(
            col.names = c("CCSR Code", "CCSR Description", "n"),
       caption= "Highest Volume CCSR Codes",
  format.args = list(big.mark = ","))
Highest Volume CCSR Codes
CCSR Code CCSR Description n
INF002 SEPTICEMIA 235,185
CIR019 HEART FAILURE 126,452
MBD017 ALCOHOL-RELATED DISORDERS 99,747
MBD001 SCHIZOPHRENIA SPECTRUM AND OTHER PSYCHOTIC DISORDERS 83,942
RSP002 PNEUMONIA (EXCEPT THAT CAUSED BY TUBERCULOSIS) 82,004
RSP008 CHRONIC OBSTRUCTIVE PULMONARY DISEASE AND BRONCHIECTASIS 71,042
END003 DIABETES MELLITUS WITH COMPLICATION 63,873
GEN004 URINARY TRACT INFECTIONS 61,030
SKN001 SKIN AND SUBCUTANEOUS TISSUE INFECTIONS 60,424
PRG023 COMPLICATIONS SPECIFIED DURING CHILDBIRTH 58,836
ccsr_list <- c("INF002", "CIR019")

This initial analysis will report the initial data and exploratory data analysis will be conducted on INF002 and CIR019 selected because of the high volume of observations.

The final analysis will be conducted with a list of 3-5 categories, shortlisted for total volume and distribution across hospitals. The list will be refined with a literature review.

Outcome Variable: Recode as Binary Variables

#   Import Dictionary
#--------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "Dictionary_Recoding_AlexMossawir_3-15-26.xlsx"
stage <- c("recode")
#--------------------------------------------------------------------------------------------------
 
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)
})
## 
## Loading dictionary from:Dictionary_Recoding_AlexMossawir_3-15-26.xlsx
## Read sheet: recode
temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))

dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts)

Helper Function - expand_variables

This function takes arguments for the lookup to source data from, the column name it is writing to, and the column name it is sourcing from. It outputs just lu_run.

expand_variables_row <- function(row, prev_col) {
  # Extract the nested data for this row
  w_data <- row[[prev_col]][[1]]
##stupid long way to only get the lookup lines for this stage
## this is uh. why I decided not to stack dictionaries. and then I did anyway.
  w_dict <- dicts[["lu_recode"]][dicts[["lu_recode"]]$stage == "expand", ]
  
  # Loop through columns to expand
  for (i in seq_len(nrow(w_dict))) {
    colsearch <- w_dict[[i, "col"]]
    strsearch <- w_dict[[i, "values"]]
    col_output <- paste0(w_dict[i, "new_col"], "_", w_dict[i, "values"])
    
    w_data <- w_data %>%
      mutate(!!col_output := as.integer(!!sym(colsearch) == strsearch))
  }
  
  # Return as list to store in list-column
  list(w_data)
}


#-------------------------------------------------------------------------------

Loop by run through the one-hot expansion code

lu_run <- lu_run %>%
  rowwise() %>%
  mutate(
    a2_data = expand_variables_row(cur_data(), prev_col = "a2_data")
  ) %>%
  ungroup()
temp <- bind_rows(lu_run$a2_data)
outcomes <- grep("^dx_", colnames(temp), value = TRUE)

year <- temp %>%
  select(c("v_year", outcomes)) %>%
  group_by(v_year) %>%
  summarize(
      Total = n(),
      across(all_of(outcomes), ~ sum(as.numeric(.x), na.rm = TRUE)),
    round((across(all_of(outcomes), ~ sum(as.numeric(.x), na.rm = TRUE) / n()*100, .names = "{.col}_pct")),2)
  )

year%>% 
  kable(
  col.names = c("Year", "Total Observations", "CIR019 n", "INF002 n", "CIR019 %", "INF002 %"),
  caption= "Highest Volume CCSR Codes - Frequency by Year",
  format.args = list(big.mark = ","))
Highest Volume CCSR Codes - Frequency by Year
Year Total Observations CIR019 n INF002 n CIR019 % INF002 %
2018 1,568,631 62,675 114,408 4.0 7.29
2019 1,556,727 63,777 120,777 4.1 7.76

Step 2. Primary Exposure Variable

Primary exposure will be the hospital-level identifier, permanent_facility_id in the dataset, representing facility-level coding and documentation patterns.

Facility IDs were previously set to be the character data type in the data standardization pass. Because facility ID is categorical, but there are 208 unique facilities, I will not recode this variable.

Step 3. Key Covariates

Key covariates will be selected to proxy patient case-mix:

  • Discharge Year (categorical, discharge_year)
  • Patient age group (categorical, age_group)
  • Database-encoded gender (categorical, gender)
  • Primary payor type (categorical, payment_typology_1)
  • APR-DRG Severity of Illness Code (categorical, apr_severity_of_illness_code)

While Discharge Year and Patient Age Group are ordinal categorical, I am beginning my analysis by converting all to factors and handling as if they are nominative categorical. As I continue to run my analysis, I will revisit these.

Recode and Manage Variables

Based on a review of the data, of the key covariates, the only one that needs recoding is Primary Payor Type.

I believe that Payment Typology is going to be a primary driver of variation, and further analysis will be best suited by a full bit-mapping of of the combination of the three primary payment typologies. A patient who is under-insured cannot be distinguished from a patient who is fully insured without complete integration of all three variables. At this point, however, the Primary Payment Typology will be classed into Private, Public, Self-Pay, and Other.

All other variables will be converted to labeled factors without recoding, to ensure they are handled correctly for the logistic regression model.

The codes for the All Patients Refined Major Diagnostic Categories (APR MDC) have remained standard throughout the 2016-2019 period, but the text of the APR-MDC description column has shown variability. To make this column legible, all APR MDC codes have been mapped to the following document: https://hcup-us.ahrq.gov/db/nation/nis/APR-DRGsV20MethodologyOverviewandBibliography.pdf

Helper Function

recode_variable

This function takes lu_run and a current column name, looks it up in the recoding LU and converts it to the values, levels, and labels that are specifies.

rm(temp)

recode_variable <- function(lu_run, col_input) {
### Always pull working data + dict  
w_dict <- dicts[["lu_recode"]]
w_data <- lu_run[[prev_col]][[1]]

##Filter to the column in question
w_dict <- w_dict %>% filter(col==col_input)

## Needed for recoding
col_output <- w_dict$new_col[[1]]
w_values <- w_dict$values
w_new_values <- w_dict$new_values
recode_vec <- w_new_values %>% setNames(w_values)

##Recode
w_data[[col_output]] <- recode(
  w_data[[col_input]], 
  !!!recode_vec)


##Needed for converting to factors
level_vec <- unique(w_dict$level)
label_vec <- unique(w_dict$label)

##Factors
w_data <- w_data %>%
mutate(
  !!col_output :=   factor(!!sym(col_output),
                 levels = level_vec,
                 labels = label_vec
                ))

lu_run[[newdataname]][[1]]  <- w_data


return(lu_run)
}

Recode

Set the global recoding variables (data source, data destination), source the list of columns that will be recoded from the dictionary ingested above.

stage_prefix <- "a2_"
prev_col <- "a2_data"
newdataname <- paste0(stage_prefix, "data")
stage <- "recode"

columns <- unique(dicts[["lu_recode"]]$col[dicts[["lu_recode"]]$stage == "recode"])

## this is so so so inefficient but I don't want to figure out map and across
for (row in seq_len(nrow(lu_run))) {
  for (col in columns) {
    lu_run[row, ] <- recode_variable(lu_run[row, ], col_input = col)
  }
}

Step 4. Missingness

Because all variables of interest now have the prefix “v” to indicate a covariate of interest or “dx” to indicate that it is the outcome of interest (diagnosis), columns of interest can be selected based on what they start with.

Helper Function - missingness

This function creates a summary table of missing values - searches for columns from a list of prefixes. The two arguments it takes are the data frame (this can be passed with map, which I just learned. finally.), and the prefix list.

missingness <- function(w_data, prefixes){
w_data %>%
  summarize(
    across(.cols = starts_with(prefixes),
        .fns = ~sum(is.na(.x)), .names = "{.col}")) %>%
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "nmissing") %>%
  mutate(
      run = unique(w_data$v_year),
      n = nrow(w_data),
      pct_missing = nmissing / n * 100
    )
}

Call missingess over all the data.

I’ve learned how map works. :)

results <-  lu_run$a2_data %>% 
  map(~missingness(w_data= .x, prefixes=c("vp_", "dx_")))


results <-  bind_rows(results) %>% arrange(desc(pct_missing))

results %>% kable(
    format.args = list(big.mark = ",")
)
variable nmissing run n pct_missing
dx_CIR019 46 2018 1,568,631 0.0029325
dx_INF002 46 2018 1,568,631 0.0029325
dx_CIR019 38 2019 1,556,727 0.0024410
dx_INF002 38 2019 1,556,727 0.0024410
vp_age 0 2018 1,568,631 0.0000000
vp_pay_1 0 2018 1,568,631 0.0000000
vp_year 0 2018 1,568,631 0.0000000
vp_apr_soi 0 2018 1,568,631 0.0000000
vp_gender 0 2018 1,568,631 0.0000000
vp_age 0 2019 1,556,727 0.0000000
vp_pay_1 0 2019 1,556,727 0.0000000
vp_year 0 2019 1,556,727 0.0000000
vp_apr_soi 0 2019 1,556,727 0.0000000
vp_gender 0 2019 1,556,727 0.0000000

Virtually no data is missing from the primary covariates, and outcome - 0.002%, or 46 and 36 observations respectively over >1.5M respectively.

Data was missing from the primary exposure (v_facilityid), because it was redacted for confidentiality. This data has been excluded from the analysis in an earlier stage. These missing values are systematically encoded, as they are all related to specific sensitive procedures and diagnoses. I think it is unlikely that these are uniform across the facilities, but with facility_id and any higher level facility level information redacted, it is difficult to investigate further.

Part 3: Descriptive Statistics

Using gtsummary to create a descriptive statistics table for my analytical sample. Please note, the primary exposure is Permanent Facility ID in the dataset (v_facilityid). This is a categorical variable, but it has a very high cardinality - as a result, I am sharing the total number of Unique Facility IDs in the dataset in its own row in the column, but not subsetting any of the other covariates on this variable.

table1 <- lu_run$a2_data %>% bind_rows()

includelist <- table1 %>% 
  select(starts_with(c("vp_","dx_"))) %>%
  colnames()

facilityrow <- tibble(
  variable = "facility_id",
  label = "Facility ID",
   stat_0 = paste0(length(unique(table1$v_facilityid)), " unique facilities"))


# summarize the data with our package
tbl1 <- table1 %>%
  tbl_summary(
    include = includelist,
    label= list(vp_age ~ "Age group, years",
                vp_pay_1 ~ "Primary payment type",
                vp_year ~ "Discharge year",
                vp_apr_soi ~ "APR-DRG Severity of Illness",
                vp_gender ~ "Database-encoded gender",
                dx_CIR019 ~ "Heart Failure (CIR019)",
                dx_INF002 ~ "Septicemia (INF002)"
                ))
tbl1 <- tbl1 %>%
  modify_table_body(
    ~ bind_rows(facilityrow, .x)
  ) %>%
  modify_caption(caption= "Table 1. Characteristics of 2018-2019 New York State Hospital Discharge Data") %>%
  modify_footnote_header("Observations subject to enhanced de-identification have been excluded from this analysis, as they lack hospital-level identifiers. No dimensions for key covariates listed are missing from observations that are not subject to enhanced de-identification.", columns=all_stat_cols(), replace=FALSE)
tbl1
Table 1. Characteristics of 2018-2019 New York State Hospital Discharge Data
Characteristic N = 3,125,3581,2
Facility ID 208 unique facilities
Age group, years
    0-17 193,231 (6.2%)
    18-29 336,659 (11%)
    30-49 623,119 (20%)
    50-69 915,443 (29%)
    70+ 1,056,906 (34%)
Primary payment type
    Private 630,537 (20%)
    Public 2,340,672 (75%)
    Self-Pay 60,421 (1.9%)
    Other 93,728 (3.0%)
Discharge year
    2018 1,568,631 (50%)
    2019 1,556,727 (50%)
APR-DRG Severity of Illness
    Minor 680,606 (22%)
    Moderate 1,233,410 (39%)
    Major 898,765 (29%)
    Extreme 312,577 (10%)
Database-encoded gender
    Female 1,699,696 (54%)
    Male 1,425,634 (46%)
    Unknown 28 (<0.1%)
Heart Failure (CIR019) 126,452 (4.0%)
    Unknown 84
Septicemia (INF002) 235,185 (7.5%)
    Unknown 84
1 n (%)
2 Observations subject to enhanced de-identification have been excluded from this analysis, as they lack hospital-level identifiers. No dimensions for key covariates listed are missing from observations that are not subject to enhanced de-identification.

Part 4 Exploratory Data Analysis

Step 1: Distribution of the outcome variable

The outcome variable is binary, so there is not much in the way of distribution to identify, so I am producing a bar chart of the two outcomes of interest for Figure 1.

figure1 <- table1 %>% 
  pivot_longer(
    cols = outcomes, 
    names_to = "outcomename",
    values_to = "outcomevalue") %>%
  group_by(outcomename, outcomevalue) %>%
  summarize(n= n()) %>% mutate(outcomevalue = factor(outcomevalue, 
                                                     levels = c(0,1), 
                                                     labels= c("No","Yes")),
                               outcomename = case_match(outcomename,
                                                    "dx_CIR019" ~ "Heart Failure (CIR019)",
                                                    "dx_INF002" ~ "Septicemia (INF002)")
                               ) %>% filter( !(is.na(outcomevalue)))


ggplot(figure1, aes(x = outcomevalue, y = n)) + geom_col(fill = "pink") + facet_wrap(~ outcomename) + scale_y_continuous(labels = label_comma()) + theme_classic() + 
  labs(title = "Figure 1. Clinical Classification Systems-Refined (CCSR) Diagnosis Distribution \nfor Heart Failure and Septicemia across NYS Hospital Discharges 2018-2019",
       y = "Discharge Count", x = "Diagnosis")

The CCSR categories for this phase of exploratory data analysis are the two highest volume categories in the dataset, of over 472 distinct diagnosis categories present in the two years. Despite being the most common, their positive rates remain low: Heart failure at 4% in 2018, and 4.1% in 2019, and Septicemia at 7.29% in 2018, and 7.76% in 2019. This heavy skew towards 0 in this binary category is a result of the variety of diagnosis categories, and is primarily indicative of the scale of variety and the sparsity of outcomes that I can expect in this dataset.

Step 2: Association between the outcome and primary exposure

Because all covariates are categorical, the outcome is binary, and the primary exposure is categorical with high cardinality (208 different facilities), the standard methods of identifying an association do not necessarily apply here.

figure2 <- table1%>% 
  pivot_longer(
    cols = outcomes, 
    names_to = "outcomename",
    values_to = "outcomevalue") %>%
  group_by(v_facilityid, outcomename) %>%
  summarize(n= n(), 
            diagnoses = sum(outcomevalue, na.rm=TRUE)) %>% 
  mutate(
    prev = diagnoses/n ,
    outcomename = case_match(outcomename,
               "dx_CIR019" ~ "Heart Failure (CIR019)",
                "dx_INF002" ~ "Septicemia (INF002)")
  )

ggplot(figure2, aes(x=outcomename, y= prev)) + 
         geom_violin(alpha = 0.5, fill="purple", color = NA)+ 
         #geom_jitter(width = 0.2, seed = 1, alpha = 0.5, color = "palevioletred") +
                theme_classic() +
          geom_boxplot(fill=NA, width=0.1) +
  scale_y_continuous(labels = label_percent()) + 
  labs(title = "Figure 2. Rate of CCSR Diagnoses for Heart Failure (CIR019) and Septicemia (INF002)\namong all Hospital Discharges  across all 208 NYS facilities 2018-2019",
       y = "Rate of Diagnosis\n(Facility Diagnoses/Total Facilty Discharges)", x = "CCSR Category")

Assessing facility-level rates of both heart Failure (CIR019) and Septicemia (INF002) (calculated as the total volume of discharges with a specific diagnosis category over the total number of discharges), we see that there is facility-level variation between categories. Because prevalence distribution for both outcomes differs across facilities, this indicates an association between facility and the diagnosis rate. The distribution of septicemia diagnosis is particularly wide, ranging from a diagnosis rate of 0% to near 20%. Both Heart Failure and Septicimia appear to have 2 potential outlier facilities, which may be worth further investigation.

Step 3: Association between the outcome and primary exposure, stratified by age.

Both selected outcomes Heart Failure and Septicemia are linked to age (as well as many other case-mix factors). For this next EDA, I want to confirm that there is still substantial facility-level variation in diagnosis rate when the data is stratified by age.

figure3 <- table1%>% 
  pivot_longer(
    cols = outcomes, 
    names_to = "outcomename",
    values_to = "outcomevalue") %>%
  group_by(v_facilityid, vp_age, outcomename) %>%
  summarize(n= n(), 
            diagnoses = sum(outcomevalue, na.rm=TRUE)) %>% 
  mutate(
    prev = diagnoses/n ,
    outcomename = case_match(outcomename,
               "dx_CIR019" ~ "Heart Failure (CIR019)",
                "dx_INF002" ~ "Septicemia (INF002)")
  ) %>% filter(outcomename == "Septicemia (INF002)",
               n >= 30)

ggplot(figure3, aes(x=vp_age, y= prev, fill=vp_age))  +
         geom_violin(alpha = 0.5,  color = NA)+ 
         #geom_jitter(width = 0.2, seed = 1, alpha = 0.5, color = "palevioletred") +
                theme_classic() +
          geom_boxplot(fill=NA, width=0.1) +
  scale_y_continuous(labels = label_percent()) + 
  labs(title = "Figure 3. Rate of CCSR Diagnoses for Septicemia (INF002) by Age Group among\nNYS Facilities with >=30 Discharges per Age Group, 2018-2019",
       y = "Rate of Diagnosis\n(Facility Diagnoses/Total Facilty Discharges)" , x = "Facility-Level Septicemia Diagnosis Rate"
       ) + theme(legend.position = "none")

Cells with fewer than 30 total discharges were excluded from this analysis to stabilize diagnosis rate estimates, and heart failure was excluded because of age related sparsity. This initial assessment indicates that even when discharges have been stratified by age group, facilities have very strong variation in diagnosis rate of septicemia. Because this facility level variation is still present with a major case-mix variable controlled for, there is reason to look into other variables as sources for diagnosis variation between facilities.