Part 0: Data Loading and Setup

Step 1. Load Required Packages

Required packages for data cleaning: tidyverse
Required packages for presentation: knitr, kableExtra
Required packages for data retrieval: vroom

Step 2. Import Run Mapping Information

Due to incompatibilities between the Clinical Classification Software (CCS) groupings in the data from 2016-2017 and the Clinical Classification Software (CCSR), the primary analysis will be completed exclusively with data from 2018-2019.

However, there are a small number (35) of diagnoses that can be directly mapped between the two systems. A sensitivity analysis can be conducted on one of the diagnoses listed in the crosswalk. The workflow can process and ingest the 2016 and 2017 by switching the initial ingest from the currently loaded one to “LU_byYear_2016-2019-AlexMossawir_3.15.26.xlsx”.

2018 Documentation

Dataset covers Inpatient Hospital Discharges that occured in calendar year 2018, statewide coverage, issued at the Discharge level. Product of the Office of Quality and Patient Safety.

2.35M Rows, 33 Columns

Source: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/yjgt-tq93/about_data

2019 Documentation

Dataset covers Inpatient Hospital Discharges that occured in calendar year 2019, statewide coverage, issued at the Discharge level. Product of the Office of Quality and Patient Safety.

2.34M Rows, 33 Columns

Source: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/4ny4-j5zv/about_data

#   Import Project Lookup
#--------------------------------------------------------------------------------------------------
#   This can be used to add additional years or remove years more quickly. This is currently
#   set to 2018-2019, but by switching the excel here to:
#           "LU_byYear_2016-2019-AlexMossawir_3.15.26.xlsx"
#   the full 2016-2019 range can be ingested.
#--------------------------------------------------------------------------------------------------

lu_run <- read_excel("LU_byYear_2018-2019-AlexMossawir_3.15.26.xlsx")
##Print LU for Reference
lu_run %>%
kable(
    caption = "Project Lookup"
  )

Project Lookup
run_id	lu_year	lu_url	lu_dict	limit
puf_2018	2018	https://health.data.ny.gov/api/views/yjgt-tq93/rows.csv?accessType=DOWNLOAD	dict_2018	NA
puf_2019	2019	https://health.data.ny.gov/api/views/4ny4-j5zv/rows.csv?accessType=DOWNLOAD	dict_2019	NA

Step 3. Generate Necessary Fields for Downloading CSV

Original version of this workflow queried the OData API, but I was able to confirm that even with pagination through a few different approaches, this data source has a hard cap at 500k rows per query. CSV will be downloaded and cleaned for some inconsistencies in unicode and column headings.

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data, to avoid re-downloading every time this
#     gets knit.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run$cache <- file.path(paste0("data/raw/",lu_run$run_id,".rds"))

## File Timestamp saved to
lu_run$cache_ts <- file.path(paste0("data/raw/",lu_run$run_id,"_ts.txt"))

Step 3. Import Data

Helper Function - import_fetch or load

This function checks if the document specified in the run mapping is currently cached within the project folder. If it is, it loads the data into the LU directly. If it is not, it pulls the number of rows specified in the document (or all) from the url in the lookup, caches the new file, and loads into the LU.

#   Fetch Or Load Function
#--------------------------------------------------------------------------------------------------
#   The function will first check for cached data at the specified location. Then, if it exists,
#   it'll load the cached data. If it doesn't exist, it'll use request generated previously.
#--------------------------------------------------------------------------------------------------
#   This function will be called on an entire row of the LU at a time - The following columns will
#   always need to be in the LU for this to work:
#       - cache - relative file path + name of .rds for each dataset imported
#       - cache_ts - relative file path + name of .txt file with time that the url was queried
#       - lu_url - odata URL where data is being pulled from
#       - request - httr2 request to get data from url
#   The following columns will be generated within the LU by this process:
#       - data - a full dataframe of the data for use
#       - data_ts - timestamp of when the data was first queried
#
#--------------------------------------------------------------------------------------------------

import_fetchorload <- function(LU){ 
  
  ## Check if the cached data exists 
if(file.exists(LU$cache[[1]])) {
  ## If so - print the following information:
  print(
  paste0(
    "Loading data from ", LU$cache[[1]],
    " (saved at ", read_lines(LU$cache_ts[[1]])[[1]], ")",
    " - saving to df."
        )
      )
  ## Read in data from cached location and ts
  LU$data [[1]]<- readRDS(LU$cache[[1]])
  LU$data_timestamp <- read_lines(LU$cache_ts[[1]])
  return(LU)
  
} else {
  ##Print the message
    print(
      paste0(
        "No data saved at ", as.character(LU$cache[[1]]), 
        "- loading from url: ",as.character(LU$lu_url[[1]])
          )
        )
  if (is.na(LU$limit[[1]])==TRUE) {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]])))
  } else {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]]), n_max = LU$limit[[1]]))}
  ts <- strftime(Sys.time())
  cat(ts, file = LU$cache_ts[[1]])
  saveRDS(temp,LU$cache[[1]])
  LU$data_timestamp <- ts
  LU$data[[1]] <- temp
  return(LU)
}}

Loop through LU to import data

#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#   Final output will be a series of messages in console indicating whether it needed to query 
#   the URL or  whether it was able to just pull from the DF directly. Data frames will be stored
#   Within the LU table that we've been operating on.
#--------------------------------------------------------------------------------------------------

lu_run <- rowwise(lu_run) %>% 
  mutate(import_fetchorload(cur_data()))

## [1] "Loading data from data/raw/puf_2018.rds (saved at 2026-03-14 15:48:06) - saving to df."
## [1] "Loading data from data/raw/puf_2019.rds (saved at 2026-03-14 15:51:40) - saving to df."

2018

Hospital Service Area	Hospital County	Operating Certificate Number	Permanent Facility Id	Facility Name	Age Group	Zip Code - 3 digits	Gender	Race	Ethnicity	Length of Stay	Type of Admission	Patient Disposition	Discharge Year	CCSR Diagnosis Code	CCSR Diagnosis Description	CCSR Procedure Code	CCSR Procedure Description	APR DRG Code	APR DRG Description	APR MDC Code	APR MDC Description	APR Severity of Illness Code	APR Severity of Illness Description	APR Risk of Mortality	APR Medical Surgical Description	Payment Typology 1	Payment Typology 2	Payment Typology 3	Birth Weight	Emergency Department Indicator	Total Charges	Total Costs
Capital/Adirond	Columbia	1001000	000146	Columbia Memorial Hospital	50 to 69	125	F	White	Not Span/Hispanic	2	Elective	Short-term Hospital	2018	CIR020	CEREBRAL INFARCTION	NA	NA	045	CVA & PRECEREBRAL OCCLUSION W INFARCT	01	DISEASES AND DISORDERS OF THE NERVOUS SYSTEM	3	Major	Major	Medical	Managed Care, Unspecified	Self-Pay	NA	NA	TRUE	17591.41	3570.06
New York City	Manhattan	7002009	001445	Harlem Hospital Center	50 to 69	100	F	Black/African American	Not Span/Hispanic	16	Emergency	Short-term Hospital	2018	RSP017	POSTPROCEDURAL OR POSTOPERATIVE RESPIRATORY SYSTEM COMPLICATION	ESA003	MECHANICAL VENTILATION	121	OTHER RESPIRATORY & CHEST PROCEDURES	04	DISEASES AND DISORDERS OF THE RESPIRATORY SYSTEM	4	Extreme	Extreme	Surgical	Medicare	Medicaid	NA	NA	TRUE	106190.21	78913.24
New York City	Manhattan	7002009	001445	Harlem Hospital Center	50 to 69	100	F	Black/African American	Not Span/Hispanic	4	Emergency	Home or Self Care	2018	MBD003	BIPOLAR AND RELATED DISORDERS	MHT001	PHARMACOTHERAPY FOR MENTAL HEALTH (EXCLUDING SUBSTANCE USE)	753	BIPOLAR DISORDERS	19	MENTAL DISEASES AND DISORDERS	2	Moderate	Minor	Medical	Medicaid	NA	NA	NA	FALSE	5600.00	4161.53

2019

Hospital Service Area	Hospital County	Operating Certificate Number	Permanent Facility Id	Facility Name	Age Group	Zip Code - 3 digits	Gender	Race	Ethnicity	Length of Stay	Type of Admission	Patient Disposition	Discharge Year	CCSR Diagnosis Code	CCSR Diagnosis Description	CCSR Procedure Code	CCSR Procedure Description	APR DRG Code	APR DRG Description	APR MDC Code	APR MDC Description	APR Severity of Illness Code	APR Severity of Illness Description	APR Risk of Mortality	APR Medical Surgical Description	Payment Typology 1	Payment Typology 2	Payment Typology 3	Birth Weight	Emergency Department Indicator	Total Charges	Total Costs
Hudson Valley	Westchester	5902001	001045	White Plains Hospital Center	30 to 49	106	M	Other Race	Not Span/Hispanic	2	Emergency	Home or Self Care	2019	INF008	VIRAL INFECTION	NA	NA	723	VIRAL ILLNESS	18	INFECTIOUS AND PARASITIC DISEASES (SYSTEMIC OR UNSPECIFIED SITES)	2	Moderate	Minor	Medical	Private Health Insurance	NA	NA	NA	TRUE	26507	4773.11
Hudson Valley	Westchester	5902001	001045	White Plains Hospital Center	70 or Older	105	F	Black/African American	Not Span/Hispanic	3	Emergency	Home or Self Care	2019	GEN002	ACUTE AND UNSPECIFIED RENAL FAILURE	NA	NA	469	ACUTE KIDNEY INJURY	11	DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT	2	Moderate	Moderate	Medical	Medicare	NA	NA	NA	TRUE	20693	5631.30
Hudson Valley	Westchester	5902001	001045	White Plains Hospital Center	50 to 69	105	F	Other Race	Not Span/Hispanic	7	Elective	Home or Self Care	2019	GEN006	OTHER SPECIFIED AND UNSPECIFIED DISEASES OF KIDNEY AND URETERS	GIS005	EXPLORATION OF PERITONEAL CAVITY	443	KIDNEY & URINARY TRACT PROCEDURES FOR NONMALIGNANCY	11	DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT	4	Extreme	Extreme	Surgical	Medicare	Blue Cross/Blue Shield	NA	NA	FALSE	138252	29848.69

Step 4. Import all Dictionaries

Dictionaries for 2016-2017 have been created for future analysis and are saved in the same folder, but the workflow will only read in dictionaries for current runs.

2018 Dictionary

dict_2018 is sourced from 2018 tab in Excel file shared in repository.

This was manually extracted from: https://health.data.ny.gov/api/views/yjgt-tq93/files/9667d39a-be57-4df1-9ce4-537bce120215?download=true&filename=NYSDOH_SPARCS_De-Identified_Data_Dictionary_2018.pdf

Additional API field names have been included in Data_Col

2018 Dictionary

dict_2018 is sourced from 2018 tab in Excel file shared in repository.

This was manually extracted from: https://health.data.ny.gov/api/views/yjgt-tq93/files/9667d39a-be57-4df1-9ce4-537bce120215?download=true&filename=NYSDOH_SPARCS_De-Identified_Data_Dictionary_2018.pdf

Additional API field names have been included in Data_Col

#   Import dictionaries
#--------------------------------------------------------------------------------------------------
#   This can be used to add additional years or remove years more quickly. building this process 
#   on a subset of 20 rows from 2016 and 2017, and then revising to add in the full year range 
#   and the full datasets once workflows are clean, so it can be batched.
#   
#   This is designed so that each dictionary is sourced from a different tab in an excel document.
#   
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx"
#--------------------------------------------------------------------------------------------------
 
##pull years out of the run lu to loop over

year <- as.character(lu_run$lu_year)
##create a dictionary list of sheets read from excel, looping over years
dicts <- lapply(year, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)

})

## 
## Loading dictionary from:DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx
## Read sheet: 2018
## Loading dictionary from:DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx
## Read sheet: 2019

## set the names of the objects using years
dicts <- setNames(dicts, paste0("dict_",year))

Part 0-1: Data Structure Merging

Column names and data types are slightly different across years - they are standardized across datasets.

Step 1. Column Rename and Type Classification

Helper Function - standardize_columns

This function uses the appropriate dictionary for the ingested dataset (ie. 2018 data will get dict_2018) as specified in the lookup. This dataset will rename columns to the standard names across variables, and ensure types are aligned.

#   Standardize Columns Function
#-------------------------------------------------------------------------------------------------
#   This is a function that standardizes the columns based on the provided dictionaries -
#   If this fails, it means that there is something that does not match in the dictionary.
#   
#   This is by design - I want it to be able to account for every single row, if something is
#   mapped incorrectly in the dataset, then it neeeds a new dictionary.
#
#   For example, while making this, I copy/pasted the rows from a print statement, but I
#   dropped one of the underscores on "__id" for the 2017, but not for the 2016 data. 
#
#--------------------------------------------------------------------------------------------------

standardize_columns <- function(run_lu, dictionaries){ 
    # Grab dictionary name out of the lookup
##Call working versions of the data and dictionary objects from the lists that contain them
    w_data <- run_lu$data[[1]]
# Grab dictionary from dicts
    w_dict <- dictionaries[[ run_lu$lu_dict[[1]] ]]  
    
## Named vector of old names using
    w_names <- setNames(w_dict$api_field, w_dict$final_name)

## Fill a_data column in the run LU with the new, renamed dataset
    temp <- w_data %>%
  ## Use the named vector
 
          rename(!!!w_names) %>%
          mutate(
            across(
                all_of(w_dict$final_name[w_dict$final_type == "character"]),
                as.character
                   ),
            across(
                all_of(w_dict$final_name[w_dict$final_type == "numeric"]),
                as.numeric
                  )
                )
        run_lu$a_data[[1]] <- temp
    
    return(run_lu)
}

Loop through datasets to standardize

#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#   If this fails, check the console for issues with the LUs not matching.
#--------------------------------------------------------------------------------------------------    
lu_run <- rowwise(lu_run) %>% 
  mutate(standardize_columns(cur_data(),dicts))

Data Outputs

2018

v_hsa	v_county	operating_certificate_number	v_facilityid	facility_name	v_age	v_patientzip	v_gender	v_race	v_ethnicity	v_los	v_admissiontype	v_disposition	v_year	v_dx_ccsr	dxccsr_description	v_pr_ccsr	prccsr_description	v_apr_drg	apr_drg_description	v_mdc	v_mdc_description	v_apr_soi	apr_soi_description	v_apr_rom	v_apr_type	v_pay_1	v_pay_2	v_pay_3	birth_weight	flag_ed	total_charges	total_costs
Capital/Adirond	Columbia	1001000	000146	Columbia Memorial Hospital	50 to 69	125	F	White	Not Span/Hispanic	2	Elective	Short-term Hospital	2018	CIR020	CEREBRAL INFARCTION	NA	NA	045	CVA & PRECEREBRAL OCCLUSION W INFARCT	01	DISEASES AND DISORDERS OF THE NERVOUS SYSTEM	3	Major	Major	Medical	Managed Care, Unspecified	Self-Pay	NA	NA	TRUE	17591.41	3570.06
New York City	Manhattan	7002009	001445	Harlem Hospital Center	50 to 69	100	F	Black/African American	Not Span/Hispanic	16	Emergency	Short-term Hospital	2018	RSP017	POSTPROCEDURAL OR POSTOPERATIVE RESPIRATORY SYSTEM COMPLICATION	ESA003	MECHANICAL VENTILATION	121	OTHER RESPIRATORY & CHEST PROCEDURES	04	DISEASES AND DISORDERS OF THE RESPIRATORY SYSTEM	4	Extreme	Extreme	Surgical	Medicare	Medicaid	NA	NA	TRUE	106190.21	78913.24
New York City	Manhattan	7002009	001445	Harlem Hospital Center	50 to 69	100	F	Black/African American	Not Span/Hispanic	4	Emergency	Home or Self Care	2018	MBD003	BIPOLAR AND RELATED DISORDERS	MHT001	PHARMACOTHERAPY FOR MENTAL HEALTH (EXCLUDING SUBSTANCE USE)	753	BIPOLAR DISORDERS	19	MENTAL DISEASES AND DISORDERS	2	Moderate	Minor	Medical	Medicaid	NA	NA	NA	FALSE	5600.00	4161.53

2019

v_hsa	v_county	operating_certificate_number	v_facilityid	facility_name	v_age	v_patientzip	v_gender	v_race	v_ethnicity	v_los	v_admissiontype	v_disposition	v_year	v_dx_ccsr	dxccsr_description	v_pr_ccsr	prccsr_description	v_apr_drg	apr_drg_description	v_mdc	v_mdc_description	v_apr_soi	apr_soi_description	v_apr_rom	v_apr_type	v_pay_1	v_pay_2	v_pay_3	birth_weight	flag_ed	total_charges	total_costs
Hudson Valley	Westchester	5902001	001045	White Plains Hospital Center	30 to 49	106	M	Other Race	Not Span/Hispanic	2	Emergency	Home or Self Care	2019	INF008	VIRAL INFECTION	NA	NA	723	VIRAL ILLNESS	18	INFECTIOUS AND PARASITIC DISEASES (SYSTEMIC OR UNSPECIFIED SITES)	2	Moderate	Minor	Medical	Private Health Insurance	NA	NA	NA	TRUE	26507	4773.11
Hudson Valley	Westchester	5902001	001045	White Plains Hospital Center	70 or Older	105	F	Black/African American	Not Span/Hispanic	3	Emergency	Home or Self Care	2019	GEN002	ACUTE AND UNSPECIFIED RENAL FAILURE	NA	NA	469	ACUTE KIDNEY INJURY	11	DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT	2	Moderate	Moderate	Medical	Medicare	NA	NA	NA	TRUE	20693	5631.30
Hudson Valley	Westchester	5902001	001045	White Plains Hospital Center	50 to 69	105	F	Other Race	Not Span/Hispanic	7	Elective	Home or Self Care	2019	GEN006	OTHER SPECIFIED AND UNSPECIFIED DISEASES OF KIDNEY AND URETERS	GIS005	EXPLORATION OF PERITONEAL CAVITY	443	KIDNEY & URINARY TRACT PROCEDURES FOR NONMALIGNANCY	11	DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT	4	Extreme	Extreme	Surgical	Medicare	Blue Cross/Blue Shield	NA	NA	FALSE	138252	29848.69

Part 1: Data Subsetting on Exclusion Criteria

Step 1. Data Quality Rule Import

Based on the contents of the data dictionaries and the key variables being used for the analysis (permanent facility id), I have created a rule dictionary with instructions to be placed into the rule engine.

This first pass is to run through the basic data quality filters - not study design filters. Any observations that cannot be used for this analysis (e.g. enhanced de-identification redacting facility information, missing key case-mix identifier like age group) will be filtered through this stage. The engine will count the number of rows in each run’s dataset and log it, as it passes the data onto the next rule.

All rule dictionaries will have the same structure, so they can be fed through the same update/logging engine downstream, this pass uses the lu filters rule dictionary.

Dictionaries have the following structure:

stage: This indicates which rules should be applied - I am likely keeping these dictionaries in separate tabs, but I wanted to make this filter in case I stack it later.
rule_id: Unique identifier for the rules for QA/querying
rule_name: Unique legible identifier for QA
rule condition: the logic that will be used to identify which rows the rules are applied to.
rule_outcome: For filtering, all rules apply exclude, but downstream and for sensitivity analysis, the rule engine may be used for partitioning the dataset or other outcomes - this will be used to determine that process.
applies: List of which runs to apply the rule to - for example, the 2016 and 2017 datasets have the abortion_edit_indicator for enhanced de-identification, while the 2018 and 2019 datasets do not have this indicator, and identifying redactions must be done by searching for “Redacted for Confidentiality” within facility_name.
order: Rules are applied sequentially based on this column -logged_as This is the clean string used to flag what the transformation was for the log table, to be formatted for final tables and outputs.

#   Import LU
#--------------------------------------------------------------------------------------------------
#   Set the stage here with the variable stage - this step is currently the data_quality pass, so
#   it's going to pull in the rules flagged with "data_quality" - in this case, from tab name.
#   Store it to data quality, again. If we want to import multiple LUs  - add tab names to 
#   the list.
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "LU_AppliedFilters_2016-2019-AlexMossawir_3.7.26.xlsx"
stage <- c("data_quality")
#--------------------------------------------------------------------------------------------------
 
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)
})

## 
## Loading dictionary from:LU_AppliedFilters_2016-2019-AlexMossawir_3.7.26.xlsx
## Read sheet: data_quality

temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))

dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts)

##Print LU for Reference
dicts$lu_data_quality %>%
kable(
    caption = "Rules Lookup for Filtering"
  )

Rules Lookup for Filtering
stage	rule_id	rule_name	rule_condition	rule_outcome	output_prefix	applies	order	logged_as
data_quality	1	exclude_abortions	flag_abortion == “Y”	exclude	a	puf_2016, puf_2017	2	Enhanced De-ID
data_quality	2	exclude_deid	facility_name == “Redacted for Confidentiality”	exclude	a	puf_2018, puf_2019	3	Enhanced De-ID
data_quality	3	missing_facility	is.na(v_facilityid)	exclude	a	all	1	Missing Facility ID
data_quality	4	missing_age	is.na(v_age)	exclude	a	all	4	Missing Age Group
data_quality	5	newborn	v_admissiontype == “Newborn”	exclude	a	all	5	Excluding Newborns
data_quality	6	medical_only	!(v_apr_type == “Medical”)	exclude	a	all	6	Include Only Medical Admissions

Step 2. Data Quality Rules Pass

The structure of the engine will first loop through over run_id using lu_run, this is the exterior loop. For each run, the engine will first determine which rules apply applies.

Then it will loop over the rules in an interior loop. For each rule, it will log iniitial nrow, and save to audit log. Then it will either apply the rule filtering or skip over it. It will output final nrows to the audit log, and pass the revised data to the next step of the interior loop. It proceeds sequentially over the rows, so working data will be updated once the rule has been executed and logged.

The inner loop passes the final audit lists and the final w_data back to the outer loop. The outer loop saves the audit log and the data to columns in the lookup.

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data - this is write-only, just to keep records.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run[[paste0("cache_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,".rds"))
print(lu_run[[paste0("cache_", stage)]])

## [1] "data/data_quality/puf_2018.rds" "data/data_quality/puf_2019.rds"

## File Timestamp saved to
lu_run[[paste0("cache_ts_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,"_ts.txt"))
print(lu_run[[paste0("cache_ts_", stage)]])

## [1] "data/data_quality/puf_2018_ts.txt" "data/data_quality/puf_2019_ts.txt"

Helper Functions - Filtering and Exclusion Rules

These functions are hidden by default - press show to see the code!

flag_rule_application

This function is called at a run_id level within the outer loop. Once the working lookup is selected, it loops over the rule list to determine if the rule is applicable to this particular run. i.e: a rule with the applies text “puf_2016, puf_2017” will be split into a vector and evaluated - it will only be applied when the run_id is in that vector.

#Function to Determine Rule Applicability - Within Outer Loop
#--------------------------------------------------------------------------------------------------
#   Within outer loop, looking at the current run_id that is being passed, 
#    and determine which rules should be applied.
#--------------------------------------------------------------------------------------------------

flag_rule_application <- function(rule, run_id) {
  applies_vec <- rule$applies_rule
  if ("all" %in% applies_vec) { TRUE } else if (run_id %in% applies_vec) { TRUE } else { FALSE }
}

apply_exclude

This function is called at a rule level within the inner loop. Because the lookup stores the logic for what columns should be excluded as a string, it needs to be unquoted to be properly evaluated as logic. This applies that logic with a filter function from dplyr in order to exclude anything that fits that criteria.

#Function Within Inner Loop - Actually Exclude
#--------------------------------------------------------------------------------------------------
#   Applies filter on the dataset using rule expression - called within the inner loop
#--------------------------------------------------------------------------------------------------

##function to if it excludes, does it
apply_exclude <- function(w_data, rule_expr) {
  w_data %>% filter(!(!!rule_expr))
}

apply_rules

This function is called on each specific rule list within the inner loop. It takes a single rule from the inner loop and applies it to the w_data working data set. It also generates a list of values to output to an audit table so all transformations (or skipped rules) can be tracked. This audit row lists the initial number of rows, and the final number of rows after transformation, and reports the number of observations affected by each rule.

This is the core of the inner loop. It returns transformed w_data and a row of the audit table.

#--------------------------------------------------------------------------------------------------
#   Function to Apply Rules - Within Inner Loop
#--------------------------------------------------------------------------------------------------
#   This takes the working dictionary and working data from the interior loop, and continues,
#   passing run_id from the outer loop. It always references [[1]] because it will be passed
#   rowwise. Outputs the updated dataset and a new list, audit_log.
#--------------------------------------------------------------------------------------------------
apply_rule <- function(rule, w_data, run_id) {
##Prelim Information  
  run_flag <- rule$flag_rule_applies
#Generate Audit Row
audit_row <- list(
  stage         = rule$stage,
  run_id        = run_id,
  rule_id       = rule$rule_id,
  rule_name     = rule$rule_name,
  rule_condition= rule$rule_condition,
  rule_outcome  = rule$rule_outcome,
  n_initial     = nrow(w_data),
  logged_as     = rule$logged_as,
  skipped       = !run_flag,
  applied       = run_flag
)
##     cat ( paste0("Run Id: ", string(run_id),   " Rule Name ",rule$rule_name), " Run flag: ", run_flag, " Class W_Data: ",  class(w_data))
## Apply dplyr rules based on what rules text is - this can be updated when I add others. But.
if (run_flag == FALSE) {
  # do nothing
} else if (rule$rule_outcome == "exclude") {
  rule_expr <- parse_expr(rule$rule_condition)
  w_data <- apply_exclude(w_data, rule_expr)
}
audit_row$n_final  <- nrow(w_data)
audit_row$n_change <- audit_row$n_initial - audit_row$n_final

audit_row$timestamp <- Sys.time()
  
   return(list(w_data = w_data, 
              audit_row = audit_row))
         
}

outer_pass

This function is the core of the outer loop. It is passed a single row from lu_run, and uses that information to call the relevant objects and manage the inner loop.

While it only takes a row from lu_run as a direct argument, it references the following global variables to preserve clarity and naming:

stage: This is used to specify throughout that we are on the data quality filtering step. This allows future steps to be built off the same engine.
stage_prefix: This is used to specify the output column name’s prefix - in this case, it is ‘a2’
prev_col: This is used to specify the source column name - in this case it is ‘a_data’.

#Function Within Outer Loop - Manage Inner Loop
#--------------------------------------------------------------------------------------------------
#   This is called by the outer loop and is used to manage the inner loop. It passes both
#   transformed data and the audit log back.
#--------------------------------------------------------------------------------------------------

outer_pass <- function(lu_row) {
      
  ##Get that run_id, data, dict to keep passing through  
    run_id <- lu_row$run_id
  # the whole reason this loop wasn't working is because this a_data is apparently. for some reason. a list length 1
    ## with a tibble at that first place. 
    w_data <- lu_row[[prev_col]][[1]]
    w_dict <- dicts[[paste0("lu_",stage)]]
    
    ## Convert rules from a dataframe into list of lists
    rules <- apply(w_dict, 1, as.list)

    for (j in seq_along(rules)) {
      rule <- rules[[j]]
      ##Applies is currently a vector hand-written in excel
        ##First remove all whitespace, then split out what remains
              ## It's gonna put the list into a list of length one with a vector in it.
              #you literally just want the vector.
      rule$applies_rule <- strsplit(gsub("\\s", "",rule$applies), ",")[[1]]
      
      ##We're looping through all the rules rn
      rule$flag_rule_applies <- flag_rule_application(rule, run_id)
      ## and saving the new items back into rules   
       rules[[j]] <- rule
          }

  ## Create an empty list object so that we can store the audit as we go.
    audit <- list()
  ##i did this just counting first but seq_along actually passes info along each
    for (j in seq_along(rules)) {
    ##pull out list j from dictionary
      rule <- rules[[j]]
##Call function apply_rule
##    Apply rule input:
##        rule = one single row (but as list) of dictionary
##        w_data = working dataset
##        run_id = the same for everything in this inner loop
##    Apply rule output:
##        a list of: w_data, audit_row    
      result <- apply_rule(rule, w_data, run_id)
      ## replace w_data with the new w_data saved in result returned list
      w_data <- result$w_data
      ##store 
      audit[[j]] <- result$audit_row
  
      
      }
    list(
    data = w_data,
    audit = audit
  )

}

Execute Function Calls to Clean Data

The functions above will refer to global stage, prev_col, stage_prefix variables. This is to allow for flexibility of the pipeline. prev_col indicates where the data for this stage should be sourced from, and stage_prefix indicates what the column name of the destination should be. I will revise these away from using the global variable approach when I have more time.

#--------------------------------------------------------------------------------------------------
## Stage Variable
stage <- "data_quality"
stage_prefix <- "a2_"

## Column Data Source
prev_col <- "a_data"

#--------------------------------------------------------------------------------------------------

newcolname <- paste0(stage_prefix, "data")

#--------------------------------------------------------------------------------------------------
results <- vector("list", nrow(lu_run))

#--------------------------------------------------------------------------------------------------

for (i in seq_len(nrow(lu_run))) {
  ##Actually run the nested loops here, saving to a new vector
  results[[i]] <- outer_pass(lu_run[i, ])
  
  ## Save the outputted day in the specified new column
  lu_run[[newcolname]][[i]] <- results[[i]]$data
  ##save audit log to a table
  lu_run$audit[[i]] <- bind_rows(results[[i]]$audit)
  ## Cache this
  saveRDS(results[[i]]$data,lu_run$cache_data_quality[[i]])
  ##Make a timestamp
  ts <- strftime(Sys.time())
  
  ##save the timestamp
  cat (ts, file = lu_run$cache_ts_data_quality[[i]])
}
rm(results)

Step 3. Analytical Sample

The workflow has filtered the data down to the analytical sample and outputted the results of each step of the process into a complete audit table.

The complete table with n at each stage is below, and a summary table of initial observations, exclusion volume, and final analytical sample size, by year, can be found below.

Complete Audit Summary Below

Click to view Complete Table of Sample Exclusions

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
audit_table <- lu_run$audit %>% 
  bind_rows() %>%
  arrange(run_id, rule_id) %>%
  mutate(
    logged_as = factor(logged_as, levels = unique(logged_as))
  ) %>%
  group_by(run_id, logged_as) %>%
  summarise(
    n_initial = max(n_initial),
    Excluded = sum(n_change),
    n_final = min(n_final),
  )

audit_table <- audit_table %>%
  mutate(
year = factor(run_id, 
                 levels = c("puf_2016","puf_2017","puf_2018","puf_2019"),
                 labels = c("2016", 
                            "2017",
                            "2018",
                            "2019")
))

audit_table %>% kable()

run_id	logged_as	n_initial	Excluded	n_final	year
puf_2018	Enhanced De-ID	2352807	8804	2344003	2018
puf_2018	Missing Facility ID	2344003	0	2344003	2018
puf_2018	Missing Age Group	2344003	0	2344003	2018
puf_2018	Excluding Newborns	2344003	212809	2131194	2018
puf_2018	Include Only Medical Admissions	2131194	562563	1568631	2018
puf_2019	Enhanced De-ID	2339462	9071	2330391	2019
puf_2019	Missing Facility ID	2330391	0	2330391	2019
puf_2019	Missing Age Group	2330391	0	2330391	2019
puf_2019	Excluding Newborns	2330391	210594	2119797	2019
puf_2019	Include Only Medical Admissions	2119797	563070	1556727	2019

Summary Table

Analytical Sample Size
	2018	2019
Initial Sample Size	2,352,807	2,339,462
Enhanced De-ID	8,804	9,071
Excluding Newborns	212,809	210,594
Include Only Medical Admissions	562,563	563,070
Final Analytical Sample	1,568,631	1,556,727

Part 2: Variable Selection, Recoding, and Data Cleaning

Step 1. Outcome Variable

Name the variable and describe its scale (continuous, binary, etc.). Report the distribution: mean and SD for continuous outcomes, or frequency and % for binary outcomes. Note any recoding from the raw variable (e.g., converting BMD from NHANES units, collapsing categories).

The primary outcome variable for this analysis is a binary flag indicating whether a hospital discharge is assigned to a selected CCSR Diagnosis Group (1- dischage falls into the category).

Per the data dictionaries, SPARCS data from 2016-2017 reflects the Clinical Classification Software (CCS) codes and descriptions. 2018-2019 reflect Clinical Classification Software - Refined (CCSR) codes and descriptions. The year version of each category has not been specified in the data dictionaries provided.

The CCS and CCSR categories have largely been found to be non-translateable, as CCS beta was built off of ICD-9, while CCSR is using ICD-10. For this reason, the analysis will be restricted to 2018-2019 data, and the additional data points from 2016-2017 may be incorporated for sensitivity analysis.

Note: There are 35 codes with a 1-1 match between CCS and CCSR, per a comparison published by ARQ, so these specific codes can be recoded and highlighted in the data as candidates for a sensitivity analysis including 2016-2017 data.

https://hcup-us.ahrq.gov/toolssoftware/ccsr/DXCCSR-vs-Beta-CCS-Comparison.pdf

count_ccsr <- bind_rows(lu_run$a2_data) %>% 
  select(
  c("v_dx_ccsr","dxccsr_description")
) %>%
  unique() %>% nrow()

There are 472 total diagnosis codes present in the datasets, across 22 body system categories sourced from the ICD-10. This is not a complete match to the MDC body system grouping, which reflects 25 total categories relevant to hospital resource use, though the two have overlap and both are present in this dataset.

top_10 <- bind_rows(lu_run$a2_data) %>% 
  select(
  c("v_dx_ccsr","dxccsr_description")
        )%>%
  group_by(v_dx_ccsr,dxccsr_description) %>%
    summarize(
    count_obs = n()
  )  %>% arrange(desc(count_obs)) 


head(top_10, n=10) %>% kable(
            col.names = c("CCSR Code", "CCSR Description", "n"),
       caption= "Highest Volume CCSR Codes",
  format.args = list(big.mark = ","))

Highest Volume CCSR Codes
CCSR Code	CCSR Description	n
INF002	SEPTICEMIA	235,185
CIR019	HEART FAILURE	126,452
MBD017	ALCOHOL-RELATED DISORDERS	99,747
MBD001	SCHIZOPHRENIA SPECTRUM AND OTHER PSYCHOTIC DISORDERS	83,942
RSP002	PNEUMONIA (EXCEPT THAT CAUSED BY TUBERCULOSIS)	82,004
RSP008	CHRONIC OBSTRUCTIVE PULMONARY DISEASE AND BRONCHIECTASIS	71,042
END003	DIABETES MELLITUS WITH COMPLICATION	63,873
GEN004	URINARY TRACT INFECTIONS	61,030
SKN001	SKIN AND SUBCUTANEOUS TISSUE INFECTIONS	60,424
PRG023	COMPLICATIONS SPECIFIED DURING CHILDBIRTH	58,836

ccsr_list <- c("INF002", "CIR019")

This initial analysis will report the initial data and exploratory data analysis will be conducted on INF002 and CIR019 selected because of the high volume of observations.

The final analysis will be conducted with a list of 3-5 categories, shortlisted for total volume and distribution across hospitals. The list will be refined with a literature review.

Outcome Variable: Recode as Binary Variables

#   Import Dictionary
#--------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "Dictionary_Recoding_AlexMossawir_3-15-26.xlsx"
stage <- c("recode")
#--------------------------------------------------------------------------------------------------
 
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)
})

## 
## Loading dictionary from:Dictionary_Recoding_AlexMossawir_3-15-26.xlsx
## Read sheet: recode

temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))

dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts)

Helper Function - expand_variables

This function takes arguments for the lookup to source data from, the column name it is writing to, and the column name it is sourcing from. It outputs just lu_run.

expand_variables_row <- function(row, prev_col) {
  # Extract the nested data for this row
  w_data <- row[[prev_col]][[1]]
##stupid long way to only get the lookup lines for this stage
## this is uh. why I decided not to stack dictionaries. and then I did anyway.
  w_dict <- dicts[["lu_recode"]][dicts[["lu_recode"]]$stage == "expand", ]
  
  # Loop through columns to expand
  for (i in seq_len(nrow(w_dict))) {
    colsearch <- w_dict[[i, "col"]]
    strsearch <- w_dict[[i, "values"]]
    col_output <- paste0(w_dict[i, "new_col"], "_", w_dict[i, "values"])
    
    w_data <- w_data %>%
      mutate(!!col_output := as.integer(!!sym(colsearch) == strsearch))
  }
  
  # Return as list to store in list-column
  list(w_data)
}


#-------------------------------------------------------------------------------

Loop by run through the one-hot expansion code

lu_run <- lu_run %>%
  rowwise() %>%
  mutate(
    a2_data = expand_variables_row(cur_data(), prev_col = "a2_data")
  ) %>%
  ungroup()

temp <- bind_rows(lu_run$a2_data)
outcomes <- grep("^dx_", colnames(temp), value = TRUE)

year <- temp %>%
  select(c("v_year", outcomes)) %>%
  group_by(v_year) %>%
  summarize(
      Total = n(),
      across(all_of(outcomes), ~ sum(as.numeric(.x), na.rm = TRUE)),
    round((across(all_of(outcomes), ~ sum(as.numeric(.x), na.rm = TRUE) / n()*100, .names = "{.col}_pct")),2)
  )

year%>% 
  kable(
  col.names = c("Year", "Total Observations", "CIR019 n", "INF002 n", "CIR019 %", "INF002 %"),
  caption= "Highest Volume CCSR Codes - Frequency by Year",
  format.args = list(big.mark = ","))

Highest Volume CCSR Codes - Frequency by Year
Year	Total Observations	CIR019 n	INF002 n	CIR019 %	INF002 %
2018	1,568,631	62,675	114,408	4.0	7.29
2019	1,556,727	63,777	120,777	4.1	7.76

Step 2. Primary Exposure Variable

Primary exposure will be the hospital-level identifier, permanent_facility_id in the dataset, representing facility-level coding and documentation patterns.

Facility IDs were previously set to be the character data type in the data standardization pass. Because facility ID is categorical, but there are 208 unique facilities, I will not recode this variable.

Step 3. Key Covariates

Key covariates will be selected to proxy patient case-mix:

Discharge Year (categorical, discharge_year)
Patient age group (categorical, age_group)
Database-encoded gender (categorical, gender)
Primary payor type (categorical, payment_typology_1)
APR-DRG Severity of Illness Code (categorical, apr_severity_of_illness_code)

While Discharge Year and Patient Age Group are ordinal categorical, I am beginning my analysis by converting all to factors and handling as if they are nominative categorical. As I continue to run my analysis, I will revisit these.

Recode and Manage Variables

Based on a review of the data, of the key covariates, the only one that needs recoding is Primary Payor Type.

I believe that Payment Typology is going to be a primary driver of variation, and further analysis will be best suited by a full bit-mapping of of the combination of the three primary payment typologies. A patient who is under-insured cannot be distinguished from a patient who is fully insured without complete integration of all three variables. At this point, however, the Primary Payment Typology will be classed into Private, Public, Self-Pay, and Other.

All other variables will be converted to labeled factors without recoding, to ensure they are handled correctly for the logistic regression model.

The codes for the All Patients Refined Major Diagnostic Categories (APR MDC) have remained standard throughout the 2016-2019 period, but the text of the APR-MDC description column has shown variability. To make this column legible, all APR MDC codes have been mapped to the following document: https://hcup-us.ahrq.gov/db/nation/nis/APR-DRGsV20MethodologyOverviewandBibliography.pdf

Helper Function

recode_variable

This function takes lu_run and a current column name, looks it up in the recoding LU and converts it to the values, levels, and labels that are specifies.

rm(temp)

recode_variable <- function(lu_run, col_input) {
### Always pull working data + dict  
w_dict <- dicts[["lu_recode"]]
w_data <- lu_run[[prev_col]][[1]]

##Filter to the column in question
w_dict <- w_dict %>% filter(col==col_input)

## Needed for recoding
col_output <- w_dict$new_col[[1]]
w_values <- w_dict$values
w_new_values <- w_dict$new_values
recode_vec <- w_new_values %>% setNames(w_values)

##Recode
w_data[[col_output]] <- recode(
  w_data[[col_input]], 
  !!!recode_vec)


##Needed for converting to factors
level_vec <- unique(w_dict$level)
label_vec <- unique(w_dict$label)

##Factors
w_data <- w_data %>%
mutate(
  !!col_output :=   factor(!!sym(col_output),
                 levels = level_vec,
                 labels = label_vec
                ))

lu_run[[newdataname]][[1]]  <- w_data


return(lu_run)
}

Recode

Set the global recoding variables (data source, data destination), source the list of columns that will be recoded from the dictionary ingested above.

stage_prefix <- "a2_"
prev_col <- "a2_data"
newdataname <- paste0(stage_prefix, "data")
stage <- "recode"

columns <- unique(dicts[["lu_recode"]]$col[dicts[["lu_recode"]]$stage == "recode"])

## this is so so so inefficient but I don't want to figure out map and across
for (row in seq_len(nrow(lu_run))) {
  for (col in columns) {
    lu_run[row, ] <- recode_variable(lu_run[row, ], col_input = col)
  }
}

Step 4. Missingness

Because all variables of interest now have the prefix “v” to indicate a covariate of interest or “dx” to indicate that it is the outcome of interest (diagnosis), columns of interest can be selected based on what they start with.

Helper Function - missingness

This function creates a summary table of missing values - searches for columns from a list of prefixes. The two arguments it takes are the data frame (this can be passed with map, which I just learned. finally.), and the prefix list.

missingness <- function(w_data, prefixes){
w_data %>%
  summarize(
    across(.cols = starts_with(prefixes),
        .fns = ~sum(is.na(.x)), .names = "{.col}")) %>%
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "nmissing") %>%
  mutate(
      run = unique(w_data$v_year),
      n = nrow(w_data),
      pct_missing = nmissing / n * 100
    )
}

Call missingess over all the data.

I’ve learned how map works. :)

results <-  lu_run$a2_data %>% 
  map(~missingness(w_data= .x, prefixes=c("vp_", "dx_")))


results <-  bind_rows(results) %>% arrange(desc(pct_missing))

results %>% kable(
    format.args = list(big.mark = ",")
)

variable	nmissing	run	n	pct_missing
dx_CIR019	46	2018	1,568,631	0.0029325
dx_INF002	46	2018	1,568,631	0.0029325
dx_CIR019	38	2019	1,556,727	0.0024410
dx_INF002	38	2019	1,556,727	0.0024410
vp_age	0	2018	1,568,631	0.0000000
vp_pay_1	0	2018	1,568,631	0.0000000
vp_year	0	2018	1,568,631	0.0000000
vp_apr_soi	0	2018	1,568,631	0.0000000
vp_gender	0	2018	1,568,631	0.0000000
vp_age	0	2019	1,556,727	0.0000000
vp_pay_1	0	2019	1,556,727	0.0000000
vp_year	0	2019	1,556,727	0.0000000
vp_apr_soi	0	2019	1,556,727	0.0000000
vp_gender	0	2019	1,556,727	0.0000000

Virtually no data is missing from the primary covariates, and outcome - 0.002%, or 46 and 36 observations respectively over >1.5M respectively.

Data was missing from the primary exposure (v_facilityid), because it was redacted for confidentiality. This data has been excluded from the analysis in an earlier stage. These missing values are systematically encoded, as they are all related to specific sensitive procedures and diagnoses. I think it is unlikely that these are uniform across the facilities, but with facility_id and any higher level facility level information redacted, it is difficult to investigate further.

Part 3: Descriptive Statistics

Using gtsummary to create a descriptive statistics table for my analytical sample. Please note, the primary exposure is Permanent Facility ID in the dataset (v_facilityid). This is a categorical variable, but it has a very high cardinality - as a result, I am sharing the total number of Unique Facility IDs in the dataset in its own row in the column, but not subsetting any of the other covariates on this variable.

table1 <- lu_run$a2_data %>% bind_rows()

includelist <- table1 %>% 
  select(starts_with(c("vp_","dx_"))) %>%
  colnames()

facilityrow <- tibble(
  variable = "facility_id",
  label = "Facility ID",
   stat_0 = paste0(length(unique(table1$v_facilityid)), " unique facilities"))


# summarize the data with our package
tbl1 <- table1 %>%
  tbl_summary(
    include = includelist,
    label= list(vp_age ~ "Age group, years",
                vp_pay_1 ~ "Primary payment type",
                vp_year ~ "Discharge year",
                vp_apr_soi ~ "APR-DRG Severity of Illness",
                vp_gender ~ "Database-encoded gender",
                dx_CIR019 ~ "Heart Failure (CIR019)",
                dx_INF002 ~ "Septicemia (INF002)"
                ))
tbl1 <- tbl1 %>%
  modify_table_body(
    ~ bind_rows(facilityrow, .x)
  ) %>%
  modify_caption(caption= "Table 1. Characteristics of 2018-2019 New York State Hospital Discharge Data") %>%
  modify_footnote_header("Observations subject to enhanced de-identification have been excluded from this analysis, as they lack hospital-level identifiers. No dimensions for key covariates listed are missing from observations that are not subject to enhanced de-identification.", columns=all_stat_cols(), replace=FALSE)
tbl1

Table 1. Characteristics of 2018-2019 New York State Hospital Discharge Data
Characteristic	N = 3,125,358^1,2
Facility ID	208 unique facilities
Age group, years
0-17	193,231 (6.2%)
18-29	336,659 (11%)
30-49	623,119 (20%)
50-69	915,443 (29%)
70+	1,056,906 (34%)
Primary payment type
Private	630,537 (20%)
Public	2,340,672 (75%)
Self-Pay	60,421 (1.9%)
Other	93,728 (3.0%)
Discharge year
2018	1,568,631 (50%)
2019	1,556,727 (50%)
APR-DRG Severity of Illness
Minor	680,606 (22%)
Moderate	1,233,410 (39%)
Major	898,765 (29%)
Extreme	312,577 (10%)
Database-encoded gender
Female	1,699,696 (54%)
Male	1,425,634 (46%)
Unknown	28 (<0.1%)
Heart Failure (CIR019)	126,452 (4.0%)
Unknown	84
Septicemia (INF002)	235,185 (7.5%)
Unknown	84
¹ n (%)
² Observations subject to enhanced de-identification have been excluded from this analysis, as they lack hospital-level identifiers. No dimensions for key covariates listed are missing from observations that are not subject to enhanced de-identification.

Part 4 Exploratory Data Analysis

Step 1: Distribution of the outcome variable

The outcome variable is binary, so there is not much in the way of distribution to identify, so I am producing a bar chart of the two outcomes of interest for Figure 1.

figure1 <- table1 %>% 
  pivot_longer(
    cols = outcomes, 
    names_to = "outcomename",
    values_to = "outcomevalue") %>%
  group_by(outcomename, outcomevalue) %>%
  summarize(n= n()) %>% mutate(outcomevalue = factor(outcomevalue, 
                                                     levels = c(0,1), 
                                                     labels= c("No","Yes")),
                               outcomename = case_match(outcomename,
                                                    "dx_CIR019" ~ "Heart Failure (CIR019)",
                                                    "dx_INF002" ~ "Septicemia (INF002)")
                               ) %>% filter( !(is.na(outcomevalue)))


ggplot(figure1, aes(x = outcomevalue, y = n)) + geom_col(fill = "pink") + facet_wrap(~ outcomename) + scale_y_continuous(labels = label_comma()) + theme_classic() + 
  labs(title = "Figure 1. Clinical Classification Systems-Refined (CCSR) Diagnosis Distribution \nfor Heart Failure and Septicemia across NYS Hospital Discharges 2018-2019",
       y = "Discharge Count", x = "Diagnosis")

The CCSR categories for this phase of exploratory data analysis are the two highest volume categories in the dataset, of over 472 distinct diagnosis categories present in the two years. Despite being the most common, their positive rates remain low: Heart failure at 4% in 2018, and 4.1% in 2019, and Septicemia at 7.29% in 2018, and 7.76% in 2019. This heavy skew towards 0 in this binary category is a result of the variety of diagnosis categories, and is primarily indicative of the scale of variety and the sparsity of outcomes that I can expect in this dataset.

Step 2: Association between the outcome and primary exposure

Because all covariates are categorical, the outcome is binary, and the primary exposure is categorical with high cardinality (208 different facilities), the standard methods of identifying an association do not necessarily apply here.

figure2 <- table1%>% 
  pivot_longer(
    cols = outcomes, 
    names_to = "outcomename",
    values_to = "outcomevalue") %>%
  group_by(v_facilityid, outcomename) %>%
  summarize(n= n(), 
            diagnoses = sum(outcomevalue, na.rm=TRUE)) %>% 
  mutate(
    prev = diagnoses/n ,
    outcomename = case_match(outcomename,
               "dx_CIR019" ~ "Heart Failure (CIR019)",
                "dx_INF002" ~ "Septicemia (INF002)")
  )

ggplot(figure2, aes(x=outcomename, y= prev)) + 
         geom_violin(alpha = 0.5, fill="purple", color = NA)+ 
         #geom_jitter(width = 0.2, seed = 1, alpha = 0.5, color = "palevioletred") +
                theme_classic() +
          geom_boxplot(fill=NA, width=0.1) +
  scale_y_continuous(labels = label_percent()) + 
  labs(title = "Figure 2. Rate of CCSR Diagnoses for Heart Failure (CIR019) and Septicemia (INF002)\namong all Hospital Discharges  across all 208 NYS facilities 2018-2019",
       y = "Rate of Diagnosis\n(Facility Diagnoses/Total Facilty Discharges)", x = "CCSR Category")

Assessing facility-level rates of both heart Failure (CIR019) and Septicemia (INF002) (calculated as the total volume of discharges with a specific diagnosis category over the total number of discharges), we see that there is facility-level variation between categories. Because prevalence distribution for both outcomes differs across facilities, this indicates an association between facility and the diagnosis rate. The distribution of septicemia diagnosis is particularly wide, ranging from a diagnosis rate of 0% to near 20%. Both Heart Failure and Septicimia appear to have 2 potential outlier facilities, which may be worth further investigation.

Step 3: Association between the outcome and primary exposure, stratified by age.

Both selected outcomes Heart Failure and Septicemia are linked to age (as well as many other case-mix factors). For this next EDA, I want to confirm that there is still substantial facility-level variation in diagnosis rate when the data is stratified by age.

figure3 <- table1%>% 
  pivot_longer(
    cols = outcomes, 
    names_to = "outcomename",
    values_to = "outcomevalue") %>%
  group_by(v_facilityid, vp_age, outcomename) %>%
  summarize(n= n(), 
            diagnoses = sum(outcomevalue, na.rm=TRUE)) %>% 
  mutate(
    prev = diagnoses/n ,
    outcomename = case_match(outcomename,
               "dx_CIR019" ~ "Heart Failure (CIR019)",
                "dx_INF002" ~ "Septicemia (INF002)")
  ) %>% filter(outcomename == "Septicemia (INF002)",
               n >= 30)

ggplot(figure3, aes(x=vp_age, y= prev, fill=vp_age))  +
         geom_violin(alpha = 0.5,  color = NA)+ 
         #geom_jitter(width = 0.2, seed = 1, alpha = 0.5, color = "palevioletred") +
                theme_classic() +
          geom_boxplot(fill=NA, width=0.1) +
  scale_y_continuous(labels = label_percent()) + 
  labs(title = "Figure 3. Rate of CCSR Diagnoses for Septicemia (INF002) by Age Group among\nNYS Facilities with >=30 Discharges per Age Group, 2018-2019",
       y = "Rate of Diagnosis\n(Facility Diagnoses/Total Facilty Discharges)" , x = "Facility-Level Septicemia Diagnosis Rate"
       ) + theme(legend.position = "none")

Cells with fewer than 30 total discharges were excluded from this analysis to stabilize diagnosis rate estimates, and heart failure was excluded because of age related sparsity. This initial assessment indicates that even when discharges have been stratified by age group, facilities have very strong variation in diagnosis rate of septicemia. Because this facility level variation is still present with a major case-mix variable controlled for, there is reason to look into other variables as sources for diagnosis variation between facilities.

Import and Clean SPARCS Data

EPI 553: Principles of Statistical Inference II

Alex Mossawir

2026-02-28

Part 0: Data Loading and Setup

Step 1. Load Required Packages

Step 2. Import Run Mapping Information

2018 Documentation

2019 Documentation

Step 3. Generate Necessary Fields for Downloading CSV

Step 3. Import Data

Helper Function - import_fetch or load

Loop through LU to import data

2018

2019

Step 4. Import all Dictionaries

2018 Dictionary

2018 Dictionary

Part 0-1: Data Structure Merging

Step 1. Column Rename and Type Classification

Helper Function - standardize_columns

Loop through datasets to standardize

Data Outputs

2018

2019

Part 1: Data Subsetting on Exclusion Criteria

Step 1. Data Quality Rule Import

Step 2. Data Quality Rules Pass

Helper Functions - Filtering and Exclusion Rules

Execute Function Calls to Clean Data

Step 3. Analytical Sample

Complete Audit Summary Below

Summary Table

Part 2: Variable Selection, Recoding, and Data Cleaning

Step 1. Outcome Variable

Outcome Variable: Recode as Binary Variables

Helper Function - expand_variables

Loop by run through the one-hot expansion code

Step 2. Primary Exposure Variable

Step 3. Key Covariates

Recode and Manage Variables

Helper Function

Recode

Step 4. Missingness

Helper Function - missingness

Call missingess over all the data.

Part 3: Descriptive Statistics

Part 4 Exploratory Data Analysis

Step 1: Distribution of the outcome variable

Step 2: Association between the outcome and primary exposure

Step 3: Association between the outcome and primary exposure, stratified by age.