Due to incompatibilities between the Clinical Classification Software (CCS) groupings in the data from 2016-2017 and the Clinical Classification Software (CCSR), the primary analysis will be completed exclusively with data from 2018-2019.
However, there are a small number (35) of diagnoses that can be directly mapped between the two systems. A sensitivity analysis can be conducted on one of the diagnoses listed in the crosswalk. The workflow can process and ingest the 2016 and 2017 by switching the initial ingest from the currently loaded one to “LU_byYear_2016-2019-AlexMossawir_3.15.26.xlsx”.
Dataset covers Inpatient Hospital Discharges that occured in calendar year 2018, statewide coverage, issued at the Discharge level. Product of the Office of Quality and Patient Safety.
2.35M Rows, 33 Columns
Dataset covers Inpatient Hospital Discharges that occured in calendar year 2019, statewide coverage, issued at the Discharge level. Product of the Office of Quality and Patient Safety.
2.34M Rows, 33 Columns
# Import Project Lookup
#--------------------------------------------------------------------------------------------------
# This can be used to add additional years or remove years more quickly. This is currently
# set to 2018-2019, but by switching the excel here to:
# "LU_byYear_2016-2019-AlexMossawir_3.15.26.xlsx"
# the full 2016-2019 range can be ingested.
#--------------------------------------------------------------------------------------------------
lu_run <- read_excel("LU_byYear_2018-2019-AlexMossawir_3.15.26.xlsx")
##Print LU for Reference
lu_run %>%
kable(
caption = "Project Lookup"
) | run_id | lu_year | lu_url | lu_dict | limit |
|---|---|---|---|---|
| puf_2018 | 2018 | https://health.data.ny.gov/api/views/yjgt-tq93/rows.csv?accessType=DOWNLOAD | dict_2018 | NA |
| puf_2019 | 2019 | https://health.data.ny.gov/api/views/4ny4-j5zv/rows.csv?accessType=DOWNLOAD | dict_2019 | NA |
Original version of this workflow queried the OData API, but I was able to confirm that even with pagination through a few different approaches, this data source has a hard cap at 500k rows per query. CSV will be downloaded and cleaned for some inconsistencies in unicode and column headings.
# Cache Info and Location
#--------------------------------------------------------------------------------------------------
# Workflow will check project file for cached data, to avoid re-downloading every time this
# gets knit.
#--------------------------------------------------------------------------------------------------
## File cached data saved to
lu_run$cache <- file.path(paste0("data/raw/",lu_run$run_id,".rds"))
## File Timestamp saved to
lu_run$cache_ts <- file.path(paste0("data/raw/",lu_run$run_id,"_ts.txt"))This function checks if the document specified in the run mapping is currently cached within the project folder. If it is, it loads the data into the LU directly. If it is not, it pulls the number of rows specified in the document (or all) from the url in the lookup, caches the new file, and loads into the LU.
# Fetch Or Load Function
#--------------------------------------------------------------------------------------------------
# The function will first check for cached data at the specified location. Then, if it exists,
# it'll load the cached data. If it doesn't exist, it'll use request generated previously.
#--------------------------------------------------------------------------------------------------
# This function will be called on an entire row of the LU at a time - The following columns will
# always need to be in the LU for this to work:
# - cache - relative file path + name of .rds for each dataset imported
# - cache_ts - relative file path + name of .txt file with time that the url was queried
# - lu_url - odata URL where data is being pulled from
# - request - httr2 request to get data from url
# The following columns will be generated within the LU by this process:
# - data - a full dataframe of the data for use
# - data_ts - timestamp of when the data was first queried
#
#--------------------------------------------------------------------------------------------------
import_fetchorload <- function(LU){
## Check if the cached data exists
if(file.exists(LU$cache[[1]])) {
## If so - print the following information:
print(
paste0(
"Loading data from ", LU$cache[[1]],
" (saved at ", read_lines(LU$cache_ts[[1]])[[1]], ")",
" - saving to df."
)
)
## Read in data from cached location and ts
LU$data [[1]]<- readRDS(LU$cache[[1]])
LU$data_timestamp <- read_lines(LU$cache_ts[[1]])
return(LU)
} else {
##Print the message
print(
paste0(
"No data saved at ", as.character(LU$cache[[1]]),
"- loading from url: ",as.character(LU$lu_url[[1]])
)
)
if (is.na(LU$limit[[1]])==TRUE) {
temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]])))
} else {
temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]]), n_max = LU$limit[[1]]))}
ts <- strftime(Sys.time())
cat(ts, file = LU$cache_ts[[1]])
saveRDS(temp,LU$cache[[1]])
LU$data_timestamp <- ts
LU$data[[1]] <- temp
return(LU)
}}# Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
# Final output will be a series of messages in console indicating whether it needed to query
# the URL or whether it was able to just pull from the DF directly. Data frames will be stored
# Within the LU table that we've been operating on.
#--------------------------------------------------------------------------------------------------
lu_run <- rowwise(lu_run) %>%
mutate(import_fetchorload(cur_data()))## [1] "Loading data from data/raw/puf_2018.rds (saved at 2026-03-14 15:48:06) - saving to df."
## [1] "Loading data from data/raw/puf_2019.rds (saved at 2026-03-14 15:51:40) - saving to df."
| Hospital Service Area | Hospital County | Operating Certificate Number | Permanent Facility Id | Facility Name | Age Group | Zip Code - 3 digits | Gender | Race | Ethnicity | Length of Stay | Type of Admission | Patient Disposition | Discharge Year | CCSR Diagnosis Code | CCSR Diagnosis Description | CCSR Procedure Code | CCSR Procedure Description | APR DRG Code | APR DRG Description | APR MDC Code | APR MDC Description | APR Severity of Illness Code | APR Severity of Illness Description | APR Risk of Mortality | APR Medical Surgical Description | Payment Typology 1 | Payment Typology 2 | Payment Typology 3 | Birth Weight | Emergency Department Indicator | Total Charges | Total Costs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Capital/Adirond | Columbia | 1001000 | 000146 | Columbia Memorial Hospital | 50 to 69 | 125 | F | White | Not Span/Hispanic | 2 | Elective | Short-term Hospital | 2018 | CIR020 | CEREBRAL INFARCTION | NA | NA | 045 | CVA & PRECEREBRAL OCCLUSION W INFARCT | 01 | DISEASES AND DISORDERS OF THE NERVOUS SYSTEM | 3 | Major | Major | Medical | Managed Care, Unspecified | Self-Pay | NA | NA | TRUE | 17591.41 | 3570.06 |
| New York City | Manhattan | 7002009 | 001445 | Harlem Hospital Center | 50 to 69 | 100 | F | Black/African American | Not Span/Hispanic | 16 | Emergency | Short-term Hospital | 2018 | RSP017 | POSTPROCEDURAL OR POSTOPERATIVE RESPIRATORY SYSTEM COMPLICATION | ESA003 | MECHANICAL VENTILATION | 121 | OTHER RESPIRATORY & CHEST PROCEDURES | 04 | DISEASES AND DISORDERS OF THE RESPIRATORY SYSTEM | 4 | Extreme | Extreme | Surgical | Medicare | Medicaid | NA | NA | TRUE | 106190.21 | 78913.24 |
| New York City | Manhattan | 7002009 | 001445 | Harlem Hospital Center | 50 to 69 | 100 | F | Black/African American | Not Span/Hispanic | 4 | Emergency | Home or Self Care | 2018 | MBD003 | BIPOLAR AND RELATED DISORDERS | MHT001 | PHARMACOTHERAPY FOR MENTAL HEALTH (EXCLUDING SUBSTANCE USE) | 753 | BIPOLAR DISORDERS | 19 | MENTAL DISEASES AND DISORDERS | 2 | Moderate | Minor | Medical | Medicaid | NA | NA | NA | FALSE | 5600.00 | 4161.53 |
| Hospital Service Area | Hospital County | Operating Certificate Number | Permanent Facility Id | Facility Name | Age Group | Zip Code - 3 digits | Gender | Race | Ethnicity | Length of Stay | Type of Admission | Patient Disposition | Discharge Year | CCSR Diagnosis Code | CCSR Diagnosis Description | CCSR Procedure Code | CCSR Procedure Description | APR DRG Code | APR DRG Description | APR MDC Code | APR MDC Description | APR Severity of Illness Code | APR Severity of Illness Description | APR Risk of Mortality | APR Medical Surgical Description | Payment Typology 1 | Payment Typology 2 | Payment Typology 3 | Birth Weight | Emergency Department Indicator | Total Charges | Total Costs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hudson Valley | Westchester | 5902001 | 001045 | White Plains Hospital Center | 30 to 49 | 106 | M | Other Race | Not Span/Hispanic | 2 | Emergency | Home or Self Care | 2019 | INF008 | VIRAL INFECTION | NA | NA | 723 | VIRAL ILLNESS | 18 | INFECTIOUS AND PARASITIC DISEASES (SYSTEMIC OR UNSPECIFIED SITES) | 2 | Moderate | Minor | Medical | Private Health Insurance | NA | NA | NA | TRUE | 26507 | 4773.11 |
| Hudson Valley | Westchester | 5902001 | 001045 | White Plains Hospital Center | 70 or Older | 105 | F | Black/African American | Not Span/Hispanic | 3 | Emergency | Home or Self Care | 2019 | GEN002 | ACUTE AND UNSPECIFIED RENAL FAILURE | NA | NA | 469 | ACUTE KIDNEY INJURY | 11 | DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT | 2 | Moderate | Moderate | Medical | Medicare | NA | NA | NA | TRUE | 20693 | 5631.30 |
| Hudson Valley | Westchester | 5902001 | 001045 | White Plains Hospital Center | 50 to 69 | 105 | F | Other Race | Not Span/Hispanic | 7 | Elective | Home or Self Care | 2019 | GEN006 | OTHER SPECIFIED AND UNSPECIFIED DISEASES OF KIDNEY AND URETERS | GIS005 | EXPLORATION OF PERITONEAL CAVITY | 443 | KIDNEY & URINARY TRACT PROCEDURES FOR NONMALIGNANCY | 11 | DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT | 4 | Extreme | Extreme | Surgical | Medicare | Blue Cross/Blue Shield | NA | NA | FALSE | 138252 | 29848.69 |
Dictionaries for 2016-2017 have been created for future analysis and are saved in the same folder, but the workflow will only read in dictionaries for current runs.
dict_2018 is sourced from 2018 tab in Excel file shared
in repository.
This was manually extracted from: https://health.data.ny.gov/api/views/yjgt-tq93/files/9667d39a-be57-4df1-9ce4-537bce120215?download=true&filename=NYSDOH_SPARCS_De-Identified_Data_Dictionary_2018.pdf
Additional API field names have been included in
Data_Col
dict_2018 is sourced from 2018 tab in Excel file shared
in repository.
This was manually extracted from: https://health.data.ny.gov/api/views/yjgt-tq93/files/9667d39a-be57-4df1-9ce4-537bce120215?download=true&filename=NYSDOH_SPARCS_De-Identified_Data_Dictionary_2018.pdf
Additional API field names have been included in
Data_Col
# Import dictionaries
#--------------------------------------------------------------------------------------------------
# This can be used to add additional years or remove years more quickly. building this process
# on a subset of 20 rows from 2016 and 2017, and then revising to add in the full year range
# and the full datasets once workflows are clean, so it can be batched.
#
# This is designed so that each dictionary is sourced from a different tab in an excel document.
#
#--------------------------------------------------------------------------------------------------
#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx"
#--------------------------------------------------------------------------------------------------
##pull years out of the run lu to loop over
year <- as.character(lu_run$lu_year)
##create a dictionary list of sheets read from excel, looping over years
dicts <- lapply(year, function(row){
cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
read_excel(dict,sheet=row)
})##
## Loading dictionary from:DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx
## Read sheet: 2018
## Loading dictionary from:DataDictionary_2016-2019-CSVs_AlexMossawir_3-13-26.xlsx
## Read sheet: 2019
Column names and data types are slightly different across years - they are standardized across datasets.
This function uses the appropriate dictionary for the ingested
dataset (ie. 2018 data will get dict_2018) as specified in
the lookup. This dataset will rename columns to the standard names
across variables, and ensure types are aligned.
# Standardize Columns Function
#-------------------------------------------------------------------------------------------------
# This is a function that standardizes the columns based on the provided dictionaries -
# If this fails, it means that there is something that does not match in the dictionary.
#
# This is by design - I want it to be able to account for every single row, if something is
# mapped incorrectly in the dataset, then it neeeds a new dictionary.
#
# For example, while making this, I copy/pasted the rows from a print statement, but I
# dropped one of the underscores on "__id" for the 2017, but not for the 2016 data.
#
#--------------------------------------------------------------------------------------------------
standardize_columns <- function(run_lu, dictionaries){
# Grab dictionary name out of the lookup
##Call working versions of the data and dictionary objects from the lists that contain them
w_data <- run_lu$data[[1]]
# Grab dictionary from dicts
w_dict <- dictionaries[[ run_lu$lu_dict[[1]] ]]
## Named vector of old names using
w_names <- setNames(w_dict$api_field, w_dict$final_name)
## Fill a_data column in the run LU with the new, renamed dataset
temp <- w_data %>%
## Use the named vector
rename(!!!w_names) %>%
mutate(
across(
all_of(w_dict$final_name[w_dict$final_type == "character"]),
as.character
),
across(
all_of(w_dict$final_name[w_dict$final_type == "numeric"]),
as.numeric
)
)
run_lu$a_data[[1]] <- temp
return(run_lu)
}# Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
# If this fails, check the console for issues with the LUs not matching.
#--------------------------------------------------------------------------------------------------
lu_run <- rowwise(lu_run) %>%
mutate(standardize_columns(cur_data(),dicts))| v_hsa | v_county | operating_certificate_number | v_facilityid | facility_name | v_age | v_patientzip | v_gender | v_race | v_ethnicity | v_los | v_admissiontype | v_disposition | v_year | v_dx_ccsr | dxccsr_description | v_pr_ccsr | prccsr_description | v_apr_drg | apr_drg_description | v_mdc | v_mdc_description | v_apr_soi | apr_soi_description | v_apr_rom | v_apr_type | v_pay_1 | v_pay_2 | v_pay_3 | birth_weight | flag_ed | total_charges | total_costs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Capital/Adirond | Columbia | 1001000 | 000146 | Columbia Memorial Hospital | 50 to 69 | 125 | F | White | Not Span/Hispanic | 2 | Elective | Short-term Hospital | 2018 | CIR020 | CEREBRAL INFARCTION | NA | NA | 045 | CVA & PRECEREBRAL OCCLUSION W INFARCT | 01 | DISEASES AND DISORDERS OF THE NERVOUS SYSTEM | 3 | Major | Major | Medical | Managed Care, Unspecified | Self-Pay | NA | NA | TRUE | 17591.41 | 3570.06 |
| New York City | Manhattan | 7002009 | 001445 | Harlem Hospital Center | 50 to 69 | 100 | F | Black/African American | Not Span/Hispanic | 16 | Emergency | Short-term Hospital | 2018 | RSP017 | POSTPROCEDURAL OR POSTOPERATIVE RESPIRATORY SYSTEM COMPLICATION | ESA003 | MECHANICAL VENTILATION | 121 | OTHER RESPIRATORY & CHEST PROCEDURES | 04 | DISEASES AND DISORDERS OF THE RESPIRATORY SYSTEM | 4 | Extreme | Extreme | Surgical | Medicare | Medicaid | NA | NA | TRUE | 106190.21 | 78913.24 |
| New York City | Manhattan | 7002009 | 001445 | Harlem Hospital Center | 50 to 69 | 100 | F | Black/African American | Not Span/Hispanic | 4 | Emergency | Home or Self Care | 2018 | MBD003 | BIPOLAR AND RELATED DISORDERS | MHT001 | PHARMACOTHERAPY FOR MENTAL HEALTH (EXCLUDING SUBSTANCE USE) | 753 | BIPOLAR DISORDERS | 19 | MENTAL DISEASES AND DISORDERS | 2 | Moderate | Minor | Medical | Medicaid | NA | NA | NA | FALSE | 5600.00 | 4161.53 |
| v_hsa | v_county | operating_certificate_number | v_facilityid | facility_name | v_age | v_patientzip | v_gender | v_race | v_ethnicity | v_los | v_admissiontype | v_disposition | v_year | v_dx_ccsr | dxccsr_description | v_pr_ccsr | prccsr_description | v_apr_drg | apr_drg_description | v_mdc | v_mdc_description | v_apr_soi | apr_soi_description | v_apr_rom | v_apr_type | v_pay_1 | v_pay_2 | v_pay_3 | birth_weight | flag_ed | total_charges | total_costs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hudson Valley | Westchester | 5902001 | 001045 | White Plains Hospital Center | 30 to 49 | 106 | M | Other Race | Not Span/Hispanic | 2 | Emergency | Home or Self Care | 2019 | INF008 | VIRAL INFECTION | NA | NA | 723 | VIRAL ILLNESS | 18 | INFECTIOUS AND PARASITIC DISEASES (SYSTEMIC OR UNSPECIFIED SITES) | 2 | Moderate | Minor | Medical | Private Health Insurance | NA | NA | NA | TRUE | 26507 | 4773.11 |
| Hudson Valley | Westchester | 5902001 | 001045 | White Plains Hospital Center | 70 or Older | 105 | F | Black/African American | Not Span/Hispanic | 3 | Emergency | Home or Self Care | 2019 | GEN002 | ACUTE AND UNSPECIFIED RENAL FAILURE | NA | NA | 469 | ACUTE KIDNEY INJURY | 11 | DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT | 2 | Moderate | Moderate | Medical | Medicare | NA | NA | NA | TRUE | 20693 | 5631.30 |
| Hudson Valley | Westchester | 5902001 | 001045 | White Plains Hospital Center | 50 to 69 | 105 | F | Other Race | Not Span/Hispanic | 7 | Elective | Home or Self Care | 2019 | GEN006 | OTHER SPECIFIED AND UNSPECIFIED DISEASES OF KIDNEY AND URETERS | GIS005 | EXPLORATION OF PERITONEAL CAVITY | 443 | KIDNEY & URINARY TRACT PROCEDURES FOR NONMALIGNANCY | 11 | DISEASES AND DISORDERS OF THE KIDNEY AND URINARY TRACT | 4 | Extreme | Extreme | Surgical | Medicare | Blue Cross/Blue Shield | NA | NA | FALSE | 138252 | 29848.69 |
Based on the contents of the data dictionaries and the key variables being used for the analysis (permanent facility id), I have created a rule dictionary with instructions to be placed into the rule engine.
This first pass is to run through the basic data quality filters - not study design filters. Any observations that cannot be used for this analysis (e.g. enhanced de-identification redacting facility information, missing key case-mix identifier like age group) will be filtered through this stage. The engine will count the number of rows in each run’s dataset and log it, as it passes the data onto the next rule.
All rule dictionaries will have the same structure, so they can be
fed through the same update/logging engine downstream, this pass uses
the lu filters rule dictionary.
Dictionaries have the following structure:
exclude, but downstream and for sensitivity analysis, the
rule engine may be used for partitioning the dataset or other outcomes -
this will be used to determine that process.# Import LU
#--------------------------------------------------------------------------------------------------
# Set the stage here with the variable stage - this step is currently the data_quality pass, so
# it's going to pull in the rules flagged with "data_quality" - in this case, from tab name.
# Store it to data quality, again. If we want to import multiple LUs - add tab names to
# the list.
#--------------------------------------------------------------------------------------------------
#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "LU_AppliedFilters_2016-2019-AlexMossawir_3.7.26.xlsx"
stage <- c("data_quality")
#--------------------------------------------------------------------------------------------------
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
read_excel(dict,sheet=row)
})##
## Loading dictionary from:LU_AppliedFilters_2016-2019-AlexMossawir_3.7.26.xlsx
## Read sheet: data_quality
temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))
dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts)
##Print LU for Reference
dicts$lu_data_quality %>%
kable(
caption = "Rules Lookup for Filtering"
) | stage | rule_id | rule_name | rule_condition | rule_outcome | output_prefix | applies | order | logged_as |
|---|---|---|---|---|---|---|---|---|
| data_quality | 1 | exclude_abortions | flag_abortion == “Y” | exclude | a | puf_2016, puf_2017 | 2 | Enhanced De-ID |
| data_quality | 2 | exclude_deid | facility_name == “Redacted for Confidentiality” | exclude | a | puf_2018, puf_2019 | 3 | Enhanced De-ID |
| data_quality | 3 | missing_facility | is.na(v_facilityid) | exclude | a | all | 1 | Missing Facility ID |
| data_quality | 4 | missing_age | is.na(v_age) | exclude | a | all | 4 | Missing Age Group |
| data_quality | 5 | newborn | v_admissiontype == “Newborn” | exclude | a | all | 5 | Excluding Newborns |
| data_quality | 6 | medical_only | !(v_apr_type == “Medical”) | exclude | a | all | 6 | Include Only Medical Admissions |
The structure of the engine will first loop through over
run_id using lu_run, this is the exterior
loop. For each run, the engine will first determine which rules apply
applies.
Then it will loop over the rules in an interior loop. For each rule, it will log iniitial nrow, and save to audit log. Then it will either apply the rule filtering or skip over it. It will output final nrows to the audit log, and pass the revised data to the next step of the interior loop. It proceeds sequentially over the rows, so working data will be updated once the rule has been executed and logged.
The inner loop passes the final audit lists and the final w_data back to the outer loop. The outer loop saves the audit log and the data to columns in the lookup.
# Cache Info and Location
#--------------------------------------------------------------------------------------------------
# Workflow will check project file for cached data - this is write-only, just to keep records.
#--------------------------------------------------------------------------------------------------
## File cached data saved to
lu_run[[paste0("cache_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,".rds"))
print(lu_run[[paste0("cache_", stage)]])## [1] "data/data_quality/puf_2018.rds" "data/data_quality/puf_2019.rds"
## File Timestamp saved to
lu_run[[paste0("cache_ts_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,"_ts.txt"))
print(lu_run[[paste0("cache_ts_", stage)]])## [1] "data/data_quality/puf_2018_ts.txt" "data/data_quality/puf_2019_ts.txt"
These functions are hidden by default - press show to see the code!
flag_rule_application
This function is called at a run_id level within the
outer loop. Once the working lookup is selected, it
loops over the rule list to determine if the rule is applicable to this
particular run. i.e: a rule with the applies text
“puf_2016, puf_2017” will be split into a vector and evaluated - it will
only be applied when the run_id is in that vector.
#Function to Determine Rule Applicability - Within Outer Loop
#--------------------------------------------------------------------------------------------------
# Within outer loop, looking at the current run_id that is being passed,
# and determine which rules should be applied.
#--------------------------------------------------------------------------------------------------
flag_rule_application <- function(rule, run_id) {
applies_vec <- rule$applies_rule
if ("all" %in% applies_vec) { TRUE } else if (run_id %in% applies_vec) { TRUE } else { FALSE }
}apply_exclude
This function is called at a rule level within the
inner loop. Because the lookup stores the logic for
what columns should be excluded as a string, it needs to be unquoted to
be properly evaluated as logic. This applies that logic with a filter
function from dplyr in order to exclude anything that fits that
criteria.
#Function Within Inner Loop - Actually Exclude
#--------------------------------------------------------------------------------------------------
# Applies filter on the dataset using rule expression - called within the inner loop
#--------------------------------------------------------------------------------------------------
##function to if it excludes, does it
apply_exclude <- function(w_data, rule_expr) {
w_data %>% filter(!(!!rule_expr))
}apply_rules
This function is called on each specific rule list
within the inner loop. It takes a single
rule from the inner loop and applies it to the
w_data working data set. It also generates a list of values
to output to an audit table so all transformations (or skipped rules)
can be tracked. This audit row lists the initial number of rows, and the
final number of rows after transformation, and reports the number of
observations affected by each rule.
This is the core of the inner loop. It returns transformed
w_data and a row of the audit table.
#--------------------------------------------------------------------------------------------------
# Function to Apply Rules - Within Inner Loop
#--------------------------------------------------------------------------------------------------
# This takes the working dictionary and working data from the interior loop, and continues,
# passing run_id from the outer loop. It always references [[1]] because it will be passed
# rowwise. Outputs the updated dataset and a new list, audit_log.
#--------------------------------------------------------------------------------------------------
apply_rule <- function(rule, w_data, run_id) {
##Prelim Information
run_flag <- rule$flag_rule_applies
#Generate Audit Row
audit_row <- list(
stage = rule$stage,
run_id = run_id,
rule_id = rule$rule_id,
rule_name = rule$rule_name,
rule_condition= rule$rule_condition,
rule_outcome = rule$rule_outcome,
n_initial = nrow(w_data),
logged_as = rule$logged_as,
skipped = !run_flag,
applied = run_flag
)
## cat ( paste0("Run Id: ", string(run_id), " Rule Name ",rule$rule_name), " Run flag: ", run_flag, " Class W_Data: ", class(w_data))
## Apply dplyr rules based on what rules text is - this can be updated when I add others. But.
if (run_flag == FALSE) {
# do nothing
} else if (rule$rule_outcome == "exclude") {
rule_expr <- parse_expr(rule$rule_condition)
w_data <- apply_exclude(w_data, rule_expr)
}
audit_row$n_final <- nrow(w_data)
audit_row$n_change <- audit_row$n_initial - audit_row$n_final
audit_row$timestamp <- Sys.time()
return(list(w_data = w_data,
audit_row = audit_row))
} outer_pass
This function is the core of the outer loop. It is
passed a single row from lu_run, and uses that information
to call the relevant objects and manage the inner
loop.
While it only takes a row from lu_run as a direct
argument, it references the following global variables to preserve
clarity and naming:
#Function Within Outer Loop - Manage Inner Loop
#--------------------------------------------------------------------------------------------------
# This is called by the outer loop and is used to manage the inner loop. It passes both
# transformed data and the audit log back.
#--------------------------------------------------------------------------------------------------
outer_pass <- function(lu_row) {
##Get that run_id, data, dict to keep passing through
run_id <- lu_row$run_id
# the whole reason this loop wasn't working is because this a_data is apparently. for some reason. a list length 1
## with a tibble at that first place.
w_data <- lu_row[[prev_col]][[1]]
w_dict <- dicts[[paste0("lu_",stage)]]
## Convert rules from a dataframe into list of lists
rules <- apply(w_dict, 1, as.list)
for (j in seq_along(rules)) {
rule <- rules[[j]]
##Applies is currently a vector hand-written in excel
##First remove all whitespace, then split out what remains
## It's gonna put the list into a list of length one with a vector in it.
#you literally just want the vector.
rule$applies_rule <- strsplit(gsub("\\s", "",rule$applies), ",")[[1]]
##We're looping through all the rules rn
rule$flag_rule_applies <- flag_rule_application(rule, run_id)
## and saving the new items back into rules
rules[[j]] <- rule
}
## Create an empty list object so that we can store the audit as we go.
audit <- list()
##i did this just counting first but seq_along actually passes info along each
for (j in seq_along(rules)) {
##pull out list j from dictionary
rule <- rules[[j]]
##Call function apply_rule
## Apply rule input:
## rule = one single row (but as list) of dictionary
## w_data = working dataset
## run_id = the same for everything in this inner loop
## Apply rule output:
## a list of: w_data, audit_row
result <- apply_rule(rule, w_data, run_id)
## replace w_data with the new w_data saved in result returned list
w_data <- result$w_data
##store
audit[[j]] <- result$audit_row
}
list(
data = w_data,
audit = audit
)
}The functions above will refer to global stage,
prev_col, stage_prefix variables. This is to
allow for flexibility of the pipeline. prev_col indicates
where the data for this stage should be sourced from, and
stage_prefix indicates what the column name of the
destination should be. I will revise these away from using the global
variable approach when I have more time.
#--------------------------------------------------------------------------------------------------
## Stage Variable
stage <- "data_quality"
stage_prefix <- "a2_"
## Column Data Source
prev_col <- "a_data"
#--------------------------------------------------------------------------------------------------
newcolname <- paste0(stage_prefix, "data")
#--------------------------------------------------------------------------------------------------
results <- vector("list", nrow(lu_run))
#--------------------------------------------------------------------------------------------------
for (i in seq_len(nrow(lu_run))) {
##Actually run the nested loops here, saving to a new vector
results[[i]] <- outer_pass(lu_run[i, ])
## Save the outputted day in the specified new column
lu_run[[newcolname]][[i]] <- results[[i]]$data
##save audit log to a table
lu_run$audit[[i]] <- bind_rows(results[[i]]$audit)
## Cache this
saveRDS(results[[i]]$data,lu_run$cache_data_quality[[i]])
##Make a timestamp
ts <- strftime(Sys.time())
##save the timestamp
cat (ts, file = lu_run$cache_ts_data_quality[[i]])
}
rm(results)The workflow has filtered the data down to the analytical sample and outputted the results of each step of the process into a complete audit table.
The complete table with n at each stage is below, and a summary table of initial observations, exclusion volume, and final analytical sample size, by year, can be found below.
# Cache Info and Location
#--------------------------------------------------------------------------------------------------
audit_table <- lu_run$audit %>%
bind_rows() %>%
arrange(run_id, rule_id) %>%
mutate(
logged_as = factor(logged_as, levels = unique(logged_as))
) %>%
group_by(run_id, logged_as) %>%
summarise(
n_initial = max(n_initial),
Excluded = sum(n_change),
n_final = min(n_final),
)
audit_table <- audit_table %>%
mutate(
year = factor(run_id,
levels = c("puf_2016","puf_2017","puf_2018","puf_2019"),
labels = c("2016",
"2017",
"2018",
"2019")
))
audit_table %>% kable()| run_id | logged_as | n_initial | Excluded | n_final | year |
|---|---|---|---|---|---|
| puf_2018 | Enhanced De-ID | 2352807 | 8804 | 2344003 | 2018 |
| puf_2018 | Missing Facility ID | 2344003 | 0 | 2344003 | 2018 |
| puf_2018 | Missing Age Group | 2344003 | 0 | 2344003 | 2018 |
| puf_2018 | Excluding Newborns | 2344003 | 212809 | 2131194 | 2018 |
| puf_2018 | Include Only Medical Admissions | 2131194 | 562563 | 1568631 | 2018 |
| puf_2019 | Enhanced De-ID | 2339462 | 9071 | 2330391 | 2019 |
| puf_2019 | Missing Facility ID | 2330391 | 0 | 2330391 | 2019 |
| puf_2019 | Missing Age Group | 2330391 | 0 | 2330391 | 2019 |
| puf_2019 | Excluding Newborns | 2330391 | 210594 | 2119797 | 2019 |
| puf_2019 | Include Only Medical Admissions | 2119797 | 563070 | 1556727 | 2019 |
| 2018 | 2019 | |
|---|---|---|
| Initial Sample Size | 2,352,807 | 2,339,462 |
| Enhanced De-ID | 8,804 | 9,071 |
| Excluding Newborns | 212,809 | 210,594 |
| Include Only Medical Admissions | 562,563 | 563,070 |
| Final Analytical Sample | 1,568,631 | 1,556,727 |
Name the variable and describe its scale (continuous, binary, etc.). Report the distribution: mean and SD for continuous outcomes, or frequency and % for binary outcomes. Note any recoding from the raw variable (e.g., converting BMD from NHANES units, collapsing categories).
The primary outcome variable for this analysis is a binary flag indicating whether a hospital discharge is assigned to a selected CCSR Diagnosis Group (1- dischage falls into the category).
Per the data dictionaries, SPARCS data from 2016-2017 reflects the Clinical Classification Software (CCS) codes and descriptions. 2018-2019 reflect Clinical Classification Software - Refined (CCSR) codes and descriptions. The year version of each category has not been specified in the data dictionaries provided.
The CCS and CCSR categories have largely been found to be non-translateable, as CCS beta was built off of ICD-9, while CCSR is using ICD-10. For this reason, the analysis will be restricted to 2018-2019 data, and the additional data points from 2016-2017 may be incorporated for sensitivity analysis.
Note: There are 35 codes with a 1-1 match between CCS and CCSR, per a comparison published by ARQ, so these specific codes can be recoded and highlighted in the data as candidates for a sensitivity analysis including 2016-2017 data.
https://hcup-us.ahrq.gov/toolssoftware/ccsr/DXCCSR-vs-Beta-CCS-Comparison.pdf
count_ccsr <- bind_rows(lu_run$a2_data) %>%
select(
c("v_dx_ccsr","dxccsr_description")
) %>%
unique() %>% nrow()There are 472 total diagnosis codes present in the datasets, across 22 body system categories sourced from the ICD-10. This is not a complete match to the MDC body system grouping, which reflects 25 total categories relevant to hospital resource use, though the two have overlap and both are present in this dataset.
top_10 <- bind_rows(lu_run$a2_data) %>%
select(
c("v_dx_ccsr","dxccsr_description")
)%>%
group_by(v_dx_ccsr,dxccsr_description) %>%
summarize(
count_obs = n()
) %>% arrange(desc(count_obs))
head(top_10, n=10) %>% kable(
col.names = c("CCSR Code", "CCSR Description", "n"),
caption= "Highest Volume CCSR Codes",
format.args = list(big.mark = ","))| CCSR Code | CCSR Description | n |
|---|---|---|
| INF002 | SEPTICEMIA | 235,185 |
| CIR019 | HEART FAILURE | 126,452 |
| MBD017 | ALCOHOL-RELATED DISORDERS | 99,747 |
| MBD001 | SCHIZOPHRENIA SPECTRUM AND OTHER PSYCHOTIC DISORDERS | 83,942 |
| RSP002 | PNEUMONIA (EXCEPT THAT CAUSED BY TUBERCULOSIS) | 82,004 |
| RSP008 | CHRONIC OBSTRUCTIVE PULMONARY DISEASE AND BRONCHIECTASIS | 71,042 |
| END003 | DIABETES MELLITUS WITH COMPLICATION | 63,873 |
| GEN004 | URINARY TRACT INFECTIONS | 61,030 |
| SKN001 | SKIN AND SUBCUTANEOUS TISSUE INFECTIONS | 60,424 |
| PRG023 | COMPLICATIONS SPECIFIED DURING CHILDBIRTH | 58,836 |
This initial analysis will report the initial data and exploratory data analysis will be conducted on INF002 and CIR019 selected because of the high volume of observations.
The final analysis will be conducted with a list of 3-5 categories, shortlisted for total volume and distribution across hospitals. The list will be refined with a literature review.
# Import Dictionary
#--------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "Dictionary_Recoding_AlexMossawir_3-15-26.xlsx"
stage <- c("recode")
#--------------------------------------------------------------------------------------------------
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
read_excel(dict,sheet=row)
})##
## Loading dictionary from:Dictionary_Recoding_AlexMossawir_3-15-26.xlsx
## Read sheet: recode
This function takes arguments for the lookup to source data from, the column name it is writing to, and the column name it is sourcing from. It outputs just lu_run.
expand_variables_row <- function(row, prev_col) {
# Extract the nested data for this row
w_data <- row[[prev_col]][[1]]
##stupid long way to only get the lookup lines for this stage
## this is uh. why I decided not to stack dictionaries. and then I did anyway.
w_dict <- dicts[["lu_recode"]][dicts[["lu_recode"]]$stage == "expand", ]
# Loop through columns to expand
for (i in seq_len(nrow(w_dict))) {
colsearch <- w_dict[[i, "col"]]
strsearch <- w_dict[[i, "values"]]
col_output <- paste0(w_dict[i, "new_col"], "_", w_dict[i, "values"])
w_data <- w_data %>%
mutate(!!col_output := as.integer(!!sym(colsearch) == strsearch))
}
# Return as list to store in list-column
list(w_data)
}
#-------------------------------------------------------------------------------lu_run <- lu_run %>%
rowwise() %>%
mutate(
a2_data = expand_variables_row(cur_data(), prev_col = "a2_data")
) %>%
ungroup()temp <- bind_rows(lu_run$a2_data)
outcomes <- grep("^dx_", colnames(temp), value = TRUE)
year <- temp %>%
select(c("v_year", outcomes)) %>%
group_by(v_year) %>%
summarize(
Total = n(),
across(all_of(outcomes), ~ sum(as.numeric(.x), na.rm = TRUE)),
round((across(all_of(outcomes), ~ sum(as.numeric(.x), na.rm = TRUE) / n()*100, .names = "{.col}_pct")),2)
)
year%>%
kable(
col.names = c("Year", "Total Observations", "CIR019 n", "INF002 n", "CIR019 %", "INF002 %"),
caption= "Highest Volume CCSR Codes - Frequency by Year",
format.args = list(big.mark = ","))| Year | Total Observations | CIR019 n | INF002 n | CIR019 % | INF002 % |
|---|---|---|---|---|---|
| 2018 | 1,568,631 | 62,675 | 114,408 | 4.0 | 7.29 |
| 2019 | 1,556,727 | 63,777 | 120,777 | 4.1 | 7.76 |
Primary exposure will be the hospital-level identifier,
permanent_facility_id in the dataset, representing
facility-level coding and documentation patterns.
Facility IDs were previously set to be the character data type in the data standardization pass. Because facility ID is categorical, but there are 208 unique facilities, I will not recode this variable.
Key covariates will be selected to proxy patient case-mix:
While Discharge Year and Patient Age Group are ordinal categorical, I am beginning my analysis by converting all to factors and handling as if they are nominative categorical. As I continue to run my analysis, I will revisit these.
Based on a review of the data, of the key covariates, the only one that needs recoding is Primary Payor Type.
I believe that Payment Typology is going to be a primary driver of variation, and further analysis will be best suited by a full bit-mapping of of the combination of the three primary payment typologies. A patient who is under-insured cannot be distinguished from a patient who is fully insured without complete integration of all three variables. At this point, however, the Primary Payment Typology will be classed into Private, Public, Self-Pay, and Other.
All other variables will be converted to labeled factors without recoding, to ensure they are handled correctly for the logistic regression model.
The codes for the All Patients Refined Major Diagnostic Categories (APR MDC) have remained standard throughout the 2016-2019 period, but the text of the APR-MDC description column has shown variability. To make this column legible, all APR MDC codes have been mapped to the following document: https://hcup-us.ahrq.gov/db/nation/nis/APR-DRGsV20MethodologyOverviewandBibliography.pdf
recode_variable
This function takes lu_run and a current column name,
looks it up in the recoding LU and converts it to the values, levels,
and labels that are specifies.
rm(temp)
recode_variable <- function(lu_run, col_input) {
### Always pull working data + dict
w_dict <- dicts[["lu_recode"]]
w_data <- lu_run[[prev_col]][[1]]
##Filter to the column in question
w_dict <- w_dict %>% filter(col==col_input)
## Needed for recoding
col_output <- w_dict$new_col[[1]]
w_values <- w_dict$values
w_new_values <- w_dict$new_values
recode_vec <- w_new_values %>% setNames(w_values)
##Recode
w_data[[col_output]] <- recode(
w_data[[col_input]],
!!!recode_vec)
##Needed for converting to factors
level_vec <- unique(w_dict$level)
label_vec <- unique(w_dict$label)
##Factors
w_data <- w_data %>%
mutate(
!!col_output := factor(!!sym(col_output),
levels = level_vec,
labels = label_vec
))
lu_run[[newdataname]][[1]] <- w_data
return(lu_run)
}Set the global recoding variables (data source, data destination), source the list of columns that will be recoded from the dictionary ingested above.
stage_prefix <- "a2_"
prev_col <- "a2_data"
newdataname <- paste0(stage_prefix, "data")
stage <- "recode"
columns <- unique(dicts[["lu_recode"]]$col[dicts[["lu_recode"]]$stage == "recode"])
## this is so so so inefficient but I don't want to figure out map and across
for (row in seq_len(nrow(lu_run))) {
for (col in columns) {
lu_run[row, ] <- recode_variable(lu_run[row, ], col_input = col)
}
}Because all variables of interest now have the prefix “v” to indicate a covariate of interest or “dx” to indicate that it is the outcome of interest (diagnosis), columns of interest can be selected based on what they start with.
This function creates a summary table of missing values - searches for columns from a list of prefixes. The two arguments it takes are the data frame (this can be passed with map, which I just learned. finally.), and the prefix list.
missingness <- function(w_data, prefixes){
w_data %>%
summarize(
across(.cols = starts_with(prefixes),
.fns = ~sum(is.na(.x)), .names = "{.col}")) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "nmissing") %>%
mutate(
run = unique(w_data$v_year),
n = nrow(w_data),
pct_missing = nmissing / n * 100
)
}I’ve learned how map works. :)
results <- lu_run$a2_data %>%
map(~missingness(w_data= .x, prefixes=c("vp_", "dx_")))
results <- bind_rows(results) %>% arrange(desc(pct_missing))
results %>% kable(
format.args = list(big.mark = ",")
)| variable | nmissing | run | n | pct_missing |
|---|---|---|---|---|
| dx_CIR019 | 46 | 2018 | 1,568,631 | 0.0029325 |
| dx_INF002 | 46 | 2018 | 1,568,631 | 0.0029325 |
| dx_CIR019 | 38 | 2019 | 1,556,727 | 0.0024410 |
| dx_INF002 | 38 | 2019 | 1,556,727 | 0.0024410 |
| vp_age | 0 | 2018 | 1,568,631 | 0.0000000 |
| vp_pay_1 | 0 | 2018 | 1,568,631 | 0.0000000 |
| vp_year | 0 | 2018 | 1,568,631 | 0.0000000 |
| vp_apr_soi | 0 | 2018 | 1,568,631 | 0.0000000 |
| vp_gender | 0 | 2018 | 1,568,631 | 0.0000000 |
| vp_age | 0 | 2019 | 1,556,727 | 0.0000000 |
| vp_pay_1 | 0 | 2019 | 1,556,727 | 0.0000000 |
| vp_year | 0 | 2019 | 1,556,727 | 0.0000000 |
| vp_apr_soi | 0 | 2019 | 1,556,727 | 0.0000000 |
| vp_gender | 0 | 2019 | 1,556,727 | 0.0000000 |
Virtually no data is missing from the primary covariates, and outcome - 0.002%, or 46 and 36 observations respectively over >1.5M respectively.
Data was missing from the primary exposure (v_facilityid), because it was redacted for confidentiality. This data has been excluded from the analysis in an earlier stage. These missing values are systematically encoded, as they are all related to specific sensitive procedures and diagnoses. I think it is unlikely that these are uniform across the facilities, but with facility_id and any higher level facility level information redacted, it is difficult to investigate further.
Using gtsummary to create a descriptive statistics table for my analytical sample. Please note, the primary exposure is Permanent Facility ID in the dataset (v_facilityid). This is a categorical variable, but it has a very high cardinality - as a result, I am sharing the total number of Unique Facility IDs in the dataset in its own row in the column, but not subsetting any of the other covariates on this variable.
table1 <- lu_run$a2_data %>% bind_rows()
includelist <- table1 %>%
select(starts_with(c("vp_","dx_"))) %>%
colnames()
facilityrow <- tibble(
variable = "facility_id",
label = "Facility ID",
stat_0 = paste0(length(unique(table1$v_facilityid)), " unique facilities"))
# summarize the data with our package
tbl1 <- table1 %>%
tbl_summary(
include = includelist,
label= list(vp_age ~ "Age group, years",
vp_pay_1 ~ "Primary payment type",
vp_year ~ "Discharge year",
vp_apr_soi ~ "APR-DRG Severity of Illness",
vp_gender ~ "Database-encoded gender",
dx_CIR019 ~ "Heart Failure (CIR019)",
dx_INF002 ~ "Septicemia (INF002)"
))
tbl1 <- tbl1 %>%
modify_table_body(
~ bind_rows(facilityrow, .x)
) %>%
modify_caption(caption= "Table 1. Characteristics of 2018-2019 New York State Hospital Discharge Data") %>%
modify_footnote_header("Observations subject to enhanced de-identification have been excluded from this analysis, as they lack hospital-level identifiers. No dimensions for key covariates listed are missing from observations that are not subject to enhanced de-identification.", columns=all_stat_cols(), replace=FALSE)
tbl1| Characteristic | N = 3,125,3581,2 |
|---|---|
| Facility ID | 208 unique facilities |
| Age group, years | |
| 0-17 | 193,231 (6.2%) |
| 18-29 | 336,659 (11%) |
| 30-49 | 623,119 (20%) |
| 50-69 | 915,443 (29%) |
| 70+ | 1,056,906 (34%) |
| Primary payment type | |
| Private | 630,537 (20%) |
| Public | 2,340,672 (75%) |
| Self-Pay | 60,421 (1.9%) |
| Other | 93,728 (3.0%) |
| Discharge year | |
| 2018 | 1,568,631 (50%) |
| 2019 | 1,556,727 (50%) |
| APR-DRG Severity of Illness | |
| Minor | 680,606 (22%) |
| Moderate | 1,233,410 (39%) |
| Major | 898,765 (29%) |
| Extreme | 312,577 (10%) |
| Database-encoded gender | |
| Female | 1,699,696 (54%) |
| Male | 1,425,634 (46%) |
| Unknown | 28 (<0.1%) |
| Heart Failure (CIR019) | 126,452 (4.0%) |
| Unknown | 84 |
| Septicemia (INF002) | 235,185 (7.5%) |
| Unknown | 84 |
| 1 n (%) | |
| 2 Observations subject to enhanced de-identification have been excluded from this analysis, as they lack hospital-level identifiers. No dimensions for key covariates listed are missing from observations that are not subject to enhanced de-identification. | |
The outcome variable is binary, so there is not much in the way of distribution to identify, so I am producing a bar chart of the two outcomes of interest for Figure 1.
figure1 <- table1 %>%
pivot_longer(
cols = outcomes,
names_to = "outcomename",
values_to = "outcomevalue") %>%
group_by(outcomename, outcomevalue) %>%
summarize(n= n()) %>% mutate(outcomevalue = factor(outcomevalue,
levels = c(0,1),
labels= c("No","Yes")),
outcomename = case_match(outcomename,
"dx_CIR019" ~ "Heart Failure (CIR019)",
"dx_INF002" ~ "Septicemia (INF002)")
) %>% filter( !(is.na(outcomevalue)))
ggplot(figure1, aes(x = outcomevalue, y = n)) + geom_col(fill = "pink") + facet_wrap(~ outcomename) + scale_y_continuous(labels = label_comma()) + theme_classic() +
labs(title = "Figure 1. Clinical Classification Systems-Refined (CCSR) Diagnosis Distribution \nfor Heart Failure and Septicemia across NYS Hospital Discharges 2018-2019",
y = "Discharge Count", x = "Diagnosis")
The CCSR categories for this phase of exploratory data analysis are the
two highest volume categories in the dataset, of over 472 distinct
diagnosis categories present in the two years. Despite being the most
common, their positive rates remain low: Heart failure at 4% in 2018,
and 4.1% in 2019, and Septicemia at 7.29% in 2018, and 7.76% in 2019.
This heavy skew towards 0 in this binary category is a result of the
variety of diagnosis categories, and is primarily indicative of the
scale of variety and the sparsity of outcomes that I can expect in this
dataset.
Because all covariates are categorical, the outcome is binary, and the primary exposure is categorical with high cardinality (208 different facilities), the standard methods of identifying an association do not necessarily apply here.
figure2 <- table1%>%
pivot_longer(
cols = outcomes,
names_to = "outcomename",
values_to = "outcomevalue") %>%
group_by(v_facilityid, outcomename) %>%
summarize(n= n(),
diagnoses = sum(outcomevalue, na.rm=TRUE)) %>%
mutate(
prev = diagnoses/n ,
outcomename = case_match(outcomename,
"dx_CIR019" ~ "Heart Failure (CIR019)",
"dx_INF002" ~ "Septicemia (INF002)")
)
ggplot(figure2, aes(x=outcomename, y= prev)) +
geom_violin(alpha = 0.5, fill="purple", color = NA)+
#geom_jitter(width = 0.2, seed = 1, alpha = 0.5, color = "palevioletred") +
theme_classic() +
geom_boxplot(fill=NA, width=0.1) +
scale_y_continuous(labels = label_percent()) +
labs(title = "Figure 2. Rate of CCSR Diagnoses for Heart Failure (CIR019) and Septicemia (INF002)\namong all Hospital Discharges across all 208 NYS facilities 2018-2019",
y = "Rate of Diagnosis\n(Facility Diagnoses/Total Facilty Discharges)", x = "CCSR Category")
Assessing facility-level rates of both heart Failure (CIR019) and
Septicemia (INF002) (calculated as the total volume of discharges with a
specific diagnosis category over the total number of discharges), we see
that there is facility-level variation between categories. Because
prevalence distribution for both outcomes differs across facilities,
this indicates an association between facility and the diagnosis rate.
The distribution of septicemia diagnosis is particularly wide, ranging
from a diagnosis rate of 0% to near 20%. Both Heart Failure and
Septicimia appear to have 2 potential outlier facilities, which may be
worth further investigation.
Both selected outcomes Heart Failure and Septicemia are linked to age (as well as many other case-mix factors). For this next EDA, I want to confirm that there is still substantial facility-level variation in diagnosis rate when the data is stratified by age.
figure3 <- table1%>%
pivot_longer(
cols = outcomes,
names_to = "outcomename",
values_to = "outcomevalue") %>%
group_by(v_facilityid, vp_age, outcomename) %>%
summarize(n= n(),
diagnoses = sum(outcomevalue, na.rm=TRUE)) %>%
mutate(
prev = diagnoses/n ,
outcomename = case_match(outcomename,
"dx_CIR019" ~ "Heart Failure (CIR019)",
"dx_INF002" ~ "Septicemia (INF002)")
) %>% filter(outcomename == "Septicemia (INF002)",
n >= 30)
ggplot(figure3, aes(x=vp_age, y= prev, fill=vp_age)) +
geom_violin(alpha = 0.5, color = NA)+
#geom_jitter(width = 0.2, seed = 1, alpha = 0.5, color = "palevioletred") +
theme_classic() +
geom_boxplot(fill=NA, width=0.1) +
scale_y_continuous(labels = label_percent()) +
labs(title = "Figure 3. Rate of CCSR Diagnoses for Septicemia (INF002) by Age Group among\nNYS Facilities with >=30 Discharges per Age Group, 2018-2019",
y = "Rate of Diagnosis\n(Facility Diagnoses/Total Facilty Discharges)" , x = "Facility-Level Septicemia Diagnosis Rate"
) + theme(legend.position = "none")
Cells with fewer than 30 total discharges were excluded from this
analysis to stabilize diagnosis rate estimates, and heart failure was
excluded because of age related sparsity. This initial assessment
indicates that even when discharges have been stratified by age group,
facilities have very strong variation in diagnosis rate of septicemia.
Because this facility level variation is still present with a major
case-mix variable controlled for, there is reason to look into other
variables as sources for diagnosis variation between facilities.