GI_Data_Review

Data Curation Process

The raw OR case dataset was imported from Excel and standardized prior to analysis. Key steps included:

Column cleaning and removal of identifiers
- Column names were standardized (lowercase, snake_case)
- Patient identifiers, provider names, and procedure descriptions were excluded to ensure de-identification.
Parsing dates and times
- Dates were converted to Date format.
- All time fields (scheduled start, pre-op arrival, MD available, room ready, anesthesia sign-off, pre-op complete, in/out room times) were parsed consistently into hms objects.
Date–time stamps (e.g., consent completion, H&P completion) were converted to POSIXct for accurate interval calculations.
Filtering
- Only first cases were retained (first_case == “Yes”).
- Cases were restricted to those scheduled for 07:30 (Mon–Wed, Fri) or 08:30 (Thu) to ensure a uniform start reference.
Compliance flags
- Binary indicators were generated for workflow readiness milestones (e.g., pre-op by 06:00, MD by 07:10, room/anesthesia readiness, pre-op complete). These provide simple compliance benchmarks.
Lag variables
- Interval (lag) variables were created to capture elapsed minutes between milestones (e.g., scheduled start → pre-op arrival, pre-op → anesthesia sign-off, in-room → out-room).
- All lags were rounded up to the nearest minute for conservative estimates.
Outcome variable
- The outcome is_on_time was standardized to a binary format (1 = On-time, 0 = Late) to allow use in regression and machine learning models.
Handling missing values (for modeling)
- Numeric lags: imputed using the median.
- Categorical variables (day of week, patient class): missing values replaced with “Unknown”.

Statistical Test

Boxplots of all the lags (minutes) between arrive to preop to a 2nd process or task. We also calculated the distance from scheduled to preoop. This is how we know how ealry a patient is arriving.

Compliance Summary

Compliance rate based on when work should be done by.This shows how we performed and gives a fuzzy picture.

Outcome Distribution

Table: Distribution of First Case On-time Status
Status	Count	Percent
Late	182	25%
On-time	532	75%

Table shows that 75% we are on time!! Great job kudos. Lets get better.

Modeling

Top left is a T-Test example that was used to find what factors (variables ) were statistically significant among the two groups (on-time vs late to start). T-test only works on number variables and not categories. On the right you have a similar test (RankSum) that works on non-numbered variables (categories).Below left and right is a Random Forest model for all the variables we loaded. The model selects the best variables (alone or in combination) to find important most weight at predicting the outcome late vs on time.

Predicted Margin

This plot shows how early we need to tell patients to arrive to belong to a on-time-start class. In short patients that arrive 30 minutes early will not start on time. Various possible factors (age,IV access, ect) could be at play but a simple intervention could be to test this on a select group of cases for X amount of time.

ROC and Clustering

## 
## Call:
## roc.default(response = df_rf$is_on_time, predictor = df_rf$mins_sched_to_preop)
## 
## Data: df_rf$mins_sched_to_preop in 182 controls (df_rf$is_on_time 0) > 532 cases (df_rf$is_on_time 1).
## Area under the curve: 0.6008

##   threshold sensitivity specificity
## 1     -83.5   0.8909774   0.2692308

Random Forest Modeling Predicted Probability based on Class (IP,OP,ED)

This shows the probability of starting on time as a function of the day of the week.

GI_Data_Review_R2

Eli Ebrahimdoost, RN

2025-09-10