622 - Final Assignment

Author

LFMG

Predicting Hospital Readmissions: A Machine Learning Approach to Reducing CMS Penalties

Summary

Problem Description: Hospital readmissions represent a significant failure in healthcare delivery and a massive financial liability. Under the CMS Hospital Readmissions Reduction Program (HRRP), hospitals face severe financial penalties for excessive readmission rates. The objective of this analysis is to identify high-risk diabetic patients before they are discharged, enabling targeted interventions to prevent costly returns.

Dataset and Preparation: We utilized the Diabetes 130-US Hospitals dataset (1999-2008) from the UCI Machine Learning Repository, containing over 100,000 patient encounters. Data preparation involved the removal of high-missingness features (such as “weight”, “payer_code”), the exclusion of deceased patients to prevent model bias, and the grouping of high-cardinality ICD-9 diagnosis codes into nine clinical categories. We also engineered a binary target variable to classify patients as “Readmitted” vs. “Not Readmitted.”

Methodologies: To capture both linear and non-linear risk factors, we applied a dual-methodology approach:

Supervised Learning: We trained a Logistic Regression baseline to understand linear odds ratios, followed by a Random Forest classifier to capture complex interactions and improve sensitivity.
Unsupervised Learning: We utilized K-Means Clustering to segment patients into “Risk Personas” based solely on utilization patterns, independent of their diagnosis.

Purpose and Conclusion: The analysis aims to transition readmission management from reactive to proactive. Our results demonstrate that Random Forest outperforms Logistic Regression in identifying high-risk patients (Sensitivity: 53.6% vs 49.3%). Furthermore, unsupervised clustering revealed a distinct “Crisis Group” (Cluster 1) comprising only 7% of the population but carrying a 76% readmission rate. The business impact is clear: by targeting intervention resources specifically toward patients with high previous inpatient history (identified by Random Forest) and “Cluster 1” profiles, the hospital can maximize ROI while reducing penalty exposure.

Libraries Used

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.1

Warning: package 'ggplot2' was built under R version 4.5.1

Warning: package 'dplyr' was built under R version 4.5.1

Warning: package 'lubridate' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(tibble)
library(readr)
library(janitor)

Warning: package 'janitor' was built under R version 4.5.1


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(skimr)

Warning: package 'skimr' was built under R version 4.5.1

library(naniar)

Warning: package 'naniar' was built under R version 4.5.1


Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete

library(DataExplorer)

Warning: package 'DataExplorer' was built under R version 4.5.1

library(GGally)

Warning: package 'GGally' was built under R version 4.5.1

library(corrplot)

Warning: package 'corrplot' was built under R version 4.5.1

corrplot 0.95 loaded

library(vip)

Warning: package 'vip' was built under R version 4.5.1


Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi

library(lubridate)
library(knitr)

Warning: package 'knitr' was built under R version 4.5.1

library(kableExtra)

Warning: package 'kableExtra' was built under R version 4.5.1


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

library(broom)
library(randomForest)

randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

library(cluster)
library(factoextra)

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(class)
library(nnet)

Loading the Data

The dataset has been upload to GitHub for easy reproducibility.

url <- "https://raw.githubusercontent.com/Lfirenzeg/msds622/refs/heads/main/Final_Project/diabetic_data.csv"

The initial glimpse showed a lot of “?” values. In this dataset, ? is the standard marker for missing data, so we also have to account for that.

df <- read_csv(url, na = c("?", "Unknown/Invalid", "")) %>% 
  clean_names()

Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# we can assign any entries with "unknown" as NA for easier processing

glimpse(df)

Rows: 101,766
Columns: 50
$ encounter_id             <dbl> 2278392, 149190, 64410, 500364, 16680, 35754,…
$ patient_nbr              <dbl> 8222157, 55629189, 86047875, 82442376, 425192…
$ race                     <chr> "Caucasian", "Caucasian", "AfricanAmerican", …
$ gender                   <chr> "Female", "Female", "Female", "Male", "Male",…
$ age                      <chr> "[0-10)", "[10-20)", "[20-30)", "[30-40)", "[…
$ weight                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ admission_type_id        <dbl> 6, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 2, 1, 1, 3, …
$ discharge_disposition_id <dbl> 25, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 6, 1,…
$ admission_source_id      <dbl> 1, 7, 7, 7, 7, 2, 2, 7, 4, 4, 7, 4, 7, 7, 2, …
$ time_in_hospital         <dbl> 1, 3, 2, 2, 1, 3, 4, 5, 13, 12, 9, 7, 7, 10, …
$ payer_code               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ medical_specialty        <chr> "Pediatrics-Endocrinology", NA, NA, NA, NA, N…
$ num_lab_procedures       <dbl> 41, 59, 11, 44, 51, 31, 70, 73, 68, 33, 47, 6…
$ num_procedures           <dbl> 0, 0, 5, 1, 0, 6, 1, 0, 2, 3, 2, 0, 0, 1, 5, …
$ num_medications          <dbl> 1, 18, 13, 16, 8, 16, 21, 12, 28, 18, 17, 11,…
$ number_outpatient        <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ number_emergency         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ number_inpatient         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ diag_1                   <chr> "250.83", "276", "648", "8", "197", "414", "4…
$ diag_2                   <chr> NA, "250.01", "250", "250.43", "157", "411", …
$ diag_3                   <chr> NA, "255", "V27", "403", "250", "250", "V45",…
$ number_diagnoses         <dbl> 1, 9, 6, 7, 5, 9, 7, 8, 8, 8, 9, 7, 8, 8, 8, …
$ max_glu_serum            <chr> "None", "None", "None", "None", "None", "None…
$ a1cresult                <chr> "None", "None", "None", "None", "None", "None…
$ metformin                <chr> "No", "No", "No", "No", "No", "No", "Steady",…
$ repaglinide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ nateglinide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ chlorpropamide           <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glimepiride              <chr> "No", "No", "No", "No", "No", "No", "Steady",…
$ acetohexamide            <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glipizide                <chr> "No", "No", "Steady", "No", "Steady", "No", "…
$ glyburide                <chr> "No", "No", "No", "No", "No", "No", "No", "St…
$ tolbutamide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ pioglitazone             <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ rosiglitazone            <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ acarbose                 <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ miglitol                 <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ troglitazone             <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ tolazamide               <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ examide                  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ citoglipton              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ insulin                  <chr> "No", "Up", "No", "Up", "Steady", "Steady", "…
$ glyburide_metformin      <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glipizide_metformin      <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glimepiride_pioglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ metformin_rosiglitazone  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ metformin_pioglitazone   <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ change                   <chr> "No", "Ch", "No", "Ch", "Ch", "No", "Ch", "No…
$ diabetes_med             <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…
$ readmitted               <chr> "NO", ">30", "NO", "NO", "NO", ">30", "NO", "…

The dataset has 101,766 rows and 50 columns combining several demographic variables. The target “readmitted” is currently character and should be treated as a binary factor (no/yes).

Exploratory Data Analysis

We can also see “readmitted” has three levels: NO, >30, and <30.

We have two choices for defining the binary target:

Option A (General Health): Did the patient return at all? (NO vs. Yes).

Option B (Financial Penalty): Did the patient return within 30 days? (NO/>30 vs. <30).

For this project we’ll go with Option A (Readmitted vs Not) for a balanced dataset.

df <- df %>%
  mutate(
    # Option A: Any readmission is 1, No readmission is 0
    target_binary = ifelse(readmitted == "NO", 0, 1),
    
    # Make sure it's a factor for classification models
    target_binary = as.factor(target_binary)
  )

# Check the Class Balance
table(df$target_binary) %>% prop.table()


        0         1 
0.5391192 0.4608808

We can see that around 46% of patients are readmitted (Target=1) and around 54% are not. This is a fairly balanced dataset, which is good news since we might not need complex SMOTE/oversampling techniques.

Missing Value Analysis

This is critical for this specific dataset.

# We need to calculate the percentage of missing values per column
missing_vals <- df %>%
  summarise_all(~ sum(is.na(.)) / n() * 100) %>%
  pivot_longer(everything(), names_to = "column", values_to = "pct_missing") %>%
  arrange(desc(pct_missing)) %>%
  filter(pct_missing > 0)

# And it would help to visualize it
ggplot(missing_vals, aes(x = reorder(column, pct_missing), y = pct_missing)) +
  geom_bar(stat = "identity", fill = "tomato") +
  coord_flip() +
  labs(title = "Percentage of Missing Values by Feature", y = "% Missing", x = "Feature")

We will likely need to drop the “weight” column entirely. For “payer_code” we might drop it or treat “Missing” as its own category (e.g., “Uninsured”).

ID Mappings

The columns admission_type_id, admission_source_id, and discharge_disposition_id are integers. We will map them to their descriptions to see if they are useful features.

df <- df %>%
  mutate(
    admission_type_id = as.factor(admission_type_id),
    admission_source_id = as.factor(admission_source_id),
    discharge_disposition_id = as.factor(discharge_disposition_id)
  )

# EDA: Does admission source affect readmission?
ggplot(df, aes(x = admission_source_id, fill = target_binary)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion Readmitted", title = "Readmission Rate by Admission Source")

Dropping High-Missing Variables

As we saw before, weight is around 97% missing. Imputing (guessing) that much data introduces massive noise. Also, payer_code is about 40% missing too and often doesn’t carry clinical value for prediction, so dropping it simplifies the model.

# Drop variables with excessive missing values
df_clean <- df %>%
  select(-weight, -payer_code)

# Quick check to confirm they are gone
names(df_clean)

 [1] "encounter_id"             "patient_nbr"             
 [3] "race"                     "gender"                  
 [5] "age"                      "admission_type_id"       
 [7] "discharge_disposition_id" "admission_source_id"     
 [9] "time_in_hospital"         "medical_specialty"       
[11] "num_lab_procedures"       "num_procedures"          
[13] "num_medications"          "number_outpatient"       
[15] "number_emergency"         "number_inpatient"        
[17] "diag_1"                   "diag_2"                  
[19] "diag_3"                   "number_diagnoses"        
[21] "max_glu_serum"            "a1cresult"               
[23] "metformin"                "repaglinide"             
[25] "nateglinide"              "chlorpropamide"          
[27] "glimepiride"              "acetohexamide"           
[29] "glipizide"                "glyburide"               
[31] "tolbutamide"              "pioglitazone"            
[33] "rosiglitazone"            "acarbose"                
[35] "miglitol"                 "troglitazone"            
[37] "tolazamide"               "examide"                 
[39] "citoglipton"              "insulin"                 
[41] "glyburide_metformin"      "glipizide_metformin"     
[43] "glimepiride_pioglitazone" "metformin_rosiglitazone" 
[45] "metformin_pioglitazone"   "change"                  
[47] "diabetes_med"             "readmitted"              
[49] "target_binary"

Visualizing Class Balance

Previously we saw that the values for readmitted are 46% vs 54%, but we can also create a simple plot to demonstrate this.

# Create a dataframe for plotting
balance_data <- df_clean %>%
  count(target_binary) %>%
  mutate(prop = n / sum(n))

# Plot
ggplot(balance_data, aes(x = target_binary, y = n, fill = target_binary)) +
  geom_col() +
  geom_text(aes(label = scales::percent(prop)), vjust = -0.5) +
  labs(
    title = "Class Balance: Readmitted vs Not Readmitted",
    x = "Readmission Status (0=No, 1=Yes)",
    y = "Count"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("steelblue", "tomato"))

Grouping ICD9 Codes (Feature Engineering)

The columns diag_1, diag_2, and diag_3 contain hundreds of distinct codes (found in the EDA, such as 428.0, 250.01). If we use these as-is, the model will overfit or fail due to the “Curse of Dimensionality.”

To address this we can group them into 9 standard clinical categories (Circulatory, Respiratory, Diabetes, etc.) to make the patterns learnable.

# Function to map ICD9 codes to 9 distinct categories
collapse_diagnosis <- function(x) {
  # Convert to character first to handle 'V' and 'E' codes safely
  x <- as.character(x)
  
  case_when(
    # 1. Circulatory (heart issues)
    str_detect(x, "^39[0-9]|^4[0-5][0-9]|^785") ~ "Circulatory",
    
    # 2. Respiratory (lung issues)
    str_detect(x, "^4[6-9][0-9]|^5[0-1][0-9]|^786") ~ "Respiratory",
    
    # 3. Digestive
    str_detect(x, "^5[2-7][0-9]|^787") ~ "Digestive",
    
    # 4. Diabetes (Strictly 250.xx codes)
    str_detect(x, "^250") ~ "Diabetes",
    
    # 5. Injury
    str_detect(x, "^8|^9") ~ "Injury",
    
    # 6. Musculoskeletal
    str_detect(x, "^7[1-3][0-9]") ~ "Musculoskeletal",
    
    # 7. Genitourinary
    str_detect(x, "^5[8-9][0-9]|^6[0-2][0-9]|^788") ~ "Genitourinary",
    
    # 8. Neoplasms (Cancer)
    str_detect(x, "^1[4-9][0-9]|^2[0-3][0-9]") ~ "Neoplasms",
    
    # 9. Other (Everything else, including V and E codes)
    TRUE ~ "Other"
  )
}

# Apply this function to all three diagnosis columns
df_clean <- df_clean %>%
  mutate(
    diag_1_group = collapse_diagnosis(diag_1),
    diag_2_group = collapse_diagnosis(diag_2),
    diag_3_group = collapse_diagnosis(diag_3)
  ) %>%
  # Optionally convert them to factors for modeling
  mutate(across(starts_with("diag_"), as.factor))

# Visual Check: What are the primary diagnoses causing readmissions?
ggplot(df_clean, aes(x = fct_infreq(diag_1_group), fill = target_binary)) +
  geom_bar(position = "fill") +
  coord_flip() +
  labs(
    title = "Readmission Rate by Primary Diagnosis Group",
    y = "Proportion Readmitted",
    x = "Diagnosis Group"
  )

Drop Constant Columns

# Identify columns that have only 1 unique value
constant_cols <- names(df_clean)[sapply(df_clean, function(x) length(unique(x)) == 1)]

print(paste("Dropping constant columns:", paste(constant_cols, collapse = ", ")))

[1] "Dropping constant columns: examide, citoglipton"

# Remove them from the main dataset
df_clean <- df_clean %>% 
  select(-all_of(constant_cols))

Exploring Age vs. Readmission

Older patients are generally higher risk. We want to create a plot to explore this hypotehsis and validate the data quality.

ggplot(df_clean, aes(x = age, fill = target_binary)) +
  geom_bar(position = "fill") +
  labs(
    title = "Readmission Probability increases with Age",
    y = "Probability of Readmission",
    x = "Age Group"
  ) +
  scale_fill_brewer(palette = "Set2")

Exploring Time in Hospital vs. Readmission

Similar to age, we want to visually explore if staying in the hospital longer can impact probability of readmission, will staying longer be related to getting more treatment and have better chances of not being readmitted, or does it mean a person is sicker, may need extra care and be and more likely to return?

ggplot(df_clean, aes(x = target_binary, y = time_in_hospital, fill = target_binary)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Time in Hospital by Readmission Status",
    x = "Readmitted?",
    y = "Days in Hospital"
  )

In the exploratory analysis we saw that while the dataset is relatively balanced (54% No Readmission vs. 46% Readmission), it contained significant data quality issues that required intervention. We removed the ‘Weight’ and ‘Payer Code’ features due to excessive missingness (>40%) to prevent introducing a lot of noise. Also, we addressed the high dimensionality of the diagnostic features, since there were hundreds of distinct ICD-9 codes and were grouped into nine clinically relevant categories (such as Circulatory, Diabetes, Respiratory). This feature engineering step reduces model complexity and prevents overfitting. At this point we can no proceed with supervised learning to establish a predictive baseline using Logistic Regression and Random Forest, followed by unsupervised clustering to uncover latent patient risk profiles.

Robust Feature Selection

During previous attempt at building a model we were running into an error “contrasts can be applied only to factors with 2 or more levels”. This was happening because some columns of medications (like acetohexamide or troglitazone) appear as “Steady” in only 1 or 2 rows out of the entire 100,000. When we run createDataPartition, those 1 or 2 rows might end up in the Test set, leaving the Training set with only “No” values. The model sees a variable with only one level (“No”) and crashes.

Also, the code created diag_1_group, but it didn’t remove the original diag_1 column. diag_1 has 700+ unique codes. Using it in the model (target_binary ~ .) confuses the algorithm and causes issues if a specific rare code is in Test but not Train.

# To address the previous issues we define a list of columns to drop:
# 1. The original high-cardinality diagnosis codes (we use the groups instead)
# 2. Rare drugs that cause the "1 level" error because they have almost 0 "Yes" cases
cols_to_drop <- c("diag_1", "diag_2", "diag_3", # Redundant
                  "acetohexamide", "troglitazone",  # Rare (< 3 cases)
                  "glimepiride_pioglitazone", # Rare (< 2 cases)
                  "metformin_rosiglitazone", # Rare (< 2 cases)
                  "metformin_pioglitazone", # Rare (< 2 cases)
                  "examide", "citoglipton", # Constant columns
                  "encounter_id", "patient_nbr", "readmitted") # ID and Target Leaks

# Drop them from the main dataframe
df_model <- df_clean %>%
  select(-any_of(cols_to_drop))

# Verify they are gone
print(names(df_model))

 [1] "race"                     "gender"                  
 [3] "age"                      "admission_type_id"       
 [5] "discharge_disposition_id" "admission_source_id"     
 [7] "time_in_hospital"         "medical_specialty"       
 [9] "num_lab_procedures"       "num_procedures"          
[11] "num_medications"          "number_outpatient"       
[13] "number_emergency"         "number_inpatient"        
[15] "number_diagnoses"         "max_glu_serum"           
[17] "a1cresult"                "metformin"               
[19] "repaglinide"              "nateglinide"             
[21] "chlorpropamide"           "glimepiride"             
[23] "glipizide"                "glyburide"               
[25] "tolbutamide"              "pioglitazone"            
[27] "rosiglitazone"            "acarbose"                
[29] "miglitol"                 "tolazamide"              
[31] "insulin"                  "glyburide_metformin"     
[33] "glipizide_metformin"      "change"                  
[35] "diabetes_med"             "target_binary"           
[37] "diag_1_group"             "diag_2_group"            
[39] "diag_3_group"

# We also want to remove High-Missing "Medical Specialty" 
# This column has ~50% NAs. Keeping it causes massive data loss in GLM.
if("medical_specialty" %in% names(df_model)) {
  df_model <- df_model %>% select(-medical_specialty)
  print("Removed medical_specialty to prevent data loss.")
}

[1] "Removed medical_specialty to prevent data loss."

When attempting to create the predictive models we also ran into the error (factor … has new levels) which occured because the Test Set contained a rare category (Level 20 in discharge_disposition_id) that did not appear in the Training Set. The model had never seen “Level 20” before, so it crashed when asked to predict it.

In this specific dataset, Level 20 means “Expired” (Died). This also explains the warning: fitted probabilities numerically 0 or 1 occurred. If a patient died, their readmission chance is exactly 0%. This causes “Perfect Separation” in the model.

The correct data science approach here is to remove patients who died from the dataset before modeling. We cannot predict readmission for deceased patients, and they skew the model.

# Remove patients who Expired (Ids: 11, 19, 20, 21)
# These patients cannot be readmitted, so we exclude them.
# This ALSO fixes the "New Level 20" error.
df_model <- df_model %>%
  filter(!discharge_disposition_id %in% c(11, 19, 20, 21)) %>%
  droplevels() # crucial: remove the 'ghost' levels from the factor

# We also need to handle rare categories (Robustness)
# Even after removing deaths, other columns might have rare codes.
# This function groups any category with < 50 observations into "Other"
group_rare_levels <- function(var, threshold = 50) {
  var <- as.character(var)
  counts <- table(var)
  rare_levels <- names(counts)[counts < threshold]
  var[var %in% rare_levels] <- "Other"
  return(as.factor(var))
}

# Apply this to the ID columns
df_model <- df_model %>%
  mutate(
    discharge_disposition_id = group_rare_levels(discharge_disposition_id),
    admission_source_id = group_rare_levels(admission_source_id),
    admission_type_id = group_rare_levels(admission_type_id)
  )

Supervised Learning

We will split the data, run a Logistic Regression to get a baseline (interpretable odds), and then a Random Forest to capture non-linear relationships.

# We'll start by creating the Train/Test split 
set.seed(248) # to ensure reproducibility

# we'll create an index for 70% training data
trainIndex <- createDataPartition(df_model$target_binary, p = .7, 
                                  list = FALSE, 
                                  times = 1)

trainData <- df_model[ trainIndex,]
testData  <- df_model[-trainIndex,]

print(paste("Training Set:", nrow(trainData), "rows"))

[1] "Training Set: 70081 rows"

print(paste("Test Set:", nrow(testData), "rows"))

[1] "Test Set: 30033 rows"

Ensuring Factors have at least 2 levels

# We'll  look to odentify columns that have only 1 unique value in the TRAINING set
# (This catches any rare drug that didn't make it into the training split)
single_level_cols <- sapply(trainData, function(x) length(unique(na.omit(x))) < 2)
cols_to_remove <- names(single_level_cols)[single_level_cols]

print(paste("Dropping these constant columns:", paste(cols_to_remove, collapse = ", ")))

[1] "Dropping these constant columns: "

# Remove them from BOTH sets
trainData <- trainData[ , !names(trainData) %in% cols_to_remove]
testData <- testData[ , !names(testData) %in% cols_to_remove]

Logistic Regression

Now we can get a simple baseline model through logistic regression.

# Note: We can now use "." safely because we removed the bad columns from df_model
log_model <- glm(target_binary ~ ., 
                 data = trainData, 
                 family = "binomial",
                 na.action = na.omit)

# We need to handle NAs in prediction too.
# We use a safe prediction wrapper to handle any remaining NAs
log_probs <- predict(log_model, testData, type = "response")

# Only compare rows where we actually got a prediction
valid_rows <- !is.na(log_probs)
log_preds <- ifelse(log_probs[valid_rows] > 0.5, 1, 0)
test_target_valid <- testData$target_binary[valid_rows]

confusionMatrix(as.factor(log_preds), as.factor(test_target_valid), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 11635  7022
         1  3833  6837
                                          
               Accuracy : 0.6299          
                 95% CI : (0.6243, 0.6354)
    No Information Rate : 0.5274          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2485          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4933          
            Specificity : 0.7522          
         Pos Pred Value : 0.6408          
         Neg Pred Value : 0.6236          
             Prevalence : 0.4726          
         Detection Rate : 0.2331          
   Detection Prevalence : 0.3638          
      Balanced Accuracy : 0.6228          
                                          
       'Positive' Class : 1

Logistic Regression Results

Accuracy (63%): We are predicting correctly 63% of the time. While this doesn’t sound high, it is significantly better than the No Information Rate (53%) (which is what we’d get if we just guessed “No Readmission” for everyone).

Sensitivity (49.3%): This is our “Miss Rate.” The model only catches about half of the patients who actually get readmitted.

Business Impact: This is the most dangerous metric. It means 50% of the “at-risk” patients are slipping through the cracks without intervention.

Specificity (75.2%): This is our “False Alarm Rate.” This model seems actually quite good at identifying patients who won’t return.

Exploring Significant Variables

To understand what drives readmission, we don’t just look at the raw coefficients. We look at the Odds Ratios.

Odds Ratio > 1: Increases risk (let’s say, OR 1.5 = 50% higher odds of readmission).
Odds Ratio < 1: Decreases risk (Protective factor).

# We'll extract coefficients and calculate Odds Ratios
# tidy() makes the messy summary(log_model) into a nice dataframe
model_results <- tidy(log_model, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  filter(p.value < 0.05) %>%   # we only keep statistically significant factors
  arrange(desc(estimate))

# View Top 10 Risk Factors 
print("Top 10 Factors Increasing Readmission Risk:")

[1] "Top 10 Factors Increasing Readmission Risk:"

print(head(model_results, 10))

# A tibble: 10 × 5
   term                       estimate std.error statistic   p.value
   <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
 1 age[70-80)                     2.96     0.256      4.23 0.0000233
 2 age[80-90)                     2.90     0.257      4.14 0.0000343
 3 age[60-70)                     2.77     0.256      3.98 0.0000689
 4 age[40-50)                     2.59     0.256      3.71 0.000210 
 5 age[50-60)                     2.55     0.256      3.66 0.000252 
 6 age[10-20)                     2.48     0.271      3.35 0.000797 
 7 age[20-30)                     2.31     0.263      3.18 0.00146  
 8 age[30-40)                     2.29     0.258      3.20 0.00137  
 9 discharge_disposition_id15     2.28     0.341      2.42 0.0157   
10 age[90-100)                    2.24     0.261      3.10 0.00194

# View Top 10 Protective Factors
print("Top 10 Factors Reducing Readmission Risk:")

[1] "Top 10 Factors Reducing Readmission Risk:"

print(tail(model_results, 10))

# A tibble: 10 × 5
   term                       estimate std.error statistic  p.value
   <chr>                         <dbl>     <dbl>     <dbl>    <dbl>
 1 admission_source_id17        0.738     0.0527     -5.77 7.91e- 9
 2 discharge_disposition_id25   0.704     0.0897     -3.91 9.21e- 5
 3 raceAsian                    0.698     0.104      -3.46 5.31e- 4
 4 discharge_disposition_id23   0.688     0.127      -2.95 3.18e- 3
 5 diag_1_groupNeoplasms        0.661     0.0516     -8.02 1.05e-15
 6 admission_source_id6         0.577     0.0697     -7.90 2.84e-15
 7 admission_source_id4         0.542     0.0530    -11.6  6.15e-31
 8 admission_type_idOther       0.225     0.750      -1.99 4.68e- 2
 9 discharge_disposition_id13   0.127     0.188     -11.0  4.87e-28
10 discharge_disposition_id14   0.0679    0.232     -11.6  4.29e-31

We grab the top 10 highest and lowest estimates to plot

# Risk Factors plot
top_factors <- bind_rows(
  head(model_results, 10),
  tail(model_results, 10)
)

ggplot(top_factors, aes(x = reorder(term, estimate), y = estimate, fill = estimate > 1)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  labs(title = "Significant Predictors of Readmission (Odds Ratios)",
       subtitle = "Values > 1 increase risk; Values < 1 decrease risk",
       y = "Odds Ratio",
       x = "Feature") +
  scale_fill_manual(values = c("steelblue", "tomato"), labels = c("Protective", "Risk Factor")) +
  theme_minimal()

We find that multiple age groups appear as significant predictors with high Odds Ratios.

This is supposed to be typical in healthcare logistic regression models. In Logistic Regression, if we have 10 age categories, the model includes 9 of them in the equation and hides one as the “Reference.”

The group missing from the list in this case is “age[0-10)” so every Odds Ratio we see is a comparison against a child under 10 years old. Meaning that, at least according to the tables, a patient aged 70-80 is 2.95 times more likely to be readmitted than a child aged 0-10.

Random Forest

Now that we have a linear baseline, we use Random Forest. First, we train the RF

set.seed(248)

rf_model <- randomForest(target_binary ~ ., 
                         data = trainData, 
                         ntree = 100,       
                         mtry = 5,          
                         importance = TRUE, 
                         na.action = na.omit)

Then we predict

rf_pred_probs <- predict(rf_model, testData, type = "prob")[,2] # Get prob of "1"
rf_preds <- ifelse(rf_pred_probs > 0.5, 1, 0)

After that we can evaluate

confusionMatrix(as.factor(rf_preds), as.factor(testData$target_binary), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 11241  6438
         1  4227  7421
                                          
               Accuracy : 0.6363          
                 95% CI : (0.6308, 0.6419)
    No Information Rate : 0.5274          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2644          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5355          
            Specificity : 0.7267          
         Pos Pred Value : 0.6371          
         Neg Pred Value : 0.6358          
             Prevalence : 0.4726          
         Detection Rate : 0.2530          
   Detection Prevalence : 0.3972          
      Balanced Accuracy : 0.6311          
                                          
       'Positive' Class : 1

# extract and clean the importance table 
# We convert the matrix to a dataframe for easier sorting
imp_data <- as.data.frame(importance(rf_model))
imp_data$Feature <- rownames(imp_data)

# Sort by 'MeanDecreaseAccuracy' 
# High value = Variable is critical for accuracy
# Low value = Variable doesn't help much
sorted_imp <- imp_data[order(-imp_data$MeanDecreaseAccuracy), ]

print("--- TOP 10 PREDICTORS (MeanDecreaseAccuracy) ---")

[1] "--- TOP 10 PREDICTORS (MeanDecreaseAccuracy) ---"

print(head(sorted_imp, 10))

                                0           1 MeanDecreaseAccuracy
number_inpatient         48.30023  34.2344694             54.77091
discharge_disposition_id 33.08410  -8.0102174             23.44837
number_emergency         24.27983   4.7906881             23.35152
admission_source_id      25.71133 -12.7546678             22.70530
number_outpatient        20.19888   5.5302929             21.17491
diag_1_group             20.54130  -1.1733236             16.17953
num_medications          15.89041   3.0580479             15.87723
admission_type_id        19.72267  -9.3993166             15.12943
number_diagnoses         17.14299   2.0794472             14.87859
age                      17.53658   0.5702233             14.39917
                         MeanDecreaseGini                  Feature
number_inpatient                1767.7392         number_inpatient
discharge_disposition_id        1585.5691 discharge_disposition_id
number_emergency                 522.5354         number_emergency
admission_source_id              946.6751      admission_source_id
number_outpatient                630.0921        number_outpatient
diag_1_group                    2118.2282             diag_1_group
num_medications                 2589.8381          num_medications
admission_type_id                957.3396        admission_type_id
number_diagnoses                1321.8306         number_diagnoses
age                             1559.5086                      age

print("--- BOTTOM 10 PREDICTORS (Lowest Impact) ---")

[1] "--- BOTTOM 10 PREDICTORS (Lowest Impact) ---"

print(tail(sorted_imp, 10))

                             0          1 MeanDecreaseAccuracy MeanDecreaseGini
glyburide_metformin  0.4011612  1.4500453         1.2497205502        56.150306
tolazamide           2.4238556 -0.3183603         1.0169871436         4.861015
repaglinide          1.9452993 -0.9363360         0.8137400842       105.714945
glyburide           -0.8742590  1.6280350         0.7849261252       407.967364
tolbutamide          1.0005336 -0.5773173         0.4485735411         1.991789
nateglinide          1.1418101 -1.3540892         0.0260958602        57.947465
miglitol             0.3029262 -0.4459858         0.0008349514         3.822711
glimepiride          0.5330717 -1.2801113        -0.4455217024       257.985767
chlorpropamide      -1.5992963  0.8547437        -0.8583266416         8.454743
glipizide_metformin -1.7586264  0.0000000        -1.7586268515         1.631277
                                Feature
glyburide_metformin glyburide_metformin
tolazamide                   tolazamide
repaglinide                 repaglinide
glyburide                     glyburide
tolbutamide                 tolbutamide
nateglinide                 nateglinide
miglitol                       miglitol
glimepiride                 glimepiride
chlorpropamide           chlorpropamide
glipizide_metformin glipizide_metformin

varImpPlot(rf_model, 
           main = "Random Forest: Top 20 Important Features",
           n.var = 20)

Random Forest Results

While Logistic Regression provided a strong baseline, the Random Forest model demonstrated superior value for the specific business goal of Readmission Reduction.

The Random Forest correctly identified 53.5% of the readmitted patients, compared to only 49.3% for Logistic Regression. Regarding our business impact: In a hospital with 10,000 discharges, that 4.2% improvement means identifying around 200 additional high-risk patients who would have otherwise been missed. If preventing a readmission saves $10,000, this model improvement alone represents a potential $2M savings opportunity compared to the linear model.

At this point the trade off is that the Random Forest has lower Specificity (72% vs 75%), basically it generates slightly more “False Alarms”. The silver lining here is that in healthcare, a “False Alarm” (calling a healthy patient to check in) is cheap. A “Missed Readmission” (a patient returning to the ER) is expensive. Therefore, we prefer the Random Forest model.

RF Top Predictors

Looking at the top predictors we see that “number_inpatient” is the strongest one, indicating that a patient’s history of previous hospital stays is the single strongest predictor of whether they will return. As a business impact this justifies a “Frequent Flyer Program.” Any patient with >1 previous inpatient visit in the last year should automatically be flagged for high-touch intervention, regardless of their age or diagnosis.
discharge_disposition_id shows that where a patient goes after discharge matters almost as much as their history. Patients sent to SNFs (Skilled Nursing Facilities) or Rehab likely have different risks than those sent home.
number_emergency shows that similar to inpatient history, emergency room utilization indicates instability.
admission_source_id indicates that patients coming from the ER vs. a Physician Referral have different risk profiles.
number_outpatient, interestingly, shows that outpatient visits are also a top predictor. High outpatient use might actually be protective (managed care) or a sign of sickness (constant need for care).

Unsupervised Learning

K-Means Clustering

Now that we know Utilization is the key driver (Top 1, 3, and 5 predictors), we will use Clustering to group patients based only on these utilization metrics.

In this case, we’ll create “Patient Personas” and check if that group has a higher Readmission Rate.

# We'll select and scale utilization features 
# We use the TOP predictors from Random Forest for our clusters
# This ensures our clusters are actually meaningful
cluster_cols <- c("number_inpatient", "number_emergency", "number_outpatient", 
                  "num_medications", "num_lab_procedures", "time_in_hospital")

# Subset and Scale
# We use the FULL dataset (df_model) because clustering is for everyone
util_data <- df_model %>% select(all_of(cluster_cols))
util_scaled <- scale(util_data)

Now we run K-means

# We choose 3 clusters to find Low, Medium, and High usage groups
set.seed(786)
k3 <- kmeans(util_scaled, centers = 3, nstart = 25)

Then we analyze the personas

# Attach cluster ID to the original data
df_clusters <- df_model %>%
  mutate(Cluster = as.factor(k3$cluster))

# Calculate stats for each cluster
cluster_summary <- df_clusters %>%
  group_by(Cluster) %>%
  summarise(
    Count = n(),
    Readmission_Rate = mean(as.numeric(as.character(target_binary))),
    Avg_Inpatient = mean(number_inpatient),
    Avg_Emergency = mean(number_emergency),
    Avg_Labs = mean(num_lab_procedures),
    Avg_Meds = mean(num_medications)
  ) %>%
  arrange(desc(Readmission_Rate)) # Sort by Risk

print("--- CLUSTER ANALYSIS (Patient Personas) ---")

[1] "--- CLUSTER ANALYSIS (Patient Personas) ---"

print(cluster_summary)

# A tibble: 3 × 7
  Cluster Count Readmission_Rate Avg_Inpatient Avg_Emergency Avg_Labs Avg_Meds
  <fct>   <int>            <dbl>         <dbl>         <dbl>    <dbl>    <dbl>
1 1        7245            0.760         3.52         1.46       44.2     17.2
2 2       30158            0.489         0.524        0.108      56.2     23.0
3 3       62711            0.425         0.352        0.0963     36.4     12.5

# Boxplot of Inpatient Visits by Cluster
ggplot(df_clusters, aes(x = Cluster, y = number_inpatient, fill = Cluster)) +
  geom_boxplot() +
  labs(title = "Patient Personas: Inpatient History by Cluster",
       subtitle = "Which cluster represents the 'Frequent Flyers'?",
       y = "Number of Inpatient Visits")

K-Means Clustering Results

We can obserce 3 distinct clusters:

Cluster 1: “The Crisis Group” (High Risk) is the smallest group (only around 7% of patients fall here). They show extreme utilization. They average 3.5 inpatient visits and 1.5 ER visits (compared to <0.5 for others). They reach 76% readmission rate, so more likely to be unstable and constantly cycling through the hospital system.
Cluster 2: “The Complex Chronic Group” (Moderate Risk) represents about 30% of the patients. Their key trait is high clinical intensity. They have the highest average Labs (56) and Medications (23), but low ER/Inpatient visits. They had a 49% readmission rate. This indicates that these patients are sick (have complex needs), but they are being managed. Their high lab/med counts suggest they are receiving care, which keeps them out of the ER but still leaves them at moderate risk.
Cluster 3: The “Stable Group” (Low Risk) is the largest group with about 63% of patients and their key trait is low utilization across the board. They have up to a 42% readmission rate. So these are likely routine cases or healthier diabetic patients.