Modeling Hospital Stay Duration and Readmission Risk in Diabetic Patients Using Machine Learning

Introduction

Hospital readmissions and prolonged lengths of stay are critical challenges in managing diabetes-related hospital care. Frequent readmissions can indicate poor disease management, while extend hospitalizations strain healthcare resources and increase patient risk. Leveraging machine learning (ML) to predict these outcomes can help identify high-risk patients early, guide interventions, and improve care planning.

This project focuses on predicting two important outcomes among diabetic patients:

The duration of hospital stays, a continuous variable representing healthcare utilization.
The prediction of readmission, a classification problem that reflects care quality.

By analyzing clinical, demographic, and treatment-related features, the project applies predictive modeling and interpretability techniques to uncover key drivers of these outcomes and enhance decision support in diabetes care.

Dataset Details

Link	https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008
Title	Diabetes 130-US Hospitals for Years 1999-2008
Purpose of Dataset	The dataset is designed to analyze hospital readmission patterns of diabetic patients, particularly in determining the likelihood of a patient being readmitted within 30 days of discharge after hospitalization due to diabetes-related health issues.
Dimension	101767 rows, 50 columns
Content	10 years of clinical care (1999–2008) across 130 U.S. hospitals
Summary	• Class Distribution (readmitted): • “NO”: 54,864 (most frequent) • “>30” and “<30”: Remaining • Race: 76,099 patients are “Caucasian” (dominant group) • Age Groups: 10 discrete ranges like [70-80), [60-70), etc. • Missing values: Most entries have “?” for weight, payer_code, and medical_specialty — indicating missing or unknown data

Research Questions

What features are most important in predicting readmission and length of stay, as identified by machine learning interpretability methods?
⁠Which machine learning model most accurately predicts the length of hospital stay after diabetic patient readmission?
⁠How accurately can machine learning models classify diabetic patients into readmission categories?

Research Objectives

To apply feature importance and interpretability techniques to identify the most influential factors associated with diabetic readmissions.
⁠To develop and evaluate machine learning models for predicting the length of stay and hospital readmission among diabetic patients.
⁠To build and assess classification model to predict the readmission risk using structured health data.

Implementation

Data Cleaning

Accessing data source

Before performing data analysis in R, we import a library called dplyr, which is a package for data manipulation and transformation. Other than that, we also import knitr library to perform dynamic report generation. ggplot2 on the other hand will help in customizing data visualization. In addition, the corrplot package is used to visualize correlation matrices, allowing us to identify relationships between numeric variables effectively. The tidyr package supports data tidying tasks such as reshaping data,handling missing values, ensuring our dataset is structured appropriately for analysis.

library(dplyr)
library(knitr)
library(ggplot2)

Next, read diabetic_data.csv as the source file by directly inserting the file name with its directory. To improve performance, we use head(,5) to display the first 5 rows. We will consider “?” as string while reading the data.

df <- read.csv("diabetic_data.csv",na.strings = "?")
kable(head(df,5), align = "l")

encounter_id	patient_nbr	race	gender	age	weight	admission_type_id	discharge_disposition_id	admission_source_id	time_in_hospital	payer_code	medical_specialty	num_lab_procedures	num_procedures	num_medications	number_outpatient	number_inpatient	diag_1	diag_2	diag_3	number_diagnoses	max_glu_serum	A1Cresult	metformin	repaglinide	nateglinide	chlorpropamide	glimepiride	acetohexamide	glipizide	glyburide	tolbutamide	pioglitazone	rosiglitazone	acarbose	miglitol	troglitazone	tolazamide	examide	citoglipton	insulin	glyburide.metformin	glipizide.metformin	glimepiride.pioglitazone	metformin.rosiglitazone	metformin.pioglitazone	change	diabetesMed	readmitted
2278392	8222157	Caucasian	Female	[0-10)	NA	6	25	1	1	NA	Pediatrics-Endocrinology	41	0	1	0	0	250.83	NA	NA	1	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	NO
149190	55629189	Caucasian	Female	[10-20)	NA	1	1	7	3	NA	NA	59	0	18	0	0	276	250.01	255	9	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Up	No	No	No	No	No	Ch	Yes	>30
64410	86047875	AfricanAmerican	Female	[20-30)	NA	1	1	7	2	NA	NA	11	5	13	2	1	648	250	V27	6	None	None	No	No	No	No	No	No	Steady	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Yes	NO
500364	82442376	Caucasian	Male	[30-40)	NA	1	1	7	2	NA	NA	44	1	16	0	0	8	250.43	403	7	None	None	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	No	Up	No	No	No	No	No	Ch	Yes	NO
16680	42519267	Caucasian	Male	[40-50)	NA	1	1	7	1	NA	NA	51	0	8	0	0	197	157	250	5	None	None	No	No	No	No	No	No	Steady	No	No	No	No	No	No	No	No	No	No	Steady	No	No	No	No	No	Ch	Yes	NO

As a reference to some of the columns with id, we can refer to IDS_mapping.csv file.

# read 'IDS_Mapping.csv'
# df <- read.csv(file.choose(),na.strings = "?")
id_reference <- read.csv("IDS_mapping.csv")
#id_reference
kable(id_reference, align = "l")

DISCHARGE_DISPOSITION_DESCRIPTION	ADMISSION_SOURCE_DESCRIPTION	ADMISSION_TYPE_DESCRIPTION
0 - Unknown	0 - Unknown	0 - Unknown
1 - Discharged to home	1 - Physician Referral	1 - Emergency
2 - Discharged/transferred to another short term hospital	2 - Clinic Referral	2 - Urgent
3 - Discharged/transferred to SNF	3 - HMO Referral	3 - Elective
4 - Discharged/transferred to ICF	4 - Transfer from a hospital	4 - Newborn
5 - Discharged/transferred to another type of inpatient care institution	5 - Transfer from a Skilled Nursing Facility (SNF)	7 - Trauma Center
6 - Discharged/transferred to home with home health service	6 - Transfer from another health care facility
7 - Left AMA	7 - Emergency Room
8 - Discharged/transferred to home under care of Home IV provider	8 - Court/Law Enforcement
9 - Admitted as an inpatient to this hospital	10 - Transfer from critial access hospital
10 - Neonate discharged to another hospital for neonatal aftercare	11 - Normal Delivery
11 - Expired	12 - Premature Delivery
12 - Still patient or expected to return for outpatient services	13 - Sick Baby
13 - Hospice / home	14 - Extramural Birth
14 - Hospice / medical facility	17 - Transfer From Another Home Health Agency
15 - Discharged/transferred within this institution to Medicare approved swing bed	18 - Readmission to Same Home Health Agency
16 - Discharged/transferred/referred another institution for outpatient services	21 - Transfer from hospital inpt/same fac reslt in a sep claim
17 - Discharged/transferred/referred to this institution for outpatient services	22 - Born inside this hospital
19 - Expired at home. Medicaid only, hospice.	23 - Born outside this hospital
20 - Expired in a medical facility. Medicaid only, hospice.	24 - Transfer from Ambulatory Surgery Center
21 - Expired, place unknown. Medicaid only, hospice.	25 - Transfer from Hospice
22 - Discharged/transferred to another rehab fac including rehab units of a hospital .
23 - Discharged/transferred to a long term care hospital.
24 - Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare.
27 - Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere
28 - Discharged/transferred to a federal health care facility.
29 - Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital
30 - Discharged/transferred to a Critical Access Hospital (CAH).

Understanding raw data

The dataset are likely related to diabetics patient records. Data can be understand by identifying number of rows, dimension, and structure.

nrow(df)

## [1] 101766

dim(df)

## [1] 101766     50

Generally, there are 101766 observations and 50 variables. The observation covers demographic, diagnostic, and treatment information.

str(df)

## 'data.frame':    101766 obs. of  50 variables:
##  $ encounter_id            : int  2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ race                    : chr  "Caucasian" "Caucasian" "AfricanAmerican" "Caucasian" ...
##  $ gender                  : chr  "Female" "Female" "Female" "Male" ...
##  $ age                     : chr  "[0-10)" "[10-20)" "[20-30)" "[30-40)" ...
##  $ weight                  : chr  NA NA NA NA ...
##  $ admission_type_id       : int  6 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  25 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : chr  NA NA NA NA ...
##  $ medical_specialty       : chr  "Pediatrics-Endocrinology" NA NA NA ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : chr  "250.83" "276" "648" "8" ...
##  $ diag_2                  : chr  NA "250.01" "250" "250.43" ...
##  $ diag_3                  : chr  NA "255" "V27" "403" ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : chr  "None" "None" "None" "None" ...
##  $ A1Cresult               : chr  "None" "None" "None" "None" ...
##  $ metformin               : chr  "No" "No" "No" "No" ...
##  $ repaglinide             : chr  "No" "No" "No" "No" ...
##  $ nateglinide             : chr  "No" "No" "No" "No" ...
##  $ chlorpropamide          : chr  "No" "No" "No" "No" ...
##  $ glimepiride             : chr  "No" "No" "No" "No" ...
##  $ acetohexamide           : chr  "No" "No" "No" "No" ...
##  $ glipizide               : chr  "No" "No" "Steady" "No" ...
##  $ glyburide               : chr  "No" "No" "No" "No" ...
##  $ tolbutamide             : chr  "No" "No" "No" "No" ...
##  $ pioglitazone            : chr  "No" "No" "No" "No" ...
##  $ rosiglitazone           : chr  "No" "No" "No" "No" ...
##  $ acarbose                : chr  "No" "No" "No" "No" ...
##  $ miglitol                : chr  "No" "No" "No" "No" ...
##  $ troglitazone            : chr  "No" "No" "No" "No" ...
##  $ tolazamide              : chr  "No" "No" "No" "No" ...
##  $ examide                 : chr  "No" "No" "No" "No" ...
##  $ citoglipton             : chr  "No" "No" "No" "No" ...
##  $ insulin                 : chr  "No" "Up" "No" "Up" ...
##  $ glyburide.metformin     : chr  "No" "No" "No" "No" ...
##  $ glipizide.metformin     : chr  "No" "No" "No" "No" ...
##  $ glimepiride.pioglitazone: chr  "No" "No" "No" "No" ...
##  $ metformin.rosiglitazone : chr  "No" "No" "No" "No" ...
##  $ metformin.pioglitazone  : chr  "No" "No" "No" "No" ...
##  $ change                  : chr  "No" "Ch" "No" "Ch" ...
##  $ diabetesMed             : chr  "No" "Yes" "Yes" "Yes" ...
##  $ readmitted              : chr  "NO" ">30" "NO" "NO" ...

From the structure above, we can observed that most of the data are categorical with lots of missing values labelled as “?”.

After exploring about some fields with id (discharge_disposition_id, admission_source_id, admission_type_id), we observe that there are some id that contains “Null”, “Not Mapped”, “Not Available”, and “Unknown/Invalid” category. To make it standardized, we change the ‘id’ into 0, with “Unknown” as it description.

df <- df %>%
  mutate(
    discharge_disposition_id = case_when(
      discharge_disposition_id %in% c(18, 25, 26) ~ 0,
      TRUE ~ discharge_disposition_id
    ),
    admission_source_id = case_when(
      admission_source_id %in% c(9, 15, 16, 19, 20) ~ 0,
      TRUE ~ admission_source_id
    ),
    admission_type_id = case_when(
      admission_type_id %in% c(5, 6, 8) ~ 0,
      TRUE ~ admission_type_id
    )
  )

Below are the list of unique id which contains 0 id as it category.

list(
  admission_source_id = sort(unique(df$admission_source_id)),
  admission_type_id = sort(unique(df$admission_type_id)),
  discharge_disposition_id = sort(unique(df$discharge_disposition_id))
)

## $admission_source_id
##  [1]  0  1  2  3  4  5  6  7  8 10 11 13 14 17 22 25
## 
## $admission_type_id
## [1] 0 1 2 3 4 7
## 
## $discharge_disposition_id
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 19 20 22 23 24 27 28

Remove duplicates

Duplicates are removed to ensure data accuracy and reliability.

df <-  df[!duplicated(df), ]
nrow(df)

## [1] 101766

We can observe that there are no duplicated rows as the number of row before and after are tally.

Since this dataset is about patient hospital admission, encounter_id should be unique.

pk <- (unique(df$encounter_id))
length(pk)

## [1] 101766

length(!is.na(pk))

## [1] 101766

As we can see, encounter_id is unique and contains no missing value. Thus, it can act as a primary key in this dataset.

Handling missing values

Handle missing value is a critical step in data cleaning to ensure the data are ready for analysis. Ignoring this can lead to inaccurate analysis and biased model.

Firstly, we get the total number of missing values and missing percentage for each column.

missing_summary <- data.frame(
  missing_count = sapply(df, function(x) sum(is.na(x))),
  missing_percent = paste(round(sapply(df, function(x) mean(is.na(x)) * 100),2),'%')
)
kable(missing_summary, align = "l")

	missing_count	missing_percent
encounter_id	0	0 %
patient_nbr	0	0 %
race	2273	2.23 %
gender	0	0 %
age	0	0 %
weight	98569	96.86 %
admission_type_id	0	0 %
discharge_disposition_id	0	0 %
admission_source_id	0	0 %
time_in_hospital	0	0 %
payer_code	40256	39.56 %
medical_specialty	49949	49.08 %
num_lab_procedures	0	0 %
num_procedures	0	0 %
num_medications	0	0 %
number_outpatient	0	0 %
number_emergency	0	0 %
number_inpatient	0	0 %
diag_1	21	0.02 %
diag_2	358	0.35 %
diag_3	1423	1.4 %
number_diagnoses	0	0 %
max_glu_serum	0	0 %
A1Cresult	0	0 %
metformin	0	0 %
repaglinide	0	0 %
nateglinide	0	0 %
chlorpropamide	0	0 %
glimepiride	0	0 %
acetohexamide	0	0 %
glipizide	0	0 %
glyburide	0	0 %
tolbutamide	0	0 %
pioglitazone	0	0 %
rosiglitazone	0	0 %
acarbose	0	0 %
miglitol	0	0 %
troglitazone	0	0 %
tolazamide	0	0 %
examide	0	0 %
citoglipton	0	0 %
insulin	0	0 %
glyburide.metformin	0	0 %
glipizide.metformin	0	0 %
glimepiride.pioglitazone	0	0 %
metformin.rosiglitazone	0	0 %
metformin.pioglitazone	0	0 %
change	0	0 %
diabetesMed	0	0 %
readmitted	0	0 %

Based on the table above, weight have the highest missing value which is 97%, followed by prayer_code and medical_speciality with 40% and 49% respectively, race with less than 2%, and lastly diag_1, diag_2, and diag_3 with almost 0%. Column with almost 100% missing value will be removed while applying imputation to the column with less than 50% missing values.

Since there are too much missing value in weight column, the column will be removed as it is no longer useful for analysis.

df <- df %>% select(-weight)
dim(df)

## [1] 101766     49

As we can see, weight column has been removed since the variable decrease by 1.

Next, missing values in payer_code and medical_speciality column will be imputed to “Unknown” category.

df$payer_code[is.na(df$payer_code)] <- "Unknown" 
unique(df$payer_code)

##  [1] "Unknown" "MC"      "MD"      "HM"      "UN"      "BC"      "SP"     
##  [8] "CP"      "SI"      "DM"      "CM"      "CH"      "PO"      "WC"     
## [15] "OT"      "OG"      "MP"      "FR"

df$medical_specialty[is.na(df$medical_specialty)] <- "Unknown" 
unique(df$medical_specialty)

##  [1] "Pediatrics-Endocrinology"            
##  [2] "Unknown"                             
##  [3] "InternalMedicine"                    
##  [4] "Family/GeneralPractice"              
##  [5] "Cardiology"                          
##  [6] "Surgery-General"                     
##  [7] "Orthopedics"                         
##  [8] "Gastroenterology"                    
##  [9] "Surgery-Cardiovascular/Thoracic"     
## [10] "Nephrology"                          
## [11] "Orthopedics-Reconstructive"          
## [12] "Psychiatry"                          
## [13] "Emergency/Trauma"                    
## [14] "Pulmonology"                         
## [15] "Surgery-Neuro"                       
## [16] "Obsterics&Gynecology-GynecologicOnco"
## [17] "ObstetricsandGynecology"             
## [18] "Pediatrics"                          
## [19] "Hematology/Oncology"                 
## [20] "Otolaryngology"                      
## [21] "Surgery-Colon&Rectal"                
## [22] "Pediatrics-CriticalCare"             
## [23] "Endocrinology"                       
## [24] "Urology"                             
## [25] "Psychiatry-Child/Adolescent"         
## [26] "Pediatrics-Pulmonology"              
## [27] "Neurology"                           
## [28] "Anesthesiology-Pediatric"            
## [29] "Radiology"                           
## [30] "Pediatrics-Hematology-Oncology"      
## [31] "Psychology"                          
## [32] "Podiatry"                            
## [33] "Gynecology"                          
## [34] "Oncology"                            
## [35] "Pediatrics-Neurology"                
## [36] "Surgery-Plastic"                     
## [37] "Surgery-Thoracic"                    
## [38] "Surgery-PlasticwithinHeadandNeck"    
## [39] "Ophthalmology"                       
## [40] "Surgery-Pediatric"                   
## [41] "Pediatrics-EmergencyMedicine"        
## [42] "PhysicalMedicineandRehabilitation"   
## [43] "InfectiousDiseases"                  
## [44] "Anesthesiology"                      
## [45] "Rheumatology"                        
## [46] "AllergyandImmunology"                
## [47] "Surgery-Maxillofacial"               
## [48] "Pediatrics-InfectiousDiseases"       
## [49] "Pediatrics-AllergyandImmunology"     
## [50] "Dentistry"                           
## [51] "Surgeon"                             
## [52] "Surgery-Vascular"                    
## [53] "Osteopath"                           
## [54] "Psychiatry-Addictive"                
## [55] "Surgery-Cardiovascular"              
## [56] "PhysicianNotFound"                   
## [57] "Hematology"                          
## [58] "Proctology"                          
## [59] "Obstetrics"                          
## [60] "SurgicalSpecialty"                   
## [61] "Radiologist"                         
## [62] "Pathology"                           
## [63] "Dermatology"                         
## [64] "SportsMedicine"                      
## [65] "Speech"                              
## [66] "Hospitalist"                         
## [67] "OutreachServices"                    
## [68] "Cardiology-Pediatric"                
## [69] "Perinatology"                        
## [70] "Neurophysiology"                     
## [71] "Endocrinology-Metabolism"            
## [72] "DCPTEAM"                             
## [73] "Resident"

The unique categories are listed above with “Unknown” as missing value.

Since column race has only 2% of missing value, we perform imputation by mode.

mode_race <- names(which.max(table(df$race))) #Impute with mode
mode_race

## [1] "Caucasian"

“Caucasion” is the mode and those missing value will be assign as it.

df$race[is.na(df$race)] <- mode_race
unique(df$race)

## [1] "Caucasian"       "AfricanAmerican" "Other"           "Asian"          
## [5] "Hispanic"

For gender column, there are currently three categories which are “Female”, “Male”, and “Unknown/Invalid”. We consider “Unknown/Invalid” as missing value, thus it will be filtered.

unique(df$gender)

## [1] "Female"          "Male"            "Unknown/Invalid"

df <- df %>% filter(!(gender == "Unknown/Invalid"))
nrow(df)

## [1] 101763

diag_1, diag_2, and diag_3 contains the code for primary, secondary and tertiary diagnosis. Since not all patients are diagnosed will those three at once, we will just change the null value to “Not Available”.

df <- df %>%
  mutate(across(c(diag_1, diag_2, diag_3), ~ ifelse(is.na(.), "Not available", .)))

After handling missing value, there are currently 101763 observations with -3 difference as compared to original observation.

Add aggregated column

Column diag_1, diag_2, and diag_3 contains the diagnosis code. Each patient might only have one or two diagnosis, per admission. Thus, calculating the number of diagnosis will be helpful for further analysis.

df <- df %>% mutate(diag_count = rowSums(!is.na(select(.,diag_1,diag_2,diag_3))))
dim(df)

## [1] 101763     50

Our variables are back to 50 after adding a new column.

Change data type

For a useful analysis, we will change chr type to factor type. Factor will make it more efficient for categorical grouping that will be used during classification modelling.

df <- df %>% mutate(across(where(is.character), as.factor))%>% mutate(across(where(is.numeric), as.integer))
str(df)

## 'data.frame':    101763 obs. of  50 variables:
##  $ encounter_id            : int  2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ race                    : Factor w/ 5 levels "AfricanAmerican",..: 3 3 1 3 3 3 3 3 3 3 ...
##  $ gender                  : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 2 1 1 ...
##  $ age                     : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ admission_type_id       : int  0 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  0 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : Factor w/ 18 levels "BC","CH","CM",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ medical_specialty       : Factor w/ 73 levels "AllergyandImmunology",..: 38 72 72 72 72 72 72 72 72 19 ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : Factor w/ 717 levels "10","11","110",..: 125 144 455 555 55 264 264 277 253 283 ...
##  $ diag_2                  : Factor w/ 749 levels "11","110","111",..: 710 80 79 98 25 247 247 315 261 47 ...
##  $ diag_3                  : Factor w/ 790 levels "11","110","111",..: 748 122 768 249 87 87 772 87 230 318 ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
##  $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
##  $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ examide                 : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citoglipton             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
##  $ glyburide.metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glipizide.metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.pioglitazone  : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...
##  $ diag_count              : int  3 3 3 3 3 3 3 3 3 3 ...

Each category will be considered as one level of factors and non-decimal columns (id columns) will be considered as int type rather than num.

Relabelling interval bins

Column age show intervals that presented in left-closed and right-open bins. To improve readability, we will change the label by excluding the brackets with correct bins.

new_age_label <- paste0(seq(0, 90, by = 10),"-", seq(9, 99, by = 10))
levels(df$age) <- new_age_label
unique(df$age)

##  [1] 0-9   10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
## Levels: 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99

The factor for age are now re-code into 10 bins, from 0-99.

str(df$age)

##  Factor w/ 10 levels "0-9","10-19",..: 1 2 3 4 5 6 7 8 9 10 ...

Save cleaned data

Storing data in .rds format will preserve the full structure and can avoid formatting issues. Thus, cleaned data will be saved as diabetic_cleaned.rds for future use.

saveRDS(df,"diabetic_cleaned.rds")

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is performed to explore patterns, detect anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. The following visualizations and summaries aim to understand the characteristics of the diabetic patients and how various factors might influence hospital readmission.

Summary statistics

This gives a quick overview of the dataset, including minimum, maximum, mean, and quartiles for numerical variables, and frequency counts for categorical variables. This helps detect any skewness, unusual ranges, or imbalanced categories.

# Load necessary libraries for data manipulation and visualization
library(dplyr)     # For data manipulation
library(ggplot2)   # For plotting charts

# Display basic statistics (min, max, mean, etc.) for each column
summary(df)

##   encounter_id        patient_nbr                     race          gender     
##  Min.   :    12522   Min.   :      135   AfricanAmerican:19210   Female:54708  
##  1st Qu.: 84959748   1st Qu.: 23412964   Asian          :  641   Male  :47055  
##  Median :152388294   Median : 45500490   Caucasian      :78370                 
##  Mean   :165200788   Mean   : 54329650   Hispanic       : 2037                 
##  3rd Qu.:230269803   3rd Qu.: 87545714   Other          : 1505                 
##  Max.   :443867222   Max.   :189502619                                         
##                                                                                
##       age        admission_type_id discharge_disposition_id admission_source_id
##  70-79  :26066   Min.   :0.000     Min.   : 0.00            Min.   : 0.000     
##  60-69  :22482   1st Qu.:1.000     1st Qu.: 1.00            1st Qu.: 1.000     
##  50-59  :17256   Median :1.000     Median : 1.00            Median : 7.000     
##  80-89  :17197   Mean   :1.452     Mean   : 2.82            Mean   : 5.712     
##  40-49  : 9685   3rd Qu.:2.000     3rd Qu.: 3.00            3rd Qu.: 7.000     
##  30-39  : 3775   Max.   :7.000     Max.   :28.00            Max.   :25.000     
##  (Other): 5302                                                                 
##  time_in_hospital   payer_code                 medical_specialty
##  Min.   : 1.000   Unknown:40255   Unknown               :49947  
##  1st Qu.: 2.000   MC     :32439   InternalMedicine      :14635  
##  Median : 4.000   HM     : 6274   Emergency/Trauma      : 7565  
##  Mean   : 4.396   SP     : 5007   Family/GeneralPractice: 7440  
##  3rd Qu.: 6.000   BC     : 4655   Cardiology            : 5351  
##  Max.   :14.000   MD     : 3532   Surgery-General       : 3099  
##                   (Other): 9601   (Other)               :13726  
##  num_lab_procedures num_procedures num_medications number_outpatient
##  Min.   :  1.0      Min.   :0.00   Min.   : 1.00   Min.   : 0.0000  
##  1st Qu.: 31.0      1st Qu.:0.00   1st Qu.:10.00   1st Qu.: 0.0000  
##  Median : 44.0      Median :1.00   Median :15.00   Median : 0.0000  
##  Mean   : 43.1      Mean   :1.34   Mean   :16.02   Mean   : 0.3694  
##  3rd Qu.: 57.0      3rd Qu.:2.00   3rd Qu.:20.00   3rd Qu.: 0.0000  
##  Max.   :132.0      Max.   :6.00   Max.   :81.00   Max.   :42.0000  
##                                                                     
##  number_emergency  number_inpatient      diag_1          diag_2     
##  Min.   : 0.0000   Min.   : 0.0000   428    : 6862   276    : 6752  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   414    : 6580   428    : 6662  
##  Median : 0.0000   Median : 0.0000   786    : 4016   250    : 6071  
##  Mean   : 0.1978   Mean   : 0.6356   410    : 3614   427    : 5036  
##  3rd Qu.: 0.0000   3rd Qu.: 1.0000   486    : 3508   401    : 3736  
##  Max.   :76.0000   Max.   :21.0000   427    : 2766   496    : 3305  
##                                      (Other):74417   (Other):70201  
##      diag_3      number_diagnoses max_glu_serum A1Cresult     metformin    
##  250    :11555   Min.   : 1.000   >200: 1485    >7  : 3812   Down  :  575  
##  401    : 8288   1st Qu.: 6.000   >300: 1264    >8  : 8216   No    :81776  
##  276    : 5175   Median : 8.000   None:96417    None:84745   Steady:18345  
##  428    : 4577   Mean   : 7.423   Norm: 2597    Norm: 4990   Up    : 1067  
##  427    : 3955   3rd Qu.: 9.000                                            
##  414    : 3664   Max.   :16.000                                            
##  (Other):64549                                                             
##  repaglinide     nateglinide     chlorpropamide  glimepiride    acetohexamide  
##  Down  :    45   Down  :    11   Down  :     1   Down  :  194   No    :101762  
##  No    :100224   No    :101060   No    :101677   No    :96572   Steady:     1  
##  Steady:  1384   Steady:   668   Steady:    79   Steady: 4670                  
##  Up    :   110   Up    :    24   Up    :     6   Up    :  327                  
##                                                                                
##                                                                                
##                                                                                
##   glipizide      glyburide     tolbutamide     pioglitazone   rosiglitazone 
##  Down  :  560   Down  :  564   No    :101740   Down  :  118   Down  :   87  
##  No    :89078   No    :91113   Steady:    23   No    :94436   No    :95399  
##  Steady:11355   Steady: 9274                   Steady: 6975   Steady: 6099  
##  Up    :  770   Up    :  812                   Up    :  234   Up    :  178  
##                                                                             
##                                                                             
##                                                                             
##    acarbose        miglitol      troglitazone     tolazamide     examide    
##  Down  :     3   Down  :     5   No    :101760   No    :101724   No:101763  
##  No    :101455   No    :101725   Steady:     3   Steady:    38              
##  Steady:   295   Steady:    31                   Up    :     1              
##  Up    :    10   Up    :     2                                              
##                                                                             
##                                                                             
##                                                                             
##  citoglipton   insulin      glyburide.metformin glipizide.metformin
##  No:101763   Down  :12218   Down  :     6       No    :101750      
##              No    :47380   No    :101057       Steady:    13      
##              Steady:30849   Steady:   692                          
##              Up    :11316   Up    :     8                          
##                                                                    
##                                                                    
##                                                                    
##  glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone
##  No    :101762            No    :101761           No    :101762         
##  Steady:     1            Steady:     2           Steady:     1         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  change     diabetesMed readmitted    diag_count
##  Ch:47009   No :23402   <30:11357   Min.   :3   
##  No:54754   Yes:78361   >30:35545   1st Qu.:3   
##                         NO :54861   Median :3   
##                                     Mean   :3   
##                                     3rd Qu.:3   
##                                     Max.   :3   
##

Age Group Distribution

This bar plot shows how patients are distributed across different age intervals. It helps to identify which age groups have higher hospital admission rates among diabetic patients.

# 1 Show how many patients are in each age group
ggplot(df, aes(age)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Age Groups", x = "Age Group", y = "Count")

Relationship: Age vs Time in Hospital

This boxplot shows how the time spent in the hospital varies across different age groups. ‘age’ is a categorical variable (in 10-year bins), and ‘time_in_hospital’ is numerical. The boxplot helps us observe the median stay and spread within each age group.

ggplot(df, aes(x = age, y = time_in_hospital)) +
  geom_boxplot(fill = "orchid") +  # Creates a boxplot with purple fill
  labs(
    title = "Age vs Time in Hospital",   # Title of the chart
    x = "Age",                           # Label for x-axis
    y = "Hospital Stay (days)"          # Label for y-axis
  )

Correlation: Number of Diagnoses vs Time in Hospital

This scatter plot examines the relationship between the number of diagnoses (diag_count) and the duration of hospital stay. Each dot represents a patient’s data. The red line is a linear regression trend line that helps assess whether more diagnoses are associated with longer hospital stays.

ggplot(df, aes(x = number_diagnoses, y = time_in_hospital)) +
  geom_point(alpha = 0.3) +                     # Scatter plot with transparent points for better visibility
  geom_smooth(method = "lm", se = FALSE, color = "red") +  # Adds a linear trend line (no confidence interval)
  labs(
    title = "Diagnosis Count vs Time in Hospital",  # Chart title
     x = "Diagnosis Count",                          # x-axis label
     y = "Time in Hospital"                          # y-axis label
   )

## `geom_smooth()` using formula = 'y ~ x'

Race Distribution

This visualization shows the number of patients by race. It highlights racial demographics in the dataset and helps identify if certain groups are over- or under-represented.

# Show how the dataset is distributed across different races
# Convert race counts to a table
race_count <- as.data.frame(table(df$race))

# Pie chart
ggplot(race_count, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +  # Makes it circular
  labs(title = "Race Distribution") +
  theme_void()

Readmission Status

This chart breaks down the number of patients by their readmission status, such as “<30”, “>30”, and “NO”. It provides insights into how common readmission is and whether most patients return soon or much later.

# Show how many patients fall into each readmission class
readmit_count <- as.data.frame(table(df$readmitted))

ggplot(readmit_count, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Readmission Proportion") +
  theme_void()

Gender vs Readmission

This grouped bar chart shows the relationship between gender and readmission. It helps examine whether male or female patients tend to be readmitted more frequently.

# Check the readmission rate based on gender
ggplot(df, aes(gender, fill = readmitted)) +
  geom_bar(position = "dodge") +
  labs(title = "Gender vs Readmission", x = "Gender", y = "Count")

Admission Types

This graph illustrates how patients are admitted, based on admission type codes (e.g., emergency, urgent, elective). Understanding admission types can help in identifying the severity or urgency of patient cases.

# Visualize types of hospital admissions in the dataset
ggplot(df, aes(admission_type_id)) +
  geom_bar(fill = "coral") +
  labs(title = "Admission Types", x = "Admission Type ID", y = "Count")

Correlation Matrix

Load the Dataset

data <- readRDS("diabetic_cleaned.rds")
summary(data)

##   encounter_id        patient_nbr                     race          gender     
##  Min.   :    12522   Min.   :      135   AfricanAmerican:19210   Female:54708  
##  1st Qu.: 84959748   1st Qu.: 23412964   Asian          :  641   Male  :47055  
##  Median :152388294   Median : 45500490   Caucasian      :78370                 
##  Mean   :165200788   Mean   : 54329650   Hispanic       : 2037                 
##  3rd Qu.:230269803   3rd Qu.: 87545714   Other          : 1505                 
##  Max.   :443867222   Max.   :189502619                                         
##                                                                                
##       age        admission_type_id discharge_disposition_id admission_source_id
##  70-79  :26066   Min.   :0.000     Min.   : 0.00            Min.   : 0.000     
##  60-69  :22482   1st Qu.:1.000     1st Qu.: 1.00            1st Qu.: 1.000     
##  50-59  :17256   Median :1.000     Median : 1.00            Median : 7.000     
##  80-89  :17197   Mean   :1.452     Mean   : 2.82            Mean   : 5.712     
##  40-49  : 9685   3rd Qu.:2.000     3rd Qu.: 3.00            3rd Qu.: 7.000     
##  30-39  : 3775   Max.   :7.000     Max.   :28.00            Max.   :25.000     
##  (Other): 5302                                                                 
##  time_in_hospital   payer_code                 medical_specialty
##  Min.   : 1.000   Unknown:40255   Unknown               :49947  
##  1st Qu.: 2.000   MC     :32439   InternalMedicine      :14635  
##  Median : 4.000   HM     : 6274   Emergency/Trauma      : 7565  
##  Mean   : 4.396   SP     : 5007   Family/GeneralPractice: 7440  
##  3rd Qu.: 6.000   BC     : 4655   Cardiology            : 5351  
##  Max.   :14.000   MD     : 3532   Surgery-General       : 3099  
##                   (Other): 9601   (Other)               :13726  
##  num_lab_procedures num_procedures num_medications number_outpatient
##  Min.   :  1.0      Min.   :0.00   Min.   : 1.00   Min.   : 0.0000  
##  1st Qu.: 31.0      1st Qu.:0.00   1st Qu.:10.00   1st Qu.: 0.0000  
##  Median : 44.0      Median :1.00   Median :15.00   Median : 0.0000  
##  Mean   : 43.1      Mean   :1.34   Mean   :16.02   Mean   : 0.3694  
##  3rd Qu.: 57.0      3rd Qu.:2.00   3rd Qu.:20.00   3rd Qu.: 0.0000  
##  Max.   :132.0      Max.   :6.00   Max.   :81.00   Max.   :42.0000  
##                                                                     
##  number_emergency  number_inpatient      diag_1          diag_2     
##  Min.   : 0.0000   Min.   : 0.0000   428    : 6862   276    : 6752  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   414    : 6580   428    : 6662  
##  Median : 0.0000   Median : 0.0000   786    : 4016   250    : 6071  
##  Mean   : 0.1978   Mean   : 0.6356   410    : 3614   427    : 5036  
##  3rd Qu.: 0.0000   3rd Qu.: 1.0000   486    : 3508   401    : 3736  
##  Max.   :76.0000   Max.   :21.0000   427    : 2766   496    : 3305  
##                                      (Other):74417   (Other):70201  
##      diag_3      number_diagnoses max_glu_serum A1Cresult     metformin    
##  250    :11555   Min.   : 1.000   >200: 1485    >7  : 3812   Down  :  575  
##  401    : 8288   1st Qu.: 6.000   >300: 1264    >8  : 8216   No    :81776  
##  276    : 5175   Median : 8.000   None:96417    None:84745   Steady:18345  
##  428    : 4577   Mean   : 7.423   Norm: 2597    Norm: 4990   Up    : 1067  
##  427    : 3955   3rd Qu.: 9.000                                            
##  414    : 3664   Max.   :16.000                                            
##  (Other):64549                                                             
##  repaglinide     nateglinide     chlorpropamide  glimepiride    acetohexamide  
##  Down  :    45   Down  :    11   Down  :     1   Down  :  194   No    :101762  
##  No    :100224   No    :101060   No    :101677   No    :96572   Steady:     1  
##  Steady:  1384   Steady:   668   Steady:    79   Steady: 4670                  
##  Up    :   110   Up    :    24   Up    :     6   Up    :  327                  
##                                                                                
##                                                                                
##                                                                                
##   glipizide      glyburide     tolbutamide     pioglitazone   rosiglitazone 
##  Down  :  560   Down  :  564   No    :101740   Down  :  118   Down  :   87  
##  No    :89078   No    :91113   Steady:    23   No    :94436   No    :95399  
##  Steady:11355   Steady: 9274                   Steady: 6975   Steady: 6099  
##  Up    :  770   Up    :  812                   Up    :  234   Up    :  178  
##                                                                             
##                                                                             
##                                                                             
##    acarbose        miglitol      troglitazone     tolazamide     examide    
##  Down  :     3   Down  :     5   No    :101760   No    :101724   No:101763  
##  No    :101455   No    :101725   Steady:     3   Steady:    38              
##  Steady:   295   Steady:    31                   Up    :     1              
##  Up    :    10   Up    :     2                                              
##                                                                             
##                                                                             
##                                                                             
##  citoglipton   insulin      glyburide.metformin glipizide.metformin
##  No:101763   Down  :12218   Down  :     6       No    :101750      
##              No    :47380   No    :101057       Steady:    13      
##              Steady:30849   Steady:   692                          
##              Up    :11316   Up    :     8                          
##                                                                    
##                                                                    
##                                                                    
##  glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone
##  No    :101762            No    :101761           No    :101762         
##  Steady:     1            Steady:     2           Steady:     1         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  change     diabetesMed readmitted    diag_count
##  Ch:47009   No :23402   <30:11357   Min.   :3   
##  No:54754   Yes:78361   >30:35545   1st Qu.:3   
##                         NO :54861   Median :3   
##                                     Mean   :3   
##                                     3rd Qu.:3   
##                                     Max.   :3   
##

Variable Selection

In correlation analysis, only numeric variables are applicable. Therefore, the steps below outline the selection of numerical variables.

# Select only numeric columns
numeric_data <- data[sapply(data, is.numeric)]

# Check selected columns
str(numeric_data)

## 'data.frame':    101763 obs. of  14 variables:
##  $ encounter_id            : int  2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ admission_type_id       : int  0 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  0 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ diag_count              : int  3 3 3 3 3 3 3 3 3 3 ...

Normality Assumption

The normality assumption needs to be assessed to determine the appropriate type of correlation analysis—-whether to use Pearson or Spearman. If the data are normally distributed, Pearson will be used; otherwise, Spearman correlation will be applied.

# Set up plotting layout
par(mfrow = c(2, 2))  # 2x2 grid for each variable

# Loop through columns
for (col in names(numeric_data)) {
  # Skip if no variance or too many NAs
  if (sd(numeric_data[[col]], na.rm = TRUE) == 0 || sum(is.na(numeric_data[[col]])) > 0.5 * nrow(numeric_data)) next
  
  # Histogram with density curve
  hist(numeric_data[[col]], 
       breaks = 30, 
       prob = TRUE, 
       main = paste("Histogram of", col),
       xlab = col)
  lines(density(numeric_data[[col]], na.rm = TRUE), col = "red", lwd = 2)
  
  # Q-Q plot
  qqnorm(numeric_data[[col]], main = paste("Q-Q Plot of", col))
  qqline(numeric_data[[col]], col = "red")
}

# Reset plotting layout
par(mfrow = c(1, 1))

CONCLUSION: The histogram shows that most of the independent variables are not normally distributed. Therefore, Spearman correlation analysis will be used to assess the relationships.

Correlation Matrix

Next, the correlation matrix will be calculated. If any variable has a standard deviation of zero (0), it will be removed from the list of numerical variables, and the matrix will be recalculated.

# Calculate the correlation matrix [Output: warning]
cor_matrix <- cor(numeric_data, use = "complete.obs", method = "spearman")

## Warning in cor(numeric_data, use = "complete.obs", method = "spearman"): the
## standard deviation is zero

# Find columns with zero standard deviation
zero_sd_col <- names(numeric_data)[sapply(numeric_data, function(x) sd(x, na.rm = TRUE) == 0)]
print(zero_sd_col)

## [1] "diag_count"

# Remove columns with zero standard deviation
numeric_data_filtered <- numeric_data[, sapply(numeric_data, function(x) sd(x, na.rm = TRUE) != 0)]

# Recompute correlation matrix
cor_matrix <- cor(numeric_data_filtered, use = "complete.obs", method = "spearman")
print(round(cor_matrix, digits = 4))

##                          encounter_id patient_nbr admission_type_id
## encounter_id                   1.0000      0.5436            0.1031
## patient_nbr                    0.5436      1.0000            0.0210
## admission_type_id              0.1031      0.0210            1.0000
## discharge_disposition_id       0.1582      0.1726           -0.0305
## admission_source_id           -0.0381      0.0375           -0.7001
## time_in_hospital              -0.0599     -0.0170            0.0047
## num_lab_procedures            -0.0087      0.0266           -0.1700
## num_procedures                -0.0314     -0.0194            0.2123
## num_medications                0.1022      0.0447            0.0648
## number_outpatient              0.1509      0.1548           -0.0725
## number_emergency               0.1312      0.1132           -0.0435
## number_inpatient               0.0374      0.0259           -0.0273
## number_diagnoses               0.2925      0.2404           -0.0352
##                          discharge_disposition_id admission_source_id
## encounter_id                               0.1582             -0.0381
## patient_nbr                                0.1726              0.0375
## admission_type_id                         -0.0305             -0.7001
## discharge_disposition_id                   1.0000              0.0492
## admission_source_id                        0.0492              1.0000
## time_in_hospital                           0.2381              0.0040
## num_lab_procedures                         0.0729              0.1342
## num_procedures                            -0.0029             -0.2067
## num_medications                            0.1839             -0.0620
## number_outpatient                          0.0825              0.0281
## number_emergency                           0.0524              0.1058
## number_inpatient                           0.0914              0.0580
## number_diagnoses                           0.2116              0.1084
##                          time_in_hospital num_lab_procedures num_procedures
## encounter_id                      -0.0599            -0.0087        -0.0314
## patient_nbr                       -0.0170             0.0266        -0.0194
## admission_type_id                  0.0047            -0.1700         0.2123
## discharge_disposition_id           0.2381             0.0729        -0.0029
## admission_source_id                0.0040             0.1342        -0.2067
## time_in_hospital                   1.0000             0.3368         0.1872
## num_lab_procedures                 0.3368             1.0000         0.0229
## num_procedures                     0.1872             0.0229         1.0000
## num_medications                    0.4651             0.2516         0.3515
## number_outpatient                 -0.0133            -0.0236        -0.0237
## number_emergency                  -0.0009             0.0060        -0.0459
## number_inpatient                   0.0922             0.0413        -0.0636
## number_diagnoses                   0.2366             0.1691         0.0673
##                          num_medications number_outpatient number_emergency
## encounter_id                      0.1022            0.1509           0.1312
## patient_nbr                       0.0447            0.1548           0.1132
## admission_type_id                 0.0648           -0.0725          -0.0435
## discharge_disposition_id          0.1839            0.0825           0.0524
## admission_source_id              -0.0620            0.0281           0.1058
## time_in_hospital                  0.4651           -0.0133          -0.0009
## num_lab_procedures                0.2516           -0.0236           0.0060
## num_procedures                    0.3515           -0.0237          -0.0459
## num_medications                   1.0000            0.0743           0.0437
## number_outpatient                 0.0743            1.0000           0.1774
## number_emergency                  0.0437            0.1774           1.0000
## number_inpatient                  0.0992            0.1565           0.2223
## number_diagnoses                  0.2937            0.1130           0.0923
##                          number_inpatient number_diagnoses
## encounter_id                       0.0374           0.2925
## patient_nbr                        0.0259           0.2404
## admission_type_id                 -0.0273          -0.0352
## discharge_disposition_id           0.0914           0.2116
## admission_source_id                0.0580           0.1084
## time_in_hospital                   0.0922           0.2366
## num_lab_procedures                 0.0413           0.1691
## num_procedures                    -0.0636           0.0673
## num_medications                    0.0992           0.2937
## number_outpatient                  0.1565           0.1130
## number_emergency                   0.2223           0.0923
## number_inpatient                   1.0000           0.1358
## number_diagnoses                   0.1358           1.0000

Plotting the Correlation Matrix

To facilitate interpretation, a heatmap of the correlation matrix is presented below.

# Load required packages
library(corrplot)

## corrplot 0.95 loaded

# Plot the correlation matrix
corrplot(cor_matrix, method = "color", addCoef.col = "black", number.cex = 0.7, tl.cex = 0.8)

CONCLUSION: This correlation matrix shows relationships between healthcare-related variables, with values ranging from -0.70 (strong negative) to 1.00 (perfect positive). Most correlations are weak, but a few stand out as moderately strong.

Strongest Negative Correlation

i) The strongest relationship is between admission_type_id and admission_source_id (-0.70), indicating these variables move in opposite directions.
📌 For example, emergency admissions might come from different sources than planned admissions.

Hospital Stay Duration Relationships

i) time_in_hospital moderately correlates with:
a. num_medications (0.47) — patients staying longer tend to receive more medications.
b. num_lab_procedures (0.34) — longer stays involve more lab tests.

📌 Longer hospitalization is associated with increased medication and testing.

Treatment Intensity Interconnections

i) num_procedures and num_medications (0.35) — patients undergoing more procedures also receive more medications.

ii) number_diagnoses and num_medications (0.29) — patients with more diagnoses tend to receive more medications.

📌 These moderate correlations suggest that treatment intensity variables reinforce each other.

Artificial Correlation Between Identifiers

i) encounter_id and patient_nbr show a correlation of 0.54, but this is artificial.
📌 These are identifiers and not clinical features — their correlation reflects ID generation patterns, not medical trends.

Regrouping Data

The initial correlation analysis across all observations showed limited correlation between variables. However, medical data often contains patient-specific patterns that can be obscured in population-level analyses. By regrouping the data by patient number and analyzing correlations within patients, we account for between-patient variability and are better able to detect meaningful within-patient relationships between variables.

# Load required packages
library(dplyr)

data_sum <- data %>%
  group_by(patient_nbr) %>%
  summarise(
    encounter_count = n_distinct(encounter_id),
    total_lab_procedures = sum(num_lab_procedures, na.rm = TRUE),
    total_procedures = sum(num_procedures, na.rm = TRUE),
    total_medications = sum(num_medications, na.rm = TRUE),
    total_outpatient = sum(number_outpatient, na.rm = TRUE),
    total_emergency = sum(number_emergency, na.rm = TRUE),
    total_inpatient = sum(number_inpatient, na.rm = TRUE),
    total_time_in_hospital = sum(time_in_hospital, na.rm = TRUE),
    total_num_of_diagnoses = sum(number_diagnoses, na.rm = TRUE)
  )

print(data_sum, width = Inf)

## # A tibble: 71,515 × 10
##    patient_nbr encounter_count total_lab_procedures total_procedures
##          <int>           <int>                <int>            <int>
##  1         135               2                  108                7
##  2         378               1                   49                1
##  3         729               1                   68                2
##  4         774               1                   46                0
##  5         927               1                   49                0
##  6        1152               5                  209               10
##  7        1305               1                   52                1
##  8        1314               3                  151               13
##  9        1629               1                   21                0
## 10        2025               1                   47                2
##    total_medications total_outpatient total_emergency total_inpatient
##                <int>            <int>           <int>           <int>
##  1                47                0               0               1
##  2                11                0               0               0
##  3                23                0               0               0
##  4                20                0               0               0
##  5                 5                0               0               0
##  6                81                0               0               7
##  7                16                0               0               0
##  8                39                0               0               3
##  9                15                3               1               2
## 10                18                0               0               0
##    total_time_in_hospital total_num_of_diagnoses
##                     <int>                  <int>
##  1                     11                     13
##  2                      2                      3
##  3                      4                      9
##  4                      3                      9
##  5                      5                      3
##  6                     42                     24
##  7                      9                      9
##  8                      6                     23
##  9                     14                      7
## 10                     12                      9
## # ℹ 71,505 more rows

Plotting New Correlation Matrix

# Calculate the correlation matrix
cor_matrix <- cor(data_sum, use = "complete.obs", method = "pearson")
cor_matrix

##                         patient_nbr encounter_count total_lab_procedures
## patient_nbr             1.000000000      -0.0238243          -0.01266973
## encounter_count        -0.023824305       1.0000000           0.86901056
## total_lab_procedures   -0.012669730       0.8690106           1.00000000
## total_procedures       -0.029122133       0.4288023           0.39436875
## total_medications      -0.008321752       0.8591692           0.80549272
## total_outpatient        0.073490419       0.3400953           0.29218231
## total_emergency         0.027980837       0.3761894           0.30920086
## total_inpatient         0.001316926       0.7840189           0.69579065
## total_time_in_hospital -0.032822009       0.7950961           0.78456614
## total_num_of_diagnoses  0.046033256       0.9535311           0.85011767
##                        total_procedures total_medications total_outpatient
## patient_nbr                 -0.02912213      -0.008321752       0.07349042
## encounter_count              0.42880229       0.859169174       0.34009527
## total_lab_procedures         0.39436875       0.805492718       0.29218231
## total_procedures             1.00000000       0.532230844       0.12860526
## total_medications            0.53223084       1.000000000       0.32574607
## total_outpatient             0.12860526       0.325746075       1.00000000
## total_emergency              0.14623582       0.318199562       0.18820467
## total_inpatient              0.31742801       0.675796319       0.29234923
## total_time_in_hospital       0.43400973       0.812212343       0.25548036
## total_num_of_diagnoses       0.43703018       0.865713293       0.35260557
##                        total_emergency total_inpatient total_time_in_hospital
## patient_nbr                 0.02798084     0.001316926            -0.03282201
## encounter_count             0.37618944     0.784018938             0.79509610
## total_lab_procedures        0.30920086     0.695790647             0.78456614
## total_procedures            0.14623582     0.317428005             0.43400973
## total_medications           0.31819956     0.675796319             0.81221234
## total_outpatient            0.18820467     0.292349230             0.25548036
## total_emergency             1.00000000     0.486626309             0.26405294
## total_inpatient             0.48662631     1.000000000             0.63023969
## total_time_in_hospital      0.26405294     0.630239685             1.00000000
## total_num_of_diagnoses      0.36386185     0.734070207             0.79334257
##                        total_num_of_diagnoses
## patient_nbr                        0.04603326
## encounter_count                    0.95353105
## total_lab_procedures               0.85011767
## total_procedures                   0.43703018
## total_medications                  0.86571329
## total_outpatient                   0.35260557
## total_emergency                    0.36386185
## total_inpatient                    0.73407021
## total_time_in_hospital             0.79334257
## total_num_of_diagnoses             1.00000000

# Plot the correlation matrix with new variables
corrplot(cor_matrix, method = "color", addCoef.col = "black", number.cex = 0.7)

CONCLUSION:

Strongest Correlations

i) encounter_count has very strong positive correlations with:
    a. total_num_of_diagnoses (0.954)
    b. total_medications (0.859)
    c. total_lab_procedures (0.869)
    d. total_time_in_hospital (0.795)

📌 Patients with more medical encounters tend to undergo more diagnostic tests, receive more medications, and have longer hospital stays.

ii) total_num_of_diagnoses is also strongly correlated with:
    a. total_medications (0.866)
    b. total_lab_procedures (0.850)
    c. total_time_in_hospital (0.793)

📌 These strong positive correlations indicate that as the number of diagnoses increases, so do the medical interventions and hospital resources used, which aligns with clinical expectations.

Moderate Correlations

i) total_time_in_hospital is moderately correlated with:
a. total_medications (0.812)
b. total_inpatient (0.630)

📌 Longer hospital stays involve more medical interventions.

ii) total_inpatient is moderately correlated with:
a. total_medications (0.676)
b. total_num_of_diagnoses (0.734)

📌 Inpatient care is associated with higher treatment complexity.

Weak or Negligible Correlations

i) patient_nbr (likely a patient identifier) shows almost no correlation with other variables, as expected (all values close to 0).

ii) total_outpatient and total_emergency show weaker correlations with other variables (mostly < 0.4).

Prediction & Result

Regression: Modeling Hospital Stay Duration

# =========================
# 01. Load Libraries
# =========================

# ============================================
# 02. Load and Preprocess Dataset
# ============================================
library(dplyr)

df <- readRDS("diabetic_cleaned.rds")

# Convert character columns to factors
df <- mutate(df, across(where(is.character), as.factor))



# Create Empty Lists

id_like_vars <- c()
high_missing_vars <- c()
binary_vars <- c()
numeric_vars <- c()
categorical_vars <- c()

# Track which variables have already been classified
classified_vars <- c()


# ID-like Variables
id_like_vars <- names(df)[sapply(df, function(x) length(unique(x)) == nrow(df))]
classified_vars <- union(classified_vars, id_like_vars)


# High Missing Value Variables
missing_pct <- sapply(df, function(x) mean(is.na(x)))
high_missing_vars <- names(missing_pct[missing_pct > 0.3])
high_missing_vars <- setdiff(high_missing_vars, classified_vars)
classified_vars <- union(classified_vars, high_missing_vars)


# Binary Variables (2 unique values only)
binary_vars <- names(df)[sapply(df, function(x) length(unique(x)) == 2)]
binary_vars <- setdiff(binary_vars, classified_vars)
classified_vars <- union(classified_vars, binary_vars)


# Numeric Variables
numeric_vars <- names(df)[sapply(df, function(x) is.numeric(x) | is.integer(x))]
numeric_vars <- setdiff(numeric_vars, classified_vars)
classified_vars <- union(classified_vars, numeric_vars)


# Categorical Variables (Factor with > 2 levels)
categorical_vars <- names(df)[sapply(df, is.factor)]
categorical_vars <- setdiff(categorical_vars, classified_vars)
classified_vars <- union(classified_vars, categorical_vars)

# ============================================
# 03. Feature Preparation
# ============================================

# Define modeling target
target_var <- "time_in_hospital"

# Exclude variables
exclude_vars <- c(id_like_vars, high_missing_vars,categorical_vars, target_var, "readmitted")
all_vars <- setdiff(names(df), exclude_vars)
all_vars <- unique(c(all_vars, "age"))

# Subset to usable predictors + target
model_df <- df[, c(target_var, all_vars)]

# Drop rows with missing values (simple)
model_df <- na.omit(model_df)

# Convert character-type columns to factor if missed
model_df <- model_df %>% mutate(across(where(is.character), as.factor))

# Check structure
dim(model_df)

## [1] 101763     24

Train-Test Split

The dataset was partitioned into training and testing subsets to ensure consistency across both subsets for valid model evaluation.

# ============================================
# 04. Train-Test Split
# ============================================
library(caret)

## Loading required package: lattice

set.seed(99)
train_index <- createDataPartition(model_df[[target_var]], p = 0.7, list = FALSE)
train_data <- model_df[train_index, ]
test_data <- model_df[-train_index, ]

# Ensure factor levels match between train and test
for (col in names(train_data)) {
  if (is.factor(train_data[[col]])) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
  }
}

Feature Importance

Model was trained using the Ranger algorithm to evaluate feature importance for predicting hospital stay duration, revealing the top predictors based on impurity-based measures and are visualized as ranked bar chart above.

# ============================================
# 05. Feature Importance- Ranger
# ============================================

library(ranger)
library(dplyr)
library(caret)
library(tibble)

# Define target variable
target_var <- "time_in_hospital"

# Remove rows with NA  
train_data <- na.omit(train_data)

# Ensure factors are typed consistently
train_data <- train_data %>%
  mutate(across(where(is.character), as.factor))

# Train ranger model with importance
set.seed(123)
ranger_model <- ranger(
  formula = as.formula(paste(target_var, "~ .")),
  data = train_data,
  importance = "impurity",   # options: "impurity", "permutation"
  num.trees = 100
)

# Extract variable importance
importance_df <- as.data.frame(ranger_model$variable.importance) %>%
  rownames_to_column("Variable") %>%
  setNames(c("Variable", "Importance")) %>%
  arrange(desc(Importance))

# View top 15 variables
print(head(importance_df, 15))

##                    Variable Importance
## 1           num_medications 113297.056
## 2        num_lab_procedures  77335.627
## 3               patient_nbr  48921.809
## 4  discharge_disposition_id  45460.302
## 5            num_procedures  30813.725
## 6          number_diagnoses  22344.653
## 7                       age  20654.923
## 8         admission_type_id  15023.388
## 9       admission_source_id  14815.870
## 10         number_inpatient  14787.299
## 11        number_outpatient   8118.175
## 12         number_emergency   6334.988
## 13                   gender   6258.516
## 14                   change   6151.329
## 15              diabetesMed   4387.343

# Bar plot
library(ggplot2)
ggplot(importance_df[1:15, ], aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Important Variables from Ranger",
       x = "Variable", y = "Importance") +
  theme_minimal()

Variables Selection

The top 15 most important variables identified earlier were selected to construct a reduced training dataset, facilitating more focused and computationally efficient model development.

# ============================================
# 06. Select Top N Variables
# ============================================

top_n <- 15
top_vars <- importance_df$Variable[1:top_n]
print(top_vars)

##  [1] "num_medications"          "num_lab_procedures"      
##  [3] "patient_nbr"              "discharge_disposition_id"
##  [5] "num_procedures"           "number_diagnoses"        
##  [7] "age"                      "admission_type_id"       
##  [9] "admission_source_id"      "number_inpatient"        
## [11] "number_outpatient"        "number_emergency"        
## [13] "gender"                   "change"                  
## [15] "diabetesMed"

# Subset train data to include only top variables + target
train_top <- train_data[, c(target_var, top_vars)]

# Optional: Convert all character to factors again if needed
train_top <- train_top %>%
  mutate(across(where(is.character), as.factor))

Models

Linear Regression Model

The selected were subsequently used to train three predictive models, Multiple Linear Regression, Random Forest, and XGBoost. This approach aimed to evaluate and compare each model’s ability to predict hospital stay duration by leveraging the most influential features while balancing interpretability, computational efficiency, and predictive accuracy.

# ============================================
# 07. Train Linear Model (lm)
# ============================================

lm_model <- lm(as.formula(paste(target_var, "~ .")), data = train_top)
summary(lm_model)

## 
## Call:
## lm(formula = as.formula(paste(target_var, "~ .")), data = train_top)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9922 -1.7139 -0.4894  1.1919 12.7692 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.631e-01  2.443e-01   1.896  0.05799 .  
## num_medications           1.351e-01  1.427e-03  94.651  < 2e-16 ***
## num_lab_procedures        3.047e-02  5.178e-04  58.838  < 2e-16 ***
## patient_nbr              -4.822e-09  2.538e-10 -18.997  < 2e-16 ***
## discharge_disposition_id  6.430e-02  2.458e-03  26.166  < 2e-16 ***
## num_procedures            7.396e-02  6.243e-03  11.847  < 2e-16 ***
## number_diagnoses          1.256e-01  5.460e-03  23.001  < 2e-16 ***
## age10-19                  1.135e-01  2.631e-01   0.431  0.66621    
## age20-29                 -4.365e-01  2.500e-01  -1.746  0.08078 .  
## age30-39                 -4.930e-01  2.435e-01  -2.025  0.04289 *  
## age40-49                 -5.007e-01  2.406e-01  -2.081  0.03744 *  
## age50-59                 -6.359e-01  2.399e-01  -2.650  0.00805 ** 
## age60-69                 -5.051e-01  2.398e-01  -2.106  0.03521 *  
## age70-79                 -2.448e-01  2.397e-01  -1.021  0.30714    
## age80-89                  6.364e-02  2.402e-01   0.265  0.79105    
## age90-99                  1.395e-01  2.459e-01   0.567  0.57048    
## admission_type_id         1.536e-02  1.468e-02   1.047  0.29529    
## admission_source_id       9.294e-04  3.219e-03   0.289  0.77280    
## number_inpatient          9.453e-02  7.833e-03  12.068  < 2e-16 ***
## number_outpatient        -7.544e-02  7.540e-03 -10.005  < 2e-16 ***
## number_emergency         -4.795e-02  1.002e-02  -4.787 1.70e-06 ***
## genderMale               -8.291e-02  1.917e-02  -4.325 1.52e-05 ***
## changeNo                 -9.515e-02  2.249e-02  -4.231 2.33e-05 ***
## diabetesMedYes           -1.462e-01  2.616e-02  -5.590 2.28e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.525 on 71211 degrees of freedom
## Multiple R-squared:  0.287,  Adjusted R-squared:  0.2868 
## F-statistic:  1246 on 23 and 71211 DF,  p-value: < 2.2e-16

plot(lm_model)

Random Forest

# ============================================
# 08. Train Random Forest
# ============================================

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

set.seed(123)
rf_model <- randomForest(
  formula = as.formula(paste(target_var, "~ .")),
  data = train_top,
  ntree = 30,
  importance = TRUE
)
print(rf_model)

## 
## Call:
##  randomForest(formula = as.formula(paste(target_var, "~ .")),      data = train_top, ntree = 30, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 30
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 5.95309
##                     % Var explained: 33.4

varImpPlot(rf_model)

XGBoost

# ============================================
# 09. Train XGBoost
# ============================================
library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(Matrix)

# Convert to numeric matrix (XGBoost requires numeric input)
# One-hot encode categorical variables
dummies <- model.matrix(as.formula(paste(target_var, "~ .")), data = train_top)[, -1]
labels <- train_top[[target_var]]

# Create DMatrix
dtrain <- xgb.DMatrix(data = dummies, label = labels)

# Train XGBoost model
set.seed(123)
xgb_model <- xgboost(
  data = dtrain,
  nrounds = 100,
  objective = "reg:squarederror",
  verbose = 0
)

# Get and plot feature importance
print(xgb_model)

## ##### xgb.Booster
## raw: 395.1 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, objective = "reg:squarederror")
## params (as set within xgb.train):
##   objective = "reg:squarederror", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 23 
## niter: 100
## nfeatures : 23 
## evaluation_log:
##   iter train_rmse
##  <num>      <num>
##      1   3.861824
##      2   3.213284
##    ---        ---
##     99   2.139742
##    100   2.139444

importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = top_n, measure = "Gain")

Performance

The predictive performance of the models was assessed on the test set using RMSE and R-squared metrics, enabling a comparative analysis of each model’s accuracy and goodness-of-fit in estimating hospital stay duration based on the selected top predictors.

# =========================
# 10. Evaluate Models
# =========================

# Required libraries
library(caret)       # for R2()
library(broom)       # for tidy model output
library(ggplot2)     # for plotting

# --- Custom Evaluation Functions ---
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

r_squared <- function(actual, predicted) {
  cor(actual, predicted)^2
}

# --- Prepare Test Set with Top Variables ---
test_top <- test_data[, c(target_var, top_vars)]

# Ensure data types match
test_top <- test_top %>%
  mutate(across(where(is.character), as.factor))

# True labels
test_labels <- test_top[[target_var]]

# ========================
# --- LINEAR MODEL ---
# ========================
lm_preds <- predict(lm_model, newdata = test_top)
lm_rmse <- rmse(test_labels, lm_preds)
lm_r2 <- r_squared(test_labels, lm_preds)

cat("\n====================\nLinear Model Evaluation\n====================\n")

## 
## ====================
## Linear Model Evaluation
## ====================

cat("  RMSE     :", round(lm_rmse, 4), "\n")

##   RMSE     : 2.5078

cat("  R-squared:", round(lm_r2, 4), "\n")

##   R-squared: 0.2889

# ========================
# --- RANDOM FOREST ---
# ========================
rf_preds <- predict(rf_model, newdata = test_top)
rf_rmse <- rmse(test_labels, rf_preds)
rf_r2 <- R2(rf_preds, test_labels)

cat("\n====================\nRandom Forest Evaluation\n====================\n")

## 
## ====================
## Random Forest Evaluation
## ====================

cat("  RMSE     :", round(rf_rmse, 4), "\n")

##   RMSE     : 2.3588

cat("  R-squared:", round(rf_r2, 4), "\n")

##   R-squared: 0.3715

# ========================
# --- XGBOOST ---
# ========================
# One-hot encode test set (same structure as train)
test_dummies <- model.matrix(as.formula(paste(target_var, "~ .")), data = test_top)[, -1]
dtest <- xgb.DMatrix(data = test_dummies)

xgb_preds <- predict(xgb_model, dtest)
xgb_rmse <- rmse(test_labels, xgb_preds)
xgb_r2 <- R2(xgb_preds, test_labels)

cat("\n====================\nXGBoost Evaluation\n====================\n")

## 
## ====================
## XGBoost Evaluation
## ====================

cat("  RMSE     :", round(xgb_rmse, 4), "\n")

##   RMSE     : 2.3089

cat("  R-squared:", round(xgb_r2, 4), "\n")

##   R-squared: 0.3979

# Combine results into a summary data frame
results_df <- data.frame(
  Model = c("Linear Regression", "Random Forest", "XGBoost"),
  RMSE = c(lm_rmse, rf_rmse, xgb_rmse),
  R_Squared = c(lm_r2, rf_r2, xgb_r2)
)

print(results_df)

##               Model     RMSE R_Squared
## 1 Linear Regression 2.507828 0.2889076
## 2     Random Forest 2.358822 0.3715352
## 3           XGBoost 2.308908 0.3979181

XGBoost is the best-performing model among the three. It achieves the lowest prediction error (RMSE) and the highest explanatory power (R-squared). It is therefore the most suitable model for predicting hospital stay duration.

Tuning

A grid search with cross-validation was conducted to optimize XGBoost hyperparameters,identifying the parameter combination that minimized test RMSE, thereby enhancing the model’s generalization performance while maintaining computational efficiency.

# ============================================
# 11. Tuning with Cross-Validation
# ============================================

# Define a small, efficient tuning grid
quick_grid <- expand.grid(
  max_depth = c(3, 5),
  eta = c(0.05, 0.1),
  subsample = c(0.8),
  colsample_bytree = c(0.8),
  min_child_weight = c(1)
)

best_rmse <- Inf
best_params <- list()
best_nrounds <- 100

for (i in 1:nrow(quick_grid)) {
  params <- list(
    booster = "gbtree",
    objective = "reg:squarederror",
    eval_metric = "rmse",
    max_depth = quick_grid$max_depth[i],
    eta = quick_grid$eta[i],
    subsample = quick_grid$subsample[i],
    colsample_bytree = quick_grid$colsample_bytree[i],
    min_child_weight = quick_grid$min_child_weight[i]
  )
  
  set.seed(123)
  cv_model <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = 300,
    nfold = 3,                        # fewer folds for speed
    early_stopping_rounds = 10,
    verbose = 0
  )
  
  mean_rmse <- min(cv_model$evaluation_log$test_rmse_mean)
  
  if (mean_rmse < best_rmse) {
    best_rmse <- mean_rmse
    best_params <- params
    best_nrounds <- cv_model$best_iteration
  }
}

cat("Best RMSE:", round(best_rmse, 4), "\n")

## Best RMSE: 2.3167

print(best_params)

## $booster
## [1] "gbtree"
## 
## $objective
## [1] "reg:squarederror"
## 
## $eval_metric
## [1] "rmse"
## 
## $max_depth
## [1] 5
## 
## $eta
## [1] 0.1
## 
## $subsample
## [1] 0.8
## 
## $colsample_bytree
## [1] 0.8
## 
## $min_child_weight
## [1] 1

Final Model

Utilizing the optimal hyperparameters derived from cross-validation, the final XGBoost model demonstrated enhanced performance, confirming its improved ability to explain and predict hospital stay duration based on the selected features.

# ============================================
# 12. Final Model with Best Parameters
# ============================================
set.seed(123)
xgb_model_best <- xgboost(
  data = dtrain,
  nrounds = best_nrounds,
  params = best_params,
  verbose = 0
)

# Variable Importance
importance_matrix_best <- xgb.importance(model = xgb_model_best)
xgb.plot.importance(importance_matrix_best, top_n = top_n, measure = "Gain")

# Evaluate
preds <- predict(xgb_model_best, dtrain)
rmse_final <- sqrt(mean((preds - labels)^2))
r2_final <- 1 - sum((labels - preds)^2) / sum((labels - mean(labels))^2)

cat("\nFinal Tuned XGBoost Performance]\n")

## 
## Final Tuned XGBoost Performance]

cat("RMSE:", round(rmse_final, 4), "\n")

## RMSE: 2.2054

cat("R² Score:", round(r2_final, 4), "\n")

## R² Score: 0.4559

RMSE (Root Mean Squared Error): 2.2054 This indicates that, on average, the model’s predictions deviate by approximately 2.2 days from the actual hospital stay durations. In practical terms, it reflects the typical prediction error, which is acceptable if hospital stays range over a broad scale. A lower RMSE suggests better accuracy and less variability in the prediction errors.

R² Score (Coefficient of Determination): 0.4559 This value means that approximately 45.6% of the variability in hospital stay duration can be explained by the model using the selected features. While not extremely high, it reflects moderate explanatory power, which is expected in healthcare datasets where many unobserved factors (e.g., clinical decisions, patient-specific nuances) also influence outcomes.

Overall Interpretation

These values suggest that the tuned XGBoost model is reasonably effective at predicting hospital stay duration using the selected variables earlier. The model provides a useful approximation and could be valuable in hospital resource planning, early discharge prediction, or case prioritization, though there is still room for improvement through the inclusion of more detailed clinical data or unstructured variables.

Classification: Predict Readmission Class

In this new process, we are going to find top features first, and then do the classification on the top features only (removing the low features) But first, we are removing the features with lower than 2 factors level.

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ tidyr::expand()         masks Matrix::expand()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::lift()           masks caret::lift()
## ✖ randomForest::margin()  masks ggplot2::margin()
## ✖ tidyr::pack()           masks Matrix::pack()
## ✖ xgboost::slice()        masks dplyr::slice()
## ✖ tidyr::unpack()         masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)
library(randomForest)
library(e1071)
library(xgboost)
library(ROCR)

# Load cleaned dataset
df <- readRDS("diabetic_recleaned.rds")

# Convert target variable to factor
df$readmitted <- as.factor(df$readmitted)

# Check class distribution
table(df$readmitted)

## 
##   <30   >30    NO 
##  2929  8865 15684

# Split into training and testing sets
set.seed(123)
train_index <- createDataPartition(df$readmitted, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

Identify and remove low level factors

# 01: Identify the problem columns
sapply(train_data, function(col) {
  is.factor(col) && length(unique(col)) < 2
})

##             encounter_id              patient_nbr                     race 
##                    FALSE                    FALSE                    FALSE 
##                   gender                      age        admission_type_id 
##                    FALSE                    FALSE                    FALSE 
## discharge_disposition_id      admission_source_id         time_in_hospital 
##                    FALSE                    FALSE                    FALSE 
##               payer_code        medical_specialty       num_lab_procedures 
##                    FALSE                    FALSE                    FALSE 
##           num_procedures          num_medications        number_outpatient 
##                    FALSE                    FALSE                    FALSE 
##         number_emergency         number_inpatient                   diag_1 
##                    FALSE                    FALSE                    FALSE 
##                   diag_2                   diag_3         number_diagnoses 
##                    FALSE                    FALSE                    FALSE 
##            max_glu_serum                A1Cresult                metformin 
##                    FALSE                    FALSE                    FALSE 
##              repaglinide              nateglinide           chlorpropamide 
##                    FALSE                    FALSE                    FALSE 
##              glimepiride            acetohexamide                glipizide 
##                    FALSE                     TRUE                    FALSE 
##                glyburide              tolbutamide             pioglitazone 
##                    FALSE                     TRUE                    FALSE 
##            rosiglitazone                 acarbose                 miglitol 
##                    FALSE                    FALSE                    FALSE 
##             troglitazone               tolazamide                  examide 
##                     TRUE                    FALSE                     TRUE 
##              citoglipton                  insulin      glyburide.metformin 
##                     TRUE                    FALSE                    FALSE 
##      glipizide.metformin glimepiride.pioglitazone  metformin.rosiglitazone 
##                    FALSE                     TRUE                     TRUE 
##   metformin.pioglitazone                   change              diabetesMed 
##                    FALSE                    FALSE                    FALSE 
##               readmitted               diag_count 
##                    FALSE                    FALSE

names(which(sapply(train_data, function(col) is.factor(col) && length(unique(col)) < 2)))

## [1] "acetohexamide"            "tolbutamide"             
## [3] "troglitazone"             "examide"                 
## [5] "citoglipton"              "glimepiride.pioglitazone"
## [7] "metformin.rosiglitazone"

# 02: Drop problematic columns
bad_vars <- names(which(sapply(train_data, function(col) is.factor(col) && length(unique(col)) < 2)))
train_data_clean <- train_data[ , !(names(train_data) %in% bad_vars)]

test_data_clean <- test_data[ , !(names(test_data) %in% bad_vars)]

# double-check
bad_vars

## [1] "acetohexamide"            "tolbutamide"             
## [3] "troglitazone"             "examide"                 
## [5] "citoglipton"              "glimepiride.pioglitazone"
## [7] "metformin.rosiglitazone"

names(train_data)

##  [1] "encounter_id"             "patient_nbr"             
##  [3] "race"                     "gender"                  
##  [5] "age"                      "admission_type_id"       
##  [7] "discharge_disposition_id" "admission_source_id"     
##  [9] "time_in_hospital"         "payer_code"              
## [11] "medical_specialty"        "num_lab_procedures"      
## [13] "num_procedures"           "num_medications"         
## [15] "number_outpatient"        "number_emergency"        
## [17] "number_inpatient"         "diag_1"                  
## [19] "diag_2"                   "diag_3"                  
## [21] "number_diagnoses"         "max_glu_serum"           
## [23] "A1Cresult"                "metformin"               
## [25] "repaglinide"              "nateglinide"             
## [27] "chlorpropamide"           "glimepiride"             
## [29] "acetohexamide"            "glipizide"               
## [31] "glyburide"                "tolbutamide"             
## [33] "pioglitazone"             "rosiglitazone"           
## [35] "acarbose"                 "miglitol"                
## [37] "troglitazone"             "tolazamide"              
## [39] "examide"                  "citoglipton"             
## [41] "insulin"                  "glyburide.metformin"     
## [43] "glipizide.metformin"      "glimepiride.pioglitazone"
## [45] "metformin.rosiglitazone"  "metformin.pioglitazone"  
## [47] "change"                   "diabetesMed"             
## [49] "readmitted"               "diag_count"

remove_constant_factors <- function(df) {
  df[, !(sapply(df, function(col) is.factor(col) && length(unique(col)) < 2))]
}

train_data <- remove_constant_factors(train_data)
test_data <- remove_constant_factors(test_data)

# double-check again
bad_vars

## [1] "acetohexamide"            "tolbutamide"             
## [3] "troglitazone"             "examide"                 
## [5] "citoglipton"              "glimepiride.pioglitazone"
## [7] "metformin.rosiglitazone"

names(train_data)

##  [1] "encounter_id"             "patient_nbr"             
##  [3] "race"                     "gender"                  
##  [5] "age"                      "admission_type_id"       
##  [7] "discharge_disposition_id" "admission_source_id"     
##  [9] "time_in_hospital"         "payer_code"              
## [11] "medical_specialty"        "num_lab_procedures"      
## [13] "num_procedures"           "num_medications"         
## [15] "number_outpatient"        "number_emergency"        
## [17] "number_inpatient"         "diag_1"                  
## [19] "diag_2"                   "diag_3"                  
## [21] "number_diagnoses"         "max_glu_serum"           
## [23] "A1Cresult"                "metformin"               
## [25] "repaglinide"              "nateglinide"             
## [27] "chlorpropamide"           "glimepiride"             
## [29] "glipizide"                "glyburide"               
## [31] "pioglitazone"             "rosiglitazone"           
## [33] "acarbose"                 "miglitol"                
## [35] "tolazamide"               "insulin"                 
## [37] "glyburide.metformin"      "glipizide.metformin"     
## [39] "metformin.pioglitazone"   "change"                  
## [41] "diabetesMed"              "readmitted"              
## [43] "diag_count"

names(test_data)

##  [1] "encounter_id"             "patient_nbr"             
##  [3] "race"                     "gender"                  
##  [5] "age"                      "admission_type_id"       
##  [7] "discharge_disposition_id" "admission_source_id"     
##  [9] "time_in_hospital"         "payer_code"              
## [11] "medical_specialty"        "num_lab_procedures"      
## [13] "num_procedures"           "num_medications"         
## [15] "number_outpatient"        "number_emergency"        
## [17] "number_inpatient"         "diag_1"                  
## [19] "diag_2"                   "diag_3"                  
## [21] "number_diagnoses"         "max_glu_serum"           
## [23] "A1Cresult"                "metformin"               
## [25] "repaglinide"              "nateglinide"             
## [27] "chlorpropamide"           "glimepiride"             
## [29] "glipizide"                "glyburide"               
## [31] "pioglitazone"             "rosiglitazone"           
## [33] "acarbose"                 "miglitol"                
## [35] "tolazamide"               "insulin"                 
## [37] "glyburide.metformin"      "change"                  
## [39] "diabetesMed"              "readmitted"              
## [41] "diag_count"

# Drop Non-factor columns with a single unique value
#Sometimes model.matrix() can also break if a non-factor column (e.g. numeric) has zero variance.

constant_cols <- names(which(sapply(train_data, function(col) length(unique(col)) < 2)))
train_data <- train_data[, !(names(train_data) %in% constant_cols)]
test_data <- test_data[, !(names(test_data) %in% constant_cols)]

Top features

# 01. Create matrix for XGBoost
train_matrix <- model.matrix(readmitted ~ . -1, data = train_data)
train_label <- as.numeric(train_data$readmitted) - 1

# 02. Train the XGBoost model
library(xgboost)

xgb_model <- xgboost(
  data = train_matrix,
  label = train_label,
  nrounds = 50,
  objective = "multi:softprob",
  num_class = length(unique(train_label)),
  eval_metric = "mlogloss",
  verbose = 0
)

# 03. Get top features
importance <- xgb.importance(model = xgb_model)
print(head(importance, 15))

##                      Feature        Gain       Cover   Frequency
##                       <char>       <num>       <num>       <num>
##  1:              patient_nbr 0.174532370 0.069586833 0.112475334
##  2:         number_inpatient 0.145456155 0.046084727 0.027844771
##  3:             encounter_id 0.132107759 0.049256796 0.117079588
##  4: discharge_disposition_id 0.082950532 0.039615854 0.041876781
##  5:          num_medications 0.034468069 0.023408695 0.065555799
##  6:         number_diagnoses 0.033398607 0.013583347 0.023459768
##  7:         number_emergency 0.030517039 0.022946993 0.019951765
##  8:       num_lab_procedures 0.029424893 0.011609299 0.056566542
##  9:         time_in_hospital 0.017659103 0.006313778 0.037711028
## 10:        number_outpatient 0.012540098 0.015338450 0.014251261
## 11:           num_procedures 0.012320383 0.002245293 0.024556018
## 12:           diabetesMedYes 0.010777714 0.012117138 0.008989257
## 13:             payer_codeMC 0.007219052 0.005566069 0.008770007
## 14:      admission_source_id 0.007149614 0.004703997 0.010304758
## 15:                diag_1V58 0.007025155 0.008946530 0.006358255

# 04. Plot the features
xgb.plot.importance(importance_matrix = importance[1:15, ])

Classification Model

# 01. Load libraries and data
library(ranger)
library(dplyr)

# 02. Select only top features (exclude patient_nbr & encounter_id)
selected_vars <- c(
  "readmitted",
  "number_inpatient",
  "discharge_disposition_id",
  "num_medications",
  "number_diagnoses",
  "number_emergency",
  "num_lab_procedures",
  "time_in_hospital",
  "number_outpatient",
  "num_procedures",
  "diabetesMed",
  "payer_code",
  "admission_source_id",
  "diag_1"
)

df_selected <- df[, selected_vars]

# 03. Remove columns with only one unique value
df_selected <- df_selected %>%
  select(where(~ length(unique(.)) > 1))

# 04. Train/test split
set.seed(123)
train_index <- createDataPartition(df_selected$readmitted, p = 0.8, list = FALSE)
train_data <- df_selected[train_index, ]
test_data  <- df_selected[-train_index, ]

Option A: Random Forest (ranger)

# 05A. Train with ranger
rf_model <- ranger(readmitted ~ ., data = train_data, importance = "impurity")

# 06A. Predict and evaluate
pred_rf <- predict(rf_model, test_data)$predictions
rf_conf <- confusionMatrix(pred_rf, test_data$readmitted)
rf_stats <- rf_conf$overall[c("Accuracy", "Kappa")]

# Print RandomForest Confusion matrix
rf_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  <30  >30   NO
##        <30   26   24   10
##        >30  183  566  430
##        NO   376 1183 2696
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5985          
##                  95% CI : (0.5854, 0.6115)
##     No Information Rate : 0.5708          
##     P-Value [Acc > NIR] : 1.73e-05        
##                                           
##                   Kappa : 0.1764          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: <30 Class: >30 Class: NO
## Sensitivity            0.044444     0.3192    0.8597
## Specificity            0.993074     0.8353    0.3388
## Pos Pred Value         0.433333     0.4801    0.6336
## Neg Pred Value         0.897129     0.7203    0.6449
## Prevalence             0.106480     0.3227    0.5708
## Detection Rate         0.004732     0.1030    0.4907
## Detection Prevalence   0.010921     0.2146    0.7745
## Balanced Accuracy      0.518759     0.5772    0.5993

Option B: XGBoost (with numeric matrix input)

# 05B. Prepare XGBoost input
# Must convert factors to dummy vars
train_matrix <- model.matrix(readmitted ~ . -1, data = train_data)
train_label <- as.numeric(train_data$readmitted) - 1

test_matrix <- model.matrix(readmitted ~ . -1, data = test_data)
test_label <- as.numeric(test_data$readmitted) - 1

# 6B. Train XGBoost model
xgb_model <- xgboost(
  data = train_matrix,
  label = train_label,
  nrounds = 50,
  objective = "multi:softmax",
  num_class = length(unique(train_label)),
  eval_metric = "mlogloss",
  verbose = 0
)

# 7B. Predict and evaluate
pred_xgb <- predict(xgb_model, test_matrix)
pred_xgb <- factor(pred_xgb, labels = levels(train_data$readmitted))

xgb_conf <- confusionMatrix(pred_xgb, test_data$readmitted)
xgb_stats <- xgb_conf$overall[c("Accuracy", "Kappa")]

# Print XGBoost Confusion matrix
xgb_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  <30  >30   NO
##        <30   26   32    8
##        >30  193  583  462
##        NO   366 1158 2666
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5961         
##                  95% CI : (0.583, 0.6091)
##     No Information Rate : 0.5708         
##     P-Value [Acc > NIR] : 7.701e-05      
##                                          
##                   Kappa : 0.1769         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: <30 Class: >30 Class: NO
## Sensitivity            0.044444     0.3288    0.8501
## Specificity            0.991852     0.8240    0.3537
## Pos Pred Value         0.393939     0.4709    0.6363
## Neg Pred Value         0.897015     0.7204    0.6396
## Prevalence             0.106480     0.3227    0.5708
## Detection Rate         0.004732     0.1061    0.4853
## Detection Prevalence   0.012013     0.2253    0.7627
## Balanced Accuracy      0.518148     0.5764    0.6019

# 08. Combine Metrics for Comparison

compare_df <- data.frame(
  Metric = names(rf_stats),
  RandomForest = round(as.numeric(rf_stats), 4),
  XGBoost = round(as.numeric(xgb_stats), 4)
)

# Print comparison
print(compare_df)

##     Metric RandomForest XGBoost
## 1 Accuracy       0.5985  0.5961
## 2    Kappa       0.1764  0.1769

# Optional: Show class-wise stats
print(rf_conf$byClass)

##            Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: <30  0.04444444   0.9930739      0.4333333      0.8971292 0.4333333
## Class: >30  0.31923294   0.8352593      0.4800679      0.7202781 0.4800679
## Class: NO   0.85969388   0.3388465      0.6336075      0.6448749 0.6336075
##                Recall         F1 Prevalence Detection Rate Detection Prevalence
## Class: <30 0.04444444 0.08062016  0.1064798    0.004732435            0.0109210
## Class: >30 0.31923294 0.38346883  0.3227157    0.103021478            0.2145977
## Class: NO  0.85969388 0.72953592  0.5708045    0.490717146            0.7744813
##            Balanced Accuracy
## Class: <30         0.5187592
## Class: >30         0.5772461
## Class: NO          0.5992702

print(xgb_conf$byClass)

##            Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: <30  0.04444444   0.9918517      0.3939394      0.8970155 0.3939394
## Class: >30  0.32882121   0.8239721      0.4709208      0.7203947 0.4709208
## Class: NO   0.85012755   0.3536896      0.6362768      0.6395706 0.6362768
##                Recall         F1 Prevalence Detection Rate Detection Prevalence
## Class: <30 0.04444444 0.07987711  0.1064798    0.004732435           0.01201311
## Class: >30 0.32882121 0.38724676  0.3227157    0.106115763           0.22533673
## Class: NO  0.85012755 0.72781873  0.5708045    0.485256644           0.76265016
##            Balanced Accuracy
## Class: <30         0.5181481
## Class: >30         0.5763966
## Class: NO          0.6019086

Results & Interpretation

OVERALL COMPARISON

Both models are only moderately better than chance. Accuracy and Kappa are very similar, suggesting minimal difference in performance.

CLASS-WISE PERFORMANCE

🔹 Class <30 (most underrepresented) ❌ Both models perform poorly on detecting <30 readmission (sensitivity = 4.4%)

🔹 Class >30 ⚖️ Slightly better sensitivity with XGBoost, but difference is marginal.

🔹 Class NO (majority class) ✅ Both models favor this class. Slightly better balance with XGBoost.

Interpretation

Both models are heavily biased toward the majority class (NO). Extremely poor sensitivity for <30, which is the critical class for early readmission — this should be improved. Accuracy isn’t a strong metric here; consider using macro/micro averaged F1 scores or AUC per class. No major difference between models — you can prioritize the faster one (usually XGBoost).

Using other metric for both models

# 01. Step 1: Prepare Libraries

# List of required packages
required_packages <- c("MLmetrics", "yardstick", "dplyr", "pROC")

# Install any that aren't already installed
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

library(MLmetrics)

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

library(yardstick)

## 
## Attaching package: 'yardstick'

## The following object is masked _by_ '.GlobalEnv':
## 
##     rmse

## The following object is masked from 'package:readr':
## 
##     spec

## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity

library(dplyr)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# 02. Create Prediction Results Data Frame

# Use this — works fine if rf_model is trained with caret::train()
# Assume these are your actual and predicted labels
actual <- test_data$readmitted
pred_rf <- predict(rf_model, data = test_data)$predictions
rf_probs <- predict(rf_model, data = test_data, type = "response")$predictions

# Recreate the test matrix (same variables as during training!)
test_matrix <- model.matrix(readmitted ~ . -1, data = test_data)

# Predict class
pred_xgb <- predict(xgb_model, newdata = test_matrix)

# If model used "multi:softmax", output is class index (e.g., 0, 1, 2)
# So you may need to map back to class labels:
levels(actual)  # Check the actual class order

## [1] "<30" ">30" "NO"

pred_xgb <- factor(pred_xgb, levels = 0:2, labels = levels(actual))

# 03. Compute F1 Scores
# Create results data frames
rf_results <- data.frame(truth = actual, prediction = pred_rf)
xgb_results <- data.frame(truth = actual, prediction = pred_xgb)

# Convert columns to factors (yardstick requires factors)
rf_results$truth <- as.factor(rf_results$truth)
rf_results$prediction <- as.factor(rf_results$prediction)

xgb_results$truth <- as.factor(xgb_results$truth)
xgb_results$prediction <- as.factor(xgb_results$prediction)

# Compute F1 macro and micro for Random Forest
rf_fma <- f_meas(rf_results, truth = truth, estimate = prediction, average = "macro")
rf_fmi <- f_meas(rf_results, truth = truth, estimate = prediction, average = "micro")

# Compute F1 macro and micro for XGBoost
xg_fma <- f_meas(xgb_results, truth = truth, estimate = prediction, average = "macro")
xg_fmi <- f_meas(xgb_results, truth = truth, estimate = prediction, average = "micro")

# Combine into a data frame
f1_summary <- data.frame(
  Model = c("Random Forest", "Random Forest", "XGBoost", "XGBoost"),
  Metric = c("F1 Macro", "F1 Micro", "F1 Macro", "F1 Micro"),
  Estimate = c(rf_fma$.estimate, rf_fmi$.estimate, xg_fma$.estimate, xg_fmi$.estimate)
)

# Save to CSV
write.csv(f1_summary, "f1_score.csv", row.names = FALSE)

f1_summary

##           Model   Metric  Estimate
## 1 Random Forest F1 Macro 0.3978750
## 2 Random Forest F1 Micro 0.3978750
## 3       XGBoost F1 Macro 0.3983142
## 4       XGBoost F1 Micro 0.3983142

F1 Score Interpretation

✅ F1 Macro (treats all classes equally): *On average, both models perform similarly across all classes. This metric treats each class equally, even if one class (e.g., “NO”) dominates. A score below 0.5 suggests poor ability to balance precision and recall, especially for minority classes (<30, >30).

✅ F1 Micro (accounts for class imbalance): *Interpretation: These values show how well the models perform across all instances, weighted by class frequency. Micro F1 is generally close to accuracy in multi-class problems. Both models are underperforming overall — correctly predicting fewer than half of the cases.

🔍 Final Analysis: Both models perform nearly the same in terms of F1 scores. F1 < 0.4 implies room for improvement, possibly due to: -Class imbalance (e.g., “NO” dominating) -Limited predictive power in selected features

Conclusion

The analysis revealed strong correlations between medical encounter counts and variables such as diagnoses, medications, and hospital stay duration, confirming that patients with more complex needs tend to receive more treatments and stay longer. Regression modeling using tuned XGBoost provided a reasonable prediction of hospital stay duration, suggesting its potential use in hospital resource planning despite room for improvement with more detailed clinical data. However, classification models for predicting readmission performed poorly, particularly for early readmissions (<30 days), due to class imbalance and limited predictive power in the features used. While overall accuracy appeared moderate, low sensitivity and F1 scores highlighted the models’ inability to effectively detect critical readmission cases, underscoring the need for better feature engineering and evaluation metrics tailored to imbalanced healthcare data.