Modeling Hospital Stay Duration and Readmission Risk in Diabetic Patients Using Machine Learning
Introduction
Hospital readmissions and prolonged lengths of stay are critical challenges in managing diabetes-related hospital care. Frequent readmissions can indicate poor disease management, while extend hospitalizations strain healthcare resources and increase patient risk. Leveraging machine learning (ML) to predict these outcomes can help identify high-risk patients early, guide interventions, and improve care planning.
This project focuses on predicting two important outcomes among diabetic patients:
The duration of hospital stays, a continuous variable representing healthcare utilization.
The prediction of readmission, a classification problem that reflects care quality.
By analyzing clinical, demographic, and treatment-related features, the project applies predictive modeling and interpretability techniques to uncover key drivers of these outcomes and enhance decision support in diabetes care.
Dataset Details
| Link | https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008 |
| Title | Diabetes 130-US Hospitals for Years 1999-2008 |
| Purpose of Dataset | The dataset is designed to analyze hospital readmission patterns of diabetic patients, particularly in determining the likelihood of a patient being readmitted within 30 days of discharge after hospitalization due to diabetes-related health issues. |
| Dimension | 101767 rows, 50 columns |
| Content | 10 years of clinical care (1999β2008) across 130 U.S. hospitals |
| Summary | β’ Class Distribution (readmitted): Β Β Β Β β’ βNOβ: 54,864 (most frequent) Β Β Β Β β’ β>30β and β<30β: Remaining β’ Race: 76,099 patients are βCaucasianβ (dominant group) β’ Age Groups: 10 discrete ranges like [70-80), [60-70), etc. β’ Missing values: Most entries have β?β for weight, payer_code, and medical_specialty β indicating missing or unknown data |
Research Questions
What features are most important in predicting readmission and length of stay, as identified by machine learning interpretability methods?
β Which machine learning model most accurately predicts the length of hospital stay after diabetic patient readmission?
β How accurately can machine learning models classify diabetic patients into readmission categories?
Research Objectives
To apply feature importance and interpretability techniques to identify the most influential factors associated with diabetic readmissions.
β To develop and evaluate machine learning models for predicting the length of stay and hospital readmission among diabetic patients.
β To build and assess classification model to predict the readmission risk using structured health data.
Implementation
Data Cleaning
Accessing data source
Before performing data analysis in R, we import a library called
dplyr, which is a package for data manipulation and
transformation. Other than that, we also import knitr
library to perform dynamic report generation. ggplot2 on
the other hand will help in customizing data visualization. In addition,
the corrplot package is used to visualize correlation
matrices, allowing us to identify relationships between numeric
variables effectively. The tidyr package supports data
tidying tasks such as reshaping data,handling missing values, ensuring
our dataset is structured appropriately for analysis.
Next, read diabetic_data.csv as the source file by
directly inserting the file name with its directory. To improve
performance, we use head(,5) to display the first 5 rows.
We will consider β?β as string while reading the data.
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide.metformin | glipizide.metformin | glimepiride.pioglitazone | metformin.rosiglitazone | metformin.pioglitazone | change | diabetesMed | readmitted |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2278392 | 8222157 | Caucasian | Female | [0-10) | NA | 6 | 25 | 1 | 1 | NA | Pediatrics-Endocrinology | 41 | 0 | 1 | 0 | 0 | 0 | 250.83 | NA | NA | 1 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | NO |
| 149190 | 55629189 | Caucasian | Female | [10-20) | NA | 1 | 1 | 7 | 3 | NA | NA | 59 | 0 | 18 | 0 | 0 | 0 | 276 | 250.01 | 255 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 64410 | 86047875 | AfricanAmerican | Female | [20-30) | NA | 1 | 1 | 7 | 2 | NA | NA | 11 | 5 | 13 | 2 | 0 | 1 | 648 | 250 | V27 | 6 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | NO |
| 500364 | 82442376 | Caucasian | Male | [30-40) | NA | 1 | 1 | 7 | 2 | NA | NA | 44 | 1 | 16 | 0 | 0 | 0 | 8 | 250.43 | 403 | 7 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 16680 | 42519267 | Caucasian | Male | [40-50) | NA | 1 | 1 | 7 | 1 | NA | NA | 51 | 0 | 8 | 0 | 0 | 0 | 197 | 157 | 250 | 5 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | NO |
As a reference to some of the columns with id, we can refer to IDS_mapping.csv file.
# read 'IDS_Mapping.csv'
# df <- read.csv(file.choose(),na.strings = "?")
id_reference <- read.csv("IDS_mapping.csv")
#id_reference
kable(id_reference, align = "l")| DISCHARGE_DISPOSITION_DESCRIPTION | ADMISSION_SOURCE_DESCRIPTION | ADMISSION_TYPE_DESCRIPTION |
|---|---|---|
| 0 - Unknown | 0 - Unknown | 0 - Unknown |
| 1 - Discharged to home | 1 - Physician Referral | 1 - Emergency |
| 2 - Discharged/transferred to another short term hospital | 2 - Clinic Referral | 2 - Urgent |
| 3 - Discharged/transferred to SNF | 3 - HMO Referral | 3 - Elective |
| 4 - Discharged/transferred to ICF | 4 - Transfer from a hospital | 4 - Newborn |
| 5 - Discharged/transferred to another type of inpatient care institution | 5 - Transfer from a Skilled Nursing Facility (SNF) | 7 - Trauma Center |
| 6 - Discharged/transferred to home with home health service | 6 - Transfer from another health care facility | |
| 7 - Left AMA | 7 - Emergency Room | |
| 8 - Discharged/transferred to home under care of Home IV provider | 8 - Court/Law Enforcement | |
| 9 - Admitted as an inpatient to this hospital | 10 - Transfer from critial access hospital | |
| 10 - Neonate discharged to another hospital for neonatal aftercare | 11 - Normal Delivery | |
| 11 - Expired | 12 - Premature Delivery | |
| 12 - Still patient or expected to return for outpatient services | 13 - Sick Baby | |
| 13 - Hospice / home | 14 - Extramural Birth | |
| 14 - Hospice / medical facility | 17 - Transfer From Another Home Health Agency | |
| 15 - Discharged/transferred within this institution to Medicare approved swing bed | 18 - Readmission to Same Home Health Agency | |
| 16 - Discharged/transferred/referred another institution for outpatient services | 21 - Transfer from hospital inpt/same fac reslt in a sep claim | |
| 17 - Discharged/transferred/referred to this institution for outpatient services | 22 - Born inside this hospital | |
| 19 - Expired at home. Medicaid only, hospice. | 23 - Born outside this hospital | |
| 20 - Expired in a medical facility. Medicaid only, hospice. | 24 - Transfer from Ambulatory Surgery Center | |
| 21 - Expired, place unknown. Medicaid only, hospice. | 25 - Transfer from Hospice | |
| 22 - Discharged/transferred to another rehab fac including rehab units of a hospital . | ||
| 23 - Discharged/transferred to a long term care hospital. | ||
| 24 - Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare. | ||
| 27 - Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere | ||
| 28 - Discharged/transferred to a federal health care facility. | ||
| 29 - Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital | ||
| 30 - Discharged/transferred to a Critical Access Hospital (CAH). |
Understanding raw data
The dataset are likely related to diabetics patient records. Data can be understand by identifying number of rows, dimension, and structure.
## [1] 101766
## [1] 101766 50
Generally, there are 101766 observations and 50 variables. The observation covers demographic, diagnostic, and treatment information.
## 'data.frame': 101766 obs. of 50 variables:
## $ encounter_id : int 2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
## $ patient_nbr : int 8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
## $ race : chr "Caucasian" "Caucasian" "AfricanAmerican" "Caucasian" ...
## $ gender : chr "Female" "Female" "Female" "Male" ...
## $ age : chr "[0-10)" "[10-20)" "[20-30)" "[30-40)" ...
## $ weight : chr NA NA NA NA ...
## $ admission_type_id : int 6 1 1 1 1 2 3 1 2 3 ...
## $ discharge_disposition_id: int 25 1 1 1 1 1 1 1 1 3 ...
## $ admission_source_id : int 1 7 7 7 7 2 2 7 4 4 ...
## $ time_in_hospital : int 1 3 2 2 1 3 4 5 13 12 ...
## $ payer_code : chr NA NA NA NA ...
## $ medical_specialty : chr "Pediatrics-Endocrinology" NA NA NA ...
## $ num_lab_procedures : int 41 59 11 44 51 31 70 73 68 33 ...
## $ num_procedures : int 0 0 5 1 0 6 1 0 2 3 ...
## $ num_medications : int 1 18 13 16 8 16 21 12 28 18 ...
## $ number_outpatient : int 0 0 2 0 0 0 0 0 0 0 ...
## $ number_emergency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : int 0 0 1 0 0 0 0 0 0 0 ...
## $ diag_1 : chr "250.83" "276" "648" "8" ...
## $ diag_2 : chr NA "250.01" "250" "250.43" ...
## $ diag_3 : chr NA "255" "V27" "403" ...
## $ number_diagnoses : int 1 9 6 7 5 9 7 8 8 8 ...
## $ max_glu_serum : chr "None" "None" "None" "None" ...
## $ A1Cresult : chr "None" "None" "None" "None" ...
## $ metformin : chr "No" "No" "No" "No" ...
## $ repaglinide : chr "No" "No" "No" "No" ...
## $ nateglinide : chr "No" "No" "No" "No" ...
## $ chlorpropamide : chr "No" "No" "No" "No" ...
## $ glimepiride : chr "No" "No" "No" "No" ...
## $ acetohexamide : chr "No" "No" "No" "No" ...
## $ glipizide : chr "No" "No" "Steady" "No" ...
## $ glyburide : chr "No" "No" "No" "No" ...
## $ tolbutamide : chr "No" "No" "No" "No" ...
## $ pioglitazone : chr "No" "No" "No" "No" ...
## $ rosiglitazone : chr "No" "No" "No" "No" ...
## $ acarbose : chr "No" "No" "No" "No" ...
## $ miglitol : chr "No" "No" "No" "No" ...
## $ troglitazone : chr "No" "No" "No" "No" ...
## $ tolazamide : chr "No" "No" "No" "No" ...
## $ examide : chr "No" "No" "No" "No" ...
## $ citoglipton : chr "No" "No" "No" "No" ...
## $ insulin : chr "No" "Up" "No" "Up" ...
## $ glyburide.metformin : chr "No" "No" "No" "No" ...
## $ glipizide.metformin : chr "No" "No" "No" "No" ...
## $ glimepiride.pioglitazone: chr "No" "No" "No" "No" ...
## $ metformin.rosiglitazone : chr "No" "No" "No" "No" ...
## $ metformin.pioglitazone : chr "No" "No" "No" "No" ...
## $ change : chr "No" "Ch" "No" "Ch" ...
## $ diabetesMed : chr "No" "Yes" "Yes" "Yes" ...
## $ readmitted : chr "NO" ">30" "NO" "NO" ...
From the structure above, we can observed that most of the data are categorical with lots of missing values labelled as β?β.
After exploring about some fields with id
(discharge_disposition_id, admission_source_id, admission_type_id), we
observe that there are some id that contains βNullβ, βNot
Mappedβ, βNot Availableβ, and βUnknown/Invalidβ category. To make it
standardized, we change the βidβ into 0, with βUnknownβ
as it description.
df <- df %>%
mutate(
discharge_disposition_id = case_when(
discharge_disposition_id %in% c(18, 25, 26) ~ 0,
TRUE ~ discharge_disposition_id
),
admission_source_id = case_when(
admission_source_id %in% c(9, 15, 16, 19, 20) ~ 0,
TRUE ~ admission_source_id
),
admission_type_id = case_when(
admission_type_id %in% c(5, 6, 8) ~ 0,
TRUE ~ admission_type_id
)
)Below are the list of unique id which contains 0
id as it category.
list(
admission_source_id = sort(unique(df$admission_source_id)),
admission_type_id = sort(unique(df$admission_type_id)),
discharge_disposition_id = sort(unique(df$discharge_disposition_id))
)## $admission_source_id
## [1] 0 1 2 3 4 5 6 7 8 10 11 13 14 17 22 25
##
## $admission_type_id
## [1] 0 1 2 3 4 7
##
## $discharge_disposition_id
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 20 22 23 24 27 28
Remove duplicates
Duplicates are removed to ensure data accuracy and reliability.
## [1] 101766
We can observe that there are no duplicated rows as the number of row before and after are tally.
Since this dataset is about patient hospital admission, encounter_id should be unique.
## [1] 101766
## [1] 101766
As we can see, encounter_id is unique and contains no missing value. Thus, it can act as a primary key in this dataset.
Handling missing values
Handle missing value is a critical step in data cleaning to ensure the data are ready for analysis. Ignoring this can lead to inaccurate analysis and biased model.
Firstly, we get the total number of missing values and missing percentage for each column.
missing_summary <- data.frame(
missing_count = sapply(df, function(x) sum(is.na(x))),
missing_percent = paste(round(sapply(df, function(x) mean(is.na(x)) * 100),2),'%')
)
kable(missing_summary, align = "l")| missing_count | missing_percent | |
|---|---|---|
| encounter_id | 0 | 0 % |
| patient_nbr | 0 | 0 % |
| race | 2273 | 2.23 % |
| gender | 0 | 0 % |
| age | 0 | 0 % |
| weight | 98569 | 96.86 % |
| admission_type_id | 0 | 0 % |
| discharge_disposition_id | 0 | 0 % |
| admission_source_id | 0 | 0 % |
| time_in_hospital | 0 | 0 % |
| payer_code | 40256 | 39.56 % |
| medical_specialty | 49949 | 49.08 % |
| num_lab_procedures | 0 | 0 % |
| num_procedures | 0 | 0 % |
| num_medications | 0 | 0 % |
| number_outpatient | 0 | 0 % |
| number_emergency | 0 | 0 % |
| number_inpatient | 0 | 0 % |
| diag_1 | 21 | 0.02 % |
| diag_2 | 358 | 0.35 % |
| diag_3 | 1423 | 1.4 % |
| number_diagnoses | 0 | 0 % |
| max_glu_serum | 0 | 0 % |
| A1Cresult | 0 | 0 % |
| metformin | 0 | 0 % |
| repaglinide | 0 | 0 % |
| nateglinide | 0 | 0 % |
| chlorpropamide | 0 | 0 % |
| glimepiride | 0 | 0 % |
| acetohexamide | 0 | 0 % |
| glipizide | 0 | 0 % |
| glyburide | 0 | 0 % |
| tolbutamide | 0 | 0 % |
| pioglitazone | 0 | 0 % |
| rosiglitazone | 0 | 0 % |
| acarbose | 0 | 0 % |
| miglitol | 0 | 0 % |
| troglitazone | 0 | 0 % |
| tolazamide | 0 | 0 % |
| examide | 0 | 0 % |
| citoglipton | 0 | 0 % |
| insulin | 0 | 0 % |
| glyburide.metformin | 0 | 0 % |
| glipizide.metformin | 0 | 0 % |
| glimepiride.pioglitazone | 0 | 0 % |
| metformin.rosiglitazone | 0 | 0 % |
| metformin.pioglitazone | 0 | 0 % |
| change | 0 | 0 % |
| diabetesMed | 0 | 0 % |
| readmitted | 0 | 0 % |
Based on the table above, weight have the highest missing value which is 97%, followed by prayer_code and medical_speciality with 40% and 49% respectively, race with less than 2%, and lastly diag_1, diag_2, and diag_3 with almost 0%. Column with almost 100% missing value will be removed while applying imputation to the column with less than 50% missing values.
Since there are too much missing value in weight column, the column will be removed as it is no longer useful for analysis.
## [1] 101766 49
As we can see, weight column has been removed since the variable decrease by 1.
Next, missing values in payer_code and medical_speciality column will be imputed to βUnknownβ category.
## [1] "Unknown" "MC" "MD" "HM" "UN" "BC" "SP"
## [8] "CP" "SI" "DM" "CM" "CH" "PO" "WC"
## [15] "OT" "OG" "MP" "FR"
## [1] "Pediatrics-Endocrinology"
## [2] "Unknown"
## [3] "InternalMedicine"
## [4] "Family/GeneralPractice"
## [5] "Cardiology"
## [6] "Surgery-General"
## [7] "Orthopedics"
## [8] "Gastroenterology"
## [9] "Surgery-Cardiovascular/Thoracic"
## [10] "Nephrology"
## [11] "Orthopedics-Reconstructive"
## [12] "Psychiatry"
## [13] "Emergency/Trauma"
## [14] "Pulmonology"
## [15] "Surgery-Neuro"
## [16] "Obsterics&Gynecology-GynecologicOnco"
## [17] "ObstetricsandGynecology"
## [18] "Pediatrics"
## [19] "Hematology/Oncology"
## [20] "Otolaryngology"
## [21] "Surgery-Colon&Rectal"
## [22] "Pediatrics-CriticalCare"
## [23] "Endocrinology"
## [24] "Urology"
## [25] "Psychiatry-Child/Adolescent"
## [26] "Pediatrics-Pulmonology"
## [27] "Neurology"
## [28] "Anesthesiology-Pediatric"
## [29] "Radiology"
## [30] "Pediatrics-Hematology-Oncology"
## [31] "Psychology"
## [32] "Podiatry"
## [33] "Gynecology"
## [34] "Oncology"
## [35] "Pediatrics-Neurology"
## [36] "Surgery-Plastic"
## [37] "Surgery-Thoracic"
## [38] "Surgery-PlasticwithinHeadandNeck"
## [39] "Ophthalmology"
## [40] "Surgery-Pediatric"
## [41] "Pediatrics-EmergencyMedicine"
## [42] "PhysicalMedicineandRehabilitation"
## [43] "InfectiousDiseases"
## [44] "Anesthesiology"
## [45] "Rheumatology"
## [46] "AllergyandImmunology"
## [47] "Surgery-Maxillofacial"
## [48] "Pediatrics-InfectiousDiseases"
## [49] "Pediatrics-AllergyandImmunology"
## [50] "Dentistry"
## [51] "Surgeon"
## [52] "Surgery-Vascular"
## [53] "Osteopath"
## [54] "Psychiatry-Addictive"
## [55] "Surgery-Cardiovascular"
## [56] "PhysicianNotFound"
## [57] "Hematology"
## [58] "Proctology"
## [59] "Obstetrics"
## [60] "SurgicalSpecialty"
## [61] "Radiologist"
## [62] "Pathology"
## [63] "Dermatology"
## [64] "SportsMedicine"
## [65] "Speech"
## [66] "Hospitalist"
## [67] "OutreachServices"
## [68] "Cardiology-Pediatric"
## [69] "Perinatology"
## [70] "Neurophysiology"
## [71] "Endocrinology-Metabolism"
## [72] "DCPTEAM"
## [73] "Resident"
The unique categories are listed above with βUnknownβ as missing value.
Since column race has only 2% of missing value, we perform imputation by mode.
## [1] "Caucasian"
βCaucasionβ is the mode and those missing value will be assign as it.
## [1] "Caucasian" "AfricanAmerican" "Other" "Asian"
## [5] "Hispanic"
For gender column, there are currently three categories which are βFemaleβ, βMaleβ, and βUnknown/Invalidβ. We consider βUnknown/Invalidβ as missing value, thus it will be filtered.
## [1] "Female" "Male" "Unknown/Invalid"
## [1] 101763
diag_1, diag_2, and diag_3 contains the code for primary, secondary and tertiary diagnosis. Since not all patients are diagnosed will those three at once, we will just change the null value to βNot Availableβ.
After handling missing value, there are currently 101763 observations with -3 difference as compared to original observation.
Add aggregated column
Column diag_1, diag_2, and diag_3 contains the diagnosis code. Each patient might only have one or two diagnosis, per admission. Thus, calculating the number of diagnosis will be helpful for further analysis.
## [1] 101763 50
Our variables are back to 50 after adding a new column.
Change data type
For a useful analysis, we will change chr type to
factor type. Factor will make it more efficient for
categorical grouping that will be used during classification
modelling.
df <- df %>% mutate(across(where(is.character), as.factor))%>% mutate(across(where(is.numeric), as.integer))
str(df)## 'data.frame': 101763 obs. of 50 variables:
## $ encounter_id : int 2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
## $ patient_nbr : int 8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
## $ race : Factor w/ 5 levels "AfricanAmerican",..: 3 3 1 3 3 3 3 3 3 3 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 2 1 1 ...
## $ age : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ admission_type_id : int 0 1 1 1 1 2 3 1 2 3 ...
## $ discharge_disposition_id: int 0 1 1 1 1 1 1 1 1 3 ...
## $ admission_source_id : int 1 7 7 7 7 2 2 7 4 4 ...
## $ time_in_hospital : int 1 3 2 2 1 3 4 5 13 12 ...
## $ payer_code : Factor w/ 18 levels "BC","CH","CM",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ medical_specialty : Factor w/ 73 levels "AllergyandImmunology",..: 38 72 72 72 72 72 72 72 72 19 ...
## $ num_lab_procedures : int 41 59 11 44 51 31 70 73 68 33 ...
## $ num_procedures : int 0 0 5 1 0 6 1 0 2 3 ...
## $ num_medications : int 1 18 13 16 8 16 21 12 28 18 ...
## $ number_outpatient : int 0 0 2 0 0 0 0 0 0 0 ...
## $ number_emergency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : int 0 0 1 0 0 0 0 0 0 0 ...
## $ diag_1 : Factor w/ 717 levels "10","11","110",..: 125 144 455 555 55 264 264 277 253 283 ...
## $ diag_2 : Factor w/ 749 levels "11","110","111",..: 710 80 79 98 25 247 247 315 261 47 ...
## $ diag_3 : Factor w/ 790 levels "11","110","111",..: 748 122 768 249 87 87 772 87 230 318 ...
## $ number_diagnoses : int 1 9 6 7 5 9 7 8 8 8 ...
## $ max_glu_serum : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ repaglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ nateglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ chlorpropamide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glimepiride : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ acetohexamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glipizide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
## $ glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
## $ tolbutamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
## $ acarbose : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ miglitol : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ troglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ tolazamide : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ examide : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ citoglipton : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
## $ glyburide.metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glipizide.metformin : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.pioglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ change : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ readmitted : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...
## $ diag_count : int 3 3 3 3 3 3 3 3 3 3 ...
Each category will be considered as one level of factors and
non-decimal columns (id columns) will be considered as int
type rather than num.
Relabelling interval bins
Column age show intervals that presented in left-closed and right-open bins. To improve readability, we will change the label by excluding the brackets with correct bins.
new_age_label <- paste0(seq(0, 90, by = 10),"-", seq(9, 99, by = 10))
levels(df$age) <- new_age_label
unique(df$age)## [1] 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
## Levels: 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
The factor for age are now re-code into 10 bins, from 0-99.
## Factor w/ 10 levels "0-9","10-19",..: 1 2 3 4 5 6 7 8 9 10 ...
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is performed to explore patterns, detect anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. The following visualizations and summaries aim to understand the characteristics of the diabetic patients and how various factors might influence hospital readmission.
Summary statistics
This gives a quick overview of the dataset, including minimum, maximum, mean, and quartiles for numerical variables, and frequency counts for categorical variables. This helps detect any skewness, unusual ranges, or imbalanced categories.
# Load necessary libraries for data manipulation and visualization
library(dplyr) # For data manipulation
library(ggplot2) # For plotting charts
# Display basic statistics (min, max, mean, etc.) for each column
summary(df)## encounter_id patient_nbr race gender
## Min. : 12522 Min. : 135 AfricanAmerican:19210 Female:54708
## 1st Qu.: 84959748 1st Qu.: 23412964 Asian : 641 Male :47055
## Median :152388294 Median : 45500490 Caucasian :78370
## Mean :165200788 Mean : 54329650 Hispanic : 2037
## 3rd Qu.:230269803 3rd Qu.: 87545714 Other : 1505
## Max. :443867222 Max. :189502619
##
## age admission_type_id discharge_disposition_id admission_source_id
## 70-79 :26066 Min. :0.000 Min. : 0.00 Min. : 0.000
## 60-69 :22482 1st Qu.:1.000 1st Qu.: 1.00 1st Qu.: 1.000
## 50-59 :17256 Median :1.000 Median : 1.00 Median : 7.000
## 80-89 :17197 Mean :1.452 Mean : 2.82 Mean : 5.712
## 40-49 : 9685 3rd Qu.:2.000 3rd Qu.: 3.00 3rd Qu.: 7.000
## 30-39 : 3775 Max. :7.000 Max. :28.00 Max. :25.000
## (Other): 5302
## time_in_hospital payer_code medical_specialty
## Min. : 1.000 Unknown:40255 Unknown :49947
## 1st Qu.: 2.000 MC :32439 InternalMedicine :14635
## Median : 4.000 HM : 6274 Emergency/Trauma : 7565
## Mean : 4.396 SP : 5007 Family/GeneralPractice: 7440
## 3rd Qu.: 6.000 BC : 4655 Cardiology : 5351
## Max. :14.000 MD : 3532 Surgery-General : 3099
## (Other): 9601 (Other) :13726
## num_lab_procedures num_procedures num_medications number_outpatient
## Min. : 1.0 Min. :0.00 Min. : 1.00 Min. : 0.0000
## 1st Qu.: 31.0 1st Qu.:0.00 1st Qu.:10.00 1st Qu.: 0.0000
## Median : 44.0 Median :1.00 Median :15.00 Median : 0.0000
## Mean : 43.1 Mean :1.34 Mean :16.02 Mean : 0.3694
## 3rd Qu.: 57.0 3rd Qu.:2.00 3rd Qu.:20.00 3rd Qu.: 0.0000
## Max. :132.0 Max. :6.00 Max. :81.00 Max. :42.0000
##
## number_emergency number_inpatient diag_1 diag_2
## Min. : 0.0000 Min. : 0.0000 428 : 6862 276 : 6752
## 1st Qu.: 0.0000 1st Qu.: 0.0000 414 : 6580 428 : 6662
## Median : 0.0000 Median : 0.0000 786 : 4016 250 : 6071
## Mean : 0.1978 Mean : 0.6356 410 : 3614 427 : 5036
## 3rd Qu.: 0.0000 3rd Qu.: 1.0000 486 : 3508 401 : 3736
## Max. :76.0000 Max. :21.0000 427 : 2766 496 : 3305
## (Other):74417 (Other):70201
## diag_3 number_diagnoses max_glu_serum A1Cresult metformin
## 250 :11555 Min. : 1.000 >200: 1485 >7 : 3812 Down : 575
## 401 : 8288 1st Qu.: 6.000 >300: 1264 >8 : 8216 No :81776
## 276 : 5175 Median : 8.000 None:96417 None:84745 Steady:18345
## 428 : 4577 Mean : 7.423 Norm: 2597 Norm: 4990 Up : 1067
## 427 : 3955 3rd Qu.: 9.000
## 414 : 3664 Max. :16.000
## (Other):64549
## repaglinide nateglinide chlorpropamide glimepiride acetohexamide
## Down : 45 Down : 11 Down : 1 Down : 194 No :101762
## No :100224 No :101060 No :101677 No :96572 Steady: 1
## Steady: 1384 Steady: 668 Steady: 79 Steady: 4670
## Up : 110 Up : 24 Up : 6 Up : 327
##
##
##
## glipizide glyburide tolbutamide pioglitazone rosiglitazone
## Down : 560 Down : 564 No :101740 Down : 118 Down : 87
## No :89078 No :91113 Steady: 23 No :94436 No :95399
## Steady:11355 Steady: 9274 Steady: 6975 Steady: 6099
## Up : 770 Up : 812 Up : 234 Up : 178
##
##
##
## acarbose miglitol troglitazone tolazamide examide
## Down : 3 Down : 5 No :101760 No :101724 No:101763
## No :101455 No :101725 Steady: 3 Steady: 38
## Steady: 295 Steady: 31 Up : 1
## Up : 10 Up : 2
##
##
##
## citoglipton insulin glyburide.metformin glipizide.metformin
## No:101763 Down :12218 Down : 6 No :101750
## No :47380 No :101057 Steady: 13
## Steady:30849 Steady: 692
## Up :11316 Up : 8
##
##
##
## glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone
## No :101762 No :101761 No :101762
## Steady: 1 Steady: 2 Steady: 1
##
##
##
##
##
## change diabetesMed readmitted diag_count
## Ch:47009 No :23402 <30:11357 Min. :3
## No:54754 Yes:78361 >30:35545 1st Qu.:3
## NO :54861 Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
##
Age Group Distribution
This bar plot shows how patients are distributed across different age intervals. It helps to identify which age groups have higher hospital admission rates among diabetic patients.
# 1 Show how many patients are in each age group
ggplot(df, aes(age)) +
geom_bar(fill = "steelblue") +
labs(title = "Distribution of Age Groups", x = "Age Group", y = "Count")Relationship: Age vs Time in Hospital
This boxplot shows how the time spent in the hospital varies across different age groups. βageβ is a categorical variable (in 10-year bins), and βtime_in_hospitalβ is numerical. The boxplot helps us observe the median stay and spread within each age group.
ggplot(df, aes(x = age, y = time_in_hospital)) +
geom_boxplot(fill = "orchid") + # Creates a boxplot with purple fill
labs(
title = "Age vs Time in Hospital", # Title of the chart
x = "Age", # Label for x-axis
y = "Hospital Stay (days)" # Label for y-axis
)Correlation: Number of Diagnoses vs Time in Hospital
This scatter plot examines the relationship between the number of diagnoses (diag_count) and the duration of hospital stay. Each dot represents a patientβs data. The red line is a linear regression trend line that helps assess whether more diagnoses are associated with longer hospital stays.
ggplot(df, aes(x = number_diagnoses, y = time_in_hospital)) +
geom_point(alpha = 0.3) + # Scatter plot with transparent points for better visibility
geom_smooth(method = "lm", se = FALSE, color = "red") + # Adds a linear trend line (no confidence interval)
labs(
title = "Diagnosis Count vs Time in Hospital", # Chart title
x = "Diagnosis Count", # x-axis label
y = "Time in Hospital" # y-axis label
)## `geom_smooth()` using formula = 'y ~ x'
Race Distribution
This visualization shows the number of patients by race. It highlights racial demographics in the dataset and helps identify if certain groups are over- or under-represented.
# Show how the dataset is distributed across different races
# Convert race counts to a table
race_count <- as.data.frame(table(df$race))
# Pie chart
ggplot(race_count, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") + # Makes it circular
labs(title = "Race Distribution") +
theme_void()Readmission Status
This chart breaks down the number of patients by their readmission status, such as β<30β, β>30β, and βNOβ. It provides insights into how common readmission is and whether most patients return soon or much later.
# Show how many patients fall into each readmission class
readmit_count <- as.data.frame(table(df$readmitted))
ggplot(readmit_count, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Readmission Proportion") +
theme_void()Gender vs Readmission
This grouped bar chart shows the relationship between gender and readmission. It helps examine whether male or female patients tend to be readmitted more frequently.
# Check the readmission rate based on gender
ggplot(df, aes(gender, fill = readmitted)) +
geom_bar(position = "dodge") +
labs(title = "Gender vs Readmission", x = "Gender", y = "Count")Admission Types
This graph illustrates how patients are admitted, based on admission type codes (e.g., emergency, urgent, elective). Understanding admission types can help in identifying the severity or urgency of patient cases.
# Visualize types of hospital admissions in the dataset
ggplot(df, aes(admission_type_id)) +
geom_bar(fill = "coral") +
labs(title = "Admission Types", x = "Admission Type ID", y = "Count")Correlation Matrix
Load the Dataset
## encounter_id patient_nbr race gender
## Min. : 12522 Min. : 135 AfricanAmerican:19210 Female:54708
## 1st Qu.: 84959748 1st Qu.: 23412964 Asian : 641 Male :47055
## Median :152388294 Median : 45500490 Caucasian :78370
## Mean :165200788 Mean : 54329650 Hispanic : 2037
## 3rd Qu.:230269803 3rd Qu.: 87545714 Other : 1505
## Max. :443867222 Max. :189502619
##
## age admission_type_id discharge_disposition_id admission_source_id
## 70-79 :26066 Min. :0.000 Min. : 0.00 Min. : 0.000
## 60-69 :22482 1st Qu.:1.000 1st Qu.: 1.00 1st Qu.: 1.000
## 50-59 :17256 Median :1.000 Median : 1.00 Median : 7.000
## 80-89 :17197 Mean :1.452 Mean : 2.82 Mean : 5.712
## 40-49 : 9685 3rd Qu.:2.000 3rd Qu.: 3.00 3rd Qu.: 7.000
## 30-39 : 3775 Max. :7.000 Max. :28.00 Max. :25.000
## (Other): 5302
## time_in_hospital payer_code medical_specialty
## Min. : 1.000 Unknown:40255 Unknown :49947
## 1st Qu.: 2.000 MC :32439 InternalMedicine :14635
## Median : 4.000 HM : 6274 Emergency/Trauma : 7565
## Mean : 4.396 SP : 5007 Family/GeneralPractice: 7440
## 3rd Qu.: 6.000 BC : 4655 Cardiology : 5351
## Max. :14.000 MD : 3532 Surgery-General : 3099
## (Other): 9601 (Other) :13726
## num_lab_procedures num_procedures num_medications number_outpatient
## Min. : 1.0 Min. :0.00 Min. : 1.00 Min. : 0.0000
## 1st Qu.: 31.0 1st Qu.:0.00 1st Qu.:10.00 1st Qu.: 0.0000
## Median : 44.0 Median :1.00 Median :15.00 Median : 0.0000
## Mean : 43.1 Mean :1.34 Mean :16.02 Mean : 0.3694
## 3rd Qu.: 57.0 3rd Qu.:2.00 3rd Qu.:20.00 3rd Qu.: 0.0000
## Max. :132.0 Max. :6.00 Max. :81.00 Max. :42.0000
##
## number_emergency number_inpatient diag_1 diag_2
## Min. : 0.0000 Min. : 0.0000 428 : 6862 276 : 6752
## 1st Qu.: 0.0000 1st Qu.: 0.0000 414 : 6580 428 : 6662
## Median : 0.0000 Median : 0.0000 786 : 4016 250 : 6071
## Mean : 0.1978 Mean : 0.6356 410 : 3614 427 : 5036
## 3rd Qu.: 0.0000 3rd Qu.: 1.0000 486 : 3508 401 : 3736
## Max. :76.0000 Max. :21.0000 427 : 2766 496 : 3305
## (Other):74417 (Other):70201
## diag_3 number_diagnoses max_glu_serum A1Cresult metformin
## 250 :11555 Min. : 1.000 >200: 1485 >7 : 3812 Down : 575
## 401 : 8288 1st Qu.: 6.000 >300: 1264 >8 : 8216 No :81776
## 276 : 5175 Median : 8.000 None:96417 None:84745 Steady:18345
## 428 : 4577 Mean : 7.423 Norm: 2597 Norm: 4990 Up : 1067
## 427 : 3955 3rd Qu.: 9.000
## 414 : 3664 Max. :16.000
## (Other):64549
## repaglinide nateglinide chlorpropamide glimepiride acetohexamide
## Down : 45 Down : 11 Down : 1 Down : 194 No :101762
## No :100224 No :101060 No :101677 No :96572 Steady: 1
## Steady: 1384 Steady: 668 Steady: 79 Steady: 4670
## Up : 110 Up : 24 Up : 6 Up : 327
##
##
##
## glipizide glyburide tolbutamide pioglitazone rosiglitazone
## Down : 560 Down : 564 No :101740 Down : 118 Down : 87
## No :89078 No :91113 Steady: 23 No :94436 No :95399
## Steady:11355 Steady: 9274 Steady: 6975 Steady: 6099
## Up : 770 Up : 812 Up : 234 Up : 178
##
##
##
## acarbose miglitol troglitazone tolazamide examide
## Down : 3 Down : 5 No :101760 No :101724 No:101763
## No :101455 No :101725 Steady: 3 Steady: 38
## Steady: 295 Steady: 31 Up : 1
## Up : 10 Up : 2
##
##
##
## citoglipton insulin glyburide.metformin glipizide.metformin
## No:101763 Down :12218 Down : 6 No :101750
## No :47380 No :101057 Steady: 13
## Steady:30849 Steady: 692
## Up :11316 Up : 8
##
##
##
## glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone
## No :101762 No :101761 No :101762
## Steady: 1 Steady: 2 Steady: 1
##
##
##
##
##
## change diabetesMed readmitted diag_count
## Ch:47009 No :23402 <30:11357 Min. :3
## No:54754 Yes:78361 >30:35545 1st Qu.:3
## NO :54861 Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
##
Variable Selection
In correlation analysis, only numeric variables are applicable. Therefore, the steps below outline the selection of numerical variables.
# Select only numeric columns
numeric_data <- data[sapply(data, is.numeric)]
# Check selected columns
str(numeric_data)## 'data.frame': 101763 obs. of 14 variables:
## $ encounter_id : int 2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
## $ patient_nbr : int 8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
## $ admission_type_id : int 0 1 1 1 1 2 3 1 2 3 ...
## $ discharge_disposition_id: int 0 1 1 1 1 1 1 1 1 3 ...
## $ admission_source_id : int 1 7 7 7 7 2 2 7 4 4 ...
## $ time_in_hospital : int 1 3 2 2 1 3 4 5 13 12 ...
## $ num_lab_procedures : int 41 59 11 44 51 31 70 73 68 33 ...
## $ num_procedures : int 0 0 5 1 0 6 1 0 2 3 ...
## $ num_medications : int 1 18 13 16 8 16 21 12 28 18 ...
## $ number_outpatient : int 0 0 2 0 0 0 0 0 0 0 ...
## $ number_emergency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : int 0 0 1 0 0 0 0 0 0 0 ...
## $ number_diagnoses : int 1 9 6 7 5 9 7 8 8 8 ...
## $ diag_count : int 3 3 3 3 3 3 3 3 3 3 ...
Normality Assumption
The normality assumption needs to be assessed to determine the appropriate type of correlation analysisβ-whether to use Pearson or Spearman. If the data are normally distributed, Pearson will be used; otherwise, Spearman correlation will be applied.
# Set up plotting layout
par(mfrow = c(2, 2)) # 2x2 grid for each variable
# Loop through columns
for (col in names(numeric_data)) {
# Skip if no variance or too many NAs
if (sd(numeric_data[[col]], na.rm = TRUE) == 0 || sum(is.na(numeric_data[[col]])) > 0.5 * nrow(numeric_data)) next
# Histogram with density curve
hist(numeric_data[[col]],
breaks = 30,
prob = TRUE,
main = paste("Histogram of", col),
xlab = col)
lines(density(numeric_data[[col]], na.rm = TRUE), col = "red", lwd = 2)
# Q-Q plot
qqnorm(numeric_data[[col]], main = paste("Q-Q Plot of", col))
qqline(numeric_data[[col]], col = "red")
}CONCLUSION: The histogram shows that most of the independent variables are not normally distributed. Therefore, Spearman correlation analysis will be used to assess the relationships.
Correlation Matrix
Next, the correlation matrix will be calculated. If any variable has a standard deviation of zero (0), it will be removed from the list of numerical variables, and the matrix will be recalculated.
# Calculate the correlation matrix [Output: warning]
cor_matrix <- cor(numeric_data, use = "complete.obs", method = "spearman")## Warning in cor(numeric_data, use = "complete.obs", method = "spearman"): the
## standard deviation is zero
# Find columns with zero standard deviation
zero_sd_col <- names(numeric_data)[sapply(numeric_data, function(x) sd(x, na.rm = TRUE) == 0)]
print(zero_sd_col)## [1] "diag_count"
# Remove columns with zero standard deviation
numeric_data_filtered <- numeric_data[, sapply(numeric_data, function(x) sd(x, na.rm = TRUE) != 0)]
# Recompute correlation matrix
cor_matrix <- cor(numeric_data_filtered, use = "complete.obs", method = "spearman")
print(round(cor_matrix, digits = 4))## encounter_id patient_nbr admission_type_id
## encounter_id 1.0000 0.5436 0.1031
## patient_nbr 0.5436 1.0000 0.0210
## admission_type_id 0.1031 0.0210 1.0000
## discharge_disposition_id 0.1582 0.1726 -0.0305
## admission_source_id -0.0381 0.0375 -0.7001
## time_in_hospital -0.0599 -0.0170 0.0047
## num_lab_procedures -0.0087 0.0266 -0.1700
## num_procedures -0.0314 -0.0194 0.2123
## num_medications 0.1022 0.0447 0.0648
## number_outpatient 0.1509 0.1548 -0.0725
## number_emergency 0.1312 0.1132 -0.0435
## number_inpatient 0.0374 0.0259 -0.0273
## number_diagnoses 0.2925 0.2404 -0.0352
## discharge_disposition_id admission_source_id
## encounter_id 0.1582 -0.0381
## patient_nbr 0.1726 0.0375
## admission_type_id -0.0305 -0.7001
## discharge_disposition_id 1.0000 0.0492
## admission_source_id 0.0492 1.0000
## time_in_hospital 0.2381 0.0040
## num_lab_procedures 0.0729 0.1342
## num_procedures -0.0029 -0.2067
## num_medications 0.1839 -0.0620
## number_outpatient 0.0825 0.0281
## number_emergency 0.0524 0.1058
## number_inpatient 0.0914 0.0580
## number_diagnoses 0.2116 0.1084
## time_in_hospital num_lab_procedures num_procedures
## encounter_id -0.0599 -0.0087 -0.0314
## patient_nbr -0.0170 0.0266 -0.0194
## admission_type_id 0.0047 -0.1700 0.2123
## discharge_disposition_id 0.2381 0.0729 -0.0029
## admission_source_id 0.0040 0.1342 -0.2067
## time_in_hospital 1.0000 0.3368 0.1872
## num_lab_procedures 0.3368 1.0000 0.0229
## num_procedures 0.1872 0.0229 1.0000
## num_medications 0.4651 0.2516 0.3515
## number_outpatient -0.0133 -0.0236 -0.0237
## number_emergency -0.0009 0.0060 -0.0459
## number_inpatient 0.0922 0.0413 -0.0636
## number_diagnoses 0.2366 0.1691 0.0673
## num_medications number_outpatient number_emergency
## encounter_id 0.1022 0.1509 0.1312
## patient_nbr 0.0447 0.1548 0.1132
## admission_type_id 0.0648 -0.0725 -0.0435
## discharge_disposition_id 0.1839 0.0825 0.0524
## admission_source_id -0.0620 0.0281 0.1058
## time_in_hospital 0.4651 -0.0133 -0.0009
## num_lab_procedures 0.2516 -0.0236 0.0060
## num_procedures 0.3515 -0.0237 -0.0459
## num_medications 1.0000 0.0743 0.0437
## number_outpatient 0.0743 1.0000 0.1774
## number_emergency 0.0437 0.1774 1.0000
## number_inpatient 0.0992 0.1565 0.2223
## number_diagnoses 0.2937 0.1130 0.0923
## number_inpatient number_diagnoses
## encounter_id 0.0374 0.2925
## patient_nbr 0.0259 0.2404
## admission_type_id -0.0273 -0.0352
## discharge_disposition_id 0.0914 0.2116
## admission_source_id 0.0580 0.1084
## time_in_hospital 0.0922 0.2366
## num_lab_procedures 0.0413 0.1691
## num_procedures -0.0636 0.0673
## num_medications 0.0992 0.2937
## number_outpatient 0.1565 0.1130
## number_emergency 0.2223 0.0923
## number_inpatient 1.0000 0.1358
## number_diagnoses 0.1358 1.0000
Plotting the Correlation Matrix
To facilitate interpretation, a heatmap of the correlation matrix is presented below.
## corrplot 0.95 loaded
# Plot the correlation matrix
corrplot(cor_matrix, method = "color", addCoef.col = "black", number.cex = 0.7, tl.cex = 0.8)CONCLUSION: This correlation matrix shows relationships between healthcare-related variables, with values ranging from -0.70 (strong negative) to 1.00 (perfect positive). Most correlations are weak, but a few stand out as moderately strong.
Strongest Negative Correlation
i) The strongest relationship is
between admission_type_id and admission_source_id
(-0.70), indicating these variables move in opposite
directions.
π For example, emergency admissions might come from different sources
than planned admissions.
Hospital Stay Duration Relationships
i) time_in_hospital moderately
correlates with:
Β Β Β Β a. num_medications (0.47) β patients
staying longer tend to receive more medications.
Β Β Β Β b. num_lab_procedures (0.34) β longer
stays involve more lab tests.
π Longer hospitalization is associated with increased medication and testing.
Treatment Intensity Interconnections
i) num_procedures and num_medications (0.35) β patients undergoing more procedures also receive more medications.
ii) number_diagnoses and num_medications (0.29) β patients with more diagnoses tend to receive more medications.
π These moderate correlations suggest that treatment intensity variables reinforce each other.
Artificial Correlation Between Identifiers
i) encounter_id and patient_nbr
show a correlation of 0.54, but this is
artificial.
π These are identifiers and not clinical features β their correlation
reflects ID generation patterns, not medical trends.
Regrouping Data
The initial correlation analysis across all observations showed limited correlation between variables. However, medical data often contains patient-specific patterns that can be obscured in population-level analyses. By regrouping the data by patient number and analyzing correlations within patients, we account for between-patient variability and are better able to detect meaningful within-patient relationships between variables.
# Load required packages
library(dplyr)
data_sum <- data %>%
group_by(patient_nbr) %>%
summarise(
encounter_count = n_distinct(encounter_id),
total_lab_procedures = sum(num_lab_procedures, na.rm = TRUE),
total_procedures = sum(num_procedures, na.rm = TRUE),
total_medications = sum(num_medications, na.rm = TRUE),
total_outpatient = sum(number_outpatient, na.rm = TRUE),
total_emergency = sum(number_emergency, na.rm = TRUE),
total_inpatient = sum(number_inpatient, na.rm = TRUE),
total_time_in_hospital = sum(time_in_hospital, na.rm = TRUE),
total_num_of_diagnoses = sum(number_diagnoses, na.rm = TRUE)
)
print(data_sum, width = Inf)## # A tibble: 71,515 Γ 10
## patient_nbr encounter_count total_lab_procedures total_procedures
## <int> <int> <int> <int>
## 1 135 2 108 7
## 2 378 1 49 1
## 3 729 1 68 2
## 4 774 1 46 0
## 5 927 1 49 0
## 6 1152 5 209 10
## 7 1305 1 52 1
## 8 1314 3 151 13
## 9 1629 1 21 0
## 10 2025 1 47 2
## total_medications total_outpatient total_emergency total_inpatient
## <int> <int> <int> <int>
## 1 47 0 0 1
## 2 11 0 0 0
## 3 23 0 0 0
## 4 20 0 0 0
## 5 5 0 0 0
## 6 81 0 0 7
## 7 16 0 0 0
## 8 39 0 0 3
## 9 15 3 1 2
## 10 18 0 0 0
## total_time_in_hospital total_num_of_diagnoses
## <int> <int>
## 1 11 13
## 2 2 3
## 3 4 9
## 4 3 9
## 5 5 3
## 6 42 24
## 7 9 9
## 8 6 23
## 9 14 7
## 10 12 9
## # βΉ 71,505 more rows
Plotting New Correlation Matrix
# Calculate the correlation matrix
cor_matrix <- cor(data_sum, use = "complete.obs", method = "pearson")
cor_matrix## patient_nbr encounter_count total_lab_procedures
## patient_nbr 1.000000000 -0.0238243 -0.01266973
## encounter_count -0.023824305 1.0000000 0.86901056
## total_lab_procedures -0.012669730 0.8690106 1.00000000
## total_procedures -0.029122133 0.4288023 0.39436875
## total_medications -0.008321752 0.8591692 0.80549272
## total_outpatient 0.073490419 0.3400953 0.29218231
## total_emergency 0.027980837 0.3761894 0.30920086
## total_inpatient 0.001316926 0.7840189 0.69579065
## total_time_in_hospital -0.032822009 0.7950961 0.78456614
## total_num_of_diagnoses 0.046033256 0.9535311 0.85011767
## total_procedures total_medications total_outpatient
## patient_nbr -0.02912213 -0.008321752 0.07349042
## encounter_count 0.42880229 0.859169174 0.34009527
## total_lab_procedures 0.39436875 0.805492718 0.29218231
## total_procedures 1.00000000 0.532230844 0.12860526
## total_medications 0.53223084 1.000000000 0.32574607
## total_outpatient 0.12860526 0.325746075 1.00000000
## total_emergency 0.14623582 0.318199562 0.18820467
## total_inpatient 0.31742801 0.675796319 0.29234923
## total_time_in_hospital 0.43400973 0.812212343 0.25548036
## total_num_of_diagnoses 0.43703018 0.865713293 0.35260557
## total_emergency total_inpatient total_time_in_hospital
## patient_nbr 0.02798084 0.001316926 -0.03282201
## encounter_count 0.37618944 0.784018938 0.79509610
## total_lab_procedures 0.30920086 0.695790647 0.78456614
## total_procedures 0.14623582 0.317428005 0.43400973
## total_medications 0.31819956 0.675796319 0.81221234
## total_outpatient 0.18820467 0.292349230 0.25548036
## total_emergency 1.00000000 0.486626309 0.26405294
## total_inpatient 0.48662631 1.000000000 0.63023969
## total_time_in_hospital 0.26405294 0.630239685 1.00000000
## total_num_of_diagnoses 0.36386185 0.734070207 0.79334257
## total_num_of_diagnoses
## patient_nbr 0.04603326
## encounter_count 0.95353105
## total_lab_procedures 0.85011767
## total_procedures 0.43703018
## total_medications 0.86571329
## total_outpatient 0.35260557
## total_emergency 0.36386185
## total_inpatient 0.73407021
## total_time_in_hospital 0.79334257
## total_num_of_diagnoses 1.00000000
# Plot the correlation matrix with new variables
corrplot(cor_matrix, method = "color", addCoef.col = "black", number.cex = 0.7)CONCLUSION:
Strongest Correlations
i) encounter_count has very strong positive
correlations with:
Β Β Β Β a. total_num_of_diagnoses (0.954)
Β Β Β Β b. total_medications (0.859)
Β Β Β Β c.Β total_lab_procedures (0.869)
Β Β Β Β d.Β total_time_in_hospital (0.795)
π Patients with more medical encounters tend to undergo more diagnostic tests, receive more medications, and have longer hospital stays.
ii) total_num_of_diagnoses is also strongly
correlated with:
Β Β Β Β a. total_medications (0.866)
Β Β Β Β b. total_lab_procedures (0.850)
Β Β Β Β c.Β total_time_in_hospital (0.793)
π These strong positive correlations indicate that as the number of diagnoses increases, so do the medical interventions and hospital resources used, which aligns with clinical expectations.
Moderate Correlations
i) total_time_in_hospital is moderately
correlated with:
Β Β Β Β a. total_medications (0.812)
Β Β Β Β b. total_inpatient (0.630)
π Longer hospital stays involve more medical interventions.
ii) total_inpatient is moderately
correlated with:
Β Β Β Β a. total_medications (0.676)
Β Β Β Β b. total_num_of_diagnoses (0.734)
π Inpatient care is associated with higher treatment complexity.
Weak or Negligible Correlations
i) patient_nbr (likely a patient identifier) shows almost no correlation with other variables, as expected (all values close to 0).
ii) total_outpatient and total_emergency show weaker correlations with other variables (mostly < 0.4).
Prediction & Result
Regression: Modeling Hospital Stay Duration
# ============================================
# 02. Load and Preprocess Dataset
# ============================================
library(dplyr)
df <- readRDS("diabetic_cleaned.rds")
# Convert character columns to factors
df <- mutate(df, across(where(is.character), as.factor))
# Create Empty Lists
id_like_vars <- c()
high_missing_vars <- c()
binary_vars <- c()
numeric_vars <- c()
categorical_vars <- c()
# Track which variables have already been classified
classified_vars <- c()
# ID-like Variables
id_like_vars <- names(df)[sapply(df, function(x) length(unique(x)) == nrow(df))]
classified_vars <- union(classified_vars, id_like_vars)
# High Missing Value Variables
missing_pct <- sapply(df, function(x) mean(is.na(x)))
high_missing_vars <- names(missing_pct[missing_pct > 0.3])
high_missing_vars <- setdiff(high_missing_vars, classified_vars)
classified_vars <- union(classified_vars, high_missing_vars)
# Binary Variables (2 unique values only)
binary_vars <- names(df)[sapply(df, function(x) length(unique(x)) == 2)]
binary_vars <- setdiff(binary_vars, classified_vars)
classified_vars <- union(classified_vars, binary_vars)
# Numeric Variables
numeric_vars <- names(df)[sapply(df, function(x) is.numeric(x) | is.integer(x))]
numeric_vars <- setdiff(numeric_vars, classified_vars)
classified_vars <- union(classified_vars, numeric_vars)
# Categorical Variables (Factor with > 2 levels)
categorical_vars <- names(df)[sapply(df, is.factor)]
categorical_vars <- setdiff(categorical_vars, classified_vars)
classified_vars <- union(classified_vars, categorical_vars)# ============================================
# 03. Feature Preparation
# ============================================
# Define modeling target
target_var <- "time_in_hospital"
# Exclude variables
exclude_vars <- c(id_like_vars, high_missing_vars,categorical_vars, target_var, "readmitted")
all_vars <- setdiff(names(df), exclude_vars)
all_vars <- unique(c(all_vars, "age"))
# Subset to usable predictors + target
model_df <- df[, c(target_var, all_vars)]
# Drop rows with missing values (simple)
model_df <- na.omit(model_df)
# Convert character-type columns to factor if missed
model_df <- model_df %>% mutate(across(where(is.character), as.factor))
# Check structure
dim(model_df)## [1] 101763 24
Train-Test Split
The dataset was partitioned into training and testing subsets to ensure consistency across both subsets for valid model evaluation.
# ============================================
# 04. Train-Test Split
# ============================================
library(caret)## Loading required package: lattice
set.seed(99)
train_index <- createDataPartition(model_df[[target_var]], p = 0.7, list = FALSE)
train_data <- model_df[train_index, ]
test_data <- model_df[-train_index, ]
# Ensure factor levels match between train and test
for (col in names(train_data)) {
if (is.factor(train_data[[col]])) {
test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
}
}Feature Importance
Model was trained using the Ranger algorithm to evaluate feature importance for predicting hospital stay duration, revealing the top predictors based on impurity-based measures and are visualized as ranked bar chart above.
# ============================================
# 05. Feature Importance- Ranger
# ============================================
library(ranger)
library(dplyr)
library(caret)
library(tibble)
# Define target variable
target_var <- "time_in_hospital"
# Remove rows with NA
train_data <- na.omit(train_data)
# Ensure factors are typed consistently
train_data <- train_data %>%
mutate(across(where(is.character), as.factor))
# Train ranger model with importance
set.seed(123)
ranger_model <- ranger(
formula = as.formula(paste(target_var, "~ .")),
data = train_data,
importance = "impurity", # options: "impurity", "permutation"
num.trees = 100
)
# Extract variable importance
importance_df <- as.data.frame(ranger_model$variable.importance) %>%
rownames_to_column("Variable") %>%
setNames(c("Variable", "Importance")) %>%
arrange(desc(Importance))
# View top 15 variables
print(head(importance_df, 15))## Variable Importance
## 1 num_medications 113297.056
## 2 num_lab_procedures 77335.627
## 3 patient_nbr 48921.809
## 4 discharge_disposition_id 45460.302
## 5 num_procedures 30813.725
## 6 number_diagnoses 22344.653
## 7 age 20654.923
## 8 admission_type_id 15023.388
## 9 admission_source_id 14815.870
## 10 number_inpatient 14787.299
## 11 number_outpatient 8118.175
## 12 number_emergency 6334.988
## 13 gender 6258.516
## 14 change 6151.329
## 15 diabetesMed 4387.343
# Bar plot
library(ggplot2)
ggplot(importance_df[1:15, ], aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Important Variables from Ranger",
x = "Variable", y = "Importance") +
theme_minimal()Variables Selection
The top 15 most important variables identified earlier were selected to construct a reduced training dataset, facilitating more focused and computationally efficient model development.
# ============================================
# 06. Select Top N Variables
# ============================================
top_n <- 15
top_vars <- importance_df$Variable[1:top_n]
print(top_vars)## [1] "num_medications" "num_lab_procedures"
## [3] "patient_nbr" "discharge_disposition_id"
## [5] "num_procedures" "number_diagnoses"
## [7] "age" "admission_type_id"
## [9] "admission_source_id" "number_inpatient"
## [11] "number_outpatient" "number_emergency"
## [13] "gender" "change"
## [15] "diabetesMed"
Models
Linear Regression Model
The selected were subsequently used to train three predictive models, Multiple Linear Regression, Random Forest, and XGBoost. This approach aimed to evaluate and compare each modelβs ability to predict hospital stay duration by leveraging the most influential features while balancing interpretability, computational efficiency, and predictive accuracy.
# ============================================
# 07. Train Linear Model (lm)
# ============================================
lm_model <- lm(as.formula(paste(target_var, "~ .")), data = train_top)
summary(lm_model)##
## Call:
## lm(formula = as.formula(paste(target_var, "~ .")), data = train_top)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9922 -1.7139 -0.4894 1.1919 12.7692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.631e-01 2.443e-01 1.896 0.05799 .
## num_medications 1.351e-01 1.427e-03 94.651 < 2e-16 ***
## num_lab_procedures 3.047e-02 5.178e-04 58.838 < 2e-16 ***
## patient_nbr -4.822e-09 2.538e-10 -18.997 < 2e-16 ***
## discharge_disposition_id 6.430e-02 2.458e-03 26.166 < 2e-16 ***
## num_procedures 7.396e-02 6.243e-03 11.847 < 2e-16 ***
## number_diagnoses 1.256e-01 5.460e-03 23.001 < 2e-16 ***
## age10-19 1.135e-01 2.631e-01 0.431 0.66621
## age20-29 -4.365e-01 2.500e-01 -1.746 0.08078 .
## age30-39 -4.930e-01 2.435e-01 -2.025 0.04289 *
## age40-49 -5.007e-01 2.406e-01 -2.081 0.03744 *
## age50-59 -6.359e-01 2.399e-01 -2.650 0.00805 **
## age60-69 -5.051e-01 2.398e-01 -2.106 0.03521 *
## age70-79 -2.448e-01 2.397e-01 -1.021 0.30714
## age80-89 6.364e-02 2.402e-01 0.265 0.79105
## age90-99 1.395e-01 2.459e-01 0.567 0.57048
## admission_type_id 1.536e-02 1.468e-02 1.047 0.29529
## admission_source_id 9.294e-04 3.219e-03 0.289 0.77280
## number_inpatient 9.453e-02 7.833e-03 12.068 < 2e-16 ***
## number_outpatient -7.544e-02 7.540e-03 -10.005 < 2e-16 ***
## number_emergency -4.795e-02 1.002e-02 -4.787 1.70e-06 ***
## genderMale -8.291e-02 1.917e-02 -4.325 1.52e-05 ***
## changeNo -9.515e-02 2.249e-02 -4.231 2.33e-05 ***
## diabetesMedYes -1.462e-01 2.616e-02 -5.590 2.28e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.525 on 71211 degrees of freedom
## Multiple R-squared: 0.287, Adjusted R-squared: 0.2868
## F-statistic: 1246 on 23 and 71211 DF, p-value: < 2.2e-16
Random Forest
# ============================================
# 08. Train Random Forest
# ============================================
library(randomForest)## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(123)
rf_model <- randomForest(
formula = as.formula(paste(target_var, "~ .")),
data = train_top,
ntree = 30,
importance = TRUE
)
print(rf_model)##
## Call:
## randomForest(formula = as.formula(paste(target_var, "~ .")), data = train_top, ntree = 30, importance = TRUE)
## Type of random forest: regression
## Number of trees: 30
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 5.95309
## % Var explained: 33.4
XGBoost
# ============================================
# 09. Train XGBoost
# ============================================
library(xgboost)##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(Matrix)
# Convert to numeric matrix (XGBoost requires numeric input)
# One-hot encode categorical variables
dummies <- model.matrix(as.formula(paste(target_var, "~ .")), data = train_top)[, -1]
labels <- train_top[[target_var]]
# Create DMatrix
dtrain <- xgb.DMatrix(data = dummies, label = labels)
# Train XGBoost model
set.seed(123)
xgb_model <- xgboost(
data = dtrain,
nrounds = 100,
objective = "reg:squarederror",
verbose = 0
)
# Get and plot feature importance
print(xgb_model)## ##### xgb.Booster
## raw: 395.1 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, objective = "reg:squarederror")
## params (as set within xgb.train):
## objective = "reg:squarederror", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.evaluation.log()
## # of features: 23
## niter: 100
## nfeatures : 23
## evaluation_log:
## iter train_rmse
## <num> <num>
## 1 3.861824
## 2 3.213284
## --- ---
## 99 2.139742
## 100 2.139444
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = top_n, measure = "Gain")Performance
The predictive performance of the models was assessed on the test set using RMSE and R-squared metrics, enabling a comparative analysis of each modelβs accuracy and goodness-of-fit in estimating hospital stay duration based on the selected top predictors.
# =========================
# 10. Evaluate Models
# =========================
# Required libraries
library(caret) # for R2()
library(broom) # for tidy model output
library(ggplot2) # for plotting
# --- Custom Evaluation Functions ---
rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
r_squared <- function(actual, predicted) {
cor(actual, predicted)^2
}
# --- Prepare Test Set with Top Variables ---
test_top <- test_data[, c(target_var, top_vars)]
# Ensure data types match
test_top <- test_top %>%
mutate(across(where(is.character), as.factor))
# True labels
test_labels <- test_top[[target_var]]
# ========================
# --- LINEAR MODEL ---
# ========================
lm_preds <- predict(lm_model, newdata = test_top)
lm_rmse <- rmse(test_labels, lm_preds)
lm_r2 <- r_squared(test_labels, lm_preds)
cat("\n====================\nLinear Model Evaluation\n====================\n")##
## ====================
## Linear Model Evaluation
## ====================
## RMSE : 2.5078
## R-squared: 0.2889
# ========================
# --- RANDOM FOREST ---
# ========================
rf_preds <- predict(rf_model, newdata = test_top)
rf_rmse <- rmse(test_labels, rf_preds)
rf_r2 <- R2(rf_preds, test_labels)
cat("\n====================\nRandom Forest Evaluation\n====================\n")##
## ====================
## Random Forest Evaluation
## ====================
## RMSE : 2.3588
## R-squared: 0.3715
# ========================
# --- XGBOOST ---
# ========================
# One-hot encode test set (same structure as train)
test_dummies <- model.matrix(as.formula(paste(target_var, "~ .")), data = test_top)[, -1]
dtest <- xgb.DMatrix(data = test_dummies)
xgb_preds <- predict(xgb_model, dtest)
xgb_rmse <- rmse(test_labels, xgb_preds)
xgb_r2 <- R2(xgb_preds, test_labels)
cat("\n====================\nXGBoost Evaluation\n====================\n")##
## ====================
## XGBoost Evaluation
## ====================
## RMSE : 2.3089
## R-squared: 0.3979
# Combine results into a summary data frame
results_df <- data.frame(
Model = c("Linear Regression", "Random Forest", "XGBoost"),
RMSE = c(lm_rmse, rf_rmse, xgb_rmse),
R_Squared = c(lm_r2, rf_r2, xgb_r2)
)
print(results_df)## Model RMSE R_Squared
## 1 Linear Regression 2.507828 0.2889076
## 2 Random Forest 2.358822 0.3715352
## 3 XGBoost 2.308908 0.3979181
XGBoost is the best-performing model among the three. It achieves the lowest prediction error (RMSE) and the highest explanatory power (R-squared). It is therefore the most suitable model for predicting hospital stay duration.
Tuning
A grid search with cross-validation was conducted to optimize XGBoost hyperparameters,identifying the parameter combination that minimized test RMSE, thereby enhancing the modelβs generalization performance while maintaining computational efficiency.
# ============================================
# 11. Tuning with Cross-Validation
# ============================================
# Define a small, efficient tuning grid
quick_grid <- expand.grid(
max_depth = c(3, 5),
eta = c(0.05, 0.1),
subsample = c(0.8),
colsample_bytree = c(0.8),
min_child_weight = c(1)
)
best_rmse <- Inf
best_params <- list()
best_nrounds <- 100
for (i in 1:nrow(quick_grid)) {
params <- list(
booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
max_depth = quick_grid$max_depth[i],
eta = quick_grid$eta[i],
subsample = quick_grid$subsample[i],
colsample_bytree = quick_grid$colsample_bytree[i],
min_child_weight = quick_grid$min_child_weight[i]
)
set.seed(123)
cv_model <- xgb.cv(
params = params,
data = dtrain,
nrounds = 300,
nfold = 3, # fewer folds for speed
early_stopping_rounds = 10,
verbose = 0
)
mean_rmse <- min(cv_model$evaluation_log$test_rmse_mean)
if (mean_rmse < best_rmse) {
best_rmse <- mean_rmse
best_params <- params
best_nrounds <- cv_model$best_iteration
}
}
cat("Best RMSE:", round(best_rmse, 4), "\n")## Best RMSE: 2.3167
## $booster
## [1] "gbtree"
##
## $objective
## [1] "reg:squarederror"
##
## $eval_metric
## [1] "rmse"
##
## $max_depth
## [1] 5
##
## $eta
## [1] 0.1
##
## $subsample
## [1] 0.8
##
## $colsample_bytree
## [1] 0.8
##
## $min_child_weight
## [1] 1
Final Model
Utilizing the optimal hyperparameters derived from cross-validation, the final XGBoost model demonstrated enhanced performance, confirming its improved ability to explain and predict hospital stay duration based on the selected features.
# ============================================
# 12. Final Model with Best Parameters
# ============================================
set.seed(123)
xgb_model_best <- xgboost(
data = dtrain,
nrounds = best_nrounds,
params = best_params,
verbose = 0
)
# Variable Importance
importance_matrix_best <- xgb.importance(model = xgb_model_best)
xgb.plot.importance(importance_matrix_best, top_n = top_n, measure = "Gain")# Evaluate
preds <- predict(xgb_model_best, dtrain)
rmse_final <- sqrt(mean((preds - labels)^2))
r2_final <- 1 - sum((labels - preds)^2) / sum((labels - mean(labels))^2)
cat("\nFinal Tuned XGBoost Performance]\n")##
## Final Tuned XGBoost Performance]
## RMSE: 2.2054
## RΒ² Score: 0.4559
RMSE (Root Mean Squared Error): 2.2054 This indicates that, on average, the modelβs predictions deviate by approximately 2.2 days from the actual hospital stay durations. In practical terms, it reflects the typical prediction error, which is acceptable if hospital stays range over a broad scale. A lower RMSE suggests better accuracy and less variability in the prediction errors.
RΒ² Score (Coefficient of Determination): 0.4559 This value means that approximately 45.6% of the variability in hospital stay duration can be explained by the model using the selected features. While not extremely high, it reflects moderate explanatory power, which is expected in healthcare datasets where many unobserved factors (e.g., clinical decisions, patient-specific nuances) also influence outcomes.
Overall Interpretation
These values suggest that the tuned XGBoost model is reasonably effective at predicting hospital stay duration using the selected variables earlier. The model provides a useful approximation and could be valuable in hospital resource planning, early discharge prediction, or case prioritization, though there is still room for improvement through the inclusion of more detailed clinical data or unstructured variables.
Classification: Predict Readmission Class
In this new process, we are going to find top features first, and then do the classification on the top features only (removing the low features) But first, we are removing the features with lower than 2 factors level.
## ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
## β forcats 1.0.0 β readr 2.1.5
## β lubridate 1.9.4 β stringr 1.5.1
## β purrr 1.0.4 β tidyr 1.3.1
## ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
## β randomForest::combine() masks dplyr::combine()
## β tidyr::expand() masks Matrix::expand()
## β dplyr::filter() masks stats::filter()
## β dplyr::lag() masks stats::lag()
## β purrr::lift() masks caret::lift()
## β randomForest::margin() masks ggplot2::margin()
## β tidyr::pack() masks Matrix::pack()
## β xgboost::slice() masks dplyr::slice()
## β tidyr::unpack() masks Matrix::unpack()
## βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
library(randomForest)
library(e1071)
library(xgboost)
library(ROCR)
# Load cleaned dataset
df <- readRDS("diabetic_recleaned.rds")
# Convert target variable to factor
df$readmitted <- as.factor(df$readmitted)
# Check class distribution
table(df$readmitted)##
## <30 >30 NO
## 2929 8865 15684
# Split into training and testing sets
set.seed(123)
train_index <- createDataPartition(df$readmitted, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]Identify and remove low level factors
# 01: Identify the problem columns
sapply(train_data, function(col) {
is.factor(col) && length(unique(col)) < 2
})## encounter_id patient_nbr race
## FALSE FALSE FALSE
## gender age admission_type_id
## FALSE FALSE FALSE
## discharge_disposition_id admission_source_id time_in_hospital
## FALSE FALSE FALSE
## payer_code medical_specialty num_lab_procedures
## FALSE FALSE FALSE
## num_procedures num_medications number_outpatient
## FALSE FALSE FALSE
## number_emergency number_inpatient diag_1
## FALSE FALSE FALSE
## diag_2 diag_3 number_diagnoses
## FALSE FALSE FALSE
## max_glu_serum A1Cresult metformin
## FALSE FALSE FALSE
## repaglinide nateglinide chlorpropamide
## FALSE FALSE FALSE
## glimepiride acetohexamide glipizide
## FALSE TRUE FALSE
## glyburide tolbutamide pioglitazone
## FALSE TRUE FALSE
## rosiglitazone acarbose miglitol
## FALSE FALSE FALSE
## troglitazone tolazamide examide
## TRUE FALSE TRUE
## citoglipton insulin glyburide.metformin
## TRUE FALSE FALSE
## glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone
## FALSE TRUE TRUE
## metformin.pioglitazone change diabetesMed
## FALSE FALSE FALSE
## readmitted diag_count
## FALSE FALSE
## [1] "acetohexamide" "tolbutamide"
## [3] "troglitazone" "examide"
## [5] "citoglipton" "glimepiride.pioglitazone"
## [7] "metformin.rosiglitazone"
# 02: Drop problematic columns
bad_vars <- names(which(sapply(train_data, function(col) is.factor(col) && length(unique(col)) < 2)))
train_data_clean <- train_data[ , !(names(train_data) %in% bad_vars)]## [1] "acetohexamide" "tolbutamide"
## [3] "troglitazone" "examide"
## [5] "citoglipton" "glimepiride.pioglitazone"
## [7] "metformin.rosiglitazone"
## [1] "encounter_id" "patient_nbr"
## [3] "race" "gender"
## [5] "age" "admission_type_id"
## [7] "discharge_disposition_id" "admission_source_id"
## [9] "time_in_hospital" "payer_code"
## [11] "medical_specialty" "num_lab_procedures"
## [13] "num_procedures" "num_medications"
## [15] "number_outpatient" "number_emergency"
## [17] "number_inpatient" "diag_1"
## [19] "diag_2" "diag_3"
## [21] "number_diagnoses" "max_glu_serum"
## [23] "A1Cresult" "metformin"
## [25] "repaglinide" "nateglinide"
## [27] "chlorpropamide" "glimepiride"
## [29] "acetohexamide" "glipizide"
## [31] "glyburide" "tolbutamide"
## [33] "pioglitazone" "rosiglitazone"
## [35] "acarbose" "miglitol"
## [37] "troglitazone" "tolazamide"
## [39] "examide" "citoglipton"
## [41] "insulin" "glyburide.metformin"
## [43] "glipizide.metformin" "glimepiride.pioglitazone"
## [45] "metformin.rosiglitazone" "metformin.pioglitazone"
## [47] "change" "diabetesMed"
## [49] "readmitted" "diag_count"
remove_constant_factors <- function(df) {
df[, !(sapply(df, function(col) is.factor(col) && length(unique(col)) < 2))]
}
train_data <- remove_constant_factors(train_data)
test_data <- remove_constant_factors(test_data)## [1] "acetohexamide" "tolbutamide"
## [3] "troglitazone" "examide"
## [5] "citoglipton" "glimepiride.pioglitazone"
## [7] "metformin.rosiglitazone"
## [1] "encounter_id" "patient_nbr"
## [3] "race" "gender"
## [5] "age" "admission_type_id"
## [7] "discharge_disposition_id" "admission_source_id"
## [9] "time_in_hospital" "payer_code"
## [11] "medical_specialty" "num_lab_procedures"
## [13] "num_procedures" "num_medications"
## [15] "number_outpatient" "number_emergency"
## [17] "number_inpatient" "diag_1"
## [19] "diag_2" "diag_3"
## [21] "number_diagnoses" "max_glu_serum"
## [23] "A1Cresult" "metformin"
## [25] "repaglinide" "nateglinide"
## [27] "chlorpropamide" "glimepiride"
## [29] "glipizide" "glyburide"
## [31] "pioglitazone" "rosiglitazone"
## [33] "acarbose" "miglitol"
## [35] "tolazamide" "insulin"
## [37] "glyburide.metformin" "glipizide.metformin"
## [39] "metformin.pioglitazone" "change"
## [41] "diabetesMed" "readmitted"
## [43] "diag_count"
## [1] "encounter_id" "patient_nbr"
## [3] "race" "gender"
## [5] "age" "admission_type_id"
## [7] "discharge_disposition_id" "admission_source_id"
## [9] "time_in_hospital" "payer_code"
## [11] "medical_specialty" "num_lab_procedures"
## [13] "num_procedures" "num_medications"
## [15] "number_outpatient" "number_emergency"
## [17] "number_inpatient" "diag_1"
## [19] "diag_2" "diag_3"
## [21] "number_diagnoses" "max_glu_serum"
## [23] "A1Cresult" "metformin"
## [25] "repaglinide" "nateglinide"
## [27] "chlorpropamide" "glimepiride"
## [29] "glipizide" "glyburide"
## [31] "pioglitazone" "rosiglitazone"
## [33] "acarbose" "miglitol"
## [35] "tolazamide" "insulin"
## [37] "glyburide.metformin" "change"
## [39] "diabetesMed" "readmitted"
## [41] "diag_count"
# Drop Non-factor columns with a single unique value
#Sometimes model.matrix() can also break if a non-factor column (e.g. numeric) has zero variance.
constant_cols <- names(which(sapply(train_data, function(col) length(unique(col)) < 2)))
train_data <- train_data[, !(names(train_data) %in% constant_cols)]
test_data <- test_data[, !(names(test_data) %in% constant_cols)]Top features
# 01. Create matrix for XGBoost
train_matrix <- model.matrix(readmitted ~ . -1, data = train_data)
train_label <- as.numeric(train_data$readmitted) - 1
# 02. Train the XGBoost model
library(xgboost)
xgb_model <- xgboost(
data = train_matrix,
label = train_label,
nrounds = 50,
objective = "multi:softprob",
num_class = length(unique(train_label)),
eval_metric = "mlogloss",
verbose = 0
)
# 03. Get top features
importance <- xgb.importance(model = xgb_model)
print(head(importance, 15))## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: patient_nbr 0.174532370 0.069586833 0.112475334
## 2: number_inpatient 0.145456155 0.046084727 0.027844771
## 3: encounter_id 0.132107759 0.049256796 0.117079588
## 4: discharge_disposition_id 0.082950532 0.039615854 0.041876781
## 5: num_medications 0.034468069 0.023408695 0.065555799
## 6: number_diagnoses 0.033398607 0.013583347 0.023459768
## 7: number_emergency 0.030517039 0.022946993 0.019951765
## 8: num_lab_procedures 0.029424893 0.011609299 0.056566542
## 9: time_in_hospital 0.017659103 0.006313778 0.037711028
## 10: number_outpatient 0.012540098 0.015338450 0.014251261
## 11: num_procedures 0.012320383 0.002245293 0.024556018
## 12: diabetesMedYes 0.010777714 0.012117138 0.008989257
## 13: payer_codeMC 0.007219052 0.005566069 0.008770007
## 14: admission_source_id 0.007149614 0.004703997 0.010304758
## 15: diag_1V58 0.007025155 0.008946530 0.006358255
Classification Model
# 01. Load libraries and data
library(ranger)
library(dplyr)
# 02. Select only top features (exclude patient_nbr & encounter_id)
selected_vars <- c(
"readmitted",
"number_inpatient",
"discharge_disposition_id",
"num_medications",
"number_diagnoses",
"number_emergency",
"num_lab_procedures",
"time_in_hospital",
"number_outpatient",
"num_procedures",
"diabetesMed",
"payer_code",
"admission_source_id",
"diag_1"
)
df_selected <- df[, selected_vars]
# 03. Remove columns with only one unique value
df_selected <- df_selected %>%
select(where(~ length(unique(.)) > 1))
# 04. Train/test split
set.seed(123)
train_index <- createDataPartition(df_selected$readmitted, p = 0.8, list = FALSE)
train_data <- df_selected[train_index, ]
test_data <- df_selected[-train_index, ]Option A: Random Forest (ranger)
# 05A. Train with ranger
rf_model <- ranger(readmitted ~ ., data = train_data, importance = "impurity")
# 06A. Predict and evaluate
pred_rf <- predict(rf_model, test_data)$predictions
rf_conf <- confusionMatrix(pred_rf, test_data$readmitted)
rf_stats <- rf_conf$overall[c("Accuracy", "Kappa")]## Confusion Matrix and Statistics
##
## Reference
## Prediction <30 >30 NO
## <30 26 24 10
## >30 183 566 430
## NO 376 1183 2696
##
## Overall Statistics
##
## Accuracy : 0.5985
## 95% CI : (0.5854, 0.6115)
## No Information Rate : 0.5708
## P-Value [Acc > NIR] : 1.73e-05
##
## Kappa : 0.1764
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: <30 Class: >30 Class: NO
## Sensitivity 0.044444 0.3192 0.8597
## Specificity 0.993074 0.8353 0.3388
## Pos Pred Value 0.433333 0.4801 0.6336
## Neg Pred Value 0.897129 0.7203 0.6449
## Prevalence 0.106480 0.3227 0.5708
## Detection Rate 0.004732 0.1030 0.4907
## Detection Prevalence 0.010921 0.2146 0.7745
## Balanced Accuracy 0.518759 0.5772 0.5993
Option B: XGBoost (with numeric matrix input)
# 05B. Prepare XGBoost input
# Must convert factors to dummy vars
train_matrix <- model.matrix(readmitted ~ . -1, data = train_data)
train_label <- as.numeric(train_data$readmitted) - 1
test_matrix <- model.matrix(readmitted ~ . -1, data = test_data)
test_label <- as.numeric(test_data$readmitted) - 1
# 6B. Train XGBoost model
xgb_model <- xgboost(
data = train_matrix,
label = train_label,
nrounds = 50,
objective = "multi:softmax",
num_class = length(unique(train_label)),
eval_metric = "mlogloss",
verbose = 0
)
# 7B. Predict and evaluate
pred_xgb <- predict(xgb_model, test_matrix)
pred_xgb <- factor(pred_xgb, labels = levels(train_data$readmitted))
xgb_conf <- confusionMatrix(pred_xgb, test_data$readmitted)
xgb_stats <- xgb_conf$overall[c("Accuracy", "Kappa")]## Confusion Matrix and Statistics
##
## Reference
## Prediction <30 >30 NO
## <30 26 32 8
## >30 193 583 462
## NO 366 1158 2666
##
## Overall Statistics
##
## Accuracy : 0.5961
## 95% CI : (0.583, 0.6091)
## No Information Rate : 0.5708
## P-Value [Acc > NIR] : 7.701e-05
##
## Kappa : 0.1769
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: <30 Class: >30 Class: NO
## Sensitivity 0.044444 0.3288 0.8501
## Specificity 0.991852 0.8240 0.3537
## Pos Pred Value 0.393939 0.4709 0.6363
## Neg Pred Value 0.897015 0.7204 0.6396
## Prevalence 0.106480 0.3227 0.5708
## Detection Rate 0.004732 0.1061 0.4853
## Detection Prevalence 0.012013 0.2253 0.7627
## Balanced Accuracy 0.518148 0.5764 0.6019
# 08. Combine Metrics for Comparison
compare_df <- data.frame(
Metric = names(rf_stats),
RandomForest = round(as.numeric(rf_stats), 4),
XGBoost = round(as.numeric(xgb_stats), 4)
)
# Print comparison
print(compare_df)## Metric RandomForest XGBoost
## 1 Accuracy 0.5985 0.5961
## 2 Kappa 0.1764 0.1769
## Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: <30 0.04444444 0.9930739 0.4333333 0.8971292 0.4333333
## Class: >30 0.31923294 0.8352593 0.4800679 0.7202781 0.4800679
## Class: NO 0.85969388 0.3388465 0.6336075 0.6448749 0.6336075
## Recall F1 Prevalence Detection Rate Detection Prevalence
## Class: <30 0.04444444 0.08062016 0.1064798 0.004732435 0.0109210
## Class: >30 0.31923294 0.38346883 0.3227157 0.103021478 0.2145977
## Class: NO 0.85969388 0.72953592 0.5708045 0.490717146 0.7744813
## Balanced Accuracy
## Class: <30 0.5187592
## Class: >30 0.5772461
## Class: NO 0.5992702
## Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: <30 0.04444444 0.9918517 0.3939394 0.8970155 0.3939394
## Class: >30 0.32882121 0.8239721 0.4709208 0.7203947 0.4709208
## Class: NO 0.85012755 0.3536896 0.6362768 0.6395706 0.6362768
## Recall F1 Prevalence Detection Rate Detection Prevalence
## Class: <30 0.04444444 0.07987711 0.1064798 0.004732435 0.01201311
## Class: >30 0.32882121 0.38724676 0.3227157 0.106115763 0.22533673
## Class: NO 0.85012755 0.72781873 0.5708045 0.485256644 0.76265016
## Balanced Accuracy
## Class: <30 0.5181481
## Class: >30 0.5763966
## Class: NO 0.6019086
Results & Interpretation
- OVERALL COMPARISON
Both models are only moderately better than chance. Accuracy and Kappa are very similar, suggesting minimal difference in performance.
- CLASS-WISE PERFORMANCE
πΉ Class <30 (most underrepresented) β Both models perform poorly on detecting <30 readmission (sensitivity = 4.4%)
πΉ Class >30 βοΈ Slightly better sensitivity with XGBoost, but difference is marginal.
πΉ Class NO (majority class) β Both models favor this class. Slightly better balance with XGBoost.
- Interpretation
Both models are heavily biased toward the majority class (NO). Extremely poor sensitivity for <30, which is the critical class for early readmission β this should be improved. Accuracy isnβt a strong metric here; consider using macro/micro averaged F1 scores or AUC per class. No major difference between models β you can prioritize the faster one (usually XGBoost).
Using other metric for both models
# 01. Step 1: Prepare Libraries
# List of required packages
required_packages <- c("MLmetrics", "yardstick", "dplyr", "pROC")
# Install any that aren't already installed
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg)
}
}
library(MLmetrics)##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
##
## Attaching package: 'yardstick'
## The following object is masked _by_ '.GlobalEnv':
##
## rmse
## The following object is masked from 'package:readr':
##
## spec
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# 02. Create Prediction Results Data Frame
# Use this β works fine if rf_model is trained with caret::train()
# Assume these are your actual and predicted labels
actual <- test_data$readmitted
pred_rf <- predict(rf_model, data = test_data)$predictions
rf_probs <- predict(rf_model, data = test_data, type = "response")$predictions
# Recreate the test matrix (same variables as during training!)
test_matrix <- model.matrix(readmitted ~ . -1, data = test_data)
# Predict class
pred_xgb <- predict(xgb_model, newdata = test_matrix)
# If model used "multi:softmax", output is class index (e.g., 0, 1, 2)
# So you may need to map back to class labels:
levels(actual) # Check the actual class order## [1] "<30" ">30" "NO"
# 03. Compute F1 Scores
# Create results data frames
rf_results <- data.frame(truth = actual, prediction = pred_rf)
xgb_results <- data.frame(truth = actual, prediction = pred_xgb)
# Convert columns to factors (yardstick requires factors)
rf_results$truth <- as.factor(rf_results$truth)
rf_results$prediction <- as.factor(rf_results$prediction)
xgb_results$truth <- as.factor(xgb_results$truth)
xgb_results$prediction <- as.factor(xgb_results$prediction)
# Compute F1 macro and micro for Random Forest
rf_fma <- f_meas(rf_results, truth = truth, estimate = prediction, average = "macro")
rf_fmi <- f_meas(rf_results, truth = truth, estimate = prediction, average = "micro")
# Compute F1 macro and micro for XGBoost
xg_fma <- f_meas(xgb_results, truth = truth, estimate = prediction, average = "macro")
xg_fmi <- f_meas(xgb_results, truth = truth, estimate = prediction, average = "micro")
# Combine into a data frame
f1_summary <- data.frame(
Model = c("Random Forest", "Random Forest", "XGBoost", "XGBoost"),
Metric = c("F1 Macro", "F1 Micro", "F1 Macro", "F1 Micro"),
Estimate = c(rf_fma$.estimate, rf_fmi$.estimate, xg_fma$.estimate, xg_fmi$.estimate)
)
# Save to CSV
write.csv(f1_summary, "f1_score.csv", row.names = FALSE)## Model Metric Estimate
## 1 Random Forest F1 Macro 0.3978750
## 2 Random Forest F1 Micro 0.3978750
## 3 XGBoost F1 Macro 0.3983142
## 4 XGBoost F1 Micro 0.3983142
F1 Score Interpretation
β F1 Macro (treats all classes equally): *On average, both models perform similarly across all classes. This metric treats each class equally, even if one class (e.g., βNOβ) dominates. A score below 0.5 suggests poor ability to balance precision and recall, especially for minority classes (<30, >30).
β F1 Micro (accounts for class imbalance): *Interpretation: These values show how well the models perform across all instances, weighted by class frequency. Micro F1 is generally close to accuracy in multi-class problems. Both models are underperforming overall β correctly predicting fewer than half of the cases.
π Final Analysis: Both models perform nearly the same in terms of F1 scores. F1 < 0.4 implies room for improvement, possibly due to: -Class imbalance (e.g., βNOβ dominating) -Limited predictive power in selected features
Conclusion
The analysis revealed strong correlations between medical encounter counts and variables such as diagnoses, medications, and hospital stay duration, confirming that patients with more complex needs tend to receive more treatments and stay longer. Regression modeling using tuned XGBoost provided a reasonable prediction of hospital stay duration, suggesting its potential use in hospital resource planning despite room for improvement with more detailed clinical data. However, classification models for predicting readmission performed poorly, particularly for early readmissions (<30 days), due to class imbalance and limited predictive power in the features used. While overall accuracy appeared moderate, low sensitivity and F1 scores highlighted the modelsβ inability to effectively detect critical readmission cases, underscoring the need for better feature engineering and evaluation metrics tailored to imbalanced healthcare data.