Final Project DATA110

Predicting 30-Day Hospital Readmission for Diabetic Patients

Introduction

Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).

Detecting diabetes in early stages is crucial to prevent health complications, including heart disease, kidney disease, nerve damage, and vision problems (Zlotek & UChicago Medicine AdventHealth, 2024). Personally, my family has a history of diabetes, with my dad being pre-diabetic. Being able to predict how likely or severe diabetes will be can benefit not only me, but also millions of people across the nation.

Dataset Information

Original source: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008 

This data was collected by using the Health Facts database, a national warehouse that collects comprehensive clinical records across hospital throughout the United States. The group extracted a dataset of 10 years (1999-2008) of clinical data at 130 hospitals and integrated delivery networks throughout the United States. The encounters of interest were extracted from the database with 55 attributes. Then, preliminary analysis and prepossessing of the data were performed to retain only features and encounters that contain sufficient information.

A more extensive study focusing on the methodology of the dataset can be seen: https://doi.org/10.1155/2014/781670

Variables:

Total Number of Variables: 50

Categorical Variables (41): encounter_id, patient_nbr, race, gender, age, admission_type_id, discharge_disposition_id, admission_source_id, payer_code, medical_specialty, diag_1, diag_2, diag_3, max_glu_serum, A1Cresult, metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, citoglipton, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, metformin-pioglitazone, change, diabetesMed, readmitted

Numerical Variables (9): weight, time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, number_diagnoses

A full extensive list of the variables, including descriptions and values, as well as the % missing can be seen on the website.

Research questions:

  • Which factors are significantly correlated with whether or not a diabetic patient will be readmitted within 30 days?

  • Are there any disparities in readmission rates by race?

Data Analysis

Importing Libraries and Dataset

#loading libraries that I will be using throughout the project  
library(tidyverse)   
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)    

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
df <- read_csv("diabetic_data.csv") 
Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

#when viewing the dataset, all na values are set to ?  #substituting all ? values to na:    
df <- df %>%    
  mutate(across(where(is.character), ~na_if(., "?")))  #source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na    # in order to make the readmitted variable a binary variable, patients that are not readmitted within 30 days are set to false, while patients readmitted within 30 days are set to true.   
df <- df %>%       
  filter(!is.na(readmitted)) %>% #removes the nas in the variables needed for visualization as well as logistic regression models      
  filter(!is.na(number_inpatient)) %>%       
  filter(!is.na(number_outpatient)) %>%       
  filter(!is.na(number_emergency)) %>%       
  filter(!is.na(num_lab_procedures)) %>%       
  filter(!is.na(num_procedures)) %>%       
  filter(!is.na(number_diagnoses)) %>%       
  filter(!is.na(age)) %>%       
  filter(!is.na(gender)) %>%       
  filter(!is.na(race)) %>%      
  mutate(readmit_binary = ifelse(readmitted == "<30", 1, 0)) 

Logistic Regression

Logistic Regression is used for binary classification, which is fitting for the readmitted variable that we have now made binary (Castro & Ferreira, 2022). Logistic regression can be performed in R using the glm() (generalized linear model function) and using the family argument (Chugh, 2023).

When making the original model, all possible variables were added to establish a baseline of the model, called a full model (Schrader, n.d.). Then, a backwards model was made to identify the variables that are essential to the model. Finally, a refined model was created with the variables identified by the backwards model.

full_model <- glm(readmit_binary ~ number_inpatient + number_outpatient + number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race, data = df, family = "binomial")    
summary(full_model) 

Call:
glm(formula = readmit_binary ~ number_inpatient + number_outpatient + 
    number_emergency + num_lab_procedures + num_procedures + 
    number_diagnoses + age + gender + race, family = "binomial", 
    data = df)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -4.1958990  0.5840736  -7.184 6.78e-13 ***
number_inpatient       0.2722469  0.0065050  41.852  < 2e-16 ***
number_outpatient     -0.0023662  0.0075826  -0.312 0.754994    
number_emergency       0.0336336  0.0084674   3.972 7.12e-05 ***
num_lab_procedures     0.0017486  0.0005296   3.302 0.000961 ***
num_procedures        -0.0083522  0.0062279  -1.341 0.179893    
number_diagnoses       0.0542156  0.0059663   9.087  < 2e-16 ***
age[10-20)             0.9144768  0.6058895   1.509 0.131219    
age[20-30)             1.4356830  0.5887182   2.439 0.014742 *  
age[30-40)             1.4102483  0.5859435   2.407 0.016093 *  
age[40-50)             1.3345591  0.5846101   2.283 0.022441 *  
age[50-60)             1.2780429  0.5842582   2.187 0.028709 *  
age[60-70)             1.4440709  0.5841091   2.472 0.013426 *  
age[70-80)             1.5014401  0.5840464   2.571 0.010148 *  
age[80-90)             1.5126549  0.5842453   2.589 0.009623 ** 
age[90-100)            1.4240265  0.5869933   2.426 0.015268 *  
genderMale             0.0084978  0.0206832   0.411 0.681178    
genderUnknown/Invalid -5.9621613 43.9540598  -0.136 0.892102    
raceAsian             -0.0450017  0.1343859  -0.335 0.737724    
raceCaucasian         -0.0146727  0.0266361  -0.551 0.581731    
raceHispanic          -0.0389685  0.0771137  -0.505 0.613321    
raceOther             -0.0929917  0.0912568  -1.019 0.308198    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 69886  on 99492  degrees of freedom
Residual deviance: 67601  on 99471  degrees of freedom
AIC: 67645

Number of Fisher Scoring iterations: 7
#backward elimination to eliminate insigificant variables   
backwards_model <- step(full_model, direction = "backward") #source: https://alain-vandormael.netlify.app/post/varselect/    summary(backwards_model)
Start:  AIC=67644.73
readmit_binary ~ number_inpatient + number_outpatient + number_emergency + 
    num_lab_procedures + num_procedures + number_diagnoses + 
    age + gender + race

                     Df Deviance   AIC
- race                4    67602 67638
- gender              2    67601 67641
- number_outpatient   1    67601 67643
- num_procedures      1    67603 67645
<none>                     67601 67645
- num_lab_procedures  1    67612 67654
- number_emergency    1    67616 67658
- age                 9    67685 67711
- number_diagnoses    1    67685 67727
- number_inpatient    1    69249 69291

Step:  AIC=67638.08
readmit_binary ~ number_inpatient + number_outpatient + number_emergency + 
    num_lab_procedures + num_procedures + number_diagnoses + 
    age + gender

                     Df Deviance   AIC
- gender              2    67602 67634
- number_outpatient   1    67602 67636
- num_procedures      1    67604 67638
<none>                     67602 67638
- num_lab_procedures  1    67613 67647
- number_emergency    1    67618 67652
- age                 9    67687 67705
- number_diagnoses    1    67687 67721
- number_inpatient    1    69254 69288

Step:  AIC=67634.36
readmit_binary ~ number_inpatient + number_outpatient + number_emergency + 
    num_lab_procedures + num_procedures + number_diagnoses + 
    age

                     Df Deviance   AIC
- number_outpatient   1    67602 67632
- num_procedures      1    67604 67634
<none>                     67602 67634
- num_lab_procedures  1    67613 67643
- number_emergency    1    67618 67648
- age                 9    67687 67701
- number_diagnoses    1    67687 67717
- number_inpatient    1    69254 69284

Step:  AIC=67632.48
readmit_binary ~ number_inpatient + number_emergency + num_lab_procedures + 
    num_procedures + number_diagnoses + age

                     Df Deviance   AIC
- num_procedures      1    67604 67632
<none>                     67602 67632
- num_lab_procedures  1    67614 67642
- number_emergency    1    67618 67646
- age                 9    67687 67699
- number_diagnoses    1    67687 67715
- number_inpatient    1    69262 69290

Step:  AIC=67632.28
readmit_binary ~ number_inpatient + number_emergency + num_lab_procedures + 
    number_diagnoses + age

                     Df Deviance   AIC
<none>                     67604 67632
- num_lab_procedures  1    67615 67641
- number_emergency    1    67620 67646
- age                 9    67690 67700
- number_diagnoses    1    67688 67714
- number_inpatient    1    69278 69304

Updated Logistic Model (only w/ variables left after AIC)

model <- glm(readmit_binary ~ number_inpatient + num_lab_procedures + age + number_diagnoses + number_emergency, data = df, family = "binomial")   
summary(model) 

Call:
glm(formula = readmit_binary ~ number_inpatient + num_lab_procedures + 
    age + number_diagnoses + number_emergency, family = "binomial", 
    data = df)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -4.2058842  0.5834848  -7.208 5.67e-13 ***
number_inpatient    0.2729187  0.0064596  42.250  < 2e-16 ***
num_lab_procedures  0.0017147  0.0005279   3.248  0.00116 ** 
age[10-20)          0.9165406  0.6058004   1.513  0.13029    
age[20-30)          1.4338744  0.5886028   2.436  0.01485 *  
age[30-40)          1.4069255  0.5858104   2.402  0.01632 *  
age[40-50)          1.3305777  0.5844774   2.277  0.02281 *  
age[50-60)          1.2719910  0.5841315   2.178  0.02944 *  
age[60-70)          1.4371766  0.5839950   2.461  0.01386 *  
age[70-80)          1.4958845  0.5839544   2.562  0.01042 *  
age[80-90)          1.5092776  0.5841721   2.584  0.00978 ** 
age[90-100)         1.4231302  0.5869160   2.425  0.01532 *  
number_diagnoses    0.0534689  0.0059220   9.029  < 2e-16 ***
number_emergency    0.0335913  0.0084532   3.974 7.07e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 69886  on 99492  degrees of freedom
Residual deviance: 67604  on 99479  degrees of freedom
AIC: 67632

Number of Fisher Scoring iterations: 6

Statistical Analysis:

Logistic Equations based on glm() have the following equation:

Where p is the probability that an event occurs given the log-odds, which is determined by the logistic equation. For this equation, the value 0 is classified as a fail, and the value 1 is classified as a success (Berg, n.d.).

In this case, p is defined as the probability that a patient is readmitted within 30 days.

Full Model Equation: log-odds = 0.2729 * number_inpatient - 0.002* number_outpatient + 0.0336 * number_emergency + 0.0017 * num_lab_procedures - 0.008 * num_procedures + 0.0542 * number_diagnoses + 0.9145 * age[10-20) + 1.436 * age[20-30) + 1.410 * age[30-40) + 1.335 * age[40-50) + 1.278 * age[50-60) + 1.444 * age[60-70) + 1.501 * age[70-80) + 1.513 * age[80-90) + 1.424 * age[90-100) + 0.008 * genderMale - 5.962 * genderUnknown/Invalid race - 0.045 * raceAsian - 0.0146 * raceCaucasian - 0.0390 * raceHispanic - 0.092 * raceOther

Revised Model Equation: log-odds = 0.2729 * number_inpatient+ 0.0017 * num_lab_procedures + 0.9165 * age[10-20) + 1.434 * age[20-30) + 1.4069 * age[30-40) + 1.331 * age[40-50) + 1.272 * age[50-60) + 1.437 * age[60-70) + 1.496 * age[70-80) + 1.509 * age[80-90) + 1.423 * age[90-100) + 0.053 * number_diagnoses + 0.0336 * number_emergency

Looking at the AIC values and full model results, the variables that are the most correlated with whether or not someone is readmitted is the number of lab procedures, the number of emergency visits, age, the number of diagnoses, and the number of inpatient visits.

The AIC values of the revised model are as follows:

  • num_lab_procedures: 67641

  • number_emergency: 67646

  • age: 67700

  • number_diagnoses: 67714

  • number_inpatient: 69304

The AIC of the model is a relative measure of model parsimony (which is defined to be the simplest model that fits the data), so its value does not mean anything unless we compare it to different models (Bluecology blog, 2018). Looking at the AICs of the models during backward elimination, it is clear that the AICs have decreased as more and more variables were eliminated. Additionally, looking at the AIC values of the final model, we can see that even though the number of lab procedures and the number of past emergency visits have very low coefficients, they have very necessary impact, as removing these variables would increase the AIC by a significant value (8.72 for num_lab_procedures, 13.72 for number_emergency). The most critical predictor is the number of past inpatient visits, as its AIC value is the greatest by a significant relative margin compared to the AICs of the other variables in the updates model. The change in AIC if the variable was removed from the model would be 1,617.72, which is a massive jump compared to other variables.

The p-values of the revised model are as follows:

  • number_inpatient: <2e-16

  • num_lab_procedures: 0.00116

  • age[10-20): 0.1303

  • age[20-30): 0.0149

  • age[30-40): 0.0163

  • age[40-50): 0.0228

  • age[50-60): 0.0294

  • age[60-70): 0.0139

  • age[70-80): 0.0104

  • age[80-90): 0.0098

  • age[90-100): 0.0153

  • number_diagnoses: <2e-16

  • number_emergency: 7.07e-05

Looking at the p-values of all of the variables, the most significant predictors were number_inpatient, number_diagnoses, and number_emergencies, which shows how hospital information about a patient is essential in knowing if a patient will be readmitted or not. The change in p-values of the age variable when comparing different age groups was very surprising to me. I suspected that younger age groups such as age[10-20) would have lower p-values, because seniors have a higher risk of developing diabetes (CDC, 2024). However, there wasn’t a very clear trend in p-values that I expected, as the ages with the highest p-values were the age groups 40-50 and 50-60.

There is no r^2 value for my model because I am performing a logistic regression because logistic regression uses maximum likelihood estimation, unlike linear regression which uses least squares (Rumella, 2020).

Creating Final Plots

I ultimately decided to use the logistic regression plot I used to display the correlation between the number of inpatient visits and the readmission rates, while also adding race as a factor for color. To add additional interactivity, I decided to make this plot using plotly to include interactivity (Line Charts in Ggplot2 , 2015).

To create this final plot, I learned several new things. Firstly, I learned how to use geom_jitter(), which allowed overlapping points to be shown as they are moved very slightly in a random directionndefined. I also learned how to create logistic lines by integrating ggplot into plotly, and also learned how using toWebGL() can help a lot when it comes performance issues associated with rendering really large datasets (just like this dataset) (Sievert, 2019).

#in order to use the regression line,   
p <- ggplot(df, aes(x = number_inpatient, y = readmit_binary, color = race)) +   geom_jitter(width = 0.15, height = 0.05, alpha = 0.2) + #alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are.       
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se=FALSE) + #removing the confidence intervals because there would be too much clutter and all of the se bounds would overlap on top of each other.      
  scale_y_continuous(limits = c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function   
p_webgl <- ggplotly(p) %>%        
  style(hoveron = NULL) %>% #without this piece of code, there were several warning messages because webGL scattergl elements do not support/require hoveron properties. By making the hoveron attribute null, we can prevent the warnings while also keeping the graph functional  (Source: https://www.statology.org/suppress-warnings-in-r/ and https://github.com/plotly/plotly.R/issues/1582 and https://plotly-r.com/controlling-tooltips)      
  toWebGL()  %>% #webGL is used because of the sheer amount of points that need to be used for the graph. This causes significant performance issues on my laptop, so WebGL() is necessary to allow the graph to render much faster. (Source: https://plotly-r.com/performance)      
  layout(     
    title = list(text = "Number of Inpatient Vists vs. Readmission Probability by Race"),           
    xaxis = list(title = "Number of Inpatient Visits in the Proceding Year"),      
    yaxis = list(title = "Patient Readmission")        
    )     
`geom_smooth()` using formula = 'y ~ x'
p_webgl 

Looking at the visualization, there are several key patterns to note. Firstly, looking at the geom_jitter, the points seem to extend out more when patients are readmitted, which indicate that there is an association with higher previous inpatient visits and a higher probability that a patient will be readmitted. Then, looking at the distributions of the different races, it is surprising to see that many races, including Asians and Hispanics, have very little data entries due to their curve length and their individual scatter (by clicking on the different races to toggle variability). This most likely caused their curves to be different from the curves showing the patient data for African Americans and Caucasian patients, which essentially overlapped each other. This shows that there are equal likelihoods of patients being readmitted to the hospital based on the number of previous in-patient visits, regardless of race.

Final Plot in Tableau

For the plot in Tableau, I wanted to compare the average number of lab procedures that a patient has performed on them during the encounter based on their readmission condition (<30, >30, or NO). To compare them, I wanted to give the user the option to select the age groups that they wanted to choose, as a scatter of the dots don’t give a lot of information at first. I also decided to change the sizes of the dots to number of patients in the age group to give additional information.

Below is an example of comparing two age groups in Tableau:

The visualization can also be accessed here: https://public.tableau.com/app/profile/joyce.fang7225/viz/FinalProject-Plot2/TheAverageNumberofLabProceduresDependingonPatientReadmission?publish=yes

The visualization shows that there are some minor differences in age groups. Some outliers for the age group were the 0-10 age group and the 10-20 age group, likely because adolescents and teenagers are still growing and thus lab procedures for these children are different compared to adults. For younger adults, there seems to be an order of lab procedures ([NO] < [<30] < [>30]) but as the age groups get larger, the trend slowly shifts until at age groups 50-60, the general order is ([NO] < [>30] < [<30]). Age groups 80-90 and 90-100 are also outliers. I was quite surprised at the points for the average number of lab procedures, as the patients who were readmitted within 30 days and patients who did not get readmitted had the same average number of lab procedures, with only the patients who were readmitted after 30 days being a little lower. This is because the age group from 80-90 had the highest significance in the logistic model, yet there does not seem to be a very obvious pattern between the average number of lab procedures for patients with readmission types >30 and NO compared to the average number of lab procedures for patients with readmission type <30.

Overall, there were general trends for adults from 20-80, but on both ends of the age spectrum there were definitely outlying patterns.

References

Line Charts in ggplot2 . (2015). Plotly.com. https://plotly.com/ggplot2/line-charts/

alebot. (2019, July 19). toWebGL: “scattergl” objects don’t have these attributes: “hoveron.” GitHub. https://github.com/plotly/plotly.R/issues/1582

Berg, S. M. van den. (n.d.). Chapter 15 Generalised linear models: logistic regression | Analysing Data using Linear Models. In bookdown.org. https://bookdown.org/pingapang9/linear_models_bookdown/logistic.html

Bluecology blog. (2018, April 13). How do I interpret the AIC | R-bloggers. https://www.r-bloggers.com/2018/04/how-do-i-interpret-the-aic/

Castro, H. M., & Ferreira, J. C. (2022). Linear and logistic regression models: when to use and how to interpret them? Jornal Brasileiro de Pneumologia, 48(6), e20220439. https://doi.org/10.36416/1806-3756/e20220439

CDC. (2024, May 15). Diabetes Basics. Www.cdc.gov. https://www.cdc.gov/diabetes/about/index.html

Centers for Disease Control and Prevention. (2024, May 15). Diabetes Risk Factors. Diabetes. https://www.cdc.gov/diabetes/risk-factors/index.html

Chugh, V. (2023, March). Logistic Regression in R Tutorial. Www.datacamp.com. https://www.datacamp.com/tutorial/logistic-regression-R

Frost, J. (2017). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Statistics by Jim. https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

Rumella. (2020, April 30). Help with understanding logistic regression’s $R^2$ value. Cross Validated. https://stats.stackexchange.com/questions/463739/help-with-understanding-logistic-regressions-r2-value

sape research group. (2026). ggplot2 Quick Reference: geom_jitter | Software and Programmer Efficiency Research Group. Inf.usi.ch. https://sape.inf.usi.ch/quick-reference/ggplot2/geom_jitter.html

Schork, J. (2022, December 9). How to Disable Hover Information in plotly Graph in R (Example). Statistics Globe. https://statisticsglobe.com/disable-hover-information-plotly-r

Schrader, R. (n.d.). Lecture 5: A Taste of Model Selection for Multiple Regression. Department of Mathematics & Statistics at the University of New Mexico. Retrieved May 10, 2026, from https://math.unm.edu/~schrader/biostat/bio2/notes/splec5b.pdf

Sievert, C. (2019, December 19). 24 Improving performance | Interactive web-based data visualization with R, plotly, and shiny. Plotly-R.com. https://plotly-r.com/performance

Starbucks. (2022, July 17). R - Replace specific value contents with NA [duplicate]. Stack Overflow. https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na

user8229029. (2020, October 16). How to work with huge data in R Shiny and plotly? Stack Overflow. https://stackoverflow.com/questions/64392725/how-to-work-with-huge-data-in-r-shiny-and-plotly

Vandormael, A. (2022, November 11). Variable selection methods for linear regression in R. Netlify.app. https://alain-vandormael.netlify.app/post/varselect/

World Health Organization. (2024, November 14). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes

Zach. (2022, August 11). How to Suppress Warnings in R (With Examples). Statology. https://www.statology.org/suppress-warnings-in-r/

Zlotek, J., & UChicago Medicine AdventHealth. (2024, October 27). Understanding Diabetes: The Importance of Early Detection and Management. UChicago Medicine AdventHealth. https://www.uchicagomedicineadventhealth.org/blog/understanding-diabetes-importance-early-detection-and-management