Joyce Fang Final Project DATA110

Predicting 30-Day Hospital Readmission for Diabetic Patients

Introduction

Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).

Detecting diabetes in early stages is crucial to prevent health complications, including heart disease, kidney disease, nerve damage, and vision problems (Zlotek & UChicago Medicine AdventHealth, 2024). Personally, my family has a history of diabetes, with my dad being pre-diabetic. Being able to predict how likely or severe diabetes will be can benefit not only me, but also millions of people across the nation.

Dataset Information

Original source: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008

This data was collected by

Variables:

Total Number of Variables: 47

Categorical Variables: race, gender, age, admission_type_id, discharge_disposition_id, admission_source_id, time_in_hospital, payer_code, medical_specialty, diag_1, diag_2, diag_3, max_glu_serum, A1Cresult, metformin, repaglinide, nateglinide,

Numerical Variables (37): num_lab_procedures, num_procedures, num_medications, number_outpatient, number_inpatient, number_diagnoses,

A full extensive list of the variables can be seen on the website.

Research question: Which factors are significantly correlated with whether or not a diabetic patient will be readmitted within 30 days?

Data Analysis

Importing Libraries and Dataset

#loading libraries that I will be using in  
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

#importing the dataframe in  
df <- read_csv("diabetic_data.csv")

Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

#when viewing the dataset, all na values are set to ?  #substituting all ? values to na:  
df <- df %>% mutate(across(where(is.character), ~na_if(., "?"))) 
#source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na  

# in order to make the readmitted variable a binary variable, patients that are not readmitted within 30 days are set to false, while patients readmitted within 30 days are set to true. 
df <- df %>% 
  filter(!is.na(readmitted)) %>% #removes the nas in the readmitted variable needed
  mutate(readmit_binary = ifelse(readmitted == "<30", 1, 0)) %>% 
  drop_na()

Logistic Regression

Logistic Regression is used for binary classification, which is fitting for the readmitted variable that we have now made binary (Castro & Ferreira, 2022). Logistic regression can be performed in R using the glm() (generalized linear model function) and using the family argument.

When making the original model, all possible variables were added to establish a baseline of the model, called a full model (Schrader, n.d.). Then, a backwards model was made to identify the variables that are essential to the model. Finally, a refined model was created with the variables identified by the backwards model.

full_model <- glm(readmit_binary ~ number_inpatient + number_outpatient + number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race, data = df, family = "binomial")  
summary(full_model)


Call:
glm(formula = readmit_binary ~ number_inpatient + number_outpatient + 
    number_emergency + num_lab_procedures + num_procedures + 
    number_diagnoses + age + gender + race, family = "binomial", 
    data = df)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.338e+01  8.827e+02  -0.015 0.987905    
number_inpatient    3.709e-01  8.368e-02   4.432 9.32e-06 ***
number_outpatient  -1.954e-03  5.040e-02  -0.039 0.969074    
number_emergency    1.561e-01  1.139e-01   1.371 0.170317    
num_lab_procedures -1.984e-02  5.708e-03  -3.477 0.000508 ***
num_procedures      1.682e-02  6.389e-02   0.263 0.792390    
number_diagnoses   -4.115e-02  8.484e-02  -0.485 0.627665    
age[10-20)         -3.624e-01  1.248e+03   0.000 0.999768    
age[20-30)          1.208e+01  8.827e+02   0.014 0.989084    
age[30-40)          1.289e+01  8.827e+02   0.015 0.988352    
age[40-50)          1.267e+01  8.827e+02   0.014 0.988547    
age[50-60)          1.287e+01  8.827e+02   0.015 0.988370    
age[60-70)          1.287e+01  8.827e+02   0.015 0.988369    
age[70-80)          1.298e+01  8.827e+02   0.015 0.988265    
age[80-90)          1.271e+01  8.827e+02   0.014 0.988517    
age[90-100)         1.238e+01  8.827e+02   0.014 0.988811    
genderMale          2.747e-01  2.208e-01   1.244 0.213505    
raceAsian          -1.367e+01  8.827e+02  -0.015 0.987643    
raceCaucasian      -9.221e-01  5.144e-01  -1.792 0.073057 .  
raceOther          -2.442e-01  9.314e-01  -0.262 0.793202    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 659.03  on 1042  degrees of freedom
Residual deviance: 611.08  on 1023  degrees of freedom
AIC: 651.08

Number of Fisher Scoring iterations: 13

#justification using backward elimination: 
backwards_model <- step(full_model, direction = "backward") #source: https://alain-vandormael.netlify.app/post/varselect/   summary(backwards_model)

Start:  AIC=651.08
readmit_binary ~ number_inpatient + number_outpatient + number_emergency + 
    num_lab_procedures + num_procedures + number_diagnoses + 
    age + gender + race

                     Df Deviance    AIC
- age                 9   613.75 635.75
- race                3   614.70 648.70
- number_outpatient   1   611.09 649.09
- num_procedures      1   611.15 649.15
- number_diagnoses    1   611.31 649.31
- gender              1   612.64 650.64
- number_emergency    1   612.82 650.82
<none>                    611.08 651.08
- num_lab_procedures  1   622.90 660.90
- number_inpatient    1   630.92 668.92

Step:  AIC=635.75
readmit_binary ~ number_inpatient + number_outpatient + number_emergency + 
    num_lab_procedures + num_procedures + number_diagnoses + 
    gender + race

                     Df Deviance    AIC
- race                3   617.26 633.26
- number_outpatient   1   613.75 633.75
- number_diagnoses    1   613.82 633.82
- num_procedures      1   613.88 633.88
- number_emergency    1   614.92 634.92
- gender              1   615.58 635.58
<none>                    613.75 635.75
- num_lab_procedures  1   626.18 646.18
- number_inpatient    1   633.38 653.38

Step:  AIC=633.26
readmit_binary ~ number_inpatient + number_outpatient + number_emergency + 
    num_lab_procedures + num_procedures + number_diagnoses + 
    gender

                     Df Deviance    AIC
- number_outpatient   1   617.26 631.26
- num_procedures      1   617.34 631.34
- number_diagnoses    1   617.34 631.34
- number_emergency    1   618.58 632.58
- gender              1   618.98 632.98
<none>                    617.26 633.26
- num_lab_procedures  1   628.87 642.87
- number_inpatient    1   637.16 651.16

Step:  AIC=631.26
readmit_binary ~ number_inpatient + number_emergency + num_lab_procedures + 
    num_procedures + number_diagnoses + gender

                     Df Deviance    AIC
- num_procedures      1   617.34 629.34
- number_diagnoses    1   617.34 629.34
- number_emergency    1   618.59 630.59
- gender              1   618.98 630.98
<none>                    617.26 631.26
- num_lab_procedures  1   628.92 640.92
- number_inpatient    1   637.30 649.30

Step:  AIC=629.34
readmit_binary ~ number_inpatient + number_emergency + num_lab_procedures + 
    number_diagnoses + gender

                     Df Deviance    AIC
- number_diagnoses    1   617.43 627.43
- number_emergency    1   618.63 628.63
- gender              1   619.12 629.12
<none>                    617.34 629.34
- num_lab_procedures  1   628.97 638.97
- number_inpatient    1   637.31 647.31

Step:  AIC=627.43
readmit_binary ~ number_inpatient + number_emergency + num_lab_procedures + 
    gender

                     Df Deviance    AIC
- number_emergency    1   618.67 626.67
- gender              1   619.22 627.22
<none>                    617.43 627.43
- num_lab_procedures  1   630.38 638.38
- number_inpatient    1   637.31 645.31

Step:  AIC=626.67
readmit_binary ~ number_inpatient + num_lab_procedures + gender

                     Df Deviance    AIC
- gender              1   620.44 626.44
<none>                    618.67 626.67
- num_lab_procedures  1   631.17 637.17
- number_inpatient    1   645.65 651.65

Step:  AIC=626.44
readmit_binary ~ number_inpatient + num_lab_procedures

                     Df Deviance    AIC
<none>                    620.44 626.44
- num_lab_procedures  1   632.88 636.88
- number_inpatient    1   648.33 652.33

Looking at the AIC and logistic model results, the two variables that are most correlated to the linear model is num_lab_procedures and number_inpatient. Thus, our refined model will only use these two variables:

model <- glm(readmit_binary ~ number_inpatient + num_lab_procedures, data = df, family = "binomial")  
summary(model)


Call:
glm(formula = readmit_binary ~ number_inpatient + num_lab_procedures, 
    family = "binomial", data = df)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.578589   0.281558  -5.607 2.06e-08 ***
number_inpatient    0.396225   0.073913   5.361 8.29e-08 ***
num_lab_procedures -0.019229   0.005373  -3.579 0.000345 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 659.03  on 1042  degrees of freedom
Residual deviance: 620.44  on 1040  degrees of freedom
AIC: 626.44

Number of Fisher Scoring iterations: 5

Exploration Plots

Firstly, I wanted to explore the relationship between whether or not a patient was readmitted and the number of inpatient visits of the patient in the year preceding the encounter.

ggplot(df, aes(x = as.factor(readmit_binary), y = number_inpatient)) + 
  geom_violin()

Because of the high density of data points that are at 0, is it often hard to look at a violin plot and understand the correlation. What if the logistic plot was instead plotted?

ggplot(df, aes(x = number_inpatient, y = readmit_binary)) +
  geom_point() + 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + 
  scale_y_continuous(limits = c(0, 1))

`geom_smooth()` using formula = 'y ~ x'

When looking at the plots, there are only a couple of points because all points are overlapped onto one another.

Using geom_jitter(), the points can be seen because there are random changes to the points to reveal overlapping points.

ggplot(df, aes(x = number_inpatient, y = readmit_binary)) +
  geom_jitter(width = 0.15, height = 0.05, alpha = 0.2) + #alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + 
  scale_y_continuous(limits = c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function

`geom_smooth()` using formula = 'y ~ x'

I also wanted to see the relationship between number_inpatient and num_lab_procedures, which were used as the explanatory variables.

ggplot(df, aes(x = number_inpatient, y = num_lab_procedures)) + 
  geom_point() + 
  geom_smooth()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There actually isn’t a strong association between the two explanatory variables, which often means that there is less multicollinearity for the logistic regression (Frost, 2017).

Additionally, I also wanted to see the relationship between readmission rates and the number of lab procedures. Since lab procedures is also a discrete numerical variable, I will use the same types of graphs:

ggplot(df, aes(x = as.factor(readmit_binary), y = num_lab_procedures)) + 
  geom_violin()

ggplot(df, aes(x = num_lab_procedures, y = readmit_binary)) +
  geom_jitter(width = 0.15, height = 0.05, alpha = 0.2) + #alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + 
  scale_y_continuous(limits = c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function

`geom_smooth()` using formula = 'y ~ x'

There doesn’t seem to be a very strong correlation, most likely because there are so few patients that were readmitted. However, it also seems that as the number of lab procedures increased, the density increases. This could be simply due to the actual frequency of lab procedures that patients typically have, and could indicate that there isn’t a correlation:

ggplot(df, aes(x = num_lab_procedures)) + 
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Creating Final Plots

I ultimately decided to use the logistic regression plot I used to display the correlation between the number of inpatient visits and the readmission rates, while also adding race as a factor for color. To add additional interactivity, I decided to make this plot using plotly to include interactivity (Line Charts in Ggplot2 , 2015).

p <- ggplot(df, aes(x = number_inpatient, y = readmit_binary, color = race)) +
  geom_jitter(width = 0.15, height = 0.05, alpha = 0.2) + #alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + 
  scale_y_continuous(limits = c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function

plotly::ggplotly(p)

`geom_smooth()` using formula = 'y ~ x'

Looking at this graph (in which minority groups such as Asian and other are not clear), as well as the p-value for the Caucasian group being much more significant, I wanted to compare the Caucasian logistic regression line vs. minority groups to see if there are any disparities.

To do this,

df_plot <- df %>% 
  mutate(race = ifelse(race == "Caucasian", "White", "Non-White"))

p <- ggplot(df_plot, aes(x = number_inpatient, y = readmit_binary, color = race)) +
  geom_jitter(width = 0.15, height = 0.05, alpha = 0.2) + #alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. 
  geom_smooth(method = "glm", method.args = list(family = "binomial")) + 
  scale_y_continuous(limits = c(-0.1, 1.1)) + #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function 
  labs(
    title = "Number of Inpatient vists vs. Re-admission for Diabetes patients", 
    caption = "Data Source: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008", 
  )

plotly::ggplotly(p)

`geom_smooth()` using formula = 'y ~ x'

By looking at the graph, non-whites have a much higher readmission rate compared to white patients as the number of inpatient visits increase, which could reveal disparities in healthcare.

Final Plot in Tableau

In order to make a unique plot in Tableau, I decided to also add the insulin variable, as insulin is a key hormone and extremely relevant to diabetes, as can be seen in the introduction. To do this, I decided to use a circle chart in order to compare the number of average lab procedures for diabetes patients that were readmitted within 30 days, those that were readmitted after 30 days, and those that were not readmitted. I created a faceted view for each of these three categories of patients, then for each chart the patients were then divided into four categories depending on changes to their insulin level. Tableau Classic Medium colors were applied to this graph. The graph is below:

The graph can also be accessed here: https://public.tableau.com/views/FinalProject-Plot2/Sheet1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

Essay

The data that I have chosen for this final project is on diabetes. Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).

There were many types of variables that were in the dataset, featuring 50 variables that were both categorical (age, gender, race, etc.) and numerical (num_lab_procedures, num_medications, etc.). In this project, the variables that I used for my full model were readmitted (if the patient was readmitted within 30 days, after more than 30 days, or not readmitted), number_inpatient, number_outpatient, number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race

Bibliography

Line Charts in ggplot2 . (2015). Plotly.com. https://plotly.com/ggplot2/line-charts/

Castro, H. M., & Ferreira, J. C. (2022). Linear and logistic regression models: when to use and how to interpret them? Jornal Brasileiro de Pneumologia, 48(6), e20220439. https://doi.org/10.36416/1806-3756/e20220439

CDC. (2024, May 15). Diabetes Basics. Www.cdc.gov. https://www.cdc.gov/diabetes/about/index.html

Frost, J. (2017). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Statistics by Jim. https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

Schrader, R. (n.d.). Lecture 5: A Taste of Model Selection for Multiple Regression. Department of Mathematics & Statistics at the University of New Mexico. Retrieved May 10, 2026, from https://math.unm.edu/~schrader/biostat/bio2/notes/splec5b.pdf

World Health Organization. (2024, November 14). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes

Zlotek, J., & UChicago Medicine AdventHealth. (2024, October 27). Understanding Diabetes: The Importance of Early Detection and Management. UChicago Medicine AdventHealth. https://www.uchicagomedicineadventhealth.org/blog/understanding-diabetes-importance-early-detection-and-management