Predicting 30-Day Hospital Readmission for Diabetic Patients
Source: Kaiser Permanente
Introduction
Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).
Detecting diabetes in early stages is crucial to prevent health complications, including heart disease, kidney disease, nerve damage, and vision problems (Zlotek & UChicago Medicine AdventHealth, 2024). Personally, my family has a history of diabetes, with my dad being pre-diabetic. Being able to predict how likely or severe diabetes will be can benefit not only me, but also millions of people across the nation.
A full extensive list of the variables can be seen on the website.
Research question: Which factors are significantly correlated with whether or not a diabetic patient will be readmitted within 30 days?
Data Analysis
Importing Libraries and Dataset
#loading libraries that I will be using in library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
#importing the dataframe in df <-read_csv("diabetic_data.csv")
Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Cleaning
#when viewing the dataset, all na values are set to ? #substituting all ? values to na: df <- df %>%mutate(across(where(is.character), ~na_if(., "?"))) #source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na # in order to make the readmitted variable a binary variable, patients that are not readmitted within 30 days are set to false, while patients readmitted within 30 days are set to true. df <- df %>%filter(!is.na(readmitted)) %>%#removes the nas in the readmitted variable neededmutate(readmit_binary =ifelse(readmitted =="<30", 1, 0)) %>%drop_na()
Logistic Regression
Logistic Regression is used for binary classification, which is fitting for the readmitted variable that we have now made binary (Castro & Ferreira, 2022). Logistic regression can be performed in R using the glm() (generalized linear model function) and using the family argument.
When making the original model, all possible variables were added to establish a baseline of the model, called a full model (Schrader, n.d.). Then, a backwards model was made to identify the variables that are essential to the model. Finally, a refined model was created with the variables identified by the backwards model.
full_model <-glm(readmit_binary ~ number_inpatient + number_outpatient + number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race, data = df, family ="binomial") summary(full_model)
Looking at the AIC and logistic model results, the two variables that are most correlated to the linear model is num_lab_procedures and number_inpatient. Thus, our refined model will only use these two variables:
model <-glm(readmit_binary ~ number_inpatient + num_lab_procedures, data = df, family ="binomial") summary(model)
Call:
glm(formula = readmit_binary ~ number_inpatient + num_lab_procedures,
family = "binomial", data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.578589 0.281558 -5.607 2.06e-08 ***
number_inpatient 0.396225 0.073913 5.361 8.29e-08 ***
num_lab_procedures -0.019229 0.005373 -3.579 0.000345 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 659.03 on 1042 degrees of freedom
Residual deviance: 620.44 on 1040 degrees of freedom
AIC: 626.44
Number of Fisher Scoring iterations: 5
Exploration Plots
Firstly, I wanted to explore the relationship between whether or not a patient was readmitted and the number of inpatient visits of the patient in the year preceding the encounter.
ggplot(df, aes(x =as.factor(readmit_binary), y = number_inpatient)) +geom_violin()
Because of the high density of data points that are at 0, is it often hard to look at a violin plot and understand the correlation. What if the logistic plot was instead plotted?
When looking at the plots, there are only a couple of points because all points are overlapped onto one another.
Using geom_jitter(), the points can be seen because there are random changes to the points to reveal overlapping points.
ggplot(df, aes(x = number_inpatient, y = readmit_binary)) +geom_jitter(width =0.15, height =0.05, alpha =0.2) +#alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. geom_smooth(method ="glm", method.args =list(family ="binomial")) +scale_y_continuous(limits =c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function
`geom_smooth()` using formula = 'y ~ x'
I also wanted to see the relationship between number_inpatient and num_lab_procedures, which were used as the explanatory variables.
ggplot(df, aes(x = number_inpatient, y = num_lab_procedures)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
There actually isn’t a strong association between the two explanatory variables, which often means that there is less multicollinearity for the logistic regression (Frost, 2017).
Additionally, I also wanted to see the relationship between readmission rates and the number of lab procedures. Since lab procedures is also a discrete numerical variable, I will use the same types of graphs:
ggplot(df, aes(x =as.factor(readmit_binary), y = num_lab_procedures)) +geom_violin()
ggplot(df, aes(x = num_lab_procedures, y = readmit_binary)) +geom_jitter(width =0.15, height =0.05, alpha =0.2) +#alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. geom_smooth(method ="glm", method.args =list(family ="binomial")) +scale_y_continuous(limits =c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function
`geom_smooth()` using formula = 'y ~ x'
There doesn’t seem to be a very strong correlation, most likely because there are so few patients that were readmitted. However, it also seems that as the number of lab procedures increased, the density increases. This could be simply due to the actual frequency of lab procedures that patients typically have, and could indicate that there isn’t a correlation:
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Creating Final Plots
I ultimately decided to use the logistic regression plot I used to display the correlation between the number of inpatient visits and the readmission rates, while also adding race as a factor for color. To add additional interactivity, I decided to make this plot using plotly to include interactivity (Line Charts in Ggplot2 , 2015).
p <-ggplot(df, aes(x = number_inpatient, y = readmit_binary, color = race)) +geom_jitter(width =0.15, height =0.05, alpha =0.2) +#alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. geom_smooth(method ="glm", method.args =list(family ="binomial")) +scale_y_continuous(limits =c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter functionplotly::ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'
Looking at this graph (in which minority groups such as Asian and other are not clear), as well as the p-value for the Caucasian group being much more significant, I wanted to compare the Caucasian logistic regression line vs. minority groups to see if there are any disparities.
p <-ggplot(df_plot, aes(x = number_inpatient, y = readmit_binary, color = race)) +geom_jitter(width =0.15, height =0.05, alpha =0.2) +#alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. geom_smooth(method ="glm", method.args =list(family ="binomial")) +scale_y_continuous(limits =c(-0.1, 1.1)) +#the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function labs(title ="Number of Inpatient vists vs. Re-admission for Diabetes patients", caption ="Data Source: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008", )plotly::ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'
By looking at the graph, non-whites have a much higher readmission rate compared to white patients as the number of inpatient visits increase, which could reveal disparities in healthcare.
Final Plot in Tableau
In order to make a unique plot in Tableau, I decided to also add the insulin variable, as insulin is a key hormone and extremely relevant to diabetes, as can be seen in the introduction. To do this, I decided to use a circle chart in order to compare the number of average lab procedures for diabetes patients that were readmitted within 30 days, those that were readmitted after 30 days, and those that were not readmitted. I created a faceted view for each of these three categories of patients, then for each chart the patients were then divided into four categories depending on changes to their insulin level. Tableau Classic Medium colors were applied to this graph. The graph is below:
The data that I have chosen for this final project is on diabetes. Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).
Detecting diabetes in early stages is crucial to prevent health complications, including heart disease, kidney disease, nerve damage, and vision problems (Zlotek & UChicago Medicine AdventHealth, 2024). Personally, my family has a history of diabetes, with my dad being pre-diabetic. Being able to predict how likely or severe diabetes will be can benefit not only me, but also millions of people across the nation.
There were many types of variables that were in the dataset, featuring 50 variables that were both categorical (age, gender, race, etc.) and numerical (num_lab_procedures, num_medications, etc.). In this project, the variables that I used for my full model were readmitted (if the patient was readmitted within 30 days, after more than 30 days, or not readmitted), number_inpatient, number_outpatient, number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race
Bibliography
Line Charts in ggplot2 . (2015). Plotly.com. https://plotly.com/ggplot2/line-charts/
Castro, H. M., & Ferreira, J. C. (2022). Linear and logistic regression models: when to use and how to interpret them? Jornal Brasileiro de Pneumologia, 48(6), e20220439. https://doi.org/10.36416/1806-3756/e20220439
CDC. (2024, May 15). Diabetes Basics. Www.cdc.gov. https://www.cdc.gov/diabetes/about/index.html
Frost, J. (2017). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Statistics by Jim. https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
Schrader, R. (n.d.). Lecture 5: A Taste of Model Selection for Multiple Regression. Department of Mathematics & Statistics at the University of New Mexico. Retrieved May 10, 2026, from https://math.unm.edu/~schrader/biostat/bio2/notes/splec5b.pdf
World Health Organization. (2024, November 14). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes
Zlotek, J., & UChicago Medicine AdventHealth. (2024, October 27). Understanding Diabetes: The Importance of Early Detection and Management. UChicago Medicine AdventHealth. https://www.uchicagomedicineadventhealth.org/blog/understanding-diabetes-importance-early-detection-and-management