Predicting 30-Day Hospital Readmission for Diabetic Patients
Introduction
Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).
Detecting diabetes in early stages is crucial to prevent health complications, including heart disease, kidney disease, nerve damage, and vision problems (Zlotek & UChicago Medicine AdventHealth, 2024). Personally, my family has a history of diabetes, with my dad being pre-diabetic. Being able to predict how likely or severe diabetes will be can benefit not only me, but also millions of people across the nation.
This data was collected by using the Health Facts database, a national warehouse that collects comprehensive clinical records across hospital throughout the United States. The group extracted a dataset of 10 years (1999-2008) of clinical data at 130 hospitals and integrated delivery networks throughout the United States. The encounters of interest were extracted from the database with 55 attributes. Then, preliminary analysis and prepossessing of the data were performed to retain only features and encounters that contain sufficient information.
A full extensive list of the variables, including descriptions and values, as well as the % missing can be seen on the website.
Research questions:
Which factors are significantly correlated with whether or not a diabetic patient will be readmitted within 30 days?
Are there any disparities in readmission rates by race?
Data Analysis
Importing Libraries and Dataset
#loading libraries that I will be using throughout the project library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
df <-read_csv("diabetic_data.csv")
Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Cleaning
#when viewing the dataset, all na values are set to ? #substituting all ? values to na: df <- df %>%mutate(across(where(is.character), ~na_if(., "?"))) #source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na # in order to make the readmitted variable a binary variable, patients that are not readmitted within 30 days are set to false, while patients readmitted within 30 days are set to true. df <- df %>%filter(!is.na(readmitted)) %>%#removes the nas in the variables needed for visualization as well as logistic regression models filter(!is.na(number_inpatient)) %>%filter(!is.na(number_outpatient)) %>%filter(!is.na(number_emergency)) %>%filter(!is.na(num_lab_procedures)) %>%filter(!is.na(num_procedures)) %>%filter(!is.na(number_diagnoses)) %>%filter(!is.na(age)) %>%filter(!is.na(gender)) %>%filter(!is.na(race)) %>%mutate(readmit_binary =ifelse(readmitted =="<30", 1, 0))
Logistic Regression
Logistic Regression is used for binary classification, which is fitting for the readmitted variable that we have now made binary (Castro & Ferreira, 2022). Logistic regression can be performed in R using the glm() (generalized linear model function) and using the family argument (Chugh, 2023).
When making the original model, all possible variables were added to establish a baseline of the model, called a full model (Schrader, n.d.). Then, a backwards model was made to identify the variables that are essential to the model. Finally, a refined model was created with the variables identified by the backwards model.
full_model <-glm(readmit_binary ~ number_inpatient + number_outpatient + number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race, data = df, family ="binomial") summary(full_model)
Updated Logistic Model (only w/ variables left after AIC)
model <-glm(readmit_binary ~ number_inpatient + num_lab_procedures + age + number_diagnoses + number_emergency, data = df, family ="binomial") summary(model)
Call:
glm(formula = readmit_binary ~ number_inpatient + num_lab_procedures +
age + number_diagnoses + number_emergency, family = "binomial",
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.2058842 0.5834848 -7.208 5.67e-13 ***
number_inpatient 0.2729187 0.0064596 42.250 < 2e-16 ***
num_lab_procedures 0.0017147 0.0005279 3.248 0.00116 **
age[10-20) 0.9165406 0.6058004 1.513 0.13029
age[20-30) 1.4338744 0.5886028 2.436 0.01485 *
age[30-40) 1.4069255 0.5858104 2.402 0.01632 *
age[40-50) 1.3305777 0.5844774 2.277 0.02281 *
age[50-60) 1.2719910 0.5841315 2.178 0.02944 *
age[60-70) 1.4371766 0.5839950 2.461 0.01386 *
age[70-80) 1.4958845 0.5839544 2.562 0.01042 *
age[80-90) 1.5092776 0.5841721 2.584 0.00978 **
age[90-100) 1.4231302 0.5869160 2.425 0.01532 *
number_diagnoses 0.0534689 0.0059220 9.029 < 2e-16 ***
number_emergency 0.0335913 0.0084532 3.974 7.07e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69886 on 99492 degrees of freedom
Residual deviance: 67604 on 99479 degrees of freedom
AIC: 67632
Number of Fisher Scoring iterations: 6
Statistical Analysis:
Logistic Equations based on glm() have the following equation:
Where p is the probability that an event occurs given the log-odds, which is determined by the logistic equation. For this equation, the value 0 is classified as a fail, and the value 1 is classified as a success (Berg, n.d.).
In this case, p is defined as the probability that a patient is readmitted within 30 days.
Looking at the AIC values and full model results, the variables that are the most correlated with whether or not someone is readmitted is the number of lab procedures, the number of emergency visits, age, the number of diagnoses, and the number of inpatient visits.
The AIC values of the revised model are as follows:
num_lab_procedures: 67641
number_emergency: 67646
age: 67700
number_diagnoses: 67714
number_inpatient: 69304
The AIC of the model is a relative measure of model parsimony (which is defined to be the simplest model that fits the data), so its value does not mean anything unless we compare it to different models (Bluecology blog, 2018). Looking at the AICs of the models during backward elimination, it is clear that the AICs have decreased as more and more variables were eliminated. Additionally, looking at the AIC values of the final model, we can see that even though the number of lab procedures and the number of past emergency visits have very low coefficients, they have very necessary impact, as removing these variables would increase the AIC by a significant value (8.72 for num_lab_procedures, 13.72 for number_emergency). The most critical predictor is the number of past inpatient visits, as its AIC value is the greatest by a significant relative margin compared to the AICs of the other variables in the updates model. The change in AIC if the variable was removed from the model would be 1,617.72, which is a massive jump compared to other variables.
The p-values of the revised model are as follows:
number_inpatient: <2e-16
num_lab_procedures: 0.00116
age[10-20): 0.1303
age[20-30): 0.0149
age[30-40): 0.0163
age[40-50): 0.0228
age[50-60): 0.0294
age[60-70): 0.0139
age[70-80): 0.0104
age[80-90): 0.0098
age[90-100): 0.0153
number_diagnoses: <2e-16
number_emergency: 7.07e-05
Looking at the p-values of all of the variables, the most significant predictors were number_inpatient, number_diagnoses, and number_emergencies, which shows how hospital information about a patient is essential in knowing if a patient will be readmitted or not. The change in p-values of the age variable when comparing different age groups was very surprising to me. I suspected that younger age groups such as age[10-20) would have lower p-values, because seniors have a higher risk of developing diabetes (CDC, 2024). However, there wasn’t a very clear trend in p-values that I expected, as the ages with the highest p-values were the age groups 40-50 and 50-60.
There is no r^2 value for my model because I am performing a logistic regression because logistic regression uses maximum likelihood estimation, unlike linear regression which uses least squares (Rumella, 2020).
Creating Final Plots
I ultimately decided to use the logistic regression plot I used to display the correlation between the number of inpatient visits and the readmission rates, while also adding race as a factor for color. To add additional interactivity, I decided to make this plot using plotly to include interactivity (Line Charts in Ggplot2 , 2015).
To create this final plot, I learned several new things. Firstly, I learned how to use geom_jitter(), which allowed overlapping points to be shown as they are moved very slightly in a random directionndefined. I also learned how to create logistic lines by integrating ggplot into plotly, and also learned how using toWebGL() can help a lot when it comes performance issues associated with rendering really large datasets (just like this dataset) (Sievert, 2019).
#in order to use the regression line, p <-ggplot(df, aes(x = number_inpatient, y = readmit_binary, color = race)) +geom_jitter(width =0.15, height =0.05, alpha =0.2) +#alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. geom_smooth(method ="glm", method.args =list(family ="binomial"), se=FALSE) +#removing the confidence intervals because there would be too much clutter and all of the se bounds would overlap on top of each other. scale_y_continuous(limits =c(-0.1, 1.1)) #the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function p_webgl <-ggplotly(p) %>%style(hoveron =NULL) %>%#without this piece of code, there were several warning messages because webGL scattergl elements do not support/require hoveron properties. By making the hoveron attribute null, we can prevent the warnings while also keeping the graph functional (Source: https://www.statology.org/suppress-warnings-in-r/ and https://github.com/plotly/plotly.R/issues/1582 and https://plotly-r.com/controlling-tooltips) toWebGL() %>%#webGL is used because of the sheer amount of points that need to be used for the graph. This causes significant performance issues on my laptop, so WebGL() is necessary to allow the graph to render much faster. (Source: https://plotly-r.com/performance) layout( title =list(text ="Number of Inpatient Vists vs. Readmission Probability by Race"), xaxis =list(title ="Number of Inpatient Visits in the Proceding Year"), yaxis =list(title ="Patient Readmission") )