Diabetes is a critical, and often preventable, chronic disease that has rapidly risen over time (World Health Organization, 2024). This occurs when either the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces (World Health Organization, 2024). Insulin is essential, functioning like a key to let blood sugar into cells in the body to use as energy (CDC, 2024). Diabetes impacts millions in the nation, with over 37 million Americans diagnosed and nearly 9 million Americans that are unaware they have it (Zlotek & UChicago Medicine AdventHealth, 2024).
Detecting diabetes in early stages is crucial to prevent health complications, including heart disease, kidney disease, nerve damage, and vision problems (Zlotek & UChicago Medicine AdventHealth, 2024). Personally, my family has a history of diabetes, with my dad being pre-diabetic. Being able to predict how likely or severe diabetes will be can benefit not only me, but also millions of people across the nation.
This data was collected by using the Health Facts database, a national warehouse that collects comprehensive clinical records across hospital throughout the United States. The group extracted a dataset of 10 years (1999-2008) of clinical data at 130 hospitals and integrated delivery networks throughout the United States. The encounters of interest were extracted from the database with 55 attributes. Then, preliminary analysis and prepossessing of the data were performed to retain only features and encounters that contain sufficient information.
A full extensive list of the variables, including descriptions and values, as well as the % missing can be seen on the website.
The follow variables are used in the study:
Variable Name
Role
Type
Demographic
Description
readmitted
Target
Categorical
Days to inpatient readmission. Values: <30 if the patient was readmitted in less than 30 days, >30 if the patient was readmitted in more than 30 days, and No for no record of readmission.
race
Feature
Categorical
Race
Values: Caucasian, Asian, African American, Hispanic, and other
gender
Feature
Categorical
Gender
Values: male, female, and unknown/invalid
age
Feature
Categorical
Age
Grouped in 10-year intervals: [0, 10), [10, 20),…, [90, 100)
num_lab_procedures
Feature
Integer
Number of lab tests performed during the encounter
num_procedures
Feature
Integer
Number of procedures (other than lab tests) performed during the encounter
number_outpatient
Feature
Integer
Number of outpatient visits of the patient in the year preceding the encounter
number_emergency
Feature
Integer
Number of emergency visits of the patient in the year preceding the encounter
number_inpatient
Feature
Integer
Number of inpatient visits of the patient in the year preceding the encounter
number_diagnoses
Feature
Integer
Number of diagnoses entered to the system
Research questions:
Which factors are significantly correlated with whether or not a diabetic patient will be readmitted within 30 days?
Are there any disparities in readmission rates by race or age?
Data Analysis
Importing Libraries and Dataset
#loading libraries that I will be using throughout the project library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
df <-read_csv("diabetic_data.csv")
Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Cleaning
#when viewing the dataset, all na values are set to ? #substituting all ? values to na: df <- df %>%mutate(across(where(is.character), ~na_if(., "?"))) #source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na # in order to make the readmitted variable a binary variable, patients that are not readmitted within 30 days are set to false, while patients readmitted within 30 days are set to true. df <- df %>%filter(!is.na(readmitted)) %>%#removes the nas in the variables needed for visualization as well as logistic regression models filter(!is.na(number_inpatient)) %>%filter(!is.na(number_outpatient)) %>%filter(!is.na(number_emergency)) %>%filter(!is.na(num_lab_procedures)) %>%filter(!is.na(num_procedures)) %>%filter(!is.na(number_diagnoses)) %>%filter(!is.na(age)) %>%filter(!is.na(gender)) %>%filter(!is.na(race)) %>%mutate(readmit_binary =if_else(readmitted =="<30", 1, 0))
Logistic Regression
Logistic Regression is used for binary classification, which is fitting for the readmitted variable that we have now made binary (Castro & Ferreira, 2022). Logistic regression can be performed in R using the glm() (generalized linear model function) and using the family argument (Chugh, 2023).
When making the original model, all possible variables were added to establish a baseline of the model, called a full model (Schrader, n.d.). Then, a backwards model was made to identify the variables that are essential to the model. Finally, a refined model was created with the variables identified by the backwards model.
full_model <-glm(readmit_binary ~ number_inpatient + number_outpatient + number_emergency + num_lab_procedures + num_procedures + number_diagnoses + age + gender + race, data = df, family ="binomial") summary(full_model)
Updated Logistic Model (only w/ variables left after AIC)
model <-glm(readmit_binary ~ number_inpatient + num_lab_procedures + age + number_diagnoses + number_emergency, data = df, family ="binomial") summary(model)
Call:
glm(formula = readmit_binary ~ number_inpatient + num_lab_procedures +
age + number_diagnoses + number_emergency, family = "binomial",
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.2058842 0.5834848 -7.208 5.67e-13 ***
number_inpatient 0.2729187 0.0064596 42.250 < 2e-16 ***
num_lab_procedures 0.0017147 0.0005279 3.248 0.00116 **
age[10-20) 0.9165406 0.6058004 1.513 0.13029
age[20-30) 1.4338744 0.5886028 2.436 0.01485 *
age[30-40) 1.4069255 0.5858104 2.402 0.01632 *
age[40-50) 1.3305777 0.5844774 2.277 0.02281 *
age[50-60) 1.2719910 0.5841315 2.178 0.02944 *
age[60-70) 1.4371766 0.5839950 2.461 0.01386 *
age[70-80) 1.4958845 0.5839544 2.562 0.01042 *
age[80-90) 1.5092776 0.5841721 2.584 0.00978 **
age[90-100) 1.4231302 0.5869160 2.425 0.01532 *
number_diagnoses 0.0534689 0.0059220 9.029 < 2e-16 ***
number_emergency 0.0335913 0.0084532 3.974 7.07e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69886 on 99492 degrees of freedom
Residual deviance: 67604 on 99479 degrees of freedom
AIC: 67632
Number of Fisher Scoring iterations: 6
Statistical Analysis:
Logistic Equations based on glm() have the following equation:
Where p is the probability that an event occurs given the log-odds, which is determined by the logistic equation. For this equation, the value 0 is classified as a failure, and the value 1 is classified as a success (Berg, n.d.).
In this case, p is defined as the probability that a patient is readmitted within 30 days.
Looking at the AIC values and full model results, the variables that are the most correlated with whether or not someone is readmitted is the number of lab procedures, the number of emergency visits, age, the number of diagnoses, and the number of inpatient visits.
The AIC values of the revised model are as follows:
num_lab_procedures: 67641
number_emergency: 67646
age: 67700
number_diagnoses: 67714
number_inpatient: 69304
The AIC of the model is a relative measure of model parsimony (which is defined to be the simplest model that fits the data), so its value does not mean anything unless we compare it to different models (Bluecology blog, 2018). Looking at the AICs of the models during backward elimination, it is clear that the AICs have decreased as more and more variables were eliminated. Additionally, looking at the AIC values of the final model, we can see that even though the number of lab procedures and the number of past emergency visits have very low coefficients, they have very necessary impact, as removing these variables would increase the AIC by a significant value (8.72 for num_lab_procedures, 13.72 for number_emergency). The most critical predictor is the number of past inpatient visits, as its AIC value is the greatest by a significant relative margin compared to the AICs of the other variables in the updates model. The change in AIC if the variable was removed from the model would be 1,617.72, which is a massive jump compared to other variables.
The p-values of the revised model are as follows:
number_inpatient: <2e-16
num_lab_procedures: 0.00116
age[10-20): 0.1303
age[20-30): 0.0149
age[30-40): 0.0163
age[40-50): 0.0228
age[50-60): 0.0294
age[60-70): 0.0139
age[70-80): 0.0104
age[80-90): 0.0098
age[90-100): 0.0153
number_diagnoses: <2e-16
number_emergency: 7.07e-05
Looking at the p-values of all of the variables, the most significant predictors were number_inpatient, number_diagnoses, and number_emergencies, which shows how hospital information about a patient is essential in knowing if a patient will be readmitted or not. The change in p-values of the age variable when comparing different age groups was very surprising to me. I suspected that younger age groups such as age[10-20) would have lower p-values, because seniors have a higher risk of developing diabetes (CDC, 2024). However, there wasn’t a very clear trend in p-values that I expected, as the ages with the highest p-values were the age groups 40-50 and 50-60.
There is no r^2 value for my model because I am performing a logistic regression. Logistic regression uses maximum likelihood estimation, unlike linear regression which uses least squares (Rumella, 2020).
Creating Final Plots
I ultimately decided to use the logistic regression plot I used to display the correlation between the number of inpatient visits and the readmission rates, while also adding race as a factor for color. To add additional interactivity, I decided to make this plot using plotly to include interactivity (Line Charts in Ggplot2 , 2015).
set.seed(1008) #random seed that can be used for reproducibilitysample_df <- df %>%sample_n(5000) #we have to take a sample so that rpubs will won't crash the document. an image of the graph with all the samples will be attached below. #in order to use the regression line, p <-ggplot(sample_df, aes(x = number_inpatient, y = readmit_binary, color = race)) +geom_jitter(width =0.15, height =0.05, alpha =0.2) +#alpha is used to emphasize the density of these clusters of points, as the darker a cluster is, the more points there are. geom_smooth(method ="glm", method.args =list(family ="binomial"), se=FALSE) +#removing the confidence intervals because there would be too much clutter and all of the se bounds would overlap on top of each other. scale_y_continuous(limits =c(-0.1, 1.1)) +#the addition rage of +-0.1 can show more points that are slightly increased or decreased through the jitter function theme_light()+scale_color_brewer(palette ="Dark2") p_webgl <-ggplotly(p) %>%style(hoveron =NULL) %>%#without this piece of code, there were several warning messages because webGL scattergl elements do not support/require hoveron properties. By making the hoveron attribute null, we can prevent the warnings while also keeping the graph functional (Source: https://www.statology.org/suppress-warnings-in-r/ and https://github.com/plotly/plotly.R/issues/1582 and https://plotly-r.com/controlling-tooltips) layout( title =list(text ="Number of Inpatient Visits vs. Readmission Probability by Race"), xaxis =list(title ="Number of Inpatient Visits in the Proceding Year"), yaxis =list(title ="Patient Readmission"),annotations =list(x =1.2, y =-0.0, text ="Source: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008", showarrow = F, xref='paper', yref='paper', xanchor='auto', yanchor='auto', xshift=0, yshift=0,font=list(size=11)), margin =list(t =100, b =50, l =50, r =50) #used to stop the title from getting cut off, source: https://plotly.com/python/figure-labels/ )
`geom_smooth()` using formula = 'y ~ x'
p_webgl
Note: Because the visualization above is only a ~5% sample of the entire data, the graph will vary sometimes. This is some of the limitations that come with having to reduce the dataset in order to have sufficient performance and not crash rpubs.
Here is the visualization with all samples included:
To create this final plot, I learned several new things. Firstly, I learned how to use geom_jitter(), which allowed overlapping points to be shown as they are moved very slightly in a random directionndefined. I also learned how to create logistic lines by integrating ggplot into plotly, and also learned how using toWebGL() can help a lot when it comes performance issues associated with rendering really large datasets (just like this dataset) (Sievert, 2019). I also learned how to sample and debug my code, as if I used the entire dataset for the visualization, Rpubs would crash and wouldn’t show any of my images or code. This took a long time to realize, and I learned that the data visualization process is not always linear.
Looking at the visualization, there are several key patterns to note. Looking at the geom_jitter, the points seem to extend out more when patients are readmitted, which indicate that there is an association with higher previous inpatient visits and a higher probability that a patient will be readmitted. Then, looking at the distributions of the different races, it is surprising to see that many races, including Asians and Hispanics, have very little data entries due to their curve length and their individual scatter (by clicking on the different races to toggle variability). This most likely caused their curves to be different from the curves showing the patient data for African Americans and Caucasian patients, which essentially overlapped each other. This shows that there are equal likelihoods of patients being readmitted to the hospital based on the number of previous in-patient visits, regardless of race.
Final Plot in Tableau
For the plot in Tableau, I wanted to compare the average number of lab procedures that a patient has performed on them during the encounter based on their readmission condition (<30, >30, or NO). To compare them, I wanted to give the user the option to select the age groups that they wanted to choose, as a scatter of the dots don’t give a lot of information at first. I also decided to change the sizes of the dots to number of patients in the age group to give additional information.
Below is an example of comparing three age groups in Tableau:
The visualization shows that there are some minor differences in age groups. Some outliers for the age group were the 0-10 age group and the 10-20 age group, likely because adolescents and teenagers are still growing and thus lab procedures for these children are different compared to adults. For younger adults, there seems to be an order of lab procedures ([NO] < [<30] < [>30]) but as the age groups get larger, the trend slowly shifts until at age groups 50-60, the general order is ([NO] < [>30] < [<30]). Age groups 80-90 and 90-100 are also outliers. I was quite surprised at the points for the average number of lab procedures, as the patients who were readmitted within 30 days and patients who did not get readmitted had the same average number of lab procedures, with only the patients who were readmitted after 30 days being a little lower. This is because the age group from 80-90 had the highest significance in the logistic model, yet there does not seem to be a very obvious pattern between the average number of lab procedures for patients with readmission types >30 and NO compared to the average number of lab procedures for patients with readmission type <30.
Overall, there were general trends for adults from 20-80, but on both ends of the age spectrum there were definitely outlying patterns.
References
Line Charts in ggplot2 . (2015). Plotly.com. https://plotly.com/ggplot2/line-charts/
alebot. (2019, July 19). toWebGL: “scattergl” objects don’t have these attributes: “hoveron.” GitHub. https://github.com/plotly/plotly.R/issues/1582
Berg, S. M. van den. (n.d.). Chapter 15 Generalised linear models: logistic regression | Analysing Data using Linear Models. In bookdown.org. https://bookdown.org/pingapang9/linear_models_bookdown/logistic.html
Bluecology blog. (2018, April 13). How do I interpret the AIC | R-bloggers. https://www.r-bloggers.com/2018/04/how-do-i-interpret-the-aic/
Castro, H. M., & Ferreira, J. C. (2022). Linear and logistic regression models: when to use and how to interpret them? Jornal Brasileiro de Pneumologia, 48(6), e20220439. https://doi.org/10.36416/1806-3756/e20220439
CDC. (2024, May 15). Diabetes Basics. Www.cdc.gov. https://www.cdc.gov/diabetes/about/index.html
Centers for Disease Control and Prevention. (2024, May 15). Diabetes Risk Factors. Diabetes. https://www.cdc.gov/diabetes/risk-factors/index.html
Chugh, V. (2023, March). Logistic Regression in R Tutorial. Www.datacamp.com. https://www.datacamp.com/tutorial/logistic-regression-R
Frost, J. (2017). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Statistics by Jim. https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
Rumella. (2020, April 30). Help with understanding logistic regression’s $R^2$ value. Cross Validated. https://stats.stackexchange.com/questions/463739/help-with-understanding-logistic-regressions-r2-value
sape research group. (2026). ggplot2 Quick Reference: geom_jitter | Software and Programmer Efficiency Research Group. Inf.usi.ch. https://sape.inf.usi.ch/quick-reference/ggplot2/geom_jitter.html
Schork, J. (2022, December 9). How to Disable Hover Information in plotly Graph in R (Example). Statistics Globe. https://statisticsglobe.com/disable-hover-information-plotly-r
Schrader, R. (n.d.). Lecture 5: A Taste of Model Selection for Multiple Regression. Department of Mathematics & Statistics at the University of New Mexico. Retrieved May 10, 2026, from https://math.unm.edu/~schrader/biostat/bio2/notes/splec5b.pdf
Setting the Font, Title, Legend Entries, and Axis Titles. (n.d.). Plotly.com. https://plotly.com/python/figure-labels/
Sievert, C. (2019, December 19). 24 Improving performance | Interactive web-based data visualization with R, plotly, and shiny. Plotly-R.com. https://plotly-r.com/performance
Starbucks. (2022, July 17). R - Replace specific value contents with NA [duplicate]. Stack Overflow. https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na
user8229029. (2020, October 16). How to work with huge data in R Shiny and plotly? Stack Overflow. https://stackoverflow.com/questions/64392725/how-to-work-with-huge-data-in-r-shiny-and-plotly
Vandormael, A. (2022, November 11). Variable selection methods for linear regression in R. Netlify.app. https://alain-vandormael.netlify.app/post/varselect/
World Health Organization. (2024, November 14). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes
Zach. (2022, August 11). How to Suppress Warnings in R (With Examples). Statology. https://www.statology.org/suppress-warnings-in-r/
Zlotek, J., & UChicago Medicine AdventHealth. (2024, October 27). Understanding Diabetes: The Importance of Early Detection and Management. UChicago Medicine AdventHealth. https://www.uchicagomedicineadventhealth.org/blog/understanding-diabetes-importance-early-detection-and-management