Numerical Variables (37): Time in Hospital, Number of Lab Procedures, Number of Procedures, Number of Medications, Number of Outpatients, Number of Emergencies, Number_Inpatient, Number_Diagnoses, etc.
A full extensive list of the variables can be seen on the website.
Introduction/Background
I chose this dataset because I was interested in diabetes medication and the impact that it can have on patients. My dad is pre-diabetic and has to take a lot of medicine to manage blood sugar levels. Because I have a personal connection to this disease, I wanted to learn more about it through this project and analyze diabetes data. Diabetes is also extremely prevalent, with ~12% of the U.S. population having diabetes, according to the CDC. This means that if medication expenses are high, millions of Americans will not be able to treat their disease. This is an extremely important issue, and I wanted to understand more about this topic. I found this dataset by searching for diabetes datasets, and this dataset was the first one from a very reliable sources that I could use for this project.
I used a few variables out of the extensive 50 variables in this dataset. Before creating my linear regression model or plotting, I first replaced all “?” values to na values using the mutate function after extensive research online after importing my dataset from the online website. To create my linear regression model, I included the numerical variables num_medications (my dependent variable), time_in_hospital, num_lab_procedures, and number_diagnoses. In order to plot the data in an efficient manner, I created a summary dataset that captured the variables I wanted to capture by grouping the data by age (a categorical variable used to change the colors) and time in hospital (my independent variables), then summarized the dataset using the summarize() function to find the mean num_medication for each group. I ultimately decided to use highcharter since I believed that the highchart line graph would be able to show the trends of the data very well.
Importing Libraries + Dataframe
#loading libraries that I will be using in library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(RColorBrewer)#importing the dataframe in df <-read_csv("diabetic_data.csv")
Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Cleaning
#when viewing the dataset, all na values are set to ? #substituting all ? values to na: df <- df %>%mutate(across(where(is.character), ~na_if(., "?"))) #source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na view(df)
Linear Regression
model <-lm(num_medications ~ time_in_hospital + num_lab_procedures + number_diagnoses, data = df)#justification using backward elimination: backwards_model <-step(model, direction ="backward") #source: https://alain-vandormael.netlify.app/post/varselect/
Call:
lm(formula = num_medications ~ time_in_hospital + num_lab_procedures +
number_diagnoses, data = df)
Residuals:
Min 1Q Median 3Q Max
-22.246 -4.634 -0.927 3.526 59.224
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.305222 0.094241 45.68 <2e-16 ***
time_in_hospital 1.072206 0.007898 135.76 <2e-16 ***
num_lab_procedures 0.049085 0.001183 41.50 <2e-16 ***
number_diagnoses 0.658510 0.011695 56.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.009 on 101762 degrees of freedom
Multiple R-squared: 0.2564, Adjusted R-squared: 0.2564
F-statistic: 1.17e+04 on 3 and 101762 DF, p-value: < 2.2e-16
When you look at the backwards elimination model, you can see that the AIC and Residual Sum of Squared (RSS) increases if a variable is removed, and the p-values are highly significant.
By looking at results of the linear regression model the equation is:
#because i have over 100,000 entries, i need to make a summary dataset that captures the statistics that i need df_plot <- df %>%group_by(time_in_hospital) %>%#grouping by time_in_hospital, our dependent variable %>% summarise(avg_meds =mean(num_medications),.groups ="drop"#to allow plotting to work )view(df_plot)
Exploration Plots
#I wanted to first see the relationship between the number of medications and the time in hospital without grouping by age: ggplot(data = df_plot, aes(x = time_in_hospital, y = avg_meds)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As you can see, the points display a really nice positive trend.
#I also wanted to test the other variables that I included in the linear regression model, but to do that I have to create another dataset. I removed df_plot_num_lab_procedures <- df %>%#using similar method as df_plot, just changing the independent variable group_by(num_lab_procedures) %>%summarise(avg_meds =mean(num_medications),.groups ="drop"#to allow plotting to work )view(df_plot_num_lab_procedures)df_plot_number_diagnoses <- df %>%group_by(number_diagnoses) %>%summarise(avg_meds =mean(num_medications),.groups ="drop"#to allow plotting to work )view(df_plot_number_diagnoses)
#now, I want to plot the graphs: ggplot(data = df_plot_num_lab_procedures, aes(x = num_lab_procedures, y = avg_meds)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
There definitely still is a very strong positive trend, but it seems to be less linear than the previous graph. The data is also much more varied a the number of lab procedures increases, which makes sense because it is rare to have to many of them. This means that there is a higher error interval, and thus could account for the less linear pattern.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The number_diagnoses graph has a much less consistent trend, especially with a dip as the number_diagnoses approaches 11. This could be attributed to the fact that the number_diagnoses plot does not have as many points, and thus could have more varied patterns.
Diagnostics Plots
I ultimately decided to plot the relationship between the number of medications and the time in hospital, as it had shown the most linear and consistent pattern. I chose to use age as the categorical factor to include color into my graph.
Cleaning for Plots
To include the age as a factor, I have to create a separate dataset that also groups by age as well as the dependent variable (in this case it’s time_in_hospital)
df_plot <- df %>%filter(!is.na(age)) %>%#filtering out na_ages to make the plot workgroup_by(time_in_hospital, age) %>%#grouping by ages for categorical data for highchart (will be used later)summarise(avg_meds =mean(num_medications),.groups ="drop"#to allow plotting to work )view(df_plot)
To create the actual graph, I decided to use highcharter since I feel like it has the most customizability and because I really enjoyed learning about it during class. I chose to use a line graph since the independent variable is continuous and can show trends over the time in hospital.
#first, I'm going to create a vector that will contain the colors I want to use in my chart: cols <-brewer.pal(10, "Spectral")
highchart() %>%hc_add_series(df_plot, "line", hcaes(x = time_in_hospital, y = avg_meds, group = age)) %>%hc_colors(cols) %>%#changing from default colors hc_title(text ="Time in Hospital vs. Number of Medications for Diabetes Cases in Hospitals (1999-2008)") %>%#adding a detailed title hc_xAxis(title =list(text ="Days in Hospital")) %>%#adding axis labelshc_yAxis(title =list(text ="Average Medications")) %>%hc_legend(enabled =TRUE,title =list(text ="Age Group") ) %>%hc_caption(text ="Data Source: UCI Machine Learning Repository - Diabetes 130-US Hospitals (1999-2008)" ) %>%hc_add_theme(hc_theme_flat()) #changing from default theme
The graph displays a constant positive trend, as the average medications increase as the days in hospital increase.
It is also interesting that the line for average medications for the age group from 10-20 is lower than the rest of the age groups.
Because of how significant the graph is, I wanted to also make another graph:
df_plot1 <- df %>%filter(!is.na(age)) %>%#filtering out na_ages to make the plot workgroup_by(num_lab_procedures, age) %>%summarise(meds =mean(num_medications),.groups ="drop"#to allow highchart to work )highchart() %>%hc_add_series(df_plot1, "line", hcaes(x = num_lab_procedures, y = meds, group = age)) %>%hc_title(text ="# Lab Procedures vs. Number of Medications") %>%hc_xAxis(title =list(text ="# Lab Procedures")) %>%hc_yAxis(title =list(text ="Average Medications")) %>%hc_legend(enabled =TRUE,title =list(text ="Age Group") ) %>%hc_caption(text ="Data Source: UCI Machine Learning Repository - Diabetes 130-US Hospitals (1999-2008)" ) %>%hc_add_theme(hc_theme_flat())
This graph contains much more noise, as there are a lot more values for the number of lab procedures. This makes the graph above less effective than the previous graph.
Conclusion: There is a high correlation between the time in hospital and the average number of medications. This can be seen through a graph that showcases the positive and consistence linear trend. In the future, I would like to be able to analyze more data from this dataset, as the dataset contains over 50 variables and thus opens a ton of possibilities. I hope to be able to obtain a computer that is able to process all 100,000 entries and be able to see what the graphs will look like someday.