Analyzing the Correlation between the Number of Medications and Time in Hospital

Research question: Can you predict the number of medications someone needs given the time in hospital?

Dataset

Original source: https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008

Total Number of Variables: 50

Categorical Variables (13): Race, Gender, Age, Weight, Admission Type, Discharge Disposition, Admission Source, Payer Code, Medical Specialty, Diagnostic 1, 2, 3, A1cresult, Metformin

Numerical Variables (37): Time in Hospital, Number of Lab Procedures, Number of Procedures, Number of Medications, Number of Outpatients, Number of Emergencies, Number_Inpatient, Number_Diagnoses, etc.

A full extensive list of the variables can be seen on the website.

Introduction/Background

I chose this dataset because I was interested in diabetes medication and the impact that it can have on patients. My dad is pre-diabetic and has to take a lot of medicine to manage blood sugar levels. Because I have a personal connection to this disease, I wanted to learn more about it through this project and analyze diabetes data. Diabetes is also extremely prevalent, with ~12% of the U.S. population having diabetes, according to the CDC. This means that if medication expenses are high, millions of Americans will not be able to treat their disease. This is an extremely important issue, and I wanted to understand more about this topic. I found this dataset by searching for diabetes datasets, and this dataset was the first one from a very reliable sources that I could use for this project.

I used a few variables out of the extensive 50 variables in this dataset. Before creating my linear regression model or plotting, I first replaced all “?” values to na values using the mutate function after extensive research online after importing my dataset from the online website. To create my linear regression model, I included the numerical variables num_medications (my dependent variable), time_in_hospital, num_lab_procedures, and number_diagnoses. In order to plot the data in an efficient manner, I created a summary dataset that captured the variables I wanted to capture by grouping the data by age (a categorical variable used to change the colors) and time in hospital (my independent variables), then summarized the dataset using the summarize() function to find the mean num_medication for each group. I ultimately decided to use highcharter since I believed that the highchart line graph would be able to show the trends of the data very well.

Importing Libraries + Dataframe

#loading libraries that I will be using in 
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(RColorBrewer)

#importing the dataframe in 
df <- read_csv("diabetic_data.csv")

Rows: 101766 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): race, gender, age, weight, payer_code, medical_specialty, diag_1, ...
dbl (13): encounter_id, patient_nbr, admission_type_id, discharge_dispositio...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

#when viewing the dataset, all na values are set to ? 
#substituting all ? values to na: 
df <- df %>% mutate(across(where(is.character), ~na_if(., "?"))) #source: https://stackoverflow.com/questions/49457877/r-replace-specific-value-contents-with-na 
view(df)

Linear Regression

model <- lm(num_medications ~ time_in_hospital + num_lab_procedures + number_diagnoses, data = df)

#justification using backward elimination: 
backwards_model <- step(model, direction = "backward") #source: https://alain-vandormael.netlify.app/post/varselect/

Start:  AIC=396311.4
num_medications ~ time_in_hospital + num_lab_procedures + number_diagnoses

                     Df Sum of Sq     RSS    AIC
<none>                            4998723 396311
- num_lab_procedures  1     84601 5083325 398017
- number_diagnoses    1    155736 5154459 399432
- time_in_hospital    1    905369 5904093 413250

summary(backwards_model)


Call:
lm(formula = num_medications ~ time_in_hospital + num_lab_procedures + 
    number_diagnoses, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.246  -4.634  -0.927   3.526  59.224 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.305222   0.094241   45.68   <2e-16 ***
time_in_hospital   1.072206   0.007898  135.76   <2e-16 ***
num_lab_procedures 0.049085   0.001183   41.50   <2e-16 ***
number_diagnoses   0.658510   0.011695   56.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.009 on 101762 degrees of freedom
Multiple R-squared:  0.2564,    Adjusted R-squared:  0.2564 
F-statistic: 1.17e+04 on 3 and 101762 DF,  p-value: < 2.2e-16

When you look at the backwards elimination model, you can see that the AIC and Residual Sum of Squared (RSS) increases if a variable is removed, and the p-values are highly significant.

By looking at results of the linear regression model the equation is:

num_medications = 1.099792 * time_in_hospital + 0.053965 * num_lab_procedures + 0.632820 * number_diagnoses + 4.138114

Cleaning and Summary Dataset

#because i have over 100,000 entries, i need to make a summary dataset that captures the statistics that i need 
df_plot <- df %>%
  group_by(time_in_hospital) %>% #grouping by time_in_hospital, our dependent variable %>% 
  summarise(
    avg_meds = mean(num_medications),
    .groups = "drop" #to allow plotting to work
  )
view(df_plot)

Exploration Plots

#I wanted to first see the relationship between the number of medications and the time in hospital without grouping by age: 

ggplot(data = df_plot, aes(x = time_in_hospital, y = avg_meds)) + 
  geom_point() + 
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As you can see, the points display a really nice positive trend.

#I also wanted to test the other variables that I included in the linear regression model, but to do that I have to create another dataset. I removed 
df_plot_num_lab_procedures <- df %>% #using similar method as df_plot, just changing the independent variable 
  group_by(num_lab_procedures) %>%
  summarise(
    avg_meds = mean(num_medications),
    .groups = "drop" #to allow plotting to work
  )
view(df_plot_num_lab_procedures)

df_plot_number_diagnoses <- df %>%
  group_by(number_diagnoses) %>%
  summarise(
    avg_meds = mean(num_medications),
    .groups = "drop" #to allow plotting to work
  )
view(df_plot_number_diagnoses)

#now, I want to plot the graphs: 
ggplot(data = df_plot_num_lab_procedures, aes(x = num_lab_procedures, y = avg_meds)) + 
  geom_point() + 
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There definitely still is a very strong positive trend, but it seems to be less linear than the previous graph. The data is also much more varied a the number of lab procedures increases, which makes sense because it is rare to have to many of them. This means that there is a higher error interval, and thus could account for the less linear pattern.

ggplot(data = df_plot_number_diagnoses, aes(x = number_diagnoses, y = avg_meds)) + 
  geom_point() + 
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The number_diagnoses graph has a much less consistent trend, especially with a dip as the number_diagnoses approaches 11. This could be attributed to the fact that the number_diagnoses plot does not have as many points, and thus could have more varied patterns.

Diagnostics Plots

I ultimately decided to plot the relationship between the number of medications and the time in hospital, as it had shown the most linear and consistent pattern. I chose to use age as the categorical factor to include color into my graph.

Cleaning for Plots

To include the age as a factor, I have to create a separate dataset that also groups by age as well as the dependent variable (in this case it’s time_in_hospital)

df_plot <- df %>%
  filter(!is.na(age)) %>% #filtering out na_ages to make the plot work
  group_by(time_in_hospital, age) %>% #grouping by ages for categorical data for highchart (will be used later)
  summarise(
    avg_meds = mean(num_medications),
    .groups = "drop" #to allow plotting to work
  )
view(df_plot)

To create the actual graph, I decided to use highcharter since I feel like it has the most customizability and because I really enjoyed learning about it during class. I chose to use a line graph since the independent variable is continuous and can show trends over the time in hospital.

#first, I'm going to create a vector that will contain the colors I want to use in my chart: 
cols <- brewer.pal(10, "Spectral")

highchart() %>%
  hc_add_series(df_plot, "line", hcaes(x = time_in_hospital, y = avg_meds, group = age)) %>%
  hc_colors(cols) %>% #changing from default colors 
  hc_title(text = "Time in Hospital vs. Number of Medications for Diabetes Cases in Hospitals (1999-2008)") %>% #adding a detailed title 
  hc_xAxis(title = list(text = "Days in Hospital")) %>% #adding axis labels
  hc_yAxis(title = list(text = "Average Medications")) %>%
  hc_legend(
    enabled = TRUE,
    title = list(text = "Age Group")
  ) %>%
  hc_caption(
    text = "Data Source: UCI Machine Learning Repository - Diabetes 130-US Hospitals (1999-2008)"
  ) %>% 
  hc_add_theme(hc_theme_flat()) #changing from default theme

The graph displays a constant positive trend, as the average medications increase as the days in hospital increase.

It is also interesting that the line for average medications for the age group from 10-20 is lower than the rest of the age groups.

Because of how significant the graph is, I wanted to also make another graph:

df_plot1 <- df %>%
  filter(!is.na(age)) %>% #filtering out na_ages to make the plot work
  group_by(num_lab_procedures, age) %>%
  summarise(
    meds = mean(num_medications),
    .groups = "drop" #to allow highchart to work
  )

highchart() %>%
  hc_add_series(df_plot1, "line", hcaes(x = num_lab_procedures, y = meds, group = age)) %>%
  hc_title(text = "# Lab Procedures vs. Number of Medications") %>% 
  hc_xAxis(title = list(text = "# Lab Procedures")) %>%
  hc_yAxis(title = list(text = "Average Medications")) %>%
  hc_legend(
    enabled = TRUE,
    title = list(text = "Age Group")
  ) %>%
  hc_caption(
    text = "Data Source: UCI Machine Learning Repository - Diabetes 130-US Hospitals (1999-2008)"
  ) %>% 
  hc_add_theme(hc_theme_flat())

This graph contains much more noise, as there are a lot more values for the number of lab procedures. This makes the graph above less effective than the previous graph.

Conclusion: There is a high correlation between the time in hospital and the average number of medications. This can be seen through a graph that showcases the positive and consistence linear trend. In the future, I would like to be able to analyze more data from this dataset, as the dataset contains over 50 variables and thus opens a ton of possibilities. I hope to be able to obtain a computer that is able to process all 100,000 entries and be able to see what the graphs will look like someday.