Analysis of Diabetes Prediction Dataset

Overview

The purpose of this dataset was to help find predictors of diabetes. The goal is to help medical professionals identify patients with potential risk factors of diabetes.The dataset contains information on patient demographics such as age and gender, as well as medical information including blood glucose levels and hypertension. The observations in this dataset were obtained from research studies and healthcare institutions. The dataset was obtained via Kaggle.

Mustafa, T. Z. (2021). Diabetes Prediction Dataset. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("~/Downloads")
data <- read.csv(file= 'diabetes_prediction_dataset.csv')

str(data)

## 'data.frame':    100000 obs. of  9 variables:
##  $ gender             : chr  "Female" "Female" "Male" "Female" ...
##  $ age                : num  80 54 28 36 76 20 44 79 42 32 ...
##  $ hypertension       : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ heart_disease      : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ smoking_history    : chr  "never" "No Info" "never" "current" ...
##  $ bmi                : num  25.2 27.3 27.3 23.4 20.1 ...
##  $ HbA1c_level        : num  6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
##  $ blood_glucose_level: int  140 80 158 155 155 85 200 85 145 100 ...
##  $ diabetes           : int  0 0 0 0 0 0 1 0 0 0 ...

library(dplyr)
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")

ggplot(data, aes(x = hypertension, fill = diabetes)) +
  geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
  labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
        fill= "Diabetes")

Comparison of Two Categorical Variables

The purpose of this graph is to show the relationship between having hypertension and diabetes. For this graph, I changed the values of 1 in both the diabetes and hypertension variable to “Yes” and the value 0 to “No.” These integers were meant to represent categorical variables with 1 correlating to being positive for hypertension or diabetes and 0 correlating to being negative to hypertension or diabetes. Based on our graph, we can see that there is no correlation between having diabetes and having hypertension, with most patients within our sample having neither diabetes or hypertension. Thus, hypertension would not be a good predictor of diabetes within patients.

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
  geom_line(size = 1) +  # Lines representing interaction
  geom_point(size = 2) + # Points for data
  labs(
    title = "Interaction of Blood Glucose and HbA1c Levels",
    x = "Blood Glucose Level",
    y = "HbA1c Level",
    color = "Presence of Diabetes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text = element_text(size = 12),
    legend.position = "top"
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplotly(interact)

Comparison of Two Numerical Variables

The two variables being compared here are Blood Glucose Level and HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level is a measure of average blood sugar levels over 2-3 months and is measured in %. Blood glucose level is the amount of blood sugar in a patients blood at a given time. Our plot shows a correlation between high levels in HbA1c and blood glucose levels and it’s relationship with diabetes. This plot suggests patients with high levels of HbA1c and blood glucose levels are at high risk for diabetes.

knitr::opts_chunk$set(echo = TRUE)

 boxplot(data$age ~ data$diabetes,
         xlab="Patient Age",
         ylab="Presence of Diabetes",
         col = c("skyblue", "purple"),
         main="Visualization of Diabetes by Age",
         cex.main = 1.1,
         col.main = "navy")

Comparison of a Numerical and Categorical Variable

This chart shows the relationship between age and diabetes diagnosis. This chart shows that as patients grow older they are at higher risk for diabetes. This would make age a predictor for diabetes, with the average age for those without diabetes being 40 years and 60 years for those with diabetes. However, we can see that we do have a few outliers for those with diabetes with some patients being diagnosed at 20 or younger. This could be from the population of people with Type I diabetes which is usually diagnosed in patients under 20. Our dataset does not distinguish between type I and type II diabetes.

Conclusion

Diabetes can be predicted using age, blood glucose levels, and HbA1c levels in patients. However, our results showed hypertension is not a good predictor of diabetes. Based on our findings, it would be interesting to look at factors that occur once a patient becomes older that could lead to diabetes at an older age. As well, diet could be another factor to look into, such as grams of sugar consumed on a daily or weekly basis, to see if sugar consumption can predict diabetes. Another important thing to analyze in future studies would be Type I versus Type II diabetes, as it could be the reason for some of the outliers in the boxplot.

Analysis of Diabetes Prediction Dataset

Gabriella Khalil

2025-02-05