Overview
The purpose of this dataset was to help find predictors of diabetes. The goal is to help medical professionals identify patients with potential risk factors of diabetes.The dataset contains information on patient demographics such as age and gender, as well as medical information including blood glucose levels and hypertension. The observations in this dataset were obtained from research studies and healthcare institutions. The dataset was obtained via Kaggle.
Mustafa, T. Z. (2021). Diabetes Prediction Dataset. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Downloads")
data <- read.csv(file= 'diabetes_prediction_dataset.csv')
str(data)
## 'data.frame': 100000 obs. of 9 variables:
## $ gender : chr "Female" "Female" "Male" "Female" ...
## $ age : num 80 54 28 36 76 20 44 79 42 32 ...
## $ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...
## $ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...
## $ smoking_history : chr "never" "No Info" "never" "current" ...
## $ bmi : num 25.2 27.3 27.3 23.4 20.1 ...
## $ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
## $ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ...
## $ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
library(dplyr)
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
ggplot(data, aes(x = hypertension, fill = diabetes)) +
geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
fill= "Diabetes")
Comparison of Two Categorical Variables
The purpose of this graph is to show the relationship between having hypertension and diabetes. For this graph, I changed the values of 1 in both the diabetes and hypertension variable to “Yes” and the value 0 to “No.” These integers were meant to represent categorical variables with 1 correlating to being positive for hypertension or diabetes and 0 correlating to being negative to hypertension or diabetes. Based on our graph, we can see that there is no correlation between having diabetes and having hypertension, with most patients within our sample having neither diabetes or hypertension. Thus, hypertension would not be a good predictor of diabetes within patients.
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
geom_line(size = 1) + # Lines representing interaction
geom_point(size = 2) + # Points for data
labs(
title = "Interaction of Blood Glucose and HbA1c Levels",
x = "Blood Glucose Level",
y = "HbA1c Level",
color = "Presence of Diabetes"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "top"
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(interact)