Introduction
This weeks data dive aims to explore the relationship between certain health indicators and diabetes occurrence using a logistic regression model. The binary target variable is “Diabetes_binary”, which indicates whether an individual has been diagnosed with diabetes (1) or not (0). We will use four explanatory variables:
HighBP (High Blood Pressure)
HighChol (High Cholesterol)
BMI (Body Mass Index)
Smoker (Smoking status)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(dplyr)
library(ggplot2)
library(broom)
Data Preparation
First, we load the dataset.
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
# Select explanatory variables and target variable
X <- dataset %>% select(HighBP, HighChol, BMI, Smoker)
Y <- dataset$Diabetes_binary
# Fit a logistic regression model
model <- glm(Diabetes_binary ~ HighBP + HighChol + BMI + Smoker, data = dataset, family = binomial)
# Summary of the model
summary(model)
##
## Call:
## glm(formula = Diabetes_binary ~ HighBP + HighChol + BMI + Smoker,
## family = binomial, data = dataset)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.592869 0.044500 -80.74 <2e-16 ***
## HighBP 1.232811 0.017721 69.57 <2e-16 ***
## HighChol 0.817665 0.017422 46.93 <2e-16 ***
## BMI 0.079131 0.001422 55.65 <2e-16 ***
## Smoker 0.222751 0.017103 13.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98000 on 70691 degrees of freedom
## Residual deviance: 81142 on 70687 degrees of freedom
## AIC: 81152
##
## Number of Fisher Scoring iterations: 4
Logistic Regression Model
The logistic regression model was fitted to predict the probability of having diabetes based on the selected health indicators. Below are the coefficients from the model:
# Extract coefficients and confidence intervals
coefs <- tidy(model)
conf_int <- confint(model)
## Waiting for profiling to be done...
# Combine coefficients with confidence intervals
results <- coefs %>%
mutate(CI_lower = conf_int[,1], CI_upper = conf_int[,2])
conf_int <- suppressMessages(confint(model))
Using the Standard Error for at least one coefficient, build a confidence interval (Display Results)
# Display results in a table
knitr::kable(results, caption = "Coefficients and Confidence Intervals")
term | estimate | std.error | statistic | p.value | CI_lower | CI_upper |
---|---|---|---|---|---|---|
(Intercept) | -3.5928694 | 0.0445003 | -80.73803 | 0 | -3.6803289 | -3.5058895 |
HighBP | 1.2328106 | 0.0177206 | 69.56919 | 0 | 1.1980991 | 1.2675634 |
HighChol | 0.8176648 | 0.0174216 | 46.93389 | 0 | 0.7835261 | 0.8518182 |
BMI | 0.0791309 | 0.0014220 | 55.64756 | 0 | 0.0763510 | 0.0819252 |
Smoker | 0.2227507 | 0.0171030 | 13.02407 | 0 | 0.1892304 | 0.2562735 |
Interpretation of Coefficients
Each coefficient represents the change in the log odds of having diabetes for a one-unit increase in the corresponding explanatory variable:
HighBP: A one-unit increase in HighBP increases the log odds of having diabetes by approximately 1.23.
HighChol: A one-unit increase in HighChol increases the log odds of having diabetes by approximately 0.82.
BMI: A one-unit increase in
BMI
corresponds to an increase of about
0.08 in the log odds of diabetes.
Smoker: Smoking increases the log odds of having diabetes by approximately 0.22.
Confidence Intervals for Coefficients
The confidence intervals provide a range within which we expect the true coefficient to lie with a certain level of confidence (95% in this case). For example:
Conclusion
The logistic regression model shows that high blood pressure, high cholesterol, BMI, and smoking are significant predictors of diabetes risk:
Managing high blood pressure and cholesterol could be crucial steps in preventing diabetes.
The positive association between smoking and diabetes suggests that smoking cessation programs could also help reduce diabetes incidence.