Introduction

This weeks data dive aims to explore the relationship between certain health indicators and diabetes occurrence using a logistic regression model. The binary target variable is “Diabetes_binary”, which indicates whether an individual has been diagnosed with diabetes (1) or not (0). We will use four explanatory variables:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(dplyr)
library(ggplot2)
library(broom)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
# Select explanatory variables and target variable
X <- dataset %>% select(HighBP, HighChol, BMI, Smoker)
Y <- dataset$Diabetes_binary
# Fit a logistic regression model
model <- glm(Diabetes_binary ~ HighBP + HighChol + BMI + Smoker, data = dataset, family = binomial)
# Summary of the model
summary(model)
## 
## Call:
## glm(formula = Diabetes_binary ~ HighBP + HighChol + BMI + Smoker, 
##     family = binomial, data = dataset)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.592869   0.044500  -80.74   <2e-16 ***
## HighBP       1.232811   0.017721   69.57   <2e-16 ***
## HighChol     0.817665   0.017422   46.93   <2e-16 ***
## BMI          0.079131   0.001422   55.65   <2e-16 ***
## Smoker       0.222751   0.017103   13.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 98000  on 70691  degrees of freedom
## Residual deviance: 81142  on 70687  degrees of freedom
## AIC: 81152
## 
## Number of Fisher Scoring iterations: 4

Logistic Regression Model

The logistic regression model was fitted to predict the probability of having diabetes based on the selected health indicators. Below are the coefficients from the model:

# Extract coefficients and confidence intervals
coefs <- tidy(model)
conf_int <- confint(model)
## Waiting for profiling to be done...
# Combine coefficients with confidence intervals
results <- coefs %>%
  mutate(CI_lower = conf_int[,1], CI_upper = conf_int[,2])

conf_int <- suppressMessages(confint(model))

Using the Standard Error for at least one coefficient, build a confidence interval (Display Results)

# Display results in a table
knitr::kable(results, caption = "Coefficients and Confidence Intervals")
Coefficients and Confidence Intervals
term estimate std.error statistic p.value CI_lower CI_upper
(Intercept) -3.5928694 0.0445003 -80.73803 0 -3.6803289 -3.5058895
HighBP 1.2328106 0.0177206 69.56919 0 1.1980991 1.2675634
HighChol 0.8176648 0.0174216 46.93389 0 0.7835261 0.8518182
BMI 0.0791309 0.0014220 55.64756 0 0.0763510 0.0819252
Smoker 0.2227507 0.0171030 13.02407 0 0.1892304 0.2562735

Interpretation of Coefficients

Each coefficient represents the change in the log odds of having diabetes for a one-unit increase in the corresponding explanatory variable:

Confidence Intervals for Coefficients

The confidence intervals provide a range within which we expect the true coefficient to lie with a certain level of confidence (95% in this case). For example:

Conclusion

The logistic regression model shows that high blood pressure, high cholesterol, BMI, and smoking are significant predictors of diabetes risk: