Data Dive — GLMs

Introduction

This weeks data dive aims to explore the relationship between certain health indicators and diabetes occurrence using a logistic regression model. The binary target variable is “Diabetes_binary”, which indicates whether an individual has been diagnosed with diabetes (1) or not (0). We will use four explanatory variables:

HighBP (High Blood Pressure)
HighChol (High Cholesterol)
BMI (Body Mass Index)
Smoker (Smoking status)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(dplyr)
library(ggplot2)
library(broom)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")

## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

# Select explanatory variables and target variable
X <- dataset %>% select(HighBP, HighChol, BMI, Smoker)
Y <- dataset$Diabetes_binary

# Fit a logistic regression model
model <- glm(Diabetes_binary ~ HighBP + HighChol + BMI + Smoker, data = dataset, family = binomial)

# Summary of the model
summary(model)

## 
## Call:
## glm(formula = Diabetes_binary ~ HighBP + HighChol + BMI + Smoker, 
##     family = binomial, data = dataset)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.592869   0.044500  -80.74   <2e-16 ***
## HighBP       1.232811   0.017721   69.57   <2e-16 ***
## HighChol     0.817665   0.017422   46.93   <2e-16 ***
## BMI          0.079131   0.001422   55.65   <2e-16 ***
## Smoker       0.222751   0.017103   13.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 98000  on 70691  degrees of freedom
## Residual deviance: 81142  on 70687  degrees of freedom
## AIC: 81152
## 
## Number of Fisher Scoring iterations: 4

Logistic Regression Model

The logistic regression model was fitted to predict the probability of having diabetes based on the selected health indicators. Below are the coefficients from the model:

# Extract coefficients and confidence intervals
coefs <- tidy(model)
conf_int <- confint(model)

## Waiting for profiling to be done...

# Combine coefficients with confidence intervals
results <- coefs %>%
  mutate(CI_lower = conf_int[,1], CI_upper = conf_int[,2])

conf_int <- suppressMessages(confint(model))

Using the Standard Error for at least one coefficient, build a confidence interval (Display Results)

# Display results in a table
knitr::kable(results, caption = "Coefficients and Confidence Intervals")

Coefficients and Confidence Intervals
term	estimate	std.error	statistic	CI_lower	CI_upper
(Intercept)	-3.5928694	0.0445003	-80.73803	-3.6803289	-3.5058895
HighBP	1.2328106	0.0177206	69.56919	1.1980991	1.2675634
HighChol	0.8176648	0.0174216	46.93389	0.7835261	0.8518182
BMI	0.0791309	0.0014220	55.64756	0.0763510	0.0819252
Smoker	0.2227507	0.0171030	13.02407	0.1892304	0.2562735

Interpretation of Coefficients

Each coefficient represents the change in the log odds of having diabetes for a one-unit increase in the corresponding explanatory variable:

HighBP: A one-unit increase in HighBP increases the log odds of having diabetes by approximately 1.23.
HighChol: A one-unit increase in HighChol increases the log odds of having diabetes by approximately 0.82.
BMI: A one-unit increase in BMI corresponds to an increase of about 0.08 in the log odds of diabetes.
Smoker: Smoking increases the log odds of having diabetes by approximately 0.22.

Confidence Intervals for Coefficients

The confidence intervals provide a range within which we expect the true coefficient to lie with a certain level of confidence (95% in this case). For example:

The confidence interval for HighBP is [1.22953, 1.22962], meaning we are confident that the true effect of high blood pressure on diabetes lies within this range.

Conclusion

The logistic regression model shows that high blood pressure, high cholesterol, BMI, and smoking are significant predictors of diabetes risk:

Managing high blood pressure and cholesterol could be crucial steps in preventing diabetes.
The positive association between smoking and diabetes suggests that smoking cessation programs could also help reduce diabetes incidence.