Introduction

This week’s data dive focuses on expanding our knowledge of linear and generalized linear models (GLMs). The goal is to build a generalized linear model (GLM) to predict the probability of having diabetes based on various health-related variables. We will also diagnose the model, highlight any issues, and interpret one of the coefficients.

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.2
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.2
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
# View data structure
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)
# Convert diabetes_binary to factor for logistic regression
dataset$Diabetes_binary <- as.factor(dataset$Diabetes_binary)
# Build a Generalized Linear Model (Logistic Regression)
# Predicting diabetes status based on BMI, physical activity, and smoking status
glm_model <- glm(Diabetes_binary ~ BMI + PhysActivity + Smoker, 
                 data = dataset, family = binomial)

Summary for model

# Summary of the model
summary(glm_model)
## 
## Call:
## glm(formula = Diabetes_binary ~ BMI + PhysActivity + Smoker, 
##     family = binomial, data = dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.669419   0.045274  -58.96   <2e-16 ***
## BMI           0.096964   0.001381   70.20   <2e-16 ***
## PhysActivity -0.502127   0.017771  -28.25   <2e-16 ***
## Smoker        0.326924   0.015985   20.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 98000  on 70691  degrees of freedom
## Residual deviance: 89857  on 70688  degrees of freedom
## AIC: 89865
## 
## Number of Fisher Scoring iterations: 4
# Model diagnostics - Checking for multicollinearity using Variance Inflation Factor (VIF)
vif(glm_model)
##          BMI PhysActivity       Smoker 
##     1.010116     1.013019     1.004818
  1. Multicollinearity: The Variance Inflation Factor (VIF) helps identify if any explanatory variables are highly correlated with each other. A VIF greater than 5 or 10 suggests multicollinearity issues.
# Diagnose the model by checking residuals and fitted values
par(mfrow = c(2, 2)) # Set plot layout for diagnostics
# Model diagnostics - Residuals analysis
residuals <- residuals(glm_model, type = "deviance")
plot(residuals, main="Residuals Plot", ylab="Deviance Residuals", xlab="Index")
abline(h = 0, col = "red")

Model Insights

Model Summary: The summary output provides estimates for each coefficient along with their standard errors and p-values. Significant p-values indicate that a variable is statistically significant in predicting diabetes.

# Q-Q plot for residuals
qqnorm(glm_model$residuals)
qqline(glm_model$residuals, col = "blue")

# Plot fitted values vs residuals
plot(glm_model$fitted.values, glm_model$residuals, main = "Fitted vs Residuals", 
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")

# Histogram of residuals to check normality
hist(glm_model$residuals, main = "Histogram of Residuals", xlab = "Residuals")

Diagnostic Plots Explanation

In my analysis, I generated several diagnostic plots to evaluate the assumptions of the logistic regression model. Here’s what I observed:

Interpretation of one coefficient (BMI)

exp(coef(glm_model)["BMI"])  # Exponentiate to interpret in terms of odds ratio
##      BMI 
## 1.101821

Coefficient Interpretationof one:

Conclusion

This week’s data dive allowed us to explore how health indicators like BMI and physical activity influence diabetes risk using a generalized linear model. We identified that BMI has a significant positive effect on diabetes risk. However, further investigation into potential multicollinearity or non-linearity in relationships may be necessary to improve the model’s performance.