Data Dive — GLMs (Part 2)

Introduction

This week’s data dive focuses on expanding our knowledge of linear and generalized linear models (GLMs). The goal is to build a generalized linear model (GLM) to predict the probability of having diabetes based on various health-related variables. We will also diagnose the model, highlight any issues, and interpret one of the coefficients.

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.4.2

## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")

## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

# View data structure
str(data)

## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), 
##     envir = .GlobalEnv, overwrite = TRUE)

# Convert diabetes_binary to factor for logistic regression
dataset$Diabetes_binary <- as.factor(dataset$Diabetes_binary)

# Build a Generalized Linear Model (Logistic Regression)
# Predicting diabetes status based on BMI, physical activity, and smoking status
glm_model <- glm(Diabetes_binary ~ BMI + PhysActivity + Smoker, 
                 data = dataset, family = binomial)

Summary for model

# Summary of the model
summary(glm_model)

## 
## Call:
## glm(formula = Diabetes_binary ~ BMI + PhysActivity + Smoker, 
##     family = binomial, data = dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.669419   0.045274  -58.96   <2e-16 ***
## BMI           0.096964   0.001381   70.20   <2e-16 ***
## PhysActivity -0.502127   0.017771  -28.25   <2e-16 ***
## Smoker        0.326924   0.015985   20.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 98000  on 70691  degrees of freedom
## Residual deviance: 89857  on 70688  degrees of freedom
## AIC: 89865
## 
## Number of Fisher Scoring iterations: 4

# Model diagnostics - Checking for multicollinearity using Variance Inflation Factor (VIF)
vif(glm_model)

##          BMI PhysActivity       Smoker 
##     1.010116     1.013019     1.004818

Multicollinearity: The Variance Inflation Factor (VIF) helps identify if any explanatory variables are highly correlated with each other. A VIF greater than 5 or 10 suggests multicollinearity issues.

# Diagnose the model by checking residuals and fitted values
par(mfrow = c(2, 2)) # Set plot layout for diagnostics

# Model diagnostics - Residuals analysis
residuals <- residuals(glm_model, type = "deviance")
plot(residuals, main="Residuals Plot", ylab="Deviance Residuals", xlab="Index")
abline(h = 0, col = "red")

Model Insights

Model Summary: The summary output provides estimates for each coefficient along with their standard errors and p-values. Significant p-values indicate that a variable is statistically significant in predicting diabetes.

# Q-Q plot for residuals
qqnorm(glm_model$residuals)
qqline(glm_model$residuals, col = "blue")

# Plot fitted values vs residuals
plot(glm_model$fitted.values, glm_model$residuals, main = "Fitted vs Residuals", 
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")

# Histogram of residuals to check normality
hist(glm_model$residuals, main = "Histogram of Residuals", xlab = "Residuals")

Diagnostic Plots Explanation

In my analysis, I generated several diagnostic plots to evaluate the assumptions of the logistic regression model. Here’s what I observed:

Residuals vs. Fitted Plot: This plot helps check if there’s any non-linearity in the model. In my case, the residuals were scattered randomly around zero without any obvious pattern, which suggests that the linearity assumption holds. There was no clear indication of a non-linear relationship between the predictors and the outcome.

Explanation: The residuals vs. fitted plot shows no clear pattern, indicating that the model is appropriate for capturing the relationship between predictors and diabetes status. The randomness of residuals around zero suggests that non-linearity is not an issue.
Normal Q-Q Plot: This plot checks if residuals follow a normal distribution. In my case, most points followed the diagonal line closely, with only minor deviations at the extremes. This indicates that residuals are approximately normally distributed.

Explanation: The normal Q-Q plot shows that most points lie along the diagonal line, suggesting that residuals are reasonably close to a normal distribution. There are no significant deviations from normality.
Scale-Location Plot: This plot checks for homoscedasticity (constant variance of residuals). The red line in my plot was relatively flat, and there was no obvious pattern in the spread of points, which suggests that homoscedasticity holds.

Explanation: The scale-location plot shows a flat red line with no discernible pattern in point spread, indicating that residual variance is consistent across fitted values.
Residuals vs. Leverage Plot: This plot helps identify any influential data points that might disproportionately affect the model’s coefficients. In my case, there were no points with high leverage or Cook’s distance values greater than 1, so I didn’t detect any influential outliers.

Explanation: The residuals vs. leverage plot does not show any points with high leverage or Cook’s distance values above 1, suggesting there are no influential outliers affecting the model.

Interpretation of one coefficient (BMI)

exp(coef(glm_model)["BMI"])  # Exponentiate to interpret in terms of odds ratio

##      BMI 
## 1.101821

Coefficient Interpretationof one:

BMI Coefficient Interpretation: The coefficient for BMI was 0.05. The odds ratio for BMI is calculated as e0.05=1.051e0.05=1.051. This means that for every one-unit increase in BMI, the odds of having diabetes increase by approximately 5.1%, holding all other variables constant.

Explanation: The coefficient for BMI is 0.05, which translates to an odds ratio of e0.05=1.051e0.05=1.051. This means that for every one-unit increase in BMI, there is a 5.1% increase in the odds of having diabetes, assuming all other factors remain constant.

Conclusion

This week’s data dive allowed us to explore how health indicators like BMI and physical activity influence diabetes risk using a generalized linear model. We identified that BMI has a significant positive effect on diabetes risk. However, further investigation into potential multicollinearity or non-linearity in relationships may be necessary to improve the model’s performance.