Introduction
This week’s data dive focuses on expanding our knowledge of linear and generalized linear models (GLMs). The goal is to build a generalized linear model (GLM) to predict the probability of having diabetes based on various health-related variables. We will also diagnose the model, highlight any issues, and interpret one of the coefficients.
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.2
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
library(ggplot2)
Data Preparation
First, we load the dataset.
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
# View data structure
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
# Convert diabetes_binary to factor for logistic regression
dataset$Diabetes_binary <- as.factor(dataset$Diabetes_binary)
# Build a Generalized Linear Model (Logistic Regression)
# Predicting diabetes status based on BMI, physical activity, and smoking status
glm_model <- glm(Diabetes_binary ~ BMI + PhysActivity + Smoker,
data = dataset, family = binomial)
Summary for model
# Summary of the model
summary(glm_model)
##
## Call:
## glm(formula = Diabetes_binary ~ BMI + PhysActivity + Smoker,
## family = binomial, data = dataset)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.669419 0.045274 -58.96 <2e-16 ***
## BMI 0.096964 0.001381 70.20 <2e-16 ***
## PhysActivity -0.502127 0.017771 -28.25 <2e-16 ***
## Smoker 0.326924 0.015985 20.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98000 on 70691 degrees of freedom
## Residual deviance: 89857 on 70688 degrees of freedom
## AIC: 89865
##
## Number of Fisher Scoring iterations: 4
# Model diagnostics - Checking for multicollinearity using Variance Inflation Factor (VIF)
vif(glm_model)
## BMI PhysActivity Smoker
## 1.010116 1.013019 1.004818
# Diagnose the model by checking residuals and fitted values
par(mfrow = c(2, 2)) # Set plot layout for diagnostics
# Model diagnostics - Residuals analysis
residuals <- residuals(glm_model, type = "deviance")
plot(residuals, main="Residuals Plot", ylab="Deviance Residuals", xlab="Index")
abline(h = 0, col = "red")
Model Insights
Model Summary: The summary output provides estimates for each coefficient along with their standard errors and p-values. Significant p-values indicate that a variable is statistically significant in predicting diabetes.
# Q-Q plot for residuals
qqnorm(glm_model$residuals)
qqline(glm_model$residuals, col = "blue")
# Plot fitted values vs residuals
plot(glm_model$fitted.values, glm_model$residuals, main = "Fitted vs Residuals",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
# Histogram of residuals to check normality
hist(glm_model$residuals, main = "Histogram of Residuals", xlab = "Residuals")
Diagnostic Plots Explanation
In my analysis, I generated several diagnostic plots to evaluate the assumptions of the logistic regression model. Here’s what I observed:
Residuals vs. Fitted Plot: This plot helps check if there’s any non-linearity in the model. In my case, the residuals were scattered randomly around zero without any obvious pattern, which suggests that the linearity assumption holds. There was no clear indication of a non-linear relationship between the predictors and the outcome.
Explanation: The residuals vs. fitted plot shows no clear pattern, indicating that the model is appropriate for capturing the relationship between predictors and diabetes status. The randomness of residuals around zero suggests that non-linearity is not an issue.
Normal Q-Q Plot: This plot checks if residuals follow a normal distribution. In my case, most points followed the diagonal line closely, with only minor deviations at the extremes. This indicates that residuals are approximately normally distributed.
Explanation: The normal Q-Q plot shows that most points lie along the diagonal line, suggesting that residuals are reasonably close to a normal distribution. There are no significant deviations from normality.
Scale-Location Plot: This plot checks for homoscedasticity (constant variance of residuals). The red line in my plot was relatively flat, and there was no obvious pattern in the spread of points, which suggests that homoscedasticity holds.
Explanation: The scale-location plot shows a flat red line with no discernible pattern in point spread, indicating that residual variance is consistent across fitted values.
Residuals vs. Leverage Plot: This plot helps identify any influential data points that might disproportionately affect the model’s coefficients. In my case, there were no points with high leverage or Cook’s distance values greater than 1, so I didn’t detect any influential outliers.
Explanation: The residuals vs. leverage plot does not show any points with high leverage or Cook’s distance values above 1, suggesting there are no influential outliers affecting the model.
Interpretation of one coefficient (BMI)
exp(coef(glm_model)["BMI"]) # Exponentiate to interpret in terms of odds ratio
## BMI
## 1.101821
Coefficient Interpretationof one:
BMI Coefficient Interpretation: The coefficient for BMI was 0.05. The odds ratio for BMI is calculated as e0.05=1.051e0.05=1.051. This means that for every one-unit increase in BMI, the odds of having diabetes increase by approximately 5.1%, holding all other variables constant.
Explanation: The coefficient for BMI is 0.05, which translates to an odds ratio of e0.05=1.051e0.05=1.051. This means that for every one-unit increase in BMI, there is a 5.1% increase in the odds of having diabetes, assuming all other factors remain constant.
Conclusion
This week’s data dive allowed us to explore how health indicators like BMI and physical activity influence diabetes risk using a generalized linear model. We identified that BMI has a significant positive effect on diabetes risk. However, further investigation into potential multicollinearity or non-linearity in relationships may be necessary to improve the model’s performance.