Introduction
The goal of this data dive is to enhance our understanding of regression models by incorporating additional variables and evaluating their impact. We will expand upon last week’s simple linear regression model by adding 1-3 more variables, including an interaction or binary term, and assess the model using diagnostic plots.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(dplyr)
library(ggplot2)
Data Preparation
First, we load the dataset.
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
R Code for Centering Variables
You can center the variables before creating the interaction term to help mitigate multicollinearity:
# Centering variables
dataset$HighBP_centered <- scale(dataset$HighBP, center = TRUE, scale = FALSE)
dataset$PhysActivity_centered <- scale(dataset$PhysActivity, center = TRUE, scale = FALSE)
# Rebuild the model with centered interaction term
extended_model_centered <- lm(Diabetes_binary ~ HighBP_centered * PhysActivity_centered + BMI, data = dataset)
# Check VIF again
vif_values_centered <- vif(extended_model_centered)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
print(vif_values_centered)
## HighBP_centered PhysActivity_centered
## 1.075085 1.058040
## BMI HighBP_centered:PhysActivity_centered
## 1.084057 1.018559
# Check VIF with type = "predictor"
vif_values_centered <- vif(extended_model_centered, type = "predictor")
## GVIFs computed for predictors
print(vif_values_centered)
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## HighBP_centered 1.084057 3 1.013543 PhysActivity_centered
## PhysActivity_centered 1.084057 3 1.013543 HighBP_centered
## BMI 1.084057 1 1.041181 --
## Other Predictors
## HighBP_centered BMI
## PhysActivity_centered BMI
## BMI HighBP_centered, PhysActivity_centered
VIF Check: By setting type = “predictor”, you can assess multicollinearity for each predictor separately, including interaction terms.
Model Building
Initial Model
# Initial simple linear regression model
initial_model <- lm(Diabetes_binary ~ HighBP, data = dataset)
summary(initial_model)
##
## Call:
## lm(formula = Diabetes_binary ~ HighBP, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66791 -0.28328 0.02441 0.33209 0.71672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.283279 0.002631 107.7 <2e-16 ***
## HighBP 0.384626 0.003505 109.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4622 on 70690 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1455
## F-statistic: 1.204e+04 on 1 and 70690 DF, p-value: < 2.2e-16
Extended Model
We will add BMI and PhysActivity as additional predictors. An interaction term between HighBP and PhysActivity is also included to explore potential interactions.
# Extended model with additional variables and interaction term
extended_model <- lm(Diabetes_binary ~ HighBP * PhysActivity + BMI, data = dataset)
summary(extended_model_centered)
##
## Call:
## lm(formula = Diabetes_binary ~ HighBP_centered * PhysActivity_centered +
## BMI, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6177 -0.3313 -0.1165 0.3761 0.9403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0774540 0.0075674 10.235 < 2e-16
## HighBP_centered 0.3234711 0.0035271 91.709 < 2e-16
## PhysActivity_centered -0.0906909 0.0037980 -23.879 < 2e-16
## BMI 0.0141978 0.0002469 57.499 < 2e-16
## HighBP_centered:PhysActivity_centered 0.0439655 0.0077068 5.705 1.17e-08
##
## (Intercept) ***
## HighBP_centered ***
## PhysActivity_centered ***
## BMI ***
## HighBP_centered:PhysActivity_centered ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4486 on 70687 degrees of freedom
## Multiple R-squared: 0.1952, Adjusted R-squared: 0.1952
## F-statistic: 4286 on 4 and 70687 DF, p-value: < 2.2e-16
Variable Selection Rationale
BMI: Chosen due to its known association with diabetes risk.
PhysActivity: Included because physical activity can influence diabetes management.
Interaction Term (HighBP * PhysActivity): To explore
if the effect of high blood pressure on diabetes risk is modified by
physical activity levels.
Diagnostic Plots
We will use diagnostic plots to evaluate the assumptions of the regression model.
# Diagnostic plots
par(mfrow = c(2, 3))
plot(extended_model_centered)
Linearity Assumption:
We can assess linearity using the Residuals vs Fitted plot
Severity assessment:
If the red line is approximately horizontal and close to y=0, the linearity assumption is likely met.
Minor deviations suggest a low severity issue.
Large, systematic deviations indicate a high severity issue.
Confidence level:
High confidence if the red line is very close to horizontal.
Medium confidence if there are slight deviations.
Low confidence if there are clear patterns or strong curvature.
We can check this using the Q-Q plot.
Severity assessment:
Points closely following the diagonal line suggest normality.
Minor deviations at the tails indicate a low severity issue.
Significant S-shaped curves or other patterns suggest a high severity issue.
Confidence level:
High confidence if points closely follow the diagonal line.
Medium confidence if there are minor deviations, especially at the tails.
Low confidence if there are significant deviations throughout.
# Calculate Cook's distance
cooks_d <- cooks.distance(extended_model_centered)
# Create a data frame with the Cook's distance values and their indices
cooks_df <- data.frame(obs = 1:length(cooks_d), cooks_distance = cooks_d)
# Calculate the cutoff value (4/n)
n <- nrow(cooks_df)
cutoff <- 4 / n
# Create the Cook's Distance plot
ggplot(cooks_df, aes(x = obs, y = cooks_distance)) +
geom_point() +
geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
labs(title = "Cook's Distance Plot",
x = "Observation Index",
y = "Cook's Distance") +
theme_minimal() +
annotate("text", x = max(cooks_df$obs), y = cutoff,
label = "Cutoff Line (4/n)", vjust = -0.5, hjust = 1, color = "red")
Insights Gathered
Residuals vs Fitted Plot: Checks for non-linearity.
Ideally, residuals should be randomly scattered around zero.
Interpretation: The Residuals vs Fitted plot shows a
clear “X” pattern, indicating that the residuals are not randomly
distributed around zero. This suggests that the linearity assumption may
not hold, and there could be some non-linear relationships in the data
that are not being captured by the model.
Q-Q Plot: Assesses normality of residuals. Points
should lie on the diagonal line.
Interpretation: The Normal Q-Q plot shows that most
points follow the straight line, indicating that residuals are
approximately normally distributed. However, there are deviations at
both tails, suggesting potential outliers or non-normality in the
extreme values.
Scale-Location Plot: Examines homoscedasticity
(constant variance). A horizontal line indicates equal spread.
Interpretation: The Scale-Location plot displays a
distinct “X” pattern, indicating heteroscedasticity. The residual
variance is not constant across fitted values, which violates one of the
assumptions of linear regression.
Residuals vs Leverage Plot: Identifies influential
observations. Points outside Cook’s distance lines are
influential.
Interpretation: The Residuals vs Leverage plot shows
most points clustered around low leverage values and no extreme outliers
with high Cook’s distance. This suggests that while some observations
have higher leverage, they do not appear to be highly influential on the
model.
Cooks Distance Plot: The Cook’s Distance plot reveals
several influential observations, particularly in the range of
observation indices from 0 to 35,000, with the highest Cook’s distance
value around 0.004 at approximately index 30,000. While most data points
have relatively low Cook’s distances, the presence of multiple points
exceeding the cutoff line (4/n) suggests that there are several
influential observations that may have a substantial impact on the
regression results and warrant further investigation.
Further Investigation
If any assumptions are violated, consider transformations or adding/removing variables.
Investigate multicollinearity using VIF (Variance Inflation Factor).
# Check for multicollinearity
library(car)
vif(extended_model_centered)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## HighBP_centered PhysActivity_centered
## 1.075085 1.058040
## BMI HighBP_centered:PhysActivity_centered
## 1.084057 1.018559
VIF Interpretation
HighBP: VIF = 3.78
PhysActivity: VIF = 2.75
BMI: VIF = 1.08
HighBP:PhysActivity (Interaction Term): VIF = 4.84
Multicollinearity Analysis
VIF Values:
A VIF value above 5-10 indicates a potential multicollinearity problem.
The interaction term HighBP : PhysActivity has the highest VIF, suggesting some multicollinearity.
Addressing Multicollinearity:
Consider centering the variables involved in the interaction term to reduce multicollinearity.
Check if removing the interaction term significantly affects model performance and interpretability.
Conclusion
The extended regression model provides a more comprehensive
understanding of factors affecting diabetes risk. Future work could
explore additional interactions or non-linear relationships.