Introduction

The goal of this data dive is to enhance our understanding of regression models by incorporating additional variables and evaluating their impact. We will expand upon last week’s simple linear regression model by adding 1-3 more variables, including an interaction or binary term, and assess the model using diagnostic plots.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)

R Code for Centering Variables

You can center the variables before creating the interaction term to help mitigate multicollinearity:

# Centering variables
dataset$HighBP_centered <- scale(dataset$HighBP, center = TRUE, scale = FALSE)
dataset$PhysActivity_centered <- scale(dataset$PhysActivity, center = TRUE, scale = FALSE)

# Rebuild the model with centered interaction term
extended_model_centered <- lm(Diabetes_binary ~ HighBP_centered * PhysActivity_centered + BMI, data = dataset)

# Check VIF again
vif_values_centered <- vif(extended_model_centered)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
print(vif_values_centered)
##                       HighBP_centered                 PhysActivity_centered 
##                              1.075085                              1.058040 
##                                   BMI HighBP_centered:PhysActivity_centered 
##                              1.084057                              1.018559
# Check VIF with type = "predictor"
vif_values_centered <- vif(extended_model_centered, type = "predictor")
## GVIFs computed for predictors
print(vif_values_centered)
##                           GVIF Df GVIF^(1/(2*Df))        Interacts With
## HighBP_centered       1.084057  3        1.013543 PhysActivity_centered
## PhysActivity_centered 1.084057  3        1.013543       HighBP_centered
## BMI                   1.084057  1        1.041181                  --  
##                                             Other Predictors
## HighBP_centered                                          BMI
## PhysActivity_centered                                    BMI
## BMI                   HighBP_centered, PhysActivity_centered

VIF Check: By setting type = “predictor”, you can assess multicollinearity for each predictor separately, including interaction terms.

Model Building

Initial Model

# Initial simple linear regression model
initial_model <- lm(Diabetes_binary ~ HighBP, data = dataset)
summary(initial_model)
## 
## Call:
## lm(formula = Diabetes_binary ~ HighBP, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66791 -0.28328  0.02441  0.33209  0.71672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.283279   0.002631   107.7   <2e-16 ***
## HighBP      0.384626   0.003505   109.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4622 on 70690 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1455 
## F-statistic: 1.204e+04 on 1 and 70690 DF,  p-value: < 2.2e-16

Extended Model

We will add BMI and PhysActivity as additional predictors. An interaction term between HighBP and PhysActivity is also included to explore potential interactions.

# Extended model with additional variables and interaction term
extended_model <- lm(Diabetes_binary ~ HighBP * PhysActivity + BMI, data = dataset)
summary(extended_model_centered)
## 
## Call:
## lm(formula = Diabetes_binary ~ HighBP_centered * PhysActivity_centered + 
##     BMI, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6177 -0.3313 -0.1165  0.3761  0.9403 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            0.0774540  0.0075674  10.235  < 2e-16
## HighBP_centered                        0.3234711  0.0035271  91.709  < 2e-16
## PhysActivity_centered                 -0.0906909  0.0037980 -23.879  < 2e-16
## BMI                                    0.0141978  0.0002469  57.499  < 2e-16
## HighBP_centered:PhysActivity_centered  0.0439655  0.0077068   5.705 1.17e-08
##                                          
## (Intercept)                           ***
## HighBP_centered                       ***
## PhysActivity_centered                 ***
## BMI                                   ***
## HighBP_centered:PhysActivity_centered ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4486 on 70687 degrees of freedom
## Multiple R-squared:  0.1952, Adjusted R-squared:  0.1952 
## F-statistic:  4286 on 4 and 70687 DF,  p-value: < 2.2e-16

Variable Selection Rationale

BMI: Chosen due to its known association with diabetes risk.

PhysActivity: Included because physical activity can influence diabetes management.

Interaction Term (HighBP * PhysActivity): To explore if the effect of high blood pressure on diabetes risk is modified by physical activity levels.

Diagnostic Plots

We will use diagnostic plots to evaluate the assumptions of the regression model.

# Diagnostic plots
par(mfrow = c(2, 3))
plot(extended_model_centered)

Linearity Assumption:

We can assess linearity using the Residuals vs Fitted plot

Severity assessment:

Confidence level:

We can check this using the Q-Q plot.

Severity assessment:

Confidence level:

# Calculate Cook's distance
cooks_d <- cooks.distance(extended_model_centered)

# Create a data frame with the Cook's distance values and their indices
cooks_df <- data.frame(obs = 1:length(cooks_d), cooks_distance = cooks_d)

# Calculate the cutoff value (4/n)
n <- nrow(cooks_df)
cutoff <- 4 / n

# Create the Cook's Distance plot
ggplot(cooks_df, aes(x = obs, y = cooks_distance)) +
  geom_point() +
  geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance Plot",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  annotate("text", x = max(cooks_df$obs), y = cutoff, 
           label = "Cutoff Line (4/n)", vjust = -0.5, hjust = 1, color = "red")


Insights Gathered

Residuals vs Fitted Plot: Checks for non-linearity. Ideally, residuals should be randomly scattered around zero.
Interpretation: The Residuals vs Fitted plot shows a clear “X” pattern, indicating that the residuals are not randomly distributed around zero. This suggests that the linearity assumption may not hold, and there could be some non-linear relationships in the data that are not being captured by the model.

Q-Q Plot: Assesses normality of residuals. Points should lie on the diagonal line.
Interpretation: The Normal Q-Q plot shows that most points follow the straight line, indicating that residuals are approximately normally distributed. However, there are deviations at both tails, suggesting potential outliers or non-normality in the extreme values.

Scale-Location Plot: Examines homoscedasticity (constant variance). A horizontal line indicates equal spread.
Interpretation: The Scale-Location plot displays a distinct “X” pattern, indicating heteroscedasticity. The residual variance is not constant across fitted values, which violates one of the assumptions of linear regression.

Residuals vs Leverage Plot: Identifies influential observations. Points outside Cook’s distance lines are influential.
Interpretation: The Residuals vs Leverage plot shows most points clustered around low leverage values and no extreme outliers with high Cook’s distance. This suggests that while some observations have higher leverage, they do not appear to be highly influential on the model.

Cooks Distance Plot: The Cook’s Distance plot reveals several influential observations, particularly in the range of observation indices from 0 to 35,000, with the highest Cook’s distance value around 0.004 at approximately index 30,000. While most data points have relatively low Cook’s distances, the presence of multiple points exceeding the cutoff line (4/n) suggests that there are several influential observations that may have a substantial impact on the regression results and warrant further investigation.

Further Investigation

If any assumptions are violated, consider transformations or adding/removing variables.

Investigate multicollinearity using VIF (Variance Inflation Factor).

# Check for multicollinearity
library(car)
vif(extended_model_centered)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                       HighBP_centered                 PhysActivity_centered 
##                              1.075085                              1.058040 
##                                   BMI HighBP_centered:PhysActivity_centered 
##                              1.084057                              1.018559

VIF Interpretation

HighBP: VIF = 3.78

PhysActivity: VIF = 2.75

BMI: VIF = 1.08

HighBP:PhysActivity (Interaction Term): VIF = 4.84

Multicollinearity Analysis

VIF Values:

Addressing Multicollinearity:

Conclusion

The extended regression model provides a more comprehensive understanding of factors affecting diabetes risk. Future work could explore additional interactions or non-linear relationships.