Data Dive — Regression Diagnostics

Introduction

The goal of this data dive is to enhance our understanding of regression models by incorporating additional variables and evaluating their impact. We will expand upon last week’s simple linear regression model by adding 1-3 more variables, including an interaction or binary term, and assess the model using diagnostic plots.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")

## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

R Code for Centering Variables

You can center the variables before creating the interaction term to help mitigate multicollinearity:

# Centering variables
dataset$HighBP_centered <- scale(dataset$HighBP, center = TRUE, scale = FALSE)
dataset$PhysActivity_centered <- scale(dataset$PhysActivity, center = TRUE, scale = FALSE)

# Rebuild the model with centered interaction term
extended_model_centered <- lm(Diabetes_binary ~ HighBP_centered * PhysActivity_centered + BMI, data = dataset)

# Check VIF again
vif_values_centered <- vif(extended_model_centered)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

print(vif_values_centered)

##                       HighBP_centered                 PhysActivity_centered 
##                              1.075085                              1.058040 
##                                   BMI HighBP_centered:PhysActivity_centered 
##                              1.084057                              1.018559

# Check VIF with type = "predictor"
vif_values_centered <- vif(extended_model_centered, type = "predictor")

## GVIFs computed for predictors

print(vif_values_centered)

##                           GVIF Df GVIF^(1/(2*Df))        Interacts With
## HighBP_centered       1.084057  3        1.013543 PhysActivity_centered
## PhysActivity_centered 1.084057  3        1.013543       HighBP_centered
## BMI                   1.084057  1        1.041181                  --  
##                                             Other Predictors
## HighBP_centered                                          BMI
## PhysActivity_centered                                    BMI
## BMI                   HighBP_centered, PhysActivity_centered

VIF Check: By setting type = “predictor”, you can assess multicollinearity for each predictor separately, including interaction terms.

This approach should help manage multicollinearity in models with interaction terms.

Model Building

Initial Model

# Initial simple linear regression model
initial_model <- lm(Diabetes_binary ~ HighBP, data = dataset)
summary(initial_model)

## 
## Call:
## lm(formula = Diabetes_binary ~ HighBP, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66791 -0.28328  0.02441  0.33209  0.71672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.283279   0.002631   107.7   <2e-16 ***
## HighBP      0.384626   0.003505   109.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4622 on 70690 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1455 
## F-statistic: 1.204e+04 on 1 and 70690 DF,  p-value: < 2.2e-16

Extended Model

We will add BMI and PhysActivity as additional predictors. An interaction term between HighBP and PhysActivity is also included to explore potential interactions.

# Extended model with additional variables and interaction term
extended_model <- lm(Diabetes_binary ~ HighBP * PhysActivity + BMI, data = dataset)
summary(extended_model_centered)

## 
## Call:
## lm(formula = Diabetes_binary ~ HighBP_centered * PhysActivity_centered + 
##     BMI, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6177 -0.3313 -0.1165  0.3761  0.9403 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            0.0774540  0.0075674  10.235  < 2e-16
## HighBP_centered                        0.3234711  0.0035271  91.709  < 2e-16
## PhysActivity_centered                 -0.0906909  0.0037980 -23.879  < 2e-16
## BMI                                    0.0141978  0.0002469  57.499  < 2e-16
## HighBP_centered:PhysActivity_centered  0.0439655  0.0077068   5.705 1.17e-08
##                                          
## (Intercept)                           ***
## HighBP_centered                       ***
## PhysActivity_centered                 ***
## BMI                                   ***
## HighBP_centered:PhysActivity_centered ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4486 on 70687 degrees of freedom
## Multiple R-squared:  0.1952, Adjusted R-squared:  0.1952 
## F-statistic:  4286 on 4 and 70687 DF,  p-value: < 2.2e-16

Variable Selection Rationale

BMI: Chosen due to its known association with diabetes risk.

PhysActivity: Included because physical activity can influence diabetes management.

Interaction Term (HighBP * PhysActivity): To explore if the effect of high blood pressure on diabetes risk is modified by physical activity levels.

Diagnostic Plots

We will use diagnostic plots to evaluate the assumptions of the regression model.

# Diagnostic plots
par(mfrow = c(2, 3))
plot(extended_model_centered)

Linearity Assumption:

We can assess linearity using the Residuals vs Fitted plot

Severity assessment:

If the red line is approximately horizontal and close to y=0, the linearity assumption is likely met.
Minor deviations suggest a low severity issue.
Large, systematic deviations indicate a high severity issue.

Confidence level:

High confidence if the red line is very close to horizontal.
Medium confidence if there are slight deviations.
Low confidence if there are clear patterns or strong curvature.

We can check this using the Q-Q plot.

Severity assessment:

Points closely following the diagonal line suggest normality.
Minor deviations at the tails indicate a low severity issue.
Significant S-shaped curves or other patterns suggest a high severity issue.

Confidence level:

High confidence if points closely follow the diagonal line.
Medium confidence if there are minor deviations, especially at the tails.
Low confidence if there are significant deviations throughout.

# Calculate Cook's distance
cooks_d <- cooks.distance(extended_model_centered)

# Create a data frame with the Cook's distance values and their indices
cooks_df <- data.frame(obs = 1:length(cooks_d), cooks_distance = cooks_d)

# Calculate the cutoff value (4/n)
n <- nrow(cooks_df)
cutoff <- 4 / n

# Create the Cook's Distance plot
ggplot(cooks_df, aes(x = obs, y = cooks_distance)) +
  geom_point() +
  geom_hline(yintercept = cutoff, linetype = "dashed", color = "red") +
  labs(title = "Cook's Distance Plot",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal() +
  annotate("text", x = max(cooks_df$obs), y = cutoff, 
           label = "Cutoff Line (4/n)", vjust = -0.5, hjust = 1, color = "red")

Insights Gathered

Residuals vs Fitted Plot: Checks for non-linearity. Ideally, residuals should be randomly scattered around zero.
Interpretation: The Residuals vs Fitted plot shows a clear “X” pattern, indicating that the residuals are not randomly distributed around zero. This suggests that the linearity assumption may not hold, and there could be some non-linear relationships in the data that are not being captured by the model.

Q-Q Plot: Assesses normality of residuals. Points should lie on the diagonal line.
Interpretation: The Normal Q-Q plot shows that most points follow the straight line, indicating that residuals are approximately normally distributed. However, there are deviations at both tails, suggesting potential outliers or non-normality in the extreme values.

Scale-Location Plot: Examines homoscedasticity (constant variance). A horizontal line indicates equal spread.
Interpretation: The Scale-Location plot displays a distinct “X” pattern, indicating heteroscedasticity. The residual variance is not constant across fitted values, which violates one of the assumptions of linear regression.

Residuals vs Leverage Plot: Identifies influential observations. Points outside Cook’s distance lines are influential.
Interpretation: The Residuals vs Leverage plot shows most points clustered around low leverage values and no extreme outliers with high Cook’s distance. This suggests that while some observations have higher leverage, they do not appear to be highly influential on the model.

Cooks Distance Plot: The Cook’s Distance plot reveals several influential observations, particularly in the range of observation indices from 0 to 35,000, with the highest Cook’s distance value around 0.004 at approximately index 30,000. While most data points have relatively low Cook’s distances, the presence of multiple points exceeding the cutoff line (4/n) suggests that there are several influential observations that may have a substantial impact on the regression results and warrant further investigation.

Further Investigation

If any assumptions are violated, consider transformations or adding/removing variables.

Investigate multicollinearity using VIF (Variance Inflation Factor).

# Check for multicollinearity
library(car)
vif(extended_model_centered)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##                       HighBP_centered                 PhysActivity_centered 
##                              1.075085                              1.058040 
##                                   BMI HighBP_centered:PhysActivity_centered 
##                              1.084057                              1.018559

VIF Interpretation

HighBP: VIF = 3.78

PhysActivity: VIF = 2.75

BMI: VIF = 1.08

HighBP:PhysActivity (Interaction Term): VIF = 4.84

Multicollinearity Analysis

VIF Values:

A VIF value above 5-10 indicates a potential multicollinearity problem.
The interaction term HighBP : PhysActivity has the highest VIF, suggesting some multicollinearity.

Addressing Multicollinearity:

Consider centering the variables involved in the interaction term to reduce multicollinearity.
Check if removing the interaction term significantly affects model performance and interpretability.

Conclusion

The extended regression model provides a more comprehensive understanding of factors affecting diabetes risk. Future work could explore additional interactions or non-linear relationships.