Variable Selection Methods in Regression

HW: (1) Find the dataset to fit multiple regession and interpret the results.

HW: (2) Reading about variable selection method in regression and the implementatin in R.

Suppose the multiple linear regression model with the mathematical formulation given as:

\[ Y=\beta_{0} +\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots +\beta_{p}X_{p}+\epsilon.\]

Where Y is response variable (dependent) and X is predictors (independent) variables.

data("mtcars")
dim(mtcars)

## [1] 32 11

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Let us consider the mtcars dataset which is built in R and take $mpg$ as response variable and remaining $10$ as predictors.

Also discussed Variable selection methods that play major role in regression modeling by selecting important predictors to be included in the model. In the multiple regression modeling, the variable selection methods might improve interpretability, reduces overfitting, and enhances prediction accuracy.

-In R, they are several robust approaches ranging from traditional methods (like filter methods (based correlation, p_values/ t-student, and mutual information) and selection (forward, backward and stepwise) methids ) to modern regularization like LASSO and Ridge regressions.

-To tackle variable selection methods in multiple linear regression different methods of variable selection are discussed and implementation of each method in R using to mtcars dataset

Full multiple linear regression model before applying variable selection methods

# Define response correctly
y <- mtcars$mpg
# Remove mpg from predictors
x <- mtcars[, -1]
full_model <-lm(y ~ ., data=x)
cat("===============================================\n")

## ===============================================

cat("Full model summary\n")

## Full model summary

cat("===============================================")

## ===============================================

summary(full_model)

## 
## Call:
## lm(formula = y ~ ., data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

formula(full_model)

## y ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

round(coef(full_model),2)

## (Intercept)         cyl        disp          hp        drat          wt 
##       12.30       -0.11        0.01       -0.02        0.79       -3.72 
##        qsec          vs          am        gear        carb 
##        0.82        0.32        2.52        0.66       -0.20

The full multiple linear regression model can be expressed as \[ mpg = 12. 30 -0.11*cyl + 0.01*disp -0.02*hp +0.79*drat -3.72*wt +0.82*qsec+0.32*vs+2.52*am+0.66*gear-0.20*carb\] with the adjusted r-squared \[R^{2}= 0.8066\].

plot the actual mpg vs the predicted mpg

predicted <- predict(full_model)
# Plot Actual vs predicted
plot(
  y,
  predicted,
  main = "Actual vs Predicted (Correlation Model)",
  xlab = "Actual MPG",
  ylab = "Predicted MPG",
  pch = 19,
  col = "blue"
)

abline(0,1, col = "red", lwd = 2)

### Regression Diagnostic Plots The regression diagnostic plots check the regression assumptions about: (1) Linearity: Use residuals vs fitted plot for checking linearity (2) Normality of the residuals: Use Q-Q plot for checking normality of the residuals (3) Independence (4) Homoscedasticity (equality variables) of residuals

# Diagnostic plot
par(mfrow = c(2,2))
plot(full_model)

(1) The residual vs fitted plot (“top_left”) is for checking the normality assumption: The red line is not horizontal to zero and it shows a curved pattern meaning that the pattern suggests the relationship may not be purely linear. Therefore, the linearity assumption is not satisfied. (2) The Q-Q plot for checking normality (top_right plot) shows points deviate from the line (45-degree reference line) at both tails, especially the upper tail. There is an S-shaped pattern that suggests heavy tailed distribution. Therefore, the residuals are not normally distributed even taking a reference to histogram of residuals below that shows not bell shaped distributions of the residuals.

hist(residuals(full_model),
     xlab = "Residuals",
     ylab = "Frequencies",
     col = "blue",
     main = "The Histogram of residuals")

(3) The residuals vs fitted plot(“top_left plot”) also is used for checking the homoscedasticity assumption It shows that as residuals appear to spread out as fitted values increase and the spread is not constant across the range. This confirms that the the assumption of homoscedasticity is violated. (4) The scale location plot (bottom_left plot) shows a clear upward trend in red line ( not horizontal line). This indicates increasing variance and fitted values increase. Thus, the heteroscedasticity is present (i.e. violation of homoscedasticity assumption.)

To check for the independence assumption, there is a need of statistical tests since the diagnostic plots above cannot detect independence violations. ### The Durbin-Watson Test: Checking for Independence/autocorrelation in Regression The Durbin-Watson statistic ranges from 0 to 4. General decision rule $1.5 \,to\, 2.5$: Acceptable (little to no independence) $< 1.0\, or\, > 3.0$: Significant independence is likely present Exactly 2.0: Perfect independence

#install.packages("lmtest")
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

dw_test <- dwtest(full_model)
print(dw_test)

## 
##  Durbin-Watson test
## 
## data:  full_model
## DW = 1.8609, p-value = 0.1574
## alternative hypothesis: true autocorrelation is greater than 0

Interpretation: There is an acceptable independence since $DW=1.86$ that is in between 1.5 and 2.5. Therefore, the model formulated violate the assumption of independence of the residuals.

Based on the results of the full model, there is need of discussion on different variable selection methods to improve the full model so that the regression assumptions are satisfied.

1. Variable Selection by Filter Methods

Filter methods are traditional variable selection techniques that choose important predictors before building a regression model using statistical measures like correlation, p-values, ANOVA, or mutual information, without depending on the regression model itself. ### 1.1. Correlation based method

# correlation with response
#response <- mtcars$mpg
#predictors<- mtcars[,-1]
data(mtcars)
cat("===========================================\n")

## ===========================================

cat("The correlation values with response are:\n")

## The correlation values with response are:

cat("===========================================\n")

## ===========================================

cor_values <- cor(mtcars[, -1], mtcars$mpg)
cor_values

##            [,1]
## cyl  -0.8521620
## disp -0.8475514
## hp   -0.7761684
## drat  0.6811719
## wt   -0.8676594
## qsec  0.4186840
## vs    0.6640389
## am    0.5998324
## gear  0.4802848
## carb -0.5509251

# Select important variables (threshold rule)
cor_values <- cor(mtcars[, -1], mtcars$mpg)[,1]
cat("===========================================\n")

## ===========================================

cat("The selected variables with absolute correlation value greater than 0.7 are:\n")

## The selected variables with absolute correlation value greater than 0.7 are:

cat("===========================================\n")

## ===========================================

selected_vars <- names(cor_values[abs(cor_values) > 0.7])
selected_vars

## [1] "cyl"  "disp" "hp"   "wt"

Build a regression model based on correlation values with response

#Build regression model
formula_corr <- as.formula(paste("mpg ~", paste(selected_vars, collapse = " + ")))
corr_model <- lm(formula_corr, data = mtcars)
formula(corr_model)

## mpg ~ cyl + disp + hp + wt

cat("===========================================\n")

## ===========================================

cat("MODEL COEFFICIENTS\n")

## MODEL COEFFICIENTS

cat("===========================================\n")

## ===========================================

summary(corr_model)

## 
## Call:
## lm(formula = formula_corr, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## disp         0.01160    0.01173   0.989 0.331386    
## hp          -0.02054    0.01215  -1.691 0.102379    
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10

The regression model formulated based on correlation values with response by considering absolute correlation greater than 0.7 is written as \[mpg = 40.828-1.293*cyl +0.011*disp - 0.020*hp -3.853*wt \] with the adjusted r-squared \[R^2 =0.826\]

Plot the actual vs predicted

#  Actual vs Predicted
# -------------------------------------------
predicted_corr <- predict(corr_model)

plot(
  mtcars$mpg,
  predicted_corr,
  main = "Actual vs Predicted (Correlation Model)",
  xlab = "Actual MPG",
  ylab = "Predicted MPG",
  pch = 19,
  col = "blue"
)

abline(0,1, col = "red", lwd = 2)

Diagnostic plots

#Diagnostic Plots

par(mfrow = c(2,2))
plot(corr_model)

Residual Analysis

# Residual Analysis
# -------------------------------------------
hist(residuals(corr_model),
     main = "Residual Histogram",
     col = "lightblue",
     xlab = "Residuals")

# Normality test
shapiro.test(residuals(corr_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(corr_model)
## W = 0.93705, p-value = 0.06177

Interpretations The regression model built based on correlation values with response at threshold $|cor|>0.7$ satisfies the regression assumptions according to the above diagnostics plots and the test for normality that shows that the residuals are somehow normally distributed since p_value of the test is greater than 0.05.

1.2. Variable selection based on t-test / Linear Regression p-value Filter

p_values <- sapply(mtcars[, -1], function(x)
  summary(lm(mpg ~ x, data = mtcars))$coefficients[2, 4])
p_values

##          cyl         disp           hp         drat           wt         qsec 
## 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05 1.293959e-10 1.708199e-02 
##           vs           am         gear         carb 
## 3.415937e-05 2.850207e-04 5.400948e-03 1.084446e-03

# Select significant variables
cat("==================================================\n")

## ==================================================

cat("The significant variables selected are:\n")

## The significant variables selected are:

cat("==================================================\n")

## ==================================================

vars <- names(p_values[p_values < 0.05])
vars

##  [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Build final model using selected predictors

formula_final <- as.formula(paste("mpg ~", paste(vars, collapse = " + ")))
pv_model <- lm(formula_final, data = mtcars)
summary(pv_model)

## 
## Call:
## lm(formula = formula_final, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Interpretation: The filter method based on p-values produces the same model as fitting regression before selecting any predictor variables.

1.3. Variable Selection based on ANOVA filter method

predictors <- names(mtcars[, -1])

f_pvalues <- sapply(predictors, function(x) {
  model <- lm(mtcars$mpg ~ mtcars[[x]])
  anova(model)$`Pr(>F)`[1]
})

f_pvalues

##          cyl         disp           hp         drat           wt         qsec 
## 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05 1.293959e-10 1.708199e-02 
##           vs           am         gear         carb 
## 3.415937e-05 2.850207e-04 5.400948e-03 1.084446e-03

#Select important variables
# Small p-value (< 0.05)  implies variable significantly affects mpg
selected_vars_anova <- names(f_pvalues[f_pvalues < 0.05])
selected_vars_anova

##  [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

1.4. Filter Method using Mutual Information (MI)

Mutual Information (MI) measures how much information a predictor provides about the response variable. It is categorized as: Mi score = 0 means no relationship, Mi score >0 some dependency exists, Mi score large strong relationship

#install.packages("FSelectorRcpp")
library(FSelectorRcpp)
df<- mtcars
#Compute Mutual Information
mi_scores <- information_gain(mpg ~ ., data = df)

## Warning in .information_gain.data.frame(x = x, y = y, type = type, equal =
## equal, : Dependent variable is a numeric! It will be converted to factor with
## simple factor(y). We do not discretize dependent variable in FSelectorRcpp by
## default! You can choose equal frequency binning discretization by setting equal
## argument to TRUE.

mi_scores

##    attributes importance
## 1         cyl  0.9745606
## 2        disp  1.3284969
## 3          hp  1.3901361
## 4        drat  1.3938435
## 5          wt  2.1666445
## 6        qsec  1.6456199
## 7          vs  0.6419925
## 8          am  0.0000000
## 9        gear  0.6045494
## 10       carb  0.6182415

Selecting important variables at above 0.8 mi score

# Select important variables 
selected_vars_mi <- mi_scores$attributes[mi_scores$importance > 0.80]
selected_vars_mi

## [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec"

Build the model based on the selected important variables

# Create regression formula
final_formula <- as.formula(  paste("mpg ~", paste(selected_vars_mi, collapse = " + ")))

# Fit final regression model
mi_model <- lm(final_formula, data = mtcars)

# Model summary
summary(mi_model)

## 
## Call:
## lm(formula = final_formula, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9682 -1.5795 -0.4353  1.1662  5.5272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 26.30736   14.62994   1.798  0.08424 . 
## cyl         -0.81856    0.81156  -1.009  0.32282   
## disp         0.01320    0.01204   1.097  0.28307   
## hp          -0.01793    0.01551  -1.156  0.25846   
## drat         1.32041    1.47948   0.892  0.38065   
## wt          -4.19083    1.25791  -3.332  0.00269 **
## qsec         0.40146    0.51658   0.777  0.44436   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 25 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:   0.82 
## F-statistic: 24.53 on 6 and 25 DF,  p-value: 2.45e-09

Writing the model mathematically \[mpg =26.307 -0.818*cyl +0.132*disp -0.0179*hp+1.320*drat -4.190*wt+0.401*qsec\] with the adjusted r-squared \[ R^{2}= 0.82\]

2. Tradictional Stepwise Selection

This techniques iterately adds or removes predictor variables based on the AIC to find the best balance model fit and complexity.

The Akaike Information Criterion (AIC) is a statistical metric used to compare and select the best model among a set of competing models. The model with the lowest AIC score is considered the best.

2.1. Forward Selection Method:

Stats with no variables and adds predictors one by one.

# Load dataset
data(mtcars)

# Full Model
full_model <- lm(mpg ~ ., data = mtcars)
# Null Model
null_model <- lm(mpg ~ 1, data = mtcars)

#Forward Selection
forward_model <- step(
  null_model,
  scope = list(lower = null_model, upper = full_model),
  direction = "forward"
)

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + vs    1    54.228 224.09 68.283
## + carb  1    44.602 233.72 69.628
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## + gear  1     1.137 277.19 75.086
## + am    1     0.002 278.32 75.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## + carb  1   13.7724 177.40 62.805
## <none>              191.17 63.198
## + qsec  1   10.5674 180.60 63.378
## + gear  1    3.0281 188.14 64.687
## + disp  1    2.6796 188.49 64.746
## + vs    1    0.7059 190.47 65.080
## + am    1    0.1249 191.05 65.177
## + drat  1    0.0010 191.17 65.198
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + am    1    6.6228 170.00 63.442
## + disp  1    6.1762 170.44 63.526
## + carb  1    2.5187 174.10 64.205
## + drat  1    2.2453 174.38 64.255
## + qsec  1    1.4010 175.22 64.410
## + gear  1    0.8558 175.76 64.509
## + vs    1    0.0599 176.56 64.654

# Display Best Model
cat("====================================\n")

## ====================================

cat(" BEST SELECTED MODEL\n")

##  BEST SELECTED MODEL

cat("====================================\n")

## ====================================

formula(forward_model)

## mpg ~ wt + cyl + hp

# Model Summary
cat("\n====================================\n")

## 
## ====================================

cat(" MODEL SUMMARY\n")

##  MODEL SUMMARY

cat("====================================\n")

## ====================================

summary(forward_model)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

#Model Coefficients

cat("\n====================================\n")

## 
## ====================================

cat(" MODEL COEFFICIENTS\n")

##  MODEL COEFFICIENTS

cat("====================================\n")

## ====================================

round(coef(forward_model),2)

## (Intercept)          wt         cyl          hp 
##       38.75       -3.17       -0.94       -0.02

# OR
#coefficients(forward_model)

The final forward model is written as \[mpg =38.75 -3.17*wt -0.94*cyl - 0.02*hp\] with the adjusted r-squared \[ R^{2} =0.826\]

# Diagnostic Plots
par(mfrow = c(2,2))
plot(forward_model)

# Predicted vs Actual
predicted_forward <- predict(forward_model)

plot(
  mtcars$mpg,
  predicted_forward,
  main = "Actual vs Predicted",
  xlab = "Actual MPG",
  ylab = "Predicted MPG",
  pch = 19,
  col = "blue"
)

abline(0,1, col = "red", lwd = 2)

# Residual Histogram
# -----------------------------
hist(
  residuals(forward_model),
  main = "Histogram of Residuals",
  xlab = "Residuals",
  col = "lightblue",
  border = "black"
)

# Test for normality
cat("\n====================================\n")

## 
## ====================================

cat(" SHAPIRO-WILK NORMALITY TEST\n")

##  SHAPIRO-WILK NORMALITY TEST

cat("====================================\n")

## ====================================

shapiro.test(residuals(forward_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(forward_model)
## W = 0.93455, p-value = 0.05252

Interpretation: Since p_value =0.05252 > 0.05, then the residuals are normally distributed. There is no significant evidence against normality, meaning the residuals of the forward_model are approximately normally distributed. Therefore, the normality assumption of the regression model is satisfied.

# Test for Homoscedasticity
# -----------------------------
#install.packages("lmtest")   # run once
library(lmtest)

cat("\n====================================\n")

## 
## ====================================

cat(" BREUSCH-PAGAN TEST\n")

##  BREUSCH-PAGAN TEST

cat("====================================\n")

## ====================================

bptest(forward_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  forward_model
## BP = 2.9351, df = 3, p-value = 0.4017

Interpretation Since the p-value is greater than 0.05, then there is no evidence of heteroscedasticity in the residuals. The assumption of homescedasticity is satisfied.

Checking for Multicollinearity issue

Interpretation of VIF: Generally, -If $VIF=1$: No multicollinearit -If $1<VIF<5$: there is moderate correlation, usually acceptable -If $VIF\geq 5$: High multicollinearity concern -If $IVIF\geq 10$: Serious multicollinearity problem

# Multicollinearity Check
#install.packages("car")   # run once
library(car)

## Loading required package: carData

cat("\n====================================\n")

## 
## ====================================

cat(" VARIANCE INFLATION FACTOR (VIF)\n")

##  VARIANCE INFLATION FACTOR (VIF)

cat("====================================\n")

## ====================================

vif(forward_model)

##       wt      cyl       hp 
## 2.580486 4.757456 3.258481

Interpretation The Variance Inflation Factor (VIF) values for wt, cyl, and hp are 2.58, 4.76, and 3.26 respectively. Since all VIF values are below 5, there is no evidence of severe multicollinearity among the explanatory variables in the model. Therefore, the multicollinearity assumption is reasonably satisfied.

2.2. Backward Elimination:

Starts with all variables and removes them one by one. We model mpg as response variable from mtcars dataset

# Step 1: Fit Full Regression Model
full_model_b <- lm(mpg ~ ., data = mtcars)

# View full model summary
summary(full_model_b)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

# Apply Backward Elimination
backward_model <- step(
  full_model_b,
  direction = "backward"
)

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

# Display Final Selected Model
cat("====================================\n")

## ====================================

cat(" FINAL SELECTED MODEL\n")

##  FINAL SELECTED MODEL

cat("====================================\n")

## ====================================

formula(backward_model)

## mpg ~ wt + qsec + am

#Summary of Final Model
cat("\n====================================\n")

## 
## ====================================

cat(" MODEL SUMMARY\n")

##  MODEL SUMMARY

cat("====================================\n")

## ====================================

summary(backward_model)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

# Display Model Coefficients
cat("\n====================================\n")

## 
## ====================================

cat(" MODEL COEFFICIENTS\n")

##  MODEL COEFFICIENTS

cat("====================================\n")

## ====================================

round(coef(backward_model),3)

## (Intercept)          wt        qsec          am 
##       9.618      -3.917       1.226       2.936

The final model by using backward elimination method can be written mathematically as \[ mpg = 9.618 -3.917*wt +1.226*qsec+2.936*am\]

#Diagnostic Plots
par(mfrow = c(2,2))
plot(backward_model)

#  Actual vs Predicted Plot

predicted_backward <- predict(backward_model)

plot(
  mtcars$mpg,
  predicted_backward,
  main = "Actual vs Predicted Values",
  xlab = "Actual MPG",
  ylab = "Predicted MPG",
  pch = 19,
  col = "blue"
)

abline(0,1, col = "red", lwd = 2)

### Checking the regression Assumptions: Normality of the errors

# Histogram of Residuals
hist(
  residuals(backward_model),
  main = "Histogram of Residuals",
  xlab = "Residuals",
  col = "lightblue",
  border = "black"
)

Shapiro-wilk normality test has hypotheses to be tested $H_0$: Residuals are normally distributed. $H_1$: Residuals are not normally distributed. Decision will be mabe based on rule that states that: If p_value <0.05, reject $H_0$ (non-normal residuals). If p_value > 0.05, fail to reject $H_0$ ( residuals are normally distributed.)

# Normality Test

cat("\n====================================\n")

## 
## ====================================

cat(" SHAPIRO-WILK TEST\n")

##  SHAPIRO-WILK TEST

cat("====================================\n")

## ====================================

shapiro.test(residuals(backward_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(backward_model)
## W = 0.9411, p-value = 0.08043

Interpretation: The Shapiro–Wilk test for normality of residuals gives W = 0.9411 with a p-value of 0.08043. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This indicates that the residuals are approximately normally distributed, and the normality assumption of the regression model is satisfied.

#Test for Homoscedasticity
library(lmtest)

cat("\n====================================\n")

## 
## ====================================

cat(" BREUSCH-PAGAN TEST\n")

##  BREUSCH-PAGAN TEST

cat("====================================\n")

## ====================================

bptest(backward_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  backward_model
## BP = 6.1871, df = 3, p-value = 0.1029

Interpretation Since the p-value is greater than 0.05, this indicates that there is no evidence of heteroscedasticity, and the assumption of Homoscedasticity (constant variance of residuals) is satisfied.

# Multicollinearity Test
library(car)

cat("\n====================================\n")

## 
## ====================================

cat(" VARIANCE INFLATION FACTOR (VIF)\n")

##  VARIANCE INFLATION FACTOR (VIF)

cat("====================================\n")

## ====================================

round(vif(backward_model),3)

##    wt  qsec    am 
## 2.483 1.364 2.541

The VIF values for wt, qsec, and am are 2.88, 1.36, and 2.54 respectively. Since all VIF values are below 5, there is no evidence of severe multicollinearity among the explanatory variables in the model. Therefore, the multicollinearity assumption is satisfied.

2.3. Stepwise Regression:

Combines both, adding and removing variables to optimize the model.

# STEPWISE REGRESSION (FORWARD + BACKWARD)
# Fit Full and Null Models
full_model <- lm(mpg ~ ., data = mtcars)
null_model <- lm(mpg ~ 1, data = mtcars)

# Stepwise Regression (Both Directions)
stepwise_model <- step(
  null_model,
  scope = list(lower = null_model, upper = full_model),
  direction = "both"
)

## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq     RSS     AIC
## + cyl   1     87.15  191.17  63.198
## + hp    1     83.27  195.05  63.840
## + qsec  1     82.86  195.46  63.908
## + vs    1     54.23  224.09  68.283
## + carb  1     44.60  233.72  69.628
## + disp  1     31.64  246.68  71.356
## <none>               278.32  73.217
## + drat  1      9.08  269.24  74.156
## + gear  1      1.14  277.19  75.086
## + am    1      0.00  278.32  75.217
## - wt    1    847.73 1126.05 115.943
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1    14.551 176.62 62.665
## + carb  1    13.772 177.40 62.805
## <none>              191.17 63.198
## + qsec  1    10.567 180.60 63.378
## + gear  1     3.028 188.14 64.687
## + disp  1     2.680 188.49 64.746
## + vs    1     0.706 190.47 65.080
## + am    1     0.125 191.05 65.177
## + drat  1     0.001 191.17 65.198
## - cyl   1    87.150 278.32 73.217
## - wt    1   117.162 308.33 76.494
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## - hp    1    14.551 191.17 63.198
## + am    1     6.623 170.00 63.442
## + disp  1     6.176 170.44 63.526
## - cyl   1    18.427 195.05 63.840
## + carb  1     2.519 174.10 64.205
## + drat  1     2.245 174.38 64.255
## + qsec  1     1.401 175.22 64.410
## + gear  1     0.856 175.76 64.509
## + vs    1     0.060 176.56 64.654
## - wt    1   115.354 291.98 76.750

# Final Selected Model
cat("====================================\n")

## ====================================

cat(" FINAL STEPWISE MODEL\n")

##  FINAL STEPWISE MODEL

cat("====================================\n")

## ====================================

formula(stepwise_model)

## mpg ~ wt + cyl + hp

# Model Summary

cat("\n====================================\n")

## 
## ====================================

cat(" MODEL SUMMARY\n")

##  MODEL SUMMARY

cat("====================================\n")

## ====================================

summary(stepwise_model)

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

#Coefficients
cat("\n====================================\n")

## 
## ====================================

cat(" MODEL COEFFICIENTS\n")

##  MODEL COEFFICIENTS

cat("====================================\n")

## ====================================

round(coef(stepwise_model),3)

## (Intercept)          wt         cyl          hp 
##      38.752      -3.167      -0.942      -0.018

cat("\n====================================\n")

## 
## ====================================

cat(" ADJUSTED R-SQUARED\n")

##  ADJUSTED R-SQUARED

cat("====================================\n")

## ====================================

round(summary(stepwise_model)$adj.r.squared,3)

## [1] 0.826

Therefore, the final stepwise model is written as

\[mpg = 38.752 -3.167*wt -0.942*cyl -0.018*hp\] with \[R^{2}=0.826\]

#Diagnostic Plots

par(mfrow = c(2,2))
plot(stepwise_model)

# Actual vs Predicted Plot
predicted_stepwise <- predict(stepwise_model)

plot(
  mtcars$mpg,
  predicted_stepwise,
  main = "Actual vs Predicted (Stepwise Model)",
  xlab = "Actual MPG",
  ylab = "Predicted MPG",
  pch = 19,
  col = "blue"
)

abline(0, 1, col = "red", lwd = 2)

#  Residual Histogram
hist(
  residuals(stepwise_model),
  main = "Residual Histogram",
  xlab = "Residuals",
  col = "lightblue",
  border = "black"
)

#  Normality Test
cat("\n====================================\n")

## 
## ====================================

cat(" SHAPIRO-WILK TEST\n")

##  SHAPIRO-WILK TEST

cat("====================================\n")

## ====================================

shapiro.test(residuals(stepwise_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(stepwise_model)
## W = 0.93455, p-value = 0.05252

The normality assumption is satisfied since the p-value is greater than 0.05.

# Homoscedasticity Test
# -----------------------------------------
library(lmtest)

cat("\n====================================\n")

## 
## ====================================

cat(" BREUSCH-PAGAN TEST\n")

##  BREUSCH-PAGAN TEST

cat("====================================\n")

## ====================================

bptest(stepwise_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  stepwise_model
## BP = 2.9351, df = 3, p-value = 0.4017

The homoscedasticity assumption is satisfied since p-value of the test stastic is greater than 0.05.

# Multicollinearity (VIF)
library(car)

cat("\n====================================\n")

## 
## ====================================

cat(" VIF VALUES\n")

##  VIF VALUES

cat("====================================\n")

## ====================================

vif(stepwise_model)

##       wt      cyl       hp 
## 2.580486 4.757456 3.258481

Since all VIF values are in between 1 and 5, then there is no issues of multicollinearity in the predictors variables.

Conclusion The forward Selection method and stepwise regession produce the model with the same predictors model and same value of R-squared. The final forward model is written as \[mpg =38.752 -3.167*wt -0.942*cyl - 0.018*hp\] with \[ R^{2}= 0.826\]

The final backward model is given as \[ mpg = 9.618 -3.917*wt +1.226*qsec+2.936*am\]

with \[ R^{2}= 0.834\]

The stepwise model that combines forward and backward is given as

\[ mpg = 38.752 -3.167*wt -0.942*cyl -0.018*hp\] with \[R^{2}=0.826\]

3. Variable selection methods based on p-value

The above forward, backward, and stepwise methods based on p-value. For forward method adds variable if p_value is less than 0.05 and remove variable if p_value is less than 0.10 for backward elimination method. Entry variable rquires strong evidence (5%) and remove allows weaker evidence (10% to 15 %).

Let me implement stepwise method in R on mtcars dataset

# Setup  of an algorithm
predictors <- names(mtcars)[names(mtcars) != "mpg"] 
# creating an empty vector where selected predictors will be stored.
selected_vars <- c()  
# A variable is allowed to enter the model only if its p-value is less than 0.05
threshold_enter <- 0.05   # variable enters model
#A variable is removed from the model if its p-value becomes greater than 0.10
threshold_remove <- 0.10  # variable removed from model

model_stepwise_pvalue <- lm(mpg ~ 1, data = mtcars)


# Stepwise Loop (p-value based)

repeat {
  changed <- FALSE
  # FORWARD STEP (ADD)
  remaining_vars <- setdiff(predictors, selected_vars)
  best_p <- 1 # intercept only
  best_var <- NULL
  best_model <- NULL

  for (var in remaining_vars) {
    formula_try <- as.formula(
      paste("mpg ~", paste(c(selected_vars, var), collapse = " + "))
    )
    model_try <- lm(formula_try, data = mtcars)
    pvals <- summary(model_try)$coefficients
    var_p <- pvals[var, "Pr(>|t|)"]
    
    if (!is.na(var_p) && var_p < best_p) {
      best_p <- var_p
      best_var <- var
      best_model <- model_try
    }
  }

  if (!is.null(best_var) && best_p < threshold_enter) {
    selected_vars <- c(selected_vars, best_var)
    current_model <- best_model
    changed <- TRUE
    cat("Added:", best_var, " | p-value:", best_p, "\n")
  }
  # BACKWARD STEP (REMOVE)
  if (length(selected_vars) > 0) {
    model_formula <- as.formula(
      paste("mpg ~", paste(selected_vars, collapse = " + "))
    )
    model_stepwise_pvalue <- lm(model_formula, data = mtcars)
    pvals <- summary(current_model)$coefficients[-1, "Pr(>|t|)"]
    worst_p <- max(pvals, na.rm = TRUE)
    worst_var <- names(which.max(pvals))

    if (worst_p > threshold_remove) {

      selected_vars <- setdiff(selected_vars, worst_var)
      changed <- TRUE

      cat("Removed:", worst_var, " | p-value:", worst_p, "\n")
    }
  }
  # Stop if no changes
  if (!changed) break
}

## Added: wt  | p-value: 1.293959e-10 
## Added: cyl  | p-value: 0.001064282

# Final Model

cat("\n====================================\n")

## 
## ====================================

cat(" FINAL STEPWISE MODEL (p-value)\n")

##  FINAL STEPWISE MODEL (p-value)

cat("====================================\n")

## ====================================

formula(model_stepwise_pvalue)

## mpg ~ wt + cyl

# Model Summary

summary(model_stepwise_pvalue)

## 
## Call:
## lm(formula = model_formula, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

# Coefficients

round(coef(model_stepwise_pvalue),3)

## (Intercept)          wt         cyl 
##      39.686      -3.191      -1.508

The Stepwise model based on p-value can be written as \[mpg = 39.686 -3.191*wt -1.508*cyl\] With the adjusted r-squared \[ R^2=0.830.\]

# Diagnostic Plots
par(mfrow = c(2,2))
plot(model_stepwise_pvalue)

# Actual vs Predicted

pred_stepwise_pvalue <- predict(model_stepwise_pvalue)

plot(
  mtcars$mpg,
  pred_stepwise_pvalue,
  main = "Actual vs Predicted (p-value Stepwise)",
  xlab = "Actual MPG",
  ylab = "Predicted MPG",
  pch = 19,
  col = "blue"
)

abline(0,1, col = "red", lwd = 2)

#  Residual Diagnostics

hist(residuals(model_stepwise_pvalue),
     main = "Residual Histogram",
     xlab = "Residuals",
     col = "lightblue")

shapiro.test(residuals(model_stepwise_pvalue))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model_stepwise_pvalue)
## W = 0.93745, p-value = 0.06341

This results above confirm that the assumptions of regression modeling are satisfied since the residuals are normally distributed, the residuals have constant variance (homoscedasticity), there is no multicollinearity in the predictors variables, and the residuals are independent.

4. Variable selection by regularization/ Penalty Methods

Regulatization methods build variable selection directly into the model’s maths by penalizing large coefficients, forcing less important variables towards zero.

4.1 LASSO Regression

Standard linear regression tries to minimize the sum of squared errors. Lassp adds a penality to this equation based on the absolute size of the regression coefficient. \[ minimize: \sum(y-\hat(y))^2+\lambda \sum|\beta|.\]

Where $# is the tuning parameter that controls the penalty strength. If $\lambda =0$, result to standard linear regression model.

Implementing LASSO Regression on mtcars data

#install.packages("glmnet")
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 5.0

# prepare data (glmnet requires a matrix for predictors)
x <- as.matrix(mtcars[, -1]) # all variables expect mpg
y <- mtcars$mpg              # response/target variable

# Cross-validation to find the optimal lambda penalty
set.seed(42) # set seeds for reproducible cross-validation splits
cv_lasso <- cv.glmnet(x,y, alpha =1) # alpha=1 specifies LASSO
best_lambda <- cv_lasso$lambda.min
print(best_lambda)

## [1] 0.8007036

# Fit the final model using the best lambda value found
lasso_model <- glmnet(x,y,alpha=1, lambda=best_lambda)
# View the final model coefficient 
round(coef(lasso_model),3)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                 s0
## (Intercept) 35.999
## cyl         -0.885
## disp         .    
## hp          -0.012
## drat         .    
## wt          -2.709
## qsec         .    
## vs           .    
## am           .    
## gear         .    
## carb         .

Display Non-Zero Coefficients (Selected Variables)

coef_matrix <- as.matrix(coef(lasso_model))
# Show only non-zero coefficients
round(coef_matrix[coef_matrix[,1] != 0, , drop = FALSE],3)

##                 s0
## (Intercept) 35.999
## cyl         -0.885
## hp          -0.012
## wt          -2.709

The lasso model can be written as \[ mpd = 35.999 -0.885*cyl - 0.012*hp -2.709*wt\]

# Predictions

predicted_lasso <- predict(lasso_model, s = best_lambda,newx = x)
residuals <- y - predicted_lasso

#R-squared
rss <- sum((y - predicted_lasso)^2)
tss <- sum((y - mean(y))^2)
r_squared <- 1 - rss/tss
cat("R-squared:", round(r_squared,3), "\n\n")

## R-squared: 0.821

Plot Actual vs Predicted Values

#Predicted Values
predicted <- predict(lasso_model, s = best_lambda, newx = x)
plot(y, predicted,
     xlab = "Actual MPG",
     ylab = "Predicted MPG",
     main = "Actual vs Predicted Values")

# Add reference line
abline(a = 0, b = 1, col = "red", lwd = 2)

### Diagnostic plots (1) Predicted vs residuals to check linearity and homescedasticity conditions in the residuals. (2) Q-Q plot for checking the normality of the residuals. If residuals are normally distributed, the points should lie approximately on a straight line. (3) Histogram of Residuals: The Histogram of Residuals is a simple but important graphical tool used to check the normality assumption of regression residuals. (4) Actual vs Predicted: The Actual vs Predicted plot is used to evaluate how well a regression model performs in predicting the response variable. The main roles of actual vs predicted plot include measure overall model accuracy, detecting bias in predictions, detecting non-linearity problems, identifying heteroscedasticity,etc. (5) Scale location plot to check the model assumption of homoscedasticity (equality of variance for the residuals). (6) Residual Order Plot for autocorrelation checking. It is used to check whether residuals are independent and randomly distributed over the sequence of observations.

# Diagnostic Plots
par(mfrow = c(2,3))

# 1 Residuals vs Fitted
plot(predicted_lasso, residuals,
     main = "Residuals vs Fitted",
     xlab = "Fitted Values",
     ylab = "Residuals")

abline(h = 0, col = "red")

# 2 Q-Q Plot
qqnorm(residuals,
       main = "Normal Q-Q Plot")

qqline(residuals, col = "red")

# 3 Histogram of Residuals
hist(residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     col = "lightblue")

# 4 Actual vs Predicted
plot(y, predicted_lasso,
     main = "Actual vs Predicted",
     xlab = "Actual MPG",
     ylab = "Predicted MPG")

abline(a = 0, b = 1, col = "red")

# 5 Scale-Location Plot
plot(predicted_lasso,
     sqrt(abs(residuals)),
     main = "Scale-Location Plot",
     xlab = "Fitted Values",
     ylab = "Sqrt(|Residuals|)")

# 6 Residual Order Plot
plot(residuals,
     type = "b",
     main = "Residual Order Plot",
     xlab = "Observation Order",
     ylab = "Residuals")

abline(h = 0, col = "red")

4.2 Ridge Regression (L2 Regularization)

Mathematically, Ridge regression modifies standard ordinary least squares (OLS) linear regression by adding an L2 regularization penality to the cost function.

To find the regression coefficients ($\beta$), Ridge regression minimizes

\[ minimizes\, Q(\beta) = \sum_{i=1}^{n}\left(y_{i}-\hat(y_{i})\right)^2 +\lambda \sum_{j=1}^{n}\beta_{j}^2 \] In matrix form can be written as, \[ Minimize\, Q(\beta)=(Y-X\beta)^{T}(Y-XB) + \lambda \beta^{T}\beta. \] This method shinks coefficients close to zero but keeps all variables. If $\lambda=0$, the penalty term cancels out, collapsing back to a standard OLS linear regression. If $\lambda \to \infty$, the penalty dominates, forcing all predictor coefficients ($\beta_1, \beta_2, \ldots, \beta_p$) to shink asymptotically toward zero (but never reaching absolute zero). It is excellent for handling correlated data (multicollinearity).

Implementing Ridge regression in R

Like Lass, Ridge regression uses the glmnet package and requires data inputs to be formateted as a matrix. The key difference is setting $\alpha=0$.

library(glmnet)
# convert predictors as matrix
x <- as.matrix(mtcars[,-1])
# response variable
y <- mtcars$mpg


# Specify ridge regression by setting alpha=0 (alpha =1 would be LASSO)
set.seed(42) # set seeds for reproducible cross-validation splits
cv_ridge <- cv.glmnet(x,y, alpha=0)
best_lambda <- cv_ridge$lambda.min
cat("OPtimal lambda value:", best_lambda, "\n\n")

## OPtimal lambda value: 3.308517

# Fif the model using the best lambda value found
ridge_model <- glmnet(x,y, alpha=0,lambda=best_lambda)
# View the model coefficient
##Ridge shinks coefficients but leaves none at exactly zero
coef(ridge_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 20.979863290
## cyl         -0.376481709
## disp        -0.005379931
## hp          -0.011404572
## drat         1.056798768
## wt          -1.175218799
## qsec         0.158603589
## vs           0.800089614
## am           1.557664947
## gear         0.538768857
## carb        -0.521201209

The final ridge model is written as \[mpg = 20.97-0.376*cyl-0.005*disp -0.01*hp+1.056*drat-1.175*wt+0.158*qsec +0.8*vs+1.557*am+0.538*gear -0.521*carb\]

# Predictions
predicted_ridge <- predict(ridge_model, s = best_lambda,newx = x)
residuals <- y - predicted_ridge

#R-squared
rss <- sum((y - predicted_ridge)^2)
tss <- sum((y - mean(y))^2)
r_squared <- 1 - rss/tss
cat("R-squared:", round(r_squared,3), "\n\n")

## R-squared: 0.844

#Diagnostic Plots
par(mfrow = c(2,3))

# 1. Actual vs Predicted
plot(y, predicted_ridge,
     main = "Actual vs Predicted",
     xlab = "Actual MPG",
     ylab = "Predicted MPG")
abline(a = 0, b = 1, col = "red")

# 2. Residuals vs Fitted
plot(predicted_ridge, residuals,
     main = "Residuals vs Fitted",
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

# 3. Q-Q Plot
qqnorm(residuals, main = "Q-Q Plot")
qqline(residuals, col = "red")

# 4. Histogram of Residuals
hist(residuals,
     main = "Histogram of Residuals",
     col = "lightblue")

# 5. Scale-Location Plot
plot(predicted_ridge,
     sqrt(abs(residuals)),
     main = "Scale-Location Plot",
     xlab = "Fitted Values",
     ylab = "sqrt(|Residuals|)")

# 6. Residual Order Plot
plot(residuals,
     type = "b",
     main = "Residual Order Plot",
     ylab = "Residuals",
     xlab = "Observation Order")
abline(h = 0, col = "red")

4.3. Elastic Net regression

Elastic net regression combines both LASSO(L1 regularization) and Ridge(L2 regularizatio). It selects groups of correlated variables together rather than picking just one.

Step-by-step Elastic Net Regression implementation in R

library(glmnet)
x <- as.matrix(mtcars[,-1])
y <- mtcars$mpg
# Looking through different alpha options to find the one with lowest error
alphas <- seq(0.1,0.9, by =0.1)
search_results <- data.frame(alpha=numeric(), best_lambda=numeric(), 
                             min_MSE=numeric())
set.seed(42) # set seed for reproducible cross-validation splits
for (a in alphas){
  cv_fit <- cv.glmnet(x,y,alpha=a)
  
  # Store results
  search_results<- rbind(search_results,data.frame(
                         Alpha=a,
                         Best_Lambda = cv_fit$lambda.min,
                         Min_MSE = min(cv_fit$cvm)
  ))
}
# Identify the absolute best performing Alpha and Lambda combination
best_row <- search_results[which.min(search_results$Min_MSE),]
best_alpha <- best_row$Alpha
best_lambda <- best_row$Best_Lambda

cat("====Tuning Results====\n")

## ====Tuning Results====

cat("Optimal Alpha (Min Profile):", best_alpha, "\n")

## Optimal Alpha (Min Profile): 0.7

cat("Optimal Lambda(Penalty Strength):", best_lambda,"\n")

## Optimal Lambda(Penalty Strength): 0.7884199

# Fit the Elastic Net Model using optimized parameters
elastic_model <- glmnet(x,y,alpha=best_alpha, lambda=best_lambda)

#Display model coefficients
coef(elastic_model)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 34.13114793
## cyl         -0.81144465
## disp         .         
## hp          -0.01453356
## drat         0.28102708
## wt          -2.40724833
## qsec         .         
## vs           .         
## am           0.63640139
## gear         .         
## carb        -0.14661973

# Extracting active predictors as matrix 
coef_matrix <-as.matrix(coef(elastic_model))
round(as.data.frame(coef_matrix),3)

##                 s0
## (Intercept) 34.131
## cyl         -0.811
## disp         0.000
## hp          -0.015
## drat         0.281
## wt          -2.407
## qsec         0.000
## vs           0.000
## am           0.636
## gear         0.000
## carb        -0.147

The final elastic net regression model is can be written as \[mpg = 34.131 -0.811*cyl - 0.015*hp + 0.281*drat -2.407*wt +0.636*am\]

# Predictions

predicted_elastic <- predict(elastic_model, newx = x)
residuals <- y - predicted_elastic

#R-squared
rss <- sum((y - predicted_elastic)^2)
tss <- sum((y - mean(y))^2)
r_squared <- 1 - rss/tss
cat("R-squared:", round(r_squared,3), "\n\n")

## R-squared: 0.837

Diagnostic Plots for the Elastic Net Regression model formulated

#Diagnostic Plots
par(mfrow = c(2,3))

# 1. Actual vs Predicted
plot(y, predicted_elastic,
     main = "Actual vs Predicted",
     xlab = "Actual MPG",
     ylab = "Predicted MPG")
abline(a = 0, b = 1, col = "red")

# 2. Residuals vs Fitted
plot(predicted_elastic, residuals,
     main = "Residuals vs Fitted",
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red")

# 3. Q-Q Plot
qqnorm(residuals, main = "Q-Q Plot")
qqline(residuals, col = "red")

# 4. Histogram of Residuals
hist(residuals,
     main = "Histogram of Residuals",
     col = "lightblue")

# 5. Scale-Location Plot
plot(predicted_elastic,
     sqrt(abs(residuals)),
     main = "Scale-Location Plot",
     xlab = "Fitted Values",
     ylab = "sqrt(|Residuals|)")

# 6. Residual Order Plot
plot(residuals,
     type = "b",
     main = "Residual Order Plot",
     ylab = "Residuals",
     xlab = "Observation Order")
abline(h = 0, col = "red")

Therefore, the variable selection methods regularization methods (Lasso and Ridge) regressions perform better compare to the traditional filter and selection methods. This conclusion is drown based on the observation made from the above diagnostic plots produced for each variable selection methods discussed and implemented in R.

5. Variable Selection by Demension Reduction Method

Dimension reduction requires more than just filtering to select variables. Instead, you compress your high-dimensional data matrix into a smaller collection of uncorrelated components, or you calculate how much each original variable contributes to these components before filtering them out.

PCA is generally used first for exploratory dimension reduction because it summarizes predictor variability without considering the response variable. However, when prediction and variable relevance to the response are important, Partial Least Squares (PLS) is preferred because it constructs components using both predictors and the response variable.

5.1. Principal Component Analysis (PCA):

-PCA is mainly used to reduce dimensionality, remove multicollinearity, and explore the structure of predictor variables in predictive regression problems. -Create new features (Principal Components) that capture the maximum variance in the data.

predictors <-mtcars[,-1] # Remove response variable "mpg"
# run PCA
pca_output <-prcomp(predictors, scale. = TRUE)
# View the loading for the first two Principal Components
loadings <- pca_output$rotation[,1:2]
# Select variables that contribute heavily (absolute loading>0.35) to PC1 and .4 to pc2
important_vars_pc1 <- loadings[abs(loadings[, 1])>0.35, 1, drop=FALSE]
important_vars_pc2 <- loadings[abs(loadings[, 2])> 0.45, 2, drop=FALSE] # less important
print("Variables Selecting major variance in PC1:")

## [1] "Variables Selecting major variance in PC1:"

print(important_vars_pc1)

##            PC1
## cyl  0.4029711
## disp 0.3959243
## hp   0.3543255
## wt   0.3668004

print("Variables Selecting major variance in PC2:")

## [1] "Variables Selecting major variance in PC2:"

print(important_vars_pc2)

##             PC2
## qsec -0.4606627
## gear  0.4651622

# PCA + REGRESSION MODELING, Dataset: mtcars, Response Variable: mpg
#Define Response and Predictors
response <- mtcars$mpg
predictors <- mtcars[, -1]   # remove mpg

# Step 2: Run PCA
pca_output <- prcomp(predictors,scale. = TRUE)

#Loadings
loadings <- pca_output$rotation

#Important Variables from PC1
# -----------------------------------------
important_vars_pc1 <- names(
  which(abs(loadings[,1]) > 0.35))
cat("\n====================================\n")

## 
## ====================================

cat(" IMPORTANT VARIABLES IN PC1\n")

##  IMPORTANT VARIABLES IN PC1

cat("====================================\n")

## ====================================

print(important_vars_pc1)

## [1] "cyl"  "disp" "hp"   "wt"

# Important Variables from PC2
important_vars_pc2 <- names(
  which(abs(loadings[,2]) > 0.45)
)

cat("\n====================================\n")

## 
## ====================================

cat(" IMPORTANT VARIABLES IN PC2\n")

##  IMPORTANT VARIABLES IN PC2

cat("====================================\n")

## ====================================

print(important_vars_pc2)

## [1] "qsec" "gear"

# Combine Important Variables
selected_vars <- unique(
  c(important_vars_pc1, important_vars_pc2)
)

cat("\n====================================\n")

## 
## ====================================

cat(" FINAL SELECTED VARIABLES\n")

##  FINAL SELECTED VARIABLES

cat("====================================\n")

## ====================================

print(selected_vars)

## [1] "cyl"  "disp" "hp"   "wt"   "qsec" "gear"

# Build Regression Model
formula_pca <- as.formula(paste("mpg ~", paste(selected_vars, collapse = " + ")))

pca_model <- lm(formula_pca, data = mtcars)

# Final Model Summary
# -----------------------------------------
cat("\n====================================\n")

## 
## ====================================

cat(" FINAL PCA REGRESSION MODEL\n")

##  FINAL PCA REGRESSION MODEL

cat("====================================\n")

## ====================================

summary(pca_model)

## 
## Call:
## lm(formula = formula_pca, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0730 -1.9388 -0.3331  1.2097  5.6469 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 23.91504   15.11927   1.582  0.12627   
## cyl         -0.68063    0.84624  -0.804  0.42881   
## disp         0.01639    0.01262   1.299  0.20592   
## hp          -0.02397    0.01711  -1.401  0.17343   
## wt          -4.38939    1.25990  -3.484  0.00184 **
## qsec         0.54093    0.55859   0.968  0.34213   
## gear         1.24438    1.18909   1.046  0.30535   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.542 on 25 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.822 
## F-statistic: 24.87 on 6 and 25 DF,  p-value: 2.129e-09

cat("\n====================================\n")

## 
## ====================================

cat(" Formula ofr PCA Regression model\n")

##  Formula ofr PCA Regression model

cat("====================================\n")

## ====================================

formula(pca_model)

## mpg ~ cyl + disp + hp + wt + qsec + gear

# Model Coefficients

cat("\n====================================\n")

## 
## ====================================

cat(" MODEL COEFFICIENTS\n")

##  MODEL COEFFICIENTS

cat("====================================\n")

## ====================================

round(coef(pca_model),3)

## (Intercept)         cyl        disp          hp          wt        qsec 
##      23.915      -0.681       0.016      -0.024      -4.389       0.541 
##        gear 
##       1.244

The final PCA regregression model can be expressed as \[mpg = 23.915 -0.681*cyl + 0.016*disp -0.024*hp-4.389*wt+0.541*qsec+1.244*gear\] With \[R^{2}=0.822\]

# Diagnostic Plots
par(mfrow = c(2,2))
plot(pca_model)

# Actual vs Predicted Plot
predicted_pca <- predict(pca_model)

plot(
  response,
  predicted_pca,
  pch = 19,
  col = "blue",
  main = "Actual vs Predicted",
  xlab = "Actual MPG",
  ylab = "Predicted MPG"
)

abline(0,1, col = "red", lwd = 2)

# Residual Histogram
hist(
  residuals(pca_model),
  col = "lightblue",
  main = "Histogram of Residuals",
  xlab = "Residuals"
)

# Normality Test
cat("\n====================================\n")

## 
## ====================================

cat(" SHAPIRO-WILK TEST\n")

##  SHAPIRO-WILK TEST

cat("====================================\n")

## ====================================

shapiro.test(residuals(pca_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(pca_model)
## W = 0.90718, p-value = 0.009481

#Homoscedasticity Test
library(lmtest)

cat("\n====================================\n")

## 
## ====================================

cat(" BREUSCH-PAGAN TEST\n")

##  BREUSCH-PAGAN TEST

cat("====================================\n")

## ====================================

bptest(pca_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  pca_model
## BP = 3.1053, df = 6, p-value = 0.7955

# Multicollinearity Check
library(car)

cat("\n====================================\n")

## 
## ====================================

cat(" VARIANCE INFLATION FACTOR (VIF)\n")

##  VARIANCE INFLATION FACTOR (VIF)

cat("====================================\n")

## ====================================

vif(pca_model)

##       cyl      disp        hp        wt      qsec      gear 
## 10.953603 11.731869  6.596148  7.287848  4.778055  3.691100

Interpretation: The Shapiro–Wilk test indicates that residuals are normally distributed (p = 0.2261). The Breusch–Pagan test shows no evidence of heteroscedasticity (p = 0.1352), confirming homoscedasticity assumption. However, the Variance Inflation Factor (VIF) results reveal severe multicollinearity among several predictors, particularly disp, cyl, and wt, indicating that the regression coefficients may be unstable and difficult to interpret.

Therefore, Principal Components Analysis (PCA) with selected variables still have multicollinearity,and weak dimension reduction. This is no the best approach to use in this particalar dataset.

5.2. Partial Least Squares (PLS):

Similar to PCA, but creates components that specifically maximize the covariance between predictor variables and response variable.

The first two latent components, which are linear combinations of the original predictors and represent the most significant variance associated with the response variable, will be used to build the pls regression model in R based on $ncomp = 2$.

#install.packages("pls")
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

data(mtcars)
# Fit PLS model with cross-validation
set.seed(123)
pls_model <- plsr(mpg ~ ., data = mtcars, scale = TRUE,validation = "CV")
# Model Summary
#summary(pls_model)
#predictions
predicted_pls <- predict(pls_model, ncomp = 2, newdata = mtcars) # two components


# Diagnostic Plots
par(mfrow =c(2,2))
#plot of actual vs predicted
plot(mtcars$mpg, predicted_pls,
     xlab = "Actual MPG",
     ylab = "Predicted MPG",
     main = "Actual vs Predicted (PLS)")

abline(a = 0, b = 1, col = "red")

# Residual
residuals_pls <- mtcars$mpg - predicted_pls
# Plot of predicted vs residuals
plot(predicted_pls, residuals_pls,
     main = "Residuals vs Fitted (PLS)",
     xlab = "Fitted values",
     ylab = "Residuals")

abline(h = 0, col = "red")

# Q Q plot
qqnorm(residuals_pls,
       main = "Q-Q Plot of Residuals")
qqline(residuals, col = "red", lwd = 2)

# Histogram
hist(residuals_pls,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     col = "lightblue")

# PLS model coefficients
cat("\n====================================\n")

## 
## ====================================

cat(" The coeffiecients of the pls regression model \n")

##  The coeffiecients of the pls regression model

cat("====================================\n")

## ====================================

model_coef <- coef(pls_model, ncomp = 2)

round(model_coef,3)

## , , 2 comps
## 
##         mpg
## cyl  -0.680
## disp -0.739
## hp   -0.877
## drat  0.609
## wt   -1.438
## qsec  0.038
## vs    0.228
## am    1.044
## gear  0.413
## carb -1.099

# R squared
rss <- sum((mtcars$mpg - predicted_pls)^2)
tss <- sum((mtcars$mpg - mean(mtcars$mpg))^2)
r_squared <- 1 - rss/tss
cat("The value of adjusted r-squared  is:" ,r_squared, "\n")

## The value of adjusted r-squared  is: 0.853944

Conclusion PCA and PLS are not true variable selection methods because they do not eliminate predictors. Instead, they transform the original variables into latent components. PCA ignores the response variable while PLS incorporates it to improve prediction, but both methods retain all original variables in transformed form.