University of Nevada, Reno (UNR)

1 Introduction

The aims of this project are to simulate multivariate data from three underlying models with varying degrees of correlation among predictors and observing how effective different variable selection methods perform on each dataset for linear regression. Thus, we first simulate 50 datasets for each of the first three underlying models from section 7 of the Paper “Regression Shrinkage and Selection via the Lasso” by Tibshirani (1996). We then fit linear regression models for several variable selection and shrinkage methods and evaluate performances with Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Adjusted \(R^2\) as performance metrics. My next goal is to vectorize all for loops and generally reduce redundant variable storage where possible. The following methods are considered.

Ordinary Least Squares Regression
LASSO Regression
Ridge Regression
All possible Regression
Forward Selection Regression
Backward Elimination Regression
Stepwise Selection

1.1 Performance Metrics

We use the following performance metrics to measure variable selection method performances.

Akaike Information Criterion (AIC): Ways to compute the AIC can be found in “A primer on model selection using the Akaike Information Criterion” by Portet (2020). Let \(k\) be the number of parameters, \(\mathscr{L}\) be the likelihood function, \(\hat{\theta}_{MLE}\) be the maximum likelihood estimate of \(\theta\), and so the AIC is denoted as follows.

\[ AIC = -2\text{log}(\mathscr{L}(\hat{\theta}_{MLE}|y)) + 2k \]

Bayesian Information Criterion (BIC): Ways to compute the BIC can be found in “Sensitivity and specificity of information criteria” by Dziakh et al. (2011). Let \(k\) be the number of parameters, \(\mathscr{L}\) be the likelihood function, \(\hat{\theta}_{MLE}\) be the maximum likelihood estimate of \(\theta\), \(n\) be sample size, and so the BIC is denoted as follows.

\[ BIC = -2\text{log}(\mathscr{L}(\hat{\theta}_{MLE}|y)) + k\;log(n) \]

Adjusted \(\boldsymbol{R^2}\) Value: The Adjusted \(R^2\) Value is a modified version of the \(R^2\) that better accounts for number of predictors. Let \(n\) be sample size, \(k\) be number of parameters, \(\bar{y}\) be the average of the response variable, then the adjusted \(R^2\) Value is denoted as follows.

\[ R_{adj}^{2}=1-\frac{RSS/(n-k-1)}{\sum_{i=1}^{n}(y_i-\bar{y})^2/(n-1)} \]

2 Simulating Datasets

In this section we simulate data from the three aforementioned underlying models.

2.1 Underlying Model 1

Data from the first underlying model consists of 50 simulated datasets with 20 observations each, from the underlying model as outlined by Tibshirani (1996) we have the model such that \(\boldsymbol{\beta}=(3,1.5,0,0,2,0,0,0)^T\) , and \(\sigma=3\), and the correlation between \(x_i\) and \(x_j\) as \(\rho^{|i-j|}\) where \(\rho=0.5\). Thus, we have the model as

\[ y = \boldsymbol{\beta}^{T}\boldsymbol{x} + \sigma\boldsymbol{\epsilon} \] R-Implementation:

# Variable Declarations
n_obs1 <- 20
n_simul1 <-50
p1 <- 8
corr1 <- 0.5
sigma1 <- 3
beta1 <- c(3, 1.5 ,0 ,0 ,2 ,0, 0, 0)

# Simulating The Matrix of Predictors
CovarMat1 <- outer(1:length(beta1), 1:length(beta1), function(i,j){corr1^abs(i-j)})
mean_vect1 <- rep(0,length(beta1))
X_n_simul1 <- list()
for(i in 1:n_simul1){ 
  X_n_simul1[[i]] <- as.matrix(mvrnorm(n = n_obs1,mu = mean_vect1,Sigma =  CovarMat1))
}

# Generate Y
epsilon1 <- rnorm(n_obs1,mean = 0,sd = sigma1)
y_n_simul1 <- list()
for(i in 1:n_simul1){
  y_n_simul1[[i]] <-  X_n_simul1[[i]] %*% beta1 + epsilon1
}

# Create Full Datasets
data1 <- list()
for(i in 1:n_simul1){
  data1[[i]] <-  data.frame(cbind(y_n_simul1[[i]],X_n_simul1[[i]]))
  colnames(data1[[i]]) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8")
}

2.2 Underlying Model 2

Data from the second underlying model also consists of 50 simulated datasets with 20 observations each with the same model structure as for model 1, but now with parameter values \(\beta_{j}=0.85\) for every \(j\), \(\sigma=3\), and the correlation between \(x_i\) and \(x_j\) as \(\rho^{|i-j|}\) where \(\rho=0.5\).

R-Implementation:

# Variable declarations
n_obs2 <- 20
n_simul2 <-50
p2 <- 8
corr2 <- 0.5
sigma2 <- 3
beta2 <- c(0.85, 0.85, 0.85 ,0.85 ,0.85 ,0.85 ,0.85 ,0.85)

#Simulating Matrix of Predictors
CovarMat2 <- outer(1:length(beta2), 1:length(beta2), function(i,j){corr2^abs(i-j)})
mean_vect2 <- rep(0,length(beta2))
X_n_simul2 <- list()
for(i in 1:n_simul2){ 
  X_n_simul2[[i]] <- as.matrix(mvrnorm(n = n_obs2,mu = mean_vect2,Sigma =  CovarMat2))
}

#Generate Y
epsilon2 <- rnorm(n_obs2,mean = 0,sd = sigma2)
y_n_simul2 <- list()
for(i in 1:n_simul2){
  y_n_simul2[[i]] <-  X_n_simul2[[i]] %*% beta2 + epsilon2
}

# Create Full Datasets
data2 <- list()
for(i in 1:n_simul2){
  data2[[i]] <-  data.frame(cbind(y_n_simul2[[i]],X_n_simul2[[i]]))
  colnames(data2[[i]]) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8")
}

2.3 Underlying Model 3

Data from the third underlying model also consists of 50 simulated datasets with 20 observations each with the same model structure as for model 1, but now with parameter values \(\boldsymbol{\beta}=(5,0,0,0,0,0,0,0)^T\), \(\sigma=2\), and the correlation between \(x_i\) and \(x_j\) as \(\rho^{|i-j|}\) where \(\rho=0.5\).

R-Implementation:

# Variable declaration for simulating dataset
n_obs3 <- 20
n_simul3 <-50
p3 <- 8
corr3 <- 0.5
sigma3 <- 2
beta3 <- c(5, 0, 0, 0, 0, 0, 0, 0)

#Simulating matrix of predictors X
CovarMat3 <- outer(1:length(beta3), 1:length(beta3), function(i,j){corr3^abs(i-j)})
mean_vect3 <- rep(0,length(beta3))
X_n_simul3 <- list()
for(i in 1:n_simul3){ 
  X_n_simul3[[i]] <- as.matrix(mvrnorm(n = n_obs3,mu = mean_vect3,Sigma =  CovarMat3))
}

#Generate y from X 

epsilon3 <- rnorm(n_obs3,mean = 0,sd = sigma3)
y_n_simul3 <- list()
for(i in 1:n_simul3){
  y_n_simul3[[i]] <-  X_n_simul3[[i]] %*% beta3 + epsilon3
}

# Create Full Datasets
data3 <- list()
for(i in 1:n_simul3){
  data3[[i]] <-  data.frame(cbind(y_n_simul3[[i]],X_n_simul3[[i]]))
  colnames(data3[[i]]) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8")
}

3 Fitting Models

In this section, we fit the regression models and measure model performances on each underlying model. For LASSO and Ridge we use the glmnet R package, and for all-possible, forward, backward, and stepwise selection we use the olsrr package.

3.1 OLS Regression

We first fit the OLS Regression model and measure model performances. It is important to mention that no predictors are discarded for OLS Regression. We have the data as \((\boldsymbol{x}^{i},y_i)\) for \(i=1,2,...,N\), where \(\boldsymbol{x}^{i}=(x_{i1},...,x_{ip})^T\), and according to Garet et al. (2013) we thus estimate \(\beta_{0},\beta_{1},...,\beta_{p}\) to minimize

\[ \text{RSS} = \sum^{n}_{i=1}\left(y_{i}-\beta_{0}-\sum^{p}_{j=1}\beta_{j}x_{ij}\right)^2 \]

R-Implementation:

#Underlying Model 1
OLS_AIC_Data1 <- vector(length = n_simul1)
OLS_BIC_Data1 <- vector(length = n_simul1)
OLS_R2_Data1 <- vector(length = n_simul1)


for(i in 1:n_simul1){
  OLS_fit1 <- lm(y_n_simul1[[i]]~X_n_simul1[[i]])
  OLS_AIC_Data1[i] <-  AIC(OLS_fit1)
  OLS_BIC_Data1[i] <-  BIC(OLS_fit1)
  OLS_R2_Data1[i] <- summary(OLS_fit1)$adj.r.squared
}

mean_OLS_AIC_Data1 <- mean(OLS_AIC_Data1)
mean_OLS_BIC_Data1 <- mean(OLS_BIC_Data1)
mean_OLS_R2_Data1 <- mean(OLS_R2_Data1)

med_OLS_AIC_Data1 <- median(OLS_AIC_Data1)
med_OLS_BIC_Data1 <- median(OLS_BIC_Data1)
med_OLS_R2_Data1 <- median(OLS_R2_Data1)

OLS_row1 <- cbind("OLS", mean_OLS_AIC_Data1,
                  mean_OLS_BIC_Data1,
                  mean_OLS_R2_Data1,
                  med_OLS_AIC_Data1,
                  med_OLS_BIC_Data1,
                  med_OLS_R2_Data1)

#Underlying Model 2
OLS_AIC_Data2 <- vector(length = n_simul2)
OLS_BIC_Data2 <- vector(length = n_simul2)
OLS_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  OLS_fit2 <- lm(y_n_simul2[[i]]~X_n_simul2[[i]])
  OLS_AIC_Data2[i] <-  AIC(OLS_fit2)
  OLS_BIC_Data2[i] <-  BIC(OLS_fit2)
  OLS_R2_Data2[i] <- summary(OLS_fit2)$adj.r.squared
}

mean_OLS_AIC_Data2 <- mean(OLS_AIC_Data2)
mean_OLS_BIC_Data2 <- mean(OLS_BIC_Data2)
mean_OLS_R2_Data2 <- mean(OLS_R2_Data2)

med_OLS_AIC_Data2 <- median(OLS_AIC_Data2)
med_OLS_BIC_Data2 <- median(OLS_BIC_Data2)
med_OLS_R2_Data2 <- median(OLS_R2_Data2)

OLS_row2 <- cbind("OLS", mean_OLS_AIC_Data2,
                  mean_OLS_BIC_Data2,
                  mean_OLS_R2_Data2,
                  med_OLS_AIC_Data2,
                  med_OLS_BIC_Data2,
                  med_OLS_R2_Data2)

#Underlying Model 3
OLS_AIC_Data3 <- vector(length = n_simul3)
OLS_BIC_Data3 <- vector(length = n_simul3)
OLS_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  OLS_fit3 <- lm(y_n_simul3[[i]]~X_n_simul3[[i]])
  OLS_AIC_Data3[i] <-  AIC(OLS_fit3)
  OLS_BIC_Data3[i] <-  BIC(OLS_fit3)
  OLS_R2_Data3[i] <- summary(OLS_fit3)$adj.r.squared
}

mean_OLS_AIC_Data3 <- mean(OLS_AIC_Data3)
mean_OLS_BIC_Data3 <- mean(OLS_BIC_Data3)
mean_OLS_R2_Data3 <- mean(OLS_R2_Data3)

med_OLS_AIC_Data3 <- median(OLS_AIC_Data3)
med_OLS_BIC_Data3 <- median(OLS_BIC_Data3)
med_OLS_R2_Data3 <- median(OLS_R2_Data3)

OLS_row3 <- cbind("OLS", mean_OLS_AIC_Data3,
                  mean_OLS_BIC_Data3,
                  mean_OLS_R2_Data3,
                  med_OLS_AIC_Data3,
                  med_OLS_BIC_Data3,
                  med_OLS_R2_Data3)

3.2 LASSO Regression

We now outline the LASSO technique according to Tibshirani (1996), fit the LASSO Regression model, and measure model performances. We have the data as \((\boldsymbol{x}^{i},y_i)\) for \(i=1,2,...,N\), where \(\boldsymbol{x}^{i}=(x_{i1},...,x_{ip})^T\) and the predictors are standardized. Additionally, we define \(\boldsymbol{\hat{\beta}}=(\hat{\beta}_{1},...,\hat{\beta}_{p})^T\) and tuning parameter \(t\geq0\) and so the LASSO estimate \((\boldsymbol{\hat{\alpha}},\boldsymbol{\hat{\beta}})\) is defined as

\[ (\boldsymbol{\hat{\alpha}},\boldsymbol{\hat{\beta}})=\text{arg min}\left\{\sum^{N}_{i=1}(y_i-\alpha-\sum_{j}\beta_{j}x_{ij})^2\right\} \text{ subject to } \sum_{j}|\beta_j|\leq t \]

R-Implementation:

#Underlying Model 1
LASSO_AIC_Data1 <- vector(length = n_simul1)
LASSO_BIC_Data1 <- vector(length = n_simul1)
LASSO_R2_Data1 <- vector(length = n_simul1)

for(i in 1:n_simul1){
  lasso_fit1 = glmnet(X_n_simul1[[i]], y_n_simul1[[i]], alpha = 1, family="gaussian")
  
  tLL <- lasso_fit1$nulldev - deviance(lasso_fit1)
  k <- lasso_fit1$df
  n <- lasso_fit1$nobs
  
  LASSO_AIC_Data1[i] <- tail(-tLL+2*k+2*k*(k+1)/(n-k-1),n=1)
  LASSO_BIC_Data1[i] <- tail(log(n)*k - tLL,n=1)
  LASSO_R2_Data1[i] <- tail(lasso_fit1$dev.ratio,n=1)
}

mean_LASSO_AIC_Data1 <- mean(LASSO_AIC_Data1)
mean_LASSO_BIC_Data1 <- mean(LASSO_BIC_Data1)
mean_LASSO_R2_Data1 <- mean(LASSO_R2_Data1)

med_LASSO_AIC_Data1 <- median(LASSO_AIC_Data1)
med_LASSO_BIC_Data1 <- median(LASSO_BIC_Data1)
med_LASSO_R2_Data1 <- median(LASSO_R2_Data1)

LASSO_row1 <- cbind("LASSO", mean_LASSO_AIC_Data1,
                  mean_LASSO_BIC_Data1,
                  mean_LASSO_R2_Data1,
                  med_LASSO_AIC_Data1,
                  med_LASSO_BIC_Data1,
                  med_LASSO_R2_Data1)

#Underlying Model 2
LASSO_AIC_Data2 <- vector(length = n_simul2)
LASSO_BIC_Data2 <- vector(length = n_simul2)
LASSO_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  lasso_fit2 = glmnet(X_n_simul2[[i]], y_n_simul2[[i]], alpha = 1, family="gaussian")
  
  tLL <- lasso_fit2$nulldev - deviance(lasso_fit2)
  k <- lasso_fit2$df
  n <- lasso_fit2$nobs
  
  LASSO_AIC_Data2[i] <- tail(-tLL+2*k+2*k*(k+1)/(n-k-1),n=1)
  LASSO_BIC_Data2[i] <- tail(log(n)*k - tLL,n=1)
  LASSO_R2_Data2[i] <- tail(lasso_fit2$dev.ratio,n=1)
}

mean_LASSO_AIC_Data2 <- mean(LASSO_AIC_Data2)
mean_LASSO_BIC_Data2 <- mean(LASSO_BIC_Data2)
mean_LASSO_R2_Data2 <- mean(LASSO_R2_Data2)

med_LASSO_AIC_Data2 <- median(LASSO_AIC_Data2)
med_LASSO_BIC_Data2 <- median(LASSO_BIC_Data2)
med_LASSO_R2_Data2 <- median(LASSO_R2_Data2)

LASSO_row2 <- cbind("LASSO", mean_LASSO_AIC_Data2,
                  mean_LASSO_BIC_Data2,
                  mean_LASSO_R2_Data2,
                  med_LASSO_AIC_Data2,
                  med_LASSO_BIC_Data2,
                  med_LASSO_R2_Data2)

#Underlying Model 3
LASSO_AIC_Data3 <- vector(length = n_simul3)
LASSO_BIC_Data3 <- vector(length = n_simul3)
LASSO_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  lasso_fit3 = glmnet(X_n_simul3[[i]], y_n_simul3[[i]], alpha = 1, family="gaussian")
  
  tLL <- lasso_fit3$nulldev - deviance(lasso_fit3)
  k <- lasso_fit3$df
  n <- lasso_fit3$nobs
  
  LASSO_AIC_Data3[i] <- tail(-tLL+2*k+2*k*(k+1)/(n-k-1),n=1)
  LASSO_BIC_Data3[i] <- tail(log(n)*k - tLL,n=1)
  LASSO_R2_Data3[i] <- tail(lasso_fit3$dev.ratio,n=1)
}

mean_LASSO_AIC_Data3 <- mean(LASSO_AIC_Data3)
mean_LASSO_BIC_Data3 <- mean(LASSO_BIC_Data3)
mean_LASSO_R2_Data3 <- mean(LASSO_R2_Data3)

med_LASSO_AIC_Data3 <- median(LASSO_AIC_Data3)
med_LASSO_BIC_Data3 <- median(LASSO_BIC_Data3)
med_LASSO_R2_Data3 <- median(LASSO_R2_Data3)

LASSO_row3 <- cbind("LASSO", mean_LASSO_AIC_Data3,
                  mean_LASSO_BIC_Data3,
                  mean_LASSO_R2_Data3,
                  med_LASSO_AIC_Data3,
                  med_LASSO_BIC_Data3,
                  med_LASSO_R2_Data3)

3.3 Ridge Regression

We now outline the Ridge Regression technique according to Gareth James et al. (2013), fit the Ridge Regression model, and measure model performances. Ridge Regression has a similar setup to OLS, although the quantity to be minimized is different than for OLS and LASSO Regression. We define tuning parameter \(\lambda \geq 0\), and so we have regression coefficient estimates \(\boldsymbol{\hat{\beta}}\) to minimize

\[ \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j}x_{ij}\right)^2 + \lambda\sum^{p}_{j=1}\beta^{2}_{j} \] R-Implementation:

#Underlying Model 1
Rid_AIC_Data1 <- vector(length = n_simul1)
Rid_BIC_Data1 <- vector(length = n_simul1)
Rid_R2_Data1 <- vector(length = n_simul1)

for(i in 1:n_simul1){
  Rid_fit1 = glmnet(X_n_simul1[[i]], y_n_simul1[[i]], alpha = 0, family="gaussian")
  
  tLL <- Rid_fit1$nulldev - deviance(Rid_fit1)
  k <- Rid_fit1$df
  n <- Rid_fit1$nobs
  
  Rid_AIC_Data1[i] <- tail(-tLL+2*k+2*k*(k+1)/(n-k-1),n=1)
  Rid_BIC_Data1[i] <- tail(log(n)*k - tLL,n=1)
  Rid_R2_Data1[i] <- tail(Rid_fit1$dev.ratio,n=1)
}

mean_Rid_AIC_Data1 <- mean(Rid_AIC_Data1)
mean_Rid_BIC_Data1 <- mean(Rid_BIC_Data1)
mean_Rid_R2_Data1 <- mean(Rid_R2_Data1)

med_Rid_AIC_Data1 <- median(Rid_AIC_Data1)
med_Rid_BIC_Data1 <- median(Rid_BIC_Data1)
med_Rid_R2_Data1 <- median(Rid_R2_Data1)

Rid_row1 <- cbind("Ridge", mean_Rid_AIC_Data1,
                  mean_Rid_BIC_Data1,
                  mean_Rid_R2_Data1,
                  med_Rid_AIC_Data1,
                  med_Rid_BIC_Data1,
                  med_Rid_R2_Data1)

#Underlying Model 2
Rid_AIC_Data2 <- vector(length = n_simul2)
Rid_BIC_Data2 <- vector(length = n_simul2)
Rid_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  Rid_fit2 = glmnet(X_n_simul2[[i]], y_n_simul2[[i]], alpha = 0, family="gaussian")
  
  tLL <- Rid_fit2$nulldev - deviance(Rid_fit2)
  k <- Rid_fit2$df
  n <- Rid_fit2$nobs
  
  Rid_AIC_Data2[i] <- tail(-tLL+2*k+2*k*(k+1)/(n-k-1),n=1)
  Rid_BIC_Data2[i] <- tail(log(n)*k - tLL,n=1)
  Rid_R2_Data2[i] <- tail(Rid_fit2$dev.ratio,n=1)
}

mean_Rid_AIC_Data2 <- mean(Rid_AIC_Data2)
mean_Rid_BIC_Data2 <- mean(Rid_BIC_Data2)
mean_Rid_R2_Data2 <- mean(Rid_R2_Data2)

med_Rid_AIC_Data2 <- median(Rid_AIC_Data2)
med_Rid_BIC_Data2 <- median(Rid_BIC_Data2)
med_Rid_R2_Data2 <- median(Rid_R2_Data2)

Rid_row2 <- cbind("Ridge", mean_Rid_AIC_Data2,
                  mean_Rid_BIC_Data2,
                  mean_Rid_R2_Data2,
                  med_Rid_AIC_Data2,
                  med_Rid_BIC_Data2,
                  med_Rid_R2_Data2)

#Underlying Model 3
Rid_AIC_Data3 <- vector(length = n_simul3)
Rid_BIC_Data3 <- vector(length = n_simul3)
Rid_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  Rid_fit3 = glmnet(X_n_simul3[[i]], y_n_simul3[[i]], alpha = 0, family="gaussian")
  
  tLL <- Rid_fit3$nulldev - deviance(Rid_fit3)
  k <- Rid_fit3$df
  n <- Rid_fit3$nobs
  
  Rid_AIC_Data3[i] <- tail(-tLL+2*k+2*k*(k+1)/(n-k-1),n=1)
  Rid_BIC_Data3[i] <- tail(log(n)*k - tLL,n=1)
  Rid_R2_Data3[i] <- tail(Rid_fit3$dev.ratio,n=1)
}

mean_Rid_AIC_Data3 <- mean(Rid_AIC_Data3)
mean_Rid_BIC_Data3 <- mean(Rid_BIC_Data3)
mean_Rid_R2_Data3 <- mean(Rid_R2_Data3)

med_Rid_AIC_Data3 <- median(Rid_AIC_Data3)
med_Rid_BIC_Data3 <- median(Rid_BIC_Data3)
med_Rid_R2_Data3 <- median(Rid_R2_Data3)

Rid_row3 <- cbind("Ridge", mean_Rid_AIC_Data3,
                  mean_Rid_BIC_Data3,
                  mean_Rid_R2_Data3,
                  med_Rid_AIC_Data3,
                  med_Rid_BIC_Data3,
                  med_Rid_R2_Data3)

3.4 All Possible Regression

We now outline the All Possible Regression technique according to Hocking (1976), fit All Possible Regression Models, measure model performances, and. All Possible Regression is a computationally intensive variable selection technique where the given number of predictors \(t\) cannot be too large, as the number of all possible models is \(2^t\). All \(2^t\) possible models will be fit and the model with the best performance will be selected. It is important to mention that Hocking (1976) emphasizes approaches and modifications to All Possible Regression that optimize the procedure for evaluating all possible subsets, which we will not further consider for this project.

R-Implementation:

# Underlying Model 1
All_AIC_Data1 <- vector(length = n_simul1)
All_BIC_Data1 <- vector(length = n_simul1)
All_R2_Data1 <- vector(length = n_simul1)

for(i in 1:n_simul1){
  sig1 <- lm(Y~.,data=data1[[i]])
  All_fit1 <- ols_step_all_possible(sig1)
  All_AIC_Data1[i] <- min(All_fit1$aic)
  All_BIC_Data1[i] <- min(All_fit1$sbc)
  All_R2_Data1[i] <-  max(All_fit1$adjr)
}

mean_All_AIC_Data1 <- mean(All_AIC_Data1)
mean_All_BIC_Data1 <- mean(All_BIC_Data1)
mean_All_R2_Data1 <- mean(All_R2_Data1)

med_All_AIC_Data1 <- median(All_AIC_Data1)
med_All_BIC_Data1 <- median(All_BIC_Data1)
med_All_R2_Data1 <- median(All_R2_Data1)

All_row1 <- cbind("All-Possible Regression", mean_All_AIC_Data1,
                  mean_All_BIC_Data1,
                  mean_All_R2_Data1,
                  med_All_AIC_Data1,
                  med_All_BIC_Data1,
                  med_All_R2_Data1)

# Underlying Model 2
All_AIC_Data2 <- vector(length = n_simul2)
All_BIC_Data2 <- vector(length = n_simul2)
All_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  sig2 <- lm(Y~.,data=data2[[i]])
  All_fit2 <- ols_step_all_possible(sig2)
  All_AIC_Data2[i] <- min(All_fit2$aic)
  All_BIC_Data2[i] <- min(All_fit2$sbc)
  All_R2_Data2[i] <-  max(All_fit2$adjr)
}

mean_All_AIC_Data2 <- mean(All_AIC_Data2)
mean_All_BIC_Data2 <- mean(All_BIC_Data2)
mean_All_R2_Data2 <- mean(All_R2_Data2)

med_All_AIC_Data2 <- median(All_AIC_Data2)
med_All_BIC_Data2 <- median(All_BIC_Data2)
med_All_R2_Data2 <- median(All_R2_Data2)

All_row2 <- cbind("All-Possible Regression", mean_All_AIC_Data2,
                  mean_All_BIC_Data2,
                  mean_All_R2_Data2,
                  med_All_AIC_Data2,
                  med_All_BIC_Data2,
                  med_All_R2_Data2)

# Underlying Model 3
All_AIC_Data3 <- vector(length = n_simul3)
All_BIC_Data3 <- vector(length = n_simul3)
All_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  sig3 <- lm(Y~.,data=data3[[i]])
  All_fit3 <- ols_step_all_possible(sig3)
  All_AIC_Data3[i] <- min(All_fit3$aic)
  All_BIC_Data3[i] <- min(All_fit3$sbc)
  All_R2_Data3[i] <-  max(All_fit3$adjr)
}

mean_All_AIC_Data3 <- mean(All_AIC_Data3)
mean_All_BIC_Data3 <- mean(All_BIC_Data3)
mean_All_R2_Data3 <- mean(All_R2_Data3)

med_All_AIC_Data3 <- median(All_AIC_Data3)
med_All_BIC_Data3 <- median(All_BIC_Data3)
med_All_R2_Data3 <- median(All_R2_Data3)

All_row3 <- cbind("All-Possible Regression", mean_All_AIC_Data3,
                  mean_All_BIC_Data3,
                  mean_All_R2_Data3,
                  med_All_AIC_Data3,
                  med_All_BIC_Data3,
                  med_All_R2_Data3)

3.5 Forward Selection

We now outline the Forward Selection technique, fit Forward Selection Regression Models with a cutoff p-value of \(0.5\) for the \(F\)-statistic outlined below, and measure model performances. According to Hocking (1976) the Forward Selection technique starts with no variables and adds one variable at a time to the equation until either all variables are in or a criteria to stop the technique is satisfied. Furthermore, the variable considered for inclusion at any step is the one yielding the largest F-ratio among the variables being considered for inclusion. Thus, the variable \(i\) is added to the equation of \(p\)-terms if

\[ F_{i} = \underset{i}{\max}\left(\frac{RSS_{p} - RSS_{p+i}}{\hat{\sigma}_{p+i}^2}\right) > F_{in} \] R-Implementation:

#Underlying Model 1
For_AIC_Data1 <- vector(length = n_simul1)
For_BIC_Data1 <- vector(length = n_simul1)
For_R2_Data1 <- vector(length = n_simul1)

for(i in 1:n_simul1){
  sig1 <- lm(Y~.,data=data1[[i]])
  For_fit1 <- ols_step_forward_p(sig1, penter = 0.5)
  For_AIC_Data1[i] <- min(For_fit1$aic)
  For_BIC_Data1[i] <- min(For_fit1$sbc)
  For_R2_Data1[i] <-  max(For_fit1$adjr)
}

mean_For_AIC_Data1 <- mean(For_AIC_Data1)
mean_For_BIC_Data1 <- mean(For_BIC_Data1)
mean_For_R2_Data1 <- mean(For_R2_Data1)

med_For_AIC_Data1 <- median(For_AIC_Data1)
med_For_BIC_Data1 <- median(For_BIC_Data1)
med_For_R2_Data1 <- median(For_R2_Data1)

For_row1 <- cbind("Forward Selection", mean_For_AIC_Data1,
                  mean_For_BIC_Data1,
                  mean_For_R2_Data1,
                  med_For_AIC_Data1,
                  med_For_BIC_Data1,
                  med_For_R2_Data1)

#Underlying Model 2
For_AIC_Data2 <- vector(length = n_simul2)
For_BIC_Data2 <- vector(length = n_simul2)
For_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  sig2 <- lm(Y~.,data=data2[[i]])
  For_fit2 <- ols_step_forward_p(sig2, penter = 0.5)
  For_AIC_Data2[i] <- min(For_fit2$aic)
  For_BIC_Data2[i] <- min(For_fit2$sbc)
  For_R2_Data2[i] <-  max(For_fit2$adjr)
}

mean_For_AIC_Data2 <- mean(For_AIC_Data2)
mean_For_BIC_Data2 <- mean(For_BIC_Data2)
mean_For_R2_Data2 <- mean(For_R2_Data2)

med_For_AIC_Data2 <- median(For_AIC_Data2)
med_For_BIC_Data2 <- median(For_BIC_Data2)
med_For_R2_Data2 <- median(For_R2_Data2)

For_row2 <- cbind("Forward Selection", mean_For_AIC_Data2,
                  mean_For_BIC_Data2,
                  mean_For_R2_Data2,
                  med_For_AIC_Data2,
                  med_For_BIC_Data2,
                  med_For_R2_Data2)

#Underlying Model 3
For_AIC_Data3 <- vector(length = n_simul3)
For_BIC_Data3 <- vector(length = n_simul3)
For_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  sig3 <- lm(Y~.,data=data3[[i]])
  For_fit3 <- ols_step_forward_p(sig3, penter = 0.5)
  For_AIC_Data3[i] <- min(For_fit3$aic)
  For_BIC_Data3[i] <- min(For_fit3$sbc)
  For_R2_Data3[i] <-  max(For_fit3$adjr)
}

mean_For_AIC_Data3 <- mean(For_AIC_Data3)
mean_For_BIC_Data3 <- mean(For_BIC_Data3)
mean_For_R2_Data3 <- mean(For_R2_Data3)

med_For_AIC_Data3 <- median(For_AIC_Data3)
med_For_BIC_Data3 <- median(For_BIC_Data3)
med_For_R2_Data3 <- median(For_R2_Data3)

For_row3 <- cbind("Forward Selection", mean_For_AIC_Data3,
                  mean_For_BIC_Data3,
                  mean_For_R2_Data3,
                  med_For_AIC_Data3,
                  med_For_BIC_Data3,
                  med_For_R2_Data3)

3.6 Backward Elimination

We now outline the Backward Selection technique, fit Backward Elimination models with a cutoff p-value of \(0.1\) for the \(F\)-statistic outlined below, and measure model performances. According to Hocking (1976) the Backward Elimination technique starts with all variables and eliminates one variable at a time from the equation until either all variables are out or a criteria to stop the technique is satisfied. Furthermore, the variable considered for exclusion at any step is the one yielding the smallest F-ratio among the variables being considered for exclusion (if the F-ratio does not exceed a specified value). Thus, the variable \(i\) is removed removed from the equation of \(p\)-terms if

\[ F_{i} = \underset{i}{\min}\left(\frac{RSS_{p-i} - RSS_{p}}{\hat{\sigma}_{p}^2}\right) < F_{out} \] R-Implementation:

#Underlying Model 1
Back_AIC_Data1 <- vector(length = n_simul1)
Back_BIC_Data1 <- vector(length = n_simul1)
Back_R2_Data1 <- vector(length = n_simul1)

for(i in 1:n_simul1){
  sig1 <- lm(Y~.,data=data1[[i]])
  Back_fit1 <- ols_step_backward_p(sig1, prem = 0.1)
  Back_AIC_Data1[i] <- min(Back_fit1$aic)
  Back_BIC_Data1[i] <- min(Back_fit1$sbc)
  Back_R2_Data1[i] <-  max(Back_fit1$adjr)
}

mean_Back_AIC_Data1 <- mean(Back_AIC_Data1)
mean_Back_BIC_Data1 <- mean(Back_BIC_Data1)
mean_Back_R2_Data1 <- mean(Back_R2_Data1)

med_Back_AIC_Data1 <- median(Back_AIC_Data1)
med_Back_BIC_Data1 <- median(Back_BIC_Data1)
med_Back_R2_Data1 <- median(Back_R2_Data1)

Back_row1 <- cbind("Backward Selection", mean_Back_AIC_Data1,
                  mean_Back_BIC_Data1,
                  mean_Back_R2_Data1,
                  med_Back_AIC_Data1,
                  med_Back_BIC_Data1,
                  med_Back_R2_Data1)

#Underlying Model 2
Back_AIC_Data2 <- vector(length = n_simul2)
Back_BIC_Data2 <- vector(length = n_simul2)
Back_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  sig2 <- lm(Y~.,data=data2[[i]])
  Back_fit2 <- ols_step_backward_p(sig2, prem = 0.1)
  Back_AIC_Data2[i] <- min(Back_fit2$aic)
  Back_BIC_Data2[i] <- min(Back_fit2$sbc)
  Back_R2_Data2[i] <-  max(Back_fit2$adjr)
}

mean_Back_AIC_Data2 <- mean(Back_AIC_Data2)
mean_Back_BIC_Data2 <- mean(Back_BIC_Data2)
mean_Back_R2_Data2 <- mean(Back_R2_Data2)

med_Back_AIC_Data2 <- median(Back_AIC_Data2)
med_Back_BIC_Data2 <- median(Back_BIC_Data2)
med_Back_R2_Data2 <- median(Back_R2_Data2)

Back_row2 <- cbind("Backward Selection", mean_Back_AIC_Data2,
                  mean_Back_BIC_Data2,
                  mean_Back_R2_Data2,
                  med_Back_AIC_Data2,
                  med_Back_BIC_Data2,
                  med_Back_R2_Data2)

#Underlying Model 3
Back_AIC_Data3 <- vector(length = n_simul3)
Back_BIC_Data3 <- vector(length = n_simul3)
Back_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  sig3 <- lm(Y~.,data=data3[[i]])
  Back_fit3 <- ols_step_backward_p(sig3, prem = 0.1)
  Back_AIC_Data3[i] <- min(Back_fit3$aic)
  Back_BIC_Data3[i] <- min(Back_fit3$sbc)
  Back_R2_Data3[i] <-  max(Back_fit3$adjr)
}

mean_Back_AIC_Data3 <- mean(Back_AIC_Data3)
mean_Back_BIC_Data3 <- mean(Back_BIC_Data3)
mean_Back_R2_Data3 <- mean(Back_R2_Data3)

med_Back_AIC_Data3 <- median(Back_AIC_Data3)
med_Back_BIC_Data3 <- median(Back_BIC_Data3)
med_Back_R2_Data3 <- median(Back_R2_Data3)

Back_row3 <- cbind("Backward Selection", mean_Back_AIC_Data3,
                  mean_Back_BIC_Data3,
                  mean_Back_R2_Data3,
                  med_Back_AIC_Data3,
                  med_Back_BIC_Data3,
                  med_Back_R2_Data3)

3.7 Stepwise Selection

We now outline the Stepwise Selection technnique, fit Stepwise Selection Regression models with an entry cutoff p-value of \(0.15\) and a removal cutoff p-value of \(0.15\), and measure model performances. According to Hocking (1976) the Stepwise Selection technique is a combination of both Forward Selection and Backward Elimination such that variables are added as in Forward Selection, but then after adding a variable, all variables in the equation can be removed as in Backward Elimination.

R-Implementation:

#Underlying Model 1
Step_AIC_Data1 <- vector(length = n_simul1)
Step_BIC_Data1 <- vector(length = n_simul1)
Step_R2_Data1 <- vector(length = n_simul1)

for(i in 1:n_simul1){
  sig1 <- lm(Y~.,data=data1[[i]])
  Step_fit1 <- ols_step_both_p(sig1, pent = 0.15, prem = 0.15)
  Step_AIC_Data1[i] <- min(Step_fit1$aic)
  Step_BIC_Data1[i] <- min(Step_fit1$sbc)
  Step_R2_Data1[i] <-  max(Step_fit1$adjr)
}

mean_Step_AIC_Data1 <- mean(Step_AIC_Data1)
mean_Step_BIC_Data1 <- mean(Step_BIC_Data1)
mean_Step_R2_Data1 <- mean(Step_R2_Data1)

med_Step_AIC_Data1 <- median(Step_AIC_Data1)
med_Step_BIC_Data1 <- median(Step_BIC_Data1)
med_Step_R2_Data1 <- median(Step_R2_Data1)

Step_row1 <- cbind("Stepwise Selection", mean_Step_AIC_Data1,
                  mean_Step_BIC_Data1,
                  mean_Step_R2_Data1,
                  med_Step_AIC_Data1,
                  med_Step_BIC_Data1,
                  med_Step_R2_Data1)

#Underlying Model 2
Step_AIC_Data2 <- vector(length = n_simul2)
Step_BIC_Data2 <- vector(length = n_simul2)
Step_R2_Data2 <- vector(length = n_simul2)

for(i in 1:n_simul2){
  sig2 <- lm(Y~.,data=data2[[i]])
  Step_fit2 <- ols_step_both_p(sig2, pent = 0.15, prem = 0.15)
  Step_AIC_Data2[i] <- min(Step_fit2$aic)
  Step_BIC_Data2[i] <- min(Step_fit2$sbc)
  Step_R2_Data2[i] <-  max(Step_fit2$adjr)
}

mean_Step_AIC_Data2 <- mean(Step_AIC_Data2)
mean_Step_BIC_Data2 <- mean(Step_BIC_Data2)
mean_Step_R2_Data2 <- mean(Step_R2_Data2)

med_Step_AIC_Data2 <- median(Step_AIC_Data2)
med_Step_BIC_Data2 <- median(Step_BIC_Data2)
med_Step_R2_Data2 <- median(Step_R2_Data2)

Step_row2 <- cbind("Stepwise Selection", mean_Step_AIC_Data2,
                  mean_Step_BIC_Data2,
                  mean_Step_R2_Data2,
                  med_Step_AIC_Data2,
                  med_Step_BIC_Data2,
                  med_Step_R2_Data2)

#Underlying Model 3
Step_AIC_Data3 <- vector(length = n_simul3)
Step_BIC_Data3 <- vector(length = n_simul3)
Step_R2_Data3 <- vector(length = n_simul3)

for(i in 1:n_simul3){
  sig3 <- lm(Y~.,data=data3[[i]])
  Step_fit3 <- ols_step_both_p(sig3, pent = 0.15, prem = 0.15)
  Step_AIC_Data3[i] <- min(Step_fit3$aic)
  Step_BIC_Data3[i] <- min(Step_fit3$sbc)
  Step_R2_Data3[i] <-  max(Step_fit3$adjr)
}

mean_Step_AIC_Data3 <- mean(Step_AIC_Data3)
mean_Step_BIC_Data3 <- mean(Step_BIC_Data3)
mean_Step_R2_Data3 <- mean(Step_R2_Data3)

med_Step_AIC_Data3 <- median(Step_AIC_Data3)
med_Step_BIC_Data3 <- median(Step_BIC_Data3)
med_Step_R2_Data3 <- median(Step_R2_Data3)

Step_row3 <- cbind("Stepwise Selection", mean_Step_AIC_Data3,
                  mean_Step_BIC_Data3,
                  mean_Step_R2_Data3,
                  med_Step_AIC_Data3,
                  med_Step_BIC_Data3,
                  med_Step_R2_Data3)

4 Results and Discussion

4.1 Performance Metrics

Table of Results for all methods for all models: We observe that OLS performs the worst over all metrics as expected, since model 1 and model 3 contain multiple \(\beta\) coefficients with value zero, and even for model 2 with exclusively non-zero coefficients, OLS performed worse than all other models. Although all-possible regression, forward selection, backward selection, and stepwise selection performed better than OLS, they did not perform as the Ridge and LASSO models. Both Ridge and LASSO outperformed all methods for AIC, BIC, and Adjusted \(R^2\) by a large margin. Furthermore, LASSO slightly outperformed Ridge, which makes sense due to the fact that LASSO can shrink model coefficients to zero to achieve a better approximation to the underlying models, where Ridge cannot completely eliminate coefficients for underlying models with zero-coefficients (such as for model 1 and 3).

Results Tables:

Table_Data1 <- data.frame(rbind(OLS_row1,LASSO_row1,Rid_row1,All_row1
                        ,For_row1,Back_row1,Step_row1))
colnames(Table_Data1) <- c("Method","Mean AIC","Mean BIC","Mean Adj. R-Squared","Median AIC","Median BIC","Median Adj. R-Squared")

Table_Data2 <- data.frame(rbind(OLS_row2,LASSO_row2,Rid_row2,All_row2
                        ,For_row2,Back_row2,Step_row2))
colnames(Table_Data2) <- c("Method","Mean AIC","Mean BIC","Mean Adj. R-Squared","Median AIC","Median BIC","Median Adj. R-Squared")

Table_Data3 <- data.frame(rbind(OLS_row3,LASSO_row3,Rid_row3,All_row3
                        ,For_row3,Back_row3,Step_row3))
colnames(Table_Data3) <- c("Method","Mean AIC","Mean BIC","Mean Adj. R-Squared","Median AIC","Median BIC","Median Adj. R-Squared")

Table_Data1[, c(2:ncol(Table_Data1))] <- sapply(Table_Data1[, c(2:ncol(Table_Data1))], as.numeric)
Table_Data1[, c(2:ncol(Table_Data1))] <- round(Table_Data1[, c(2:ncol(Table_Data1))],2)

Table_Data2[, c(2:ncol(Table_Data2))] <- sapply(Table_Data2[, c(2:ncol(Table_Data2))], as.numeric)
Table_Data2[, c(2:ncol(Table_Data2))] <- round(Table_Data2[, c(2:ncol(Table_Data2))],2)

Table_Data3[, c(2:ncol(Table_Data3))] <- sapply(Table_Data3[, c(2:ncol(Table_Data3))], as.numeric)
Table_Data3[, c(2:ncol(Table_Data3))] <- round(Table_Data3[, c(2:ncol(Table_Data3))],2)

Table_Full <- cbind(Table_Data1, Table_Data2[,-1], Table_Data3[,-1])
Table_Full[, c(2:ncol(Table_Full))] <- sapply(Table_Full[, c(2:ncol(Table_Full))], as.numeric)
Table_Full[, c(2:ncol(Table_Full))] <- round(Table_Full[, c(2:ncol(Table_Full))],2)

kable(Table_Data1) %>%
  kable_styling("striped") %>%
  add_header_above(c(" " = 1, "Underlying Model 1" = 6))

kable(Table_Data2) %>%
  kable_styling("striped") %>%
  add_header_above(c(" " = 1, "Underlying Model 2" = 6))

kable(Table_Data3) %>%
  kable_styling("striped") %>%
  add_header_above(c(" " = 1, "Underlying Model 3" = 6))

	Underlying Model 1
Method	Mean AIC	Mean BIC	Mean Adj. R-Squared	Median AIC	Median BIC	Median Adj. R-Squared
OLS	107.52	117.48	0.70	109.41	119.37	0.71
LASSO	-471.70	-476.46	0.82	-441.02	-446.14	0.83
Ridge	-466.06	-471.19	0.82	-436.67	-441.80	0.83
All-Possible Regression	102.26	107.93	0.75	102.66	108.30	0.76
Forward Selection	102.35	108.10	0.75	102.66	108.50	0.76
Backward Selection	102.32	108.09	0.75	102.66	108.33	0.76
Stepwise Selection	103.18	108.20	0.72	103.70	108.30	0.73

	Underlying Model 2
Method	Mean AIC	Mean BIC	Mean Adj. R-Squared	Median AIC	Median BIC	Median Adj. R-Squared
OLS	102.19	112.15	0.67	104.08	114.04	0.68
LASSO	-331.91	-336.64	0.81	-299.24	-304.37	0.82
Ridge	-329.21	-334.34	0.80	-297.19	-302.31	0.81
All-Possible Regression	96.73	102.50	0.73	98.26	103.62	0.73
Forward Selection	97.28	103.30	0.72	98.59	104.13	0.72
Backward Selection	96.82	102.72	0.73	98.27	103.75	0.73
Stepwise Selection	97.99	103.24	0.70	98.83	103.90	0.71

	Underlying Model 3
Method	Mean AIC	Mean BIC	Mean Adj. R-Squared	Median AIC	Median BIC	Median Adj. R-Squared
OLS	84.53	94.49	0.89	86.26	96.21	0.90
LASSO	-480.89	-485.85	0.94	-449.17	-454.30	0.94
Ridge	-473.97	-479.09	0.93	-442.46	-447.58	0.93
All-Possible Regression	77.63	82.14	0.91	78.87	83.43	0.92
Forward Selection	77.70	82.22	0.91	78.87	83.43	0.92
Backward Selection	77.73	82.29	0.91	79.05	83.59	0.92
Stepwise Selection	78.31	82.23	0.91	79.25	83.43	0.91

4.2 Plots

Plotting AIC Values Over All Models: We confirm the results from the tables of the previous subsection by directly plotting every AIC value for every method. OLS performs the worst, the stepwise methods perform slightly better, and LASSO and Ridge models perform the best.

AIC_df <- data.frame(
   OLS = c(OLS_AIC_Data1, OLS_AIC_Data2, OLS_AIC_Data3),
   LASSO = c(LASSO_AIC_Data1, LASSO_AIC_Data2, LASSO_AIC_Data3),
   Ridge = c(Rid_AIC_Data1, Rid_AIC_Data2, Rid_AIC_Data3),
   `All Possible Regression` = c(unlist(All_AIC_Data1), unlist(All_AIC_Data2), unlist(All_AIC_Data3)),
   `Forward Selection` = c(unlist(For_AIC_Data1), unlist(For_AIC_Data2), unlist(For_AIC_Data3)),
   `Backward Elimination` = c(unlist(Back_AIC_Data1), unlist(Back_AIC_Data2), unlist(Back_AIC_Data3)),
   `Stepwise Selection` = c(unlist(Step_AIC_Data1), unlist(Step_AIC_Data2),unlist(Step_AIC_Data3))
)

gathered_data <- gather(AIC_df, key = "key", value = "value")

colnames(gathered_data) <- c("key","value")

gathered_data %>%
  ggplot( aes(reorder(key, value, FUN = median, .desc =TRUE), y=value, fill=key)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.5) +
    geom_jitter(color="black", size=0.4, alpha=0.8) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)) +
    coord_flip() +
    labs(title = "AIC Distribution Over All Models",
      x="Variable Selection Technique",
      y="AIC")

Plotting BIC Values Over All Models: We confirm the results (as nearly identical to AIC) from the tables by directly plotting every BIC value for every method. OLS performs the worst, the stepwise methods perform slightly better, and LASSO and Ridge models perform the best.

BIC_df <- data.frame(
   OLS = c(OLS_BIC_Data1, OLS_BIC_Data2, OLS_BIC_Data3),
   LASSO = c(LASSO_BIC_Data1, LASSO_BIC_Data2, LASSO_BIC_Data3),
   Ridge = c(Rid_BIC_Data1, Rid_BIC_Data2, Rid_BIC_Data3),
   `All Possible Regression` = c(unlist(All_BIC_Data1), unlist(All_BIC_Data2), unlist(All_BIC_Data3)),
   `Forward Selection` = c(unlist(For_BIC_Data1), unlist(For_BIC_Data2), unlist(For_BIC_Data3)),
   `Backward Elimination` = c(unlist(Back_BIC_Data1), unlist(Back_BIC_Data2), unlist(Back_BIC_Data3)),
   `Stepwise Selection` = c(unlist(Step_BIC_Data1), unlist(Step_BIC_Data2),unlist(Step_BIC_Data3))
)

gathered_data <- gather(BIC_df, key = "key", value = "value")

colnames(gathered_data) <- c("key","value")

gathered_data %>%
  ggplot( aes(reorder(key, value, FUN = median, .desc =TRUE), y=value, fill=key)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.5) +
    geom_jitter(color="black", size=0.4, alpha=0.8) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)) +
    coord_flip() +
    labs(title = "BIC Distribution Over All Models",
      x="Variable Selection Technique",
      y="BIC")

Plotting Adjusted R-Squared Values Over All Models: We again confirm the results from the tables by directly plotting every R-Squared value for every method. OLS performs the worst, the stepwise methods perform slightly better, and LASSO and Ridge models perform the best. It is important to mention that although the performance differences seem less pronounced here than for AIC and BIC, the Adjusted R-Square can only take values between 0 and 1.

R2_df <- data.frame(
   OLS = c(OLS_R2_Data1, OLS_R2_Data2, OLS_R2_Data3),
   LASSO = c(LASSO_R2_Data1, LASSO_R2_Data2, LASSO_R2_Data3),
   Ridge = c(Rid_R2_Data1, Rid_R2_Data2, Rid_R2_Data3),
   `All Possible Regression` = c(unlist(All_R2_Data1), unlist(All_R2_Data2), unlist(All_R2_Data3)),
   `Forward Selection` = c(unlist(For_R2_Data1), unlist(For_R2_Data2), unlist(For_R2_Data3)),
   `Backward Elimination` = c(unlist(Back_R2_Data1), unlist(Back_R2_Data2), unlist(Back_R2_Data3)),
   `Stepwise Selection` = c(unlist(Step_R2_Data1), unlist(Step_R2_Data2),unlist(Step_R2_Data3))
)

gathered_data <- gather(R2_df, key = "key", value = "value")

colnames(gathered_data) <- c("key","value")

gathered_data %>%
  ggplot( aes(reorder(key, value, FUN = median, .desc =TRUE), y=value, fill=key)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.5) +
    geom_jitter(color="black", size=0.4, alpha=0.8) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)) +
    coord_flip() +
    labs(title ="Adjusted R-Squared Distribution Over All Models",
      x="Variable Selection Technique",
      y=expression(paste('Adjusted',R^2)))

Plotting AIC and BIC Values for Every Method: We again confirm the same results as from earlier plots. Additionally, we note that for OLS and the Stepwise methods BIC values seem to be slightly higher, and for LASSO and Ridge models AIC values seem to be slightly higher.

Table_Full_New <- rbind(Table_Full[,c(1,2,3,5,6)],Table_Full[,c(1,8,9,11,12)],Table_Full[,c(1,14,15,17,18)])
gathered_Full <- gather(Table_Full_New , key = "key", value = "value",-Method)
colnames(gathered_Full) <- c("method","metric","value")

ggplot(data = gathered_Full , aes(x=method, y=value)) + 
  geom_boxplot(aes(fill=metric)) +
  facet_wrap( ~ method, scales="free") +
  theme_light()

Plotting AIC and BIC Values for Every Method: We again confirm the same results as from earlier plots. Interestingly, the median Adjusted \(R^2\) is slightly higher than the mean Adjusted \(R^2\).

Table_Full_New <- rbind(Table_Full[,c(1,4,7)],Table_Full[,c(1,10,13)],Table_Full[,c(1,16,19)])
gathered_Full <- gather(Table_Full_New , key = "key", value = "value",-Method)
colnames(gathered_Full) <- c("method","metric","value")

ggplot(data = gathered_Full , aes(x=method, y=value)) + 
  geom_boxplot(aes(fill=metric)) +
  facet_wrap( ~ method, scales="free") +
  theme_light()

Plotting Performances by Model: Although we aim to compare variable selection methods, it can be interesting to consider performance metrics by underlying model. We observe that overall AIC, BIC, and Adjusted \(R^2\) are better for model 1 and model 3 than for model 2. These results make sense intuitively, since we are comparing model selection methods and a linear model (model 2) that has exclusively non-zero coefficients will lead to poorer performances than linear models including non-zero coefficients (such as model 1 and 3 here).

AIC_ByMod_df <- data.frame(
   'Model-1' = c(OLS_AIC_Data1, LASSO_AIC_Data1, Rid_AIC_Data1,unlist(All_AIC_Data1),unlist(Back_AIC_Data1),unlist(Step_AIC_Data1)),
   'Model-2' = c(OLS_AIC_Data2, LASSO_AIC_Data2, Rid_AIC_Data2,unlist(All_AIC_Data2),unlist(Back_AIC_Data2),unlist(Step_AIC_Data2)),
   'Model-3' = c(OLS_AIC_Data3, LASSO_AIC_Data3, Rid_AIC_Data3,unlist(All_AIC_Data3),unlist(Back_AIC_Data3),unlist(Step_AIC_Data3))
)

BIC_ByMod_df <- data.frame(
   'Model-1' = c(OLS_BIC_Data1, LASSO_BIC_Data1, Rid_BIC_Data1,unlist(All_BIC_Data1),unlist(Back_BIC_Data1),unlist(Step_BIC_Data1)),
   'Model-2' = c(OLS_BIC_Data2, LASSO_BIC_Data2, Rid_BIC_Data2,unlist(All_BIC_Data2),unlist(Back_BIC_Data2),unlist(Step_BIC_Data2)),
   'Model-3' = c(OLS_BIC_Data3, LASSO_BIC_Data3, Rid_BIC_Data3,unlist(All_BIC_Data3),unlist(Back_BIC_Data3),unlist(Step_BIC_Data3))
)

R2_ByMod_df <- data.frame(
   'Model-1' = c(OLS_R2_Data1, LASSO_R2_Data1, Rid_R2_Data1,unlist(All_R2_Data1),unlist(Back_R2_Data1),unlist(Step_R2_Data1)),
   'Model-2' = c(OLS_R2_Data2, LASSO_R2_Data2, Rid_R2_Data2,unlist(All_R2_Data2),unlist(Back_R2_Data2),unlist(Step_R2_Data2)),
   'Model-3' = c(OLS_R2_Data3, LASSO_R2_Data3, Rid_R2_Data3,unlist(All_R2_Data3),unlist(Back_R2_Data3),unlist(Step_R2_Data3))
)

AIC_ByMod_gather <- gather(AIC_ByMod_df, key = "key", value = "value")
colnames(AIC_ByMod_gather) <- c("key","value")

BIC_ByMod_gather <- gather(BIC_ByMod_df, key = "key", value = "value")
colnames(BIC_ByMod_gather) <- c("key","value")

R2_ByMod_gather <- gather(R2_ByMod_df, key = "key", value = "value")
colnames(R2_ByMod_gather) <- c("key","value")

AIC_ByMod_gather %>%
  ggplot( aes(reorder(key, value, FUN = median, .desc =TRUE), y=value, fill=key)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.5) +
    geom_jitter(color="black", size=0.4, alpha=0.8) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)) +
    coord_flip() +
    labs(title = "AIC Distribution By Model",
      x="Variable Selection Technique",
      y="AIC")

BIC_ByMod_gather %>%
  ggplot( aes(reorder(key, value, FUN = median, .desc =TRUE), y=value, fill=key)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.5) +
    geom_jitter(color="black", size=0.4, alpha=0.8) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)) +
    coord_flip() +
    labs(title = "BIC Distribution By Model",
      x="Variable Selection Technique",
      y="BIC")

R2_ByMod_gather %>%
  ggplot( aes(reorder(key, value, FUN = median, .desc =TRUE), y=value, fill=key)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.5) +
    geom_jitter(color="black", size=0.4, alpha=0.8) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)) +
    coord_flip() +
    labs(title = "R-Squared Distribution By Model",
      x="Variable Selection Technique",
      y=expression(paste(R^2)))

4.3 Discussion and Conclusion

The observation that LASSO and Ridge outperform the stepwise methods (and OLS), and also that LASSO outperforms Ridge is in line with the recommendations of by Hastie, Tibshirani, and Friedman (2009). Furthermore, there seem to be more modern variable selection methods that outperform LASSO and Ridge (that could be considered in a future project), which follow a mixed integer optimization approach to choosing best subsets, as outlined in a paper by Bertsimas, King, Mazumder (2015).

5 References

5.1 Aknowledgement

I would like to thank Professor Mihye Ahn for the recommendation of doing such a personal project to supplement her regression and linear models course.

5.2 Papers

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society (Series B), 58, 267-288.
Hocking, R. (1976). A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression. Biometrics, 32, 1-49.
Portet S. (2020). A primer on model selection using the Akaike Information Criterion. Infectious Disease Modelling, 5, 111–128.
Dziak, J. J., Coffman, D. L., Lanza, S. T., Li, R., & Jermiin, L. S. (2020). Sensitivity and specificity of information criteria. Briefings in bioinformatics, 21(2), 553–565.
Bertsimas, D., King, A., & Mazumder, R. (2015). Best Subset Selection via a Modern Optimization Lens. arXiv: Methodology.

5.3 Books

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. New York: Springer.
James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. New York: Springer.

5.4 R Packages

glmnet
knitr
foreach
MASS
rlist
dvmisc
kableExtra
ggplot2
tidyverse
hrbrthemes
viridis
olsrr
extrafont

Comparing Variable Selection Techniques on Simulated Data

1 Introduction

1.1 Performance Metrics

2 Simulating Datasets

2.1 Underlying Model 1

2.2 Underlying Model 2

2.3 Underlying Model 3

3 Fitting Models

3.1 OLS Regression

3.2 LASSO Regression

3.3 Ridge Regression

3.4 All Possible Regression

3.5 Forward Selection

3.6 Backward Elimination

3.7 Stepwise Selection

4 Results and Discussion

4.1 Performance Metrics

4.2 Plots

4.3 Discussion and Conclusion

5 References

5.1 Aknowledgement

5.2 Papers

5.3 Books

5.4 R Packages