Stepwise, Lasso, and Elastic Net

Preamble

Stepwise regression assumes that the predictor variables are not highly correlated. During each step in stepwise regression, a variable is considered for addition to or subtraction from the set of predictor variables based on some pre-specified criterion (e.g. adjusted R-squared). The two main approaches involve forward selection, starting with no variables in the model, and backwards selection, starting with all candidate predictors.

Lasso (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.

The elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. The elastic net penalty is controlled by alpha, and bridges the gap between lasso (alpha=1) and ridge (alpha=0). Note that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one and discard the others. Ridge is generally good at prediction but tends to be less interpretable.

Here, we will evaluate the usefullness of all three models using a well-described dataset with “crime rate” in 47 states of the USA for 1960 as the response variable and 15 predictor variables. This may seem an odd choice as more recent data with richer sets of predictor features are certainly available. The choice of a limited dataset with a limited number of features, some of which are linearly correlated, will allow us to more easily interpret differences between the regression methods as well as evaluate the usefulness of principal component analysis in combination with the three different regression methods.

# clear the environment and set seed
rm(list = ls())
set.seed(123)

library(MASS)
library(glmnet)

## Warning: package 'glmnet' was built under R version 3.3.3

## Warning: package 'foreach' was built under R version 3.3.1

library(ggplot2)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.3.1

library(caret)

## Warning: package 'caret' was built under R version 3.3.1

Stepwise Regression

Load the crime dataset and display a summary. Crime is the response variable, the other variables are predictor variables.

dfC <- read.csv("http://www.statsci.org/data/general/uscrime.txt", sep="\t")
#print summary
summary(dfC)

##        M               So               Ed             Po1       
##  Min.   :11.90   Min.   :0.0000   Min.   : 8.70   Min.   : 4.50  
##  1st Qu.:13.00   1st Qu.:0.0000   1st Qu.: 9.75   1st Qu.: 6.25  
##  Median :13.60   Median :0.0000   Median :10.80   Median : 7.80  
##  Mean   :13.86   Mean   :0.3404   Mean   :10.56   Mean   : 8.50  
##  3rd Qu.:14.60   3rd Qu.:1.0000   3rd Qu.:11.45   3rd Qu.:10.45  
##  Max.   :17.70   Max.   :1.0000   Max.   :12.20   Max.   :16.60  
##       Po2               LF              M.F              Pop        
##  Min.   : 4.100   Min.   :0.4800   Min.   : 93.40   Min.   :  3.00  
##  1st Qu.: 5.850   1st Qu.:0.5305   1st Qu.: 96.45   1st Qu.: 10.00  
##  Median : 7.300   Median :0.5600   Median : 97.70   Median : 25.00  
##  Mean   : 8.023   Mean   :0.5612   Mean   : 98.30   Mean   : 36.62  
##  3rd Qu.: 9.700   3rd Qu.:0.5930   3rd Qu.: 99.20   3rd Qu.: 41.50  
##  Max.   :15.700   Max.   :0.6410   Max.   :107.10   Max.   :168.00  
##        NW              U1                U2            Wealth    
##  Min.   : 0.20   Min.   :0.07000   Min.   :2.000   Min.   :2880  
##  1st Qu.: 2.40   1st Qu.:0.08050   1st Qu.:2.750   1st Qu.:4595  
##  Median : 7.60   Median :0.09200   Median :3.400   Median :5370  
##  Mean   :10.11   Mean   :0.09547   Mean   :3.398   Mean   :5254  
##  3rd Qu.:13.25   3rd Qu.:0.10400   3rd Qu.:3.850   3rd Qu.:5915  
##  Max.   :42.30   Max.   :0.14200   Max.   :5.800   Max.   :6890  
##       Ineq            Prob              Time           Crime       
##  Min.   :12.60   Min.   :0.00690   Min.   :12.20   Min.   : 342.0  
##  1st Qu.:16.55   1st Qu.:0.03270   1st Qu.:21.60   1st Qu.: 658.5  
##  Median :17.60   Median :0.04210   Median :25.80   Median : 831.0  
##  Mean   :19.40   Mean   :0.04709   Mean   :26.60   Mean   : 905.1  
##  3rd Qu.:22.75   3rd Qu.:0.05445   3rd Qu.:30.45   3rd Qu.:1057.5  
##  Max.   :27.60   Max.   :0.11980   Max.   :44.00   Max.   :1993.0

Scale the predictor variables, except So. The variable So is really a factor variable indicating whether a state is considered Southern or not and, therefore, should not be scaled.

colNames <- colnames(dfC[,-2])[1:14]

normalize <- function(df, cols) {
  result <- df # make a copy of the input data frame

  for (j in cols) { # each specified col
    m <- mean(df[,j]) # column mean
    std <- sd(df[,j]) # column (sample) sd

    result[,j] <- sapply(result[,j], function(x) (x - m) / std)
  }
  return(result)
}

#normalize predictors except 'So'
dfC.norm <- normalize(dfC, colNames)

Perform a backwards stepwise regression with cross-validation. The output of all the iterations is suppressed. The results of the final step are given in the commented section of the next code block.

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

lmFit_Step <- train(Crime ~ ., data = dfC.norm, "lmStepAIC", scope = 
                 list(lower = Crime~1, upper = Crime~.), direction = "backward",
                 trControl=ctrl)

# Step:  AIC=503.93
# .outcome ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
# 
#        Df Sum of Sq     RSS    AIC
# <none>              1453068 503.93
# - M.F   1    103159 1556227 505.16
# - U1    1    127044 1580112 505.87
# - Prob  1    247978 1701046 509.34
# - U2    1    255443 1708511 509.55
# - M     1    296790 1749858 510.67
# - Ed    1    445788 1898855 514.51
# - Ineq  1    738244 2191312 521.24
# - Po1   1   1672038 3125105 537.93

Develop a new model with the eight variables found with stepwise regression. This yields an adjusted R-squared of 0.7444.

mod_Step = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = dfC.norm)
summary(mod_Step)

## 
## Call:
## lm(formula = Crime ~ M.F + U1 + Prob + U2 + M + Ed + Ineq + Po1, 
##     data = dfC.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      28.52  31.731  < 2e-16 ***
## M.F            65.83      40.08   1.642  0.10874    
## U1           -109.73      60.20  -1.823  0.07622 .  
## Prob          -86.31      33.89  -2.547  0.01505 *  
## U2            158.22      61.22   2.585  0.01371 *  
## M             117.28      42.10   2.786  0.00828 ** 
## Ed            201.50      59.02   3.414  0.00153 ** 
## Ineq          244.70      55.69   4.394 8.63e-05 ***
## Po1           305.07      46.14   6.613 8.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

Next use cross-validation to see how good this model really is. Because we only have 47 data points, let’s use 47-fold cross-validation (equivalent to leave-one-out cross-validation).

SStot <- sum((dfC.norm$Crime - mean(dfC.norm$Crime))^2)
totsse <- 0

for(i in 1:nrow(dfC.norm)) {
  mod_Step_i = lm(Crime ~ M.F+U1+Prob+U2+M+Ed+Ineq+Po1, data = dfC.norm[-i,])
  pred_i <- predict(mod_Step_i, newdata=dfC.norm[i,])
  totsse <- totsse + ((pred_i - dfC.norm[i,16])^2)
}

R2_mod <- 1 - totsse/SStot
R2_mod

##        1 
## 0.667621

Note that in the previous model, the p-value for M.F is higher than 0.1. Let’s see what happens if it is removed.

mod_Step = lm(Crime ~ U1+Prob+U2+M+Ed+Ineq+Po1, data = dfC.norm)
summary(mod_Step)

## 
## Call:
## lm(formula = Crime ~ U1 + Prob + U2 + M + Ed + Ineq + Po1, data = dfC.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -520.76 -105.67    9.53  136.28  519.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      29.14  31.062  < 2e-16 ***
## U1            -63.86      54.48  -1.172   0.2482    
## Prob          -84.83      34.61  -2.451   0.0188 *  
## U2            134.13      60.71   2.209   0.0331 *  
## M             134.20      41.70   3.218   0.0026 ** 
## Ed            244.38      54.07   4.520 5.62e-05 ***
## Ineq          264.65      55.52   4.767 2.61e-05 ***
## Po1           314.89      46.73   6.738 4.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.8 on 39 degrees of freedom
## Multiple R-squared:  0.7738, Adjusted R-squared:  0.7332 
## F-statistic: 19.06 on 7 and 39 DF,  p-value: 8.805e-11

Now it appears that U1 is not all that significant. Let’s remove it as well and re-run the model,

mod_Step = lm(Crime ~ Prob+U2+M+Ed+Ineq+Po1, data = dfC.norm)
summary(mod_Step)

## 
## Call:
## lm(formula = Crime ~ Prob + U2 + M + Ed + Ineq + Po1, data = dfC.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      29.27  30.918  < 2e-16 ***
## Prob          -86.44      34.74  -2.488  0.01711 *  
## U2             75.47      34.55   2.185  0.03483 *  
## M             131.98      41.85   3.154  0.00305 ** 
## Ed            219.79      50.07   4.390 8.07e-05 ***
## Ineq          269.91      55.60   4.855 1.88e-05 ***
## Po1           341.84      40.87   8.363 2.56e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11

Next, use cross-validation to evaluate this latest model. The results show that the simpler model has an R-squared of 0.666 which is almost the same compared to the model which included M.F and U1. This would suggest that these features may be ommitted from the final model.

SStot <- sum((dfC.norm$Crime - mean(dfC.norm$Crime))^2)
totsse <- 0

for(i in 1:nrow(dfC.norm)) {
  mod_Step_i = lm(Crime ~ Prob+U2+M+Ed+Ineq+Po1, data = dfC.norm[-i,])
  pred_i <- predict(mod_Step_i,newdata=dfC.norm[i,])
  totsse <- totsse + ((pred_i - dfC.norm[i,16])^2)
}
R2_mod <- 1 - totsse/SStot
R2_mod

##         1 
## 0.6661638

The same result can be obtained with “step” as shown below (output of code not shown).

model <- lm(Crime ~ ., data = dfC.norm)
step(model, direction = "backward")

Lasso Regression

Prepare the data for use in lasso regression.

#building lasso

XP=data.matrix(dfC.norm[,-16])
YP=data.matrix(dfC.norm$Crime)
lasso=cv.glmnet(x=as.matrix(dfC.norm[,-16]), y=as.matrix(dfC.norm$Crime), alpha=1,
                nfolds = 5, type.measure="mse", family="gaussian")

Output the coefficients of the variables selected by lasso.

coef(lasso, s=lasso$lambda.min)

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 889.057299
## M            85.003217
## So           47.081683
## Ed          124.290064
## Po1         308.889820
## Po2           .       
## LF            1.116734
## M.F          52.037497
## Pop           .       
## NW            4.443519
## U1          -22.257315
## U2           55.981384
## Wealth        .       
## Ineq        180.588656
## Prob        -81.812835
## Time          .

Fit a new model with the ‘remaining’ ten variables. The adjusted R-squared is slightly lower as compared to that obtained with stepwise regression.

mod_lasso = lm(Crime ~So+M+Ed+Po1+LF+M.F+NW+U2+Ineq+Prob, data = dfC.norm)
summary(mod_lasso)

## 
## Call:
## lm(formula = Crime ~ So + M + Ed + Po1 + LF + M.F + NW + U2 + 
##     Ineq + Prob, data = dfC.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -410.55 -121.42    5.76  110.54  550.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  863.284     53.510  16.133  < 2e-16 ***
## So           122.792    129.896   0.945  0.35080    
## M            113.614     49.962   2.274  0.02903 *  
## Ed           188.755     64.750   2.915  0.00608 ** 
## Po1          333.470     49.353   6.757 6.86e-08 ***
## LF            25.162     49.588   0.507  0.61496    
## M.F           31.513     43.179   0.730  0.47021    
## NW            -2.883     59.595  -0.048  0.96169    
## U2            72.881     41.795   1.744  0.08973 .  
## Ineq         229.121     68.845   3.328  0.00203 ** 
## Prob         -99.145     40.243  -2.464  0.01867 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 206.6 on 36 degrees of freedom
## Multiple R-squared:  0.7767, Adjusted R-squared:  0.7147 
## F-statistic: 12.53 on 10 and 36 DF,  p-value: 5.374e-09

Use cross-validation to evaluate the model. Note that four of the variables, namely So, LF, M.F., and NW do not appear to be significant. In fact, without these variables, the model is the same as that obtained with stepwise regression.

SStot <- sum((dfC.norm$Crime - mean(dfC.norm$Crime))^2)
totsse <- 0

for(i in 1:nrow(dfC.norm)) {
  mod_lasso_i = lm(Crime ~ So+M+Ed+Po1+LF+M.F+NW+U2+Ineq+Prob, data = dfC.norm[-i,])
  pred_i <- predict(mod_lasso_i,newdata=dfC.norm[i,])
  totsse <- totsse + ((pred_i - dfC.norm[i,16])^2)
}

R2_mod <- 1 - totsse/SStot
R2_mod

##         1 
## 0.5840486

Elastic Net

The alpha values are varied in steps of 0.1, from 0 to 1, and subsequently the resultant R-Squared values are calculated.

R2=c()

for (i in 0:10) {
  mod_elastic = cv.glmnet(x=as.matrix(dfC.norm[,-16]),y=as.matrix(dfC.norm$Crime),
                          alpha=i/10,nfolds = 5,type.measure="mse",family="gaussian")


#The deviance(dev.ratio ) shows the percentage of deviance explained, 
#(equivalent to r squared in case of regression)
  
  R2 = cbind(R2, mod_elastic$glmnet.fit$dev.ratio[which(mod_elastic$glmnet.fit$lambda == mod_elastic$lambda.min)])

}

R2

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.7129289 0.7733668 0.7549545 0.7782892 0.7551427 0.7921597 0.7921574
##           [,8]      [,9]     [,10]     [,11]
## [1,] 0.7636336 0.7552367 0.7569563 0.7931003

alpha_best = (which.max(R2)-1)/10
alpha_best

## [1] 1

Use the “best” alpha value to build the model.

Elastic_net=cv.glmnet(x=as.matrix(dfC.norm[,-16]),y=as.matrix(dfC.norm$Crime),alpha=alpha_best,
                      nfolds = 5,type.measure="mse",family="gaussian")

#Output the coefficients of the variables selected by Elastic Net

coef(Elastic_net, s=Elastic_net$lambda.min)

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 888.760971
## M            82.888559
## So           47.952148
## Ed          116.058564
## Po1         310.189409
## Po2           .       
## LF            2.228622
## M.F          49.818594
## Pop           .       
## NW            3.623195
## U1          -13.837819
## U2           46.698552
## Wealth        .       
## Ineq        175.547425
## Prob        -80.415268
## Time          .

The Elastic Net selects 12 variables compared to 10 in Lasso and 8 in stepwise. Next, compare how this new model performs compared to the Lasso and Stepwise models.

mod_Elastic_net = lm(Crime ~So+M+Ed+Po1+M.F+Pop+NW+U1+U2+Wealth+Ineq+Prob, data = dfC.norm)
summary(mod_Elastic_net)

## 
## Call:
## lm(formula = Crime ~ So + M + Ed + Po1 + M.F + Pop + NW + U1 + 
##     U2 + Wealth + Ineq + Prob, data = dfC.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -434.18 -107.01   18.55  115.88  470.32 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   897.29      51.91  17.286  < 2e-16 ***
## So             22.89     125.35   0.183  0.85621    
## M             112.71      49.35   2.284  0.02876 *  
## Ed            195.70      62.94   3.109  0.00378 ** 
## Po1           293.18      64.99   4.511 7.32e-05 ***
## M.F            48.92      48.12   1.017  0.31656    
## Pop           -33.25      45.63  -0.729  0.47113    
## NW             19.16      57.71   0.332  0.74195    
## U1            -89.76      65.68  -1.367  0.18069    
## U2            140.78      66.77   2.108  0.04245 *  
## Wealth         83.30      95.53   0.872  0.38932    
## Ineq          285.77      85.19   3.355  0.00196 ** 
## Prob          -92.75      41.12  -2.255  0.03065 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.6 on 34 degrees of freedom
## Multiple R-squared:  0.7971, Adjusted R-squared:  0.7255 
## F-statistic: 11.13 on 12 and 34 DF,  p-value: 1.52e-08

The R-squared appears to be similar to that obtained with Lasso and Stepwise regression. Let’s use cross-validation to evaluate the model.

SStot <- sum((dfC.norm$Crime - mean(dfC.norm$Crime))^2)
totsse <- 0

for(i in 1:nrow(dfC.norm)) {
  mod_lasso_i = lm(Crime ~ So+M+Ed+Po1+Po2+M.F+LF+Pop+NW+U1+U2+Wealth+Ineq+Prob+Time, data = dfC.norm[-i,])
  pred_i <- predict(mod_lasso_i,newdata=dfC.norm[i,])
  totsse <- totsse + ((pred_i - dfC.norm[i,16])^2)
}
R2_mod <- 1 - totsse/SStot
R2_mod

##        1 
## 0.485607

The resultant R- squared value is a much worse cross-validated R-squared estimate because for most of the variables, the p-values seemed to indicate they were not significant. If the insignificant variables would be remove, one would end up with a model akin to that obtained with Lasso or Stepwise regression.

Build stepwise, Lasso, and Elastic Net models using Principal Component Analysis

The following code block shows PCA on a matrix of scaled predictors.

pca <- prcomp(dfC[,1:15], scale. = TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.4534 1.6739 1.4160 1.07806 0.97893 0.74377
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688
## Cumulative Proportion  0.4013 0.5880 0.7217 0.79920 0.86308 0.89996
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.56729 0.55444 0.48493 0.44708 0.41915 0.35804
## Proportion of Variance 0.02145 0.02049 0.01568 0.01333 0.01171 0.00855
## Cumulative Proportion  0.92142 0.94191 0.95759 0.97091 0.98263 0.99117
##                           PC13   PC14    PC15
## Standard deviation     0.26333 0.2418 0.06793
## Proportion of Variance 0.00462 0.0039 0.00031
## Cumulative Proportion  0.99579 0.9997 1.00000

The following are useful visualizations when deciding how many principal components to choose.

screeplot(pca, type="lines",col="blue")

var <- pca$sdev^2
propvar <- var/sum(var)

plot(propvar, xlab = "Principal Component", ylab = "Proportion of Variance Explained",ylim = c(0,1), type = "b")

plot(cumsum(propvar), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0,1), type = "b")

Here, let’ us’s use all the principal components (PCs) instead of the original variables and evaluate the performance of the 3 above models

# Creating a dataframe of response variable and PCs
PCcrime <- as.data.frame(cbind(pca$x, dfC[,16]))
colnames(PCcrime)[16] <- "Crime"

—————————- Stepwise Regression ————————————-

Stepwise Regression using PCs and Cross Validation

Use the code in the block below to perform five-fold cross-validation. The final model is shown in the commented section.

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
set.seed(1)
lmFit_Step_PC <- train(Crime ~ ., data = PCcrime, "lmStepAIC", scope = 
                      list(lower = Crime~1, upper = Crime~.), direction = "backward",trControl=ctrl)

##Step:  AIC=507.37
##.outcome ~ PC1 + PC2 + PC4 + PC5 + PC6 + PC7 + PC12 + PC14 + 
##PC15
##
##Df Sum of Sq     RSS    AIC
##<none>              1498159 507.37
##- PC15  1     82173 1580332 507.88
##- PC6   1     92261 1590420 508.18
##- PC14  1    129212 1627371 509.26
##- PC7   1    203535 1701694 511.36
##- PC4   1    257832 1755991 512.83
##- PC12  1    494595 1992754 518.78
##- PC2   1    633037 2131196 521.94
##- PC1   1   1177568 2675727 532.63
##- PC5   1   2312556 3810715 549.25

Fitting a new model with these 9 PCS.

mod_Step_PC = lm(Crime ~ PC15+PC6+PC14+PC7+PC4+PC12+PC2+PC1+PC5, data = PCcrime)
summary(mod_Step_PC)

## 
## Call:
## lm(formula = Crime ~ PC15 + PC6 + PC14 + PC7 + PC4 + PC12 + PC2 + 
##     PC1 + PC5, data = PCcrime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -351.54  -93.58  -20.42  121.74  553.91 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      29.35  30.836  < 2e-16 ***
## PC15         -622.21     436.77  -1.425 0.162660    
## PC6           -60.21      39.89  -1.509 0.139665    
## PC14          219.19     122.70   1.786 0.082235 .  
## PC7           117.26      52.30   2.242 0.031040 *  
## PC4            69.45      27.52   2.523 0.016048 *  
## PC12          289.61      82.87   3.495 0.001249 ** 
## PC2           -70.08      17.72  -3.954 0.000334 ***
## PC1            65.22      12.09   5.393 4.17e-06 ***
## PC5          -229.04      30.31  -7.557 5.20e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201.2 on 37 degrees of freedom
## Multiple R-squared:  0.7823, Adjusted R-squared:  0.7293 
## F-statistic: 14.77 on 9 and 37 DF,  p-value: 8.755e-10

We obtain an Adjusted R-SQuared value = 0.729 using the selected 9PCs using Backward StepWise regression and Cross Validation. This is slightly lower than using the same method on the original variables. Note that PC15 and PC6 are not significant so we will remove them before attempting cross-validation as shown below.

SStot <- sum((dfC$Crime - mean(dfC$Crime))^2)
totsse <- 0
for(i in 1:nrow(PCcrime)) {
  mod_lasso_i = lm(Crime ~ PC14+PC7+PC4+PC12+PC2+PC1+PC5, data = PCcrime[-i,])
  pred_i <- predict(mod_lasso_i,newdata=PCcrime[i,])
  totsse <- totsse + ((pred_i - PCcrime[i,16])^2)
}
R2_mod <- 1 - totsse/SStot
R2_mod

##         1 
## 0.6271988

—————————- Lasso Regression ————————————-

Lasso regression using PCs and cross-validation.

Building the Lasso model and showing the coefficients of the variables selected by Lasso.

XP=data.matrix(PCcrime[,-16])
YP=data.matrix(PCcrime$Crime)
lasso_PC=cv.glmnet(x=as.matrix(PCcrime[,-16]),y=as.matrix(PCcrime$Crime),alpha=1,
                nfolds = 5,type.measure="mse",family="gaussian")

coef(lasso_PC, s=lasso_PC$lambda.min)

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  905.085106
## PC1           61.350753
## PC2          -64.418018
## PC3           18.497103
## PC4           60.649976
## PC5         -219.356045
## PC6          -47.463846
## PC7          100.540212
## PC8           11.613371
## PC9          -17.620884
## PC10          35.107541
## PC11           7.970122
## PC12         263.128175
## PC13          45.775592
## PC14         179.970042
## PC15        -482.612769

Fitting a new model with these 12 PCs compared to 10 original variables.

mod_lasso_PC = lm(Crime ~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC10+PC12+PC13+PC14+PC15, data = PCcrime)
summary(mod_lasso_PC)

## 
## Call:
## lm(formula = Crime ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + 
##     PC10 + PC12 + PC13 + PC14 + PC15, data = PCcrime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -404.62  -85.20   -6.35  111.78  526.89 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      29.48  30.698  < 2e-16 ***
## PC1            65.22      12.15   5.369 5.70e-06 ***
## PC2           -70.08      17.80  -3.936 0.000389 ***
## PC3            25.19      21.05   1.197 0.239582    
## PC4            69.45      27.64   2.512 0.016913 *  
## PC5          -229.04      30.44  -7.523 9.82e-09 ***
## PC6           -60.21      40.07  -1.503 0.142142    
## PC7           117.26      52.53   2.232 0.032313 *  
## PC10           56.32      66.66   0.845 0.404101    
## PC12          289.61      83.24   3.479 0.001398 ** 
## PC13           81.79     113.18   0.723 0.474838    
## PC14          219.19     123.25   1.778 0.084288 .  
## PC15         -622.21     438.74  -1.418 0.165237    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.1 on 34 degrees of freedom
## Multiple R-squared:  0.7981, Adjusted R-squared:  0.7269 
## F-statistic:  11.2 on 12 and 34 DF,  p-value: 1.408e-08

We obtain a similar Adjusted R-SQuared value = 0.7269 using the selected 12 PCs instead of the 10 variables using Lasso regression.

Let’s see how this cross-validates.

SStot <- sum((dfC$Crime - mean(dfC$Crime))^2)
totsse <- 0

for(i in 1:nrow(PCcrime)) {
  mod_lasso_i = lm(Crime ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC10+PC12+PC13+PC14+PC15, data = PCcrime[-i,])
  pred_i <- predict(mod_lasso_i,newdata=PCcrime[i,])
  totsse <- totsse + ((pred_i - PCcrime[i,16])^2)
}

R2_mod <- 1 - totsse/SStot
R2_mod

##         1 
## 0.5857863

The R-squared value would appear a lot worse as compare to the value obtained with stepwise regression. However, note that PCs 3, 6, 10, 13, and 15 do not appear to be significant. If we remove them, we get the same model as when we use only significant variables from the stepwise PC model.

—————————- Elastic Net ————————————-

Develop an Elastic Net model using the PCs and display the best alpha value.

R2_PC=c()

for (i in 0:10) {
  model = cv.glmnet(x=as.matrix(PCcrime[,-16]),y=as.matrix(PCcrime$Crime),
                    alpha=i/10,nfolds = 5,type.measure="mse",family="gaussian")
  
  R2_PC = cbind(R2_PC,model$glmnet.fit$dev.ratio[which(model$glmnet.fit$lambda == model$lambda.min)])
  
}

R2_PC

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.7782769 0.7654027 0.7558492 0.7743684 0.8021926 0.7470497 0.7872682
##           [,8]      [,9]     [,10]     [,11]
## [1,] 0.7464576 0.7992367 0.7650309 0.7940698

alpha_best_PC = (which.max(R2_PC)-1)/10
alpha_best_PC

## [1] 0.4

Observe that the best alpha value = 0.4 which is slightly closer to a Lasso model. The R-Squared values are slightly higher here. Let’s build the model using this alpha value and display the coefficients of the variables selected by Elastic Net. Next, we will develop a new model followed by cross-validation.

Elastic_net_PC=cv.glmnet(x=as.matrix(PCcrime[,-16]),y=as.matrix(PCcrime$Crime),alpha=alpha_best,
                   nfolds = 5,type.measure="mse",family="gaussian")

coef(Elastic_net_PC, s=Elastic_net_PC$lambda.min)

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  905.085106
## PC1           61.350753
## PC2          -64.418018
## PC3           18.497103
## PC4           60.649976
## PC5         -219.356045
## PC6          -47.463846
## PC7          100.540212
## PC8           11.613371
## PC9          -17.620884
## PC10          35.107541
## PC11           7.970122
## PC12         263.128175
## PC13          45.775592
## PC14         179.970042
## PC15        -482.612769

mod_Elastic_net_PC = lm(Crime ~PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC12+PC14+PC15, data = PCcrime)
summary(mod_Elastic_net_PC)

## 
## Call:
## lm(formula = Crime ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + 
##     PC12 + PC14 + PC15, data = PCcrime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -376.38  -92.82  -14.14   98.01  515.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      29.17  31.029  < 2e-16 ***
## PC1            65.22      12.02   5.427 4.06e-06 ***
## PC2           -70.08      17.61  -3.979 0.000321 ***
## PC3            25.19      20.82   1.210 0.234195    
## PC4            69.45      27.35   2.539 0.015574 *  
## PC5          -229.04      30.12  -7.605 5.37e-09 ***
## PC6           -60.21      39.64  -1.519 0.137515    
## PC7           117.26      51.97   2.256 0.030239 *  
## PC12          289.61      82.35   3.517 0.001201 ** 
## PC14          219.19     121.94   1.798 0.080642 .  
## PC15         -622.21     434.06  -1.433 0.160350    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200 on 36 degrees of freedom
## Multiple R-squared:  0.7908, Adjusted R-squared:  0.7327 
## F-statistic: 13.61 on 10 and 36 DF,  p-value: 1.785e-09

SStot <- sum((dfC$Crime - mean(dfC$Crime))^2)
totsse <- 0

for(i in 1:nrow(PCcrime)) {
  mod_lasso_i = lm(Crime ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC12+PC14+PC15, data = PCcrime[-i,])
  pred_i <- predict(mod_lasso_i,newdata=PCcrime[i,])
  totsse <- totsse + ((pred_i - PCcrime[i,16])^2)
}

R2_mod <- 1 - totsse/SStot
R2_mod

##         1 
## 0.6267523