Regularization Methods for Multiple Linear Regression

Assignment
Textbook Problem 6.1
Part A
Part B
Part C
Part D
Stepwise Regression
Regularization Methods
Ridge Regression
Lasso Regression
Elastic Net Regression
Comparing the Performances

Assignment

Do problem 6.1 on page 169 of the textbook.
Also use ridge regression, lasso, and elastic net. Compare the performances of the 3 methods.

library(tidyverse)
library(reshape2)
library(caret)
library(glmnet)
library(forecast)
library(gains)

Textbook Problem 6.1

Using BostonHousing.csv which contains information on 506 US Census housing tracts in the Boston area.

Boston <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/BostonHousing.csv")

Part A

There’s two types of modeling: explanatory and predictive modeling. In predictive modeling, we partition the data into training and validation sets. The goal of predictive modeling is to predict new data accurately. In order to create a model that can predict new data accurately, we have to holdout some data when creating the model, then use this data to test our model. The data being held out is part of the “validation set”, whereas the data being used to construct the model is part of the “training set”. The training set is used to create the model and the validation set is used to test the “predictive performance” of the model.

Part B

An explanatory model using all observations:
MEDV = -28.91 - 0.26(CRIM) + 3.76(CHAS) + 8.28(RM)

selected.housing.vars <- c("MEDV", "CRIM", "CHAS", "RM")

# Explanatory model parameters
lm(MEDV ~ ., data = Boston[, selected.housing.vars])

## 
## Call:
## lm(formula = MEDV ~ ., data = Boston[, selected.housing.vars])
## 
## Coefficients:
## (Intercept)         CRIM         CHAS           RM  
##    -28.8107      -0.2607       3.7630       8.2782

A predictive (training) model using ~75% of the observations:
MEDV = -30.75 - 0.27(CRIM) + 3.55(CHAS) + 8.54(RM)

Note the Betas (slope/intercept parameters) in the training model are similar to the Betas in the explanatory model using all of the rows.
This difference is one that might not explain the entire data set the best, but could be better for predicting new data.

# Partition data into training and validation sets
set.seed(100)
sample.housing <- sample(c(1:506), 380) # ~75% of the data
training.housing <- Boston[sample.housing, selected.housing.vars] 
validation.housing <- Boston[-sample.housing, selected.housing.vars]

# Fit Multiple linear regression model, Median house price (MEDV) by the other 3 variables (.)
housing.lm <- lm(MEDV ~ ., data = training.housing)
summary(housing.lm)

## 
## Call:
## lm(formula = MEDV ~ ., data = training.housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.710  -2.897  -0.182   2.342  39.642 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30.75453    3.07439 -10.003  < 2e-16 ***
## CRIM         -0.27386    0.03332  -8.218 3.39e-15 ***
## CHAS          3.54500    1.29196   2.744  0.00636 ** 
## RM            8.54200    0.48528  17.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.87 on 376 degrees of freedom
## Multiple R-squared:  0.5484, Adjusted R-squared:  0.5448 
## F-statistic: 152.2 on 3 and 376 DF,  p-value: < 2.2e-16

Part C

A median house price for a tract that does not bound the Charles River, has a crime rate of 0.1, and has an average of 6 rooms per house is:
MEDV = -30.75 - 0.27(0.1) + 3.55(0) + 8.54(6) = 20.463 = about $20,463

Here is how accuarate the model is at fitting the validation set:
ME and RMSE are not extremely high and the MAPE implies the model is about 78% accurate in predicting the next observations.

housing.lm.predictions <- predict(housing.lm, validation.housing)
accuracy(housing.lm.predictions, validation.housing$MEDV)

##               ME     RMSE     MAE      MPE    MAPE
## Test set 1.39906 7.039879 4.60643 1.185454 21.8375

Part D

Amongst all of the predictors:

MEDV and CAT MEDV should be highly correlated because CAT MEDV is simply 1 if MEDV is over 3000 and 0 if under.
AGE (Pre-1940 Buildings) and LSTAT (Lower status of population) might measure similar things as older buildings might be more “affordable” for those with lower status.
INDUS (Nonretail business acres), NOX (Nitrous oxide), and TAX (property tax rate) are likely to be positively correlated as these variables would score high in commercial areas.

Looking at the correlation matrix:

correlation_matrix <- cor(na.omit(Boston))

round(correlation_matrix, 2)

##            CRIM    ZN INDUS  CHAS   NOX    RM   AGE   DIS   RAD   TAX PTRATIO
## CRIM       1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## ZN        -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## INDUS      0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## CHAS      -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## NOX        0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## RM        -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## AGE        0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## DIS       -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## RAD        0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## TAX        0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## PTRATIO    0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## LSTAT      0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## MEDV      -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
## CAT..MEDV -0.15  0.37 -0.37  0.11 -0.23  0.64 -0.19  0.12 -0.20 -0.27   -0.44
##           LSTAT  MEDV CAT..MEDV
## CRIM       0.46 -0.39     -0.15
## ZN        -0.41  0.36      0.37
## INDUS      0.60 -0.48     -0.37
## CHAS      -0.05  0.18      0.11
## NOX        0.59 -0.43     -0.23
## RM        -0.61  0.70      0.64
## AGE        0.60 -0.38     -0.19
## DIS       -0.50  0.25      0.12
## RAD        0.49 -0.38     -0.20
## TAX        0.54 -0.47     -0.27
## PTRATIO    0.37 -0.51     -0.44
## LSTAT      1.00 -0.74     -0.47
## MEDV      -0.74  1.00      0.79
## CAT..MEDV -0.47  0.79      1.00

CHAS has a small correlation with all of the other variables. I would keep CHAS.
MEDV, CAT MEDV, and RM all have similar correlations with the other variables. I might only keep MEDV.
INDUS, NOX, AGE, and LSTAT have similar correlations. I might only keep INDUS and LSTAT.
CRIM, RAD, TAX, and PTRATIO have similar correlations. I might only keep RAD and PTRATIO.
Lastly ZN and DIS have similar correlations. I might only keep ZN.

ggplot(data = melt(correlation_matrix), 
       aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(color = "white") + 
  scale_fill_gradient2(limit = c(-1, 1), midpoint = 0, 
                       low = "dark blue", high = "dark red", mid = "white")

Boston.subset <- c("CHAS", "MEDV", "INDUS", "LSTAT", "RAD", "PTRATIO", "ZN")

From this table, I would keep CHAS, MEDV, INDUS, LSTAT, RAD, PTRATIO, and ZN to perform stepwise regression on.

Stepwise Regression

Stepwise regression is a way of selecting some, but not all of the variables in a dataset containing lots of predictors. It aims to explain most of the variability without using too few or too many predictors by penalizing models that have adiditional predictors. In the textbook, this is referred to as the bias-variance tradeoff. There are many ways of calculating this, including AIC and BIC.

Using the variables from the previous step, we will use stepwise regression (using AIC criterion) to select the best model.

# Split into training and validation using our set.seed() and sample() functions from before
training.subset <- Boston[sample.housing, Boston.subset] 
validation.subset <- Boston[-sample.housing, Boston.subset]

subset_model = lm(MEDV ~ ., data = training.subset)
small_model <- lm(MEDV ~ 1, data = training.subset)

Forward Stepwise regression: Starts with no predictors, adding one predictor at a time until an added predictor doesn’t add statistical significance.

step_forward <- step(small_model, scope = list(lower = small_model, upper = subset_model), direction = "forward")

## Start:  AIC=1645.14
## MEDV ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + LSTAT    1   16979.1 11709 1306.6
## + PTRATIO  1    7761.7 20926 1527.3
## + INDUS    1    7646.6 21041 1529.3
## + ZN       1    4672.6 24015 1579.6
## + RAD      1    4487.6 24200 1582.5
## + CHAS     1     560.9 28127 1639.6
## <none>                 28688 1645.1
## 
## Step:  AIC=1306.62
## MEDV ~ LSTAT
## 
##           Df Sum of Sq     RSS    AIC
## + PTRATIO  1   2004.85  9704.1 1237.2
## + CHAS     1    537.62 11171.3 1290.8
## + ZN       1    155.82 11553.1 1303.5
## <none>                 11708.9 1306.6
## + INDUS    1     16.62 11692.3 1308.1
## + RAD      1      0.88 11708.1 1308.6
## 
## Step:  AIC=1237.25
## MEDV ~ LSTAT + PTRATIO
## 
##         Df Sum of Sq    RSS    AIC
## + CHAS   1   298.329 9405.8 1227.4
## + RAD    1   231.599 9472.5 1230.1
## <none>               9704.1 1237.2
## + INDUS  1    44.338 9659.7 1237.5
## + ZN     1     0.123 9704.0 1239.2
## 
## Step:  AIC=1227.38
## MEDV ~ LSTAT + PTRATIO + CHAS
## 
##         Df Sum of Sq    RSS    AIC
## + RAD    1   207.216 9198.5 1220.9
## <none>               9405.8 1227.4
## + INDUS  1    23.570 9382.2 1228.4
## + ZN     1     8.789 9397.0 1229.0
## 
## Step:  AIC=1220.92
## MEDV ~ LSTAT + PTRATIO + CHAS + RAD
## 
##         Df Sum of Sq    RSS    AIC
## <none>               9198.5 1220.9
## + ZN     1    8.4474 9190.1 1222.6
## + INDUS  1    0.0003 9198.5 1222.9

Backward Stepwise Regression: Starts with all predictors, reducing one at a time until the remaining predictors are significant.

step_back <- step(subset_model, direction = "backward")

## Start:  AIC=1224.53
## MEDV ~ CHAS + INDUS + LSTAT + RAD + PTRATIO + ZN
## 
##           Df Sum of Sq     RSS    AIC
## - INDUS    1       0.9  9190.1 1222.6
## - ZN       1       9.4  9198.5 1222.9
## <none>                  9189.2 1224.5
## - RAD      1     171.5  9360.7 1229.6
## - CHAS     1     280.0  9469.1 1233.9
## - PTRATIO  1    1755.8 10945.0 1289.0
## - LSTAT    1    7842.8 17032.0 1457.0
## 
## Step:  AIC=1222.57
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO + ZN
## 
##           Df Sum of Sq     RSS    AIC
## - ZN       1       8.4  9198.5 1220.9
## <none>                  9190.1 1222.6
## - RAD      1     206.9  9397.0 1229.0
## - CHAS     1     282.2  9472.3 1232.1
## - PTRATIO  1    1755.4 10945.5 1287.0
## - LSTAT    1    9328.5 18518.6 1486.8
## 
## Step:  AIC=1220.92
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  9198.5 1220.9
## - RAD      1     207.2  9405.8 1227.4
## - CHAS     1     273.9  9472.5 1230.1
## - PTRATIO  1    1971.9 11170.5 1292.7
## - LSTAT    1   10604.4 19802.9 1510.3

“Both” Stepwise Regression: At each step, predictors are added or reduced depending on statistical significance.

step_both <- step(subset_model, direction = "both")

## Start:  AIC=1224.53
## MEDV ~ CHAS + INDUS + LSTAT + RAD + PTRATIO + ZN
## 
##           Df Sum of Sq     RSS    AIC
## - INDUS    1       0.9  9190.1 1222.6
## - ZN       1       9.4  9198.5 1222.9
## <none>                  9189.2 1224.5
## - RAD      1     171.5  9360.7 1229.6
## - CHAS     1     280.0  9469.1 1233.9
## - PTRATIO  1    1755.8 10945.0 1289.0
## - LSTAT    1    7842.8 17032.0 1457.0
## 
## Step:  AIC=1222.57
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO + ZN
## 
##           Df Sum of Sq     RSS    AIC
## - ZN       1       8.4  9198.5 1220.9
## <none>                  9190.1 1222.6
## + INDUS    1       0.9  9189.2 1224.5
## - RAD      1     206.9  9397.0 1229.0
## - CHAS     1     282.2  9472.3 1232.1
## - PTRATIO  1    1755.4 10945.5 1287.0
## - LSTAT    1    9328.5 18518.6 1486.8
## 
## Step:  AIC=1220.92
## MEDV ~ CHAS + LSTAT + RAD + PTRATIO
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  9198.5 1220.9
## + ZN       1       8.4  9190.1 1222.6
## + INDUS    1       0.0  9198.5 1222.9
## - RAD      1     207.2  9405.8 1227.4
## - CHAS     1     273.9  9472.5 1230.1
## - PTRATIO  1    1971.9 11170.5 1292.7
## - LSTAT    1   10604.4 19802.9 1510.3

All 3 stepwise regression methods came up with the exact same model, containing 4 predictors:
MEDV = 55.40 - 0.9(LSTAT) - 1.24(PTRATIO) + 3.67(CHAS) + 0.1(RAD)

This model would predict higher median house prices for those tracts on the Charles River, those with accessibility to highways, higher status, and higher teacher to pupil ratio.

summary(step_forward)

## 
## Call:
## lm(formula = MEDV ~ LSTAT + PTRATIO + CHAS + RAD, data = training.subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1039  -2.9201  -0.6114   1.9998  20.2015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.39791    2.44011  22.703  < 2e-16 ***
## LSTAT       -0.90531    0.04354 -20.792  < 2e-16 ***
## PTRATIO     -1.24401    0.13875  -8.966  < 2e-16 ***
## CHAS         3.67386    1.09934   3.342 0.000916 ***
## RAD          0.10337    0.03557   2.906 0.003872 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.953 on 375 degrees of freedom
## Multiple R-squared:  0.6794, Adjusted R-squared:  0.6759 
## F-statistic: 198.6 on 4 and 375 DF,  p-value: < 2.2e-16

summary(step_back)

## 
## Call:
## lm(formula = MEDV ~ CHAS + LSTAT + RAD + PTRATIO, data = training.subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1039  -2.9201  -0.6114   1.9998  20.2015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.39791    2.44011  22.703  < 2e-16 ***
## CHAS         3.67386    1.09934   3.342 0.000916 ***
## LSTAT       -0.90531    0.04354 -20.792  < 2e-16 ***
## RAD          0.10337    0.03557   2.906 0.003872 ** 
## PTRATIO     -1.24401    0.13875  -8.966  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.953 on 375 degrees of freedom
## Multiple R-squared:  0.6794, Adjusted R-squared:  0.6759 
## F-statistic: 198.6 on 4 and 375 DF,  p-value: < 2.2e-16

summary(step_both)

## 
## Call:
## lm(formula = MEDV ~ CHAS + LSTAT + RAD + PTRATIO, data = training.subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1039  -2.9201  -0.6114   1.9998  20.2015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.39791    2.44011  22.703  < 2e-16 ***
## CHAS         3.67386    1.09934   3.342 0.000916 ***
## LSTAT       -0.90531    0.04354 -20.792  < 2e-16 ***
## RAD          0.10337    0.03557   2.906 0.003872 ** 
## PTRATIO     -1.24401    0.13875  -8.966  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.953 on 375 degrees of freedom
## Multiple R-squared:  0.6794, Adjusted R-squared:  0.6759 
## F-statistic: 198.6 on 4 and 375 DF,  p-value: < 2.2e-16

Because all the models are the same, their accuracy measurements (ME, RMSE, etc.) are also the exact same.
Compared to the model from before (using CRIM, CHAS, and RM), this model from the stepwise function has higher values, with exception of the ME.
The model from before seems to be a better model for predicting new data, when this validation set is used.

step_for.prediction <- predict(step_forward, validation.subset)
step_back.prediction <- predict(step_back, validation.subset)
step_both.prediction <- predict(step_both, validation.subset)

accuracy(step_for.prediction, validation.subset$MEDV)

##               ME    RMSE      MAE      MPE     MAPE
## Test set 1.23483 7.48106 5.526723 1.253665 24.49312

accuracy(step_back.prediction, validation.subset$MEDV)

##               ME    RMSE      MAE      MPE     MAPE
## Test set 1.23483 7.48106 5.526723 1.253665 24.49312

accuracy(step_both.prediction, validation.subset$MEDV)

##               ME    RMSE      MAE      MPE     MAPE
## Test set 1.23483 7.48106 5.526723 1.253665 24.49312

Regularization Methods

Ridge, Lasso, and Elastic-Net Regression are very similar in that they aim to create a number of predictors that is neither too large or too small. The glmnet() function from the glmnet package makes these regression methods easy to compute. The glmnet() function takes an x, y, alpha, and lambda parameters. Alpha is set to 0 for ridge regression, 1 for lasso regression, and a decimal in between for elastic net regression. The lambda parameter can be easily computed using the cv.glmnet() function.

Before we begin the regression methods, we need to split our data into a training and validation set. In addition, we need to specify our x (predictors) and y (MEDV) values:

# Partition data into training and validation sets
set.seed(100)
sample.housing <- Boston$MEDV %>% createDataPartition(p = 0.75, list = F)

train.housing <- Boston[sample.housing, ] 
test.housing <- Boston[-sample.housing, ]

# Becuase glmnet() doesn't accept the generic y ~ x format, we need to split into x and y
# x: Predictors
x <- model.matrix(MEDV ~ ., train.housing)
# y: Median Hosue Price MEDV
y <- train.housing$MEDV

Ridge Regression

Let’s start with ridge regression (alpha = 0, lambda specified by cv.glmnet())

set.seed(100)
cv <- cv.glmnet(x = x, y = y, alpha = 0) 

# Lambda: 0.704
cv$lambda.min

## [1] 0.7041333

ridge_model <- glmnet(x = x, y = y, alpha = 0, lambda = cv$lambda.min)
coef(ridge_model)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 33.989958051
## (Intercept)  .          
## CRIM        -0.095189550
## ZN          -0.006998894
## INDUS        0.053124566
## CHAS         2.932989998
## NOX         -9.797127860
## RM           1.536356840
## AGE         -0.005143717
## DIS         -0.481190272
## RAD          0.075969327
## TAX         -0.003859152
## PTRATIO     -0.494638221
## LSTAT       -0.446374738
## CAT..MEDV   10.517543795

x.test <- model.matrix(MEDV ~ ., test.housing)
predictions <- predict(ridge_model, x.test) %>% as.vector()

ridge_performance <- data.frame(
  RMSE_Ridge = RMSE(predictions, test.housing$MEDV),
  R2_Ridge   = R2(predictions, test.housing$MEDV) 
)

Lasso Regression

Now, lasso regression (alpha = 1, lambda specified by cv.glmnet())

set.seed(100)
cv <- cv.glmnet(x = x, y = y, alpha = 1) 

# Lambda: 0.020
cv$lambda.min

## [1] 0.02005397

lasso_model <- glmnet(x = x, y = y, alpha = 1, lambda = cv$lambda.min)
coef(lasso_model)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  39.664950709
## (Intercept)   .          
## CRIM         -0.101771949
## ZN           -0.004528387
## INDUS         0.105570997
## CHAS          2.854117644
## NOX         -12.645648668
## RM            0.994761360
## AGE           .          
## DIS          -0.561979114
## RAD           0.134420750
## TAX          -0.006270727
## PTRATIO      -0.509880236
## LSTAT        -0.508404961
## CAT..MEDV    11.556642427

predictions <- predict(lasso_model, x.test) %>% as.vector()

lasso_performance <- data.frame(
  RMSE_Lasso = RMSE(predictions, test.housing$MEDV),
  R2_Lasso   = R2(predictions, test.housing$MEDV) 
)

Elastic Net Regression

Now, lasso regression (alpha = decimals between 0 and 1, lambda specified by cv.glmnet())

set.seed(100)

# To create 10 different alphas and lambdas we need to set tuneLength to 10 and trControl to 10
elastic_model <- train(MEDV ~., data = train.housing, method = "glmnet",
                       trControl = trainControl("cv", number = 10), tuneLength = 10)

# Best alpha/lambda value: 0.1, 0.09
elastic_model$bestTune

##   alpha     lambda
## 5   0.1 0.09265453

coef(elastic_model$finalModel, elastic_model$bestTune$lambda)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  39.340729302
## CRIM         -0.102717551
## ZN           -0.005007817
## INDUS         0.104230351
## CHAS          2.868251553
## NOX         -12.699999589
## RM            1.068836528
## AGE           .          
## DIS          -0.558677745
## RAD           0.135319773
## TAX          -0.006319288
## PTRATIO      -0.518322370
## LSTAT        -0.501344572
## CAT..MEDV    11.419733740

predictions <- predict(elastic_model, x.test) %>% as.vector()

elastic_performance <- data.frame(
  RMSE_Elastic = RMSE(predictions, test.housing$MEDV),
  R2_Elastic   = R2(predictions, test.housing$MEDV) 
)

Comparing the Performances

As we can see, all 3 models have comparable R^2 values (~84%). Lasso and Elastic-Net have very similar RMSE values, although Lasso has the lowest RMSE.

ridge_performance

##   RMSE_Ridge  R2_Ridge
## 1   3.896098 0.8440277

lasso_performance

##   RMSE_Lasso  R2_Lasso
## 1   3.818446 0.8458754

elastic_performance

##   RMSE_Elastic R2_Elastic
## 1     3.818529  0.8461946