Regularized Regressions

Download adult income data

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

options(scipen = 10, digits = 4)
url.train <- "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
url.names <- "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names"
#download.file(url.train, destfile = "HousingValues.csv")
#download.file(url.names, destfile = "HousingFeatureNames.txt")

# Read the training and test data into memory
train <- read.table("HousingValues.csv")
# Feature names can be found by referring to the downloaded names .txt file above
names(train) <- c("CRIM",      #per capita crime rate by town
                  "ZN",        #proportion of residential land zoned for lots over 25,000 sq.ft.
                  "INDUS",     #proportion of non-retail business acres per town
                  "CHAS",      #Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
                  "NOX",       #nitric oxides concentration (parts per 10 million)
                  "RM",        #average number of rooms per dwelling
                  "AGE",       #proportion of owner-occupied units built prior to 1940
                  "DIS",       #weighted distances to five Boston employment centres
                  "RAD",       #index of accessibility to radial highways
                  "TAX",       #full-value property-tax rate per $10,000
                  "PTRATIO",   #pupil-teacher ratio by town
                  "AA",        #1000(AA - 0.63)^2 where AA is the proportion of African Americans by town
                  "LSTAT",     #% lower status of the population
                  "MEDV")      #Median value of owner-occupied homes in $1000's

Let’s implement an OLS model, a Lasso, a Ridge, and an Elastic Net using the caret package:

preproc <- preProcess(train,
                      method = c("center", "scale"))
train.preproc <- predict(preproc, train)
model.names <- c("OLS",
                 "LassoTop5",
                 "LassoAllVars",
                 "Ridge",
                 "ElasticNetTop5",
                 "ElasticNetAllVars")
set.seed(12928)
model.ols <- train(MEDV ~ .,
                   data = train.preproc,
                   method = "lm")
coefficients <- data.frame(model.ols$finalModel$coefficients[-1])

model.lasso <- train(MEDV ~ .,
                     data = train.preproc,
                     method = "lasso",
                     tuneLength = 20)

## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2

lasso.coef <- predict(model.lasso$finalModel, type='coef', mode='norm')$coefficients
coefficients <- cbind(coefficients,
                      lasso.coef[6,],
                      lasso.coef[nrow(lasso.coef),])

model.ridge <- train(MEDV ~ .,
                     data = train.preproc,
                     method = "ridge",
                     tuneLength = 30)
ridge.coef <- predict(model.ridge$finalModel, type='coef', mode='norm')$coefficients
coefficients <- cbind(coefficients,
                      ridge.coef[nrow(ridge.coef),])

model.enet <- train(MEDV ~ .,
                    data = train.preproc,
                    method = "enet",
                    tuneLength = 8)
enet.coef <- predict(model.enet$finalModel, type='coef', mode='norm')$coefficients
coefficients <- cbind(coefficients,
                      enet.coef[6,],
                      enet.coef[nrow(enet.coef),])

names(coefficients) <- model.names
print(coefficients)

##               OLS LassoTop5 LassoAllVars       Ridge ElasticNetTop5
## CRIM    -0.101017   0.00000    -0.101017 -0.09879790        0.00000
## ZN       0.117715   0.00000     0.117715  0.11321565        0.00000
## INDUS    0.015335   0.00000     0.015335  0.00740187        0.00000
## CHAS     0.074199   0.02812     0.074199  0.07584613        0.01719
## NOX     -0.223848   0.00000    -0.223848 -0.21550125        0.00000
## RM       0.291056   0.31375     0.291056  0.29593049        0.31261
## AGE      0.002119   0.00000     0.002119 -0.00005977        0.00000
## DIS     -0.337836   0.00000    -0.337836 -0.33046120        0.00000
## RAD      0.289749   0.00000     0.289749  0.26731671        0.00000
## TAX     -0.226032   0.00000    -0.226032 -0.20492254        0.00000
## PTRATIO -0.224271  -0.16336    -0.224271 -0.22283876       -0.16091
## AA       0.092432   0.04446     0.092432  0.09288774        0.03870
## LSTAT   -0.407447  -0.39076    -0.407447 -0.40549200       -0.38940
##         ElasticNetAllVars
## CRIM            -0.092993
## ZN               0.100851
## INDUS           -0.012596
## CHAS             0.080575
## NOX             -0.190239
## RM               0.310943
## AGE             -0.006156
## DIS             -0.306213
## RAD              0.209349
## TAX             -0.153623
## PTRATIO         -0.219130
## AA               0.094485
## LSTAT           -0.399768

I’m not actually sure why a few of the coefficients end up not being shrunk in the ridge and enet regressions… Anyway, here lasso ends up converging to OLS so lambda = 0 was optimal for purposes of minimizing RMSE.

Looking at the stepwise algorithms used to estimate these, we see that in the lasso INDUS gets added in the 10th step and then removed in the 13th step and readded at the end. If we look at the coefficients for each step, you can see that one is made non-zero in each step (or one is set to zero if that ends up being optimal for RMSE).

print(model.lasso$finalModel)

## 
## Call:
## enet(x = as.matrix(x), y = y, lambda = 0)
## Cp statistics of the Lasso fit 
## Cp: 1393.00 1111.19  447.49  166.36  143.59  112.93  105.28   92.52   63.72   44.70   44.45   38.93   22.59   10.34   12.03   14.00 
## DF:  1  2  3  4  5  6  7  8  9 10 11 12 13 12 13 14 
## Sequence of  moves:
##      LSTAT RM PTRATIO AA CHAS CRIM DIS NOX ZN INDUS RAD TAX INDUS INDUS
## Var     13  6      11 12    4    1   8   5  2     3   9  10    -3     3
## Step     1  2       3  4    5    6   7   8  9    10  11  12    13    14
##      AGE   
## Var    7 16
## Step  15 16

print(lasso.coef)

##         CRIM      ZN      INDUS    CHAS      NOX     RM      AGE      DIS
## 0   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.0000 0.000000  0.00000
## 1   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.0000 0.000000  0.00000
## 2   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.1824 0.000000  0.00000
## 3   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.2796 0.000000  0.00000
## 4   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.2953 0.000000  0.00000
## 5   0.000000 0.00000  0.000e+00 0.02812  0.00000 0.3138 0.000000  0.00000
## 6  -0.005689 0.00000  0.000e+00 0.03853  0.00000 0.3211 0.000000  0.00000
## 7  -0.014446 0.00000  0.000e+00 0.04453  0.00000 0.3245 0.000000 -0.02373
## 8  -0.023557 0.00000  0.000e+00 0.05650 -0.06452 0.3266 0.000000 -0.09853
## 9  -0.034976 0.03616  0.000e+00 0.06572 -0.11141 0.3233 0.000000 -0.17631
## 10 -0.036499 0.04101 -2.355e-03 0.06703 -0.11665 0.3227 0.000000 -0.18732
## 11 -0.046558 0.04918 -1.002e-02 0.06967 -0.13564 0.3188 0.000000 -0.21160
## 12 -0.068071 0.07617 -1.235e-18 0.07148 -0.17022 0.3080 0.000000 -0.26175
## 13 -0.096495 0.11036  0.000e+00 0.07455 -0.21177 0.2930 0.000000 -0.33000
## 14 -0.099683 0.11540  1.083e-02 0.07435 -0.21982 0.2919 0.000000 -0.33602
## 15 -0.101017 0.11772  1.534e-02 0.07420 -0.22385 0.2911 0.002119 -0.33784
##        RAD      TAX PTRATIO      AA   LSTAT
## 0  0.00000  0.00000  0.0000 0.00000  0.0000
## 1  0.00000  0.00000  0.0000 0.00000 -0.1095
## 2  0.00000  0.00000  0.0000 0.00000 -0.2920
## 3  0.00000  0.00000 -0.1309 0.00000 -0.3828
## 4  0.00000  0.00000 -0.1463 0.01972 -0.3857
## 5  0.00000  0.00000 -0.1634 0.04446 -0.3908
## 6  0.00000  0.00000 -0.1690 0.05236 -0.3905
## 7  0.00000  0.00000 -0.1754 0.06102 -0.4013
## 8  0.00000  0.00000 -0.1905 0.06714 -0.4028
## 9  0.00000  0.00000 -0.1929 0.07229 -0.4045
## 10 0.00000  0.00000 -0.1928 0.07287 -0.4045
## 11 0.02026  0.00000 -0.1990 0.07633 -0.4047
## 12 0.12663 -0.08925 -0.2089 0.08273 -0.4055
## 13 0.26057 -0.19722 -0.2208 0.09083 -0.4057
## 14 0.28099 -0.21748 -0.2232 0.09202 -0.4064
## 15 0.28975 -0.22603 -0.2243 0.09243 -0.4074

Since ridge does not perform feature selection, it will always only have as many steps as there are predictor variables:

print(model.ridge$finalModel) #Only 14 steps in ridge, one for each predictor

## 
## Call:
## enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Sequence of  moves:
##      LSTAT RM PTRATIO AA CHAS CRIM DIS NOX ZN INDUS RAD TAX INDUS INDUS
## Var     13  6      11 12    4    1   8   5  2     3   9  10    -3     3
## Step     1  2       3  4    5    6   7   8  9    10  11  12    13    14
##      AGE   
## Var    7 16
## Step  15 16

Here are the plots that caret used to determine optimal tuning parameters. Lasso appeared to be strictly decreasing in error as we approached the fraction full solution (full solution being OLS; the other approaches showed that lambda equals one was best, so it’s no surprise that the lasso doesn’t perform any shrinkage). But the ridge definitely had an interior optimum.

plot(model.ridge)

plot(model.lasso)

plot(model.enet)

RMSE was used to select the optimal model using the smallest value. The final values used for the model were fraction = 1 and lambda = 0.01, which is consistent with both of the individual plots of ridge and lasso. Notice that enet uses both of the tuning parameters from lasso and ridge.

Regularized Regressions

Matthew Baumer

December 20, 2015

Download adult income data

Dimensionality Reduction