Download adult income data

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
options(scipen = 10, digits = 4)
url.train <- "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
url.names <- "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names"
#download.file(url.train, destfile = "HousingValues.csv")
#download.file(url.names, destfile = "HousingFeatureNames.txt")

# Read the training and test data into memory
train <- read.table("HousingValues.csv")
# Feature names can be found by referring to the downloaded names .txt file above
names(train) <- c("CRIM",      #per capita crime rate by town
                  "ZN",        #proportion of residential land zoned for lots over 25,000 sq.ft.
                  "INDUS",     #proportion of non-retail business acres per town
                  "CHAS",      #Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
                  "NOX",       #nitric oxides concentration (parts per 10 million)
                  "RM",        #average number of rooms per dwelling
                  "AGE",       #proportion of owner-occupied units built prior to 1940
                  "DIS",       #weighted distances to five Boston employment centres
                  "RAD",       #index of accessibility to radial highways
                  "TAX",       #full-value property-tax rate per $10,000
                  "PTRATIO",   #pupil-teacher ratio by town
                  "AA",        #1000(AA - 0.63)^2 where AA is the proportion of African Americans by town
                  "LSTAT",     #% lower status of the population
                  "MEDV")      #Median value of owner-occupied homes in $1000's

Let’s implement an OLS model, a Lasso, a Ridge, and an Elastic Net using the caret package:

preproc <- preProcess(train,
                      method = c("center", "scale"))
train.preproc <- predict(preproc, train)
model.names <- c("OLS",
                 "LassoTop5",
                 "LassoAllVars",
                 "Ridge",
                 "ElasticNetTop5",
                 "ElasticNetAllVars")
set.seed(12928)
model.ols <- train(MEDV ~ .,
                   data = train.preproc,
                   method = "lm")
coefficients <- data.frame(model.ols$finalModel$coefficients[-1])

model.lasso <- train(MEDV ~ .,
                     data = train.preproc,
                     method = "lasso",
                     tuneLength = 20)
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
lasso.coef <- predict(model.lasso$finalModel, type='coef', mode='norm')$coefficients
coefficients <- cbind(coefficients,
                      lasso.coef[6,],
                      lasso.coef[nrow(lasso.coef),])

model.ridge <- train(MEDV ~ .,
                     data = train.preproc,
                     method = "ridge",
                     tuneLength = 30)
ridge.coef <- predict(model.ridge$finalModel, type='coef', mode='norm')$coefficients
coefficients <- cbind(coefficients,
                      ridge.coef[nrow(ridge.coef),])

model.enet <- train(MEDV ~ .,
                    data = train.preproc,
                    method = "enet",
                    tuneLength = 8)
enet.coef <- predict(model.enet$finalModel, type='coef', mode='norm')$coefficients
coefficients <- cbind(coefficients,
                      enet.coef[6,],
                      enet.coef[nrow(enet.coef),])

names(coefficients) <- model.names
print(coefficients)
##               OLS LassoTop5 LassoAllVars       Ridge ElasticNetTop5
## CRIM    -0.101017   0.00000    -0.101017 -0.09879790        0.00000
## ZN       0.117715   0.00000     0.117715  0.11321565        0.00000
## INDUS    0.015335   0.00000     0.015335  0.00740187        0.00000
## CHAS     0.074199   0.02812     0.074199  0.07584613        0.01719
## NOX     -0.223848   0.00000    -0.223848 -0.21550125        0.00000
## RM       0.291056   0.31375     0.291056  0.29593049        0.31261
## AGE      0.002119   0.00000     0.002119 -0.00005977        0.00000
## DIS     -0.337836   0.00000    -0.337836 -0.33046120        0.00000
## RAD      0.289749   0.00000     0.289749  0.26731671        0.00000
## TAX     -0.226032   0.00000    -0.226032 -0.20492254        0.00000
## PTRATIO -0.224271  -0.16336    -0.224271 -0.22283876       -0.16091
## AA       0.092432   0.04446     0.092432  0.09288774        0.03870
## LSTAT   -0.407447  -0.39076    -0.407447 -0.40549200       -0.38940
##         ElasticNetAllVars
## CRIM            -0.092993
## ZN               0.100851
## INDUS           -0.012596
## CHAS             0.080575
## NOX             -0.190239
## RM               0.310943
## AGE             -0.006156
## DIS             -0.306213
## RAD              0.209349
## TAX             -0.153623
## PTRATIO         -0.219130
## AA               0.094485
## LSTAT           -0.399768

I’m not actually sure why a few of the coefficients end up not being shrunk in the ridge and enet regressions… Anyway, here lasso ends up converging to OLS so lambda = 0 was optimal for purposes of minimizing RMSE.

Looking at the stepwise algorithms used to estimate these, we see that in the lasso INDUS gets added in the 10th step and then removed in the 13th step and readded at the end. If we look at the coefficients for each step, you can see that one is made non-zero in each step (or one is set to zero if that ends up being optimal for RMSE).

print(model.lasso$finalModel)
## 
## Call:
## enet(x = as.matrix(x), y = y, lambda = 0)
## Cp statistics of the Lasso fit 
## Cp: 1393.00 1111.19  447.49  166.36  143.59  112.93  105.28   92.52   63.72   44.70   44.45   38.93   22.59   10.34   12.03   14.00 
## DF:  1  2  3  4  5  6  7  8  9 10 11 12 13 12 13 14 
## Sequence of  moves:
##      LSTAT RM PTRATIO AA CHAS CRIM DIS NOX ZN INDUS RAD TAX INDUS INDUS
## Var     13  6      11 12    4    1   8   5  2     3   9  10    -3     3
## Step     1  2       3  4    5    6   7   8  9    10  11  12    13    14
##      AGE   
## Var    7 16
## Step  15 16
print(lasso.coef)
##         CRIM      ZN      INDUS    CHAS      NOX     RM      AGE      DIS
## 0   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.0000 0.000000  0.00000
## 1   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.0000 0.000000  0.00000
## 2   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.1824 0.000000  0.00000
## 3   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.2796 0.000000  0.00000
## 4   0.000000 0.00000  0.000e+00 0.00000  0.00000 0.2953 0.000000  0.00000
## 5   0.000000 0.00000  0.000e+00 0.02812  0.00000 0.3138 0.000000  0.00000
## 6  -0.005689 0.00000  0.000e+00 0.03853  0.00000 0.3211 0.000000  0.00000
## 7  -0.014446 0.00000  0.000e+00 0.04453  0.00000 0.3245 0.000000 -0.02373
## 8  -0.023557 0.00000  0.000e+00 0.05650 -0.06452 0.3266 0.000000 -0.09853
## 9  -0.034976 0.03616  0.000e+00 0.06572 -0.11141 0.3233 0.000000 -0.17631
## 10 -0.036499 0.04101 -2.355e-03 0.06703 -0.11665 0.3227 0.000000 -0.18732
## 11 -0.046558 0.04918 -1.002e-02 0.06967 -0.13564 0.3188 0.000000 -0.21160
## 12 -0.068071 0.07617 -1.235e-18 0.07148 -0.17022 0.3080 0.000000 -0.26175
## 13 -0.096495 0.11036  0.000e+00 0.07455 -0.21177 0.2930 0.000000 -0.33000
## 14 -0.099683 0.11540  1.083e-02 0.07435 -0.21982 0.2919 0.000000 -0.33602
## 15 -0.101017 0.11772  1.534e-02 0.07420 -0.22385 0.2911 0.002119 -0.33784
##        RAD      TAX PTRATIO      AA   LSTAT
## 0  0.00000  0.00000  0.0000 0.00000  0.0000
## 1  0.00000  0.00000  0.0000 0.00000 -0.1095
## 2  0.00000  0.00000  0.0000 0.00000 -0.2920
## 3  0.00000  0.00000 -0.1309 0.00000 -0.3828
## 4  0.00000  0.00000 -0.1463 0.01972 -0.3857
## 5  0.00000  0.00000 -0.1634 0.04446 -0.3908
## 6  0.00000  0.00000 -0.1690 0.05236 -0.3905
## 7  0.00000  0.00000 -0.1754 0.06102 -0.4013
## 8  0.00000  0.00000 -0.1905 0.06714 -0.4028
## 9  0.00000  0.00000 -0.1929 0.07229 -0.4045
## 10 0.00000  0.00000 -0.1928 0.07287 -0.4045
## 11 0.02026  0.00000 -0.1990 0.07633 -0.4047
## 12 0.12663 -0.08925 -0.2089 0.08273 -0.4055
## 13 0.26057 -0.19722 -0.2208 0.09083 -0.4057
## 14 0.28099 -0.21748 -0.2232 0.09202 -0.4064
## 15 0.28975 -0.22603 -0.2243 0.09243 -0.4074

Since ridge does not perform feature selection, it will always only have as many steps as there are predictor variables:

print(model.ridge$finalModel) #Only 14 steps in ridge, one for each predictor
## 
## Call:
## enet(x = as.matrix(x), y = y, lambda = param$lambda)
## Sequence of  moves:
##      LSTAT RM PTRATIO AA CHAS CRIM DIS NOX ZN INDUS RAD TAX INDUS INDUS
## Var     13  6      11 12    4    1   8   5  2     3   9  10    -3     3
## Step     1  2       3  4    5    6   7   8  9    10  11  12    13    14
##      AGE   
## Var    7 16
## Step  15 16

Here are the plots that caret used to determine optimal tuning parameters. Lasso appeared to be strictly decreasing in error as we approached the fraction full solution (full solution being OLS; the other approaches showed that lambda equals one was best, so it’s no surprise that the lasso doesn’t perform any shrinkage). But the ridge definitely had an interior optimum.

plot(model.ridge)

plot(model.lasso)

plot(model.enet)

RMSE was used to select the optimal model using the smallest value. The final values used for the model were fraction = 1 and lambda = 0.01, which is consistent with both of the individual plots of ridge and lasso. Notice that enet uses both of the tuning parameters from lasso and ridge.

Dimensionality Reduction

Another way to approach this issue is by using so-called dimensionality techniques. One example is the principal component approach where you define new predictors that are orthogonal linear combinations of the raw data with the intention of maximizing variance in some of these new directions that you define.

model.pcr <- train(MEDV ~ .,
                   data = train.preproc,
                   method = "pcr",
                   tuneLength = 13)
## Loading required package: pls
## Warning: package 'pls' was built under R version 3.2.3
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings
plot(model.pcr)

Using this method, minimum RMSE is acheived by including 12 components, which means that dimensionality reduction doesn’t seem terribly effective, but the RMSE from including only 5 components is only marginally worse, so we can do almost as well even if we drop 8 of our 13 dimensions.