Libraries to load

load data:

data(Hitters, package = "ISLR")
Hitters <- na.omit(Hitters)

head(Hitters)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
## -Al Newman          185   37     1   23   8    21     2    214    42      1
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
## -Al Newman           30    9     24      N        E      76     127      7
##                   Salary NewLeague
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A
## -Al Newman          70.0         A

1 Feature Selection

1.1 Forward selection

leaps.hitters <- regsubsets(Salary ~ ., data = Hitters, nvmax = 5, method = "forward")
summary.leaps.hitters <- summary(leaps.hitters)
summary.leaps.hitters
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 5, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: forward
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "

According to the forward stepwise selection procedure(implement by regsubsets), the best five features are: CRBI,Hits,PutOuts,DivisionW,AtBat.

1.2 Lasso regression

1.2.1 Split data and fit

Split the dataset into 60% train and 40% test:

set.seed(500316995)
inTrain <- createDataPartition(Hitters$Salary, p = 0.6)[[1]]
allFeatures <- colnames(Hitters)[!grepl("Salary", colnames(Hitters))]
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
X_train <- x[inTrain, ]
X_test <- x[-inTrain, ]
y_train <- y[inTrain]
y_test <- y[-inTrain]

Perform a lasso regression using the glmnet package:

grid <- 10^seq(6, -2, length = 100)
lasso.mod <- glmnet(X_train, y_train, alpha = 1, lambda = grid, standardize = TRUE)
dim(coef(lasso.mod))
## [1]  20 100

1.2.2 Inspect coefficients as function of λ

Plot the lasso regression coefficients as a function of λ:

plot(lasso.mod, xvar = "lambda", label = TRUE)

Use cross-validation to find the best λ value:

set.seed(500316995)
# Using cross-validation for Lasso to find the best lambda (based on cvm 'mean
# cross-validated error')
cv.lasso <- cv.glmnet(X_train, y_train, alpha = 1)
plot(cv.lasso)

bestlam <- cv.lasso$lambda.min
bestlam
## [1] 30.86259

By using the inbuilt CV function in glmnet, we can find the best λ value is 30.86.

1.2.3 Check if model is sparse

# Lasso for feature selection
lasso.coef <- predict(lasso.mod, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef
##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  -9.575110911   0.000000000   2.469001327   0.000000000   1.410804495 
##           RBI         Walks         Years        CAtBat         CHits 
##   0.000000000   0.892335332   0.000000000   0.000000000   0.000000000 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##   0.000000000   0.242969193   0.301318510   0.000000000   0.000000000 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -93.357154269   0.001328902  -0.022002716   0.000000000   0.000000000

We can find there are 11 coefficients that have been shrunk to zero: AtBat,HmRun,RBI,Years,CAtBat,CHits,CHmRun,CWalks,LeagueN,Errors,NewLeagueN.

1.2.4 Assess on the test set

# predict on test set using optimal lambda value estimated by CV
lasso.pred <- predict(lasso.mod, s = bestlam, newx = X_test)
# compute MSE
mean((lasso.pred - y_test)^2)
## [1] 145267

After using the optimal trained model to predict the salary of the test dataset, we can find the mean squared error is 145267.

1.3 Optional Ridge regression

Perform a ridge regression using the glmnet package:

grid <- 10^seq(6, -2, length = 100)
ridge.mod <- glmnet(X_train, y_train, alpha = 0, lambda = grid, standardize = TRUE)
dim(coef(ridge.mod))
## [1]  20 100

Plot the Ridge regression coefficients as a function of λ:

plot(ridge.mod, xvar = "lambda", label = TRUE)

Use cross-validation to find the best λ value in ridge regression:

set.seed(500316995)
# Using cross-validation for Lasso to find the best lambda (based on cvm 'mean
# cross-validated error')
cv.ridge <- cv.glmnet(X_train, y_train, alpha = 0)
plot(cv.ridge)

bestlam <- cv.ridge$lambda.min
bestlam
## [1] 168.58

By using the inbuilt CV function in glmnet, we can find the best λ value in ridge regression is 168.58.

Check if the ridge regression model is sparse:

# Lasso for feature selection
ridge.coef <- predict(ridge.mod, type = "coefficients", s = bestlam)[1:20, ]
ridge.coef
##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  -55.60561635    0.17440086    1.53611125   -1.28134551    2.08941514 
##           RBI         Walks         Years        CAtBat         CHits 
##    0.49146219    1.44253959    1.61073297    0.01053200    0.07068033 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##    0.40817053    0.13475881    0.12978393   -0.04523344   32.63958130 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -109.75188716    0.06955367   -0.12052601   -2.45726905    5.05239799

We can find there are no coefficients that have been shrunk to zero, so ridge regression has no feature selection function.

# predict on test set using optimal lambda value estimated by CV
ridge.pred <- predict(ridge.mod, s = bestlam, newx = X_test)
# compute MSE
mean((ridge.pred - y_test)^2)
## [1] 141449.2

The mean squared error by using ridge regression model to predict is 141449.2, less than the mean squared error by using lasso regression model which is 145267.

APP. Student Info

Course: STAT5003_Computational Statistical Methods
Assignment: Lab Week 8
Student Name: Yujun Yao(June Yao)
SID: 500316995
Email: