load data:
data(Hitters, package = "ISLR")
Hitters <- na.omit(Hitters)
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## -Al Newman 185 37 1 23 8 21 2 214 42 1
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## -Al Newman 30 9 24 N E 76 127 7
## Salary NewLeague
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
## -Al Newman 70.0 Aleaps.hitters <- regsubsets(Salary ~ ., data = Hitters, nvmax = 5, method = "forward")
summary.leaps.hitters <- summary(leaps.hitters)
summary.leaps.hitters
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 5, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "According to the forward stepwise selection procedure(implement by regsubsets), the best five features are: CRBI,Hits,PutOuts,DivisionW,AtBat.
Split the dataset into 60% train and 40% test:
set.seed(500316995)
inTrain <- createDataPartition(Hitters$Salary, p = 0.6)[[1]]
allFeatures <- colnames(Hitters)[!grepl("Salary", colnames(Hitters))]
x <- model.matrix(Salary ~ ., data = Hitters)[, -1]
y <- Hitters$Salary
X_train <- x[inTrain, ]
X_test <- x[-inTrain, ]
y_train <- y[inTrain]
y_test <- y[-inTrain]Perform a lasso regression using the glmnet package:
Plot the lasso regression coefficients as a function of λ:
Use cross-validation to find the best λ value:
set.seed(500316995)
# Using cross-validation for Lasso to find the best lambda (based on cvm 'mean
# cross-validated error')
cv.lasso <- cv.glmnet(X_train, y_train, alpha = 1)
plot(cv.lasso)By using the inbuilt CV function in glmnet, we can find the best λ value is 30.86.
# Lasso for feature selection
lasso.coef <- predict(lasso.mod, type = "coefficients", s = bestlam)[1:20, ]
lasso.coef
## (Intercept) AtBat Hits HmRun Runs
## -9.575110911 0.000000000 2.469001327 0.000000000 1.410804495
## RBI Walks Years CAtBat CHits
## 0.000000000 0.892335332 0.000000000 0.000000000 0.000000000
## CHmRun CRuns CRBI CWalks LeagueN
## 0.000000000 0.242969193 0.301318510 0.000000000 0.000000000
## DivisionW PutOuts Assists Errors NewLeagueN
## -93.357154269 0.001328902 -0.022002716 0.000000000 0.000000000We can find there are 11 coefficients that have been shrunk to zero: AtBat,HmRun,RBI,Years,CAtBat,CHits,CHmRun,CWalks,LeagueN,Errors,NewLeagueN.
# predict on test set using optimal lambda value estimated by CV
lasso.pred <- predict(lasso.mod, s = bestlam, newx = X_test)
# compute MSE
mean((lasso.pred - y_test)^2)
## [1] 145267After using the optimal trained model to predict the salary of the test dataset, we can find the mean squared error is 145267.
Perform a ridge regression using the glmnet package:
grid <- 10^seq(6, -2, length = 100)
ridge.mod <- glmnet(X_train, y_train, alpha = 0, lambda = grid, standardize = TRUE)
dim(coef(ridge.mod))
## [1] 20 100Plot the Ridge regression coefficients as a function of λ:
Use cross-validation to find the best λ value in ridge regression:
set.seed(500316995)
# Using cross-validation for Lasso to find the best lambda (based on cvm 'mean
# cross-validated error')
cv.ridge <- cv.glmnet(X_train, y_train, alpha = 0)
plot(cv.ridge)By using the inbuilt CV function in glmnet, we can find the best λ value in ridge regression is 168.58.
Check if the ridge regression model is sparse:
# Lasso for feature selection
ridge.coef <- predict(ridge.mod, type = "coefficients", s = bestlam)[1:20, ]
ridge.coef
## (Intercept) AtBat Hits HmRun Runs
## -55.60561635 0.17440086 1.53611125 -1.28134551 2.08941514
## RBI Walks Years CAtBat CHits
## 0.49146219 1.44253959 1.61073297 0.01053200 0.07068033
## CHmRun CRuns CRBI CWalks LeagueN
## 0.40817053 0.13475881 0.12978393 -0.04523344 32.63958130
## DivisionW PutOuts Assists Errors NewLeagueN
## -109.75188716 0.06955367 -0.12052601 -2.45726905 5.05239799We can find there are no coefficients that have been shrunk to zero, so ridge regression has no feature selection function.
# predict on test set using optimal lambda value estimated by CV
ridge.pred <- predict(ridge.mod, s = bestlam, newx = X_test)
# compute MSE
mean((ridge.pred - y_test)^2)
## [1] 141449.2The mean squared error by using ridge regression model to predict is 141449.2, less than the mean squared error by using lasso regression model which is 145267.
Course: STAT5003_Computational Statistical Methods
Assignment: Lab Week 8
Student Name: Yujun Yao(June Yao)
SID: 500316995
Email: yyao2983@uni.sydney.edu.au