In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
(a). Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector \(\epsilon\) of length n = 100.
set.seed(1)
n = 100
X = rnorm(n)
e = rnorm(n)
(b). Generate a response vector Y of length n = 100 according to the model \[Y=\beta_{0}+\beta_{1}X+\beta_{2}X^2+\beta_{3}X^3+\epsilon\] where \(\beta_{0},\beta_{1},\beta_{2}\) and \(\beta_{3}\) are constants of your choice.
# set beta value
beta_0 = 3
beta_1 = -4
beta_2 = 9
beta_3 = -5
Y = beta_0 + beta_1 + beta_2*X^2 + beta_3*X^3 + e
(c).Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, \(X^2\),…,\(X^{10}\). What is the best model obtained according to \(C_{p}\), BIC, and adjusted \(R^2\)? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .
data = data.frame(Y, X, X2 = X^2, X3 = X^3, X4 = X^4, X5 = X^5, X6 = X^6, X7 = X^7, X8 = X^8, X9 = X^9, X10 = X^10)
library(leaps)
regfit.full = regsubsets(Y ~ ., data, nvmax = 10)
reg.summary = summary(regfit.full)
# C_p
reg.summary$cp
## [1] 10311.8854570 0.2030361 -0.0638344 0.6067483 2.1782005
## [6] 3.9955812 5.7869063 7.1694092 9.1535580 11.0000000
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l" )
which.min(reg.summary$cp) # 3
## [1] 3
points(3, reg.summary$cp[3], col = 'red', cex=2, pch = 20)
# According to BIC, the optimal model contains 3 variables.
coef(regfit.full,3)
## (Intercept) X2 X3 X9
## -0.9138169986 8.8408733725 -5.0739064911 0.0008650782
# BIC
reg.summary$bic
## [1] -71.32151 -537.20300 -535.03363 -531.88503 -527.75396 -523.35151
## [7] -518.97848 -515.06342 -510.47603 -506.04325
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l" )
which.min(reg.summary$bic) # 2
## [1] 2
points(2, reg.summary$bic[2], col = 'red', cex=2, pch = 20)
# According to BIC, the optimal model contains 2 variables.
coef(regfit.full,2)
## (Intercept) X2 X3
## -0.9406405 8.8773352 -4.9892781
# adjusted R^2
reg.summary$adjr2
## [1] 0.5484938 0.9958712 0.9959286 0.9959452 0.9959215 0.9958860 0.9958509
## [8] 0.9958342 0.9957886 0.9957486
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l" )
which.max(reg.summary$adjr2) # 4
## [1] 4
points(4, reg.summary$adjr2[4], col = 'red', cex=2, pch = 20)
# According to asjusted R square value, the optimal model contains 4 variables.
coef(regfit.full,4)
## (Intercept) X X2 X3 X5
## -0.92799225 0.38745596 8.84575641 -5.44202574 0.08072292
The original formula contains X, X^2 and X^3.
According to best subset selection method:
Cp: X^2, X^3, X^9
BIC: X^2, X^3
Adjust R square: X, X^2, X^3, X^5.
(d). Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?
# forward
regfit.fwd = regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Y ~ ., data = data, nvmax = 10, method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## X FALSE FALSE
## X2 FALSE FALSE
## X3 FALSE FALSE
## X4 FALSE FALSE
## X5 FALSE FALSE
## X6 FALSE FALSE
## X7 FALSE FALSE
## X8 FALSE FALSE
## X9 FALSE FALSE
## X10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
## X X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" "*" " " " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" "*" " " " " " " " " " " "*" " "
## 5 ( 1 ) "*" "*" "*" " " "*" " " " " " " "*" " "
## 6 ( 1 ) "*" "*" "*" " " "*" " " " " "*" "*" " "
## 7 ( 1 ) "*" "*" "*" " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" " "
## 9 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# backward
regfit.bwd = regsubsets(Y ~ ., data = data, nvmax=10, method = "backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Y ~ ., data = data, nvmax = 10, method = "backward")
## 10 Variables (and intercept)
## Forced in Forced out
## X FALSE FALSE
## X2 FALSE FALSE
## X3 FALSE FALSE
## X4 FALSE FALSE
## X5 FALSE FALSE
## X6 FALSE FALSE
## X7 FALSE FALSE
## X8 FALSE FALSE
## X9 FALSE FALSE
## X10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: backward
## X X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" "*" " " " " " " " " " " "*" " "
## 4 ( 1 ) "*" "*" "*" " " " " " " " " " " "*" " "
## 5 ( 1 ) "*" "*" "*" " " " " " " " " "*" "*" " "
## 6 ( 1 ) "*" "*" "*" " " " " " " " " "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" " " " " "*" " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
The best 2-variable, 3-variable and 4-variable models identified by forward stepwise selection are the same as backward stepwise selection. Comparing to the results in (c), except that the best 4-variable model selected by the best subset selection is different from them, the other two models are the same as them.
(e). Now fit a lasso model to the simulated data, again using X, \(X^2\),…,\(X^10\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.
# Before using glmnet function, we must pass in an x matrix as well as a y vector.
library(glmnet)
set.seed(1)
x = model.matrix(Y ~ ., data)[,-1]
y = data$Y
# use CV to select the optimal value of lambda
cv.out = cv.glmnet(x, y, alpha = 1)
plot(cv.out)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.05489217
# coefficient estimates
grid <- 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x,y,alpha=1,lambda = grid)
lasso.coef = predict(lasso.mod, type = "coefficients", s= bestlam)[1:11,]
lasso.coef[lasso.coef!=0]
## (Intercept) X2 X3 X4 X6
## -8.240037e-01 8.606810e+00 -4.971453e+00 4.812840e-02 6.629000e-04
## X8
## 6.913996e-06
After performing cross validation we get lambda as 0.05489217 for minimum MSE, then using this lambda value to fit lasso regression, finaly obtaining the coefficient estimates as above. A variable with a coefficient of 0 means it will not be selected.
# new data
beta_7 = 12
Y2 = beta_0 + beta_7 * X^7 + e
data1 = data.frame(Y2, X, X2= X^2, X3 = X^3, X4 = X^4, X5=X^5, X6=X^6, X7 = X^7, X8 = X^8, X9 = X^9, X10 = X^10)
# best subset selection
library(leaps)
regfit.full = regsubsets(Y2~.,data = data1)
reg.summary = summary(regfit.full)
par(mfrow=c(2,3))
# adjust r square
which.max(reg.summary$adjr2) # 4
## [1] 4
plot(reg.summary$adjr2, xlab = "Number of Variables",ylab="Adjusted RSq", type="l")
points(4,reg.summary$bic[4],col="red",cex=2,pch=20)
coef(regfit.full,4) # 4-variable model: X, X^2, X^3, X^7
## (Intercept) X X2 X3 X7
## 3.0762524 0.2914016 -0.1617671 -0.2526527 12.0091338
# BIC
which.min(reg.summary$bic) # 1
## [1] 1
plot(reg.summary$bic, xlab = "Number of Variables",ylab="BIC", type="l")
points(1,reg.summary$bic[1],col="red",cex=2,pch=20)
coef(regfit.full,1) # single-variable model: X^7
## (Intercept) X7
## 2.95894 12.00077
# Cp
which.min(reg.summary$cp) # 2
## [1] 2
plot(reg.summary$cp, xlab = "Number of Variables",ylab="Cp", type="l")
points(2,reg.summary$cp[2],col="red",cex=2,pch=20)
coef(regfit.full,2) # 2-variable model: X^2, X^7
## (Intercept) X2 X7
## 3.0704904 -0.1417084 12.0015552
# Variables included in the optimal model
plot(regfit.full,scale = "adjr2")
plot(regfit.full,scale = "bic")
plot(regfit.full,scale = "Cp")
# lasso regression
# Step1: check and omit NA value
# Step2: use model.matrix function to change qualitative variables into dummy variables
x2 = model.matrix(Y2~.,data1)[,-1]
y2 = data1$Y2
# Step3: build model
library(glmnet)
grid <- 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x2,y2,alpha=1,lambda = grid)
# Step4: optimal value of lambda
cv.out = cv.glmnet(x2, y2, alpha = 1)
plot(cv.out)
bestlam = cv.out$lambda.min
bestlam # 21.20277
## [1] 21.20275
# coefficient estimates
lasso.coef = predict(lasso.mod, type = "coefficients", s= bestlam)[1:11,]
lasso.coef[lasso.coef !=0]
## (Intercept) X7
## 4.435343 11.650942
Comments:
The original formula is: Y = 3 + 12*X^7 + e
(1) The best subset selection: From the results above, adjusted R square chooses 4-variable model, Cp chooses 2-variable model and BIC chooses 1 variable model.
(2) However, the lasso regression just choose single-variable model and the coefficients about beta0 and beta7 are similar to the original coefficients.
So, in this case, lasso model perform better than best subset selection.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a). Split the data set into a training set and a test set.
library(ISLR)
attach(College)
set.seed(1)
# method1
set.seed(1)
train1 = sample(c(TRUE,FALSE), nrow(College)*0.7, replace = TRUE)
test1 = (!train1)
y.test1 = College$Apps[test1]
# method2 (Tip: we will not use this data sets)
set.seed(4)
train2 = sample(1:nrow(College), nrow(College)*0.7) # replace = FALSE
test2 = (-train2)
y.test2 = College[test2,]$Apps
(b). Fit a linear model using least squares on the training set, and report the test error obtained.
lm = lm(Apps~., data = College, subset=train1)
ols.pred = predict(lm, College[test1,])
ols_mse = mean((ols.pred-y.test1)^2)
ols_mse
## [1] 1139285
(c). Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
train.mat = model.matrix(Apps~.,College[train1,])
test.mat = model.matrix(Apps~.,College[test1,])
y_train = College[train1,]$Apps
y_test = College[test1,]$Apps
# build ridge model
library(glmnet)
grid = 10^seq(10,-2,length=100)
ridge.mod = glmnet(train.mat, y_train, alpha = 0, lambda = grid, thresh = 1e-12 )
# find best lambda
cv.out = cv.glmnet(train.mat, y_train, alpha = 0)
bestlam = cv.out$lambda.min
# predict and MSE
ridge.pred = predict(ridge.mod, s = bestlam, newx = test.mat)
ridge_mse = mean((ridge.pred-y_test)^2)
(d). Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
# build lasso model
library(glmnet)
grid = 10^seq(10,-2,length=100)
lasso.mod = glmnet(train.mat, y_train, alpha = 1, lambda = grid, thresh = 1e-12 )
# find best lambda
cv.out = cv.glmnet(train.mat, y_train, alpha = 1)
bestlam = cv.out$lambda.min
# predict and MSE
lasso.pred = predict(lasso.mod, s = bestlam, newx = test.mat)
lasso_mse = mean((lasso.pred-y_test)^2)
# non-zero coefficient
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam)
lasso.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -514.44271026
## (Intercept) .
## PrivateYes -428.37924738
## Accept 1.71801006
## Enroll -1.21284078
## Top10perc 51.80050721
## Top25perc -14.63543313
## F.Undergrad 0.09112750
## P.Undergrad -0.03913185
## Outstate -0.11911140
## Room.Board 0.14886025
## Books 0.08442170
## Personal -0.01053865
## PhD -4.14758433
## Terminal -8.39153954
## S.F.Ratio 33.38387097
## perc.alumni 6.57109829
## Expend 0.09697893
## Grad.Rate 5.94725738
We can see that 15 of the 19 coefficients are non-zero.
(e). Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
set.seed(2)
pcr.fit = pcr(Apps~., data = College, subset = train1, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 402 17
## Y dimension: 402 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4171 4106 2305 2313 2109 2025 1863
## adjCV 4171 4108 2300 2309 2102 2026 1856
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1866 1866 1786 1755 1746 1766 1743
## adjCV 1859 1867 1779 1746 1733 1757 1734
## 14 comps 15 comps 16 comps 17 comps
## CV 1745 1758 1321 1313
## adjCV 1736 1731 1306 1296
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.814 57.71 64.68 70.58 75.74 80.74 84.31 87.49
## Apps 5.067 70.87 70.91 77.71 79.63 82.22 82.22 82.32
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.46 92.88 94.90 96.74 97.80 98.72 99.37
## Apps 83.99 84.70 85.32 85.32 85.89 85.95 90.88
## 16 comps 17 comps
## X 99.83 100.00
## Apps 93.21 93.71
validationplot(pcr.fit, val.type = "MSEP") # M = 17
pcr.pred = predict(pcr.fit, College[test1,], ncomp = 17)
pcr_mse = mean((pcr.pred-y_test)^2)
(f). Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls.fit = plsr(Apps~.,data=College, subset=train1, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP") # M = 10
pls.pred = predict(pls.fit, College[test1,], ncomp = 10)
pls_mse = mean((pls.pred-y_test)^2)
(g). Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches.
# R square = 1 - RSS/TSS
library(magrittr) # in order to use %>% function
library(dplyr)
TSS = sum((y_test - mean(y_test))^2)
data.frame(approach = c("OLS","Ridge","Lasso","PCR","PLS"),
test_MSE = c(ols_mse, ridge_mse, lasso_mse, pcr_mse, pls_mse),
test_R2 = c(1-sum((y_test-ols.pred)^2) / TSS,
1-sum((y_test-ridge.pred)^2) / TSS,
1-sum((y_test-lasso.pred)^2) / TSS,
1-sum((y_test-pcr.pred)^2) / TSS,
1-sum((y_test-pls.pred)^2) / TSS)) %>%
arrange(test_MSE)
Comments:
There is not much difference among the test errors resulting from these 5 approaches. The smallest test MSE(ridge regression) is only less 6.4% than the largest test MSE(OLS&PCR). Meanwhile, each of the models performed well, with test R2 values above 0.9.
We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
(a). Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated according to the model \[Y = X\beta + \epsilon,\] where \(\beta\) has some elements that are exactly equal to zero.
n = 1000
p = 20
X = matrix(rnorm(n*p, mean =0 , sd = 1), nrow = n)
beta = rnorm(p, mean = 0, sd = 2)
beta[1] = beta[3] = beta[5] = beta[7] = beta[9] = beta[11] = beta[13] = 0
eps = rnorm(n, mean = 1, sd = 3)
Y = X %*% beta + eps
(b). Split your data set into a training set containing 100 observations and a test set containing 900 observations.
df = data.frame(Y,X)
set.seed(42)
train_n = sample(1000,900, replace=FALSE)
# training set
train = df[train_n,]
test = df[-train_n,]
(c). Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
regfit.train = regsubsets(Y~.,train, nvmax = 20)
train.mat = model.matrix(Y~., train, nvmax = 20)
val.errors = rep(NA, 20)
for (i in 1:20){
coefi = coef(regfit.train, id = i)
pred = train.mat[, names(coefi)] %*% coefi
val.errors[i] = mean((train$Y - pred)^2)
}
plot(val.errors,xlab = "Number of predictors of best model",
ylab="Traning MSE",
type ="b",
pch=20,
col= "blue" )
(d). Plot the test set MSE associated with the best model of each size.
test.mat = model.matrix(Y~., test, nvmax = 20)
val.errors = rep(NA, 20)
for (i in 1:20){
coefi = coef(regfit.train, id = i)
pred = test.mat[, names(coefi)] %*% coefi
val.errors[i] = mean((test$Y - pred)^2)
}
plot(val.errors,xlab = "Number of predictors of best model",
ylab="Test MSE",
type ="b",
pch=20,
col= "red" )
(e). For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.
which.min(val.errors)
## [1] 12
# The test MSE is minimized for the 10-variable model.
(f). How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.
coef(regfit.train,10)
## (Intercept) X4 X6 X8 X12 X14
## 1.224632 -1.163012 2.037483 -2.323246 -3.202610 2.169971
## X16 X17 X18 X19 X20
## -1.200095 1.114554 -1.241834 -2.648545 2.950689
beta
## [1] 0.0000000 -0.9201557 0.0000000 -1.1614397 0.0000000 2.0952022
## [7] 0.0000000 -2.4072018 0.0000000 0.1847917 0.0000000 -3.2384499
## [13] 0.0000000 2.1796603 0.8795816 -1.2196297 1.1523020 -1.2475878
## [19] -2.8024330 2.9818237
Comments:
I zeroed beta1, beta3, beta5, beta7, beta9, beta11, beta13 in true model.
From the etimated coefficient values of best subset model, we can see that the 10-variable model which contains the minimized test MSE also eliminate variables with zero coefficients, meanwhile the esitamed coefficients are pretty similar to the true coefficient values. Although it delete the X12, X14 and X16 in this case.
(g). Create a plot displaying \(\sqrt{\sum_{j=1}^{p}(\beta_{0}-\hat\beta_{j}^r)}\)for a range of values of r, where \(\hat\beta_{j}^r\) is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?
val.errors = rep(NA, 20)
cols = colnames(df)[-1]
for (i in 1:20) {
coefi = coef(regfit.train, id = i)
val.errors[i] = sqrt(sum((beta[cols %in% names(coefi)] - coefi[names(coefi) %in% cols])^2) + sum(beta[!(cols %in% names(coefi))])^2)
}
plot(val.errors, xlab = "Number of Predictors",
ylab = "MSE for estimated and true coefficients",
pch = 19,
type = "b")
which.min(val.errors)
## [1] 10
Comments:
In (d), we select a model based on the test MSE, then the result is 10-variable model.
In (g), we select a model based on how close the estimared coefficients are to the true parameters, then it also let us to arrive at selecting the 10-variable model.
It shows that a model that has coefficient estimates closer to true coefficients will also have a lower test MSE.
We will now try to predict per capita crime rate in the Boston dataset.
(a). Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
data(Boston)
set.seed(1)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
train.n = sample(nrow(Boston), nrow(Boston)*0.7, replace = FALSE)
test.n = -train.n
# train data set
train = Boston[train.n,]
test = Boston[test.n,]
# linear regression
lm = lm(crim~., data = train)
lm.pred = predict(lm, test)
lm.mse = mean((lm.pred - test$crim)^2)
lm.mse
## [1] 58.97718
# Best subset selection
sum(is.na(Boston$crim))
## [1] 0
attach(Boston)
library(leaps)
regfit.full = regsubsets(crim~.,train, nvmax = 13)
test.mat = model.matrix(crim~., data = test)
val.errors = rep(NA, 13)
for (i in 1:13){
coefi = coef(regfit.full, id = i)
bss.pred = test.mat[,names(coefi)]%*%coefi
val.errors[i] = mean((test$crim-bss.pred)^2)
}
which.min(val.errors) # 12
## [1] 12
bss.mse = val.errors[12]
bss.mse
## [1] 58.972
# Forward Stepwise Selection
regfit.fwd = regsubsets(crim~., data = train, nvmax = 13, method = "forward")
val.errors = rep(NA, 13)
for (i in 1:13){
coefi = coef(regfit.fwd, id = i)
fwd.pred = test.mat[,names(coefi)]%*%coefi
val.errors[i] = mean((test$crim-fwd.pred)^2)
}
fwd.mse = val.errors[which.min(val.errors)] # 12
fwd.mse
## [1] 58.972
# backward Stepwise Selection
regfit.bwd = regsubsets(crim~., data = train, nvmax = 13, method = "backward")
val.errors = rep(NA, 13)
for (i in 1:13){
coefi = coef(regfit.bwd, id = i)
bwd.pred = test.mat[,names(coefi)]%*%coefi
val.errors[i] = mean((test$crim-bwd.pred)^2)
}
bwd.mse = val.errors[which.min(val.errors)] # 12
bwd.mse
## [1] 58.972
# Ridge
train.mat = model.matrix(crim~.,data=train)
test.mat = model.matrix(crim~.,data=test)
cv.out = cv.glmnet(train.mat, train$crim, alpha=0)
bestlam = cv.out$lambda.min
ridge.mod = glmnet(train.mat, train$crim, alpha=0, lambda = bestlam)
ridge.pred = predict(ridge.mod, s= bestlam, newx = test.mat)
ridge.mse = mean((ridge.pred-test$crim)^2)
ridge.mse
## [1] 60.14492
# out = glmnet(train.mat,train$crim, alpha = 0, lambda=bestlam)
# ridge.coef = predict(out, type = "coefficients", s= bestlam)
# ridge.coef
# Lasso
cv.out = cv.glmnet(train.mat, train$crim, alpha=1)
bestlam = cv.out$lambda.min
lasso.mod = glmnet(train.mat, train$crim, alpha=1, lambda = bestlam)
lasso.pred = predict(lasso.mod, s= bestlam, newx = test.mat)
lasso.mse = mean((lasso.pred-test$crim)^2)
lasso.mse
## [1] 59.50934
out = glmnet(train.mat,train$crim, alpha = 1, lambda=bestlam)
lasso.coef = predict(out, type = "coefficients", s= bestlam)
lasso.coef # 10-variable model
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 11.80683892
## (Intercept) .
## zn 0.02951219
## indus -0.07725018
## chas -0.56086670
## nox -4.87927183
## rm .
## age .
## dis -0.53487860
## rad 0.47453908
## tax .
## ptratio -0.21591209
## black -0.01185566
## lstat 0.21203003
## medv -0.08579901
# PCR
library(pls)
set.seed(2)
pcr.fit = pcr(crim~., data = train, scale= TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred = predict(pcr.fit, test[-1], ncomp = 13)
pcr.mse = mean((pcr.pred-test$crim)^2)
pcr.mse
## [1] 58.97718
# PLS
set.seed(2)
pls.fit = plsr(crim~., data = train, scale= TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, test[-1], ncomp = 7)
pls.mse = mean((pls.pred-test$crim)^2)
pls.mse
## [1] 58.87353
(b). Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
library(magrittr) # in order to use %>% function
library(dplyr)
TSS = sum((y_test - mean(y_test))^2)
data.frame(method = c("OLS","Best Subset Selection","FWD","BWD","Ridge","Lasso","PCR","PLS"),
test_MSE = c(lm.mse, bss.mse, fwd.mse, bwd.mse,ridge.mse, lasso.mse, pcr.mse, pls.mse),
test_R2 = c(1-sum((test$crim-lm.pred)^2) / TSS,
1-sum((test$crim-bss.pred)^2) / TSS,
1-sum((test$crim-fwd.pred)^2) / TSS,
1-sum((test$crim-bwd.pred)^2) / TSS,
1-sum((test$crim-ridge.pred)^2) / TSS,
1-sum((test$crim-lasso.pred)^2) / TSS,
1-sum((test$crim-pcr.pred)^2) / TSS,
1-sum((test$crim-pls.pred)^2) / TSS),
num_of_variables = c(13,12,12,12,13,10,13,7)
) %>%
arrange(test_MSE)
Comments:
I use 8 methods to do prediction. The detailed information about test MSE, test R square and number of variables for the best model are shown in above. Since all test R-squares exceed 99.99%, all models perform well. In addition, the biggest test MSE is only 2.2% larger than the smallest one.
(c). Does your chosen model involve all of the features in the dataset? Why or why not? ANS:
No, I will not choose a model that involve all the features in the data set. Because from the details shown in (b), all models perform well and there is little difference among them in terms of test MSE and test R2, so I will choose the model which contains the fewest variables, the lasso model.