Applied Exercises

Exercise 8 in section 6.8

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a). Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector \(\epsilon\) of length n = 100.

set.seed(1)
n = 100
X = rnorm(n)
e = rnorm(n)

(b). Generate a response vector Y of length n = 100 according to the model \[Y=\beta_{0}+\beta_{1}X+\beta_{2}X^2+\beta_{3}X^3+\epsilon\] where \(\beta_{0},\beta_{1},\beta_{2}\) and \(\beta_{3}\) are constants of your choice.

# set beta value
beta_0 = 3
beta_1 = -4
beta_2 = 9
beta_3 = -5
Y = beta_0 + beta_1 + beta_2*X^2 + beta_3*X^3 + e

(c).Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, \(X^2\),…,\(X^{10}\). What is the best model obtained according to \(C_{p}\), BIC, and adjusted \(R^2\)? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .

data = data.frame(Y, X, X2 = X^2, X3 = X^3, X4 = X^4, X5 = X^5, X6 = X^6, X7 = X^7, X8 = X^8, X9 = X^9, X10 = X^10)
library(leaps)
regfit.full = regsubsets(Y ~ ., data, nvmax = 10)
reg.summary = summary(regfit.full)
# C_p
reg.summary$cp
##  [1] 10311.8854570     0.2030361    -0.0638344     0.6067483     2.1782005
##  [6]     3.9955812     5.7869063     7.1694092     9.1535580    11.0000000
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l" )
which.min(reg.summary$cp) # 3
## [1] 3
points(3, reg.summary$cp[3], col = 'red', cex=2, pch = 20)

# According to BIC, the optimal model contains 3 variables.
coef(regfit.full,3)
##   (Intercept)            X2            X3            X9 
## -0.9138169986  8.8408733725 -5.0739064911  0.0008650782
# BIC
reg.summary$bic
##  [1]  -71.32151 -537.20300 -535.03363 -531.88503 -527.75396 -523.35151
##  [7] -518.97848 -515.06342 -510.47603 -506.04325
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l" )
which.min(reg.summary$bic) # 2
## [1] 2
points(2, reg.summary$bic[2], col = 'red', cex=2, pch = 20)

# According to BIC, the optimal model contains 2 variables.
coef(regfit.full,2)
## (Intercept)          X2          X3 
##  -0.9406405   8.8773352  -4.9892781
# adjusted R^2
reg.summary$adjr2
##  [1] 0.5484938 0.9958712 0.9959286 0.9959452 0.9959215 0.9958860 0.9958509
##  [8] 0.9958342 0.9957886 0.9957486
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l" )
which.max(reg.summary$adjr2) # 4
## [1] 4
points(4, reg.summary$adjr2[4], col = 'red', cex=2, pch = 20)

# According to asjusted R square value, the optimal model contains 4 variables.
coef(regfit.full,4)
## (Intercept)           X          X2          X3          X5 
## -0.92799225  0.38745596  8.84575641 -5.44202574  0.08072292


The original formula contains X, X^2 and X^3.
According to best subset selection method:
Cp: X^2, X^3, X^9
BIC: X^2, X^3
Adjust R square: X, X^2, X^3, X^5.

(d). Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

# forward
regfit.fwd = regsubsets(Y ~ ., data = data, nvmax = 10, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Y ~ ., data = data, nvmax = 10, method = "forward")
## 10 Variables  (and intercept)
##     Forced in Forced out
## X       FALSE      FALSE
## X2      FALSE      FALSE
## X3      FALSE      FALSE
## X4      FALSE      FALSE
## X5      FALSE      FALSE
## X6      FALSE      FALSE
## X7      FALSE      FALSE
## X8      FALSE      FALSE
## X9      FALSE      FALSE
## X10     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
##           X   X2  X3  X4  X5  X6  X7  X8  X9  X10
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 )  " " "*" "*" " " " " " " " " " " " " " "
## 3  ( 1 )  " " "*" "*" " " " " " " " " " " "*" " "
## 4  ( 1 )  "*" "*" "*" " " " " " " " " " " "*" " "
## 5  ( 1 )  "*" "*" "*" " " "*" " " " " " " "*" " "
## 6  ( 1 )  "*" "*" "*" " " "*" " " " " "*" "*" " "
## 7  ( 1 )  "*" "*" "*" " " "*" " " "*" "*" "*" " "
## 8  ( 1 )  "*" "*" "*" " " "*" "*" "*" "*" "*" " "
## 9  ( 1 )  "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# backward
regfit.bwd = regsubsets(Y ~ ., data = data, nvmax=10, method = "backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(Y ~ ., data = data, nvmax = 10, method = "backward")
## 10 Variables  (and intercept)
##     Forced in Forced out
## X       FALSE      FALSE
## X2      FALSE      FALSE
## X3      FALSE      FALSE
## X4      FALSE      FALSE
## X5      FALSE      FALSE
## X6      FALSE      FALSE
## X7      FALSE      FALSE
## X8      FALSE      FALSE
## X9      FALSE      FALSE
## X10     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: backward
##           X   X2  X3  X4  X5  X6  X7  X8  X9  X10
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 )  " " "*" "*" " " " " " " " " " " " " " "
## 3  ( 1 )  " " "*" "*" " " " " " " " " " " "*" " "
## 4  ( 1 )  "*" "*" "*" " " " " " " " " " " "*" " "
## 5  ( 1 )  "*" "*" "*" " " " " " " " " "*" "*" " "
## 6  ( 1 )  "*" "*" "*" " " " " " " " " "*" "*" "*"
## 7  ( 1 )  "*" "*" "*" " " " " "*" " " "*" "*" "*"
## 8  ( 1 )  "*" "*" "*" "*" " " "*" " " "*" "*" "*"
## 9  ( 1 )  "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"


The best 2-variable, 3-variable and 4-variable models identified by forward stepwise selection are the same as backward stepwise selection. Comparing to the results in (c), except that the best 4-variable model selected by the best subset selection is different from them, the other two models are the same as them.

(e). Now fit a lasso model to the simulated data, again using X, \(X^2\),…,\(X^10\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.

# Before using glmnet function, we must pass in an x matrix as well as a y vector.
library(glmnet)
set.seed(1)
x = model.matrix(Y ~ ., data)[,-1]
y = data$Y

# use CV to select the optimal value of lambda
cv.out = cv.glmnet(x, y, alpha = 1)
plot(cv.out)

bestlam = cv.out$lambda.min
bestlam
## [1] 0.05489217
# coefficient estimates
grid <- 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x,y,alpha=1,lambda = grid)
lasso.coef = predict(lasso.mod, type = "coefficients", s= bestlam)[1:11,]
lasso.coef[lasso.coef!=0]
##   (Intercept)            X2            X3            X4            X6 
## -8.240037e-01  8.606810e+00 -4.971453e+00  4.812840e-02  6.629000e-04 
##            X8 
##  6.913996e-06


After performing cross validation we get lambda as 0.05489217 for minimum MSE, then using this lambda value to fit lasso regression, finaly obtaining the coefficient estimates as above. A variable with a coefficient of 0 means it will not be selected.

  1. Now generate a response vector Y according to the model \[Y = \beta_{0} + \beta_{7+}X^{7} + \epsilon\], and perform best subset selection and the lasso. Discuss the results obtained.
# new data
beta_7 = 12
Y2 = beta_0 + beta_7 * X^7 + e
data1 = data.frame(Y2, X, X2= X^2, X3 = X^3, X4 = X^4, X5=X^5, X6=X^6, X7 = X^7, X8 = X^8, X9 = X^9, X10 = X^10)

# best subset selection
library(leaps)
regfit.full = regsubsets(Y2~.,data = data1)
reg.summary = summary(regfit.full)
par(mfrow=c(2,3))

# adjust r square
which.max(reg.summary$adjr2) # 4
## [1] 4
plot(reg.summary$adjr2, xlab = "Number of Variables",ylab="Adjusted RSq", type="l")
points(4,reg.summary$bic[4],col="red",cex=2,pch=20)
coef(regfit.full,4) # 4-variable model: X, X^2, X^3, X^7
## (Intercept)           X          X2          X3          X7 
##   3.0762524   0.2914016  -0.1617671  -0.2526527  12.0091338
# BIC
which.min(reg.summary$bic) # 1
## [1] 1
plot(reg.summary$bic, xlab = "Number of Variables",ylab="BIC", type="l")
points(1,reg.summary$bic[1],col="red",cex=2,pch=20)
coef(regfit.full,1) # single-variable model: X^7
## (Intercept)          X7 
##     2.95894    12.00077
# Cp
which.min(reg.summary$cp) # 2
## [1] 2
plot(reg.summary$cp, xlab = "Number of Variables",ylab="Cp", type="l")
points(2,reg.summary$cp[2],col="red",cex=2,pch=20)
coef(regfit.full,2) # 2-variable model: X^2, X^7
## (Intercept)          X2          X7 
##   3.0704904  -0.1417084  12.0015552
# Variables included in the optimal model
plot(regfit.full,scale = "adjr2")
plot(regfit.full,scale = "bic")
plot(regfit.full,scale = "Cp")

# lasso regression
# Step1: check and omit NA value
# Step2: use model.matrix function to change qualitative variables into dummy variables
x2 = model.matrix(Y2~.,data1)[,-1]
y2 = data1$Y2

# Step3: build model
library(glmnet)
grid <- 10^seq(10, -2, length = 100)
lasso.mod = glmnet(x2,y2,alpha=1,lambda = grid)

# Step4: optimal value of lambda
cv.out = cv.glmnet(x2, y2, alpha = 1)
plot(cv.out)

bestlam = cv.out$lambda.min
bestlam  # 21.20277
## [1] 21.20275
# coefficient estimates
lasso.coef = predict(lasso.mod, type = "coefficients", s= bestlam)[1:11,]
lasso.coef[lasso.coef !=0]
## (Intercept)          X7 
##    4.435343   11.650942

Comments:
The original formula is: Y = 3 + 12*X^7 + e
(1) The best subset selection: From the results above, adjusted R square chooses 4-variable model, Cp chooses 2-variable model and BIC chooses 1 variable model.
(2) However, the lasso regression just choose single-variable model and the coefficients about beta0 and beta7 are similar to the original coefficients.
So, in this case, lasso model perform better than best subset selection.

Exercise 9 in section 6.8

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a). Split the data set into a training set and a test set.

library(ISLR)
attach(College)
set.seed(1)

# method1
set.seed(1)
train1 = sample(c(TRUE,FALSE), nrow(College)*0.7, replace = TRUE)
test1 = (!train1)
y.test1 = College$Apps[test1]

# method2 (Tip: we will not use this data sets)
set.seed(4)
train2 = sample(1:nrow(College), nrow(College)*0.7) # replace = FALSE
test2 = (-train2)
y.test2 = College[test2,]$Apps

(b). Fit a linear model using least squares on the training set, and report the test error obtained.

lm = lm(Apps~., data = College, subset=train1)
ols.pred = predict(lm, College[test1,])
ols_mse = mean((ols.pred-y.test1)^2)
ols_mse
## [1] 1139285

(c). Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

train.mat = model.matrix(Apps~.,College[train1,])
test.mat = model.matrix(Apps~.,College[test1,])
y_train = College[train1,]$Apps
y_test = College[test1,]$Apps

# build ridge model
library(glmnet)
grid = 10^seq(10,-2,length=100)
ridge.mod = glmnet(train.mat, y_train, alpha = 0, lambda = grid, thresh = 1e-12 )

# find best lambda
cv.out = cv.glmnet(train.mat, y_train, alpha = 0)
bestlam = cv.out$lambda.min

# predict and MSE
ridge.pred = predict(ridge.mod, s = bestlam, newx = test.mat)
ridge_mse = mean((ridge.pred-y_test)^2)

(d). Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# build lasso model
library(glmnet)
grid = 10^seq(10,-2,length=100)
lasso.mod = glmnet(train.mat, y_train, alpha = 1, lambda = grid, thresh = 1e-12 )

# find best lambda
cv.out = cv.glmnet(train.mat, y_train, alpha = 1)
bestlam = cv.out$lambda.min

# predict and MSE
lasso.pred = predict(lasso.mod, s = bestlam, newx = test.mat)
lasso_mse = mean((lasso.pred-y_test)^2)

# non-zero coefficient
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam)
lasso.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -514.44271026
## (Intercept)    .         
## PrivateYes  -428.37924738
## Accept         1.71801006
## Enroll        -1.21284078
## Top10perc     51.80050721
## Top25perc    -14.63543313
## F.Undergrad    0.09112750
## P.Undergrad   -0.03913185
## Outstate      -0.11911140
## Room.Board     0.14886025
## Books          0.08442170
## Personal      -0.01053865
## PhD           -4.14758433
## Terminal      -8.39153954
## S.F.Ratio     33.38387097
## perc.alumni    6.57109829
## Expend         0.09697893
## Grad.Rate      5.94725738


We can see that 15 of the 19 coefficients are non-zero.


(e). Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
set.seed(2)
pcr.fit = pcr(Apps~., data = College, subset = train1, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 402 17 
##  Y dimension: 402 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4171     4106     2305     2313     2109     2025     1863
## adjCV         4171     4108     2300     2309     2102     2026     1856
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1866     1866     1786      1755      1746      1766      1743
## adjCV     1859     1867     1779      1746      1733      1757      1734
##        14 comps  15 comps  16 comps  17 comps
## CV         1745      1758      1321      1313
## adjCV      1736      1731      1306      1296
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.814    57.71    64.68    70.58    75.74    80.74    84.31    87.49
## Apps    5.067    70.87    70.91    77.71    79.63    82.22    82.22    82.32
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.46     92.88     94.90     96.74     97.80     98.72     99.37
## Apps    83.99     84.70     85.32     85.32     85.89     85.95     90.88
##       16 comps  17 comps
## X        99.83    100.00
## Apps     93.21     93.71
validationplot(pcr.fit, val.type = "MSEP")  # M = 17

pcr.pred = predict(pcr.fit, College[test1,], ncomp = 17)
pcr_mse = mean((pcr.pred-y_test)^2)

(f). Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit = plsr(Apps~.,data=College, subset=train1, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP") # M = 10

pls.pred = predict(pls.fit, College[test1,], ncomp = 10)
pls_mse = mean((pls.pred-y_test)^2)

(g). Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches.

# R square = 1 - RSS/TSS

library(magrittr) # in order to use %>% function
library(dplyr)
TSS = sum((y_test - mean(y_test))^2)
data.frame(approach = c("OLS","Ridge","Lasso","PCR","PLS"),
           test_MSE = c(ols_mse, ridge_mse, lasso_mse, pcr_mse, pls_mse),
           test_R2 = c(1-sum((y_test-ols.pred)^2) / TSS,
                       1-sum((y_test-ridge.pred)^2) / TSS,
                       1-sum((y_test-lasso.pred)^2) / TSS,
                       1-sum((y_test-pcr.pred)^2) / TSS,
                       1-sum((y_test-pls.pred)^2) / TSS)) %>%
 arrange(test_MSE)

Comments:
There is not much difference among the test errors resulting from these 5 approaches. The smallest test MSE(ridge regression) is only less 6.4% than the largest test MSE(OLS&PCR). Meanwhile, each of the models performed well, with test R2 values above 0.9.



Exercise 10 in section 6.8

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a). Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated according to the model \[Y = X\beta + \epsilon,\] where \(\beta\) has some elements that are exactly equal to zero.

n = 1000
p = 20
X = matrix(rnorm(n*p, mean =0 , sd = 1), nrow = n)
beta = rnorm(p, mean = 0, sd = 2)
beta[1] = beta[3] = beta[5] = beta[7] = beta[9] = beta[11] = beta[13] = 0
eps = rnorm(n, mean = 1, sd = 3)
Y = X %*% beta + eps

(b). Split your data set into a training set containing 100 observations and a test set containing 900 observations.

df = data.frame(Y,X)
set.seed(42)
train_n = sample(1000,900, replace=FALSE)
# training set
train = df[train_n,]
test = df[-train_n,]

(c). Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

regfit.train = regsubsets(Y~.,train, nvmax = 20)
train.mat = model.matrix(Y~., train, nvmax = 20)
val.errors = rep(NA, 20)
for (i in 1:20){
  coefi = coef(regfit.train, id = i)
  pred = train.mat[, names(coefi)] %*% coefi
  val.errors[i] = mean((train$Y - pred)^2)
}
plot(val.errors,xlab = "Number of predictors of best model",
     ylab="Traning MSE",
     type ="b",
     pch=20,
     col= "blue" )

(d). Plot the test set MSE associated with the best model of each size.

test.mat = model.matrix(Y~., test, nvmax = 20)
val.errors = rep(NA, 20)
for (i in 1:20){
  coefi = coef(regfit.train, id = i)
  pred = test.mat[, names(coefi)] %*% coefi
  val.errors[i] = mean((test$Y - pred)^2)
}
plot(val.errors,xlab = "Number of predictors of best model",
     ylab="Test MSE",
     type ="b",
     pch=20,
     col= "red" )

(e). For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

which.min(val.errors)
## [1] 12
# The test MSE is minimized for the 10-variable model.

(f). How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

coef(regfit.train,10)
## (Intercept)          X4          X6          X8         X12         X14 
##    1.224632   -1.163012    2.037483   -2.323246   -3.202610    2.169971 
##         X16         X17         X18         X19         X20 
##   -1.200095    1.114554   -1.241834   -2.648545    2.950689
beta
##  [1]  0.0000000 -0.9201557  0.0000000 -1.1614397  0.0000000  2.0952022
##  [7]  0.0000000 -2.4072018  0.0000000  0.1847917  0.0000000 -3.2384499
## [13]  0.0000000  2.1796603  0.8795816 -1.2196297  1.1523020 -1.2475878
## [19] -2.8024330  2.9818237

Comments:
I zeroed beta1, beta3, beta5, beta7, beta9, beta11, beta13 in true model.
From the etimated coefficient values of best subset model, we can see that the 10-variable model which contains the minimized test MSE also eliminate variables with zero coefficients, meanwhile the esitamed coefficients are pretty similar to the true coefficient values. Although it delete the X12, X14 and X16 in this case.

(g). Create a plot displaying \(\sqrt{\sum_{j=1}^{p}(\beta_{0}-\hat\beta_{j}^r)}\)for a range of values of r, where \(\hat\beta_{j}^r\) is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

val.errors = rep(NA, 20)
cols = colnames(df)[-1]
for (i in 1:20) {
    coefi = coef(regfit.train, id = i)
    val.errors[i] = sqrt(sum((beta[cols %in% names(coefi)] - coefi[names(coefi) %in% cols])^2) + sum(beta[!(cols %in% names(coefi))])^2)
}
plot(val.errors, xlab = "Number of Predictors", 
     ylab = "MSE for estimated and true coefficients", 
     pch = 19, 
     type = "b")

which.min(val.errors)
## [1] 10

Comments:
In (d), we select a model based on the test MSE, then the result is 10-variable model.
In (g), we select a model based on how close the estimared coefficients are to the true parameters, then it also let us to arrive at selecting the 10-variable model.
It shows that a model that has coefficient estimates closer to true coefficients will also have a lower test MSE.



Exercise 11 in section 6.8

We will now try to predict per capita crime rate in the Boston dataset.

(a). Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
data(Boston)
set.seed(1)

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
train.n = sample(nrow(Boston), nrow(Boston)*0.7, replace = FALSE)
test.n = -train.n
# train data set
train = Boston[train.n,]
test = Boston[test.n,]
# linear regression 
lm = lm(crim~., data = train)
lm.pred = predict(lm, test)
lm.mse = mean((lm.pred - test$crim)^2)
lm.mse
## [1] 58.97718
# Best subset selection
sum(is.na(Boston$crim))
## [1] 0
attach(Boston)
library(leaps)
regfit.full = regsubsets(crim~.,train, nvmax = 13)
test.mat = model.matrix(crim~., data = test)

val.errors = rep(NA, 13)
for (i in 1:13){
  coefi = coef(regfit.full, id = i)
  bss.pred = test.mat[,names(coefi)]%*%coefi
  val.errors[i] = mean((test$crim-bss.pred)^2)
}

which.min(val.errors) # 12
## [1] 12
bss.mse = val.errors[12]
bss.mse
## [1] 58.972
# Forward Stepwise Selection
regfit.fwd = regsubsets(crim~., data = train, nvmax = 13, method = "forward")

val.errors = rep(NA, 13)
for (i in 1:13){
  coefi = coef(regfit.fwd, id = i)
  fwd.pred = test.mat[,names(coefi)]%*%coefi
  val.errors[i] = mean((test$crim-fwd.pred)^2)
}

fwd.mse = val.errors[which.min(val.errors)] # 12
fwd.mse
## [1] 58.972
# backward Stepwise Selection
regfit.bwd = regsubsets(crim~., data = train, nvmax = 13, method = "backward")

val.errors = rep(NA, 13)
for (i in 1:13){
  coefi = coef(regfit.bwd, id = i)
  bwd.pred = test.mat[,names(coefi)]%*%coefi
  val.errors[i] = mean((test$crim-bwd.pred)^2)
}

bwd.mse = val.errors[which.min(val.errors)] # 12
bwd.mse
## [1] 58.972
# Ridge
train.mat = model.matrix(crim~.,data=train)
test.mat = model.matrix(crim~.,data=test)

cv.out = cv.glmnet(train.mat, train$crim, alpha=0)
bestlam = cv.out$lambda.min

ridge.mod = glmnet(train.mat, train$crim, alpha=0, lambda = bestlam)
ridge.pred = predict(ridge.mod, s= bestlam, newx = test.mat)
ridge.mse = mean((ridge.pred-test$crim)^2)
ridge.mse
## [1] 60.14492
# out = glmnet(train.mat,train$crim, alpha = 0, lambda=bestlam)
# ridge.coef = predict(out, type = "coefficients", s= bestlam)
# ridge.coef
# Lasso
cv.out = cv.glmnet(train.mat, train$crim, alpha=1)
bestlam = cv.out$lambda.min

lasso.mod = glmnet(train.mat, train$crim, alpha=1, lambda = bestlam)
lasso.pred = predict(lasso.mod, s= bestlam, newx = test.mat)
lasso.mse = mean((lasso.pred-test$crim)^2)
lasso.mse
## [1] 59.50934
out = glmnet(train.mat,train$crim, alpha = 1, lambda=bestlam)
lasso.coef = predict(out, type = "coefficients", s= bestlam)
lasso.coef # 10-variable model
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) 11.80683892
## (Intercept)  .         
## zn           0.02951219
## indus       -0.07725018
## chas        -0.56086670
## nox         -4.87927183
## rm           .         
## age          .         
## dis         -0.53487860
## rad          0.47453908
## tax          .         
## ptratio     -0.21591209
## black       -0.01185566
## lstat        0.21203003
## medv        -0.08579901
# PCR
library(pls)
set.seed(2)
pcr.fit = pcr(crim~., data = train, scale= TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred = predict(pcr.fit, test[-1], ncomp = 13)
pcr.mse = mean((pcr.pred-test$crim)^2)
pcr.mse
## [1] 58.97718
# PLS
set.seed(2)
pls.fit = plsr(crim~., data = train, scale= TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit, test[-1], ncomp = 7)
pls.mse = mean((pls.pred-test$crim)^2)
pls.mse
## [1] 58.87353

(b). Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

library(magrittr) # in order to use %>% function
library(dplyr)
TSS = sum((y_test - mean(y_test))^2)

data.frame(method = c("OLS","Best Subset Selection","FWD","BWD","Ridge","Lasso","PCR","PLS"),
           test_MSE = c(lm.mse, bss.mse, fwd.mse, bwd.mse,ridge.mse, lasso.mse, pcr.mse, pls.mse),
           test_R2 = c(1-sum((test$crim-lm.pred)^2) / TSS,
                       1-sum((test$crim-bss.pred)^2) / TSS,
                       1-sum((test$crim-fwd.pred)^2) / TSS,
                       1-sum((test$crim-bwd.pred)^2) / TSS,
                       1-sum((test$crim-ridge.pred)^2) / TSS,
                       1-sum((test$crim-lasso.pred)^2) / TSS,
                       1-sum((test$crim-pcr.pred)^2) / TSS,
                       1-sum((test$crim-pls.pred)^2) / TSS),
           num_of_variables = c(13,12,12,12,13,10,13,7) 
           ) %>%
 arrange(test_MSE)

Comments:
I use 8 methods to do prediction. The detailed information about test MSE, test R square and number of variables for the best model are shown in above. Since all test R-squares exceed 99.99%, all models perform well. In addition, the biggest test MSE is only 2.2% larger than the smallest one.

(c). Does your chosen model involve all of the features in the dataset? Why or why not? ANS:
No, I will not choose a model that involve all the features in the data set. Because from the details shown in (b), all models perform well and there is little difference among them in terms of test MSE and test R2, so I will choose the model which contains the fewest variables, the lasso model.