An Introduction to Statistical Learning (ISLR) Solutions: Chapter 6

Problem 1

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:

(a)

Which of the three models with k predictors has the smallest training RSS?

Solution (a)

The smallest training RSS will be for the model with best subset approach. This is because the model will be chosen after considering all the possible models with k parameters for best subset. This is not true for either backward stepwise or forward stepwise.

(b)

Which of the three models with k predictors has the smallest test RSS?

Solution (b)

Can’t Say: The model with best subset approach will select a best model with k predictors from all possible combinations for k predictors based on training RSS. Similar will be the case for forward stepwise and backward stepwise approach. But this might not be nessassrily true for test RSS.

(c)

True or False:

Solution (c)

i

The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

TRUE

ii

The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.

TRUE

iii

The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.

FALSE

iv

The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

FALSE

v

The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

FALSE

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer.

(a)

The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Solution (a)

iii

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(b)

Repeat (a) for ridge regression relative to least squares.

Solution (b)

iii Same as Lasso

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(c)

Repeat (a) for non-linear methods relative to least squares.

Solution (C)

ii

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias

Problem 3

Suppose we estimate the regression coefficients in a linear regression model by minimizing \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j})^2\] subject to

\[ \sum_{j=1}^{p} |\beta_j| \le s\]

for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

(a)

As we increase s from 0, the training RSS will:

Increase initially, and then eventually start decreasing in an inverted U shape.

Decrease initially, and then eventually start increasing in a U shape.

Steadily increase.

Steadily decrease.

Remain constant.

Solution (a)

Steadily decrease

With increase in s we are making the model more and more flexible as the restriction on \(\beta\) is reducing. This will lead to decreased RSS.

(b)

Repeat (a) for test RSS

Solution (b)

Decrease initially, and then eventually start increasing in a U shape

As the model is becoming more and more flexible the test RSS will reduce first and then start increasing when overfitting will start

(c)

Repeat (a) for variance.

Solution (c)

Steadily increase

Variance steadily increase with the increase in model flexibility

(d)

Repeat (a) for (squared) bias.

Solution (d)

Steadily decrease

Bias decreases with the increase in the model flexibility

(e)

Repeat (a) for the irreducible error.

Solution (e)

Remain constant

Irreducible error is independent of model parameters and thus independent of s

Problem 4

Suppose we estimate the regression coefficients in a linear regression model by minimizing \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j})^2 + \lambda \sum_{j=1}^{p} \beta_j^2\] for a particular value of ??. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

(a)

As we increase ?? from 0, the training RSS will:

Increase initially, and then eventually start decreasing in an inverted U shape.

Decrease initially, and then eventually start increasing in a U shape.

Steadily increase.

Steadily decrease.

Remain constant.

Solution (a)

Steadily increase

With increase in \(\lambda\) we are making the model less and less flexible as the restriction on \(\beta\) is increasing. This will lead to increased RSS.

(b)

Repeat (a) for test RSS

Solution (b)

Decrease initially, and then eventually start increasing in a U shape

As the model is becoming less and less flexible the test RSS will decrease first and then start increasing when overfitting will start

(c)

Repeat (a) for variance.

Solution (c)

Steadily decrease

Variance steadily decreases with the decrease in model flexibility

(d)

Repeat (a) for (squared) bias.

Solution (d)

Steadily increase

Bias increases with the decrease in the model flexibility

(e)

Repeat (a) for the irreducible error.

Solution (e)

Remain constant

Irreducible error is independent of model parameters and thus independent of \(\lambda\)

Problem 5

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting. Suppose that n = 2, p = 2, x11 = x12, x21 = x22. Furthermore, suppose that y1+y2 = 0 and x11+x21 = 0 and x12+x22 = 0, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: ^ ??0 = 0.

(a)

Write out the ridge regression optimization problem in this setting.

Solution (a)

Given the setting we have \(x_{1,1}\) = \(x_{1,2}\) = \(x_1\) and similarly \(x_2\) Thus the ridge regression problem reduces to minimizing \((y_1 - \beta_1 x_1 - \beta_2 x_1)^2 + (y_2 - \beta_1 x_2 - \beta_2 x_2)^2 + \lambda (\beta_1^2 + \beta_2^2)\)

(b)

Argue that in this setting, the ridge coefficient estimates satisfy \(\beta_1 = \beta_2\)

Solution (b)

On Differentiating the expression for ridge regression wrt \(\beta_1\) and \(\beta_2\) & equating it to zero we get expression \(\beta_1 = \beta_2\)

(c)

Write out the lasso optimization problem in this setting

Solution (c)

\((y_1 - \beta_1 x_1 - \beta_2 x_1)^2 + (y_2 - \beta_1 x_2 - \beta_2 x_2)^2 + \lambda (|\beta_1| + |\beta_2|)\)

Problem 6

We will now explore (6.12) and (6.13) further.

(a)

Consider (6.12) \[\sum_{j=1}^{p} (y_i - \beta_j)^2 + \lambda \sum_{j=1}^{p} \beta_j^2\] with p = 1. For some choice of y1 and \(\lambda\) > 0, plot (6.12) as a function of \(\beta_1\).

Your plot should confirm that (6.12) is solved by (6.14)\[\beta_{j}^R=y_i / (1+\lambda)\]

Solution a

y <- 5
lambda <- 5
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * beta^2, pch = 20, xlab = "beta", ylab = "Ridge optimization")
beta.est <- y / (1 + lambda)
points(beta.est, (y - beta.est)^2 + lambda * beta.est^2, col = "red", pch = 4, lwd = 5)

(b)

Consider (6.13) \[\sum_{j=1}^{p} (y_i - \beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j|\] with p = 1. For some choice of y1 and \(\lambda\) > 0, plot (6.13) as a function of \(\beta_1\). Your plot should confirm that (6.13) is solved by (6.15).

Solution (b)

y <- 5
lambda <- 5
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * abs(beta), pch = 20, xlab = "beta", ylab = "Lasso optimization")
beta.est <- y - lambda / 2
points(beta.est, (y - beta.est)^2 + lambda * abs(beta.est), col = "red", pch = 4, lwd = 5)

Problem 8

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection

(a)

Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector error of length n = 100.

Solution (a)

set.seed(1)
x<-rnorm(100)
error<-rnorm(100)

(b)

Generate a response vector Y of length n = 100 according to the model \(Y = \beta_0 + \beta_1X + \beta_2X_2 + \beta_3X_3 + error\), where \(\beta_0, \beta_1, \beta_2, and \beta_3\) are constants of your choice.

Solution (b)

b0<-2
b1<-3
b2<-(-4)
b3<-0.5

y<-b0+b1*x+b2*x^2+b3*x^3+error

(c)

Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X,X2, . . .,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.Note you will need to use the data.frame() function to create a single data set containing both X and Y .

Solution (C)

#install.packages("leaps")
library(leaps)
data.full<-data.frame(y=y,x=x)
regfit.full<-regsubsets(y~x+I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10,really.big = T)
reg.summary<-summary(regfit.full)
par(mfrow=c(1,3))

#plot model c_p value for different number of variables.Least value of c_p gives best model
plot(reg.summary$cp,xlab="Number of Variables",ylab="C_p",type="l")
points(which.min(reg.summary$cp),reg.summary$cp[which.min(reg.summary$cp)],col="blue",cex=2,pch=20)

#plot model BIC value
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary$bic),reg.summary$bic[which.min(reg.summary$bic)],col="blue",cex=2,pch=20)

#plot model adj R square. HIgher adj r sqaure gives best model
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adj R^2",type="l")
points(which.max(reg.summary$adjr2),reg.summary$adjr2[which.max(reg.summary$adjr2)],col="blue",cex=2,pch=20)

#Model
coef(regfit.full,which.max(reg.summary$adjr2))

## (Intercept)           x      I(x^2)      I(x^5) 
##  2.07219472  3.44514720 -4.15676236  0.09022577

The plots shows that the best subset method gives the model with 3 predictor variables when C_p, BIC, and adjusted R square are used as criterias to select model

The model selects X, \(x^2\) and \(x^5\) as parameters.

The original relationship of y is explained by x,\(x^2\), and \(x^3\)

(d)

Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

Solution (d)

Forward Selection Approach

library(leaps)
data.full<-data.frame(y=y,x=x)
regfit.fwd<-regsubsets(y~x+I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10,really.big = T,method = "forward")
reg.summary.fwd<-summary(regfit.fwd)
par(mfrow=c(1,3))

#plot model c_p value for different number of variables.Least value of c_p gives best model
plot(reg.summary.fwd$cp,xlab="Number of Variables",ylab="C_p",type="l")
points(which.min(reg.summary.fwd$cp),reg.summary.fwd$cp[which.min(reg.summary.fwd$cp)],col="blue",cex=2,pch=20)

#plot model BIC value
plot(reg.summary.fwd$bic,xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary.fwd$bic),reg.summary.fwd$bic[which.min(reg.summary.fwd$bic)],col="blue",cex=2,pch=20)

#plot model adj R square. HIgher adj r sqaure gives best model
plot(reg.summary.fwd$adjr2,xlab="Number of Variables",ylab="Adj R^2",type="l")
points(which.max(reg.summary.fwd$adjr2),reg.summary.fwd$adjr2[which.max(reg.summary.fwd$adjr2)],col="blue",cex=2,pch=20)

#Model
coef(regfit.fwd,which.max(reg.summary.fwd$adjr2))

## (Intercept)           x      I(x^2)      I(x^5) 
##  2.07219472  3.44514720 -4.15676236  0.09022577

With forward selection approach also 3 predictor variable model is picked.

Here also the model selects X, \(x_2\) and \(x_5\) as predictors

Backward Selection Approach

library(leaps)
data.full<-data.frame(y=y,x=x)
regfit.bwd<-regsubsets(y~x+I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10,really.big = T,method = "backward")
reg.summary.bwd<-summary(regfit.bwd)
par(mfrow=c(1,3))

#plot model c_p value for different number of variables.Least value of c_p gives best model
plot(reg.summary.bwd$cp,xlab="Number of Variables",ylab="C_p",type="l")
points(which.min(reg.summary.bwd$cp),reg.summary.bwd$cp[which.min(reg.summary.bwd$cp)],col="blue",cex=2,pch=20)

#plot model BIC value
plot(reg.summary.bwd$bic,xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary.bwd$bic),reg.summary.bwd$bic[which.min(reg.summary.bwd$bic)],col="blue",cex=2,pch=20)

#plot model adj R square. HIgher adj r sqaure gives best model
plot(reg.summary.bwd$adjr2,xlab="Number of Variables",ylab="Adj R^2",type="l")
points(which.max(reg.summary.bwd$adjr2),reg.summary.bwd$adjr2[which.max(reg.summary.bwd$adjr2)],col="blue",cex=2,pch=20)

#Model
coef(regfit.bwd,which.max(reg.summary.bwd$adjr2))

## (Intercept)           x      I(x^2)      I(x^5) 
##  2.07219472  3.44514720 -4.15676236  0.09022577

(e)

Now fit a lasso model to the simulated data, again using X,X2, . . . , X10 as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.

Solution (e)

library(glmnet)
set.seed(1)
xmat<-model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[,-1]
set.seed(1)
cv.lasso<-cv.glmnet(xmat,y,alpha=1)
plot(cv.lasso)

bestlam<-cv.lasso$lambda.min
bestlam

## [1] 0.02684502

fit.lasso<-glmnet(xmat,y,alpha=1)
predict(fit.lasso,s=bestlam,type="coefficients")[1:11,]

##   (Intercept)             x        I(x^2)        I(x^3)        I(x^4) 
##  2.050652e+00  3.301160e+00 -4.121223e+00  1.298972e-01  0.000000e+00 
##        I(x^5)        I(x^6)        I(x^7)        I(x^8)        I(x^9) 
##  6.621380e-02  0.000000e+00  7.160792e-05  0.000000e+00  0.000000e+00 
##       I(x^10) 
##  2.919318e-06

After performing cross validation we get \(\lambda\) as 0.026845 for minimum MSE.

We use this value of lambda to fit lasso regression

We obtain the coefficient estimates as above. Rest coefficients are reduced to 0

(f)

Now generate a response vector Y according to the model \[Y = \beta_0 + \beta_7 X_7 + error\] and perform best subset selection and the lasso. Discuss the results obtained.

Solution (f)

Best Subset Method

b7<-7
y <- b0 + b7 * x^7 + error
data.full <- data.frame(y = y, x = x)
regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)
reg.summary <- summary(regfit.full)
par(mfrow = c(1, 3))
plot(reg.summary$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch = 20)

coef(regfit.full,2)

## (Intercept)      I(x^2)      I(x^7) 
##   2.0704904  -0.1417084   7.0015552

coef(regfit.full,1)

## (Intercept)      I(x^7) 
##     1.95894     7.00077

coef(regfit.full,4)

## (Intercept)           x      I(x^2)      I(x^3)      I(x^7) 
##   2.0762524   0.2914016  -0.1617671  -0.2526527   7.0091338

We see \(c_p\) chooses 2 variable model, BIC chooses 1 variable model while adj r sqaure chooses 4 variable model

Lasso Model

xmat <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[, -1]
set.seed(1)
cv.lasso <- cv.glmnet(xmat, y, alpha = 1)
bestlam <- cv.lasso$lambda.min
bestlam

## [1] 13.57478

fit.lasso <- glmnet(xmat, y, alpha = 1)
predict(fit.lasso, s = bestlam, type = "coefficients")[1:11, ]

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5) 
##    2.904188    0.000000    0.000000    0.000000    0.000000    0.000000 
##      I(x^6)      I(x^7)      I(x^8)      I(x^9)     I(x^10) 
##    0.000000    6.776797    0.000000    0.000000    0.000000

Here the lasso also chooses the best model with one variable

Thus lasso performs better as compared to best subset approach as it gives model close to simulated value of y

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a)

Split the data set into a training set and a test set.

Solution (a)

library(ISLR)
attach(College)
set.seed(11)
#Randomly splitting data into trainig and test set in 7:3 ratio
subset<-sample(nrow(College),nrow(College)*0.7)
train<-College[subset,]
test<-College[-subset,]

(b)

Fit a linear model using least squares on the training set, and report the test error obtained.

Solution (b)

ls.full<-lm(Apps~.,data=train)
summary(ls.full)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5023.2  -408.6   -46.6   336.7  7242.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -140.27742  518.76903  -0.270 0.786955    
## PrivateYes  -511.17144  176.18600  -2.901 0.003872 ** 
## Accept         1.63646    0.05066  32.302  < 2e-16 ***
## Enroll        -1.08353    0.24583  -4.408 1.27e-05 ***
## Top10perc     58.88924    7.33783   8.025 6.66e-15 ***
## Top25perc    -21.37240    6.10623  -3.500 0.000505 ***
## F.Undergrad    0.07107    0.04330   1.641 0.101333    
## P.Undergrad    0.05855    0.03783   1.548 0.122318    
## Outstate      -0.08871    0.02447  -3.625 0.000317 ***
## Room.Board     0.14479    0.06418   2.256 0.024493 *  
## Books         -0.19750    0.29231  -0.676 0.499547    
## Personal       0.03265    0.07920   0.412 0.680391    
## PhD           -8.78197    6.16418  -1.425 0.154844    
## Terminal      -4.79463    6.69874  -0.716 0.474463    
## S.F.Ratio     17.93368   16.68176   1.075 0.282848    
## perc.alumni    2.29991    5.51102   0.417 0.676608    
## Expend         0.07830    0.01574   4.974 8.92e-07 ***
## Grad.Rate      9.14788    3.83929   2.383 0.017541 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1113 on 525 degrees of freedom
## Multiple R-squared:  0.9308, Adjusted R-squared:  0.9286 
## F-statistic: 415.4 on 17 and 525 DF,  p-value: < 2.2e-16

predicted.apps<-predict(ls.full,test)
testerror<-mean((test$Apps-predicted.apps)^2)
testerror

## [1] 769127.1

The Mean Square error for the test data set is found to be 7.69127110^{5}

(c)

Fit a ridge regression model on the training set, with ?? chosen by cross-validation. Report the test error obtained.

Solution (c)

#create matrix for training set and test set

train.mat<-model.matrix(Apps~.,data=train)
test.mat<-model.matrix(Apps~.,data=test)

#defining grid to covering all the range of lambda.This will e used to find best value of lambda

grid<-10^seq(4,-2,length=100)

#fitting the ridge regression model
library(glmnet)
ridge<-glmnet(train.mat,train$Apps,alpha=0,lambda=grid,thresh = 1e-12)

#doing cross validation on model

cv.ridge<-cv.glmnet(train.mat,train$Apps,alpha=0,lambda=grid,thresh=1e-12)

#finding the lambda for which cv error is minimum on training data
bestlam.ridge<-cv.ridge$lambda.min
bestlam.ridge

## [1] 0.01

#using the lambda value obtained from cross validation for the ridge model directly on test data set to get the predicted values

pred.newridge<-predict(ridge,s=bestlam.ridge,newx =test.mat)

#Mean Square Error calculation
mean((test$Apps-pred.newridge)^2)

## [1] 769103.1

The Mean Square error for the test data set is found to be 7.691030610^{5} for Ridge Regression Model

This is slightly less than the one for Least Square Model

(d)

Fit a lasso model on the training set, with ?? chosen by crossvalidation.Report the test error obtained, along with the number of non-zero coefficient estimates.

Solution (d)

lasso<-glmnet(train.mat,train$Apps,alpha=1,lambda=grid,thresh = 1e-12)

#doing cross validation on model

cv.lasso<-cv.glmnet(train.mat,train$Apps,alpha=1,lambda=grid,thresh=1e-12)

#finding the lambda for which cv error is minimum on training data
bestlam.lasso<-cv.lasso$lambda.min
bestlam.lasso

## [1] 0.01

#using the lambda value obtained from cross validation for the lasso model directly on test data set to get the predicted values

pred.newlasso<-predict(lasso,s=bestlam.lasso,newx =test.mat)

#Mean Square Error calculation
mean((test$Apps-pred.newlasso)^2)

## [1] 769058.3

#Non zero Coefficienct estimates

predict(lasso,s=bestlam,type="coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -326.56814587
## (Intercept)    .         
## PrivateYes  -500.65803166
## Accept         1.54767937
## Enroll        -0.46674168
## Top10perc     47.34473048
## Top25perc    -12.45525428
## F.Undergrad    .         
## P.Undergrad    0.04640836
## Outstate      -0.06891103
## Room.Board     0.12707018
## Books         -0.05928521
## Personal       0.01197316
## PhD           -7.03010228
## Terminal      -4.56073825
## S.F.Ratio     12.44794269
## perc.alumni    .         
## Expend         0.07348693
## Grad.Rate      6.60642007

The Mean Square error for the test data set is found to be 7.690583110^{5} for Lasso Regression Model

The mean square error has further reduced compared to Ridge Regression. Lasso model has 13 non zero coefficients.

(e)

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

Solution (e)

library(pls)
pcrmodel<-pcr(Apps~.,data=train,scale=TRUE,validation="CV")

#plotting Mean square error (MSEP) for different number of components
validationplot(pcrmodel,val.type="MSEP")

predict.pcr<-predict(pcrmodel,test,ncomp=17)
mean((test$Apps-predict.pcr)^2)

## [1] 769127.1

We see that the cross validation error is minimum for M=17.If we use 17 components it gives MSE on test set as 7.69127110^{5}.

This is similar to the one obtained for least squares method.

If we take a lower value of M, lets say 5 or 7 we get higher MSE for test set

(f)

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

Solution

plsrmodel<-plsr(Apps~.,data=train,scale=TRUE,validation="CV")

#plotting Mean square error (MSEP) for different number of components
validationplot(plsrmodel,val.type="MSEP")

predict.plsr<-predict(plsrmodel,test,ncomp=10)
mean((test$Apps-predict.plsr)^2)

## [1] 775233.6

We see that the cross validation error is minimum for M=10.If we use 10 components it gives MSE on test set as 7.752336110^{5}.

This is higher compared to the one obtained for least squares method.

(g)

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Solution (g)

We use test R Sqaure statistic to see how accurately we predicted the number of college applications received

Higher value of test R-Square shows higher accuracy

#Least Square model
test.avg <- mean(test$Apps)
lm.r2 <- 1 - mean((predicted.apps - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#Ridge model
ridge.r2 <- 1 - mean((pred.newridge - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#Lasso model
lasso.r2 <- 1 - mean((pred.newlasso - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#PCR model
pcr.r2 <- 1 - mean((predict.pcr - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#PLS model
pls.r2 <- 1 - mean((predict.plsr - test$Apps)^2) / mean((test.avg - test$Apps)^2)

Least Square Test R-Square: 0.918807

Ridge Model Test R-Square: 0.9188096

Lasso Model Test R-Square: 0.9188143

PCR Model Test R-Square: 0.918807

PLS Model Test R-Square: 0.9181624

It can be seen that the Lasso model predicts the highest R-Square. Though all the models have similar performance

This was expected as the minimum MSE was found for Lasso model across all the models

Problem 10

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a)

Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated according to the model \(Y = X\beta + error\), where \(\beta\) has some elements that are exactly equal to zero

Solution (a)

set.seed(1)
#Create a matrix of 1000*20
x<-matrix(rnorm(1000*20),1000,20)
#Create a vector of beta and randomly assign 0 to few betas
b<-rnorm(20)
b[1]<-0
b[3]<-0
b[5]<-0
b[7]<-0
b[9]<-0
#generating error term
error<-rnorm(1000)
#multiplying x and beta matrix to get y
y<-x%*%b+error

(b)

Split your data set into a training set containing 100 observations and a test set containing 900 observations.

Solution (b)

#Create a vector train of length 100 and assign them random index number chosen from 1 to 1000
train<-sample(seq(1000),100,replace=FALSE)
#create a vector test which contains rest of the indices 
test<- (-train)
#subset x matrix for the given sequence in train and test
x.train<-x[train,]
x.test<-x[test,]
y.train<-y[train]
y.test<-y[test]

(c)

Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

Solution (c)

library(leaps)
data.train <- data.frame(y = y.train, x = x.train)
regfit.full <- regsubsets(y ~ ., data = data.train, nvmax = 20)
train.mat <- model.matrix(y ~ ., data = data.train, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
    coefi <- coef(regfit.full, id = i)
    pred <- train.mat[, names(coefi)] %*% coefi
    val.errors[i] <- mean((pred - y.train)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Training MSE", pch = 19, type = "b",col="orange")

(d)

Plot the test set MSE associated with the best model of each size.

Solution (d)

data.test <- data.frame(y = y.test, x = x.test)
test.mat <- model.matrix(y ~ ., data = data.test, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
    coefi <- coef(regfit.full, id = i)
    pred <- test.mat[, names(coefi)] %*% coefi
    val.errors[i] <- mean((pred - y.test)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")

(e)

For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

Solution (e)

which.min(val.errors)

## [1] 14

For the model having 14 variables the test set MSE is minimum

(f)

How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.

Solution (f)

coef(regfit.full,which.min(val.errors))

## (Intercept)         x.2         x.4         x.8        x.10        x.11 
##   0.1686140   0.3310800  -1.9043844   0.8537035   0.7885128   0.7023013 
##        x.12        x.13        x.14        x.15        x.16        x.17 
##   0.5486439  -0.3692904  -0.8258862  -0.5603339  -0.2197001   0.3084519 
##        x.18        x.19        x.20 
##   1.5824914   0.9099814  -0.7816566

We zeroed certain parameters in true model

The best subset model, for which the test MSE is minimum, built is able to identify those variables and removed them from the model

(g)

Create a plot displaying \[(\sqrt(\sum_{j=1}^{p} (\beta_{j} - \beta_{j}^{r})^2))\] for a range of values of r, where \(\beta_{j}^{r}\) j is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?

Solution (g)

val.errors <- rep(NA, 20)
x_cols = colnames(x, do.NULL = FALSE, prefix = "x.")
for (i in 1:20) {
    coefi <- coef(regfit.full, id = i)
    val.errors[i] <- sqrt(sum((b[x_cols %in% names(coefi)] - coefi[names(coefi) %in% x_cols])^2) + sum(b[!(x_cols %in% names(coefi))])^2)
}
plot(val.errors, xlab = "Number of Predictors", ylab = "Mean Square Error for estimated and true coefficients", pch = 19, type = "b",col="orange")

It can be seen that the error is minimized for 3 variables. It further decreases for variables from 14-20.

The test MSE is minimum for 14 variable model.Thus it can be said that the model which gives parameter estimates closest to true paramter estimate need not give the least test MSE i.e. it need not be the best model to predict

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

(a)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Solution (a)

Exploring Data

library(MASS)
attach(Boston)
dim(Boston)

## [1] 506  14

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

#Splitting Data into training and test data 
set.seed(10743959)
subset<-sample(nrow(Boston),nrow(Boston)*0.70)
boston.train<-Boston[subset,]
boston.test<-Boston[-subset,]

Following is the list of different approaches tried for variable selection and model Building

Full Model

Best Subset

Forward Selection

Backward Selection

Stepwise Selection

Ridge Regression

Lasso Regression

Principal Component Regression (PCR)

Partial Least Square (PLS)

Full Model

#Simple Least Square FUll model

slr.full<-lm(medv~.,data=boston.train)
slr.predict<-predict(slr.full,boston.test)
slr.MSE<-mean((slr.predict-boston.test$medv)^2)
slr.MSE

## [1] 17.96323

Best Subset

#Best Subset Model 
library(leaps)
bsm<-regsubsets(medv~.,data=boston.train,nbest=1,nvmax=13)
summary(bsm)

## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = boston.train, nbest = 1, 
##     nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  "*" " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 7  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 8  ( 1 )  " "  "*" " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 9  ( 1 )  " "  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 10  ( 1 ) " "  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

#Out of 13 models best model is chosen for minimum cv error on test set

boston.test.mat <- model.matrix(medv~ ., data = boston.test, nvmax = 13)
val.errors <- rep(NA, 13)
for (i in 1:13) {
    coefi <- coef(bsm, id = i)
    pred <- boston.test.mat[, names(coefi)] %*% coefi
    val.errors[i] <- mean((pred - boston.test$medv)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")

which.min(val.errors)

## [1] 11

coef(bsm,which.min(val.errors))

##   (Intercept)          crim            zn          chas           nox 
##  42.504262409  -0.051977242   0.058240750   2.579296005 -16.716321472 
##            rm           dis           rad           tax       ptratio 
##   3.017526315  -1.598507331   0.330049548  -0.013725911  -0.904781765 
##         black         lstat 
##   0.008099312  -0.611990159

bsm.MSE<-val.errors[11]
bsm.MSE

## [1] 17.81448

Best subset method selects 1 model for each number of predictor i.e. best model having 1 predictor, best model having 2 predictor and so on based on the minimum training set MSE

To select best model from these 13 models test set MSE is found for each model. The model giving least test set MSE is chosen as final model

We select final model with 11 predictor variables

Forward Selection

#forward selection

nullmodel<-lm(medv ~ 1, data = boston.train)
fullmodel<-lm(medv ~ ., data = boston.train)

fwd.model<-step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), 
    direction = "forward")

#Forward model coefficients
summary(fwd.model)

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + zn + chas + 
##     black + rad + tax, data = boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5511  -2.7981  -0.5087   1.6696  24.1049 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.553676   6.098471   6.814 4.28e-11 ***
## lstat        -0.620575   0.058679 -10.576  < 2e-16 ***
## rm            3.068242   0.484744   6.330 7.69e-10 ***
## ptratio      -0.899612   0.160989  -5.588 4.69e-08 ***
## dis          -1.568736   0.229077  -6.848 3.47e-11 ***
## nox         -16.280488   4.309207  -3.778 0.000186 ***
## zn            0.056446   0.016975   3.325 0.000979 ***
## chas          2.628518   1.055356   2.491 0.013224 *  
## black         0.008998   0.003350   2.686 0.007590 ** 
## rad           0.305467   0.083179   3.672 0.000279 ***
## tax          -0.013603   0.004534  -3.000 0.002898 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.021 on 343 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:  0.7138 
## F-statistic: 89.05 on 10 and 343 DF,  p-value: < 2.2e-16

#Forward test MSE
fwd.predict<-predict(fwd.model,boston.test)
fwd.mse<-mean((fwd.predict-boston.test$medv)^2)
fwd.mse

## [1] 18.84049

Backward Selection

#backward selection

bwd.model<-step(fullmodel,direction = "backward")

summary(bwd.model)

## 
## Call:
## lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax + 
##     ptratio + black + lstat, data = boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5511  -2.7981  -0.5087   1.6696  24.1049 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.553676   6.098471   6.814 4.28e-11 ***
## zn            0.056446   0.016975   3.325 0.000979 ***
## chas          2.628518   1.055356   2.491 0.013224 *  
## nox         -16.280488   4.309207  -3.778 0.000186 ***
## rm            3.068242   0.484744   6.330 7.69e-10 ***
## dis          -1.568736   0.229077  -6.848 3.47e-11 ***
## rad           0.305467   0.083179   3.672 0.000279 ***
## tax          -0.013603   0.004534  -3.000 0.002898 ** 
## ptratio      -0.899612   0.160989  -5.588 4.69e-08 ***
## black         0.008998   0.003350   2.686 0.007590 ** 
## lstat        -0.620575   0.058679 -10.576  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.021 on 343 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:  0.7138 
## F-statistic: 89.05 on 10 and 343 DF,  p-value: < 2.2e-16

#Backward test MSE
bwd.predict<-predict(bwd.model,boston.test)
bwd.mse<-mean((bwd.predict-boston.test$medv)^2)
bwd.mse

## [1] 18.84049

Stepwise Selection

#stepwise selection

step.model<-step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), 
    direction = "both")

summary(step.model)

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + zn + chas + 
##     black + rad + tax, data = boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5511  -2.7981  -0.5087   1.6696  24.1049 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.553676   6.098471   6.814 4.28e-11 ***
## lstat        -0.620575   0.058679 -10.576  < 2e-16 ***
## rm            3.068242   0.484744   6.330 7.69e-10 ***
## ptratio      -0.899612   0.160989  -5.588 4.69e-08 ***
## dis          -1.568736   0.229077  -6.848 3.47e-11 ***
## nox         -16.280488   4.309207  -3.778 0.000186 ***
## zn            0.056446   0.016975   3.325 0.000979 ***
## chas          2.628518   1.055356   2.491 0.013224 *  
## black         0.008998   0.003350   2.686 0.007590 ** 
## rad           0.305467   0.083179   3.672 0.000279 ***
## tax          -0.013603   0.004534  -3.000 0.002898 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.021 on 343 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:  0.7138 
## F-statistic: 89.05 on 10 and 343 DF,  p-value: < 2.2e-16

#Stepwise test MSE
step.predict<-predict(step.model,boston.test)
step.mse<-mean((step.predict-boston.test$medv)^2)
step.mse

## [1] 18.84049

Ridge Regression

#Ridge Regression

#creating matrix for Ridge Regression
bostontrain.mat<-model.matrix(medv~.,data=boston.train)
bostontest.mat<-model.matrix(medv~.,data=boston.test)

#Defining grid that covers the wide range of lambda 
grid<-10^seq(4,-2,length=100)

#Fitting the ridge regression model
library(glmnet)
ridge.model<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)

#Finding lambda that gives minimum MSE by cross validation on training set
boston.cv.ridge<-cv.glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)

#Visualizing MSE vs Log(lambda)
plot(boston.cv.ridge)

#finding the lambda for which MSE is minimum on training data
boston.bestlam.ridge<-boston.cv.ridge$lambda.min
boston.bestlam.ridge

## [1] 0.1417474

#fitting the Ridge model with with the value of lambda obtained and finding the coefficients
ridge.model.1<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=boston.bestlam.ridge)
coef(ridge.model.1)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  39.726176503
## (Intercept)   .          
## crim         -0.043885158
## zn            0.052687856
## indus        -0.005063095
## chas          2.692011518
## nox         -15.182853995
## rm            3.177687894
## age          -0.008481379
## dis          -1.529508653
## rad           0.269963547
## tax          -0.010966192
## ptratio      -0.881138929
## black         0.008225794
## lstat        -0.588333809

#using the lambda value obtained from cross validation for the ridge model directly on test data set to get the predicted values
boston.pred.newridge<-predict(ridge.model,s=boston.bestlam.ridge,newx=bostontest.mat)

#Mean Square Error calculation
ridge.MSE<-mean((boston.test$medv-boston.pred.newridge)^2)
ridge.MSE

## [1] 17.74425

We perform ridge regression for a range of values of lambda and find MSE for training set. The value of lambda corresponding to minimum MSE is used to build final model and for prediction

The value of lambda is 0.1417474 for which we got minimum train set MSE

Lasso Regression

#Lasso Regression
lasso.model<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)

#finding lambda by cross validation and then finding MSE on test set using that lambda
boston.cv.lasso<-cv.glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)

#Visualizing MSE vs Log(lambda)
plot(boston.cv.lasso)

#finding the lambda for which cv error is minimum on training data
boston.bestlam.lasso<-boston.cv.lasso$lambda.min
boston.bestlam.lasso

## [1] 0.1232847

#fitting the Lasso model with with the value of lambda obtained and finding the coefficients
lasso.model.1<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.1)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  40.033394715
## (Intercept)   .          
## crim         -0.044466644
## zn            0.053252258
## indus        -0.002505365
## chas          2.680304197
## nox         -15.340063745
## rm            3.166768913
## age          -0.008441933
## dis          -1.541098565
## rad           0.276257251
## tax          -0.011281348
## ptratio      -0.883796393
## black         0.008225935
## lstat        -0.590324265

#using the lambda value obtained from cross validation for the lasso model directly on test data set to get the predicted values
boston.pred.newlasso<-predict(lasso.model,s=boston.bestlam.lasso,newx=bostontest.mat)

#Mean Square Error calculation
lasso.MSE<-mean((boston.test$medv-boston.pred.newlasso)^2)
lasso.MSE

## [1] 17.76477

We perform lasso regression for a range of values of lambda and find MSE for training set. The value of lambda corresponding to minimum MSE is used to build final model and for prediction

The value of lambda is 0.1232847 for which we got minimum train set MSE

Principal Component Regression (PCR)

#Principal Component Regression
library(pls)
pcr.model<-pcr(medv~.,data=boston.train,scale=TRUE,validation="CV")
#13 component summary
summary(pcr.model)

## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.398    7.599    7.437    5.937    5.957    5.576    5.598
## adjCV        9.398    7.597    7.434    5.932    5.952    5.567    5.591
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.575    5.477    5.447     5.428     5.452     5.253     5.186
## adjCV    5.565    5.467    5.444     5.418     5.445     5.240     5.173
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       47.68    59.12    68.52    75.68    82.00    86.70    90.14
## medv    34.95    39.63    60.77    60.94    65.86    65.96    66.42
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.15    95.05     96.84     98.28     99.58    100.00
## medv    67.91    68.10     69.03     69.15     71.49     72.29

#plotting Mean square error (MSEP) for different number of components
validationplot(pcr.model,val.type="MSEP")

#using 5 components to build the final model 
pcr.model.5comp<-pcr(medv~.,data=boston.train,scale=TRUE,ncomp=5)
summary(pcr.model.5comp)

## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       47.68    59.12    68.52    75.68    82.00
## medv    34.95    39.63    60.77    60.94    65.86

#Obtaining the scores of each principal component
pcr.model.5comp$coefficients

## , , 1 comps
## 
##                  medv
## crim    -0.6063915501
## zn       0.5653092042
## indus   -0.7708923694
## chas    -0.0006292165
## nox     -0.7499003656
## rm       0.4304083600
## age     -0.6915127075
## dis      0.7132887312
## rad     -0.7111004067
## tax     -0.7475540656
## ptratio -0.4139434089
## black    0.4877422872
## lstat   -0.6904890541
## 
## , , 2 comps
## 
##                medv
## crim    -1.16076039
## zn       0.04235492
## indus   -0.56193240
## chas     0.63284060
## nox     -0.34381707
## rm       0.58457968
## age     -0.17407242
## dis      0.12326631
## rad     -1.17923658
## tax     -1.15681336
## ptratio -0.97561493
## black    0.97449535
## lstat   -0.73255090
## 
## , , 3 comps
## 
##                medv
## crim    -0.40885303
## zn       1.14506849
## indus   -0.57361682
## chas     2.11629628
## nox      0.07154186
## rm       2.92971918
## age     -0.12664889
## dis     -0.12661619
## rad     -0.06747743
## tax     -0.27611708
## ptratio -2.13668603
## black    0.02081913
## lstat   -1.84931819
## 
## , , 4 comps
## 
##                 medv
## crim    -0.446493901
## zn       1.082229500
## indus   -0.560066308
## chas     2.339542181
## nox      0.002820896
## rm       2.929573203
## age     -0.173545861
## dis     -0.095552865
## rad      0.012642142
## tax     -0.218882977
## ptratio -1.892652735
## black    0.159432982
## lstat   -1.922300636
## 
## , , 5 comps
## 
##                medv
## crim    -0.79069410
## zn       0.31377021
## indus   -0.53227588
## chas     0.97980565
## nox     -0.07258062
## rm       4.11921228
## age      0.08861939
## dis     -0.53232412
## rad      0.24973101
## tax     -0.07788551
## ptratio -1.33947559
## black    0.49875496
## lstat   -2.65801809

#Using PCR model with 5 components (explaining 80 % of variation) to predict on test data
boston.predict.pcr<-predict(pcr.model,boston.test,ncomp=5)
pcr.MSE<-mean((boston.test$medv-boston.predict.pcr)^2)
pcr.MSE

## [1] 15.96318

We perform Principal Component Regression and see summary for 13 components

Plot of MSE vs Number of Components suggests that minimum MSE is obtained for 13 components.This means that PCR is not helping in dimension reduction from p+1 to m+1

We select 5 as number of components in final model as 5 components explain about 80 % of variation

Scores of the components are also summarized. Scores are further used to fit the Principal Component Model

Partial Least Square (PLS)

Similar steps are followed as above for Partial Least square method

library(pls)
#Partial Least Square
plsr.model<-plsr(medv~.,data=boston.train,scale=TRUE,validation="CV")

#plotting Mean square error (MSEP) for different number of components
validationplot(plsr.model,val.type="MSEP")

#using 5 components to build the final model 
plsr.model.5comp<-plsr(medv~.,data=boston.train,scale=TRUE,ncomp=5)
summary(plsr.model.5comp)

## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       46.42    57.12    64.14    70.68    76.34
## medv    46.90    67.90    70.21    71.38    71.87

#Obtaining the scores of each 
plsr.model.5comp$coefficients

## , , 1 comps
## 
##               medv
## crim    -0.6629002
## zn       0.6346088
## indus   -0.8125719
## chas     0.3234698
## nox     -0.7289083
## rm       1.1522209
## age     -0.6400400
## dis      0.4186334
## rad     -0.5963464
## tax     -0.7481862
## ptratio -0.8083894
## black    0.5970368
## lstat   -1.2711419
## 
## , , 2 comps
## 
##                medv
## crim    -0.42482542
## zn       0.52607570
## indus   -0.47112526
## chas     1.40382834
## nox     -0.24208308
## rm       3.48196710
## age     -0.04543861
## dis     -1.18593277
## rad      0.35129618
## tax     -0.25239724
## ptratio -1.93617630
## black    0.69452051
## lstat   -3.03566596
## 
## , , 3 comps
## 
##               medv
## crim     0.0162841
## zn       0.5614550
## indus   -0.4975755
## chas     0.5868014
## nox     -0.9170569
## rm       3.3403655
## age     -0.3678324
## dis     -1.9690256
## rad      0.9934994
## tax     -0.1655670
## ptratio -1.6703181
## black    0.8684079
## lstat   -4.0099664
## 
## , , 4 comps
## 
##               medv
## crim    -0.0137188
## zn       0.5497794
## indus   -0.3653050
## chas     0.5499696
## nox     -1.3785074
## rm       2.5871245
## age     -0.3460459
## dis     -3.0540460
## rad      1.1149044
## tax     -0.6409048
## ptratio -1.9283887
## black    1.1417590
## lstat   -4.5137243
## 
## , , 5 comps
## 
##                medv
## crim     0.04611646
## zn       1.10369104
## indus   -0.31643319
## chas     0.79412630
## nox     -1.76929120
## rm       2.19271829
## age     -0.40901868
## dis     -3.49536648
## rad      1.46882099
## tax     -0.87414555
## ptratio -2.08221751
## black    0.81717250
## lstat   -4.63089016

boston.predict.plsr<-predict(plsr.model,boston.test,ncomp=5)
plsr.MSE<-mean((boston.test$medv-boston.predict.plsr)^2)
plsr.MSE

## [1] 19.42082

(b)

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Solution (b)

Validation Set MSE is used for evaluating model performance

Below is the list of different approaches with the MSE obtained for Validation Data set.

Full Model: 17.96

Best Subset: 17.81

Forward Selection: 18.84

Backward Selection: 18.84

Stepwise Selection: 18.84

Ridge Regression: 17.74

Lasso Regression: 17.76

Principal Component Regression (PCR): 15.96

Partial Least Square (PLS): 19.42

The minimum validation is obtained for Principal Component Regression Model with 5 Principal Components

It is interesting to see that validation MSE for forward, backward and subset selection models is same

The second least MSE for Validation set is obtained for Ridge regression

(c)

Does your chosen model involve all of the features in the data set? Why or why not?

Solution (c)

The PCR model contains 5 components which comprises of all the 13 predictor variables. The weightage of each predictor variable for every component is given by scores

The Ridge regression model also contains all the predictor variables

This suggest that all the predictor variables are contributing in predicting the response variable

An Introduction to Statistical Learning (ISLR) Solutions: Chapter 6

Swapnil Sharma

July 28, 2017

Chapter 6 Linear Model Selection and Regularization:

Subset Selection, Shrinkage Methods (Ridge and Lasso), Dimension Reduction Methods (Principal Components Regression and Partial Least Squares)

Problem 1

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (c)

Problem 2

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (C)

Problem 3

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (c)

(d)

Solution (d)

(e)

Solution (e)

Problem 4

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (c)

(d)

Solution (d)

(e)

Solution (e)

Problem 5

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (c)

Problem 6

(a)

Solution a

(b)

Solution (b)

Problem 8

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (C)

(d)

Solution (d)

(e)

Solution (e)

(f)

Solution (f)

Problem 9

(a)

Solution (a)

(b)

Solution (b)

(c)

Solution (c)

(d)

Solution (d)

(e)

Solution (e)

(f)

Solution

(g)