ISLR2 – Exercise 6.6 Solution

Question 6.6.1 (Difficulty Level - 2)

1. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . ,p predictors. Explain your answers:

(a) Which of the three models with k predictors has the smallest training RSS?

Ans: Best subset selection will have the smallest training RSS as it works with every possible model with k predictors.

(b) Which of the three models with k predictors has the smallest test RSS?

Ans: It is difficult to ascertain which model will have smallest test RSS given the the amount of information. Best subset selection can over fit if the number of observations are more than number of predictors.

(c) True or False:

i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

Ans: True

ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.

Ans: True

iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection. Ans: False

iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

Ans: False

v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

Ans: False

Question 6.6.2 (Difficulty Level - 3)

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Ans: Option (iii) is correct. Lasso has better prediction accuracy because when Least Squares method overfits, Lasso can reduce the variance by little increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.

Ans: Option (iii) is correct. Ridge has better prediction accuracy than Least Squares because as we increase the value of hyperparameter lambda, flexibility of ridge decreases by decreasing variance and increasing bias.

(c) Repeat (a) for non-linear methods relative to least squares.

Ans: Correct answer is (ii).

Question 6.6.9 (Difficulty Level - 5)

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR2)
dat = College 

set.seed(123)

q = sample(nrow(dat), nrow(dat)*0.8)
train = dat[q, ] #training data
test = dat[-q, ]  #test data

library(caret) # using caret package to scale the data

## Loading required package: ggplot2

## Loading required package: lattice

obj <- preProcess(train, method = c('center', 'scale'))

train <- predict(obj, train)
test <- predict(obj, test)

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

mod <- lm(Apps ~., data = train)
pred <- predict(mod, test, type = "response")
mse <- mean((test$Apps - pred)^2)
print(paste("Test error obtained is", mse))

## [1] "Test error obtained is 0.184773455487158"

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet) # required to perform ridge regression when alpha is zero

## Warning: package 'glmnet' was built under R version 4.1.3

## Loading required package: Matrix

## Loaded glmnet 4.1-3

set.seed(456)

# Creating matrix for train and test data

train_matrix = model.matrix(Apps~., data = train)
test_matrix = model.matrix(Apps~., data = test)

# Choosing best lambda value using cross validation
cv_out = cv.glmnet(train_matrix,train$Apps,alpha=0)
best_lambda = cv_out$lambda.min


# fitting ridge regression model

mod_ridge = glmnet(train_matrix,train$Apps,alpha = 0, lambda=best_lambda)

pred_ridge = predict(mod_ridge, s = best_lambda , newx = test_matrix)


ridge_err = mean((pred_ridge - test$Apps)^2)

print(paste("Test error of ridge model is", ridge_err))

## [1] "Test error of ridge model is 0.347317435297609"

(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Choosing best lambda value using cross validation
cv_out = cv.glmnet(train_matrix,train$Apps,alpha=1)
best_lambda1 = cv_out$lambda.min


# fitting ridge regression model

mod_lasso = glmnet(train_matrix,train$Apps,alpha = 1, lambda = best_lambda1)

pred_lasso = predict(mod_lasso, s = best_lambda1 , newx = test_matrix)


lasso_err = mean((pred_lasso - test$Apps)^2)

print(paste("Test error of ridge model is", lasso_err))

## [1] "Test error of ridge model is 0.197016667898902"

l = mod_lasso$beta

print(l[l[,1]!=0,]) # extracting non zero coefficients

##  PrivateYes      Accept   Top10perc   Top25perc F.Undergrad    Outstate 
## -0.16619664  0.77661067  0.18327122 -0.03273613  0.08078508 -0.04293302 
##  Room.Board         PhD    Terminal perc.alumni      Expend   Grad.Rate 
##  0.04563997 -0.02019748 -0.01968370 -0.02404018  0.11522247  0.04198098

print(paste("Total number of non zero coefficients are", length(l[l[,1] != 0])))

## <sparse>[ <logic> ] : .M.sub.i.logical() maybe inefficient

## [1] "Total number of non zero coefficients are 12"

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls) #required to perform PCR

## Warning: package 'pls' was built under R version 4.1.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(042)

mod_pcr <-  pcr(Apps ~., data = train , validation = "CV") #train data is already scaled

summary(mod_pcr)

## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.001   0.9537   0.4821   0.4835   0.3815   0.3797   0.3700
## adjCV        1.001   0.9544   0.4815   0.4843   0.3744   0.3751   0.3693
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.3696   0.3544   0.3429    0.3407    0.3437    0.3438    0.3440
## adjCV   0.3734   0.3551   0.3412    0.3401    0.3430    0.3431    0.3433
##        14 comps  15 comps  16 comps  17 comps
## CV       0.3444    0.3222    0.3020    0.3009
## adjCV    0.3438    0.3206    0.3011    0.2999
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.382    56.92    64.18    70.08    75.54    80.73    84.55    88.18
## Apps    9.917    77.23    77.49    86.20    86.22    86.92    86.94    88.23
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.48     94.05     96.02     97.51     98.41     98.96     99.41
## Apps    88.89     88.92     88.97     89.07     89.09     89.24     90.35
##       16 comps  17 comps
## X        99.80    100.00
## Apps     91.63     91.92

validationplot(mod_pcr, val.type = "MSEP")

print(paste("As per validation plot, it seems that with M value as 4, the error is reduced substantially. However, for values greater than 4, the reduction in error is very small with increasing number of principal components."))

## [1] "As per validation plot, it seems that with M value as 4, the error is reduced substantially. However, for values greater than 4, the reduction in error is very small with increasing number of principal components."

pred_pcr <- predict(mod_pcr, test, ncomp = 4)

pcr_err = mean((pred_pcr - test$Apps)^2)

print(paste("Test error of pcr model is", pcr_err))

## [1] "Test error of pcr model is 0.509892361751212"

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls) #required to perform PLS

set.seed(506)

mod_plsr <-  plsr(Apps ~., data = train , validation = "CV") #train data is already scaled

summary(mod_plsr)

## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.001   0.4442   0.3890   0.3361   0.3268   0.3187   0.3097
## adjCV        1.001   0.4429   0.3902   0.3355   0.3260   0.3176   0.3079
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.3042   0.3015   0.3016    0.3007    0.3010    0.3008    0.3005
## adjCV   0.3031   0.3006   0.3007    0.2998    0.3001    0.2999    0.2996
##        14 comps  15 comps  16 comps  17 comps
## CV       0.3007    0.3007    0.3007    0.3007
## adjCV    0.2997    0.2998    0.2998    0.2998
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X          26    51.33    62.55    66.21    69.90    74.00    77.55    80.80
## Apps       81    85.37    89.38    90.28    91.22    91.75    91.84    91.88
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.26     84.89     87.29     91.39     93.28     95.33     98.08
## Apps    91.89     91.91     91.92     91.92     91.92     91.92     91.92
##       16 comps  17 comps
## X        99.50    100.00
## Apps     91.92     91.92

validationplot(mod_plsr, val.type = "MSEP")

print(paste("As per validation plot, it seems that with M value as 3, the error is reduced substantially. However, for values greater than 3, the reduction in error is very small with increasing number of principal components"))

## [1] "As per validation plot, it seems that with M value as 3, the error is reduced substantially. However, for values greater than 3, the reduction in error is very small with increasing number of principal components"

pred_plsr <- predict(mod_plsr, test, ncomp = 3)

plsr_err = mean((pred_plsr - test$Apps)^2)

print(paste("Test error of plsr model is", plsr_err))

## [1] "Test error of plsr model is 0.430798992307428"

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

# R Squared for Ridge model
SSE_ridge <- sum((pred_ridge - test$Apps)^2)
SST <- sum((test$Apps - mean(test$Apps))^2)
R_squared_ridge <- 1 - SSE_ridge / SST

# R Squared for Lasso Model
SSE_lasso <- sum((pred_lasso - test$Apps)^2)
R_squared_lasso <- 1 - SSE_lasso / SST

# R Squared for PCR Model
SSE_pcr <- sum((pred_pcr - test$Apps)^2)
R_squared_pcr <- 1 - SSE_pcr / SST

# R Squared for PLS Model
SSE_plsr <- sum((pred_plsr - test$Apps)^2)
R_squared_pls <- 1 - SSE_plsr / SST


data.frame("model" = c("Linear", "Ridge", "Lasso", "PCR", "PLS"),
           "MSE" = c(mse, ridge_err,lasso_err, pcr_err, plsr_err),
           "R_Squared" = c(summary(mod)$r.squared,R_squared_ridge,R_squared_lasso, R_squared_pcr, R_squared_pls))

##    model       MSE R_Squared
## 1 Linear 0.1847735 0.9192219
## 2  Ridge 0.3473174 0.8632126
## 3  Lasso 0.1970167 0.9224070
## 4    PCR 0.5098924 0.7991841
## 5    PLS 0.4307990 0.8303342

Ans: Across all models Linear and Lasso seem to outperform with very close MSE and R_Squared Values. PCR model has the highest error with lowest variance explained by the principal components. Between PCR and PLS we can observe from the summary of the cross validation that with only 3 principal components, PLS model is able to explain 89.38% variation in the response whereas it takes 14 principal components for PCR to explain 89.24% variations in response. Ridge model performs better than PLS model. However, if we increase the number of components in PLS model, then it might per better.

ISLR2 – Exercise 6.6 Solution

Abhinav Kumar

1/10/2022

Question 6.6.1 (Difficulty Level - 2)

Question 6.6.2 (Difficulty Level - 3)

Question 6.6.9 (Difficulty Level - 5)