Load Libraries and attach datasets

library(ISLR2)
library(glmnet)
library(pls)
library(leaps)
library(tidyverse)
attach(College)
attach(Boston)





Problem 2.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justify your answer:
Lasso chooses coefficients that minimizes RSS. Additionally, as the lamda increases, variance decreases at a higher rate than the bias increases. This results in improved accuracy with a small increase in bias compared to the corresponding decrease in variance. The larger shrinkage penalty results in the coefficients being close to zero, which increases shrinkage and reducing the prediction variance. So in this case, the bias-variance trade-off, may be more beneficial than least squares.


(b) Ridge regression relative to least squares, is:

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justify your answer:
Ridge regression’s justification is very similar to the lasso justification, with the exception that the shrinkage penalty is smaller, meaning that the coefficients don’t get to zero as quickly or absolutely as they do in the lasso method. However, there is still a larger reduction in variance than there is an increase in bias, so once again the bias-variance trade-off may make this model a better fit than least squares.

(c) Non-linear methods relative to least squares, is:

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Justify your answer:
Non-linear methods make less assumptions because they do not assume linearity. As a result, this method is more flexible. When the relationship is in-fact non-linear, which then leads to a decrease in bias that will be more substantial than the increase in variance. For those reasons, once again, the bias-variance trade-off may make this model better than least squares.






Problem 9.

In this exercise, we will predict the number of applications received using the other variables in the College data set.


(a) Split the data set into a training set and a test set.

I split the data 75/25 Train/Test.


set.seed(22)
train_num = sample(nrow(College), (nrow(College)*.75))
train <- College[train_num, ]
test <- College[-train_num, ]


(b) Fit a linear model using least squares on the training set, and report the test error obtained.

The linear model MSE is 894367.7.


fit_lm <- lm(Apps ~ ., data = train)
pred_lm <- predict(fit_lm, test)
LS_MSE <- mean((pred_lm - test$Apps)^2)

LS_MSE
## [1] 894367.7


(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

The ridge regression MSE is 894344.5.


set.seed(22)

train_mat <- model.matrix(Apps ~ ., data = train)
test_mat <- model.matrix(Apps ~ ., data = test)
grid <- 10 ^ seq(10, -2, length = 100)

fit_ridge <- glmnet(train_mat, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv_ridge <- cv.glmnet(train_mat, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- cv_ridge$lambda.min
ridge_pred <- predict(fit_ridge, s = bestlam, newx = test_mat)
ridge_MSE <- mean((ridge_pred - test$Apps)^2)

ridge_MSE
## [1] 894344.5


(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

The lasso model MSE is 894286.5. This model used 18 coefficients, however several were near zero.


set.seed(22)

fit_lasso <- glmnet(train_mat, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv_lasso <- cv.glmnet(train_mat, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam <- cv_lasso$lambda.min
lasso_pred <- predict(fit_lasso, s = bestlam, newx = test_mat)
lasso_MSE <- mean((lasso_pred - test$Apps)^2)
lasso_coef <- predict(fit_lasso, type = 'coefficient', s = bestlam)

lasso_MSE
## [1] 894286.5
lasso_coef[lasso_coef!=0]
##  [1] -544.02484392 -505.69565764    1.61674471   -1.17471944   52.63154083
##  [6]  -19.04816319    0.10540607    0.03969968   -0.07265414    0.12794426
## [11]   -0.16297511    0.07056082   -9.10044055   -2.68997846   15.52020545
## [16]    1.64990949    0.09487090   10.02553687


(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

The PCR model MSE is 1550146 with an M value of 6.


set.seed(22)
fit_pcr = pcr(Apps~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pcr, val.type = 'MSEP')


pcr_pred <- predict(fit_pcr,test, ncomp = 6)
pcr_MSE <- mean((pcr_pred-test$Apps)^2)

pcr_MSE
## [1] 1550146


(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

The PLS-model MSE is 840735.5 with an M value of 8


set.seed(22)

fit_pls = plsr(Apps~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pls, val.type = 'MSEP')

pls_pred <- predict(fit_pls,test, ncomp = 8)
pls_MSE <- mean((pls_pred-test$Apps)^2)

pls_MSE
## [1] 840735.5


(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

There was not a significant difference in MSE between most of the methods, with the exception of PCR and PLS. PLS had the lowest MSE of 840735.5. The ridge and lasso methods also preformed slightly better than least square, but were worse than PLS. PCR performed significantly worse than all the other models, and only became close when so many predictors were included that it was essentially replicating the least squares method.


all_MSE = c(LS_MSE, ridge_MSE, lasso_MSE, pcr_MSE, pls_MSE)
names(all_MSE) = c("ls", "ridge", "lasso", "pcr", "pls")
all_MSE
##        ls     ridge     lasso       pcr       pls 
##  894367.7  894344.5  894286.5 1550146.4  840735.5
which.min(all_MSE)
## lasso 
##     3





Problem 11.

We will now try to predict per capita crime rate in the Boston data set.


(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

#Explore Data
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          lstat      
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   : 1.73  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.: 6.95  
##  Median : 5.000   Median :330.0   Median :19.05   Median :11.36  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :12.65  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:16.95  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00
#Split data into train and test
set.seed(22)
train_num = sample(nrow(Boston), (nrow(Boston)*.75))
train <- Boston[train_num, ]
test <- Boston[-train_num, ]


LS Model

fit_lm <- lm(crim ~ ., data = train)
pred_lm <- predict(fit_lm, test)
LS_MSE <- mean((pred_lm - test$crim)^2)

LS_MSE
## [1] 20.28627


Ridge Model

set.seed(22)

train_mat <- model.matrix(crim ~ ., data = train)
test_mat <- model.matrix(crim ~ ., data = test)
grid <- 10 ^ seq(10, -2, length = 100)

fit_ridge <- glmnet(train_mat, train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
cv_ridge <- cv.glmnet(train_mat, train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- cv_ridge$lambda.min

ridge_pred <- predict(fit_ridge, s = bestlam, newx = test_mat)
ridge_MSE <- mean((ridge_pred - test$crim)^2)

ridge_MSE
## [1] 20.00783


Lasso Model

set.seed(22)

fit_lasso <- glmnet(train_mat, train$crim, alpha = 1, lambda = grid, thresh = 1e-12)
cv_lasso <- cv.glmnet(train_mat, train$crim, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam <- cv_lasso$lambda.min
lasso_pred <- predict(fit_lasso, s = bestlam, newx = test_mat)
lasso_MSE <- mean((lasso_pred - test$crim)^2)

lasso_MSE
## [1] 19.78296


PCR Model

set.seed(22)
fit_pcr = pcr(crim~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pcr, val.type = 'MSEP')


pcr_pred <- predict(fit_pcr,test, ncomp = 7)
pcr_MSE <- mean((pcr_pred-test$crim)^2)

pcr_MSE
## [1] 22.40217


PLS Model

set.seed(22)

fit_pls = plsr(crim~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pls, val.type = 'MSEP')

pls_pred <- predict(fit_pls,test, ncomp = 3)
pls_MSE <- mean((pls_pred-test$crim)^2)

pls_MSE
## [1] 21.37494


(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

The three models that seem to perform well on this data set are LS (20.28 MSE), Ridge (20 MSE), and Lasso (19.78 MSE), with Lasso being the best by a slight margin. PCR (22.40) and PLS (21.37 MSE) perform nearly as well, with PCR being the worst of the five.


all_MSE = c(LS_MSE, ridge_MSE, lasso_MSE, pcr_MSE, pls_MSE)
names(all_MSE) = c("ls", "ridge", "lasso", "pcr", "pls")
all_MSE
##       ls    ridge    lasso      pcr      pls 
## 20.28627 20.00783 19.78296 22.40217 21.37494
which.min(all_MSE)
## lasso 
##     3


(c) Does your chosen model involve all of the features in the data set? Why or why not?

The Lasso model does use all features of the data set, however, when looking at the coefficients, some of them are so close to zero, that the model is effectively not using them.