2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer

(a) The lasso, relative to least squares, is:
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.
More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
data(College)
set.seed(11)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

fit.lm <- lm(Apps ~ ., data = College.train)
pred.lm <- predict(fit.lm, College.test)
mean((pred.lm - College.test$Apps)^2)
## [1] 1026096

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(Matrix)
library(MatrixModels)
library(glmnet)
## Loaded glmnet 4.1-7
train.matrix <- model.matrix(Apps ~ ., data = College.train)
test.matrix <- model.matrix(Apps ~ ., data = College.test)
grid <- 10 ^ seq(4, -2, length = 100)
fit.ridge <- glmnet(train.matrix, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.matrix, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlamda.ridge <- cv.ridge$lambda.min
bestlamda.ridge
## [1] 0.01
pred.ridge <- predict(fit.ridge, s = bestlamda.ridge, newx = test.matrix)
mean((pred.ridge - College.test$Apps)^2)
## [1] 1026069

The test error is higher for ridge regression than for least squares.

(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.

fit.lasso <- glmnet(train.matrix, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.matrix, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlamda.lasso <- cv.lasso$lambda.min
bestlamda.lasso
## [1] 0.01
pred.lasso <- predict(fit.lasso, s = bestlamda.lasso, newx = test.matrix)
mean((pred.lasso - College.test$Apps)^2)
## [1] 1026036
predict(fit.lasso, s = bestlamda.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)   37.86520037
## (Intercept)    .         
## PrivateYes  -551.14946609
## Accept         1.74980812
## Enroll        -1.36005786
## Top10perc     65.55655577
## Top25perc    -22.52640339
## F.Undergrad    0.10181853
## P.Undergrad    0.01789131
## Outstate      -0.08706371
## Room.Board     0.15384585
## Books         -0.12227313
## Personal       0.16194591
## PhD          -14.29638634
## Terminal      -1.03118224
## S.F.Ratio      4.47956819
## perc.alumni   -0.45456280
## Expend         0.05618050
## Grad.Rate      9.07242834

The test error is also higher for ridge regression than for least squares.

(e) Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")

pred.pcr <- predict(fit.pcr, College.test, ncomp = 10)
mean((pred.pcr - College.test$Apps)^2)
## [1] 1867486

The test error is also higher for PCR than for least squares.

(f) Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")

pred.pls <- predict(fit.pls, College.test, ncomp = 10)
mean((pred.pls - College.test$Apps)^2)
## [1] 1031287

(g) Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

test.avg <- mean(College.test$Apps)
lm.r2 <- 1 - mean((pred.lm - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2 <- 1 - mean((pred.ridge - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2 <- 1 - mean((pred.lasso - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2 <- 1 - mean((pred.pcr - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2 <- 1 - mean((pred.pls - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)

the test \(R^2\) for least squares is 0.910422805412481
the test \(R^2\) for ridge is 0.910425186297465 the test \(R^2\) for lasso is 0.910428049291513 the test \(R^2\) for pcr is 0.836970251054856
the test \(R^2\) for pls is 0.90996958889881

All models, except PCR, predict college applications with high accuracy.

11. We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider..

library(leaps)
library(MASS)
data(Boston)
set.seed(1)

predict.regsubsets <- function(object, newdata, id, ...) {
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    xvars <- names(coefi)
    mat[, xvars] %*% coefi
}

k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
    for (i in 1:13) {
        pred <- predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

This is for cross-validation.

x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
cv.out <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out)

This is for lasso

cv.out <- cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out)

this is for ridge regression.

pcr.fit <- pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.198    7.198    6.786    6.762    6.790    6.821
## adjCV         8.61    7.195    7.195    6.780    6.753    6.784    6.813
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.822    6.689    6.712     6.720     6.712     6.664     6.593
## adjCV    6.812    6.679    6.701     6.708     6.700     6.651     6.580
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr.fit, val.type = "MSEP")

this is for pcr.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.
We select the ones with lowest CV so that would be the cross validation method.

(c) Does your chosen model involve all of the features in the data set? Why or why not?
No it only has 13 perdictor variables.