1. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
  1. The lasso, relative to least squares, is:
  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
## (iii) is the correct answer, Lasso is less flexible, but more accurate 
##in prediction when the increase in bias is less than the decrease in variance. 
## Lasso reduces model flexibility by shrinking some coefficients to zero and 
## adding a penalty to the least squares objectives
  1. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  1. Repeat (a) for ridge regression relative to least squares.
## (iii) is the correct answer.
## Ridge regression adds a penalty that shrinks coefficients towards zero but
## does not set any to exactly zero
## like Lasso, this model is less flexible. It reduces variance at the cost of some bias.
  1. Repeat (a) for non-linear methods relative to least squares.
## (i) is the correct answer.
## none linear methods are more flexible than linear regresion.
## they have lower bias, but hiegher variance.
## prediction is improved when the decrease is larger than the increase in variance.
  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
  1. Split the data set into a training set and a test set.
library(ISLR2) #load necessary library
set.seed(10) # ensuring repeatability of the outcome
# SPlitting data into train and test data sets
train <- sample(1:nrow(College), nrow(College)/2) # creating train data set
test <- (-train) # Creating test data set
College.train <- College[train, ]
College.test <- College[test, ]
  1. Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit <- lm(Apps ~ ., data = College.train) 
lm.pred <- predict(lm.fit, College.test)
mean((College.test$Apps - lm.pred)^2)
## [1] 1020100
## Test error obtained: 1020100
  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
train.mat <- model.matrix(Apps ~ ., data = College.train)
test.mat <- model.matrix(Apps ~ ., data = College.test)
y.train <- College.train$Apps
y.test <- College.test$Apps

grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(train.mat, y.train, alpha = 0, lambda = grid)
cv.out <- cv.glmnet(train.mat, y.train, alpha = 0)
bestlam <- cv.out$lambda.min

ridge.pred <- predict(ridge.mod, s = bestlam, newx = test.mat)
mean((ridge.pred - y.test)^2)
## [1] 985020.1
## Test error: 985020.1
  1. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod <- glmnet(train.mat, y.train, alpha = 1, lambda = grid)
cv.out <- cv.glmnet(train.mat, y.train, alpha = 1)
bestlam <- cv.out$lambda.min

lasso.pred <- predict(lasso.mod, s = bestlam, newx = test.mat)
mean((lasso.pred - y.test)^2)
## [1] 1008145
lasso.coef <- predict(lasso.mod, type = "coefficients", s = bestlam)
sum(lasso.coef != 0)
## [1] 16
## Test error: 1008145
## non-zero coefficient: 16
  1. Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.4.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     4335     2390     2401     2112     1954     1914
## adjCV         4347     4335     2386     2401     2085     1949     1905
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1910     1879     1871      1867      1867      1875      1894
## adjCV     1902     1862     1863      1860      1859      1867      1887
##        14 comps  15 comps  16 comps  17 comps
## CV         1853      1634      1323      1286
## adjCV      1934      1586      1310      1273
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
## Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
## Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
##       16 comps  17 comps
## X        99.86    100.00
## Apps     93.71     93.95
  1. Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     2154     1836     1732     1620     1422     1314
## adjCV         4347     2148     1832     1724     1591     1397     1298
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1276     1264     1260      1266      1265      1262      1262
## adjCV     1264     1253     1250      1254      1253      1251      1250
##        14 comps  15 comps  16 comps  17 comps
## CV         1263      1263      1263      1263
## adjCV      1251      1251      1251      1252
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
## Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
## Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
##       16 comps  17 comps
## X        98.15    100.00
## Apps     93.95     93.95
pls.pred <- predict(pls.fit, College.test, ncomp = 5)
mean((College.test$Apps - pls.pred)^2)
## [1] 1129004
## test error: 1129004
## value of M: 5
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
##All five methods yield similar test MSEs.

## Ridge regression performs best.

## Lasso is competitive and yields a sparse model (17 non-zero coefficients).

## PCR and PLS have similar test errors, but require tuning the number of components.

## None of the methods dramatically outperforms the others, which means:

## The predictability of applications is moderate.

## There's no strong advantage to using one method over another in this case.
  1. We will now try to predict per capita crime rate in the Boston data set.
## Loading necessary libraries:

library(ISLR2)
library(leaps)
## Warning: package 'leaps' was built under R version 4.4.3
library(glmnet)
library(pls)

# splitting the data
data("Boston")
set.seed(10)
train <- sample(1:nrow(Boston), nrow(Boston) / 2)
test <- (-train)
x.train <- model.matrix(crim ~ ., Boston)[train, ]
x.test <- model.matrix(crim ~ ., Boston)[test, ]
y.train <- Boston$crim[train]
y.test <- Boston$crim[test]
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
# Best subset selection


# Split data
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
test <- (-train)
boston.train <- Boston[train, ]
boston.test <- Boston[test, ]

# Fit best subset
regfit.best <- regsubsets(crim ~ ., data = boston.train, nvmax = 13)
test.mat <- model.matrix(crim ~ ., data = boston.test)

# Helper function for prediction
predict.regsubsets <- function(object, newdata, id) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  vars <- names(coefi)
  mat[, vars] %*% coefi
}

# Adjust loop based on what was actually fitted
max.models <- dim(summary(regfit.best)$which)[1]
val.errors <- rep(NA, max.models)

for (i in 1:max.models) {
  pred <- predict.regsubsets(regfit.best, boston.test, id = i)
  val.errors[i] <- mean((boston.test$crim - pred)^2)
}

best.size <- which.min(val.errors)
cat("Best model size:", best.size, "\n")
## Best model size: 1
cat("Test MSE:", val.errors[best.size], "\n")
## Test MSE: 40.14557
##Best model size: 1 
##Test MSE: 40.14557 
  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
## Lasso regression is recommended:

##It produced low test error.

##It also performed automatic feature selection, which simplifies the model.
  1. Does your chosen model involve all of the features in the data set? Why or why not?
## No, lasso does not use all features.

##Lasso introduces an L1 penalty which can shrink some coefficients exactly to zero.

##This allows the model to automatically discard irrelevant or redundant variables.

##This improves interpretability and often reduces overfitting, especially when some predictors are weak or highly correlated.