library(tidyverse)
library(openintro)
library(ISLR)
library(ISLR2)
library(glmnet)
library(pls)
library(kableExtra)
library(leaps)
library(MASS)

Exercise 2

For parts \((a)\) through \((c)\), indicate which of \(i.\) through \(iv.\) is correct. Justify your answer.
\((a)\) The lasso, relative to least squares, is:
  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  • iii is correct as Lasso can have a reduction in variance at the expense of small increase in bias while least squares estimates have high variance. Lasso can shrink coefficient estimates, removing non-essential variables for less variance and higher bias.
\((b)\) Repeat \((a)\) for ridge regression relative to least squares.
  • iii is correct as ridge regression can shrink the coefficient estimates, thereby decreasing variance with higher bias. Ridge is less flexible than the least squares.
\((c)\) Repeat \((a)\) for non-linear methods relative to least squares.
  • ii is correct as non-linear models are more flexible and have less bias than least squares.

Exercise 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.
\((a)\) Split the data set into a training set and a test set.
attach(College)
set.seed(1)

x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]
\((b)\) Fit a linear model using least squares on the training set, and report the test error obtained.
set.seed(1)

linear.fit<-lm(Apps~., data=College, subset=train)
summary(linear.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5741.2  -479.5    15.3   359.6  7258.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.902e+02  6.381e+02  -1.238 0.216410    
## PrivateYes  -3.070e+02  2.006e+02  -1.531 0.126736    
## Accept       1.779e+00  5.420e-02  32.830  < 2e-16 ***
## Enroll      -1.470e+00  3.115e-01  -4.720 3.35e-06 ***
## Top10perc    6.673e+01  8.310e+00   8.030 1.31e-14 ***
## Top25perc   -2.231e+01  6.533e+00  -3.415 0.000708 ***
## F.Undergrad  9.269e-02  5.529e-02   1.676 0.094538 .  
## P.Undergrad  9.397e-03  5.493e-02   0.171 0.864275    
## Outstate    -1.084e-01  2.700e-02  -4.014 7.22e-05 ***
## Room.Board   2.115e-01  7.224e-02   2.928 0.003622 ** 
## Books        2.912e-01  3.985e-01   0.731 0.465399    
## Personal     6.133e-03  8.803e-02   0.070 0.944497    
## PhD         -1.548e+01  6.681e+00  -2.316 0.021082 *  
## Terminal     6.415e+00  7.290e+00   0.880 0.379470    
## S.F.Ratio    2.283e+01  2.047e+01   1.115 0.265526    
## perc.alumni  1.134e+00  6.083e+00   0.186 0.852274    
## Expend       4.857e-02  1.619e-02   2.999 0.002890 ** 
## Grad.Rate    7.490e+00  4.397e+00   1.703 0.089324 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083 on 370 degrees of freedom
## Multiple R-squared:  0.9389, Adjusted R-squared:  0.9361 
## F-statistic: 334.3 on 17 and 370 DF,  p-value: < 2.2e-16
pred.app<-predict(linear.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error
## [1] 1135758

MSE for a linear model using the least squares is 1,135,758.

\((c)\) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
set.seed(1)

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 405.8404
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 976268.9

MSE for the ridge regression model is 976,268.9.

\((d)\) Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(1)

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 1.97344
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1116252
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
nonzero=sum(lasso.coef != 0) - 1
nonzero
## [1] 17

MSE for the Lasso model is 1,116,252. There are 17 non-zero coefficient estimates.

\((e)\) Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
set.seed(1)

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
validationplot(pcr.college, val.type="MSEP")

optimal.pcr <- which.min(pcr.college$validation$PRESS) - 1 
optimal.pcr
## [1] 16
pcr.pred=predict(pcr.college,x[test,],ncomp=optimal.pcr)
mean((pcr.pred-y.test)^2)
## [1] 1137877

MSE for a PCR model is 1,137,877 with the value of \(M\) being 16 due to its low CV and high variance explained.

\((f)\) Fit a PLS model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
set.seed(1)

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

optimal.pcr <- which.min(pls.college$validation$PRESS) - 1 
optimal.pcr
## [1] 16
pls.pred=predict(pls.college,x[test,],ncomp=optimal.pcr)
mean((pls.pred-y.test)^2)
## [1] 1135812

MSE for a PLS model is 1,135,812 with the value of \(M\) being 16 due to its low CV and high variance explained.

\((g)\) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
set.seed(1)

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)

r2_values = c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2)

bar_positions = barplot(r2_values, 
        names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), 
        main="Test R-squared", ylim=c(0, max(r2_values) * 1.1))
text(x=bar_positions, y=r2_values, labels=round(r2_values, 3), pos=3, cex=0.6)

lm.rmse = sqrt(mean((College.test[, "Apps"] - pred.app)^2))
ridge.rmse = sqrt(mean((College.test[, "Apps"] - ridge.pred)^2))
lasso.rmse = sqrt(mean((College.test[, "Apps"] - lasso.pred)^2))
pcr.rmse = sqrt(mean((pcr.pred - y.test)^2))
pls.rmse = sqrt(mean((pls.pred - y.test)^2))

lm.mse = mean((College.test[, "Apps"] - pred.app)^2)
ridge.mse = mean((College.test[, "Apps"] - ridge.pred)^2)
lasso.mse = mean((College.test[, "Apps"] - lasso.pred)^2)
pcr.mse = mean((pcr.pred - y.test)^2)
pls.mse = mean((pls.pred - y.test)^2)

rmse_table <- data.frame(
  Method = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
  MSE = c(lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse),
  RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, pcr.rmse, pls.rmse)
)

rmse_table %>%
  kbl(digits = 3, format = "html", caption = "Comparison of MSE and RMSE for Different Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
Comparison of MSE and RMSE for Different Methods
Method MSE RMSE
OLS 1135758.3 1065.720
Ridge 976268.9 988.063
Lasso 1116251.6 1056.528
PCR 1137876.9 1066.713
PLS 1135812.2 1065.745

The \(R^2\) of each model shows that PCR has the lowest accuracy to explain the variance in the data whereas all of the other models are fairly similar. When focused on just the RMSE/MSE, Ridge Regression has the lowest RMSE/MSE and would be a good model to use.

Exercise 11

We will now try to predict per capita crime rate in the Boston data set.
\((a)\) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
set.seed(1)
attach(Boston)

predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

which.min(mean.cv.errors)
## [1] 9
mean.cv.errors[which.min(mean.cv.errors)]
## [1] 42.81453
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 2.176491
## zn          .       
## indus       .       
## chas        .       
## nox         .       
## rm          .       
## age         .       
## dis         .       
## rad         0.150484
## tax         .       
## ptratio     .       
## black       .       
## lstat       .       
## medv        .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.921353
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.523899542
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526007
## nox          1.874769665
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.669133
pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
## adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
## adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
\((b)\) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
best_subset_mse = mean.cv.errors[which.min(mean.cv.errors)]
lasso_mse = min(cv.lasso$cvm)
ridge_mse = min(cv.ridge$cvm)
pcr_mse = min(pcr.crime$validation$PRESS / nrow(Boston))  # PRESS / n = MSE

best_subset_rmse = sqrt(best_subset_mse)
lasso_rmse = sqrt(lasso_mse)
ridge_rmse = sqrt(ridge_mse)
pcr_rmse = sqrt(pcr_mse)

mse_table <- data.frame(
  Method = c("Best Subset", "LASSO", "Ridge", "PCR"),
  MSE = c(best_subset_mse, lasso_mse, ridge_mse, pcr_mse),
  RMSE = c(best_subset_rmse, lasso_rmse, ridge_rmse, pcr_rmse)
)

mse_table %>%
  kbl(digits = 3, format = "html", caption = "Comparison of MSE and RMSE for Different Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
Comparison of MSE and RMSE for Different Methods
Method MSE RMSE
Best Subset 42.815 6.543
LASSO 44.837 6.696
Ridge 43.334 6.583
PCR 42.097 6.488

Based on the MSE, the PCR model had the lowest cross-validation error with the MSE of 42.8.

\((c)\) Does your chosen model involve all of the features in the data set? Why or why not?

The model chosen is the PCR model. This model has 13 predictors and the lowest MSE. PCR transforms the 13 variables in the data into into principal components (PCs), which are linear combinations of the original variables. This selects a subset of principal components that explain most of the variance, potentially reducing overfitting. Multicollinearity is also avoided by using uncorrelated principal components rather than raw predictors.

---
title: "Assignment 5"
author: "Rani Misra"
date: "`r Sys.Date()`"
output: openintro::lab_report
---

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(ISLR)
library(ISLR2)
library(glmnet)
library(pls)
library(kableExtra)
library(leaps)
library(MASS)
```

## Exercise 2
##### For parts $(a)$ through $(c)$, indicate which of $i.$ through $iv.$ is correct. Justify your answer.

##### $(a)$ The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

- iii is correct as Lasso can have a reduction in variance at the expense of small 
increase in bias while least squares estimates have high variance. Lasso can 
shrink coefficient estimates, removing non-essential variables for less variance 
and higher bias.

##### $(b)$ Repeat $(a)$ for ridge regression relative to least squares.

- iii is correct as ridge regression can shrink the coefficient 
estimates, thereby decreasing variance with higher bias. Ridge is less flexible 
than the least squares.

##### $(c)$ Repeat $(a)$ for non-linear methods relative to least squares.

- ii is correct as non-linear models are more flexible and have less bias than 
least squares.

## Exercise 9
##### In this exercise, we will predict the number of applications received using the other variables in the `College` data set.

##### $(a)$ Split the data set into a training set and a test set.

```{r}
attach(College)
set.seed(1)

x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]
```

##### $(b)$ Fit a linear model using least squares on the training set, and report the test error obtained.

```{r}
set.seed(1)

linear.fit<-lm(Apps~., data=College, subset=train)
summary(linear.fit)
pred.app<-predict(linear.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error
```

MSE for a linear model using the least squares is 1,135,758. 

##### $(c)$ Fit a ridge regression model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained.

```{r}
set.seed(1)

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
```

MSE for the ridge regression model is 976,268.9. 

##### $(d)$ Fit a lasso model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

```{r}
set.seed(1)

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
nonzero=sum(lasso.coef != 0) - 1
nonzero
```

MSE for the Lasso model is 1,116,252. There are 17 non-zero coefficient estimates.

##### $(e)$ Fit a PCR model on the training set, with $M$ chosen by cross-validation. Report the test error obtained, along with the value of $M$ selected by cross-validation.

```{r}
set.seed(1)

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
validationplot(pcr.college, val.type="MSEP")
optimal.pcr <- which.min(pcr.college$validation$PRESS) - 1 
optimal.pcr
pcr.pred=predict(pcr.college,x[test,],ncomp=optimal.pcr)
mean((pcr.pred-y.test)^2)
```

MSE for a PCR model is 1,137,877 with the value of $M$ being 16 due to its low 
CV and high variance explained.

##### $(f)$ Fit a PLS model on the training set, with $M$ chosen by cross-validation. Report the test error obtained, along with the value of $M$ selected by cross-validation.

```{r}
set.seed(1)

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")
optimal.pcr <- which.min(pls.college$validation$PRESS) - 1 
optimal.pcr
pls.pred=predict(pls.college,x[test,],ncomp=optimal.pcr)
mean((pls.pred-y.test)^2)
```

MSE for a PLS model is 1,135,812 with the value of $M$ being 16 due to its low 
CV and high variance explained.

##### $(g)$ Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

```{r}
set.seed(1)

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /
  mean((College.test[, "Apps"] - test.avg)^2)

r2_values = c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2)

bar_positions = barplot(r2_values, 
        names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), 
        main="Test R-squared", ylim=c(0, max(r2_values) * 1.1))
text(x=bar_positions, y=r2_values, labels=round(r2_values, 3), pos=3, cex=0.6)

lm.rmse = sqrt(mean((College.test[, "Apps"] - pred.app)^2))
ridge.rmse = sqrt(mean((College.test[, "Apps"] - ridge.pred)^2))
lasso.rmse = sqrt(mean((College.test[, "Apps"] - lasso.pred)^2))
pcr.rmse = sqrt(mean((pcr.pred - y.test)^2))
pls.rmse = sqrt(mean((pls.pred - y.test)^2))

lm.mse = mean((College.test[, "Apps"] - pred.app)^2)
ridge.mse = mean((College.test[, "Apps"] - ridge.pred)^2)
lasso.mse = mean((College.test[, "Apps"] - lasso.pred)^2)
pcr.mse = mean((pcr.pred - y.test)^2)
pls.mse = mean((pls.pred - y.test)^2)

rmse_table <- data.frame(
  Method = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
  MSE = c(lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse),
  RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, pcr.rmse, pls.rmse)
)

rmse_table %>%
  kbl(digits = 3, format = "html", caption = "Comparison of MSE and RMSE for Different Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
```

The $R^2$ of each model shows that PCR has the lowest accuracy to explain the 
variance in the data whereas all of the other models are fairly similar. When 
focused on just the RMSE/MSE, Ridge Regression has the lowest RMSE/MSE and would be a good model to use. 

## Exercise 11
##### We will now try to predict per capita crime rate in the `Boston` data set.

##### $(a)$ Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

```{r}
set.seed(1)
attach(Boston)

predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

which.min(mean.cv.errors)
mean.cv.errors[which.min(mean.cv.errors)]

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])

pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)
```

##### $(b)$ Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

```{r}
best_subset_mse = mean.cv.errors[which.min(mean.cv.errors)]
lasso_mse = min(cv.lasso$cvm)
ridge_mse = min(cv.ridge$cvm)
pcr_mse = min(pcr.crime$validation$PRESS / nrow(Boston))  # PRESS / n = MSE

best_subset_rmse = sqrt(best_subset_mse)
lasso_rmse = sqrt(lasso_mse)
ridge_rmse = sqrt(ridge_mse)
pcr_rmse = sqrt(pcr_mse)

mse_table <- data.frame(
  Method = c("Best Subset", "LASSO", "Ridge", "PCR"),
  MSE = c(best_subset_mse, lasso_mse, ridge_mse, pcr_mse),
  RMSE = c(best_subset_rmse, lasso_rmse, ridge_rmse, pcr_rmse)
)

mse_table %>%
  kbl(digits = 3, format = "html", caption = "Comparison of MSE and RMSE for Different Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
```

Based on the MSE, the PCR model had the lowest cross-validation error with the MSE of 42.8.

##### $(c)$ Does your chosen model involve all of the features in the data set? Why or why not?

The model chosen is the PCR model. This model has 13 predictors 
and the lowest MSE. PCR transforms the 13 variables in the data into into principal components (PCs), which are linear combinations of the original variables. This selects a subset of principal components that explain most of the variance, potentially reducing overfitting. Multicollinearity is also avoided by using uncorrelated principal components rather than raw predictors.