Before moving to the non-linear world, in chapter 6 we investigate some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.

Question 2.

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(b) Repeat (a) for ridge regression relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Question 9.

In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.

#Load and split the College data
library(ISLR) #Load the ISLR library
package 㤼㸱ISLR㤼㸲 was built under R version 3.6.3
set.seed(52) #set the seed for Rs random number generator
sum(is.na(College)) #count the missing vallues
[1] 0
train = sample(1:nrow(College), nrow(College)/2) # create vector of training observations
test = -train # create vector of test observations
college_train = College[train, ] # create training data 
college_test = College[test, ] # create test data 

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

#Number of applications is the "Apps" variable.
lm_fit = lm(Apps~., data=college_train) #fit linear model
lm_pred = predict(lm_fit, college_test) #make predictions on test set
lm_mse=mean((college_test$Apps - lm_pred)^2) #calculate mse
lm_mse #print out mse
[1] 1338031

Test MSE is 1,338,031.

(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

#Pick λ using college_train and report error on college_test
library(glmnet) #load glmnet for lasso and rigde fit
Loading required package: Matrix
Loaded glmnet 4.0
train_mat = model.matrix(Apps~., data=college_train)[,-1] #create training matrix no intercept column
test_mat = model.matrix(Apps~., data=college_test)[,-1] # create test matrix
grid = 10^seq(10, -5, length=1000) # create a list of possible lambda values
mod_ridge = cv.glmnet(train_mat, college_train$Apps, alpha=0, lambda=grid, thresh=1e-12) #fit the ridge using cross validation to choose lambda
lambda_best = mod_ridge$lambda.min # store the best lambda chosen
lambda_best #print out the best lambda
[1] 20.95662
ridge_pred = predict(mod_ridge, newx=test_mat, s=lambda_best)#make predictions on test set
ridge_mse=mean((college_test$Apps - ridge_pred)^2) #calculate mse
ridge_mse #print ridge mse
[1] 1432032

Test MSE is higher than that of OLS, 1,432,032.

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

mod_lasso = cv.glmnet(train_mat, college_train$Apps, alpha=1, lambda=grid, thresh=1e-12) #fit the lasso using cross validation to choose lambda
lambda_best = mod_lasso$lambda.min # store the best lambda chosen
lambda_best #print out the best lambda
[1] 12.9155
lasso_pred = predict(mod_lasso, newx=test_mat, s=lambda_best) #make predictions on test set
lasso_mse=mean((college_test$Apps - lasso_pred)^2) #calculate mse
lasso_mse #print lasso mse
[1] 1416086

Test MSE is higher than that of OLS, 1,416,086, but lower that the Ridge MSE.

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
#fit principle component model
pcr_fit = pcr(Apps~., data=college_train, scale=T, validation="CV")
# view validation plot
validationplot(pcr_fit, val.type="MSEP")

#calculate predictions
pcr_pred = predict(pcr_fit, college_test, ncomp=9)
#calculate mse
pcr_mse=mean((college_test$Apps - pcr_pred)^2) 
pcr_mse #print pcr mse
[1] 3048364

Test MSE is much higher than that of OLS, 3,048,364.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

#fit partial least squares model
pls_fit = plsr(Apps~., data=college_train, scale=T, validation="CV")
# view validation plot
validationplot(pls_fit, val.type="MSEP")

#calculate predictions
pls_pred = predict(pls_fit, college_test, ncomp=5)
#calculate mse
pls_mse=mean((college_test$Apps - pls_pred)^2)
pls_mse #print pls mse
[1] 2146014

Test MSE is much higher than that of OLS, 2,146,014, but lower than PCR.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3package 㤼㸱purrr㤼㸲 was built under R version 3.6.3package 㤼㸱dplyr㤼㸲 was built under R version 3.6.3package 㤼㸱forcats㤼㸲 was built under R version 3.6.3
res<-as_tibble(list(Model=c("Linear", "Ridge", "Lasso","PCR","PLS"), MSE=c(lm_mse,ridge_mse,lasso_mse,pcr_mse,pls_mse)))
res %>% 
  arrange(MSE) %>%
  mutate(MSE=format(MSE,big.mark=","))

The linear model outperforms the other chosen models.

Question 11.

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(1)
library(MASS)
library(leaps)
library(glmnet)
#Best subset selection
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

which.min(rmse.cv)
[1] 9
best_rmse=rmse.cv[which.min(rmse.cv)]
best_rmse
[1] 6.543281
#LASSO
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)
14 x 1 sparse Matrix of class "dgCMatrix"
                   1
(Intercept) 2.176491
zn          .       
indus       .       
chas        .       
nox         .       
rm          .       
age         .       
dis         .       
rad         0.150484
tax         .       
ptratio     .       
black       .       
lstat       .       
medv        .       
lasso_rmse=sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
lasso_rmse
[1] 7.921353
#RIDGE
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)
14 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)  1.523899548
zn          -0.002949852
indus        0.029276741
chas        -0.166526006
nox          1.874769661
rm          -0.142852604
age          0.006207995
dis         -0.094547258
rad          0.045932737
tax          0.002086668
ptratio      0.071258052
black       -0.002605281
lstat        0.035745604
medv        -0.023480540
ridge_rmse=sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
ridge_rmse
[1] 7.669133
#PCR
library(pls)
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
Data:   X dimension: 506 13 
    Y dimension: 506 1
Fit method: svdpc
Number of components considered: 13

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps
CV            8.61    7.175    7.180    6.724
adjCV         8.61    7.174    7.179    6.721
       4 comps  5 comps  6 comps  7 comps  8 comps
CV       6.731    6.727    6.727    6.722    6.614
adjCV    6.725    6.724    6.724    6.718    6.609
       9 comps  10 comps  11 comps  12 comps
CV       6.618     6.607     6.598     6.553
adjCV    6.613     6.602     6.592     6.546
       13 comps
CV        6.488
adjCV     6.481

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps
X       47.70    60.36    69.67    76.45    82.99
crim    30.69    30.87    39.27    39.61    39.61
      6 comps  7 comps  8 comps  9 comps  10 comps
X       88.00    91.14    93.45    95.40     97.04
crim    39.86    40.14    42.47    42.55     42.78
      11 comps  12 comps  13 comps
X        98.46     99.52     100.0
crim     43.04     44.13      45.4
pcr_rmse=sqrt(pcr.fit$validation$adj[13])

13 component pcr fit has lowest CV/adjCV RMSEP.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

library(tidyverse)
res<-as_tibble(list(Model=c("Best Subsets", "Lasso","Ridge","PCR"), RMSE=c(best_rmse,lasso_rmse,ridge_rmse,pcr_rmse)))
res %>% 
  arrange(RMSE)

(c) Does your chosen model involve all of the features in the data set? Why or why not? I would choose the 9 parameter best subset model because it had the best cross-validated RMSE, next to PCR, but it was simpler model than the 13 component PCR model.

---
title: "R Notebook"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

Before moving to the non-linear world, in chapter 6 we investigate some ways in which the simple linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures.


### Question 2.  
__For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.__  
__(a) The lasso, relative to least squares, is:__

1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.  
2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.  
3. _Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance._  
4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

__(b) Repeat (a) for ridge regression relative to least squares.__

1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.  
2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.  
3. _Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance._  
4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

__(c) Repeat (a) for non-linear methods relative to least squares.__   

1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.  
2. _More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias._  
3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.  
4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.



### Question 9. 
__In this exercise, we will predict the number of applications received using the other variables in the `College` data set.__  
__(a) Split the data set into a training set and a test set.__
```{r}
#Load and split the College data
library(ISLR) #Load the ISLR library
set.seed(52) #set the seed for Rs random number generator
sum(is.na(College)) #count the missing vallues
train = sample(1:nrow(College), nrow(College)/2) # create vector of training observations
test = -train # create vector of test observations
college_train = College[train, ] # create training data 
college_test = College[test, ] # create test data 
```

__(b) Fit a linear model using least squares on the training set, and report the test error obtained.__
```{r}
#Number of applications is the "Apps" variable.
lm_fit = lm(Apps~., data=college_train) #fit linear model
lm_pred = predict(lm_fit, college_test) #make predictions on test set
lm_mse=mean((college_test$Apps - lm_pred)^2) #calculate mse
lm_mse #print out mse
```

Test MSE is `r format(lm_mse, big.mark =",")`.

__(c) Fit a ridge regression model on the training set, with $\lambda$ chosen by cross-validation. Report the test error obtained.__
```{r, warning=FALSE}
#Pick λ using college_train and report error on college_test
library(glmnet) #load glmnet for lasso and rigde fit
train_mat = model.matrix(Apps~., data=college_train)[,-1] #create training matrix no intercept column
test_mat = model.matrix(Apps~., data=college_test)[,-1] # create test matrix
grid = 10^seq(10, -5, length=1000) # create a list of possible lambda values
mod_ridge = cv.glmnet(train_mat, college_train$Apps, alpha=0, lambda=grid, thresh=1e-12) #fit the ridge using cross validation to choose lambda
lambda_best = mod_ridge$lambda.min # store the best lambda chosen
lambda_best #print out the best lambda
ridge_pred = predict(mod_ridge, newx=test_mat, s=lambda_best)#make predictions on test set
ridge_mse=mean((college_test$Apps - ridge_pred)^2) #calculate mse
ridge_mse #print ridge mse
```
Test MSE is higher than that of OLS, `r format(ridge_mse, big.mark =",")`.

__(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.__
```{r, warning=FALSE}
mod_lasso = cv.glmnet(train_mat, college_train$Apps, alpha=1, lambda=grid, thresh=1e-12) #fit the lasso using cross validation to choose lambda
lambda_best = mod_lasso$lambda.min # store the best lambda chosen
lambda_best #print out the best lambda
lasso_pred = predict(mod_lasso, newx=test_mat, s=lambda_best) #make predictions on test set
lasso_mse=mean((college_test$Apps - lasso_pred)^2) #calculate mse
lasso_mse #print lasso mse
```
Test MSE is higher than that of OLS, `r format(lasso_mse, big.mark =",")`, but lower that the Ridge MSE. 

__(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.__
```{r}
library(pls)
#fit principle component model
pcr_fit = pcr(Apps~., data=college_train, scale=T, validation="CV")
# view validation plot
validationplot(pcr_fit, val.type="MSEP")
#calculate predictions
pcr_pred = predict(pcr_fit, college_test, ncomp=9)
#calculate mse
pcr_mse=mean((college_test$Apps - pcr_pred)^2) 
pcr_mse #print pcr mse
```
Test MSE is much higher than that of OLS, `r format(pcr_mse, big.mark =",")`.


__(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.__
```{r}
#fit partial least squares model
pls_fit = plsr(Apps~., data=college_train, scale=T, validation="CV")
# view validation plot
validationplot(pls_fit, val.type="MSEP")
#calculate predictions
pls_pred = predict(pls_fit, college_test, ncomp=5)
#calculate mse
pls_mse=mean((college_test$Apps - pls_pred)^2)
pls_mse #print pls mse
```
Test MSE is much higher than that of OLS, `r format(pls_mse, big.mark =",")`, but lower than PCR.

__(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?__

```{r,message=FALSE}
library(tidyverse)
res<-as_tibble(list(Model=c("Linear", "Ridge", "Lasso","PCR","PLS"), MSE=c(lm_mse,ridge_mse,lasso_mse,pcr_mse,pls_mse)))
res %>% 
  arrange(MSE) %>%
  mutate(MSE=format(MSE,big.mark=","))
```

The linear model outperforms the other chosen models.


### Question 11. 
__We will now try to predict per capita crime rate in the Boston data set.__

__(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.__
```{r}
set.seed(1)
library(MASS)
library(leaps)
library(glmnet)
#Best subset selection
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10 #number of folds
p = ncol(Boston) - 1 # number of predictors
folds = sample(rep(1:k, length = nrow(Boston))) #create the folds
cv.errors = matrix(NA, k, p) #initialize a matrix to store resulting errors
#cycle through every fold and examine which best subset fits the holdout fold.
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean)) #average the cv errors
plot(rmse.cv, pch = 19, type = "b") #plot it
which.min(rmse.cv) #find out how many variables our best model has
best_rmse=rmse.cv[which.min(rmse.cv)] #store the best rmse
best_rmse #print the best rmse
```

```{r}
#LASSO
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
lasso_rmse=sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
lasso_rmse
```


```{r}
#RIDGE
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
ridge_rmse=sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
ridge_rmse
```

```{r}
#PCR
library(pls)
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
pcr_rmse=sqrt(pcr.fit$validation$adj[13])
```
13 component pcr fit has lowest CV/adjCV RMSEP.

__(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.__
```{r}
library(tidyverse)
res<-as_tibble(list(Model=c("Best Subsets", "Lasso","Ridge","PCR"), RMSE=c(best_rmse,lasso_rmse,ridge_rmse,pcr_rmse)))
res %>% 
  arrange(RMSE)
```

__(c) Does your chosen model involve all of the features in the data set? Why or why not?__
I would choose the 9 parameter best subset model because it had the best cross-validated RMSE, next to PCR, but it was simpler model than the 13 component PCR model.