Assignment 5

Rudy Martinez

7/15/2021

Libraries

library(ISLR)
library(tidyverse)
library(glmnet)
library(MASS)
library(pls)
library(caret)
library(leaps)

Exercises

Exercise 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias

(a) The lasso, relative to least squares, is: Option iii. Lasso’s advantage over least squares is rooted in the bias-variance trade-off. The lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions.
(b) Repeat (a) for ridge regression relative to least squares: Option iii. Ridge regression performs well by trading off a small increase in bias for a large decrease in variance. Ridge regression works best in situations where the least squares estimates have high variance.
(c) Repeat (a) for non-linear methods relative to least squares. Option ii. By using a non-linear method, we are using a more flexible method because less assumptions are being made about the functional form of f.

Exercise 9

In this exercise, we will predict the number of applications received using the other variables in the `College` data set.

(a) Split the data set into a training set and a test set.

# Exercise 9-a
attach(College)

set.seed(1)
train_index = sample(1:nrow(College), round(nrow(College) * 0.70))
train = College[train_index, ]
test = College[-train_index, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

#Exercise 9-b
lm.fit = lm(Apps~., data = train)
summary(lm.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5819.3  -450.7     1.4   328.5  7444.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.486e+02  5.066e+02  -1.083  0.27935    
## PrivateYes  -5.096e+02  1.643e+02  -3.101  0.00203 ** 
## Accept       1.721e+00  4.755e-02  36.201  < 2e-16 ***
## Enroll      -1.042e+00  2.419e-01  -4.308 1.97e-05 ***
## Top10perc    5.354e+01  6.435e+00   8.320 7.60e-16 ***
## Top25perc   -1.617e+01  5.088e+00  -3.178  0.00157 ** 
## F.Undergrad  2.739e-02  4.367e-02   0.627  0.53069    
## P.Undergrad  7.135e-02  3.645e-02   1.957  0.05085 .  
## Outstate    -8.783e-02  2.170e-02  -4.047 5.97e-05 ***
## Room.Board   1.623e-01  5.571e-02   2.914  0.00372 ** 
## Books        2.781e-01  2.718e-01   1.023  0.30675    
## Personal    -4.606e-03  7.254e-02  -0.063  0.94939    
## PhD         -9.694e+00  5.356e+00  -1.810  0.07087 .  
## Terminal    -2.714e-01  6.006e+00  -0.045  0.96397    
## S.F.Ratio    1.652e+01  1.606e+01   1.029  0.30408    
## perc.alumni  2.302e+00  4.848e+00   0.475  0.63517    
## Expend       5.982e-02  1.337e-02   4.476 9.34e-06 ***
## Grad.Rate    7.160e+00  3.517e+00   2.036  0.04227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1021 on 526 degrees of freedom
## Multiple R-squared:  0.9332, Adjusted R-squared:  0.931 
## F-statistic: 432.3 on 17 and 526 DF,  p-value: < 2.2e-16

lm.pred = predict(lm.fit, newdata= test)
lm.err = mean((test$Apps - lm.pred)^2)
lm.err

## [1] 1266407

- The test error obtained from the linear model fit on all variables is 1266407.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

#Exercise 9-c
xtrain = model.matrix(Apps~., data = train[,-1])
ytrain = train$Apps
xtest = model.matrix(Apps~., data = test[,-1])
ytest = test$Apps

set.seed(1)
ridge.fit = cv.glmnet(xtrain, ytrain, alpha = 0)
plot(ridge.fit)

ridge.lambda = ridge.fit$lambda.min
paste("Ridge Lambda", ridge.lambda)

## [1] "Ridge Lambda 367.220655836988"

ridge.pred = predict(ridge.fit, s = ridge.lambda, newx = xtest)
ridge.err = mean((ridge.pred-ytest)^2)
paste("Test Error", ridge.err)

## [1] "Test Error 1168672.06016979"

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#Exercise 9-d
set.seed(1)
lasso.fit = cv.glmnet(xtrain, ytrain, alpha=1)
plot(lasso.fit)

lasso.lambda = lasso.fit$lambda.min
paste("Lasso Lamda", lasso.lambda)

## [1] "Lasso Lamda 1.95974618863191"

lasso.pred = predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err = mean((lasso.pred-ytest)^2)
paste("Lasso Error", lasso.err)

## [1] "Lasso Error 1269433.8248159"

#Exercise 9-d coeff
lasso.coeff = predict(lasso.fit, type="coefficients", s=lasso.lambda)[1:18,]
round(lasso.coeff[lasso.coeff != 0],3)

## (Intercept)      Accept      Enroll   Top10perc   Top25perc F.Undergrad 
##   -1055.063       1.702      -0.945      51.533     -14.532       0.036 
## P.Undergrad    Outstate  Room.Board       Books    Personal         PhD 
##       0.074      -0.107       0.146       0.261      -0.005      -6.878 
##    Terminal   S.F.Ratio perc.alumni      Expend   Grad.Rate 
##       0.608      24.927       0.627       0.063       6.095

The above values represent the non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

#Exercise 9-e
pcr.fit = pcr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type="MSEP")

summary(pcr.fit)

## Data:    X dimension: 544 17 
##  Y dimension: 544 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3892     3799     2114     2122     1779     1698     1704
## adjCV         3892     3800     2110     2121     1761     1686     1698
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1705     1654     1607      1606      1612      1613      1615
## adjCV     1702     1643     1600      1600      1606      1608      1609
##        14 comps  15 comps  16 comps  17 comps
## CV         1615      1577      1197      1156
## adjCV      1610      1560      1186      1147
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.058    57.03    64.43    70.29    75.66    80.65    84.26    87.61
## Apps    5.575    71.65    71.65    81.07    82.59    82.61    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.59     92.84     94.93     96.74     97.82     98.72     99.38
## Apps    84.56     84.83     84.86     84.87     85.02     85.06     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.04     93.32

#Exercise 9-e results
pcr.pred = predict(pcr.fit, test, ncomp = 10)
pcr.error = mean((test$Apps - pcr.pred)^2)
paste("Test Error", pcr.error)

## [1] "Test Error 1842925.42996644"

M = 10

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

#Exercise 9-f
pls.fit = plsr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)

## Data:    X dimension: 544 17 
##  Y dimension: 544 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3892     1939     1730     1538     1422     1329     1187
## adjCV         3892     1934     1730     1531     1402     1297     1175
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1168     1152     1153      1150      1157      1153      1150
## adjCV     1159     1145     1144      1142      1147      1145      1142
##        14 comps  15 comps  16 comps  17 comps
## CV         1149      1149      1149      1149
## adjCV      1140      1140      1140      1140
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.20    62.49    64.91    67.30    72.46    76.95    80.88
## Apps    76.61    82.45    86.93    90.76    92.84    93.06    93.14    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.27      87.8     90.88     92.47     95.11     97.09
## Apps    93.26     93.28      93.3     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.37    100.00
## Apps     93.32     93.32

#Exercise 9-f test error
pls.pred=predict(pls.fit, test, ncomp = 13)
pls.err=mean((pls.pred-test$Apps)^2)
paste("Test Error", pls.err)

## [1] "Test Error 1266750.68341741"

This PLS model comes very close to beating the ridge regression. M = 13

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
- lm.err = 1266407
- ridge.err = 1168672.06
- lasso.err = 1269433.82
- pcr.err = 1842925.43
- pls.err = 1266750.68

#Exercise 9-f chart
barplot(c(lm.err,
          ridge.err,
          lasso.err,
          pcr.error,
          pls.err), 
        col="steelblue", 
        xlab="Regression Methods", 
        ylab="Test Error", 
        main="Test Errors for All Methods", 
        names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

Comments: The best model for this problem was the Ridge Regression, meaning that the model can predict college applications with higher accuracy. However, as demonstrated in the barplot, the test errors were not that far apart or different.

Exercise 11

(a) We will now try to predict per capita crime rate in the `Boston` data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

#Exercise 11-a
attach(Boston)
set.seed(10743959)

subset = sample(nrow(Boston),nrow(Boston)*0.70)
boston.train = Boston[subset,]
boston.test = Boston[-subset,]

bsm = regsubsets(crim~.,data=boston.train,nbest=1,nvmax=13) 
summary(bsm)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston.train, nbest = 1, 
##     nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   "*"  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" " " "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" " " "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

#Out of 13 models best model is chosen for minimum cv error on test set
boston.test.mat = model.matrix(crim~ ., data = boston.test, nvmax = 13)
val.errors = rep(NA, 13)
for (i in 1:13) {
    coefi = coef(bsm, id = i)
    pred = boston.test.mat[, names(coefi)] %*% coefi
    val.errors[i] = mean((pred - boston.test$medv)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")

which_min = which.min(val.errors)
which_min

## [1] 1

coef_res = coef(bsm,which.min(val.errors))
coef_res

## (Intercept)         rad 
##  -2.3696479   0.6341561

bsm.MSE = val.errors[1]
bsm.MSE

## [1] 500.8023

#Exercise 11-a Ridge Regression

#creating matrix for Ridge Regression
bostontrain.mat = model.matrix(crim~.,data=boston.train)
bostontest.mat = model.matrix(crim~.,data=boston.test)

#Defining grid that covers the wide range of lambda 
grid = 10^seq(4,-2,length=100)

#Fitting the ridge regression model
ridge.model = glmnet(bostontrain.mat,boston.train$crim,alpha=0,lambda=grid)

#Finding lambda that gives minimum MSE by cross validation on training set
boston.cv.ridge = cv.glmnet(bostontrain.mat,boston.train$crim,alpha=0,lambda=grid)

#Visualizing MSE vs Log(lambda)
plot(boston.cv.ridge)

#finding the lambda for which MSE is minimum on training data
boston.bestlam.ridge = boston.cv.ridge$lambda.min
boston.bestlam.ridge

## [1] 0.3764936

#fitting the Ridge model with with the value of lambda obtained and finding the coefficients
ridge.model.1 = glmnet(bostontrain.mat,boston.train$crim,alpha=0,lambda=boston.bestlam.ridge)
coef(ridge.model.1)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 14.524886991
## (Intercept)  .          
## zn           0.038828432
## indus       -0.075163341
## chas        -0.855244880
## nox         -8.011304694
## rm           0.100978690
## age          0.007595292
## dis         -0.797614897
## rad          0.445909347
## tax          0.003644403
## ptratio     -0.211698432
## black       -0.009703337
## lstat        0.083984370
## medv        -0.155978332

#using the lambda value obtained from cross validation for the ridge model directly on test data set to get the predicted values
boston.pred.newridge = predict(ridge.model,s = boston.bestlam.ridge, newx=bostontest.mat)

#Mean Square Error calculation
ridge.MSE = mean((boston.test$crim-boston.pred.newridge)^2)
ridge.MSE

## [1] 31.61811

#Lasso Regression
lasso.model = glmnet(bostontrain.mat,boston.train$crim,alpha=0,lambda=grid)

#finding lambda by cross validation and then finding MSE on test set using that lambda
boston.cv.lasso = cv.glmnet(bostontrain.mat,boston.train$crim,alpha=0,lambda=grid)

#Visualizing MSE vs Log(lambda)
plot(boston.cv.lasso)

#finding the lambda for which cv error is minimum on training data
boston.bestlam.lasso = boston.cv.lasso$lambda.min
boston.bestlam.lasso

## [1] 0.1232847

#fitting the Lasso model with with the value of lambda obtained and finding the coefficients
lasso.model.1 = glmnet(bostontrain.mat,boston.train$crim,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.1)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  1.890430e+01
## (Intercept)  .           
## zn           4.459417e-02
## indus       -6.786118e-02
## chas        -8.875154e-01
## nox         -1.058333e+01
## rm           9.102107e-02
## age          8.577012e-03
## dis         -9.350947e-01
## rad          5.151007e-01
## tax          8.387289e-04
## ptratio     -2.846172e-01
## black       -9.354300e-03
## lstat        6.818129e-02
## medv        -1.857570e-01

#using the lambda value obtained from cross validation for the lasso model directly on test data set to get the predicted values
boston.pred.newlasso = predict(lasso.model,s=boston.bestlam.lasso,newx=bostontest.mat)

#Mean Square Error calculation
lasso.MSE = mean((boston.test$crim-boston.pred.newlasso)^2)
lasso.MSE

## [1] 31.45548

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Validation Set MSE is used for evaluating model performance. Below is the list of different approaches with the MSE obtained for Validation Data set. Based on these results, the Ridge Regression appears to have performed the best based on MSE.
- Best Subset: 500.8023
- Ridge Regression: 31.70779
- Lasso Regression: 31.76385

(c) Does your chosen model involve all of the features in the data set? Why or why not?

It does not involve all of the features in tehe data set.

Libraries

Exercises

Exercise 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

Exercise 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Exercise 11

(a) We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

In this exercise, we will predict the number of applications received using the other variables in the `College` data set.

(a) We will now try to predict per capita crime rate in the `Boston` data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.