Problem 2 -

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. - Lasso reduces the number of variables which reduces the variance with a penality of increased bias.

(b) Repeat (a) for ridge regression relative to least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. - This acts the same as Lasso, so with the shrinking of predictors, variance will decrease.

(c) Repeat (a) for non-linear methods relative to least squares.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. - Nonlinear model are more flexible and the curve can create a better perorming model if the data is nonlinear, but at the cost of an increase in variance.

Problem 9 -

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
set.seed(7)
training = sample(nrow(College), 0.5*nrow(College))
testing = (-training)
head(training)
## [1] 298 467 415 476 615 706
College.train=College[training,]
College.test=College[-training,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.
The test error for the linear model is 1589512.

lm.fit=lm(Apps~.,data=College.train)
lm.pred=predict(lm.fit,College.test)
mean((lm.pred-College.test$Apps)^2)
## [1] 1589512

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
The test error for the ridge regression model (with a set lambda of 24.77076) is 1687583 which is a little higher than the linear regression model.

library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.6.3
## Loaded glmnet 4.0-2
xtrain=model.matrix(Apps~.,data=College.train[,-1])
ytrain=College.train$Apps
xtest=model.matrix(Apps~.,data=College.test[,-1])
ytest=College.test$Apps
set.seed(1)
grid=10^seq(10,-2,length=100)
ridge.mod=cv.glmnet(xtrain,ytrain,alpha=0,lambda=grid,thresh=1e-12)
plot(ridge.mod)

bestlam.ridge=ridge.mod$lambda.min
bestlam.ridge
## [1] 24.77076
ridge.pred=predict(ridge.mod,s=bestlam.ridge, newx=xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1687583
predict(ridge.mod,s=bestlam.ridge,type="coefficients")[1:18,]
##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.477078e+03  0.000000e+00  1.308713e+00 -3.621240e-01  2.661546e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -2.477345e+00  7.397240e-02  5.409678e-02 -1.044175e-01  1.834637e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
## -9.843825e-03 -1.404217e-02 -8.349868e+00  2.131644e+00  3.788106e+01 
##   perc.alumni        Expend     Grad.Rate 
## -8.870888e+00  1.221154e-01  5.703943e+00

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
The test error for the lasso model (with a set lambda of 8.111308) is 1658070 which is a little less than the ridge regression model, but not as low as the linear regression model. There are 12 predictors left in this model with coefficients not equal to zero.

set.seed(7)
lasso.fit=cv.glmnet(xtrain, ytrain, lambda=grid,alpha=1,thresh=1e-12)
bestlam.lass=lasso.fit$lambda.min
bestlam.lass
## [1] 8.111308
lasso.pred=predict(lasso.fit,s=bestlam.lass,newx=xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err #test error
## [1] 1658070
predict(lasso.fit,s=bestlam.lass,type="coefficients")[1:18,]
##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.428626e+03  0.000000e+00  1.355438e+00 -3.139357e-01  2.382525e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  0.000000e+00  3.990514e-02  5.338261e-02 -1.030625e-01  1.693439e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  0.000000e+00  0.000000e+00 -5.854816e+00  0.000000e+00  3.642366e+01 
##   perc.alumni        Expend     Grad.Rate 
## -7.345560e+00  1.200491e-01  4.175079e+00

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
The PCR model with the lowest cv was the one with 16 predictors which is not much dimension reduction. This model had a test error of 1836330 which is higher than the other models.

library(pls)
## Warning: package 'pls' was built under R version 3.6.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(7)
pcr.fit=pcr(Apps~., data=College.train, scale=TRUE,validation="CV")
summary(pcr.fit)  
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3599     3606     1697     1715     1341     1315     1285
## adjCV         3599     3606     1694     1724     1335     1305     1283
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1266     1190     1133      1138      1127      1135      1129
## adjCV     1260     1171     1131      1138      1125      1132      1126
##        14 comps  15 comps  16 comps  17 comps
## CV         1132      1121     963.8     910.6
## adjCV      1129      1120     960.3     906.3
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X     32.1592    56.77    63.87    70.20    75.58    80.62    84.43
## Apps   0.9001    78.40    78.47    87.09    87.87    88.21    89.04
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       88.09    91.13     93.33     95.48     97.17     98.22     99.05
## Apps    90.62    90.69     90.71     90.87     90.87     91.01     91.02
##       15 comps  16 comps  17 comps
## X        99.49     99.83    100.00
## Apps     91.14     93.84     94.63
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,College.test,ncomp=16)
mean((pcr.pred-ytest)^2)  
## [1] 1836330

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
The PLS model with the lowest cv was the one with 11 predictors. This model has a test error of 1597785 which is a little higher than the liner model, but still very close.

set.seed(7)
pls.fit=plsr(Apps~., data=College.train, scale=TRUE,validation="CV")
summary(pls.fit)  
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3599     1500     1183     1103     1062    997.9    931.9
## adjCV         3599     1496     1183     1101     1058    983.4    926.1
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       923.1    915.3    917.3     913.6     907.9     911.8     910.7
## adjCV    918.8    911.5    912.8     909.5     903.9     907.3     906.4
##        14 comps  15 comps  16 comps  17 comps
## CV        910.9     910.6     910.6     910.6
## adjCV     906.6     906.3     906.3     906.3
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X        24.6    37.65    62.56    66.45    68.51    72.82    76.91
## Apps     83.5    89.70    91.20    92.18    93.76    94.25    94.42
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       81.17    83.20     86.84     89.17     91.45     94.43     95.82
## Apps    94.49    94.55     94.57     94.61     94.63     94.63     94.63
##       15 comps  16 comps  17 comps
## X        97.14     98.26    100.00
## Apps     94.63     94.63     94.63
validationplot(pls.fit,val.type="MSEP")

pls.pred=predict(pls.fit,College.test,ncomp=11)
mean((pls.pred-ytest)^2)  
## [1] 1597785

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
From looking at the test MSE alone, the best choice of model would be either the linear or pls model, but overall they performed fairly similiarly with the exeption of the PCR which had a much higher MSE. Since we probably do not want to pick a model using solely this number, we can also calcualte the Rsquare value for each of these models. All of the Rsquare values are very close in value with the linear regression model being the highest. Since that model retains all of the predictors, if you were looking to reduce the size of you model, you could choose the PLS model that only keeps 11 of the predictors. In the end, all of these models have a high Rsquare value.

test.avg = mean(College.test$Apps)
lm.r2 = 1 - mean((lm.pred - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2 = 1 - mean((ridge.pred - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2 = 1 - mean((lasso.pred - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2 = 1 - mean((pcr.pred - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2 = 1 - mean((pls.pred - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lm.r2
## [1] 0.9066308
ridge.r2
## [1] 0.90087
lasso.r2
## [1] 0.9026036
pcr.r2
## [1] 0.8921325
pls.r2
## [1] 0.9061448

Problem 11

We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 3.6.3
library(glmnet)
library(pls)
names(Boston)  #14 variables
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

#split into train and test

set.seed(7)
train = sample(1:nrow(Boston), 0.5*nrow(Boston))
test = -train
xmat.train=model.matrix(crim~.,Boston[train,])
xmat.test=model.matrix(crim~.,Boston[test,])

#best subset selection - The best subset model is the one with 9 predictors based on the lowest test error of 40.84844. The coefficients for this model are show below. Even though this model does have the lowest test MSE, it is worthy to note that all MSE values are very close. This model also has the highest Rsquare value of 46.

regfit.full=regsubsets(crim~., data=Boston[train,], nvmax = 13)
val.errors=rep(NA, 13)
for(i in 1:13){
  coefi=coef(regfit.full, id=i)
  pred=xmat.test[,names(coefi)]%*%coefi
  val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
##  [1] 41.82023 42.47616 42.98094 42.23723 42.46149 41.53291 41.54721
##  [8] 41.02016 40.84844 41.51028 41.18489 41.82330 41.89341
which.min(val.errors)
## [1] 9
coef(regfit.full,9)
##   (Intercept)            zn         indus           nox           dis 
##  19.667885814   0.045968110  -0.173577595 -13.644603194  -1.138607296 
##           rad       ptratio         black         lstat          medv 
##   0.540026488  -0.271362028  -0.006674327   0.261222593  -0.168020438
reg.summary=summary(regfit.full)
reg.summary$adjr2
##  [1] 0.3888275 0.4412589 0.4448041 0.4481259 0.4497513 0.4555164 0.4577326
##  [8] 0.4599693 0.4600838 0.4593302 0.4584755 0.4576504 0.4555833

#ridge regression - The MSE of the ridge regression is 41.25 which is similar to that of the best subset regression.

set.seed(7)
grid=10^seq(10,-2,length=100)
ridge.fit=cv.glmnet(xmat.train, Boston[train,]$crim, lambda=grid,alpha=0)
bestlam=ridge.fit$lambda.min
bestlam #best lambda
## [1] 0.2848036
library(glmnet)
ridge.pred=predict(ridge.fit,s=bestlam,newx=xmat.test)
ridge.err=mean((ridge.pred-Boston[test,]$crim)^2)
ridge.err #test error
## [1] 41.24571
out=glmnet(xmat.train,Boston[train,]$crim,alpha=0)
predict(out,type='coefficients', s=bestlam,)[1:14,]
##  (Intercept)  (Intercept)           zn        indus         chas 
## 14.035425382  0.000000000  0.036629072 -0.141439966 -1.321808043 
##          nox           rm          age          dis          rad 
## -6.528149383 -0.463088367 -0.006378197 -0.848287243  0.415126228 
##          tax      ptratio        black        lstat 
##  0.003537194 -0.133040734 -0.007407636  0.246871085

#Lasso - The MSE of the ridge regression is 40.77 when lambda is equal to 0.09 which is very similar to that of the other models.

lasso.fit=cv.glmnet(xmat.train, Boston[train,]$crim, lambda=grid,alpha=1)
bestlam.lass=lasso.fit$lambda.min
bestlam.lass
## [1] 0.03053856
lasso.pred=predict(lasso.fit,s=bestlam.lass,newx=xmat.test)
lasso.err=mean((lasso.pred-Boston[test,]$crim)^2)
lasso.err #test error
## [1] 41.31128
out=glmnet(xmat.train,Boston[train,]$crim,alpha=1,lambda=grid)
lasso.coef=predict(out, type='coefficient',s=bestlam)[1:14,]
lasso.coef
##  (Intercept)  (Intercept)           zn        indus         chas 
## -1.070059816  0.000000000  0.011447266  0.000000000 -0.454970892 
##          nox           rm          age          dis          rad 
##  0.000000000  0.000000000  0.000000000 -0.073400532  0.434968642 
##          tax      ptratio        black        lstat 
##  0.000000000  0.000000000 -0.005045505  0.265759151

#PCR - The lowest test MSE 0f 41.89341, which is the lowest CV squared, actually appears to happen when all 13 variables(no dimension reduction) are left in the model. However, by looking at the plot below, we can see that after about 3 predictors, there is not much of a decrease in the MSE.

set.seed(7)
pcr.fit=pcr(crim~., data=Boston[train,], scale=TRUE,validation="CV")
summary(pcr.fit)  #lowest CV value at 13 predictors
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.924    7.321    7.315    7.016    7.008    7.025    7.026
## adjCV        8.924    7.316    7.309    7.004    7.003    7.016    7.014
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.022    7.010    6.969     6.865     6.881     6.817     6.749
## adjCV    7.009    6.976    6.956     6.847     6.863     6.799     6.730
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       49.81    61.39    70.07    77.12    83.77    88.50    91.71
## crim    33.90    34.42    40.37    40.37    40.67    41.66    42.19
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.76    95.78     97.39     98.72     99.65    100.00
## crim    43.47    43.84     45.67     45.99     47.08     48.37
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,Boston[test,],ncomp=13)
mean((pcr.pred-Boston$crim[test])^2)  
## [1] 41.89341

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
I would select a model using the best subset selection method. This yeilded the lowest test MSE when there were the 9 predictors listed below.

coef(regfit.full,9)
##   (Intercept)            zn         indus           nox           dis 
##  19.667885814   0.045968110  -0.173577595 -13.644603194  -1.138607296 
##           rad       ptratio         black         lstat          medv 
##   0.540026488  -0.271362028  -0.006674327   0.261222593  -0.168020438

(c) Does your chosen model involve all of the features in the data set? Why or why not?
No, as you can see from above, the model I chose onnly has 9 predictors in it. This yieled the lowest MSE as well ass the highest Rsquare for all of the best subset selection models.