Question 2

2. For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer.

(a) The lasso, relative to least squares, is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. The shrinking factor will reduce the variance apparent in OLS models where the number of predictors and observations is nearly equal. However, in shrinking the Beta coefficients this will cause them to become more biased. This can be acceptable as the decrease in variance results in better prediction accuracy. Unlike, Ridge Regression the Beta coefficients can become zero and essentially remove them from the model, mathematically speaking

(b) Repeat (a) for ridge regression relative to least squares.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Since Ridge is similar to Lasso, this same answer can apply here.

(c) Repeat (a) for non-linear methods relative to least squares.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. As we learned in chapter 2, OLS assumes that the relationship between X and Y is linear. It is rare that there is true linear relationship so there will be bias. Non-linear (complex) methods will have less bias but will have increased variance

##Question 9

  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
  1. Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
attach(College)
set.seed(1)
train=sample(1:nrow(College),nrow(College)/2)
test=(-train)
train.set=College[train,]
test.set=College[test,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained. The test error is 1135758

apps.test=Apps[test]
lm.fit=lm(Apps~.,data=train.set)
lm.pred=predict(lm.fit, test.set)
mean((apps.test-lm.pred)^2)
## [1] 1135758

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## Loaded glmnet 4.0-2
set.seed(1)
x=model.matrix(Apps~.,train.set)
y=train.set$Apps
y.test=y[test]
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,lambda=grid,alpha=0,thresh=1e-12)
cv.out=cv.glmnet(x,y,alpa=0)
bestlam=cv.out$lambda.min
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1407901

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(x,y,alpha=1,lambda=grid, thresh=1e-12)
set.seed(1)
cv.out=cv.glmnet(x,y,alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1412383
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type='coefficients',s=bestlam)
lasso.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -7.700064e+02
## (Intercept)  .           
## PrivateYes  -3.118246e+02
## Accept       1.763118e+00
## Enroll      -1.324681e+00
## Top10perc    6.484111e+01
## Top25perc   -2.084190e+01
## F.Undergrad  7.231529e-02
## P.Undergrad  1.210751e-02
## Outstate    -1.049092e-01
## Room.Board   2.088339e-01
## Books        2.936591e-01
## Personal     3.742302e-03
## PhD         -1.443128e+01
## Terminal     5.292709e+00
## S.F.Ratio    2.176770e+01
## perc.alumni  5.149189e-01
## Expend       4.824620e-02
## Grad.Rate    7.035589e+00

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. The lowest MSE is roughly 16, but this would not be very much smaller than the full 18 variables in the data set. I went with 10 as the next lowest MSE. the test error obtained is 1723100

library(pls)
## Warning: package 'pls' was built under R version 3.6.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr.fit=pcr(Apps~.,data=College,subset=train,scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,test.set,ncomp=10)
mean((test.set$Apps- pcr.pred)^2)
## [1] 1723100

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit=plsr(Apps~.,data=College, subset=train,scale=TRUE,validation="CV")
validationplot(pls.fit,val.type="MSEP")

pls.pred=predict(pls.fit,test.set,ncomp=9)
mean((pls.pred-test.set$Apps)^2)
## [1] 1109578

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Here are the mean squared errors obtained in the above models using the test data set: Ridge - 1407901, Lasso - 1412383, PCR - 1723100. The Ridge and Lasso performed better than the PCR with the Ridge having a slight advantage on the Lasso.

##Question 11 We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider. Best Subset Selection

library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 3.6.3
attach(Boston)
set.seed(1)
train=sample(c(TRUE, FALSE), nrow(Boston),rep=TRUE)
test=(!train)
regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax = 13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
  coefi=coef(regfit.best,id=i)
  pred=test.mat[,names(coefi)]%*%coefi
  val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
##  [1] 53.58865 54.37254 52.66064 52.50837 51.53113 51.12540 51.04084
##  [8] 50.69984 50.43627 50.52382 50.71664 50.68209 50.65678
which.min(val.errors)
## [1] 9
regfit.best=regsubsets(crim~.,data=Boston,nvmax = 13)
coef(regfit.best,9)
##   (Intercept)            zn         indus           nox           dis 
##  19.124636156   0.042788127  -0.099385948 -10.466490364  -1.002597606 
##           rad       ptratio         black         lstat          medv 
##   0.539503547  -0.270835584  -0.008003761   0.117805932  -0.180593877
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}
k=10
set.seed(1)
folds=sample(1:k,nrow(Boston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
  best.fit=regsubsets(crim~.,data=Boston[folds!=j,],nvmax = 13)
  for(i in 1:13){
    pred=predict(best.fit,Boston[folds ==j,],id=i)
    cv.errors[j,i]=mean((Boston$crim[folds == j]-pred)^2)
  }
}
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125
par(mfrow=c(1,1))
plot(mean.cv.errors ,type='b')

reg.best=regsubsets(crim~.,data=Boston,nvmax = 13)
coef(reg.best,12)
##   (Intercept)            zn         indus          chas           nox 
##  16.985713928   0.044673247  -0.063848469  -0.744367726 -10.202169211 
##            rm           dis           rad           tax       ptratio 
##   0.439588002  -0.993556631   0.587660185  -0.003767546  -0.269948860 
##         black         lstat          medv 
##  -0.007518904   0.128120290  -0.198877768

Lasso

library(glmnet)
x=model.matrix(crim~.,Boston[,-1])
y=Boston$crim
grid=10^seq(10,-2,length=100)
set.seed(1)
train=sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred - y.test)^2)
## [1] 40.89875
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type='coefficients',s=bestlam)
lasso.coef
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 11.548254668
## (Intercept)  .          
## zn           0.034622094
## indus       -0.063614952
## chas        -0.558766844
## nox         -5.887086047
## rm           0.162261311
## age          .          
## dis         -0.724700687
## rad          0.507628586
## tax          .          
## ptratio     -0.160962515
## black       -0.007549463
## lstat        0.122386039
## medv        -0.147241214

Ridge Regression

ridge.mod=glmnet(x[train,],y[train],alpha=0, lambda=grid,thresh=1e-12)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpa=0)
bestlam=cv.out$lambda.min
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 41.42233
out=glmnet(x,y,alpha=0)
predict(out,type='coefficients',s=bestlam,)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  9.063048626
## (Intercept)  .          
## zn           0.033002416
## indus       -0.082046152
## chas        -0.737684583
## nox         -5.393098481
## rm           0.335972073
## age          0.001962473
## dis         -0.702123641
## rad          0.422779054
## tax          0.003400607
## ptratio     -0.135911587
## black       -0.008483285
## lstat        0.142613436
## medv        -0.139604127

PCR

library(pls)
set.seed(1)
x=model.matrix(crim~.,Boston[,-1])
y=Boston$crim
grid=10^seq(10,-2,length=100)
set.seed(1)
train=sample(1:nrow(Boston),nrow(Boston)/2)
test=(-train)
y.test=y[test]
pcr.fit=pcr(crim~.,data=Boston[train,],scale=TRUE,validation="CV")
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,Boston[test,],ncomp =13)
mean((pcr.pred-Boston$crim[test])^2)
## [1] 41.54639

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error. Here are the MSE’s for the models above, Subset Selection - 42.46 with 12 variables, Lasso - 40.90, PCR - 41.55 I am going to go with the PCR model using 8 predictors while the Lasso model only zeroed out 2 coefficients so it contains more predictors and adding complexity to the model.

(c) Does your chosen model involve all of the features in the data set? Why or why not? Since PCR is a dimension reduction model, all of the variables of the data set are not icluded.