Notebook prepared by Everton Lima

Introduction to Statistical Learning Solutions (ISLR)

Ch 6 Exercises

Table of Contents

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

1a

The model that is likely to have the smallest training RSS is the model produced by best subset selection. This is because the other options do not consider all possible models as they are greedy in their approach. Moreover, it is clear that the best subset selection will consider all the models considered by forward and backward step selection. It will then be sure to the select the model with the lowest training RSS.

1b

It is unclear which selection algorithm will produce the model with the lowest test RSS. The model produced by best subset selection certainly has higher chances of producing the model with the lowest training RSS since all subsets are considered. However, it may also overfit the model causing the other approaches to have a better test RSS. Thus, with the information given it is unclear which model will perform best in respect to the test RSS.

1c

  1. True. Forward step selection does not drop the previously chosen predictors, but instead continues in the same search path.
  2. True. Backward step selection with (k+1)-predictors will drop one predictor at a time and select the k-predictor subset that will provide the lowest training RSS.
  3. False. While it may occur, the statement is not always true. For example, the forward step selection algorithm selects the predictor with the lowest RSS for the 1-variable model. However, combination of other predictors may have a better RSS score than the model with only 1-variable.
  4. False. The argument is the same as the above; the algorithms may not consider the same subsets of predictors.
  5. False. Best subset selection may drop previously chosen predictors when a new one is added, since all possible subsets are considered.

2

2a

Alternative iii is the correct choice; The lasso limits the number of predictors, thus it reduces the inherent variance at the cost of an increase in bias. This can be interpreted as follows. Removing a predictor from the model is equivalent to saying the removed feature does not have a strong relationship with the target value, this may be a biased statement but it decreases the variance as a lower number of values need to be estimated from data.

2b

  1. Ridge regression will produce more biased models as it shrinks predictors that don’t have as a strong relationship with the target variable; the variance will decrease at the cost of an increase in bias.

2c

  1. Nonlinear methods have a higher variance than regular least scares. The curve will follow the observations more tightly than otherwise, causing the model to perform better when there is a underlying non-linear relationship between the predictors and the target variable.

3

3a

There will be a monotonically decrease in RSS as a bigger set of solutions becomes feasible and an increase in variance occurs. The correct alternative is iv.

3b

Option ii is the correct one. An increase in \(s\) means there will be an increase in flexibility in the model.

3c

Alternative iii is the correct option for the reasons discussed above.

3d

iv

3e

As it names states, the irreducible error will remain constant. Option v is the correct one.

4

4a

Option iii; there will be a steady increase as the coefficients are shrieked by the penalty term.

4b

Option ii; The model will have its variance reduced at the cost of bias. This will cause the test RSS to decrease until the point that a decrease in variance is not offset by an increase in variance, where the test RSS will then increase.

4c

Option iv; there will be a steady decrease in variance as the coefficient are shrinked by an increase in the penalty term.

4d

Option iii.

4e

Option v.

5

Skipped

6

6a

y=12
lambda=3

ridge = function(b){
  return( (y-b)^2+lambda*b^2)
}

bs=lapply(0:100,ridge)
plot(0:100,bs,type='l',xlab='Beta',ylab='Ridge')

b0x=y/(1+lambda)
b0y=ridge(b0x)
  
points(b0x,b0y,col="red",cex=2,pch=20)
text(b0x,b0y,labels='B0',pos=3)

6b

y=50
lambda=4

lasso = function(b){
  return((y-b)^2+lambda*abs(b))
}

min=function(y,lambda){
  if(y>lambda/2)
    return(y-lambda/2)
  else if(y<-lambda/2)
    return(y+lambda/2)
  else if( abs(y) <= lambda/2)
    return(0)
}

bs=lapply(0:100,lasso)
plot(0:100,bs,type='l',xlab='Beta',ylab='Lasso')

b0x=min(y,lambda)
b0y=lasso(b0x)

points(b0x,b0y,col="red",cex=2,pch=20)
text(b0x,b0y,labels='B0',pos=3)

7

Skipped

Applied Exercises

8

8a

set.seed(42)
n=100
x=rnorm(n)  # predictor vector
e=rnorm(n) # noise vector

8b

betas=sample(1:n,4)
betas
## [1]  3 51 62 41
y=betas[1]+betas[2]*x+betas[3]*x^2+betas[4]*x^3+e

curve(betas[1]+betas[2]*x+betas[3]*x^2,from=-3, to=3, xlab="x", ylab="y")
points(data.frame(x,y),col='red')

8c

library(leaps)


PrintBestModelByScore=function(model,score,ord=which.min){
  # PrintBestModelByScore prints the coefficients of the best model selected.
  # model is the output of regsubsets function from the leaps package
  # ord is a function used to return the best model given a vector of scores.
  #
  # function is executed for side effects only.
  #
  
  model.summary=summary(model)
  model.bestcoef=ord(model.summary[[score]])
  model.which=model.summary$which[model.bestcoef,]
  
  print(coef(model,model.bestcoef))
}

y.bestsub=regsubsets(y~poly(x,10,raw = T),data=data.frame(x,y),method = 'exhaustive')

PrintBestModelByScore(y.bestsub,'cp')
##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##            2.851142318           50.487064223           62.381429720 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)6  poly(x, 10, raw = T)8 
##           41.355306390           -0.187316229            0.052618451 
##  poly(x, 10, raw = T)9 poly(x, 10, raw = T)10 
##           -0.001250983           -0.003883804
PrintBestModelByScore(y.bestsub,'bic')
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##           3.002453396          50.959062732          61.859658437 
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)7 
##          40.191887577           0.721914977          -0.156366621 
## poly(x, 10, raw = T)9 
##           0.009546922
PrintBestModelByScore(y.bestsub,'adjr2',ord=which.max)
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            2.86875691           51.01825748           62.25410402 
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)6 
##           39.98526723            0.89899425           -0.09091581 
## poly(x, 10, raw = T)7 poly(x, 10, raw = T)8 poly(x, 10, raw = T)9 
##           -0.21426796            0.01460042            0.01574804
par(mfrow=c(2,2))

# plots
for(score in c('cp','bic','adjr2','rsq')) 
  plot(x=summary(y.bestsub)[[score]],xlab='Number of Predictors',ylab=score,type='l')

8d

# Forward selection

y.forward=regsubsets(y~poly(x,10,raw = T),data=data.frame(x,y),method = 'forward')

PrintBestModelByScore(y.forward,'cp')
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##           3.002453396          50.959062732          61.859658437 
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)7 
##          40.191887577           0.721914977          -0.156366621 
## poly(x, 10, raw = T)9 
##           0.009546922
PrintBestModelByScore(y.forward,'bic')
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##           3.002453396          50.959062732          61.859658437 
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)7 
##          40.191887577           0.721914977          -0.156366621 
## poly(x, 10, raw = T)9 
##           0.009546922
PrintBestModelByScore(y.forward,'adjr2',ord=which.max)
##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##            2.850972505           51.025584458           62.413925328 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)4  poly(x, 10, raw = T)5 
##           39.952624289           -0.213185003            0.925439288 
##  poly(x, 10, raw = T)7  poly(x, 10, raw = T)9 poly(x, 10, raw = T)10 
##           -0.220302150            0.016107769            0.000759652
# plots
par(mfrow=c(2,2))

for(score in c('cp','bic','adjr2','rsq')) 
  plot(x=summary(y.forward)[[score]],xlab='Number of Predictors',ylab=score,type='l')

# Barckward Selection
# Forward selection

y.backward=regsubsets(y~poly(x,10,raw=T),data=data.frame(x,y),method = 'backward')

PrintBestModelByScore(y.backward,'cp')
##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##            2.851876962           50.460805013           62.378905170 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)6  poly(x, 10, raw = T)7 
##           41.395902335           -0.182196697           -0.007798896 
##  poly(x, 10, raw = T)8 poly(x, 10, raw = T)10 
##            0.049540851           -0.003458933
PrintBestModelByScore(y.backward,'bic')
##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##            2.804598315           50.743003586           62.519876526 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)6  poly(x, 10, raw = T)8 
##           41.165713285           -0.227949448            0.060212498 
## poly(x, 10, raw = T)10 
##           -0.003902153
PrintBestModelByScore(y.backward,'adjr2',ord=which.max)
##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##             2.84527494            50.75412712            62.37950315 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)5  poly(x, 10, raw = T)6 
##            40.82627209             0.24798080            -0.18334724 
##  poly(x, 10, raw = T)7  poly(x, 10, raw = T)8 poly(x, 10, raw = T)10 
##            -0.03632163             0.05259647            -0.00406704
# plots
par(mfrow=c(2,2))

for(score in c('cp','bic','adjr2','rsq')) 
  plot(x=summary(y.backward)[[score]],xlab='Number of Predictors',ylab=score,type='l')

There is no difference in the models selected by forward and backward predictor selection. These are also the same ones found in the previous section.

8e

# Select the optimal value of lambda by cross validation
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
y.lasso.cv=cv.glmnet(poly(x,degree = 10,raw=T),y)

y.lasso.cv$lambda.min
## [1] 3.411904
y.lasso.cv$lambda.1se
## [1] 4.109651
# Plot the cross validation error as a function of lambda
plot(data.frame(y.lasso.cv$lambda,y.lasso.cv$cvm),xlab='Lambda',ylab='Mean Cross Validation Error',type='l')

# Report the best coefficients found
predict(y.lasso.cv,s = y.lasso.cv$lambda.min,type='coefficients')
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept)  7.398695
## 1           52.579806
## 2           56.966829
## 3           38.949240
## 4            .       
## 5            .       
## 6            .       
## 7            .       
## 8            .       
## 9            .       
## 10           .

As can be observed from the coefficients obtained, lasso perform much better than both best subset selection and forward subset selection as it is able to correctly select the predictors.

y.lasso.pred=predict(y.lasso.cv,s=y.lasso.cv$lambda.min,newx=poly(x,10,raw=T))

curve(betas[1]+betas[2]*x+betas[3]*x^2,from=-3, to=3, xlab="x", ylab="y")
points(data.frame(x,y),col='red')
points(data.frame(x,y.lasso.pred),col='blue')

The predictions made by this model (in blue) accurately follow the observed data (shown in red).

8f

set.seed(345)

n=100
betas=sample(1:n,2)
betas
## [1] 22 28
e=rnorm(n)
x=rnorm(n)

y2=betas[1]+betas[2]*x^7+e

curve(betas[1]+betas[2]*x^7,from=-3,to=3,xlab='x',ylab='y')
points(data.frame(x,y2),col='red')

# Best Subset

y2.bestsub=regsubsets(y~poly(x,10,raw=T),data=data.frame(x,y),method = 'exhaustive')
plot(y2.bestsub)

PrintBestModelByScore(y2.bestsub,'cp')
##           (Intercept) poly(x, 10, raw = T)4 
##             57.552612             -1.284602
PrintBestModelByScore(y2.bestsub,'bic')
##           (Intercept) poly(x, 10, raw = T)4 
##             57.552612             -1.284602
PrintBestModelByScore(y2.bestsub,'adjr2',ord=which.max)
##            (Intercept)  poly(x, 10, raw = T)2  poly(x, 10, raw = T)4 
##             108.560102            -538.731967             766.539941 
##  poly(x, 10, raw = T)6  poly(x, 10, raw = T)8 poly(x, 10, raw = T)10 
##            -364.206982              67.255616              -4.143186
par(mfrow=c(2,2))

# plots
for(score in c('cp','bic','adjr2','rsq')) 
  plot(x=summary(y2.bestsub)[[score]],xlab='Number of Predictors',ylab=score,type='l')

# Lasso
y2.lasso.cv=cv.glmnet(poly(x,degree = 10,raw=T),y2)
plot(data.frame(y2.lasso.cv$lambda,y2.lasso.cv$cvm),xlab='Lambda',ylab='Mean Cross Validation Error',type='l')

predict(y.lasso.cv,s = y2.lasso.cv$lambda.min,type='coefficients')
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 55.839076
## 1           33.718666
## 2            .       
## 3            7.429332
## 4            .       
## 5            .       
## 6            .       
## 7            .       
## 8            .       
## 9            .       
## 10           .

Neither approach performs well as neither selects the coefficient related to the seventh power of x. However, the model found does provide a good approximation for the data observed as can be observed below.

y2.lasso.pred=predict(y2.lasso.cv,s = y2.lasso.cv$lambda.min,poly(x,10))

curve(betas[1]+betas[2]*x^7,from=-3,to=3,xlab='x',ylab='y')
points(data.frame(x,y2),col='red')
points(data.frame(x,y2.lasso.pred),col='blue')

9

9a

library(ISLR)
n=nrow(College)
x=c()

set.seed(100)
train=sample(1:n,n/2)
test=-train

9b

lm.fit=lm(Apps~.,data=College[train,])
lm.pred=predict(lm.fit,newdata = College[test,])
x=c(x,lm=mean( (lm.pred-College[test,"Apps"])^2 ))

9c

grid = 10 ^ seq(4, -2, length=100)
train.mat = model.matrix(Apps~., data=College[train,])
test.mat = model.matrix(Apps~., data=College[test,])
library(glmnet)

ridge.cv=cv.glmnet(x=train.mat,y=College[train,"Apps"],alpha=0,thresh=1e-12,lambda = grid)
ridge.fit=glmnet(x=train.mat,y=College[train,"Apps"],alpha=0,lambda = ridge.cv$lambda.min)
ridge.pred=predict(ridge.fit,newx=test.mat )
ridge.coef=predict(ridge.fit,type='coefficients',s=ridge.cv$lambda.min)

ridge.cv$lambda.min
## [1] 14.17474
ridge.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -5.698205e+02
## (Intercept)  .           
## PrivateYes  -5.936518e+02
## Accept       1.280631e+00
## Enroll      -4.716949e-01
## Top10perc    4.794493e+01
## Top25perc   -1.219088e+01
## F.Undergrad  9.839294e-02
## P.Undergrad  7.080416e-02
## Outstate    -7.050049e-02
## Room.Board   3.114784e-01
## Books        4.586375e-02
## Personal     2.040331e-03
## PhD         -7.258500e+00
## Terminal    -2.507214e+00
## S.F.Ratio   -1.510205e+01
## perc.alumni -1.442947e+01
## Expend       6.324409e-02
## Grad.Rate    1.116685e+01
x=c(x,ridge=mean( (ridge.pred-College[test,2])^2))

9d

lasso.cv=cv.glmnet(train.mat,y=College[train,"Apps"],alpha=1,lambda=grid,thresh=1e-12)
lasso.fit=glmnet(x=as.matrix(College[train,-c(1,2)]),y=College[train,2],alpha=1,lambda = lasso.cv$lambda.min)
lasso.coef=predict(lasso.fit,type='coefficients',s=lasso.cv$lambda.min)

lasso.cv$lambda.min
## [1] 16.29751
lasso.coef
## 17 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.364330e+03
## Accept       1.264287e+00
## Enroll       .           
## Top10perc    3.850869e+01
## Top25perc   -4.595076e+00
## F.Undergrad  3.380132e-02
## P.Undergrad  7.214142e-02
## Outstate    -8.197778e-02
## Room.Board   2.618713e-01
## Books        .           
## Personal     .           
## PhD         -2.243368e+00
## Terminal     .           
## S.F.Ratio    .           
## perc.alumni -1.614927e+01
## Expend       6.392972e-02
## Grad.Rate    6.941991e+00
lasso.pred=predict(ridge.fit,newx=test.mat)
x=c(x,lasso=mean( (lasso.pred-College[test,2])^2))

9e

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(2)

pcr.fit=pcr(Apps~.,data=College,scale=TRUE,validation='CV',subset=train)
validationplot(pcr.fit,val.type="MSEP")

pcr.fit$ncomp
## [1] 17

From the plot and the object returned we know that the number of components that achieved the lowest cross validation error is 17. The test error is given below.

pcr.pred=predict(pcr.fit,newdata = College[test,-c(2)],ncomp = 16)
x=c(x,pcr=mean( (pcr.pred-College[test,2])^2))

9f

pls.fit=plsr(Apps~.,data=College,scale=TRUE,validation='CV',subset=train)
validationplot(pls.fit,val.type="MSEP")

pls.fit$ncomp
## [1] 17
pls.pred=predict(pls.fit,newdata = College[test,-c(2)],ncomp = 10)
x=c(x,pls=mean( (pls.pred-College[test,2])^2))

9g

sort(x)
##      lm     pls   ridge   lasso     pcr 
## 1355557 1365337 1408910 1408910 1531268

From the results obtained there is not a significant difference from fitting a model with least squares, ridge, lasso and partial least squares. The lasso and ridge regression significantly penalize the Books, Personal, Terminal and S.F. Ratio predictors. We can see that these are also not found to be significant in the least squares model.

summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = College[train, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2842.8  -487.6   -90.9   347.3  5224.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.967e+02  5.513e+02  -0.901 0.368171    
## PrivateYes  -5.574e+02  1.896e+02  -2.941 0.003482 ** 
## Accept       1.342e+00  6.902e-02  19.451  < 2e-16 ***
## Enroll      -6.678e-01  2.465e-01  -2.709 0.007070 ** 
## Top10perc    5.059e+01  7.358e+00   6.876 2.64e-11 ***
## Top25perc   -1.422e+01  6.140e+00  -2.316 0.021121 *  
## F.Undergrad  1.097e-01  3.888e-02   2.821 0.005039 ** 
## P.Undergrad  7.432e-02  3.893e-02   1.909 0.057031 .  
## Outstate    -7.834e-02  2.741e-02  -2.858 0.004501 ** 
## Room.Board   3.032e-01  6.717e-02   4.514 8.56e-06 ***
## Books        2.648e-02  3.577e-01   0.074 0.941029    
## Personal     8.379e-03  9.253e-02   0.091 0.927894    
## PhD         -7.400e+00  6.474e+00  -1.143 0.253776    
## Terminal    -2.347e+00  7.030e+00  -0.334 0.738667    
## S.F.Ratio   -1.447e+01  1.823e+01  -0.794 0.427878    
## perc.alumni -1.319e+01  5.674e+00  -2.325 0.020606 *  
## Expend       6.334e-02  1.725e-02   3.671 0.000277 ***
## Grad.Rate    1.124e+01  4.067e+00   2.764 0.005989 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 997.5 on 370 degrees of freedom
## Multiple R-squared:  0.9288, Adjusted R-squared:  0.9256 
## F-statistic: 284.1 on 17 and 370 DF,  p-value: < 2.2e-16

Furthermore, Principal Component Regression is the worst fitting procedure for this dataset; It is clear then that the direction with most variance of predictors is not strongly related to the predictors.

avg_apps=mean(College[,"Apps"])
1 - mean((College[test, "Apps"] - lm.pred)^2) /mean((College[test, "Apps"] - avg_apps)^2)
## [1] 0.9180139

The best performing model then errors on average 1,355,557 and 91% of variance present in the data is explained by the model.

10

10a

p=20
n=1000

set.seed(38)

# Generate noise
e=rnorm(n)

# Generate Observations
X=c()
for(i in 1:p)
  X=cbind(X,rnorm(n,mean = sample(1:100,1),sd = sample(0:50,1)))
colnames(X)<-paste('X',1:20,sep = "")

# Generate Coefficients
betas=sample(0:100,p,replace = T)
betas[ !sample(0:1,replace = T,20) ]=0
names(betas)<-paste('X',1:20,sep = "")

# Generate Target Variable
y.mat=X %*% matrix(betas,nrow = 20)
y=apply(cbind(y.mat,e),1,sum)
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13600   20560   22640   22470   24300   30960
y.mat=data.frame(X,y)

print(paste("The coefficients chosen are",paste(betas,collapse = " "))) # coefficients generated in the model
## [1] "The coefficients chosen are 57 0 0 96 0 0 16 86 44 0 0 0 0 6 0 0 0 0 0 75"

10b

train=sample(1:n,100)
test=-train

10c

library(leaps)
# Best subset selection
y.bestsub=regsubsets(y~.,data = y.mat[train,],method = 'exhaustive',nvmax = 20)
y.bestsub.summary=summary(y.bestsub)

# plot train MSE for each model 
y.bestsub.mse=1/n*y.bestsub.summary$rss

plot(y.bestsub.mse,xlab="Number of Variables",ylab="Train MSE",type='l')
axis(1, at=seq(1,20,1))

10d

# plot test MSE for each model 

predict.regsubsets=function(object,newdata,id,...){
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi
}

ers=c()
for(i in 1:20)
  ers=c(ers, mean((y.mat[test,]-predict.regsubsets(y.bestsub,newdata = y.mat[test,],i))^2)  )

plot(ers,type='l',xlab='Number of Predictors',ylab="Test MSE")
axis(side=1,at=1:20)

10e

The minimum value is achieved by a model with 4 predictors.

10f

The model used to generate the data has 7 non-zero coefficients.

10g

betas.error=c()
for(i in 1:20){
  coefi=coef(y.bestsub,id=i)[-1]
  xvars=names(coefi)
  e=sqrt(sum((coefi-betas[xvars])^2))
  betas.error=c(betas.error,e)
}

plot(scale(betas.error),col='red',ylab='Test MSE & RSE of Coef',
     xlab="Number of Predictors",type='l',ylim = c(-3,3))
lines(scale(ers),type='l')
axis(side=1,at=1:20)

The error of the estimated coefficients follows the curve. however it does give the model with 7 or more coefficients as the model that has the lowest predictor estimate error.

11

11a
library(glmnet)
library(leaps)
library(pls)
library(MASS)

n=nrow(Boston)
train=sample(1:n,n/2)
test=-train

Boston.train=model.matrix(crim~.,Boston[train,])
Boston.test=model.matrix(crim~.,Boston[test,])
model.errors=c()
# Best Subset Selection
bestsub.train=regsubsets(crim~.,data=Boston[train,],method='exhaustive',nvmax = 14)
bestsub.train.summary=summary(bestsub.train)

ers=c()
for(i in 1:13)
  ers=c(ers, mean((Boston[test,1]-
                     predict.regsubsets(bestsub.train,newdata = Boston[test,],i))^2)  )
plot(ers,type='l',xlab="Number of Predictors",ylab="Test MSE")

par(mfrow=c(2,2))
for(score in c('cp','rss','adjr2','bic')) 
  plot(bestsub.train.summary[[score]],xlab='Number of Predictors',ylab=score,type='l')

dev.off()
## null device 
##           1
bestsub.pred=predict.regsubsets(bestsub.train,newdata = Boston[test,],which.min(ers))
model.errors=c(model.errors,bestsubset=ers[which.min(ers)])

From the plot above we can see that models with a smaller number of predictors perform well in all scores. This gives good evidence for the use of lasso, and ridge regression as potentially performing well in this dataset. Furthermore, from the test MSE we see that the very simple models do indeed perform well.

# Lasso
set.seed(37)
lasso.cv=cv.glmnet(x=Boston.train,y=Boston[train,'crim'],alpha = 1)

lasso.cv.cvm.min=which.min(lasso.cv$cvm)
plot(lasso.cv$lambda,lasso.cv$cvm,type='l',xlab='Lambda',ylab='Cross Validation MSE')
points(lasso.cv$lambda.min,lasso.cv$cvm[lasso.cv.cvm.min],pch=20,col="red",cex=1)

lasso.cv$lambda.min
## [1] 0.4810037
coef(lasso.cv,s = lasso.cv$lambda.min)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)  2.1289195
## (Intercept)  .        
## zn           .        
## indus        .        
## chas         .        
## nox          .        
## rm           .        
## age          .        
## dis         -0.1781566
## rad          0.5070153
## tax          .        
## ptratio      .        
## black        .        
## lstat        .        
## medv        -0.1165434

Above you can view the cross validation MSE as a function of lambda. We can observe then that a small penalty term provides the best performing model according tot he training cross validation error. This results in a model with 8 predictors.

ers=c()
for(l in lasso.cv$lambda){
  lasso.pred=predict(lasso.cv,newx=Boston.test,s = l)
  ers=c(ers,mean( (Boston[test,'crim']-lasso.pred)^2 ))
}

plot(lasso.cv$lambda,ers,type='l',ylab='Test MSE',xlab='Lambda')
points(lasso.cv$lambda.min,ers[which.min(lasso.cv$cvm)],pch=20,col="red",cex=1)

ers.min=which.min(ers)
points(lasso.cv$lambda[ers.min],ers[ers.min],pch=20,col="green",cex=1)

model.errors=c(model.errors,lasso=ers[ers.min])
lasso.pred=predict(lasso.cv,newx=Boston.test,s = lasso.cv$lambda.min)

From the test MSE we now know that a larger penalty term would have improved the model. The Test MSE error obtained by using the lambda selected via cross validation is shown in red and the optimal test lambda value is shown in green.

# PCR
set.seed(42)
pcr.fit=pcr(crim~.,validation='CV',scale=T,data=Boston[train,])
validationplot(pcr.fit,val.type="MSEP")

From the PCR cross validation plot above we can see that there is not a significant change in the cross validation MSE for 3 through 12 predictors, where the minimum is achieved when 9 principal components are used.

ers=c()
for(i in 1:13){
  pcr.pred=predict(pcr.fit,newdata = Boston[test,],ncomp = i)
  ers=c(ers,mean((Boston[test,'crim']-pcr.pred)^2))
}
plot(ers,type='l',ylab='Test MSE',xlab='Number of Components')
points(9,ers[9],col='red',pch=20)
points(which.min(ers),ers[which.min(ers)],col='green',pch=20)

model.errors=c(model.errors,pcr=ers[9])
pcr.pred=predict(pcr.fit,newdata = Boston[test,],ncomp = 9)
11b
sort(model.errors)
## bestsubset      lasso        pcr 
##   36.13583   36.71278   37.55686
# R squared
cor(Boston[test,'crim'],bestsub.pred)^2
##           [,1]
## [1,] 0.4677941
cor(Boston[test,'crim'],pcr.pred)^2
## [1] 0.4482602
cor(Boston[test,'crim'],lasso.pred)^2
##              1
## [1,] 0.4537028

The best model is best-subset selection is the best performing model on this data according to both Test MSE and R squared, where it explains 46% of the variance encountered in the data. I expect then that models that do not use all predictors will tend to perform well in this dataset.

11c

Not all predictors are strongly related tot the response variable; using all of them will decrease performance since it will overfit the model