Notebook prepared by Everton Lima
The model that is likely to have the smallest training RSS is the model produced by best subset selection. This is because the other options do not consider all possible models as they are greedy in their approach. Moreover, it is clear that the best subset selection will consider all the models considered by forward and backward step selection. It will then be sure to the select the model with the lowest training RSS.
It is unclear which selection algorithm will produce the model with the lowest test RSS. The model produced by best subset selection certainly has higher chances of producing the model with the lowest training RSS since all subsets are considered. However, it may also overfit the model causing the other approaches to have a better test RSS. Thus, with the information given it is unclear which model will perform best in respect to the test RSS.
Alternative iii is the correct choice; The lasso limits the number of predictors, thus it reduces the inherent variance at the cost of an increase in bias. This can be interpreted as follows. Removing a predictor from the model is equivalent to saying the removed feature does not have a strong relationship with the target value, this may be a biased statement but it decreases the variance as a lower number of values need to be estimated from data.
There will be a monotonically decrease in RSS as a bigger set of solutions becomes feasible and an increase in variance occurs. The correct alternative is iv.
Option iii; there will be a steady increase as the coefficients are shrieked by the penalty term.
Option ii; The model will have its variance reduced at the cost of bias. This will cause the test RSS to decrease until the point that a decrease in variance is not offset by an increase in variance, where the test RSS will then increase.
y=12
lambda=3
ridge = function(b){
return( (y-b)^2+lambda*b^2)
}
bs=lapply(0:100,ridge)
plot(0:100,bs,type='l',xlab='Beta',ylab='Ridge')
b0x=y/(1+lambda)
b0y=ridge(b0x)
points(b0x,b0y,col="red",cex=2,pch=20)
text(b0x,b0y,labels='B0',pos=3)
y=50
lambda=4
lasso = function(b){
return((y-b)^2+lambda*abs(b))
}
min=function(y,lambda){
if(y>lambda/2)
return(y-lambda/2)
else if(y<-lambda/2)
return(y+lambda/2)
else if( abs(y) <= lambda/2)
return(0)
}
bs=lapply(0:100,lasso)
plot(0:100,bs,type='l',xlab='Beta',ylab='Lasso')
b0x=min(y,lambda)
b0y=lasso(b0x)
points(b0x,b0y,col="red",cex=2,pch=20)
text(b0x,b0y,labels='B0',pos=3)
betas=sample(1:n,4)
betas
## [1] 3 51 62 41
y=betas[1]+betas[2]*x+betas[3]*x^2+betas[4]*x^3+e
curve(betas[1]+betas[2]*x+betas[3]*x^2,from=-3, to=3, xlab="x", ylab="y")
points(data.frame(x,y),col='red')
library(leaps)
PrintBestModelByScore=function(model,score,ord=which.min){
# PrintBestModelByScore prints the coefficients of the best model selected.
# model is the output of regsubsets function from the leaps package
# ord is a function used to return the best model given a vector of scores.
#
# function is executed for side effects only.
#
model.summary=summary(model)
model.bestcoef=ord(model.summary[[score]])
model.which=model.summary$which[model.bestcoef,]
print(coef(model,model.bestcoef))
}
y.bestsub=regsubsets(y~poly(x,10,raw = T),data=data.frame(x,y),method = 'exhaustive')
PrintBestModelByScore(y.bestsub,'cp')
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 2.851142318 50.487064223 62.381429720
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)6 poly(x, 10, raw = T)8
## 41.355306390 -0.187316229 0.052618451
## poly(x, 10, raw = T)9 poly(x, 10, raw = T)10
## -0.001250983 -0.003883804
PrintBestModelByScore(y.bestsub,'bic')
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.002453396 50.959062732 61.859658437
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)7
## 40.191887577 0.721914977 -0.156366621
## poly(x, 10, raw = T)9
## 0.009546922
PrintBestModelByScore(y.bestsub,'adjr2',ord=which.max)
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 2.86875691 51.01825748 62.25410402
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)6
## 39.98526723 0.89899425 -0.09091581
## poly(x, 10, raw = T)7 poly(x, 10, raw = T)8 poly(x, 10, raw = T)9
## -0.21426796 0.01460042 0.01574804
par(mfrow=c(2,2))
# plots
for(score in c('cp','bic','adjr2','rsq'))
plot(x=summary(y.bestsub)[[score]],xlab='Number of Predictors',ylab=score,type='l')
# Forward selection
y.forward=regsubsets(y~poly(x,10,raw = T),data=data.frame(x,y),method = 'forward')
PrintBestModelByScore(y.forward,'cp')
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.002453396 50.959062732 61.859658437
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)7
## 40.191887577 0.721914977 -0.156366621
## poly(x, 10, raw = T)9
## 0.009546922
PrintBestModelByScore(y.forward,'bic')
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.002453396 50.959062732 61.859658437
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)7
## 40.191887577 0.721914977 -0.156366621
## poly(x, 10, raw = T)9
## 0.009546922
PrintBestModelByScore(y.forward,'adjr2',ord=which.max)
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 2.850972505 51.025584458 62.413925328
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)4 poly(x, 10, raw = T)5
## 39.952624289 -0.213185003 0.925439288
## poly(x, 10, raw = T)7 poly(x, 10, raw = T)9 poly(x, 10, raw = T)10
## -0.220302150 0.016107769 0.000759652
# plots
par(mfrow=c(2,2))
for(score in c('cp','bic','adjr2','rsq'))
plot(x=summary(y.forward)[[score]],xlab='Number of Predictors',ylab=score,type='l')
# Barckward Selection
# Forward selection
y.backward=regsubsets(y~poly(x,10,raw=T),data=data.frame(x,y),method = 'backward')
PrintBestModelByScore(y.backward,'cp')
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 2.851876962 50.460805013 62.378905170
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)6 poly(x, 10, raw = T)7
## 41.395902335 -0.182196697 -0.007798896
## poly(x, 10, raw = T)8 poly(x, 10, raw = T)10
## 0.049540851 -0.003458933
PrintBestModelByScore(y.backward,'bic')
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 2.804598315 50.743003586 62.519876526
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)6 poly(x, 10, raw = T)8
## 41.165713285 -0.227949448 0.060212498
## poly(x, 10, raw = T)10
## -0.003902153
PrintBestModelByScore(y.backward,'adjr2',ord=which.max)
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 2.84527494 50.75412712 62.37950315
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)5 poly(x, 10, raw = T)6
## 40.82627209 0.24798080 -0.18334724
## poly(x, 10, raw = T)7 poly(x, 10, raw = T)8 poly(x, 10, raw = T)10
## -0.03632163 0.05259647 -0.00406704
# plots
par(mfrow=c(2,2))
for(score in c('cp','bic','adjr2','rsq'))
plot(x=summary(y.backward)[[score]],xlab='Number of Predictors',ylab=score,type='l')
There is no difference in the models selected by forward and backward predictor selection. These are also the same ones found in the previous section.
# Select the optimal value of lambda by cross validation
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
y.lasso.cv=cv.glmnet(poly(x,degree = 10,raw=T),y)
y.lasso.cv$lambda.min
## [1] 3.411904
y.lasso.cv$lambda.1se
## [1] 4.109651
# Plot the cross validation error as a function of lambda
plot(data.frame(y.lasso.cv$lambda,y.lasso.cv$cvm),xlab='Lambda',ylab='Mean Cross Validation Error',type='l')
# Report the best coefficients found
predict(y.lasso.cv,s = y.lasso.cv$lambda.min,type='coefficients')
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 7.398695
## 1 52.579806
## 2 56.966829
## 3 38.949240
## 4 .
## 5 .
## 6 .
## 7 .
## 8 .
## 9 .
## 10 .
As can be observed from the coefficients obtained, lasso perform much better than both best subset selection and forward subset selection as it is able to correctly select the predictors.
y.lasso.pred=predict(y.lasso.cv,s=y.lasso.cv$lambda.min,newx=poly(x,10,raw=T))
curve(betas[1]+betas[2]*x+betas[3]*x^2,from=-3, to=3, xlab="x", ylab="y")
points(data.frame(x,y),col='red')
points(data.frame(x,y.lasso.pred),col='blue')
The predictions made by this model (in blue) accurately follow the observed data (shown in red).
set.seed(345)
n=100
betas=sample(1:n,2)
betas
## [1] 22 28
e=rnorm(n)
x=rnorm(n)
y2=betas[1]+betas[2]*x^7+e
curve(betas[1]+betas[2]*x^7,from=-3,to=3,xlab='x',ylab='y')
points(data.frame(x,y2),col='red')
# Best Subset
y2.bestsub=regsubsets(y~poly(x,10,raw=T),data=data.frame(x,y),method = 'exhaustive')
plot(y2.bestsub)
PrintBestModelByScore(y2.bestsub,'cp')
## (Intercept) poly(x, 10, raw = T)4
## 57.552612 -1.284602
PrintBestModelByScore(y2.bestsub,'bic')
## (Intercept) poly(x, 10, raw = T)4
## 57.552612 -1.284602
PrintBestModelByScore(y2.bestsub,'adjr2',ord=which.max)
## (Intercept) poly(x, 10, raw = T)2 poly(x, 10, raw = T)4
## 108.560102 -538.731967 766.539941
## poly(x, 10, raw = T)6 poly(x, 10, raw = T)8 poly(x, 10, raw = T)10
## -364.206982 67.255616 -4.143186
par(mfrow=c(2,2))
# plots
for(score in c('cp','bic','adjr2','rsq'))
plot(x=summary(y2.bestsub)[[score]],xlab='Number of Predictors',ylab=score,type='l')
# Lasso
y2.lasso.cv=cv.glmnet(poly(x,degree = 10,raw=T),y2)
plot(data.frame(y2.lasso.cv$lambda,y2.lasso.cv$cvm),xlab='Lambda',ylab='Mean Cross Validation Error',type='l')
predict(y.lasso.cv,s = y2.lasso.cv$lambda.min,type='coefficients')
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 55.839076
## 1 33.718666
## 2 .
## 3 7.429332
## 4 .
## 5 .
## 6 .
## 7 .
## 8 .
## 9 .
## 10 .
Neither approach performs well as neither selects the coefficient related to the seventh power of x. However, the model found does provide a good approximation for the data observed as can be observed below.
y2.lasso.pred=predict(y2.lasso.cv,s = y2.lasso.cv$lambda.min,poly(x,10))
curve(betas[1]+betas[2]*x^7,from=-3,to=3,xlab='x',ylab='y')
points(data.frame(x,y2),col='red')
points(data.frame(x,y2.lasso.pred),col='blue')
lm.fit=lm(Apps~.,data=College[train,])
lm.pred=predict(lm.fit,newdata = College[test,])
x=c(x,lm=mean( (lm.pred-College[test,"Apps"])^2 ))
grid = 10 ^ seq(4, -2, length=100)
train.mat = model.matrix(Apps~., data=College[train,])
test.mat = model.matrix(Apps~., data=College[test,])
library(glmnet)
ridge.cv=cv.glmnet(x=train.mat,y=College[train,"Apps"],alpha=0,thresh=1e-12,lambda = grid)
ridge.fit=glmnet(x=train.mat,y=College[train,"Apps"],alpha=0,lambda = ridge.cv$lambda.min)
ridge.pred=predict(ridge.fit,newx=test.mat )
ridge.coef=predict(ridge.fit,type='coefficients',s=ridge.cv$lambda.min)
ridge.cv$lambda.min
## [1] 14.17474
ridge.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5.698205e+02
## (Intercept) .
## PrivateYes -5.936518e+02
## Accept 1.280631e+00
## Enroll -4.716949e-01
## Top10perc 4.794493e+01
## Top25perc -1.219088e+01
## F.Undergrad 9.839294e-02
## P.Undergrad 7.080416e-02
## Outstate -7.050049e-02
## Room.Board 3.114784e-01
## Books 4.586375e-02
## Personal 2.040331e-03
## PhD -7.258500e+00
## Terminal -2.507214e+00
## S.F.Ratio -1.510205e+01
## perc.alumni -1.442947e+01
## Expend 6.324409e-02
## Grad.Rate 1.116685e+01
x=c(x,ridge=mean( (ridge.pred-College[test,2])^2))
lasso.cv=cv.glmnet(train.mat,y=College[train,"Apps"],alpha=1,lambda=grid,thresh=1e-12)
lasso.fit=glmnet(x=as.matrix(College[train,-c(1,2)]),y=College[train,2],alpha=1,lambda = lasso.cv$lambda.min)
lasso.coef=predict(lasso.fit,type='coefficients',s=lasso.cv$lambda.min)
lasso.cv$lambda.min
## [1] 16.29751
lasso.coef
## 17 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.364330e+03
## Accept 1.264287e+00
## Enroll .
## Top10perc 3.850869e+01
## Top25perc -4.595076e+00
## F.Undergrad 3.380132e-02
## P.Undergrad 7.214142e-02
## Outstate -8.197778e-02
## Room.Board 2.618713e-01
## Books .
## Personal .
## PhD -2.243368e+00
## Terminal .
## S.F.Ratio .
## perc.alumni -1.614927e+01
## Expend 6.392972e-02
## Grad.Rate 6.941991e+00
lasso.pred=predict(ridge.fit,newx=test.mat)
x=c(x,lasso=mean( (lasso.pred-College[test,2])^2))
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(2)
pcr.fit=pcr(Apps~.,data=College,scale=TRUE,validation='CV',subset=train)
validationplot(pcr.fit,val.type="MSEP")
pcr.fit$ncomp
## [1] 17
From the plot and the object returned we know that the number of components that achieved the lowest cross validation error is 17. The test error is given below.
pcr.pred=predict(pcr.fit,newdata = College[test,-c(2)],ncomp = 16)
x=c(x,pcr=mean( (pcr.pred-College[test,2])^2))
pls.fit=plsr(Apps~.,data=College,scale=TRUE,validation='CV',subset=train)
validationplot(pls.fit,val.type="MSEP")
pls.fit$ncomp
## [1] 17
pls.pred=predict(pls.fit,newdata = College[test,-c(2)],ncomp = 10)
x=c(x,pls=mean( (pls.pred-College[test,2])^2))
sort(x)
## lm pls ridge lasso pcr
## 1355557 1365337 1408910 1408910 1531268
From the results obtained there is not a significant difference from fitting a model with least squares, ridge, lasso and partial least squares. The lasso and ridge regression significantly penalize the Books, Personal, Terminal and S.F. Ratio predictors. We can see that these are also not found to be significant in the least squares model.
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = College[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2842.8 -487.6 -90.9 347.3 5224.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.967e+02 5.513e+02 -0.901 0.368171
## PrivateYes -5.574e+02 1.896e+02 -2.941 0.003482 **
## Accept 1.342e+00 6.902e-02 19.451 < 2e-16 ***
## Enroll -6.678e-01 2.465e-01 -2.709 0.007070 **
## Top10perc 5.059e+01 7.358e+00 6.876 2.64e-11 ***
## Top25perc -1.422e+01 6.140e+00 -2.316 0.021121 *
## F.Undergrad 1.097e-01 3.888e-02 2.821 0.005039 **
## P.Undergrad 7.432e-02 3.893e-02 1.909 0.057031 .
## Outstate -7.834e-02 2.741e-02 -2.858 0.004501 **
## Room.Board 3.032e-01 6.717e-02 4.514 8.56e-06 ***
## Books 2.648e-02 3.577e-01 0.074 0.941029
## Personal 8.379e-03 9.253e-02 0.091 0.927894
## PhD -7.400e+00 6.474e+00 -1.143 0.253776
## Terminal -2.347e+00 7.030e+00 -0.334 0.738667
## S.F.Ratio -1.447e+01 1.823e+01 -0.794 0.427878
## perc.alumni -1.319e+01 5.674e+00 -2.325 0.020606 *
## Expend 6.334e-02 1.725e-02 3.671 0.000277 ***
## Grad.Rate 1.124e+01 4.067e+00 2.764 0.005989 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 997.5 on 370 degrees of freedom
## Multiple R-squared: 0.9288, Adjusted R-squared: 0.9256
## F-statistic: 284.1 on 17 and 370 DF, p-value: < 2.2e-16
Furthermore, Principal Component Regression is the worst fitting procedure for this dataset; It is clear then that the direction with most variance of predictors is not strongly related to the predictors.
avg_apps=mean(College[,"Apps"])
1 - mean((College[test, "Apps"] - lm.pred)^2) /mean((College[test, "Apps"] - avg_apps)^2)
## [1] 0.9180139
The best performing model then errors on average 1,355,557 and 91% of variance present in the data is explained by the model.
p=20
n=1000
set.seed(38)
# Generate noise
e=rnorm(n)
# Generate Observations
X=c()
for(i in 1:p)
X=cbind(X,rnorm(n,mean = sample(1:100,1),sd = sample(0:50,1)))
colnames(X)<-paste('X',1:20,sep = "")
# Generate Coefficients
betas=sample(0:100,p,replace = T)
betas[ !sample(0:1,replace = T,20) ]=0
names(betas)<-paste('X',1:20,sep = "")
# Generate Target Variable
y.mat=X %*% matrix(betas,nrow = 20)
y=apply(cbind(y.mat,e),1,sum)
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13600 20560 22640 22470 24300 30960
y.mat=data.frame(X,y)
print(paste("The coefficients chosen are",paste(betas,collapse = " "))) # coefficients generated in the model
## [1] "The coefficients chosen are 57 0 0 96 0 0 16 86 44 0 0 0 0 6 0 0 0 0 0 75"
library(leaps)
# Best subset selection
y.bestsub=regsubsets(y~.,data = y.mat[train,],method = 'exhaustive',nvmax = 20)
y.bestsub.summary=summary(y.bestsub)
# plot train MSE for each model
y.bestsub.mse=1/n*y.bestsub.summary$rss
plot(y.bestsub.mse,xlab="Number of Variables",ylab="Train MSE",type='l')
axis(1, at=seq(1,20,1))
# plot test MSE for each model
predict.regsubsets=function(object,newdata,id,...){
form=as.formula(object$call[[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
ers=c()
for(i in 1:20)
ers=c(ers, mean((y.mat[test,]-predict.regsubsets(y.bestsub,newdata = y.mat[test,],i))^2) )
plot(ers,type='l',xlab='Number of Predictors',ylab="Test MSE")
axis(side=1,at=1:20)
betas.error=c()
for(i in 1:20){
coefi=coef(y.bestsub,id=i)[-1]
xvars=names(coefi)
e=sqrt(sum((coefi-betas[xvars])^2))
betas.error=c(betas.error,e)
}
plot(scale(betas.error),col='red',ylab='Test MSE & RSE of Coef',
xlab="Number of Predictors",type='l',ylim = c(-3,3))
lines(scale(ers),type='l')
axis(side=1,at=1:20)
The error of the estimated coefficients follows the curve. however it does give the model with 7 or more coefficients as the model that has the lowest predictor estimate error.
library(glmnet)
library(leaps)
library(pls)
library(MASS)
n=nrow(Boston)
train=sample(1:n,n/2)
test=-train
Boston.train=model.matrix(crim~.,Boston[train,])
Boston.test=model.matrix(crim~.,Boston[test,])
model.errors=c()
# Best Subset Selection
bestsub.train=regsubsets(crim~.,data=Boston[train,],method='exhaustive',nvmax = 14)
bestsub.train.summary=summary(bestsub.train)
ers=c()
for(i in 1:13)
ers=c(ers, mean((Boston[test,1]-
predict.regsubsets(bestsub.train,newdata = Boston[test,],i))^2) )
plot(ers,type='l',xlab="Number of Predictors",ylab="Test MSE")
par(mfrow=c(2,2))
for(score in c('cp','rss','adjr2','bic'))
plot(bestsub.train.summary[[score]],xlab='Number of Predictors',ylab=score,type='l')
dev.off()
## null device
## 1
bestsub.pred=predict.regsubsets(bestsub.train,newdata = Boston[test,],which.min(ers))
model.errors=c(model.errors,bestsubset=ers[which.min(ers)])
From the plot above we can see that models with a smaller number of predictors perform well in all scores. This gives good evidence for the use of lasso, and ridge regression as potentially performing well in this dataset. Furthermore, from the test MSE we see that the very simple models do indeed perform well.
# Lasso
set.seed(37)
lasso.cv=cv.glmnet(x=Boston.train,y=Boston[train,'crim'],alpha = 1)
lasso.cv.cvm.min=which.min(lasso.cv$cvm)
plot(lasso.cv$lambda,lasso.cv$cvm,type='l',xlab='Lambda',ylab='Cross Validation MSE')
points(lasso.cv$lambda.min,lasso.cv$cvm[lasso.cv.cvm.min],pch=20,col="red",cex=1)
lasso.cv$lambda.min
## [1] 0.4810037
coef(lasso.cv,s = lasso.cv$lambda.min)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.1289195
## (Intercept) .
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis -0.1781566
## rad 0.5070153
## tax .
## ptratio .
## black .
## lstat .
## medv -0.1165434
Above you can view the cross validation MSE as a function of lambda. We can observe then that a small penalty term provides the best performing model according tot he training cross validation error. This results in a model with 8 predictors.
ers=c()
for(l in lasso.cv$lambda){
lasso.pred=predict(lasso.cv,newx=Boston.test,s = l)
ers=c(ers,mean( (Boston[test,'crim']-lasso.pred)^2 ))
}
plot(lasso.cv$lambda,ers,type='l',ylab='Test MSE',xlab='Lambda')
points(lasso.cv$lambda.min,ers[which.min(lasso.cv$cvm)],pch=20,col="red",cex=1)
ers.min=which.min(ers)
points(lasso.cv$lambda[ers.min],ers[ers.min],pch=20,col="green",cex=1)
model.errors=c(model.errors,lasso=ers[ers.min])
lasso.pred=predict(lasso.cv,newx=Boston.test,s = lasso.cv$lambda.min)
From the test MSE we now know that a larger penalty term would have improved the model. The Test MSE error obtained by using the lambda selected via cross validation is shown in red and the optimal test lambda value is shown in green.
# PCR
set.seed(42)
pcr.fit=pcr(crim~.,validation='CV',scale=T,data=Boston[train,])
validationplot(pcr.fit,val.type="MSEP")
From the PCR cross validation plot above we can see that there is not a significant change in the cross validation MSE for 3 through 12 predictors, where the minimum is achieved when 9 principal components are used.
ers=c()
for(i in 1:13){
pcr.pred=predict(pcr.fit,newdata = Boston[test,],ncomp = i)
ers=c(ers,mean((Boston[test,'crim']-pcr.pred)^2))
}
plot(ers,type='l',ylab='Test MSE',xlab='Number of Components')
points(9,ers[9],col='red',pch=20)
points(which.min(ers),ers[which.min(ers)],col='green',pch=20)
model.errors=c(model.errors,pcr=ers[9])
pcr.pred=predict(pcr.fit,newdata = Boston[test,],ncomp = 9)
sort(model.errors)
## bestsubset lasso pcr
## 36.13583 36.71278 37.55686
# R squared
cor(Boston[test,'crim'],bestsub.pred)^2
## [,1]
## [1,] 0.4677941
cor(Boston[test,'crim'],pcr.pred)^2
## [1] 0.4482602
cor(Boston[test,'crim'],lasso.pred)^2
## 1
## [1,] 0.4537028
The best model is best-subset selection is the best performing model on this data according to both Test MSE and R squared, where it explains 46% of the variance encountered in the data. I expect then that models that do not use all predictors will tend to perform well in this dataset.