The lasso method introduces a little bias but the trade-off with the greater decrease in variance. The L1 penalty makes coefficient estimates to be exactly 0 when lambda is sufficiently large.
The introduction of the lambda parameter, like in the lasso method, introduces some bias in the coefficient estimates. This increase in bias also comes with a large reduction in variance compared to least squares regression.
Non-linear methods have less bias than least squares regression because you aren’t assuming a linear relationship with the data. This decrease in bias results in increase in variance because the model can pick up different shapes on the data it is trained on that might not be present in the data that is used for predictions.
set.seed(1)
train = sample(1:nrow(College), nrow(College)/2)
test = (-train)
y.test = College$Apps[test]
lm9 <- lm(Apps~.,data=College,subset=train)
preds <- predict(lm9, College[test,])
mse(y.test,preds)
## [1] 1135758
The mean squared error of the simple linear regression model using all the predictors to predict the number of applications received was 1135758.
x = model.matrix(Apps~.,College)[,-2]
y = College$Apps
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
ridge.pred=predict(ridge.mod,s=cv.out$lambda.min,newx=x[test,])
mse(y.test, ridge.pred)
## [1] 954651.1
The mean squared error obtained from the ridge regression model using the lambda chosen by cross-validation is 954651.1. Which is 181106.9 less than the mse of the simple linear regression model.
lasso.mod=glmnet(x,y,alpha=1, lambda=grid)
cv.out2=cv.glmnet(x[train,], y[train],alpha=1)
cv.out2$lambda.min
## [1] 2.165848
lasso.pred=predict(lasso.mod, s=cv.out2$lambda.min, newx=x[test ,])
mse(y.test, lasso.pred)
## [1] 954155.4
The mean squared error obtained from the lasso method model using the lambda chosen by cross-validation is 954155.4. Which is only 495.7 less than the ridge regression model, a small difference compared to the simple linear regression model.
pcr.fit=pcr(Apps~., data=College, subset = train, scale=TRUE, validation ="CV")
validationplot(pcr.fit, val.type="MSEP")
pcr.pred=predict(pcr.fit,x[test ,],ncomp =5)
mse(y.test, pcr.pred)
## [1] 1963819
The lowest MSEP is with 16 components, but since the data contains 18 predictors, there is no reason to use PCR to try to minimize predictors by using just two less. It seems though from the plot, that 5 components gives a low MSEP and the additional components after 5 only decrease by a little. The cross-validation score for 5 components is 1979 and using 5 components explains 81.15 of the variance in the response. Using additional components only reduced the MSEP slightly after 5 components.
pls.fit=plsr(Apps~.,data=College,subset=train,scale=TRUE,validation ="CV")
validationplot(pls.fit,val.type="MSEP")
pls.pred=predict(pls.fit,x[test,],ncomp =6)
mse(y.test, pls.pred)
## [1] 1084846
Using simple linear regression the mse was 1135758. The method with the highest MSE was 1963819 so fitting a simple linear regression with all the predictors predicted on the test data better than trying to reduce the dimensions into 5 components. The model that predicted the best was the one fit with the lasso method. The best model returned an mse of 954155.4, this is a 181602.6 difference from the mse of the simple linear regression model.
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
test = (-train)
y.test = Boston$crim[test]
Best Subset Selection
#Best Subset
regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax=13)
reg.summary=summary(regfit.best)
par(mfrow=c(1,3))
plot(regfit.best ,scale="adjr2")
plot(regfit.best ,scale="Cp")
plot(regfit.best ,scale="bic")
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors = rep(NA ,13)
for(i in 1:13){
coefi=coef(regfit.best ,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
which.min(val.errors)
## [1] 1
coef(regfit.best,1)
## (Intercept) rad
## -2.5242729 0.6702535
val.errors[1]
## [1] 40.14557
The Lasso
x = model.matrix(crim~.,Boston)[,-1]
y = Boston$crim
grid=10^seq(10,-2,length=100)
lasso.mod=glmnet(x,y,alpha=1, lambda=grid)
cv.out2=cv.glmnet(x[train,], y[train],alpha=1)
lasso.pred=predict(lasso.mod, s=cv.out2$lambda.min, newx=x[test ,])
mse(y.test, lasso.pred)
## [1] 37.79472
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
Ridge regression
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
ridge.pred=predict(ridge.mod,s=cv.out$lambda.min,newx=x[test,])
mse(y.test, ridge.pred)
## [1] 38.01174
PCR
pcr.fit=pcr(crim~., data=Boston, subset = train, scale=TRUE, validation ="CV")
validationplot(pcr.fit, val.type="MSEP")
pcr.pred=predict(pcr.fit,x[test ,],ncomp =5)
mse(y.test, pcr.pred)
## [1] 44.23119
After running best subset selection, ridge regression, lasso, pcr, and models suggested by best subset according to different criteria, the model that performed the best on the subset of the data I created by cutting the original data in half was the model fit with the lasso method. The mean square error was 37.8. The lambda chosen by the cross validation for this lasso model was 0.06805595. The variable nox had the largest estimated coefficient and the variables age and tax were not included.
The lasso method shrinks the coefficients down to 0 if they effect on the response. For this data, it included all the variables except for age and tax.
predict(lasso.mod,type="coefficients",s=cv.out2$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 11.548254668
## zn 0.034622094
## indus -0.063614952
## chas -0.558766844
## nox -5.887086047
## rm 0.162261311
## age .
## dis -0.724700687
## rad 0.507628586
## tax .
## ptratio -0.160962515
## black -0.007549463
## lstat 0.122386039
## medv -0.147241214