Because the lasso model selects the beta or the coefficients values in such a way that the overall cross validation error decreased to a small extent. For λ>0 the coefficients will tend to reach 0. This process of shrinkage will reduce the variance of the predictions but with a small increase in bias. So there is a trade-off.The lasso limits the number of predictors, thus it reduces the inherent variance at the cost of an increase in bias.
for the same reasons as part (a). The only real difference here is in the ridge objective function RSS+λ∑pi=1β2i, where the shrinkage term for ridge regression is slightly different to that of the lasso. RSS+λ∑pi=1β2i this is the ridge objective function and the shrinkage term λ is slightly different from the lasso regression. Ridge regression do not shrink coefficients that are useless t exactly 0 like lasso but just like Lasso the shrinkage will reduce the variance at the cost of an increase in bias.
Non-linear methods are more flexible methods since they are not built on more assumptions. Since we do not consider any assumption the non linear methods have low bias that outweighs any increment in the variance. This is why there will be a high accuracy in the predictions.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
library(pls)
## Warning: package 'pls' was built under R version 4.0.4
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
df <- College
dim(df)
## [1] 777 18
sum(is.na(df))
## [1] 0
str(df)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
set.seed(3)
index = sample(1:nrow(df), 0.7*nrow(df))
train = df[index,]
test = df[-index,]
lm.fit <- lm(Apps ~ ., data = train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3377.8 -432.6 -64.3 350.0 6230.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -638.64835 494.48369 -1.292 0.19708
## PrivateYes -596.53900 164.68933 -3.622 0.00032 ***
## Accept 1.32983 0.06126 21.709 < 2e-16 ***
## Enroll -0.27428 0.21965 -1.249 0.21234
## Top10perc 52.60987 7.00024 7.515 2.46e-13 ***
## Top25perc -17.61480 5.44793 -3.233 0.00130 **
## F.Undergrad 0.04737 0.03878 1.221 0.22248
## P.Undergrad 0.04116 0.03435 1.198 0.23133
## Outstate -0.07459 0.02326 -3.206 0.00143 **
## Room.Board 0.18067 0.05973 3.025 0.00261 **
## Books -0.09611 0.26777 -0.359 0.71979
## Personal 0.02455 0.07656 0.321 0.74859
## PhD -10.13790 5.51802 -1.837 0.06674 .
## Terminal -5.64889 6.08928 -0.928 0.35400
## S.F.Ratio 23.73129 15.22420 1.559 0.11965
## perc.alumni -6.46646 5.04895 -1.281 0.20085
## Expend 0.12574 0.01798 6.992 8.29e-12 ***
## Grad.Rate 11.01750 3.54346 3.109 0.00198 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1042 on 525 degrees of freedom
## Multiple R-squared: 0.9216, Adjusted R-squared: 0.919
## F-statistic: 362.8 on 17 and 525 DF, p-value: < 2.2e-16
ols.pred <- predict(lm.fit, test)
(ols_mse <- mean((ols.pred - test$Apps)^2))
## [1] 1409723
set.seed(1)
x_train = model.matrix(Apps~.,train)[,-1]
y_train = train$Apps
x_test = model.matrix(Apps~.,test)[,-1]
y_test = test$Apps
set.seed(1)
cv.out=cv.glmnet(x_train,y_train,alpha=0, lambda = 10^seq(2, -2, length = 100))
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 29.83647
mod.ridge <- glmnet(y = y_train,
x = x_train,
alpha = 0,
lambda = 10^seq(2,-2, length = 100))
ridge.pred <- predict(mod.ridge, s = bestlam, newx = x_test)
(ridge_mse <- mean((ridge.pred - y_test)^2))
## [1] 1567301
set.seed(1)
cv.out=cv.glmnet(x_train,y_train,alpha=1,lambda = 10^seq(2, -2, length = 100)) # for lasso we put alpha = 1
plot(cv.out)
bestlam =cv.out$lambda.min
lasso.mod=glmnet(x_train,y_train,alpha=1,lambda=10^seq(2,-2, length = 100))
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x_test)
mean((lasso.pred - y_test)^2)
## [1] 1451043
lasso.coef <- predict(lasso.mod, type = "coefficients", s = cv.out$lambda.min)
round(lasso.coef, 3)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -753.803
## PrivateYes -583.680
## Accept 1.281
## Enroll .
## Top10perc 47.297
## Top25perc -13.561
## F.Undergrad 0.018
## P.Undergrad 0.037
## Outstate -0.065
## Room.Board 0.171
## Books -0.024
## Personal 0.012
## PhD -8.738
## Terminal -5.494
## S.F.Ratio 20.345
## perc.alumni -7.179
## Expend 0.121
## Grad.Rate 9.905
set.seed(5)
model.pcr <- pcr(Apps ~ .,data = train, scale = T, validation = "CV")
validationplot(model.pcr, val.type="MSEP")
model.pcr.mse <- MSEP(model.pcr, estimate = "CV")
# the minimum we got at 17 components
pcr.pred=predict(model.pcr,x_test,ncomp =17)
mean((pcr.pred-y_test)^2)
## [1] 1409723
set.seed(1)
pls.fit=plsr(Apps~.,data=train,scale=TRUE,validation ="CV")
summary (pls.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3665 1593 1355 1231 1205 1154 1115
## adjCV 3665 1592 1355 1229 1199 1145 1110
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1096 1093 1090 1088 1088 1089 1089
## adjCV 1092 1089 1087 1085 1085 1085 1086
## 14 comps 15 comps 16 comps 17 comps
## CV 1089 1089 1089 1089
## adjCV 1086 1086 1086 1086
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.10 42.12 63.55 67.17 70.56 74.76 78.01 81.33
## Apps 81.44 87.06 89.38 90.32 91.37 91.90 92.07 92.11
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 84.04 87.16 89.65 92.05 93.58 94.52 96.94
## Apps 92.13 92.14 92.15 92.15 92.15 92.16 92.16
## 16 comps 17 comps
## X 98.36 100.00
## Apps 92.16 92.16
validationplot(pls.fit,val.type="MSEP")
model.pls.msep <- MSEP(pls.fit, estimate = "CV")
# we got the minimum at 17 components again
pls.pred=predict(pls.fit,x_test,ncomp =17)
mean((pls.pred -y_test)^2)
## [1] 1409723
Answer: 1409723,1567301, 1451043, 1409723, 1409723 these are the test errors that are obtained from different models.
test.erros = c(1409723,1567301, 1451043, 1409723, 1409723)
min(test.erros)
## [1] 1409723
The minimum test error obtained is 1409723 from three models a linear model using least squares, PCR model, PLS model. The two models that got different and greater test errors than OLS,PCR, and PLS are ridge and lasso which got 1567301, 1451043 respectively.
library(MASS)
boston <- Boston
str(boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
library(leaps) # for resubsets() to perform best subset regression
## Warning: package 'leaps' was built under R version 4.0.4
predict.regsubsets=function(object , newdata ,id ,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
regfit.best=regsubsets(crim~.,data=boston,nvmax=13)
k=10
set.seed(1)
folds=sample (1:k,nrow(boston),replace=TRUE)
cv.errors=matrix(NA, k,13, dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
best.fit=regsubsets (crim~.,data=boston[folds!=j,],nvmax=13)
for(i in 1:13){
pred=predict.regsubsets(best.fit ,boston [folds ==j,],id=i)
cv.errors[j,i]= mean( ( boston$crim[ folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors ,2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
which.min(mean.cv.errors) # we gor 12 th one
## 12
## 12
par(mfrow=c(1,1))
plot(mean.cv.errors ,type='b')
min(mean.cv.errors)
## [1] 42.46014
The minimum of CV errors is 42.46014
set.seed(1)
x_data = model.matrix(crim~.,boston)[,-1]
y_data = boston$crim
set.seed(1)
cv.out=cv.glmnet(x_data,y_data,alpha=1,lambda = 10^seq(2, -2, length = 100))
# for lasso we put alpha = 1
plot(cv.out)
bestlam =cv.out$lambda.min
lasso.mod=glmnet(x_data,y_data,alpha=1,lambda=10^seq(2,-2, length = 100))
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x_data)
mean((lasso.pred - y_data)^2)
## [1] 40.46977
The mean square error for this model is 40.46977
set.seed(1)
cv.out=cv.glmnet(x_data,y_data,alpha=0, lambda = 10^seq(2, -2, length = 100))
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.1123324
mod.ridge <- glmnet(y = y_data,
x = x_data,
alpha = 0,
lambda = 10^seq(2,-2, length = 100))
ridge.pred <- predict(mod.ridge, s = bestlam, newx = x_data)
(ridge_mse <- mean((ridge.pred - y_data)^2))
## [1] 40.35765
The mean square error for this model is 40.35765
set.seed(5)
model.pcr <- pcr(crim ~ .,data = boston, scale = T, validation = "CV")
validationplot(model.pcr, val.type="MSEP")
model.pcr.mse <- MSEP(model.pcr, estimate = "CV")
# the minimum we got at 13 components
model.pcr.mse
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## 74.13 52.00 51.89 45.53 45.42 45.47
## 6 comps 7 comps 8 comps 9 comps 10 comps 11 comps
## 46.11 46.14 44.27 44.57 44.50 44.39
## 12 comps 13 comps
## 43.64 42.60
pcr.pred=predict(model.pcr,x_data,ncomp =13)
mean((pcr.pred-y_data)^2)
## [1] 40.31607
The mean square error for this model is 40.31607
Answer: From all the models I got the least mean square error for PCR and the MSE is 40.31. But the validation set errors and the cross-validation errors we obtained or not accurate test errors because we used the full data set for training and used it for our predictions as well.
Answer: Yes I chose the PCR model and it has all the 13 variables selected. But we used all the variables in our data set that means we are simply performing the least squares and there is no dimension reduction happened. However the cv error is some what similar to the 3 component model also. So using a small component number for our model would be sufficient.