2. For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer. (a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
As ‘lambda’ increases, the variance of the predictions reduces at the cost of small increase in bias. Lasso is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.(iii.)
Lasso performs well in a setting where we have small number of predictors have substantially high coefficients and others have relatively small or negligeable coefficients (almost equal to zero). Because Lasso, performs variable selection, it is easier to interpret as well.
(b) Repeat (a) for ridge regression relative to least squares.
Similar to Lasso - As the tuning parameter or lambda increases, the shrinkage of the ridge coefficient estimates lead to substantial reduction in variance at the expense of increase in bias. Ridge regression is less flexible as lambda increases and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.(iii.)
The only difference between ridge regression and lasso is that ridge regression does not shrink coeffiencients of the less-useful variables to exactly zero. Ridge regression performs better when response is a function of many predictors, where all the coefficients are of roughly equal size. Ridge regression works best in situations when we have more predictors than observations where least squares method is unable to perform well.
(c) Repeat (a) for non-linear methods relative to least squares
Non-linear methods are more flexible. Non-linear methods will give improved prediction accuracy when its increase in variance is less than its decrease in bias.(ii.)
9. In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
attach(College)
View(College)
(a) Split the data set into a training set and a test set.
set.seed(1)
trainingind<-sample(nrow(College), 0.75*nrow(College))
head(trainingind)
## [1] 679 129 509 471 299 270
train<-College[trainingind, ]
test<-College[-trainingind, ]
dim(College)
## [1] 777 18
dim(train)
## [1] 582 18
dim(test)
## [1] 195 18
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit=lm(Apps~., data=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
lm.pred=predict(lm.fit, newdata=test)
lm.err = mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1384604
The test error rate is 1384604 and the R-squared is 93.36%. This means 93.36% of variance is described by this model. We may use only the variables that are significant to increase the R-squared value.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
#For ridge regression we need glmnet.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Loaded glmnet 4.1-3
set.seed(1)
xtrain = model.matrix(Apps~.,data= train[,-1])
ytrain = train$Apps
xtest = model.matrix(Apps~.,data = test[,-1])
ytest = test$Apps
#Performing ridge regression
ridge.mod=glmnet(xtrain,ytrain,alpha=0)
Before predicting ridge regression, doing cross validation to get best lambda
#Performing cross-validation - by default it performs ten-fold cross-validation
set.seed(1)
cv.out=cv.glmnet(xtrain,ytrain,alpha=0)
plot(cv.out)
bestlam =cv.out$lambda.min
bestlam
## [1] 364.6228
#Predicting ridge regression with the best lambda. i.e., 3
ridge.pred=predict (ridge.mod ,s=bestlam, newx=xtest)
ridge.err = mean((ridge.pred -ytest)^2)
ridge.err
## [1] 1260111
The test error obtained is 1260111 which is lower than linear regression model.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#Performing Lasso
lasso.mod=glmnet(xtrain,ytrain,alpha=1)
#Performing cross-validation - by default it performs ten-fold cross-validation
set.seed(1)
cv.out=cv.glmnet(xtrain,ytrain,alpha=1)
plot(cv.out)
bestlam =cv.out$lambda.min
lasso.pred=predict (lasso.mod ,s=bestlam ,newx=xtest)
lasso.err = mean((lasso.pred -ytest)^2)
lasso.err
## [1] 1394834
The test error for Lasso is 1394834.The lasso model has more test error than earlier models.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(2)
pcr.fit = pcr(Apps~.,data = train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3760 2066 2087 1861 1684 1635
## adjCV 3862 3761 2064 2087 1833 1661 1631
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1629 1578 1542 1534 1547 1551 1554
## adjCV 1627 1569 1538 1530 1543 1547 1550
## 14 comps 15 comps 16 comps 17 comps
## CV 1552 1449 1130 1073
## adjCV 1548 1432 1124 1069
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
The summary shows that, 10 components may be a better fit.
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred=predict(pcr.fit, test,ncomp = 10)
pcr.err = mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1952693
The PCR model does not outperform any of the previous models.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
set.seed(2)
pls.fit = plsr(Apps~.,data = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1911 1695 1464 1362 1213 1107
## adjCV 3862 1906 1694 1458 1342 1188 1100
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1094 1078 1077 1076 1076 1075 1074
## adjCV 1089 1074 1072 1072 1072 1070 1069
## 14 comps 15 comps 16 comps 17 comps
## CV 1073 1073 1073 1073
## adjCV 1068 1069 1068 1069
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
validationplot(pls.fit, val.type = "MSEP")
pls.pred=predict(pls.fit, test,ncomp = 10)
pls.err = mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1392936
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
Based on test error, ridge regression has lowest MSE. Hence, ridge regression model may be better.
11. We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
attach(Boston)
View(Boston)
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
regfit.full=regsubsets(crim ~.,Boston, nvmax = 13)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., Boston, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
regfit.full=regsubsets(crim ~.,data=Boston,nvmax=13)
reg.summary=summary(regfit.full)
The best one varible model is that include ‘rad’. The best two variable model is one that includes ‘rad’ and ‘lstat’.
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
points(9,reg.summary$adjr2[9], col='red', cex=2, pch=20)
plot(reg.summary$cp, xlab="Number of Variables",ylab="Cp",type="l")
plot(reg.summary$bic, xlab="Number of Variables",ylab="BIC",type="l")
which.max(reg.summary$adjr2)
## [1] 9
#points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)
which.min(reg.summary$rss)
## [1] 13
which.min(reg.summary$cp)
## [1] 8
which.min(reg.summary$bic)
## [1] 3
coef(regfit.full, 9)
## (Intercept) zn indus nox dis
## 19.124636156 0.042788127 -0.099385948 -10.466490364 -1.002597606
## rad ptratio black lstat medv
## 0.539503547 -0.270835584 -0.008003761 0.117805932 -0.180593877
The adjusted R-squared suggests that 9-variable model is the best.The coefficeints for 9 variable model is shown above.
plot(regfit.full, scale = 'adjr2')
set.seed(1)
train = sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test = (!train)
regfit.full = regsubsets(crim~., data = Boston[train,],nvmax = 13)
#We need model matrix for validation
test.mat = model.matrix(crim~., data=Boston[test,])
val.errors =rep(NA,13)
for(i in 1:13){
coefi=coef(regfit.full, id = i)
pred=test.mat[, names(coefi)]%*%coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
## [1] 53.58865 54.37254 52.66064 52.50837 51.53113 51.12540 51.04084 50.69984
## [9] 50.43627 50.52382 50.71664 50.68209 50.65678
which.min(val.errors)
## [1] 9
bsm.err=(val.errors[9])
bsm.err
## [1] 50.43627
The nine variable model is best model from this cross-validation approach.The lowest MSE reported for best subset selection model is 50.43627.
The nine variable model consists of the following variables: ‘zn’,‘indus’,‘nox’,‘dis’,‘rad’,‘ptratio’,‘black’,‘lstat’ and ‘medv’.
#For ridge regression we need glmnet.
library(glmnet)
set.seed(1)
boston.x = model.matrix(crim~., data = Boston)[,-1]
boston.y = Boston$crim
#Performing ridge regression
ridge.boston=cv.glmnet(boston.x,boston.y,alpha=0, type.measure = "mse")
plot(ridge.boston)
coef(ridge.boston)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.017516864
## zn -0.002805664
## indus 0.034405928
## chas -0.225250602
## nox 2.249887499
## rm -0.162546004
## age 0.007343331
## dis -0.114928730
## rad 0.059813844
## tax 0.002659110
## ptratio 0.086423005
## black -0.003342067
## lstat 0.044495213
## medv -0.029124577
boston.ridge.err=ridge.boston$cvm[ridge.boston$lambda==ridge.boston$lambda.1se]
boston.ridge.err
## [1] 55.80448
The mean squared error for ridge regression is 55.80448.
#For Lasso, the alpha value is changed to 1.
library(glmnet)
set.seed(1)
#Performing lasso
lasso.boston=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(lasso.boston)
boston.lasso.err=lasso.boston$cvm[lasso.boston$lambda==lasso.boston$lambda.1se]
boston.lasso.err
## [1] 55.3338
The mean squared error for Lasso is 55.338, which is lower than ridge regression.
# Principal Components Regression
library(pls)
set.seed(2)
pcr.boston=pcr(crim~., data=Boston,scale=TRUE,validation="CV")
summary(pcr.boston)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.229 7.227 6.814 6.799 6.795 6.794
## adjCV 8.61 7.225 7.222 6.807 6.789 6.788 6.787
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.787 6.654 6.664 6.673 6.676 6.651 6.573
## adjCV 6.780 6.645 6.656 6.664 6.666 6.639 6.562
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
In the PCR model, the 8 components model appears to have lowest CV error. The 8 components model, explains 99.52% of variance is explained by the predictors. And for 8 components model, MSE is 44.13.PCR model predicts pretty well.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
The best subset selection model had validation set error of 50.43. The Ride regression had 55.80, the Lasso model had validation set error of 55.33. The PCR had validation set error of 44.13 for 8 components model.In short, the PCR performed better.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The eight components model of PCR was the well chosen model. The model had only eight components because we are looking to have better prediction accuracy with low variance and low validation set error.