2. For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer. (a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

As ‘lambda’ increases, the variance of the predictions reduces at the cost of small increase in bias. Lasso is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.(iii.)

Lasso performs well in a setting where we have small number of predictors have substantially high coefficients and others have relatively small or negligeable coefficients (almost equal to zero). Because Lasso, performs variable selection, it is easier to interpret as well.

(b) Repeat (a) for ridge regression relative to least squares.

Similar to Lasso - As the tuning parameter or lambda increases, the shrinkage of the ridge coefficient estimates lead to substantial reduction in variance at the expense of increase in bias. Ridge regression is less flexible as lambda increases and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.(iii.)

The only difference between ridge regression and lasso is that ridge regression does not shrink coeffiencients of the less-useful variables to exactly zero. Ridge regression performs better when response is a function of many predictors, where all the coefficients are of roughly equal size. Ridge regression works best in situations when we have more predictors than observations where least squares method is unable to perform well.

(c) Repeat (a) for non-linear methods relative to least squares

Non-linear methods are more flexible. Non-linear methods will give improved prediction accuracy when its increase in variance is less than its decrease in bias.(ii.)

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)
View(College)

(a) Split the data set into a training set and a test set.

set.seed(1) 
trainingind<-sample(nrow(College), 0.75*nrow(College))
head(trainingind)
## [1] 679 129 509 471 299 270
train<-College[trainingind, ]
test<-College[-trainingind, ]
dim(College)
## [1] 777  18
dim(train)
## [1] 582  18
dim(test)
## [1] 195  18

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit=lm(Apps~., data=train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16
lm.pred=predict(lm.fit, newdata=test)
lm.err = mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1384604

The test error rate is 1384604 and the R-squared is 93.36%. This means 93.36% of variance is described by this model. We may use only the variables that are significant to increase the R-squared value.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

#For ridge regression we need glmnet. 

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Loaded glmnet 4.1-3
set.seed(1)

xtrain = model.matrix(Apps~.,data= train[,-1])
ytrain = train$Apps
xtest = model.matrix(Apps~.,data = test[,-1])
ytest = test$Apps

#Performing ridge regression

ridge.mod=glmnet(xtrain,ytrain,alpha=0)

Before predicting ridge regression, doing cross validation to get best lambda

#Performing cross-validation - by default it performs ten-fold cross-validation

set.seed(1)
cv.out=cv.glmnet(xtrain,ytrain,alpha=0)
plot(cv.out)

bestlam =cv.out$lambda.min
bestlam
## [1] 364.6228
#Predicting ridge regression with the best lambda. i.e., 3

ridge.pred=predict (ridge.mod ,s=bestlam, newx=xtest)
ridge.err = mean((ridge.pred -ytest)^2)
ridge.err
## [1] 1260111

The test error obtained is 1260111 which is lower than linear regression model.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#Performing Lasso

lasso.mod=glmnet(xtrain,ytrain,alpha=1)
#Performing cross-validation - by default it performs ten-fold cross-validation

set.seed(1)
cv.out=cv.glmnet(xtrain,ytrain,alpha=1)
plot(cv.out)

bestlam =cv.out$lambda.min
lasso.pred=predict (lasso.mod ,s=bestlam ,newx=xtest)
lasso.err = mean((lasso.pred -ytest)^2)
lasso.err
## [1] 1394834

The test error for Lasso is 1394834.The lasso model has more test error than earlier models.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## Warning: package 'pls' was built under R version 4.1.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(2)
pcr.fit = pcr(Apps~.,data = train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3760     2066     2087     1861     1684     1635
## adjCV         3862     3761     2064     2087     1833     1661     1631
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1629     1578     1542      1534      1547      1551      1554
## adjCV     1627     1569     1538      1530      1543      1547      1550
##        14 comps  15 comps  16 comps  17 comps
## CV         1552      1449      1130      1073
## adjCV      1548      1432      1124      1069
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36

The summary shows that, 10 components may be a better fit.

validationplot(pcr.fit, val.type = "MSEP")

pcr.pred=predict(pcr.fit, test,ncomp = 10)
pcr.err = mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1952693

The PCR model does not outperform any of the previous models.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
set.seed(2)
pls.fit = plsr(Apps~.,data = train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1911     1695     1464     1362     1213     1107
## adjCV         3862     1906     1694     1458     1342     1188     1100
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1094     1078     1077      1076      1076      1075      1074
## adjCV     1089     1074     1072      1072      1072      1070      1069
##        14 comps  15 comps  16 comps  17 comps
## CV         1073      1073      1073      1073
## adjCV      1068      1069      1068      1069
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36
validationplot(pls.fit, val.type = "MSEP")

pls.pred=predict(pls.fit, test,ncomp = 10)
pls.err = mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1392936

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

Based on test error, ridge regression has lowest MSE. Hence, ridge regression model may be better.

11. We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
attach(Boston)
View(Boston)
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
regfit.full=regsubsets(crim ~.,Boston, nvmax = 13)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., Boston, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
regfit.full=regsubsets(crim ~.,data=Boston,nvmax=13)
reg.summary=summary(regfit.full)

The best one varible model is that include ‘rad’. The best two variable model is one that includes ‘rad’ and ‘lstat’.

par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
points(9,reg.summary$adjr2[9], col='red', cex=2, pch=20)
plot(reg.summary$cp, xlab="Number of Variables",ylab="Cp",type="l")
plot(reg.summary$bic, xlab="Number of Variables",ylab="BIC",type="l")

which.max(reg.summary$adjr2)
## [1] 9
#points(11,reg.summary$adjr2[11], col="red",cex=2,pch=20)

which.min(reg.summary$rss)
## [1] 13
which.min(reg.summary$cp)
## [1] 8
which.min(reg.summary$bic)
## [1] 3
coef(regfit.full, 9)
##   (Intercept)            zn         indus           nox           dis 
##  19.124636156   0.042788127  -0.099385948 -10.466490364  -1.002597606 
##           rad       ptratio         black         lstat          medv 
##   0.539503547  -0.270835584  -0.008003761   0.117805932  -0.180593877

The adjusted R-squared suggests that 9-variable model is the best.The coefficeints for 9 variable model is shown above.

plot(regfit.full, scale = 'adjr2')

set.seed(1)
train = sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test = (!train)
regfit.full = regsubsets(crim~., data = Boston[train,],nvmax = 13)

#We need model matrix for validation

test.mat = model.matrix(crim~., data=Boston[test,])
val.errors =rep(NA,13)
for(i in 1:13){
  coefi=coef(regfit.full, id = i)
  pred=test.mat[, names(coefi)]%*%coefi
  val.errors[i]=mean((Boston$crim[test]-pred)^2)
}

val.errors
##  [1] 53.58865 54.37254 52.66064 52.50837 51.53113 51.12540 51.04084 50.69984
##  [9] 50.43627 50.52382 50.71664 50.68209 50.65678
which.min(val.errors)
## [1] 9
bsm.err=(val.errors[9])
bsm.err
## [1] 50.43627

The nine variable model is best model from this cross-validation approach.The lowest MSE reported for best subset selection model is 50.43627.

The nine variable model consists of the following variables: ‘zn’,‘indus’,‘nox’,‘dis’,‘rad’,‘ptratio’,‘black’,‘lstat’ and ‘medv’.

#For ridge regression we need glmnet. 

library(glmnet)
set.seed(1)

boston.x = model.matrix(crim~., data = Boston)[,-1]
boston.y = Boston$crim

#Performing ridge regression

ridge.boston=cv.glmnet(boston.x,boston.y,alpha=0, type.measure = "mse")
plot(ridge.boston)

coef(ridge.boston)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.017516864
## zn          -0.002805664
## indus        0.034405928
## chas        -0.225250602
## nox          2.249887499
## rm          -0.162546004
## age          0.007343331
## dis         -0.114928730
## rad          0.059813844
## tax          0.002659110
## ptratio      0.086423005
## black       -0.003342067
## lstat        0.044495213
## medv        -0.029124577
boston.ridge.err=ridge.boston$cvm[ridge.boston$lambda==ridge.boston$lambda.1se]
boston.ridge.err
## [1] 55.80448

The mean squared error for ridge regression is 55.80448.

#For Lasso, the alpha value is changed to 1. 

library(glmnet)
set.seed(1)


#Performing lasso

lasso.boston=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(lasso.boston)

boston.lasso.err=lasso.boston$cvm[lasso.boston$lambda==lasso.boston$lambda.1se]
boston.lasso.err
## [1] 55.3338

The mean squared error for Lasso is 55.338, which is lower than ridge regression.

# Principal Components Regression

library(pls)
set.seed(2)
pcr.boston=pcr(crim~., data=Boston,scale=TRUE,validation="CV")
summary(pcr.boston)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.229    7.227    6.814    6.799    6.795    6.794
## adjCV         8.61    7.225    7.222    6.807    6.789    6.788    6.787
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.787    6.654    6.664     6.673     6.676     6.651     6.573
## adjCV    6.780    6.645    6.656     6.664     6.666     6.639     6.562
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

In the PCR model, the 8 components model appears to have lowest CV error. The 8 components model, explains 99.52% of variance is explained by the predictors. And for 8 components model, MSE is 44.13.PCR model predicts pretty well.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

The best subset selection model had validation set error of 50.43. The Ride regression had 55.80, the Lasso model had validation set error of 55.33. The PCR had validation set error of 44.13 for 8 components model.In short, the PCR performed better.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The eight components model of PCR was the well chosen model. The model had only eight components because we are looking to have better prediction accuracy with low variance and low validation set error.