Problem 2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
(a) The lasso, relative to least squares, is:
The lasso realtive to least squares is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.(iii)
Lasso avoid overfitting by using a less flexible fitting approach than least squares. The lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.
(b) For ridge regression relative to least squares.
For ridge regression relative to least squares is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.(iii)
Similar to lasso, ridge regression also avoid verfitting by using a less flexible fitting approach than least squares. ridge regression can perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.
(c) For non-linear methods relative to least squares.
For non-linear methods relative to least squares is more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.(ii)
### Problem 9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)

(a) Split the data set into a training set and a test set.

set.seed(1)
train=sample(c(TRUE,FALSE),nrow(College),rep=TRUE)
test=(!train)

College.train = College[train,]
College.test = College[test,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit = lm(Apps~., data=College.train)
lm.pred = predict(lm.fit, College.test, type="response")
mean((lm.pred-College.test$Apps)^2)
## [1] 984743.1

The test error of the linear model fit is 984743.1.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet) 
set.seed(1)
train.mat = model.matrix(Apps~., data = College.train)
test.mat = model.matrix(Apps~., data = College.test)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(train.mat,College.train$Apps,alpha=0,lambda=grid)
cv.out = cv.glmnet(train.mat,College.train$Apps,alpha=0)
bestlam = cv.out$lambda.min
bestlam
## [1] 394.2365
summary(ridge.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1800   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
mean((ridge.pred - College.test$Apps)^2)
## [1] 940753.5

The test error of the ridge regression fit with a lambda selected by cross-validation is 940753.5, which is less than the test error of the linear model.

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(train.mat,College.train$Apps,alpha=1,lambda=grid)
summary(lasso.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1800   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv.outl2=cv.glmnet(train.mat,College.train$Apps,alpha=1)
bestlam2=cv.outl2$lambda.min
bestlam2
## [1] 2.309051
lasso.pred=predict(lasso.mod,s=bestlam2,newx = test.mat)
mean((lasso.pred-College.test$Apps)^2)
## [1] 978962.6

The test error of the lasso model fit with a lambda chosen by cross-validation is 991035.4, which is higher than both the linear model test error and the ridge regression test error.

out=glmnet(train.mat,College.train$Apps,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam2)[1:18,]
lasso.coef[lasso.coef!=0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -173.89023228 -649.94882654    1.67083516   -0.75305706   57.37667424 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -20.72078908    0.01792667    0.03670117   -0.08378483    0.10937820 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.05923882   -0.01840355   -9.09196597   -6.37371755   20.24530218 
##   perc.alumni        Expend 
##    8.95013487    0.08288697

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data:    X dimension: 393 17 
##  Y dimension: 393 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4189     4147     2344     2340     2099     2048     1978
## adjCV         4189     4146     2338     2335     2092     2073     1968
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1972     1949     1889      1893      1895      1906      1913
## adjCV     1964     1933     1875      1881      1884      1892      1900
##        14 comps  15 comps  16 comps  17 comps
## CV         1944      1853      1396      1404
## adjCV      1960      1812      1379      1386
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.858    57.44    64.20    69.91    75.10    80.17    83.82    87.30
## Apps    4.353    70.99    71.18    76.84    78.34    81.03    81.59    82.21
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.26     92.74     94.79     96.70     97.76     98.67     99.37
## Apps    83.31     83.97     83.97     84.34     84.58     84.70     91.28
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.83     93.02
validationplot(pcr.college, val.type="MSEP")

pcr.pred=predict(pcr.college,College.test,ncomp=10)
mean((pcr.pred-College.test$Apps)^2)
## [1] 1682909

The test error of the lasso model fit with a lambda chosen by cross-validation is 1682909, which underperforms all previous models.
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)
## Data:    X dimension: 393 17 
##  Y dimension: 393 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4189     2180     1909     1755     1696     1455     1332
## adjCV         4189     2171     1906     1741     1665     1431     1318
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1310     1301     1298      1298      1297      1298      1297
## adjCV     1298     1289     1287      1286      1285      1286      1285
##        14 comps  15 comps  16 comps  17 comps
## CV         1297      1296      1296      1296
## adjCV      1285      1285      1285      1285
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.01    44.96    62.49    65.22    68.52    72.89    77.13    80.46
## Apps    75.74    82.40    86.74    90.58    92.34    92.79    92.88    92.93
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.45     84.76     88.08     90.76     92.80     94.45     97.02
## Apps    92.98     93.00     93.01     93.01     93.02     93.02     93.02
##       16 comps  17 comps
## X        98.03    100.00
## Apps     93.02     93.02

Around M=8 appears to be when the MSEP with PCR dimension reduction is lowest. M=9 has the lowest CV error, therefore I’ll try both and assess which performs better after looking at the summary.

pls.pred=predict(pls.college,College.test,ncomp=8)
mean((pls.pred-College.test$Apps)^2)
## [1] 978534.3
pls.pred=predict(pls.college,College.test,ncomp=9)
mean((pls.pred-College.test$Apps)^2)
## [1] 1007163

M=8 results in best perfomring PLS model with test error rate as 978534.3
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The models performance is based on test error rate from best to worst are as follows: Ridge Regression (940753.5)
PLS model (978534.3)
linear model (984743.1)
Lasso model (991035.4)
PCR model (1682909).
The test errors of PLS, linear, and lasso are fairly similary to one another, while PCR performs significantly worse and Ridge performs a little better. To say how accurately we can predict the number of applications received, let’s compute the ridge regression R-square value.

sum.squares = sum((mean(College.test$Apps) - College.test$Apps)^2)
sum.residuals = sum((ridge.pred - College.test$Apps)^2)
1 - (sum.residuals)/(sum.squares)
## [1] 0.9241129

The R-squared value of ridge regression tells us that I can predict 92.4% of the variance in Apps using this model

detach(College)

Problem 11. We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
library(leaps)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(1)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(!train)

Boston.train = Boston[train,]
Boston.test = Boston[test,]

Best Subset Selection:

regfit.full = regsubsets(crim~.,data=Boston.train,nvmax=13)
reg.summary = summary(regfit.full)

par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")

plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
p = which.max(reg.summary$adjr2)
points(p,reg.summary$adjr2[p], col="red",cex=2,pch=20)

plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
p = which.min(reg.summary$cp)
points(p,reg.summary$cp[p],col="red",cex=2,pch=20)

plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
p = which.min(reg.summary$bic)
points(p,reg.summary$bic[p],col="red",cex=2,pch=20)

It appears both p=3 and p=7 could be good options, so I will fit models using both suggested subsets on test dataset.

summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston.train, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " "   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" " " "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
mean((mean(Boston.test$crim)-Boston.test$crim)^2)
## [1] 83.74891
lm.fit = lm(crim~rad+black+lstat, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 52.66064

The test error of the linear model fit with the 3 best predictors is 52.66.

lm.fit = lm(crim~zn+indus+dis+rad+black+lstat+medv, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 51.04084

51.04 is the test error of the linear model fit with the top 7 predictors. which, in my opinion, is insufficiently better than the model with the three best predictors to justify the complexity brought by the inclusion of four extra variables.
Lasso:

set.seed(1)
train.mat = model.matrix(crim~., data = Boston.train)
test.mat = model.matrix(crim~., data = Boston.test)
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=1)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.01973085
grid=10^seq(10,-2,length=100)
lasso.mod = glmnet(train.mat,Boston.train$crim,alpha=1,lambda = grid)
lasso.pred = predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred - Boston.test$crim)^2)
## [1] 50.73601

The test error of the lasso model fit with a lambda chosen by cross-validation is 50.75, which is lower than the linear model using best subsets. We can look at which variables were not shrunk to zero.

lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[1:15,]
lasso.coef[lasso.coef!=0]
## (Intercept)          zn       indus        chas         nox          rm 
## 15.97323390  0.04355463 -0.10777115 -0.14788387 -6.31552576 -0.24437358 
##         age         dis         rad     ptratio       black       lstat 
##  0.01121919 -0.74123953  0.47693745 -0.15627085 -0.01470577  0.11441311 
##        medv 
## -0.11381509

Ridge Regression:

cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=0)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.5240686
ridge.mod = glmnet(train.mat,Boston.train$crim,alpha = 0, lambda = grid)
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
mean((ridge.pred - Boston.test$crim)^2)
## [1] 51.45778

The test error of the ridge regression fit with a lambda chosen by cross-validation is 51.42, higher than the LASSO test error.
PCR:

pcr.boston=pcr(crim~., data=Boston.train,scale=TRUE,validation="CV")
summary(pcr.boston)
## Data:    X dimension: 262 13 
##  Y dimension: 262 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.065    6.594    6.607    6.076    6.027    6.048    6.104
## adjCV        8.065    6.591    6.603    6.070    6.017    6.042    6.094
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.092    6.010    5.968     6.018     6.042     6.005     5.958
## adjCV    6.081    5.996    5.949     6.003     6.027     5.988     5.941
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       44.58    58.02    68.58    75.64    82.03    87.37    90.66    93.02
## crim    33.90    34.29    44.32    44.91    45.57    45.62    46.13    48.13
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.01     96.87      98.4     99.54    100.00
## crim    48.80     48.96      49.0     49.91     50.71
validationplot(pcr.boston, val.type = "MSEP")

pcr.pred=predict(pcr.boston,Boston.test,ncomp=8)
mean((pcr.pred-Boston.test$crim)^2)
## [1] 53.17176

The test error for PCR model is 55.46 which is higher than all other regression methods.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
I would suggest that the optimal model is the linear model with the best subset of 3 predictors based on the information in part (a) above.  Despite having less test error than the three variable linear model, the Lasso and Ridge models I’ve shown above still have 12 and 13 predictors, respectively. I find it difficult to justify the reduction in test error when compared to adding 9–10 extra variables.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
No, not all feature is involved. This method of selection uses the best subset. It contains three of the 13 potential predictors, as was already mentioned.

detach(Boston)