Assignment 5

Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:
      i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
      
      ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
      
      iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
      
      iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
   
    The Correct answer to part A is number III Less Flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease. 

    This is due to Lasso’s solution can have a reduced variance when there is a small increase in bias while the least squares estimate bias get a higher variance. The Lasso can make the coefficient smaller as well as remove the non-essential variables for less variance & higher bias. 
      
(b) Repeat (a) for ridge regression relative to least squares.

    The Correct answer to part B is the same ass part A which is number III, Less Flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease.

    This is due to the fact that the Ridge Regression can also shrink the coefficient estimates as small as possible and leads to keeping some of  the extra variance provided by the incorporation of non-essential variables. It also makes a model with a smaller variance and larger bias relative to least squares. When compared to lasso Ridge has a smaller bias and larger variance. At times ther is an extreme situation of this depending on the size of the  λ penalty used which leads to higher penalty, higher bias, and lower variance. 

(c) Repeat (a) for non-linear methods relative to least squares.

    The Correct answer to part C is number II, more flexible and hence will give improved prediction accuracy when its increase in variance is than its decrease in bias.

    This is primarily due to the reason that non-linear regression is more flexible when it comes to the shape of the curves of the models that it can fit. Non- linear regression tends to have less bias when compared to least squares. Finally, these models tend to have higher variance and yet has the goal of increased prediction accuracy.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR2)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.2.3

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7

library(pls)

## Warning: package 'pls' was built under R version 4.2.3

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(dplyr)

attach(College)

str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

pairs(College)

x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

pls.fit<-lm(Apps~., data=College, subset=train)
summary(pls.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5139.5  -473.3   -21.1   353.2  7402.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -629.36179  639.35741  -0.984 0.325579    
## PrivateYes  -647.56836  192.17056  -3.370 0.000832 ***
## Accept         1.68912    0.05038  33.530  < 2e-16 ***
## Enroll        -1.02383    0.27721  -3.693 0.000255 ***
## Top10perc     48.19124    8.10714   5.944 6.42e-09 ***
## Top25perc    -10.51538    6.44952  -1.630 0.103865    
## F.Undergrad    0.01992    0.05364   0.371 0.710574    
## P.Undergrad    0.04213    0.05348   0.788 0.431373    
## Outstate      -0.09489    0.02674  -3.549 0.000436 ***
## Room.Board     0.14549    0.07243   2.009 0.045277 *  
## Books          0.06660    0.31115   0.214 0.830623    
## Personal       0.05663    0.09453   0.599 0.549475    
## PhD          -10.11489    7.11588  -1.421 0.156027    
## Terminal      -2.29300    8.03546  -0.285 0.775528    
## S.F.Ratio     22.07117   18.70991   1.180 0.238897    
## perc.alumni    2.08121    6.00673   0.346 0.729179    
## Expend         0.07654    0.01672   4.577 6.45e-06 ***
## Grad.Rate      9.99706    4.49821   2.222 0.026857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9367 
## F-statistic:   338 on 17 and 370 DF,  p-value: < 2.2e-16

pred.app<-predict(pls.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error

## [1] 1020100

    This shows the mean squared error for the linear model using least squares is 1020100.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam

## [1] 411.3927

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 985020.1

    This shows the Mean Squared Error for the ridge regression model is 985020.1 which is slightly lower than the one with the leaner model. 

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

## [1] 24.66235

lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 1008145

out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]

##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -6.324960e+02 -4.087012e+02  1.436837e+00 -1.410240e-01  3.143012e+01 
##     Top25perc   P.Undergrad      Outstate    Room.Board      Personal 
## -8.606536e-01  1.480293e-02 -5.342495e-02  1.205819e-01  4.379135e-05 
##           PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
## -5.121245e+00 -3.371192e+00  2.717231e+00 -1.039648e+00  6.838161e-02 
##     Grad.Rate 
##  4.700317e+00

    This next model shows the Mean Squared Error of the LASSO model id 1008145 higher than the previous model in part c but still lower than the model of part a.   

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     4345     2371     2391     2104     1949     1898
## adjCV         4347     4345     2368     2396     2085     1939     1891
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1899     1880     1864      1861      1870      1873      1891
## adjCV     1893     1862     1857      1853      1862      1865      1885
##        14 comps  15 comps  16 comps  17 comps
## CV         1903      1727      1295      1260
## adjCV      1975      1669      1283      1249
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
## Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
## Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
##       16 comps  17 comps
## X        99.86    100.00
## Apps     93.71     93.95

validationplot(pcr.college, val.type="MSEP")

pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

## [1] 1422699

  This model shows the mean squared error for the PCR model is 1422699 and  the selected number of components was 10 as it has a low cv and high varience explained< so far this is the highest mean sqaured error of all the models ran so far. 
  
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the valueof M selected by cross-validation.

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     2178     1872     1734     1615     1453     1359
## adjCV         4347     2171     1867     1726     1586     1427     1341
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1347     1340     1329      1317      1310      1305      1305
## adjCV     1330     1324     1314      1302      1296      1291      1291
##        14 comps  15 comps  16 comps  17 comps
## CV         1305      1307      1307      1307
## adjCV      1291      1292      1293      1293
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
## Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
## Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
##       16 comps  17 comps
## X        98.15    100.00
## Apps     93.95     93.95

pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

## [1] 1049868

  This model shows tje mean squared error for the PLS model is 1049868 which is simular but in the middle of the pack compared to the other models. The selected number of components was 11 and had the lowest CV and a high variance. 
  
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

detach(College)

    When comparing the R-squared of the models they show the PCR had the lowest accuracy but not too much lower than the other models. When looking at the Mean Squared Error of the different model the Ridge Regression model has the lowest so it would be the better of model.

Question 11

We will now try to predict per capita crime rate in the Boston data set.

library(leaps)

## Warning: package 'leaps' was built under R version 4.2.3

attach(Boston)

str(Boston)

## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best Subset Selection

set.seed(123)
index = sample(nrow(Boston), 0.7*nrow(Boston))
boston.train<-Boston[index,]
boston.test<-Boston[-index,]

boston.bsm = regsubsets(crim ~ .,data = boston.train, nvmax = 13)
summary(boston.bsm)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston.train, nvmax = 13)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " "*"     " "   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

boston.test.mat = model.matrix(crim ~., data = boston.test, nvmax = 12)
val.errors = rep(NA, 12)
for (i in 1:12) {
coefi = coef(boston.bsm, id = i)
pred = boston.test.mat[, names(coefi)] %*% coefi
val.errors[i] = mean((pred - boston.test$crim)^2)
}
plot(val.errors, xlab = "Number of predictors in model", ylab = "Test MSE", pch = 19, type = "b")

which.min(val.errors)

## [1] 11

coef(boston.bsm,which.min(val.errors))

##   (Intercept)            zn         indus          chas           nox 
##  15.299873241   0.052384985  -0.054494823  -0.252605744 -10.641836652 
##            rm           dis           rad           tax       ptratio 
##   0.731054549  -1.025629494   0.683220087  -0.004538344  -0.359671426 
##         lstat          medv 
##   0.130666954  -0.263974404

boston.bsm.MSE = val.errors[11]
boston.bsm.MSE

## [1] 18.67874

      The best subset model adds 1 predictor each time in this case up to 12 predictors within all 12 models, the one that gave the best/ least Means squared error was 11. Concluding the final model selected by the best subset approach with the MSE being 18.68

Ridge Regression

boston.train.mat=model.matrix(crim~.,boston.train)
boston.test.mat=model.matrix(crim~.,boston.test)
grid=10^seq(10,-2,length=100)
set.seed(1)
bridge.mod = glmnet(boston.train.mat,boston.train$crim,alpha=0, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(boston.train.mat,boston.train$crim,alpha=0, lambda = grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam

## [1] 0.2154435

plot(cv.out)

ridge.model.1<-glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=bestlam)
coef(ridge.model.1)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  1.072632e+01
## (Intercept)  .           
## zn           4.543365e-02
## indus       -7.477200e-02
## chas        -2.817295e-01
## nox         -8.051290e+00
## rm           6.901592e-01
## age         -2.601379e-04
## dis         -8.892072e-01
## rad          5.868775e-01
## tax          6.594563e-05
## ptratio     -2.900139e-01
## lstat        1.457719e-01
## medv        -2.295003e-01

ridge.pred=predict(ridge.model.1,s=bestlam,newx=boston.test.mat)
ridge.MSE = mean((ridge.pred-boston.test$crim)^2)
ridge.MSE

## [1] 18.1551

      As output the ridge regression model runs a range of values of lambda. In this model it found the best lambda of 0.2154465 in which it got its lowest Means square error of 18.16

Lasso Regression

lasso.model<-glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=grid)
boston.cv.lasso<-cv.glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=grid)

plot(boston.cv.lasso)

boston.bestlam.lasso<-boston.cv.lasso$lambda.min
boston.bestlam.lasso

## [1] 0.1629751

lasso.model.1<-glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.1)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 11.6672209672
## (Intercept)  .           
## zn           0.0468570045
## indus       -0.0720333874
## chas        -0.2725308738
## nox         -8.5989190675
## rm           0.6993673676
## age         -0.0001445266
## dis         -0.9184581671
## rad          0.6059432441
## tax         -0.0007972230
## ptratio     -0.3048771521
## lstat        0.1428859802
## medv        -0.2366344026

boston.pred.newlasso<-predict(lasso.model,s=boston.bestlam.lasso,newx=boston.test.mat)

lasso.MSE<-mean((boston.test$crim-boston.pred.newlasso)^2)
lasso.MSE

## [1] 18.24209

    As found in the Lasso regression model the lambda was found of 0.1629751 with the Means squared error being 18.24. Which the lamba is lower than the previous model and the MSE is higher.

PCR

boston.pcr.model<-pcr(crim~.,data=boston.train,scale=TRUE,validation="CV")
summary(boston.pcr.model)

## Data:    X dimension: 354 12 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.525    8.068    8.053    7.676    7.574    7.564    7.541
## adjCV        9.525    8.066    8.050    7.673    7.569    7.561    7.537
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       7.491    7.427    7.424     7.430     7.376     7.292
## adjCV    7.519    7.420    7.417     7.423     7.367     7.282
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.92    64.72    73.86    80.98    87.04    90.43    92.81    95.13
## crim    28.57    29.01    35.55    37.36    37.72    38.42    38.72    40.97
##       9 comps  10 comps  11 comps  12 comps
## X       96.92     98.38     99.50    100.00
## crim    41.14     41.14     42.64     43.88

validationplot(boston.pcr.model,val.type="MSEP")

boston.pcr.model$ncomp

## [1] 12

boston.pcr.12.model= pcr(crim~.,data=boston.train,scale=TRUE,ncomp=12)
summary(boston.pcr.12.model)

## Data:    X dimension: 354 12 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 12
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.92    64.72    73.86    80.98    87.04    90.43    92.81    95.13
## crim    28.57    29.01    35.55    37.36    37.72    38.42    38.72    40.97
##       9 comps  10 comps  11 comps  12 comps
## X       96.92     98.38     99.50    100.00
## crim    41.14     41.14     42.64     43.88

boston.predict.pcr<-predict(boston.pcr.model,boston.test,ncomp=12)
pcr.MSE<-mean((boston.test$crim-boston.predict.pcr)^2)
pcr.MSE

## [1] 18.67968

    As found in the PCR model shows the mean squared error for the PCR model is 18.68 and  the selected number of components was 12.

Model Comparision

model.names = c("Best subset selection", "Lasso", "Ridge regression", "PCR")
error.matrix = c(boston.bsm.MSE, lasso.MSE, ridge.MSE, pcr.MSE)
data.frame(model.names, error.matrix)

##             model.names error.matrix
## 1 Best subset selection     18.67874
## 2                 Lasso     18.24209
## 3      Ridge regression     18.15510
## 4                   PCR     18.67968

    To Sum this comparison of the 4 models show that the Ridge Regression model has the lowest Means sqaured error of 18.16.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

    In conclusion, s shown in the running of the four different models and then model comparision, it was found the Rigde regression is the best model. This is due to it having the lowest Means Sqaure Error of the 4 models which is 18.16. To add though all four models' MSE are within the same range.

(c) Does your chosen model involve all of the features in the data set? Why or why not?
    
    Yes, the choosen model RIdge regression involves all the features in the dataset. It is shown from having a better lambda & the lower MSE which is due to it using as many of the predictors as possible and has multicollinearity like this dataset has.