Chapter 06 (page 259): 2, 9, 11

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Q3(a) The lasso, relative to least squares, is:
A3(a) iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance:
Justification: Lasso increases bias by constrains or regularizes the coefficient estimates, in doing so coefficients shrink their value and with LASSO have the potential to shrink to 0, allowing the shrinkage methods can also perform variable selection. The constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance.
Q3(b) Repeat (a) for ridge regression relative to least squares.
A3(b) iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Justification: Same reason for LASSO, main difference is LASSO’s method of constraint can bring coefficients to 0, while Ridge Regression can bring them close, but not 0 (due to the method of their shrinkage being different)
Q3(c) Repeat (a) for non-linear methods relative to least squares.
A3(c) ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Justification: Non-linear methods are more flexible than least squares. With the flexibility of non-linear come an decrease in bias along with an increase in variance. The trade off is finding the point when the increase in variance is less than the decrease in bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)

Q9(a) Split the data set into a training set and a test set.
A9(a)

set.seed(1)
train=sample(1:nrow(College),nrow(College)/2)
test=(-train)
y.test=College$Apps[test]

Q9(b) Fit a linear model using least squares on the training set, and report the test error obtained.
A9(b) The MSE for the linear model is 1135758.

set.seed(1)
lm.fit = lm(Apps~., 
            data=College, 
            subset=train)
lm.pred = predict(lm.fit, College[test,])
mean((y.test - lm.pred)^2)
## [1] 1135758

Q9(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
A9(c) Cross validation using glmnet chose the best value of λ as 405.8404. Using this λ the MSE for the ridge regression model is 976647.8

set.seed(1)
library(glmnet)
x=model.matrix(Apps ~ ., College)[, -1]
y=College$Apps

grid=10^seq(10,-2,length=100)

ridge.mod=glmnet(x[train,],y[train],alpha=0, lambda=grid, thresh=1e-12)
cv.out=cv.glmnet(x[train,], y[train], alpha=0)
bestlam=cv.out$lambda.min

ridge.pred=predict(ridge.mod,s=bestlam,newx = x[test,])
ridge.pred = predict(ridge.mod, s = bestlam, newx = x[test, ])
mean((ridge.pred - y.test)^2) 
## [1] 976647.8
bestlam
## [1] 405.8404

Q9(d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
A9(d) Cross validation chosen the best value of λ as 1.97344. Using this λ the MSE is 1032618 and the number of non-zero coefficient estimates (not including the intercept) are 17.

set.seed(1)
lasso.mod=glmnet(x[train,],y[train],alpha=1, lambda=grid, thresh=1e-1)
plot(lasso.mod)

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam = cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1032618
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out, type = "coefficient",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -468.96036214 -491.47351867    1.57117027   -0.76858284   48.25732969 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -12.94716203    0.04293538    0.04406960   -0.08340724    0.14963683 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.01549548    0.02915128   -8.42538012   -3.26671319   14.61167079 
##   perc.alumni        Expend     Grad.Rate 
##   -0.02612797    0.07718333    8.31579869
bestlam
## [1] 1.97344

Q9(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
A9(e) The PCR model has a MSE of 1135758 using 17 for the value of M (selected by cross validation).

library(pls)
set.seed(1)
pcr.fit=pcr(Apps~., 
            data=College, 
            subset= train, 
            scale=TRUE, 
            validation="CV")
summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4006     2373     2372     2069     1961     1919
## adjCV         4288     4007     2368     2369     1999     1948     1911
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1919     1921     1876      1832      1832      1836      1837
## adjCV     1912     1915     1868      1821      1823      1827      1827
##        14 comps  15 comps  16 comps  17 comps
## CV         1853      1759      1341      1270
## adjCV      1850      1733      1326      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       32.20    57.78    65.31    70.99    76.37    81.27     84.8    87.85
## Apps    13.44    70.93    71.07    79.87    81.15    82.25     82.3    82.33
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     92.91     94.98     96.74     97.79     98.72     99.42
## Apps    83.38     84.76     84.80     84.84     85.11     85.14     90.55
##       16 comps  17 comps
## X        99.88    100.00
## Apps     93.42     93.89
validationplot(pcr.fit, val.type = "MSEP", legendpos = "top")

MSEP(pcr.fit)
##        (Intercept)   1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV        18390649  16051334  5630366  5627415  4279138  3845707  3681767
## adjCV     18390649  16059941  5609499  5610221  3996071  3795843  3650145
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV     3682813  3692142  3519204   3354439   3356321   3369472   3374680
## adjCV  3654280  3667606  3490660   3317249   3323781   3338697   3339500
##        14 comps  15 comps  16 comps  17 comps
## CV      3433208   3092457   1799505   1611684
## adjCV   3423289   3004430   1758941   1579659
pcr.pred=predict(pcr.fit, x[test,], ncomp=17)
mean((pcr.pred - y.test)^2)
## [1] 1135758

Q9(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
A9(f) The PLS model has a MSE of 1124003 using 11 for the value of M (selected by cross validation).

set.seed(1)
pls.fit=plsr(Apps~., 
             data=College, 
             subset= train, 
             scale=TRUE, 
             validation="CV")
summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     2217     2019     1761     1630     1533     1347
## adjCV         4288     2211     2012     1749     1605     1510     1331
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1309     1303     1286      1283      1283      1277      1271
## adjCV     1296     1289     1273      1270      1270      1264      1258
##        14 comps  15 comps  16 comps  17 comps
## CV         1270      1270      1270      1270
## adjCV      1258      1257      1257      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.21    50.73    63.06    65.52    70.20    74.20    78.62    80.81
## Apps    75.39    81.24    86.97    91.14    92.62    93.43    93.56    93.68
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.29     87.17     89.15     91.37     92.58     94.42     96.98
## Apps    93.76     93.79     93.83     93.86     93.88     93.89     93.89
##       16 comps  17 comps
## X        98.78    100.00
## Apps     93.89     93.89
validationplot(pls.fit, val.type="MSEP") 

MSEP_object = MSEP(pls.fit)
pls.pred=predict(pls.fit, x[test,], ncomp=11)
mean((pls.pred - y.test)^2)
## [1] 1124003

Q9(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
A9(g) The results from the 5 approaches in order of lowest test errors are: 1. Ridge Regression - 976647.8, 2. LASSO - 1032618, 3. PLS - 1124003, and tied for last is both PCR & Least Square Model - 1135758. The MSE for all of the tests are very high, the R-squared (helps explain proportion of variance) for all are above .90, with ridge regression having the highest R-squared of .915 which is strong.

#Least Square model
test.avg = mean(y.test)
lm.r2 =1 - mean((lm.pred- y.test)^2)  / mean((test.avg - y.test)^2)
lm.r2
## [1] 0.9015413
ridge.r2 =1 - mean((ridge.pred- y.test)^2) / mean((test.avg - y.test)^2)
ridge.r2
## [1] 0.9153346
lasso.r2 =1 - mean((lasso.pred- y.test)^2)  / mean((test.avg - y.test)^2)
lasso.r2
## [1] 0.9104826
pcr.r2 =1 - mean((pcr.pred- y.test)^2) / mean((test.avg - y.test)^2)
pcr.r2
## [1] 0.9015413
pls.r2 =1 - mean((pls.pred - y.test)^2) / mean((test.avg - y.test)^2)
pls.r2
## [1] 0.9025604
detach(College)

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(ISLR2)
attach(Boston)
sum(is.na(Boston))
## [1] 0
set.seed(1)
train=sample(1:nrow(Boston),nrow(Boston)/2)
test=(-train)
y.test=Boston$crim[test]
attach(Boston)

Q11(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
A11(a) Results from the approaches include:
Best Subset Selection - 6 components would fit my model best, with 6 components my Adjusted R Squared is .4353996 (slightly smaller than 7 components at .4373523), my MSE is smallest at 37.50905 (smaller than 5 and 7 components at 37.60826 and 37.83619). Ridge Regression - The ridge regression chose a best lambda of 0.5919159, using 12 components, the MSE is 40.16656 and the R-squared is 0.4950061 LASSO - The LASSO chose a best lambda of 0.01850161, using 11 components,the MSE is 41.01241 and the R-squared is 0.5018015. PCR - The lowest MSEP occurs with 12 components, with the 12 components the MSE is 41.19923 and the R-squared is .5021. PLS - The lowest MSEP occurs with 9 components, with the 9 components the MSE is 42.09646 and the R-squared is .5020.

Best Subset Selection

set.seed(1)
library(leaps)
regfit.full = regsubsets(crim ~ ., data = Boston, nvmax = 13)
regfull_summary = summary(regfit.full)
names(regfull_summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
regfull_summary$rsq
##  [1] 0.3912567 0.4207965 0.4244920 0.4334892 0.4378328 0.4421077 0.4451514
##  [8] 0.4470236 0.4482891 0.4487917 0.4493353 0.4493378
par(mfrow = c(2,2))
plot(regfull_summary$cp) 
plot(regfull_summary$bic) 
plot(regfull_summary$adjr2)


cp_min = which.min(summary(regfit.full)$cp)
cp_min
## [1] 7
bic_min = which.min(summary(regfit.full)$bic)
bic_min
## [1] 2
adj_r2_max = which.max(summary(regfit.full)$adjr2)
adj_r2_max
## [1] 9
coef(regfit.full,7) 
## (Intercept)          zn         nox         dis         rad     ptratio 
##  17.4668235   0.0449679 -12.4578238  -0.9425497   0.5615224  -0.3470306 
##       lstat        medv 
##   0.1147895  -0.1902559

test.mat=model.matrix(crim~.,data = Boston[test,])

val.errors=rep(NA, 12)

#for loop
for (i in 1:12) {
  coefi=coef(regfit.full, id=i)
  pred=test.mat[,names(coefi)] %*% coefi
  val.errors[i]=mean((Boston$crim[test]-pred)^2)
}

val.errors
##  [1] 39.30376 39.24783 38.88351 37.74299 37.60826 37.50905 37.83619 37.86088
##  [9] 37.76300 37.65814 37.70689 37.70938
#finding the lowest
which.min(val.errors)
## [1] 6
#8 components has the lowest test error
coef(regfit.full,6)
##  (Intercept)           zn          nox          dis          rad      ptratio 
##  20.57369642   0.04666807 -11.85795204  -1.04623129   0.57113805  -0.36940158 
##         medv 
##  -0.24759359
summary(regfit.full)$adjr2
##  [1] 0.3900489 0.4184935 0.4210527 0.4289661 0.4322111 0.4353996 0.4373523
##  [8] 0.4381225 0.4382783 0.4376562 0.4370736 0.4359343

Ridge Regress

library(glmnet)
set.seed(1)
#put data in correct format
x = model.matrix(crim ~ ., Boston)[, -1]
y = Boston$crim
grid=10^seq(10,-2,length=100)

par(mfrow=c(1,1))
ridge.mod = glmnet(x[train, ], y[train], alpha = 0, lambda = grid)

plot(ridge.mod)

cv.out=cv.glmnet(x[train,], y[train], alpha=0)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.5919159
ridge.pred=predict(ridge.mod,s=bestlam,newx = x[test,])
mean((ridge.pred - y.test)^2) 
## [1] 40.16656
out = glmnet(x, y, alpha = 0, lambda = grid)
ridge.coef = predict(out, type="coefficients", s = bestlam)[1:13,]
ridge.coef 
##   (Intercept)            zn         indus          chas           nox 
##  4.7800682958  0.0329969247 -0.0761309849 -0.8269527810 -4.6361822431 
##            rm           age           dis           rad           tax 
##  0.5099734692  0.0001390388 -0.7030350702  0.4349603263  0.0040258862 
##       ptratio         lstat          medv 
## -0.1565951690  0.1564164622 -0.1552181711
Ridge.fit=glmnet(x[train, ], y[train], alpha = 0, lambda = bestlam)
Ridge.fit$dev.ratio
## [1] 0.4950061

LASSO

library(glmnet)
set.seed(1)
lasso.mod=glmnet(x[train,],y[train],alpha=1, lambda=grid, thresh=1e-1)
plot(lasso.mod)

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam = cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 41.01241
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out, type = "coefficient",s=bestlam)[1:13,]
lasso.coef[lasso.coef!=0]
##  (Intercept)           zn        indus         chas          nox           rm 
## 12.191709892  0.042448091 -0.062972172 -0.762720813 -8.906737143  0.550638814 
##          dis          rad          tax      ptratio        lstat         medv 
## -0.934471918  0.581556676 -0.002083489 -0.276107183  0.136948270 -0.205045031
bestlam
## [1] 0.01850161
Lasso.fit=glmnet(x[train, ], y[train], alpha = 1, lambda = bestlam)
Lasso.fit$dev.ratio
## [1] 0.5018015
#best lam = 0.01850161, training error = 51.1149

PCR - The lowest MSEP occurs with 12 components, with the 12 components the MSE is 41.19923 and the R-squared is 50.21.

library(pls)
set.seed(1)
pcr.fit = pcr(crim ~ ., data = Boston, subset = train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 253 12 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.681    7.682    7.408    7.144    7.146    7.114
## adjCV        9.275    7.675    7.677    7.405    7.136    7.138    7.106
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.942    6.949    6.896     6.854     6.807     6.728
## adjCV    6.932    6.942    6.892     6.845     6.796     6.717
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.61    63.48    72.71    80.29    86.43    90.37    92.94    95.04
## crim    32.91    33.06    37.89    42.35    42.62    43.12    45.76    45.79
##       9 comps  10 comps  11 comps  12 comps
## X       96.80     98.34     99.50    100.00
## crim    47.16     47.87     48.83     50.21
validationplot(pcr.fit, val.type="MSEP") #lowest MSE occurs with 12 components

names(pcr.fit)
##  [1] "coefficients"  "scores"        "loadings"      "Yloadings"    
##  [5] "projection"    "Xmeans"        "Ymeans"        "fitted.values"
##  [9] "residuals"     "Xvar"          "Xtotvar"       "fit.time"     
## [13] "ncomp"         "method"        "center"        "scale"        
## [17] "validation"    "call"          "terms"         "model"
summary(pcr.fit)$scores
## Data:    X dimension: 253 12 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.681    7.682    7.408    7.144    7.146    7.114
## adjCV        9.275    7.675    7.677    7.405    7.136    7.138    7.106
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.942    6.949    6.896     6.854     6.807     6.728
## adjCV    6.932    6.942    6.892     6.845     6.796     6.717
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.61    63.48    72.71    80.29    86.43    90.37    92.94    95.04
## crim    32.91    33.06    37.89    42.35    42.62    43.12    45.76    45.79
##       9 comps  10 comps  11 comps  12 comps
## X       96.80     98.34     99.50    100.00
## crim    47.16     47.87     48.83     50.21
## NULL
pcr.pred = predict(pcr.fit, x[test, ], ncomp = 12)
mean((pcr.pred - y.test)^2)
## [1] 41.19923
#training error = 41.19923

PLS

set.seed(1)
pls.fit=plsr(crim~., data=Boston, subset= train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data:    X dimension: 253 12 
##  Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.482    6.944    6.811    6.759     6.76    6.752
## adjCV        9.275    7.476    6.935    6.798    6.750     6.75    6.740
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.742    6.734    6.728     6.728     6.728     6.728
## adjCV    6.730    6.722    6.717     6.717     6.717     6.717
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       50.25    58.91    64.51    72.96    80.79    84.54    88.20    91.94
## crim    36.50    45.85    48.45    49.32    49.64    49.97    50.11    50.18
##       9 comps  10 comps  11 comps  12 comps
## X       94.88     96.36     98.01    100.00
## crim    50.20     50.21     50.21     50.21
validationplot(pls.fit, val.type = "MSEP", legendpos = "top")

MSEP(pls.fit)
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           86.02    55.98    48.22    46.38    45.68    45.70    45.59
## adjCV        86.02    55.89    48.10    46.22    45.57    45.57    45.42
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       45.46    45.34    45.27     45.27     45.27     45.27
## adjCV    45.29    45.19    45.12     45.12     45.12     45.12
pcr.pred=predict(pcr.fit, x[test,], ncomp=9)
mean((pcr.pred - y.test)^2)
## [1] 42.92522

Q11(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.
A11(b) The models I would recommend using are: PLS using 9 components, with an MSE of 42.09646 and an R-squared of .5020.
I chose this model (while it was very close to the other models) because it utilized less components/predictor variables and reduce the odds of over fitting while still having a higher R-squared than other models.

Q11(c) Does your chosen model involve all of the features in the data set? Why or why not?
A11(c) No the chosen model do not involve all of the features/predictor variables from the data set. The PLS showed a small MSE at 9 components, with no improvement on the cross validation pass 9 components and minimal improvement on the R-squared (.0001).