For parts (a) through (c), indicate which of i. through iv. is
correct. Justify your answer. i. More flexible and hence will give
improved prediction accuracy when its increase in bias is less than its
decrease in variance. ii. More flexible and hence will give improved
prediction accuracy when its increase in variance is less than its
decrease in bias. iii. Less flexible and hence will give improved
prediction accuracy when its increase in bias is less than its decrease
in variance. iv. Less flexible and hence will give improved prediction
accuracy when its increase in variance is less than its decrease in
bias.
Q3(a) The lasso, relative to least squares, is:
A3(a) iii. Less flexible and hence will give improved
prediction accuracy when its increase in bias is less than its decrease
in variance:
Justification: Lasso increases bias by constrains or
regularizes the coefficient estimates, in doing so coefficients shrink
their value and with LASSO have the potential to shrink to 0, allowing
the shrinkage methods can also perform variable selection. The
constraint should improve the fit, because shrinking the coefficients
can significantly reduce their variance.
Q3(b) Repeat (a) for ridge regression relative to least
squares.
A3(b) iii. Less flexible and hence will give improved
prediction accuracy when its increase in bias is less than its decrease
in variance.
Justification: Same reason for LASSO, main difference
is LASSO’s method of constraint can bring coefficients to 0, while Ridge
Regression can bring them close, but not 0 (due to the method of their
shrinkage being different)
Q3(c) Repeat (a) for non-linear methods relative to
least squares.
A3(c) ii. More flexible and hence will give improved
prediction accuracy when its increase in variance is less than its
decrease in bias.
Justification: Non-linear methods are more flexible
than least squares. With the flexibility of non-linear come an decrease
in bias along with an increase in variance. The trade off is finding the
point when the increase in variance is less than the decrease in
bias.
In this exercise, we will predict the number of applications received
using the other variables in the College data set.
library(ISLR)
attach(College)
Q9(a) Split the data set into a training set and a
test set.
A9(a)
set.seed(1)
train=sample(1:nrow(College),nrow(College)/2)
test=(-train)
y.test=College$Apps[test]
Q9(b) Fit a linear model using least squares on the
training set, and report the test error obtained.
A9(b) The MSE for the linear model is 1135758.
set.seed(1)
lm.fit = lm(Apps~.,
data=College,
subset=train)
lm.pred = predict(lm.fit, College[test,])
mean((y.test - lm.pred)^2)
## [1] 1135758
Q9(c) Fit a ridge regression model on the training
set, with λ chosen by cross-validation. Report the test error
obtained.
A9(c) Cross validation using glmnet chose the best
value of λ as 405.8404. Using this λ the MSE for the ridge regression
model is 976647.8
set.seed(1)
library(glmnet)
x=model.matrix(Apps ~ ., College)[, -1]
y=College$Apps
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0, lambda=grid, thresh=1e-12)
cv.out=cv.glmnet(x[train,], y[train], alpha=0)
bestlam=cv.out$lambda.min
ridge.pred=predict(ridge.mod,s=bestlam,newx = x[test,])
ridge.pred = predict(ridge.mod, s = bestlam, newx = x[test, ])
mean((ridge.pred - y.test)^2)
## [1] 976647.8
bestlam
## [1] 405.8404
Q9(d) Fit a lasso model on the training set, with λ
chosen by cross validation. Report the test error obtained, along with
the number of non-zero coefficient estimates.
A9(d) Cross validation chosen the best value of λ as
1.97344. Using this λ the MSE is 1032618 and the number of non-zero
coefficient estimates (not including the intercept) are 17.
set.seed(1)
lasso.mod=glmnet(x[train,],y[train],alpha=1, lambda=grid, thresh=1e-1)
plot(lasso.mod)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam = cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1032618
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out, type = "coefficient",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -468.96036214 -491.47351867 1.57117027 -0.76858284 48.25732969
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -12.94716203 0.04293538 0.04406960 -0.08340724 0.14963683
## Books Personal PhD Terminal S.F.Ratio
## 0.01549548 0.02915128 -8.42538012 -3.26671319 14.61167079
## perc.alumni Expend Grad.Rate
## -0.02612797 0.07718333 8.31579869
bestlam
## [1] 1.97344
Q9(e) Fit a PCR model on the training set, with M
chosen by cross validation. Report the test error obtained, along with
the value of M selected by cross-validation.
A9(e) The PCR model has a MSE of 1135758 using 17 for
the value of M (selected by cross validation).
library(pls)
set.seed(1)
pcr.fit=pcr(Apps~.,
data=College,
subset= train,
scale=TRUE,
validation="CV")
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4006 2373 2372 2069 1961 1919
## adjCV 4288 4007 2368 2369 1999 1948 1911
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1919 1921 1876 1832 1832 1836 1837
## adjCV 1912 1915 1868 1821 1823 1827 1827
## 14 comps 15 comps 16 comps 17 comps
## CV 1853 1759 1341 1270
## adjCV 1850 1733 1326 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
validationplot(pcr.fit, val.type = "MSEP", legendpos = "top")
MSEP(pcr.fit)
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 18390649 16051334 5630366 5627415 4279138 3845707 3681767
## adjCV 18390649 16059941 5609499 5610221 3996071 3795843 3650145
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 3682813 3692142 3519204 3354439 3356321 3369472 3374680
## adjCV 3654280 3667606 3490660 3317249 3323781 3338697 3339500
## 14 comps 15 comps 16 comps 17 comps
## CV 3433208 3092457 1799505 1611684
## adjCV 3423289 3004430 1758941 1579659
pcr.pred=predict(pcr.fit, x[test,], ncomp=17)
mean((pcr.pred - y.test)^2)
## [1] 1135758
Q9(f) Fit a PLS model on the training set, with M
chosen by cross validation. Report the test error obtained, along with
the value of M selected by cross-validation.
A9(f) The PLS model has a MSE of 1124003 using 11 for
the value of M (selected by cross validation).
set.seed(1)
pls.fit=plsr(Apps~.,
data=College,
subset= train,
scale=TRUE,
validation="CV")
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2217 2019 1761 1630 1533 1347
## adjCV 4288 2211 2012 1749 1605 1510 1331
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1309 1303 1286 1283 1283 1277 1271
## adjCV 1296 1289 1273 1270 1270 1264 1258
## 14 comps 15 comps 16 comps 17 comps
## CV 1270 1270 1270 1270
## adjCV 1258 1257 1257 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20 78.62 80.81
## Apps 75.39 81.24 86.97 91.14 92.62 93.43 93.56 93.68
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.29 87.17 89.15 91.37 92.58 94.42 96.98
## Apps 93.76 93.79 93.83 93.86 93.88 93.89 93.89
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.89 93.89
validationplot(pls.fit, val.type="MSEP")
MSEP_object = MSEP(pls.fit)
pls.pred=predict(pls.fit, x[test,], ncomp=11)
mean((pls.pred - y.test)^2)
## [1] 1124003
Q9(g) Comment on the results obtained. How
accurately can we predict the number of college applications received?
Is there much difference among the test errors resulting from these five
approaches?
A9(g) The results from the 5 approaches in order of
lowest test errors are: 1. Ridge Regression - 976647.8, 2. LASSO -
1032618, 3. PLS - 1124003, and tied for last is both PCR & Least
Square Model - 1135758. The MSE for all of the tests are very high, the
R-squared (helps explain proportion of variance) for all are above .90,
with ridge regression having the highest R-squared of .915 which is
strong.
#Least Square model
test.avg = mean(y.test)
lm.r2 =1 - mean((lm.pred- y.test)^2) / mean((test.avg - y.test)^2)
lm.r2
## [1] 0.9015413
ridge.r2 =1 - mean((ridge.pred- y.test)^2) / mean((test.avg - y.test)^2)
ridge.r2
## [1] 0.9153346
lasso.r2 =1 - mean((lasso.pred- y.test)^2) / mean((test.avg - y.test)^2)
lasso.r2
## [1] 0.9104826
pcr.r2 =1 - mean((pcr.pred- y.test)^2) / mean((test.avg - y.test)^2)
pcr.r2
## [1] 0.9015413
pls.r2 =1 - mean((pls.pred - y.test)^2) / mean((test.avg - y.test)^2)
pls.r2
## [1] 0.9025604
detach(College)
We will now try to predict per capita crime rate in the
Boston data set.
library(ISLR2)
attach(Boston)
sum(is.na(Boston))
## [1] 0
set.seed(1)
train=sample(1:nrow(Boston),nrow(Boston)/2)
test=(-train)
y.test=Boston$crim[test]
attach(Boston)
Q11(a) Try out some of the regression methods
explored in this chapter, such as best subset selection, the lasso,
ridge regression, and PCR. Present and discuss results for the
approaches that you consider.
A11(a) Results from the approaches include:
Best Subset Selection - 6 components would fit my model
best, with 6 components my Adjusted R Squared is .4353996 (slightly
smaller than 7 components at .4373523), my MSE is smallest at 37.50905
(smaller than 5 and 7 components at 37.60826 and 37.83619).
Ridge Regression - The ridge regression chose a best
lambda of 0.5919159, using 12 components, the MSE is 40.16656 and the
R-squared is 0.4950061 LASSO - The LASSO chose a best
lambda of 0.01850161, using 11 components,the MSE is 41.01241 and the
R-squared is 0.5018015. PCR - The lowest MSEP occurs
with 12 components, with the 12 components the MSE is 41.19923 and the
R-squared is .5021. PLS - The lowest MSEP occurs with 9
components, with the 9 components the MSE is 42.09646 and the R-squared
is .5020.
Best Subset Selection
set.seed(1)
library(leaps)
regfit.full = regsubsets(crim ~ ., data = Boston, nvmax = 13)
regfull_summary = summary(regfit.full)
names(regfull_summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
regfull_summary$rsq
## [1] 0.3912567 0.4207965 0.4244920 0.4334892 0.4378328 0.4421077 0.4451514
## [8] 0.4470236 0.4482891 0.4487917 0.4493353 0.4493378
par(mfrow = c(2,2))
plot(regfull_summary$cp)
plot(regfull_summary$bic)
plot(regfull_summary$adjr2)
cp_min = which.min(summary(regfit.full)$cp)
cp_min
## [1] 7
bic_min = which.min(summary(regfit.full)$bic)
bic_min
## [1] 2
adj_r2_max = which.max(summary(regfit.full)$adjr2)
adj_r2_max
## [1] 9
coef(regfit.full,7)
## (Intercept) zn nox dis rad ptratio
## 17.4668235 0.0449679 -12.4578238 -0.9425497 0.5615224 -0.3470306
## lstat medv
## 0.1147895 -0.1902559
test.mat=model.matrix(crim~.,data = Boston[test,])
val.errors=rep(NA, 12)
#for loop
for (i in 1:12) {
coefi=coef(regfit.full, id=i)
pred=test.mat[,names(coefi)] %*% coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
## [1] 39.30376 39.24783 38.88351 37.74299 37.60826 37.50905 37.83619 37.86088
## [9] 37.76300 37.65814 37.70689 37.70938
#finding the lowest
which.min(val.errors)
## [1] 6
#8 components has the lowest test error
coef(regfit.full,6)
## (Intercept) zn nox dis rad ptratio
## 20.57369642 0.04666807 -11.85795204 -1.04623129 0.57113805 -0.36940158
## medv
## -0.24759359
summary(regfit.full)$adjr2
## [1] 0.3900489 0.4184935 0.4210527 0.4289661 0.4322111 0.4353996 0.4373523
## [8] 0.4381225 0.4382783 0.4376562 0.4370736 0.4359343
Ridge Regress
library(glmnet)
set.seed(1)
#put data in correct format
x = model.matrix(crim ~ ., Boston)[, -1]
y = Boston$crim
grid=10^seq(10,-2,length=100)
par(mfrow=c(1,1))
ridge.mod = glmnet(x[train, ], y[train], alpha = 0, lambda = grid)
plot(ridge.mod)
cv.out=cv.glmnet(x[train,], y[train], alpha=0)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.5919159
ridge.pred=predict(ridge.mod,s=bestlam,newx = x[test,])
mean((ridge.pred - y.test)^2)
## [1] 40.16656
out = glmnet(x, y, alpha = 0, lambda = grid)
ridge.coef = predict(out, type="coefficients", s = bestlam)[1:13,]
ridge.coef
## (Intercept) zn indus chas nox
## 4.7800682958 0.0329969247 -0.0761309849 -0.8269527810 -4.6361822431
## rm age dis rad tax
## 0.5099734692 0.0001390388 -0.7030350702 0.4349603263 0.0040258862
## ptratio lstat medv
## -0.1565951690 0.1564164622 -0.1552181711
Ridge.fit=glmnet(x[train, ], y[train], alpha = 0, lambda = bestlam)
Ridge.fit$dev.ratio
## [1] 0.4950061
LASSO
library(glmnet)
set.seed(1)
lasso.mod=glmnet(x[train,],y[train],alpha=1, lambda=grid, thresh=1e-1)
plot(lasso.mod)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam = cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx = x[test,])
mean((lasso.pred-y.test)^2)
## [1] 41.01241
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out, type = "coefficient",s=bestlam)[1:13,]
lasso.coef[lasso.coef!=0]
## (Intercept) zn indus chas nox rm
## 12.191709892 0.042448091 -0.062972172 -0.762720813 -8.906737143 0.550638814
## dis rad tax ptratio lstat medv
## -0.934471918 0.581556676 -0.002083489 -0.276107183 0.136948270 -0.205045031
bestlam
## [1] 0.01850161
Lasso.fit=glmnet(x[train, ], y[train], alpha = 1, lambda = bestlam)
Lasso.fit$dev.ratio
## [1] 0.5018015
#best lam = 0.01850161, training error = 51.1149
PCR - The lowest MSEP occurs with 12 components, with the 12 components the MSE is 41.19923 and the R-squared is 50.21.
library(pls)
set.seed(1)
pcr.fit = pcr(crim ~ ., data = Boston, subset = train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 253 12
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.681 7.682 7.408 7.144 7.146 7.114
## adjCV 9.275 7.675 7.677 7.405 7.136 7.138 7.106
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.942 6.949 6.896 6.854 6.807 6.728
## adjCV 6.932 6.942 6.892 6.845 6.796 6.717
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 50.61 63.48 72.71 80.29 86.43 90.37 92.94 95.04
## crim 32.91 33.06 37.89 42.35 42.62 43.12 45.76 45.79
## 9 comps 10 comps 11 comps 12 comps
## X 96.80 98.34 99.50 100.00
## crim 47.16 47.87 48.83 50.21
validationplot(pcr.fit, val.type="MSEP") #lowest MSE occurs with 12 components
names(pcr.fit)
## [1] "coefficients" "scores" "loadings" "Yloadings"
## [5] "projection" "Xmeans" "Ymeans" "fitted.values"
## [9] "residuals" "Xvar" "Xtotvar" "fit.time"
## [13] "ncomp" "method" "center" "scale"
## [17] "validation" "call" "terms" "model"
summary(pcr.fit)$scores
## Data: X dimension: 253 12
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.681 7.682 7.408 7.144 7.146 7.114
## adjCV 9.275 7.675 7.677 7.405 7.136 7.138 7.106
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.942 6.949 6.896 6.854 6.807 6.728
## adjCV 6.932 6.942 6.892 6.845 6.796 6.717
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 50.61 63.48 72.71 80.29 86.43 90.37 92.94 95.04
## crim 32.91 33.06 37.89 42.35 42.62 43.12 45.76 45.79
## 9 comps 10 comps 11 comps 12 comps
## X 96.80 98.34 99.50 100.00
## crim 47.16 47.87 48.83 50.21
## NULL
pcr.pred = predict(pcr.fit, x[test, ], ncomp = 12)
mean((pcr.pred - y.test)^2)
## [1] 41.19923
#training error = 41.19923
PLS
set.seed(1)
pls.fit=plsr(crim~., data=Boston, subset= train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 253 12
## Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.482 6.944 6.811 6.759 6.76 6.752
## adjCV 9.275 7.476 6.935 6.798 6.750 6.75 6.740
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.742 6.734 6.728 6.728 6.728 6.728
## adjCV 6.730 6.722 6.717 6.717 6.717 6.717
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 50.25 58.91 64.51 72.96 80.79 84.54 88.20 91.94
## crim 36.50 45.85 48.45 49.32 49.64 49.97 50.11 50.18
## 9 comps 10 comps 11 comps 12 comps
## X 94.88 96.36 98.01 100.00
## crim 50.20 50.21 50.21 50.21
validationplot(pls.fit, val.type = "MSEP", legendpos = "top")
MSEP(pls.fit)
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 86.02 55.98 48.22 46.38 45.68 45.70 45.59
## adjCV 86.02 55.89 48.10 46.22 45.57 45.57 45.42
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 45.46 45.34 45.27 45.27 45.27 45.27
## adjCV 45.29 45.19 45.12 45.12 45.12 45.12
pcr.pred=predict(pcr.fit, x[test,], ncomp=9)
mean((pcr.pred - y.test)^2)
## [1] 42.92522
Q11(b) Propose a model (or set of models) that seem
to perform well on this data set, and justify your answer. Make sure
that you are evaluating model performance using validation set error,
cross validation, or some other reasonable alternative, as opposed to
using training error.
A11(b) The models I would recommend using are: PLS
using 9 components, with an MSE of 42.09646 and an R-squared of
.5020.
I chose this model (while it was very close to the other models) because
it utilized less components/predictor variables and reduce the odds of
over fitting while still having a higher R-squared than other
models.
Q11(c) Does your chosen model involve all of the
features in the data set? Why or why not?
A11(c) No the chosen model do not involve all of the
features/predictor variables from the data set. The PLS showed a small
MSE at 9 components, with no improvement on the cross validation pass 9
components and minimal improvement on the R-squared (.0001).