This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load Libraries
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
# Question 2
# For parts (a) through (c), indicate which of i. through iv. is correct.
# Justify your answer.
# (a) The lasso, relative to least squares, is:
# i. More flexible and hence will give improved prediction accuracy
# when its increase in bias is less than its decrease in
# variance.
# ii. More flexible and hence will give improved prediction accuracy
# when its increase in variance is less than its decrease
# in bias.
# iii. Less flexible and hence will give improved prediction
# accuracy
# when its increase in bias is less than its decrease in
# variance.
# iv. Less flexible and hence will give improved prediction accuracy
# when its increase in variance is less than its decrease
# in bias.
# Question 2 part (a)
# The lasso, relative to least squares, is: (iii)
# Less flexible and hence will give improved prediction accuracy.
# The lasso is able to reduce model flexibility by shrinking
# coefficient estimates and even sets some estimates to zero,
# which reduces variance. Even though variance is reduced this method may
# increase bias due to underfitting. Overall the lasso can improve prediction
# accuracy provided that the reduction in the variances offsets the increase
# in bias.
# Question 2 part (b) Repeat (a) for ridge regression relative to least squares.
# The ridge regression relative to least squares is: (iii)
# Less flexible and hence will give improved prediction accuracy.
# Ridge regression is similar to lasso but it instead shrinks coefficients
# toward zero value but they key difference is that it doesn't set coefficients
# to exactly zero. Ridge regression is less flexible due to penalization of
# large coeffiencts, which causes reduction in variance and increase in bias.
# Ridge regression can improve prediction accuracy provided that the reduction
# in the variances offsets the increase in bias.
# Question 2 part (c) Repeat (a) for non-linear methods relative to least
# squares.
# For non-linear methods relative to least squares is : (ii)
# More flexible and hence will give improved prediction accuracy when its
# increase in variance is less than its decrease in bias.
# Non-linear models compared to least squares are more flexible and they
# are also less biased as well.
# Question 9
# In this exercise, we will predict the number of applications received
# using the other variables in the College data set.
attach(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
# Question 9 (a) Split the data set into a training set and a test set.
x = model.matrix(Apps~.,College)[,-1]
y = College$Apps
set.seed(10)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
College_train = College[train, ]
College_test = College[test, ]
y_test = y[test]
# Question 9 (b) Fit a linear model using least squares on the training set,
# and report the test error obtained.
college_ls_fit = lm(Apps~., data=College, subset=train)
summary(college_ls_fit)
##
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5139.5 -473.3 -21.1 353.2 7402.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -629.36179 639.35741 -0.984 0.325579
## PrivateYes -647.56836 192.17056 -3.370 0.000832 ***
## Accept 1.68912 0.05038 33.530 < 2e-16 ***
## Enroll -1.02383 0.27721 -3.693 0.000255 ***
## Top10perc 48.19124 8.10714 5.944 6.42e-09 ***
## Top25perc -10.51538 6.44952 -1.630 0.103865
## F.Undergrad 0.01992 0.05364 0.371 0.710574
## P.Undergrad 0.04213 0.05348 0.788 0.431373
## Outstate -0.09489 0.02674 -3.549 0.000436 ***
## Room.Board 0.14549 0.07243 2.009 0.045277 *
## Books 0.06660 0.31115 0.214 0.830623
## Personal 0.05663 0.09453 0.599 0.549475
## PhD -10.11489 7.11588 -1.421 0.156027
## Terminal -2.29300 8.03546 -0.285 0.775528
## S.F.Ratio 22.07117 18.70991 1.180 0.238897
## perc.alumni 2.08121 6.00673 0.346 0.729179
## Expend 0.07654 0.01672 4.577 6.45e-06 ***
## Grad.Rate 9.99706 4.49821 2.222 0.026857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9367
## F-statistic: 338 on 17 and 370 DF, p-value: < 2.2e-16
pred_app = predict(college_ls_fit, College_test)
lmtest_error = mean((College_test$Apps-pred_app)^2)
lmtest_error
## [1] 1020100
# The test error (MSE) is 1020100
# Question 9 (c) Fit a ridge regression model on the training set,
# with lambda chosen by cross-validation. Report the test error obtained.
grid = 10^seq(10,-2,length=100)
ridge_mod = glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge_mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv_college_out = cv.glmnet(x[train,],y[train] ,alpha=0)
best_lambda = cv_college_out$lambda.min
best_lambda
## [1] 411.3927
ridge_pred=predict(ridge_mod, s=best_lambda ,newx=x[test,])
ridge_error = mean((ridge_pred - y_test)^2)
print(ridge_error)
## [1] 985020.1
# The test error (MSE) for the ridge model is 985020.1
# Question 9 (d) Fit a lasso model on the training set, with lambda
# chosen by cross validation. Report the test error obtained, along with
# the number of non-zero coefficient estimates.
lasso_mod = glmnet(x[train,], y[train], alpha=1, lambda=grid)
summary(lasso_mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv_out = cv.glmnet(x[train,],y[train],alpha=1)
best_lam2 = cv_out$lambda.min
best_lam2
## [1] 24.66235
lasso_pred = predict(lasso_mod,s=best_lam2, newx=x[test,])
lasso_error = mean((lasso_pred - y_test)^2)
print(lasso_error)
## [1] 1008145
# Number of non-zero coefficeint estimates
out = glmnet(x,y,alpha=1,lambda = grid)
lasso_coef = predict(out,type="coefficients",s=best_lam2)[1:18,]
lasso_coef[lasso_coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -6.324960e+02 -4.087012e+02 1.436837e+00 -1.410240e-01 3.143012e+01
## Top25perc P.Undergrad Outstate Room.Board Personal
## -8.606536e-01 1.480293e-02 -5.342495e-02 1.205819e-01 4.379135e-05
## PhD Terminal S.F.Ratio perc.alumni Expend
## -5.121245e+00 -3.371192e+00 2.717231e+00 -1.039648e+00 6.838161e-02
## Grad.Rate
## 4.700317e+00
# The test error (MSE) for the lasso model is 1008145
# Question 9 (e) Fit a PCR model on the training set, with M chosen by
# cross validation. Report the test error obtained, along with the value of M
# selected by cross-validation.
pcr_college = pcr(Apps~., data=College_train , scale=TRUE, validation="CV")
summary(pcr_college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 4345 2371 2391 2104 1949 1898
## adjCV 4347 4345 2368 2396 2085 1939 1891
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1899 1880 1864 1861 1870 1873 1891
## adjCV 1893 1862 1857 1853 1862 1865 1885
## 14 comps 15 comps 16 comps 17 comps
## CV 1903 1727 1295 1260
## adjCV 1975 1669 1283 1249
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.6794 56.94 64.38 70.61 76.27 80.97 84.48 87.54
## Apps 0.9148 71.17 71.36 79.85 81.49 82.73 82.79 83.70
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.89 94.96 96.81 97.97 98.73 99.39
## Apps 83.86 84.08 84.11 84.11 84.16 84.28 93.08
## 16 comps 17 comps
## X 99.86 100.00
## Apps 93.71 93.95
validationplot(pcr_college, val.type="MSEP")
pcr_pred = predict(pcr_college,x[test,], ncomp = 10)
pcr_error = mean((pcr_pred - y_test)^2)
print(pcr_error)
## [1] 1422699
# The test error (MSE) is 1008145 with a M value of 10
# chosen. M value of 10 was chosen due to the low CV and high variance.
# Question 9 (f) Fit a PLS model on the training set, with M chosen by
# cross validation. Report the test error obtained, along with the value of M
# selected by cross-validation.
pls_college = plsr(Apps~., data = College_train,scale=TRUE, validation="CV")
validationplot(pls_college, val.type="MSEP")
summary(pls_college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 2178 1872 1734 1615 1453 1359
## adjCV 4347 2171 1867 1726 1586 1427 1341
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1347 1340 1329 1317 1310 1305 1305
## adjCV 1330 1324 1314 1302 1296 1291 1291
## 14 comps 15 comps 16 comps 17 comps
## CV 1305 1307 1307 1307
## adjCV 1291 1292 1293 1293
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.27 38.72 62.64 65.26 69.01 73.96 78.86 82.18
## Apps 76.96 84.31 86.80 91.48 93.37 93.75 93.81 93.84
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 85.35 87.42 89.18 91.41 92.70 94.58 97.16
## Apps 93.88 93.91 93.93 93.94 93.95 93.95 93.95
## 16 comps 17 comps
## X 98.15 100.00
## Apps 93.95 93.95
pls_pred = predict(pls_college,x[test,], ncomp=12)
pls_error = mean((pls_pred - y_test)^2)
print(pls_error)
## [1] 1019435
# The test error (MSE) is 1019435 with a M value of 10
# chosen. M value of 12 was chosen due to the low CV.
# (g) Comment on the results obtained. How accurately can we predict
# the number of college applications received? Is there much
# difference among the test errors resulting from these five approaches?
errors = c(lmtest_error, ridge_error, lasso_error, pcr_error, pls_error)
names(errors) <- c("lm MSE", "ridge MSE", "lasso MSE", "pcr MSE", "pls MSE")
print(sort(errors))
## ridge MSE lasso MSE pls MSE lm MSE pcr MSE
## 985020.1 1008144.7 1019435.4 1020099.8 1422698.9
sum_squares = sum((mean(College_test$Apps) - College_test$Apps)^2)
sum_residuals = sum((ridge_pred - College_test$Apps)^2)
ridge_rsquare = 1 - (sum_residuals)/(sum_squares)
print(ridge_rsquare)
## [1] 0.9107905
# The ridge model has the lowest MSE (test error) of the five approaches.
# The test errors of lasso, pls, and linear are very similar to each other,
# and then there is a stark contrast with the pcr being the highest test error
# compared to all the other models and thus it is the worst.
# The ridge model will be selected as a result of lowest MSE.
# Based on the r-square value of the ridge, it can predict 91.1% of the variance
# of 'apps' (college applications).
# Need to detach dataset
detach(College)
# Question 11 We will now try to predict per capita crime rate in the
# Boston data set.
attach(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Question 11 Part a
# Try out some of the regression methods explored in this chapter,
# such as best subset selection, the lasso, ridge regression, and
# PCR. Present and discuss results for the approaches that you consider.
# Subset selection
set.seed(7)
predict_regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict_regsubsets(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean_cv_errors = apply(cv.errors, 2, mean)
plot(mean_cv_errors, type = "b", xlab = "Number of variables", ylab = "CV error")
which.min(mean_cv_errors)
## [1] 13
# Subset MSE
mean_cv_errors[which.min(mean_cv_errors)]
## [1] 43.26558
# Lasso
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv_lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv_lasso)
coef(cv_lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.7799525
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.1920089
## tax .
## ptratio .
## black .
## lstat .
## medv .
# Lasso MSE
sqrt(cv_lasso$cvm[cv_lasso$lambda == cv_lasso$lambda.1se])
## [1] 7.765238
# Ridge
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv_ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv_ridge)
coef(cv_ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.523899542
## zn -0.002949852
## indus 0.029276741
## chas -0.166526007
## nox 1.874769665
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
# Ridge MSE
sqrt(cv_ridge$cvm[cv_ridge$lambda == cv_ridge$lambda.1se])
## [1] 7.659135
# PCR
pcr_crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr_crime)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.215 7.218 6.784 6.781 6.785 6.802
## adjCV 8.61 7.212 7.215 6.778 6.774 6.780 6.795
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.799 6.667 6.707 6.699 6.710 6.675 6.600
## adjCV 6.791 6.658 6.698 6.689 6.698 6.661 6.586
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
# Propose a model (or set of models) that seem to perform well on
# this data set, and justify your answer. Make sure that you are
# evaluating model performance using validation set error, cross validation,
# or some other reasonable alternative, as opposed to
# using training error.
# See above
# Looking at the MSE, the ridge had the lowest cross
# validation and thus it is the optimal model to go on.
# Part c
# Does your chosen model involve all of the features in the data
# set? Why or why not?
# Ridge model doesn't have specific variable selection process and includes all
# predictors ant thus all features where used.