In this exercise, we will predict the number of applications received using the other variables in the College data set.
Part A: Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
data(College)
set.seed(1)
trainid = sample(1:nrow(College), nrow(College)/2)
train = College[trainid,]
test = College[-trainid,]
Part B: Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps~., data=train)
lm.pred = predict(lm.fit, test)
lm.err = mean((test$Apps - lm.pred)^2)
lm.err
## [1] 1135758
Part C: Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
train.X = model.matrix(Apps ~ ., data = train)
train.Y = train[, "Apps"]
test.X = model.matrix(Apps ~ ., data = test)
test.Y = test[, "Apps"]
grid = 10 ^ seq(4, -2, length=100)
ridge.mod = glmnet(train.X, train.Y, alpha =0, lambda=grid, thresh = 1e-12)
lambda.best = ridge.mod$lambda.min
ridge.pred = predict(ridge.mod, newx= test.X, s=lambda.best)
(ridge.err = mean((test.Y - ridge.pred)^2))
## [1] 1164319
Part D: Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod = glmnet(train.X, train.Y, alpha =1, lambda=grid, thresh = 1e-12)
lasso.cv = cv.glmnet(train.X, train.Y, alpha =1, lambda=grid, thresh = 1e-12)
lambda.best = lasso.cv$lambda.min
lasso.pred = predict(lasso.mod, newx= test.X, s=lambda.best)
(lasso.err = mean((test.Y - lasso.pred)^2))
## [1] 1135660
predict(lasso.mod, s = lambda.best, type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -7.900363e+02
## (Intercept) .
## PrivateYes -3.070103e+02
## Accept 1.779328e+00
## Enroll -1.469508e+00
## Top10perc 6.672214e+01
## Top25perc -2.230442e+01
## F.Undergrad 9.258974e-02
## P.Undergrad 9.408838e-03
## Outstate -1.083495e-01
## Room.Board 2.115147e-01
## Books 2.912105e-01
## Personal 6.120406e-03
## PhD -1.547200e+01
## Terminal 6.409503e+00
## S.F.Ratio 2.282638e+01
## perc.alumni 1.130498e+00
## Expend 4.856697e-02
## Grad.Rate 7.488081e+00
Part E: Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.0.4
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit = pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type="MSEP")

summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4027 2351 2355 2046 1965 1906
## adjCV 4288 4031 2347 2353 2014 1955 1899
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1910 1913 1871 1799 1799 1802 1800
## adjCV 1903 1908 1866 1790 1792 1795 1793
## 14 comps 15 comps 16 comps 17 comps
## CV 1832 1728 1310 1222
## adjCV 1829 1702 1296 1212
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
pcr.pred = predict(pcr.fit, test, ncomp=10)
(pcr.err = mean((test$Apps - pcr.pred)^2))
## [1] 1723100
Part F: Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit = plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type="MSEP")

pls.pred = predict(pls.fit, test, ncomp=10)
(pls.err = mean((test$Apps - pls.pred)^2))
## [1] 1131661