In this exercise, we will predict the number of applications received using the other variables in the College data set.
- (a) Split the data set into a training set and a test set.
# Exercise 9-a
attach(College)
set.seed(1)
train_index = sample(1:nrow(College), round(nrow(College) * 0.70))
train = College[train_index, ]
test = College[-train_index, ]
- (b) Fit a linear model using least squares on the training set, and report the test error obtained.
#Exercise 9-b
lm.fit = lm(Apps~., data = train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5819.3 -450.7 1.4 328.5 7444.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.486e+02 5.066e+02 -1.083 0.27935
## PrivateYes -5.096e+02 1.643e+02 -3.101 0.00203 **
## Accept 1.721e+00 4.755e-02 36.201 < 2e-16 ***
## Enroll -1.042e+00 2.419e-01 -4.308 1.97e-05 ***
## Top10perc 5.354e+01 6.435e+00 8.320 7.60e-16 ***
## Top25perc -1.617e+01 5.088e+00 -3.178 0.00157 **
## F.Undergrad 2.739e-02 4.367e-02 0.627 0.53069
## P.Undergrad 7.135e-02 3.645e-02 1.957 0.05085 .
## Outstate -8.783e-02 2.170e-02 -4.047 5.97e-05 ***
## Room.Board 1.623e-01 5.571e-02 2.914 0.00372 **
## Books 2.781e-01 2.718e-01 1.023 0.30675
## Personal -4.606e-03 7.254e-02 -0.063 0.94939
## PhD -9.694e+00 5.356e+00 -1.810 0.07087 .
## Terminal -2.714e-01 6.006e+00 -0.045 0.96397
## S.F.Ratio 1.652e+01 1.606e+01 1.029 0.30408
## perc.alumni 2.302e+00 4.848e+00 0.475 0.63517
## Expend 5.982e-02 1.337e-02 4.476 9.34e-06 ***
## Grad.Rate 7.160e+00 3.517e+00 2.036 0.04227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1021 on 526 degrees of freedom
## Multiple R-squared: 0.9332, Adjusted R-squared: 0.931
## F-statistic: 432.3 on 17 and 526 DF, p-value: < 2.2e-16
lm.pred = predict(lm.fit, newdata= test)
lm.err = mean((test$Apps - lm.pred)^2)
lm.err
## [1] 1266407
- The test error obtained from the linear model fit on all variables is 1266407.
- (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
#Exercise 9-c
xtrain = model.matrix(Apps~., data = train[,-1])
ytrain = train$Apps
xtest = model.matrix(Apps~., data = test[,-1])
ytest = test$Apps
set.seed(1)
ridge.fit = cv.glmnet(xtrain, ytrain, alpha = 0)
plot(ridge.fit)

ridge.lambda = ridge.fit$lambda.min
paste("Ridge Lambda", ridge.lambda)
## [1] "Ridge Lambda 367.220655836988"
ridge.pred = predict(ridge.fit, s = ridge.lambda, newx = xtest)
ridge.err = mean((ridge.pred-ytest)^2)
paste("Test Error", ridge.err)
## [1] "Test Error 1168672.06016979"
- (d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#Exercise 9-d
set.seed(1)
lasso.fit = cv.glmnet(xtrain, ytrain, alpha=1)
plot(lasso.fit)

lasso.lambda = lasso.fit$lambda.min
paste("Lasso Lamda", lasso.lambda)
## [1] "Lasso Lamda 1.95974618863191"
lasso.pred = predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err = mean((lasso.pred-ytest)^2)
paste("Lasso Error", lasso.err)
## [1] "Lasso Error 1269433.8248159"
#Exercise 9-d coeff
lasso.coeff = predict(lasso.fit, type="coefficients", s=lasso.lambda)[1:18,]
round(lasso.coeff[lasso.coeff != 0],3)
## (Intercept) Accept Enroll Top10perc Top25perc F.Undergrad
## -1055.063 1.702 -0.945 51.533 -14.532 0.036
## P.Undergrad Outstate Room.Board Books Personal PhD
## 0.074 -0.107 0.146 0.261 -0.005 -6.878
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 0.608 24.927 0.627 0.063 6.095
- The above values represent the non-zero coefficient estimates.
- (e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
#Exercise 9-e
pcr.fit = pcr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type="MSEP")

## Data: X dimension: 544 17
## Y dimension: 544 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3892 3799 2114 2122 1779 1698 1704
## adjCV 3892 3800 2110 2121 1761 1686 1698
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1705 1654 1607 1606 1612 1613 1615
## adjCV 1702 1643 1600 1600 1606 1608 1609
## 14 comps 15 comps 16 comps 17 comps
## CV 1615 1577 1197 1156
## adjCV 1610 1560 1186 1147
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.058 57.03 64.43 70.29 75.66 80.65 84.26 87.61
## Apps 5.575 71.65 71.65 81.07 82.59 82.61 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.59 92.84 94.93 96.74 97.82 98.72 99.38
## Apps 84.56 84.83 84.86 84.87 85.02 85.06 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.04 93.32
#Exercise 9-e results
pcr.pred = predict(pcr.fit, test, ncomp = 10)
pcr.error = mean((test$Apps - pcr.pred)^2)
paste("Test Error", pcr.error)
## [1] "Test Error 1842925.42996644"
- (f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
#Exercise 9-f
pls.fit = plsr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

## Data: X dimension: 544 17
## Y dimension: 544 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3892 1939 1730 1538 1422 1329 1187
## adjCV 3892 1934 1730 1531 1402 1297 1175
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1168 1152 1153 1150 1157 1153 1150
## adjCV 1159 1145 1144 1142 1147 1145 1142
## 14 comps 15 comps 16 comps 17 comps
## CV 1149 1149 1149 1149
## adjCV 1140 1140 1140 1140
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.20 62.49 64.91 67.30 72.46 76.95 80.88
## Apps 76.61 82.45 86.93 90.76 92.84 93.06 93.14 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.27 87.8 90.88 92.47 95.11 97.09
## Apps 93.26 93.28 93.3 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.37 100.00
## Apps 93.32 93.32
#Exercise 9-f test error
pls.pred=predict(pls.fit, test, ncomp = 13)
pls.err=mean((pls.pred-test$Apps)^2)
paste("Test Error", pls.err)
## [1] "Test Error 1266750.68341741"
- This PLS model comes very close to beating the ridge regression. M = 13
- (g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
- lm.err = 1266407
- ridge.err = 1168672.06
- lasso.err = 1269433.82
- pcr.err = 1842925.43
- pls.err = 1266750.68
#Exercise 9-f chart
barplot(c(lm.err,
ridge.err,
lasso.err,
pcr.error,
pls.err),
col="steelblue",
xlab="Regression Methods",
ylab="Test Error",
main="Test Errors for All Methods",
names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

- Comments: The best model for this problem was the Ridge Regression, meaning that the model can predict college applications with higher accuracy. However, as demonstrated in the barplot, the test errors were not that far apart or different.