library(tidyverse)
library(openintro)
library(ISLR)
library(ISLR2)
library(glmnet)
library(pls)
library(kableExtra)
library(leaps)
library(MASS)More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
College data
set.##
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5741.2 -479.5 15.3 359.6 7258.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.902e+02 6.381e+02 -1.238 0.216410
## PrivateYes -3.070e+02 2.006e+02 -1.531 0.126736
## Accept 1.779e+00 5.420e-02 32.830 < 2e-16 ***
## Enroll -1.470e+00 3.115e-01 -4.720 3.35e-06 ***
## Top10perc 6.673e+01 8.310e+00 8.030 1.31e-14 ***
## Top25perc -2.231e+01 6.533e+00 -3.415 0.000708 ***
## F.Undergrad 9.269e-02 5.529e-02 1.676 0.094538 .
## P.Undergrad 9.397e-03 5.493e-02 0.171 0.864275
## Outstate -1.084e-01 2.700e-02 -4.014 7.22e-05 ***
## Room.Board 2.115e-01 7.224e-02 2.928 0.003622 **
## Books 2.912e-01 3.985e-01 0.731 0.465399
## Personal 6.133e-03 8.803e-02 0.070 0.944497
## PhD -1.548e+01 6.681e+00 -2.316 0.021082 *
## Terminal 6.415e+00 7.290e+00 0.880 0.379470
## S.F.Ratio 2.283e+01 2.047e+01 1.115 0.265526
## perc.alumni 1.134e+00 6.083e+00 0.186 0.852274
## Expend 4.857e-02 1.619e-02 2.999 0.002890 **
## Grad.Rate 7.490e+00 4.397e+00 1.703 0.089324 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1083 on 370 degrees of freedom
## Multiple R-squared: 0.9389, Adjusted R-squared: 0.9361
## F-statistic: 334.3 on 17 and 370 DF, p-value: < 2.2e-16
pred.app<-predict(linear.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error## [1] 1135758
MSE for a linear model using the least squares is 1,135,758.
set.seed(1)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## [1] 405.8404
## [1] 976268.9
MSE for the ridge regression model is 976,268.9.
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## [1] 1.97344
## [1] 1116252
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
nonzero=sum(lasso.coef != 0) - 1
nonzero## [1] 17
MSE for the Lasso model is 1,116,252. There are 17 non-zero coefficient estimates.
set.seed(1)
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
validationplot(pcr.college, val.type="MSEP")## [1] 16
## [1] 1137877
MSE for a PCR model is 1,137,877 with the value of \(M\) being 16 due to its low CV and high variance explained.
set.seed(1)
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")## [1] 16
## [1] 1135812
MSE for a PLS model is 1,135,812 with the value of \(M\) being 16 due to its low CV and high variance explained.
set.seed(1)
test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /
mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /
mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /
mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /
mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /
mean((College.test[, "Apps"] - test.avg)^2)
r2_values = c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2)
bar_positions = barplot(r2_values,
names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
main="Test R-squared", ylim=c(0, max(r2_values) * 1.1))
text(x=bar_positions, y=r2_values, labels=round(r2_values, 3), pos=3, cex=0.6)lm.rmse = sqrt(mean((College.test[, "Apps"] - pred.app)^2))
ridge.rmse = sqrt(mean((College.test[, "Apps"] - ridge.pred)^2))
lasso.rmse = sqrt(mean((College.test[, "Apps"] - lasso.pred)^2))
pcr.rmse = sqrt(mean((pcr.pred - y.test)^2))
pls.rmse = sqrt(mean((pls.pred - y.test)^2))
lm.mse = mean((College.test[, "Apps"] - pred.app)^2)
ridge.mse = mean((College.test[, "Apps"] - ridge.pred)^2)
lasso.mse = mean((College.test[, "Apps"] - lasso.pred)^2)
pcr.mse = mean((pcr.pred - y.test)^2)
pls.mse = mean((pls.pred - y.test)^2)
rmse_table <- data.frame(
Method = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
MSE = c(lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse),
RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, pcr.rmse, pls.rmse)
)
rmse_table %>%
kbl(digits = 3, format = "html", caption = "Comparison of MSE and RMSE for Different Methods") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)| Method | MSE | RMSE |
|---|---|---|
| OLS | 1135758.3 | 1065.720 |
| Ridge | 976268.9 | 988.063 |
| Lasso | 1116251.6 | 1056.528 |
| PCR | 1137876.9 | 1066.713 |
| PLS | 1135812.2 | 1065.745 |
The \(R^2\) of each model shows that PCR has the lowest accuracy to explain the variance in the data whereas all of the other models are fairly similar. When focused on just the RMSE/MSE, Ridge Regression has the lowest RMSE/MSE and would be a good model to use.
Boston data set.set.seed(1)
attach(Boston)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")## [1] 9
## [1] 42.81453
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.176491
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.150484
## tax .
## ptratio .
## black .
## lstat .
## medv .
## [1] 7.921353
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.523899542
## zn -0.002949852
## indus 0.029276741
## chas -0.166526007
## nox 1.874769665
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
## [1] 7.669133
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
best_subset_mse = mean.cv.errors[which.min(mean.cv.errors)]
lasso_mse = min(cv.lasso$cvm)
ridge_mse = min(cv.ridge$cvm)
pcr_mse = min(pcr.crime$validation$PRESS / nrow(Boston)) # PRESS / n = MSE
best_subset_rmse = sqrt(best_subset_mse)
lasso_rmse = sqrt(lasso_mse)
ridge_rmse = sqrt(ridge_mse)
pcr_rmse = sqrt(pcr_mse)
mse_table <- data.frame(
Method = c("Best Subset", "LASSO", "Ridge", "PCR"),
MSE = c(best_subset_mse, lasso_mse, ridge_mse, pcr_mse),
RMSE = c(best_subset_rmse, lasso_rmse, ridge_rmse, pcr_rmse)
)
mse_table %>%
kbl(digits = 3, format = "html", caption = "Comparison of MSE and RMSE for Different Methods") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)| Method | MSE | RMSE |
|---|---|---|
| Best Subset | 42.815 | 6.543 |
| LASSO | 44.837 | 6.696 |
| Ridge | 43.334 | 6.583 |
| PCR | 42.097 | 6.488 |
Based on the MSE, the PCR model had the lowest cross-validation error with the MSE of 42.8.
The model chosen is the PCR model. This model has 13 predictors and the lowest MSE. PCR transforms the 13 variables in the data into into principal components (PCs), which are linear combinations of the original variables. This selects a subset of principal components that explain most of the variance, potentially reducing overfitting. Multicollinearity is also avoided by using uncorrelated principal components rather than raw predictors.