The Lasso will lead to option iii. Since lasso shrinks coefficients and can set some to zero making it less flexible than least squares which will introduce more bias but reduce variance. Predictions improve when variation reduction is higher than the increase of bias.
The Ridge regression will lead to option iii. Because this method also uses shrinkage. Very similar to the lasso method but it retains all coefficients in the model. The usage of the penalty term discourages large coefficients which introduce bias and reduces variances in the model. Predictions improve when variation reduction is higher than the increase of bias.
Non-linear methods will lead to option i. which allows for more complex relationships to occur between predictors and response variables which leads to more flexibility in the model and that usually means lower bias and higher variance. Predictions improve when variance increases slightly and bias reduces significantly.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.3
library(pls)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
data(College)
set.seed(1)
n <- nrow(College)
train <- sample(1:n, size = round(0.8 * n))
test <- setdiff(1:n, train)
train_data <- College[train, ]
test_data <- College[test, ]
x <- model.matrix(Apps ~ ., data = College)[, -1]
y <- College$Apps
lm.fit <- lm(Apps ~ ., data = train_data)
lm.pred <- predict(lm.fit, newdata = test_data)
lm.mse <- mean((lm.pred - test_data$Apps)^2)
lm.mse
## [1] 1578073
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x[train, ], y[train], alpha = 0, lambda = grid)
cv.ridge <- cv.glmnet(x[train, ], y[train], alpha = 0)
bestlam.ridge <- cv.ridge$lambda.min
bestlam.ridge
## [1] 362.6608
ridge.pred <- predict(ridge.mod, s = bestlam.ridge, newx = x[test, ])
ridge.mse <- mean((ridge.pred - y[test])^2)
ridge.mse
## [1] 1445510
cat("\n--- Ridge Results ---\n",
"Test MSE:", round(ridge.mse, 2),
"; best λ:", signif(bestlam.ridge, 3), "\n")
##
## --- Ridge Results ---
## Test MSE: 1445510 ; best λ: 363
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
cv.lasso <- cv.glmnet(x[train, ], y[train], alpha = 1)
bestlam.lasso <- cv.lasso$lambda.min
bestlam.lasso
## [1] 1.935412
lasso.pred <- predict(lasso.mod, s = bestlam.lasso, newx = x[test, ])
lasso.mse <- mean((lasso.pred - y[test])^2)
lasso.coef <- predict(lasso.mod, s = bestlam.lasso, type = "coefficients")[1:18, ]
nonzero.count <- sum(lasso.coef != 0) - 1 # exclude intercept
lasso.mse
## [1] 1565419
nonzero.count
## [1] 17
cat("\n--- Lasso Results ---\n",
"Test MSE:", round(lasso.mse, 2),
"; best λ:", signif(bestlam.lasso, 3),
"; non‑zero coefficients:", nonzero.count, "\n")
##
## --- Lasso Results ---
## Test MSE: 1565419 ; best λ: 1.94 ; non‑zero coefficients: 17
pcr.fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 622 17
## Y dimension: 622 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3834 3753 2060 2071 1672 1594 1579
## adjCV 3834 3754 2057 2072 1665 1583 1576
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1566 1526 1497 1483 1486 1488 1484
## adjCV 1565 1519 1494 1481 1484 1485 1481
## 14 comps 15 comps 16 comps 17 comps
## CV 1486 1464 1160 1079
## adjCV 1483 1454 1150 1073
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.019 57.05 64.13 70.01 75.36 80.38 84.09 87.44
## Apps 4.315 72.01 72.02 81.89 83.65 83.73 83.98 85.12
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.48 92.84 94.92 96.78 97.86 98.72 99.36
## Apps 85.40 85.75 85.75 85.76 85.88 85.94 89.94
## 16 comps 17 comps
## X 99.83 100.00
## Apps 92.88 93.47
validationplot(pcr.fit, val.type = "MSEP")
# Find number of components with lowest CV error
pcr.pred <- predict(pcr.fit, newdata = test_data, ncomp = which.min(pcr.fit$validation$PRESS))
pcr.mse <- mean((pcr.pred - test_data$Apps)^2)
pcr.mse
## [1] 1578073
pcr.ncomp <- which.min(pcr.fit$validation$PRESS)
pcr.ncomp
## [1] 17
cat("\n--- PCR Results ---\n",
"Test MSE:", round(pcr.mse, 2),
"; Optimal number of components (M):", pcr.ncomp, "\n")
##
## --- PCR Results ---
## Test MSE: 1578073 ; Optimal number of components (M): 17
pls.fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 622 17
## Y dimension: 622 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3834 1880 1630 1445 1391 1243 1165
## adjCV 3834 1876 1630 1440 1374 1221 1153
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1143 1130 1124 1121 1122 1118 1116
## adjCV 1133 1121 1116 1112 1113 1109 1108
## 14 comps 15 comps 16 comps 17 comps
## CV 1116 1116 1116 1116
## adjCV 1108 1108 1108 1108
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.52 45.30 62.57 65.06 67.50 72.05 76.04 80.49
## Apps 77.30 83.58 87.50 90.88 92.89 93.15 93.24 93.31
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.50 85.41 87.76 91.08 92.72 95.12 96.97
## Apps 93.39 93.42 93.45 93.46 93.46 93.47 93.47
## 16 comps 17 comps
## X 97.98 100.00
## Apps 93.47 93.47
validationplot(pls.fit, val.type = "MSEP")
pls.pred <- predict(pls.fit, newdata = test_data, ncomp = which.min(pls.fit$validation$PRESS))
pls.mse <- mean((pls.pred - test_data$Apps)^2)
pls.mse
## [1] 1578073
pls.ncomp <- which.min(pls.fit$validation$PRESS)
pls.ncomp
## [1] 17
cat("\n--- PLS Results ---\n",
"Test MSE:", round(pls.mse, 2),
"; Optimal number of components (M):", pls.ncomp, "\n")
##
## --- PLS Results ---
## Test MSE: 1578073 ; Optimal number of components (M): 17
cat("Test MSEs:\n")
## Test MSEs:
cat("Linear Regression:", lm.mse, "\n")
## Linear Regression: 1578073
cat("Ridge Regression:", ridge.mse, "\n")
## Ridge Regression: 1445510
cat("Lasso Regression:", lasso.mse, "\n")
## Lasso Regression: 1565419
cat("PCR:", pcr.mse, "\n")
## PCR: 1578073
cat("PLS:", pls.mse, "\n")
## PLS: 1578073
range_values <- range(College$Apps)
print(range_values)
## [1] 81 48094
Prediction are off by about 1225 applications per college which is about 2.5% of the total range making my models predicting fairly accurately. There is not much difference in MSE between majority of approaches we used. The difference between the best approach which is Ridge Regression of 1,445,510 and the worst approach which is PCR/Linear of 1,578,073 is 132,000 in MSE.
library(leaps)
data(Boston)
set.seed(1)
train_indices <- sample(1:nrow(Boston), size = 0.8 * nrow(Boston))
train_data <- Boston[train_indices, ]
test_data <- Boston[-train_indices, ]
# Ridge Regression Method
x_train <- model.matrix(crim ~ ., data = train_data)[, -1]
y_train <- train_data$crim
x_test <- model.matrix(crim ~ ., data = test_data)[, -1]
y_test <- test_data$crim
# Fit ridge with CV
cv.ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda_ridge <- cv.ridge$lambda.min
ridge.pred <- predict(cv.ridge, s = best_lambda_ridge, newx = x_test)
ridge_mse <- mean((ridge.pred - y_test)^2)
cat("Ridge test MSE:", ridge_mse, "\n")
## Ridge test MSE: 65.66146
#Lasso Regession
# Fit lasso with CV
cv.lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda_lasso <- cv.lasso$lambda.min
lasso.pred <- predict(cv.lasso, s = best_lambda_lasso, newx = x_test)
lasso_mse <- mean((lasso.pred - y_test)^2)
cat("Lasso test MSE:", lasso_mse, "\n")
## Lasso test MSE: 64.66192
# Number of non-zero coefficients
lasso.coef <- predict(cv.lasso, s = best_lambda_lasso, type = "coefficients")
cat("Number of non-zero coefficients in lasso:", sum(lasso.coef != 0) - 1, "\n")
## Number of non-zero coefficients in lasso: 11
#PCR Method
pcr.fit <- pcr(crim ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
# Choose number of components with lowest CV error
pcr.ncomp <- which.min(pcr.fit$validation$PRESS)
pcr.pred <- predict(pcr.fit, newdata = test_data, ncomp = pcr.ncomp)
pcr.mse <- mean((pcr.pred - test_data$crim)^2)
cat("PCR test MSE:", pcr.mse, "with components:", pcr.ncomp, "\n")
## PCR test MSE: 64.30352 with components: 12
Among these models I ran, PCR results has the lowest test MSE of 64.30 comparatively with Lasso regression being 64.66 MSE and Ridge regression being 65.66 which indicates a slightly better predictive performance. if predictive accuracy is of importance. PCR is the best choice, but if simplicity of a model is important then Lasso is the better option.
I chose the PCR model since predictive accuracy is main priority in my decision. This chosen model does not involve all the 13 features in the dataset and instead it uses 12 components. This indicates that the model likely dropped the least informative component that added noise to the model and avoids the risk of overfitting the model.