Lasso is less flexible than least squares because it
imposes an L1 penalty, shrinking some coefficients to zero. This adds
bias, but can reduce variance,
especially when many predictors are irrelevant.
Lasso improves prediction accuracy when the increase in bias is
less than the decrease in variance.
Ridge regression is also less flexible due to the L2 penalty, which shrinks coefficients but does not set them to zero. It performs well when bias increase < variance decrease.
Non-linear methods are more flexible and tend to fit training data better. They help when their increase in variance is outweighed by the reduction in bias.
data("College")
College <- College %>% mutate(Private = as.numeric(Private == "Yes"))
set.seed(1)
train_idx <- sample(1:nrow(College), nrow(College)/2)
train <- College[train_idx, ]
test <- College[-train_idx, ]
x_train <- model.matrix(Apps ~ ., data = train)[, -1]
y_train <- train$Apps
x_test <- model.matrix(Apps ~ ., data = test)[, -1]
y_test <- test$Apps
lm_fit <- lm(Apps ~ ., data = train)
lm_pred <- predict(lm_fit, test)
mean((lm_pred - y_test)^2)
## [1] 1135758
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_pred <- predict(cv_ridge, s = cv_ridge$lambda.min, newx = x_test)
mean((ridge_pred - y_test)^2)
## [1] 976261.5
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_pred <- predict(cv_lasso, s = cv_lasso$lambda.min, newx = x_test)
mean((lasso_pred - y_test)^2)
## [1] 1115901
lasso_coef <- predict(cv_lasso, s = cv_lasso$lambda.min, type = "coefficients")
sum(lasso_coef != 0)
## [1] 18
pcr_fit <- pcr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")
opt_comp <- which.min(pcr_fit$validation$PRESS)
pcr_pred <- predict(pcr_fit, test, ncomp = opt_comp)
mean((pcr_pred - y_test)^2)
## [1] 1135758
pls_fit <- plsr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
validationplot(pls_fit, val.type = "MSEP")
opt_pls <- which.min(pls_fit$validation$PRESS)
pls_pred <- predict(pls_fit, test, ncomp = opt_pls)
mean((pls_pred - y_test)^2)
## [1] 1135758
All five models give reasonable predictions.
- Lasso reduces test error and simplifies the model by selecting fewer
variables.
- Ridge helps when many predictors are correlated.
- PCR and PLS reduce dimensionality.
- Least Squares may overfit.
*Lasso** is often preferred due to performance and interpretability.
data("Boston")
set.seed(1)
train_idx <- sample(1:nrow(Boston), nrow(Boston)*0.7)
train <- Boston[train_idx, ]
test <- Boston[-train_idx, ]
x_train <- model.matrix(crim ~ ., data = train)[, -1]
y_train <- train$crim
x_test <- model.matrix(crim ~ ., data = test)[, -1]
y_test <- test$crim
# Ridge
cv_ridge_b <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_b_pred <- predict(cv_ridge_b, s = cv_ridge_b$lambda.min, newx = x_test)
ridge_b_mse <- mean((ridge_b_pred - y_test)^2)
# Lasso
cv_lasso_b <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_b_pred <- predict(cv_lasso_b, s = cv_lasso_b$lambda.min, newx = x_test)
lasso_b_mse <- mean((lasso_b_pred - y_test)^2)
lasso_b_coef <- predict(cv_lasso_b, s = cv_lasso_b$lambda.min, type = "coefficients")
# PCR
pcr_b <- pcr(crim ~ ., data = train, scale = TRUE, validation = "CV")
opt_pcr <- which.min(pcr_b$validation$PRESS)
pcr_b_pred <- predict(pcr_b, test, ncomp = opt_pcr)
pcr_b_mse <- mean((pcr_b_pred - y_test)^2)
# MSE Comparison
c(Ridge = ridge_b_mse, Lasso = lasso_b_mse, PCR = pcr_b_mse)
## Ridge Lasso PCR
## 58.75168 58.06509 57.61252
The model with the lowest test MSE is
preferred.
- Lasso often performs well, offering a sparse solution
and avoiding overfitting.
- Ridge works better when all predictors
contribute.
- PCR may lose predictive power if key variables are
excluded in low-dimensional components.
Lasso uses fewer variables by shrinking some coefficients to zero
—
This improves interpretability and reduces overfitting.
PCR and Ridge include all variables but with regularization or
dimensionality reduction.