In this study, we delve into the realm of regularized regression techniques by comparing two popular methods: Ridge and Lasso regression. Both Ridge and Lasso regression are regularization techniques used in predictive modeling to mitigate overfitting and improve model generalization. However, they differ in their approaches to regularization and have distinct characteristics that influence their predictive performance.
Ridge regression introduces a penalty term to the regression model that is proportional to the square of the coefficients. This penalty term, also known as the L2 norm, helps to shrink the coefficients towards zero, effectively reducing their magnitudes. On the other hand, Lasso regression utilizes a penalty term proportional to the absolute value of the coefficients, known as the L1 norm. This penalty induces sparsity in the coefficient matrix and performs feature selection by setting some coefficients to zero.
The objective of this project is to explore and compare the predictive performance of Ridge and Lasso regression techniques in the context of predicting graduation rates for colleges. By leveraging the College dataset, which contains various attributes of colleges across the United States, we aim to gain insights into how these regularization techniques differ in their ability to model graduation rates. Through this comparative analysis, we seek to identify the strengths and limitations of Ridge and Lasso regression.
We will start by loading the necessary libraries for our analysis. These include ‘ISLR’ for accessing the College dataset, ‘glmnet’ for fitting Ridge and Lasso regression models, and ‘dplyr’ for data manipulation.
# Load necessary libraries
library(ISLR) # For accessing the College dataset
library(glmnet) # For Ridge and Lasso regression
library(dplyr) # For data manipulation
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report. The data frame has 777 observations and the following 18 variables:
Lets load the dataset and explore its structure, variables, and summary statistics.
# Load the College dataset
data(College)
# Display the structure of the dataset
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
# Display summary statistics of the dataset
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
We split the dataset into training and testing sets, with 80% of the data used for training and the remaining 20% for testing.
# Convert "Private" variable to numeric
College$Private <- as.numeric(College$Private) # Convert factor variable "Private" to numeric for modeling
# Data Splitting
set.seed(123) # Set seed for reproducibility
train_index <- sample(1:nrow(College), 0.8*nrow(College)) # Generate random indices for training set (80% of data)
train_data <- College[train_index, ] # Training set
test_data <- College[-train_index, ] # Test set
Now we are going to define predictor matrix and response vector:
# Defining Predictor Matrix and Response Vector
x_train <- as.matrix(train_data[, -which(names(train_data) == "Grad.Rate")]) # Predictor matrix for training set
y_train <- train_data$Grad.Rate # Target variable for training set
x_test <- as.matrix(test_data[, -which(names(test_data) == "Grad.Rate")]) # Predictor matrix for test set
y_test <- test_data$Grad.Rate # Target variable for test set
The cv.glmnet function fits regression model using cross-validation and selects the optimal value of lambda (regularization parameter) that minimizes the mean squared error (MSE) on the training data.
# Fit Ridge Regression model
ridge_model <- cv.glmnet(x_train, y_train, alpha = 0) # Fit Ridge Regression model using glmnet package
# Fit Lasso Regression model
lasso_model <- cv.glmnet(x_train, y_train, alpha = 1) # Fit Lasso Regression model using glmnet package
# Using MSE to evaluate the models
# Get Mean Squared Error (MSE) on test set for Ridge Regression
ridge_predictions <- predict(ridge_model, s = "lambda.min", newx = x_test) # Predictions using selected lambda
ridge_mse <- mean((ridge_predictions - y_test)^2) # Calculate Mean Squared Error
# Get Mean Squared Error (MSE) on test set for Lasso Regression
lasso_predictions <- predict(lasso_model, s = "lambda.min", newx = x_test) # Predictions using selected lambda
lasso_mse <- mean((lasso_predictions - y_test)^2) # Calculate Mean Squared Error
# Print MSE for Ridge and Lasso Regression
print(paste("Ridge Regression MSE:", ridge_mse))
## [1] "Ridge Regression MSE: 158.405268432072"
print(paste("Lasso Regression MSE:", lasso_mse))
## [1] "Lasso Regression MSE: 161.992111131763"
The MSE values represent the average squared difference between the actual and predicted graduation rates for colleges in the test dataset. A lower MSE indicates better predictive performance, as it reflects smaller errors between the predicted and actual values. In this case, the Ridge regression model has a slightly lower MSE compared to the Lasso regression model.
# Calculate residuals
ridge_residuals <- ridge_predictions - y_test
lasso_residuals <- lasso_predictions - y_test
# Plot residuals vs predicted values
par(mfrow = c(1, 2))
plot(ridge_predictions, ridge_residuals, col = "blue", xlab = "Predicted Graduation Rates",
ylab = "Residuals", main = "Residual Analysis: Ridge")
abline(h = 0, col = "red")
plot(lasso_predictions, lasso_residuals, col = "green", xlab = "Predicted Graduation Rates",
ylab = "Residuals", main = "Residual Analysis: Lasso")
abline(h = 0, col = "red")
# Extract selected coefficients for Ridge Regression
ridge_coefficients <- coef(ridge_model, s = "lambda.min") # Extract coefficients for selected lambda
print("Ridge Regression Coefficients:")
## [1] "Ridge Regression Coefficients:"
print(ridge_coefficients)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.047880e+01
## Private 4.761198e+00
## Apps 7.469641e-04
## Accept 3.033353e-04
## Enroll 5.615738e-05
## Top10perc 7.009384e-02
## Top25perc 1.121298e-01
## F.Undergrad -1.243917e-04
## P.Undergrad -1.198858e-03
## Outstate 7.080958e-04
## Room.Board 2.155294e-03
## Books -1.208491e-03
## Personal -2.152616e-03
## PhD 6.764559e-02
## Terminal -3.843631e-02
## S.F.Ratio -3.157836e-02
## perc.alumni 2.528622e-01
## Expend -3.011967e-04
# Extract selected coefficients for Lasso Regression
lasso_coefficients <- coef(lasso_model, s = "lambda.min") # Extract coefficients for selected lambda
print("Lasso Regression Coefficients:")
## [1] "Lasso Regression Coefficients:"
print(lasso_coefficients)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 27.8697551527
## Private 5.2214694881
## Apps 0.0011903161
## Accept .
## Enroll .
## Top10perc 0.0265843665
## Top25perc 0.1389974203
## F.Undergrad -0.0001844750
## P.Undergrad -0.0012602310
## Outstate 0.0008115123
## Room.Board 0.0021808209
## Books -0.0006035260
## Personal -0.0020528949
## PhD 0.0726900229
## Terminal -0.0497067682
## S.F.Ratio .
## perc.alumni 0.2842915654
## Expend -0.0004203725
Plotting top 5 coefficients:
# Function to extract top n coefficients excluding intercept
get_top_n_coefficients <- function(coefficients, n = 5) {
# Exclude intercept
coefficients <- coefficients[-1]
sorted_coefficients <- sort(abs(coefficients), decreasing = TRUE)
top_n <- sorted_coefficients[1:n]
top_n_names <- names(top_n)
return(top_n_names)
}
# Extracting top 5 coefficients for both models
top_n_ridge <- get_top_n_coefficients(ridge_coefficients[, "s1"], n = 5)
top_n_lasso <- get_top_n_coefficients(lasso_coefficients[, "s1"], n = 5)
# Extracting corresponding coefficient values
ridge_top5_values <- abs(ridge_coefficients[top_n_ridge, "s1"])
lasso_top5_values <- abs(lasso_coefficients[top_n_lasso, "s1"])
# Combining coefficients into a single df
top5_coefficients_df <- data.frame(Ridge = ridge_top5_values, Lasso = lasso_top5_values)
# Graphing top 5 feature importance
barplot(t(top5_coefficients_df), beside = TRUE, col = c("blue", "red"),
main = "Top 5 Feature Importance: Ridge vs Lasso",
xlab = "Features", ylab = "Absolute Coefficient Values",
legend.text = TRUE)
legend("topright", legend = c("Ridge", "Lasso"), fill = c("blue", "red"))
Ridge and Lasso regression models, when applied to predict college graduation rates, highlight common influential factors with slight variations. Both emphasize the significance of private status, alumni involvement, and student academic performance. However, Ridge regression places a bit more weight on faculty credentials, particularly the percentage of faculty with Ph.D. degrees, whereas Lasso regression slightly favors other predictors. This nuanced difference suggests varying priorities in feature selection between the two techniques
In summary, Ridge and Lasso regression each offer unique strengths and limitations that cater to different analytical needs:
Optimal Usage: Ridge regression is preferred when predictive accuracy is paramount, and the goal is to prevent overfitting in the presence of multicollinearity. It is also suitable when interpretability is less critical, and a comprehensive model view is desired.
Optimal Usage: Lasso regression is preferred when interpretability and feature selection are crucial, and there is a need to identify the most influential predictors. It is particularly useful when dealing with high-dimensional datasets or when model simplicity is desired.
In practice, the choice between Ridge and Lasso regression depends on the specific goals of the analysis, including the trade-offs between predictive accuracy, interpretability, and model complexity.