Introduction

In this study, we delve into the realm of regularized regression techniques by comparing two popular methods: Ridge and Lasso regression. Both Ridge and Lasso regression are regularization techniques used in predictive modeling to mitigate overfitting and improve model generalization. However, they differ in their approaches to regularization and have distinct characteristics that influence their predictive performance.

Ridge regression introduces a penalty term to the regression model that is proportional to the square of the coefficients. This penalty term, also known as the L2 norm, helps to shrink the coefficients towards zero, effectively reducing their magnitudes. On the other hand, Lasso regression utilizes a penalty term proportional to the absolute value of the coefficients, known as the L1 norm. This penalty induces sparsity in the coefficient matrix and performs feature selection by setting some coefficients to zero.

The objective of this project is to explore and compare the predictive performance of Ridge and Lasso regression techniques in the context of predicting graduation rates for colleges. By leveraging the College dataset, which contains various attributes of colleges across the United States, we aim to gain insights into how these regularization techniques differ in their ability to model graduation rates. Through this comparative analysis, we seek to identify the strengths and limitations of Ridge and Lasso regression.


Libraries

We will start by loading the necessary libraries for our analysis. These include ‘ISLR’ for accessing the College dataset, ‘glmnet’ for fitting Ridge and Lasso regression models, and ‘dplyr’ for data manipulation.

# Load necessary libraries
library(ISLR)   # For accessing the College dataset
library(glmnet) # For Ridge and Lasso regression
library(dplyr)  # For data manipulation


Dataset

Statistics for a large number of US Colleges from the 1995 issue of US News and World Report. The data frame has 777 observations and the following 18 variables:

Lets load the dataset and explore its structure, variables, and summary statistics.

# Load the College dataset
data(College)

# Display the structure of the dataset
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
# Display summary statistics of the dataset
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00


Data Preprocessing and Splitting

We split the dataset into training and testing sets, with 80% of the data used for training and the remaining 20% for testing.

# Convert "Private" variable to numeric
College$Private <- as.numeric(College$Private)  # Convert factor variable "Private" to numeric for modeling

# Data Splitting
set.seed(123)  # Set seed for reproducibility
train_index <- sample(1:nrow(College), 0.8*nrow(College))  # Generate random indices for training set (80% of data)
train_data <- College[train_index, ]  # Training set
test_data <- College[-train_index, ]  # Test set

Now we are going to define predictor matrix and response vector:

# Defining Predictor Matrix and Response Vector
x_train <- as.matrix(train_data[, -which(names(train_data) == "Grad.Rate")])  # Predictor matrix for training set
y_train <- train_data$Grad.Rate  # Target variable for training set
x_test <- as.matrix(test_data[, -which(names(test_data) == "Grad.Rate")])  # Predictor matrix for test set
y_test <- test_data$Grad.Rate  # Target variable for test set


Models Training

The cv.glmnet function fits regression model using cross-validation and selects the optimal value of lambda (regularization parameter) that minimizes the mean squared error (MSE) on the training data.

# Fit Ridge Regression model
ridge_model <- cv.glmnet(x_train, y_train, alpha = 0)  # Fit Ridge Regression model using glmnet package

# Fit Lasso Regression model
lasso_model <- cv.glmnet(x_train, y_train, alpha = 1)  # Fit Lasso Regression model using glmnet package


Models Evaluation

# Using MSE to evaluate the models
# Get Mean Squared Error (MSE) on test set for Ridge Regression
ridge_predictions <- predict(ridge_model, s = "lambda.min", newx = x_test)  # Predictions using selected lambda
ridge_mse <- mean((ridge_predictions - y_test)^2)  # Calculate Mean Squared Error

# Get Mean Squared Error (MSE) on test set for Lasso Regression
lasso_predictions <- predict(lasso_model, s = "lambda.min", newx = x_test)  # Predictions using selected lambda
lasso_mse <- mean((lasso_predictions - y_test)^2)  # Calculate Mean Squared Error
Mean Squared Error
# Print MSE for Ridge and Lasso Regression
print(paste("Ridge Regression MSE:", ridge_mse))
## [1] "Ridge Regression MSE: 158.405268432072"
print(paste("Lasso Regression MSE:", lasso_mse))
## [1] "Lasso Regression MSE: 161.992111131763"

The MSE values represent the average squared difference between the actual and predicted graduation rates for colleges in the test dataset. A lower MSE indicates better predictive performance, as it reflects smaller errors between the predicted and actual values. In this case, the Ridge regression model has a slightly lower MSE compared to the Lasso regression model.

Residual Analysis
# Calculate residuals
ridge_residuals <- ridge_predictions - y_test
lasso_residuals <- lasso_predictions - y_test

# Plot residuals vs predicted values
par(mfrow = c(1, 2))
plot(ridge_predictions, ridge_residuals, col = "blue", xlab = "Predicted Graduation Rates",
     ylab = "Residuals", main = "Residual Analysis: Ridge")
abline(h = 0, col = "red")
plot(lasso_predictions, lasso_residuals, col = "green", xlab = "Predicted Graduation Rates",
     ylab = "Residuals", main = "Residual Analysis: Lasso")
abline(h = 0, col = "red")


Models Comparison

# Extract selected coefficients for Ridge Regression
ridge_coefficients <- coef(ridge_model, s = "lambda.min")  # Extract coefficients for selected lambda
print("Ridge Regression Coefficients:")
## [1] "Ridge Regression Coefficients:"
print(ridge_coefficients)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  3.047880e+01
## Private      4.761198e+00
## Apps         7.469641e-04
## Accept       3.033353e-04
## Enroll       5.615738e-05
## Top10perc    7.009384e-02
## Top25perc    1.121298e-01
## F.Undergrad -1.243917e-04
## P.Undergrad -1.198858e-03
## Outstate     7.080958e-04
## Room.Board   2.155294e-03
## Books       -1.208491e-03
## Personal    -2.152616e-03
## PhD          6.764559e-02
## Terminal    -3.843631e-02
## S.F.Ratio   -3.157836e-02
## perc.alumni  2.528622e-01
## Expend      -3.011967e-04
# Extract selected coefficients for Lasso Regression
lasso_coefficients <- coef(lasso_model, s = "lambda.min")  # Extract coefficients for selected lambda
print("Lasso Regression Coefficients:")
## [1] "Lasso Regression Coefficients:"
print(lasso_coefficients)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) 27.8697551527
## Private      5.2214694881
## Apps         0.0011903161
## Accept       .           
## Enroll       .           
## Top10perc    0.0265843665
## Top25perc    0.1389974203
## F.Undergrad -0.0001844750
## P.Undergrad -0.0012602310
## Outstate     0.0008115123
## Room.Board   0.0021808209
## Books       -0.0006035260
## Personal    -0.0020528949
## PhD          0.0726900229
## Terminal    -0.0497067682
## S.F.Ratio    .           
## perc.alumni  0.2842915654
## Expend      -0.0004203725

Plotting top 5 coefficients:

# Function to extract top n coefficients excluding intercept
get_top_n_coefficients <- function(coefficients, n = 5) {
  # Exclude intercept
  coefficients <- coefficients[-1]
  sorted_coefficients <- sort(abs(coefficients), decreasing = TRUE)
  top_n <- sorted_coefficients[1:n]
  top_n_names <- names(top_n)
  return(top_n_names)
}

# Extracting top 5 coefficients for both models
top_n_ridge <- get_top_n_coefficients(ridge_coefficients[, "s1"], n = 5)
top_n_lasso <- get_top_n_coefficients(lasso_coefficients[, "s1"], n = 5)

# Extracting corresponding coefficient values
ridge_top5_values <- abs(ridge_coefficients[top_n_ridge, "s1"])
lasso_top5_values <- abs(lasso_coefficients[top_n_lasso, "s1"])

# Combining coefficients into a single df
top5_coefficients_df <- data.frame(Ridge = ridge_top5_values, Lasso = lasso_top5_values)

# Graphing top 5 feature importance
barplot(t(top5_coefficients_df), beside = TRUE, col = c("blue", "red"), 
        main = "Top 5 Feature Importance: Ridge vs Lasso",
        xlab = "Features", ylab = "Absolute Coefficient Values",
        legend.text = TRUE)
legend("topright", legend = c("Ridge", "Lasso"), fill = c("blue", "red"))

Ridge and Lasso regression models, when applied to predict college graduation rates, highlight common influential factors with slight variations. Both emphasize the significance of private status, alumni involvement, and student academic performance. However, Ridge regression places a bit more weight on faculty credentials, particularly the percentage of faculty with Ph.D. degrees, whereas Lasso regression slightly favors other predictors. This nuanced difference suggests varying priorities in feature selection between the two techniques

Coefficient Magnitudes:
  • Ridge Regression: Ridge regression tends to shrink coefficients towards zero while still keeping them non-zero. This is achieved through the L2 regularization penalty, which adds a fraction of the square of the coefficients to the loss function. In the provided coefficients, we observe that all features have non-zero coefficients, indicating that Ridge regression retains all features in the model.
  • Lasso Regression: Lasso regression, on the other hand, can set some coefficients exactly to zero, leading to a sparse solution. This is achieved through the L1 regularization penalty, which adds the absolute value of the coefficients to the loss function. In the provided coefficients, we see that Lasso has set some coefficients (Accept, Enroll, and S.F.Ratio) to zero, indicating that these features are deemed less important for predicting graduation rates.
Interpretability:
  • Ridge Regression: While Ridge regression retains all features in the model, the coefficients are shrunk towards zero. This can make interpretation more challenging, especially when dealing with a large number of features, as the impact of each feature on the target variable may be less clear.
  • Lasso Regression: Lasso regression’s ability to perform feature selection by setting some coefficients to zero can lead to more interpretable models. Features with non-zero coefficients are deemed more important for predicting graduation rates, while features with zero coefficients are considered less relevant and can be ignored.
Model Complexity:
  • Ridge Regression: Ridge regression typically results in models with all features included, albeit with their coefficients shrunk towards zero. This can lead to more complex models, especially when dealing with datasets with a large number of features.
  • Lasso Regression: Lasso regression tends to produce sparser models with fewer features, as it sets some coefficients exactly to zero. This can lead to simpler models with reduced complexity, which may be desirable for better interpretability and computational efficiency.
Performance:
  • Ridge Regression: In this case, Ridge regression has a slightly lower MSE compared to Lasso regression, indicating slightly better predictive performance. However, the difference in MSE between the two models is relatively small. Further analysis, such as cross-validation or testing on additional datasets, may be needed to determine if this difference is statistically significant and consistent across different datasets.


Conclusions

In summary, Ridge and Lasso regression each offer unique strengths and limitations that cater to different analytical needs:

Ridge Regression:

Strengths:
  • Retains all features in the model, providing a comprehensive view of the data.
  • Performs well when multicollinearity is present among predictor variables.
  • Suitable for scenarios where interpretability is not the primary concern, and the focus is on predictive accuracy.
Limitations:
  • Does not perform feature selection; all features are retained in the model.
  • May struggle with datasets containing a large number of irrelevant features, leading to less interpretable models.

Optimal Usage: Ridge regression is preferred when predictive accuracy is paramount, and the goal is to prevent overfitting in the presence of multicollinearity. It is also suitable when interpretability is less critical, and a comprehensive model view is desired.

Lasso Regression:

Strengths:
  • Performs feature selection by setting some coefficients to zero, resulting in simpler and more interpretable models.
  • Well-suited for scenarios where identifying the most relevant features is essential for decision-making.
  • Can handle datasets with a large number of predictors by effectively reducing model complexity.
Limitations:
  • May struggle when predictors are highly correlated, as it tends to arbitrarily select one predictor over others.
  • Less effective in situations where all features are potentially relevant, as it may discard useful predictors.

Optimal Usage: Lasso regression is preferred when interpretability and feature selection are crucial, and there is a need to identify the most influential predictors. It is particularly useful when dealing with high-dimensional datasets or when model simplicity is desired.

In practice, the choice between Ridge and Lasso regression depends on the specific goals of the analysis, including the trade-offs between predictive accuracy, interpretability, and model complexity.