Regularization methods, such as Ridge and Lasso regression, are some techniques for reducing model complexity and preventing overfitting in statistical modeling. These methods are particularly crucial in scenarios with a large number of predictors or when multicollinearity exists among variables. By incorporating a penalty term into the loss function, Ridge and Lasso regression effectively mitigate overfitting and enhance the model’s generalization ability to new data.
In this analysis, we will use ridge and lasso regression to predict the graduate rate of colleges in the USA. The dataset used is the College dataset from the ‘ISLR’ package in R, which include variables such as whether college is private or not, number of applications, number of accepted applications, number of enrollment among others.
Our objective is to construct, compare, and evaluate predictive models capable of accurately estimating college graduation rates (“Grad.Rate”) based on these input variables.
In total there are 777 rows and 18 columns. The variable “Grad.Rate” serves as our output variable, while the remaining 17 variables act as predictors. There are no missing values in the dataset.
# Load data
data("College")
dataset <- College
# Get overview of the data
dim(dataset)
## [1] 777 18
head(as_tibble(dataset))
# Convert 'Private' from factor to numeric (0 for 'No', 1 for 'Yes')
dataset$Private <- as.numeric(dataset$Private == 'Yes')
# Check for any missing values
any(sum(is.na(dataset))) # no missing values
## [1] FALSE
There is a moderate positive correlation (0.34) between being a
private college and the graduation rate. Private colleges tend to have
slightly higher graduation rates compared to public ones. Also, there is
a moderate negative correlation (-0.31) between student-to-faculty ratio
and graduation rates. Lower student-to-faculty ratios, indicating
smaller class sizes, are associated with higher graduation rates. Other
variables like Top10perc, Top25perc, Expend, and PhD
have
relatively strong positive correlations with Grad.Rate, indicating that
higher values of these variables tend to be associated with higher
graduation rates. This suggests that colleges with a higher percentage
of top students ( Top10perc and Top25perc
), higher
expenditure, and a higher percentage of faculty with Ph.D. degrees tend
to have higher graduation rates.
# Plot the correlation matrix using corrplot
cor_value <- cor(dataset)
# 5. Correlation matrix
par(cex = 0.5)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_value, method="color", col=col(300), type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation tl.col="black", tl.srt=90, #Text label color and rotation diag=FALSE, title = "Correlation Matrix",
mar = c(1, 1, 1, 1))
Correlation Plot
We split our dataset into a training set, which comprises approximately 70% of the data and consists of about 544 rows, and a test set, which constitutes the remaining 30% and comprises around 344 rows. The random seed was set to 123 for reproducibility.
#----------------- SPLIT DATA----------------
set.seed(123)
train = dataset %>% sample_frac(0.7) # get train data
test = dataset %>% setdiff(train) # get test data
x_train = model.matrix(Grad.Rate~., train)[,-1]
x_test = model.matrix(Grad.Rate~., test)[,-1]
# get the outcome for train and test
y_train = train %>%
select(Grad.Rate) %>%
unlist() %>%
as.numeric()
y_test = test %>%
select(Grad.Rate) %>%
unlist() %>%
as.numeric()
# get dimension of the data
cat("Dimension of train data: ", dim(x_train), "\n")
## Dimension of train data: 544 17
cat("Dimension of test data: ", dim(x_test), "\n")
## Dimension of test data: 233 17
In Ridge regression, the selection of the regularization parameter, lambda (\(\lambda\)), is crucial as it determines the strength of the penalty applied to the coefficients. The value of lambda.min = 1.947471 is the value that minimizes the mean squared error in the cross-validation process. It indicates the level of regularization needed to prevent overfitting while retaining predictive accuracy. This suggest that, we can minimize the prediction error when lambda = lambda.min. The value of lambda.1se = 26.35021 represents the lambda parameter that is one standard error away from the lambda value that minimizes the mean cross-validated error. It is typically used for model simplicity, as it tends to produce a model with fewer variables without sacrificing much predictive accuracy. In this case if we set lambda = lambda.1se, we are striking a balance between model complexity and performance, resulting in a simpler model that still retains reasonable predictive capability.
# ----------------- RIDGE REGRESSION ----------------
# Find best lambda
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0,nfolds = 10) # create a cross validation model
ridge_lambda_min <- cv_ridge$lambda.min # get min lambda value
ridge_lambda_1se <- cv_ridge$lambda.1se # get 1se lambda value
# print lambda
print(ridge_lambda_min)
## [1] 1.947471
print(ridge_lambda_1se)
## [1] 15.07856
From the plot below (Figure 2), the y-axis represent the mean-square error, the x-axis is the log of lambda(log(\(\lambda\))) and the numbers across the top of the plot is the number of non-zero coefficients in the model for each value of lambda. The red dots denote the error estimate and the boundary line around the red line is the confidence interval for the error estimate. The first dotted line represent the minimum value of lambda i.e log(\(\lambda\)) = 0.6665315 which retain all 17 predictor variables within the model. The second dotted line represent the maximum value within one standard error of the minimum error which also retain all 17 non-zero coefficients of the ridge model
# plot cv model
plot(cv_ridge)
Cross Validation Ridge Model Plot
After fitting the model on the train data, and reviewing the coefficients, the following observations were made:
The intercept
is approximately 39.14, this implies
when all other variables are zero, US college schools have 39.14
graduate rate.
Private
: The private variable which indicate whether
a college is private or public has a coefficient of approximately 4.49.
This suggests that private colleges, on average, have a graduate rate
approximately 4.49 units higher than non-private colleges.
Top25perc, Top10perc, PhD, perc.alumni
: These
variables have relatively high positive coefficient as compared to
others, suggesting that one-unit increase in these predictors leads to
increases in the graduate rate. Other positive coefficients variables
are Apps, Accept, Outstate, Room.Board
, but their
coefficients are very small, suggesting that these predictors have
relatively small effects on the response variable in this
model.
Terminal, S.F.Ratio, Enroll, F.Undergrad, P.Undergrad, Books, Personal, Expend
:
These variables have negative coefficients suggesting that a one-unit
increase in these predictors leads to decreases in the response
variable(Grad.Rate
).
# Fit the model
ridge_model = glmnet(x_train, y_train, alpha = 0,lambda = ridge_lambda_min)
# print coefficient values
coef(ridge_model)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3.941455e+01
## Private 4.490714e+00
## Apps 7.765641e-04
## Accept 2.221845e-04
## Enroll -2.764178e-04
## Top10perc 6.715711e-02
## Top25perc 1.358328e-01
## F.Undergrad -3.882698e-05
## P.Undergrad -1.163104e-03
## Outstate 6.168634e-04
## Room.Board 2.034711e-03
## Books -5.832332e-04
## Personal -2.928525e-03
## PhD 9.167559e-02
## Terminal -8.243149e-02
## S.F.Ratio -1.700936e-01
## perc.alumni 2.899054e-01
## Expend -3.170015e-04
The root mean square error (RMSE) for the training set = 12.55158. The RMSE value represents the average difference between the actual values (y_train) and the predicted values (ridge_pred_train) from the ridge regression model. In this context, the RMSE indicates that, on average, the model’s predictions deviate from the actual values by approximately 12.55 units.
# Train set Prediction
ridge_pred_train <- predict(ridge_model, newx = x_train)
ridge_train_rmse <- rmse(y_train, ridge_pred_train)
ridge_train_rmse
## [1] 12.55158
The root mean square error (RMSE) for the test set is computed to be 13.02. This measure evaluates the average difference between the actual values and the predicted values generated by the ridge regression model on the test data. In this context, the RMSE value of 13.02 indicates that, on average, the model’s predictions deviate from the actual values by approximately 13.02 units.
# Test set Prediction
ridge_pred_test <- predict(ridge_model, newx = x_test)
ridge_test_rmse <- rmse(y_test, ridge_pred_test)
ridge_test_rmse
## [1] 13.01708
The RMSE for the test set is slightly higher than the RMSE for the train set. Implying the model’s performance on the test data is slightly worse than on the training data. While this difference in performance suggests some degree of overfitting, it is relatively small. Therefore, while the model may be slightly overfitting, it does not appear to be a significant concern in this case.
#check the difference in rmse
abs(ridge_train_rmse - ridge_test_rmse)
## [1] 0.465502
The value of lambda.min = 0.1858955 is the value that minimizes the mean squared error in the cross-validation process. This suggest that the level of regularization needed to prevent overfitting while retaining predictive accuracy is when lambda = lambda.min. The value of lambda.1se = 1.902698 represents the lambda parameter that is one standard error away from the lambda value that minimizes the mean cross-validated error. In this case for our lasso regression if we set lambda = lambda.1se, we are striking a balance between model complexity and performance, resulting in a simpler model that still retains reasonable predictive capability.
# ------------- LASSO REGRESSION ------------
# Find best lambda
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1,nfolds = 10) # create a cross validation model
lasso_lambda_min <- cv_lasso$lambda.min # get min lambda value
lasso_lambda_1se <- cv_lasso$lambda.1se # get 1se lambda value
# print lambda
print(lasso_lambda_min)
## [1] 0.1858955
print(lasso_lambda_1se)
## [1] 1.902698
From Figure 3, the y-axis represent the mean-square error, the x-axis is the log of lambda(log(\(\lambda\))) and the numbers across the top of the plot is the number of non-zero coefficients in the model for each value of lambda. The first dotted line represent the minimum value of lambda(\(\lambda_{\text{min}}\)) i.e log(\(\lambda_{\text{min}}\)) = -1.68257. This lambda value retains only 13 predictor variables within the model, indicating that 4 predictors were dropped from our model during the regularization process. Similarly, the second dotted line represents the maximum value within one standard error of the minimum error, which also retains only 8 non-zero coefficients of the lasso model. The red dots denote the error estimate and the boundary line around the red line is the confidence interval for the error estimate.
# plot cv model
plot(cv_lasso) # plot cv model
Cross Validation Lasso Model Plot
From the coefficients provided by the Lasso model, it is evident
that some coefficients have reduced to zero. The predictors for which
the coefficients have been reduced to zero are
Accept, Enroll, F.Undergrad,Books
. This indicate that these
predictors have been effectively dropped from the model during the
regularization process.
On the other hand, predictors such as
Private, Apps, Top10perc, Top25perc, P.Undergrad, Outstate, Room.Board, Personal, PhD, Terminal, S.F.Ratio, perc.alumni, Expend
retain non-zero coefficients, suggesting that they continue to
contribute to the model’s predictions.
Unlike the coefficient from the ridge regression, the
intercept
for the lasso regression is approximately 36.46,
which imply when all other variables are zero, US college schools have
36.46 graduate rate.
# Fit the model
lasso_model <- glmnet(x_train, y_train, alpha = 1,lambda = lasso_lambda_min)
# Print the coefficients
coef(lasso_model)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 36.4623243485
## Private 4.8080660824
## Apps 0.0009039970
## Accept .
## Enroll .
## Top10perc 0.0078727136
## Top25perc 0.1748126729
## F.Undergrad .
## P.Undergrad -0.0012340518
## Outstate 0.0006557908
## Room.Board 0.0020506959
## Books .
## Personal -0.0029136075
## PhD 0.0769753598
## Terminal -0.0754698238
## S.F.Ratio -0.1081763298
## perc.alumni 0.3144156922
## Expend -0.0002939809
The root mean square error (RMSE) for the training set = 12.53952. The RMSE value represents the average difference between the actual values (y_train) and the predicted values (pred_train) from our lasso regression model. In this context, the RMSE indicates that, on average, the model’s predictions(lasso_pred_train) deviate from the actual values(y_train) by approximately 12.54 units.
# Train set Prediction
lasso_pred_train <- predict(lasso_model, newx = x_train)
lasso_train_rmse <- rmse(y_train, lasso_pred_train)
lasso_train_rmse
## [1] 12.53952
The root mean square error (RMSE) for the test set is 13.04479. This measure evaluates the average difference between the actual values(y_test) and the predicted values(lasso_pred_test) generated by the lasso regression model on the test data. In this context, the RMSE value of 13.04 indicates that, on average, the model’s predictions deviate from the actual values by approximately 13.04 units.
# Test set Prediction
lasso_pred_test <- predict(lasso_model, newx = x_test)
lasso_test_rmse <- rmse(y_test, lasso_pred_test)
lasso_test_rmse
## [1] 13.04479
Based on the provided RMSE values for the training set (12.53952) and the test set (13.04479), the model’s performance on the test data is slightly worse than on the training data. While this difference in performance suggests some degree of overfitting, it is relatively small. Therefore, while the model may be slightly overfitting, it does not appear to be a significant concern in this case.
#check for difference between rmse
abs(lasso_train_rmse-lasso_test_rmse)
## [1] 0.5052712
To determine which model performed better, we need to compare their performance metrics, such as RMSE, on the test set. The RMSE for the ridge regression model on the test set = 13.02. The RMSE for the lasso regression model on the test set = 13.04.
Since the RMSE for the ridge regression model is slightly lower than that of the lasso regression model, we can conclude that the ridge regression model performed slightly better on the test set. The Ridge regression model may have performed better in this case because it retains all predictors in the model but penalizes the coefficients to avoid overfitting. This approach allows it to capture more nuanced relationships between predictors and the response variable. On the other hand, the Lasso regression model may have dropped some predictors by setting their coefficients to zero during the regularization process. While this feature selection can reduce model complexity and potentially improve interpretability, it may also lead to loss of information and slightly higher prediction errors.
Another difference was that the absolute difference between the RMSE of train and test set for ridge regression is slightly lower than that of the lasso regression.
Using the stepwise selection and fitting the model using linear regression the RMSE for the test data is 13.16379, and the RMSE for the train data is 12.49078.
# Stepwise selection method
step_model <- step(lm(Grad.Rate ~ ., data = train), direction = 'both')
#summary(step_model)
# Predict on training set
train_pred_step <- predict(step_model, newdata = train)
# Make predictions on the test data
test_pred_step <- predict(step_model, newdata = test)
# Evaluate test set performance
step_train_rmse <- sqrt(mean((train$Grad.Rate - train_pred_step)^2))
step_train_rmse
## [1] 12.49078
# Evaluate test set performance
step_test_rmse <- sqrt(mean((test$Grad.Rate - test_pred_step)^2))
step_test_rmse
## [1] 13.16379
Based on the provided RMSE values, it appears that the Stepwise Selection using lm model performed slightly better on the training set compared to both Ridge regression and LASSO, as it has the lowest training RMSE. However, on the test set, the Stepwise Selection lm model has the highest RMSE, indicating poorer performance compared to Ridge regression and LASSO. This indicate overfitting since it perform well on train but fail to perform well on test. Therefore, in terms of RMSE values, Ridge regression and LASSO are preferable over the Stepwise Selection model.
When we compare Ridge and lasso, the RMSE for ridge is lesser than lasso but the opposite with the train set. But as for the different between the RMSE of train and test set, ridge is less indicateing that Ridge regression is better at controlling overfitting and has better generalization performance. In summary, Ridge regression is the better model for this particular dataset and problem.
# Compare RMSE for stepwise selection with Ridge and LASSO regression
all_model <- data.frame(
Model = c("Ridge", "Lasso", "Stepwise Selection"),
Train_RMSE = c(ridge_train_rmse, lasso_train_rmse, step_train_rmse),
Test_RMSE = c(ridge_test_rmse,lasso_test_rmse,step_test_rmse))
all_model$RMSE_difference <- abs(all_model$Train_RMSE - all_model$Test_RMSE)
all_model
In conclusion, Ridge regression emerges as the preferred model for predicting college graduation rates in the USA, based on its superior performance in controlling overfitting and achieving better generalization compared to LASSO and Stepwise Selection models.
Jain, A. (2024, April 25). Ridge and Lasso Regression in Python | Complete Tutorial (Updated 2024). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
Manwani, R. (2021, September 20). Lasso and Ridge Regularization – A Rescuer From Overfitting. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/09/lasso-and-ridge-regularization-a-rescuer-from-overfitting/
Oleszak, M. (2019, November 12). Regularization in R Tutorial: Ridge, Lasso and Elastic Net. https://www.datacamp.com/tutorial/tutorial-ridge-lasso-elastic-net