Analysis

Explore data

In total there are 777 rows and 18 columns. The variable “Grad.Rate” serves as our output variable, while the remaining 17 variables act as predictors. There are no missing values in the dataset.

# Load data
data("College")
dataset <- College
# Get overview of the data
dim(dataset)

## [1] 777  18

head(as_tibble(dataset))

# Convert 'Private' from factor to numeric (0 for 'No', 1 for 'Yes')
dataset$Private <- as.numeric(dataset$Private == 'Yes')

# Check for any missing values
any(sum(is.na(dataset))) # no missing values

## [1] FALSE

Correlation plot

There is a moderate positive correlation (0.34) between being a private college and the graduation rate. Private colleges tend to have slightly higher graduation rates compared to public ones. Also, there is a moderate negative correlation (-0.31) between student-to-faculty ratio and graduation rates. Lower student-to-faculty ratios, indicating smaller class sizes, are associated with higher graduation rates. Other variables like Top10perc, Top25perc, Expend, and PhD have relatively strong positive correlations with Grad.Rate, indicating that higher values of these variables tend to be associated with higher graduation rates. This suggests that colleges with a higher percentage of top students ( Top10perc and Top25perc), higher expenditure, and a higher percentage of faculty with Ph.D. degrees tend to have higher graduation rates.

# Plot the correlation matrix using corrplot
cor_value <- cor(dataset)
# 5. Correlation matrix
par(cex = 0.5)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor_value, method="color", col=col(300), type="upper", order="hclust",
         addCoef.col = "black", # Add coefficient of correlation tl.col="black", tl.srt=90, #Text label color and rotation diag=FALSE, title = "Correlation Matrix",
         mar = c(1, 1, 1, 1))

Correlation Plot

Split Data into Train-Test

We split our dataset into a training set, which comprises approximately 70% of the data and consists of about 544 rows, and a test set, which constitutes the remaining 30% and comprises around 344 rows. The random seed was set to 123 for reproducibility.

#----------------- SPLIT DATA----------------
set.seed(123)
train = dataset %>% sample_frac(0.7) # get train data
test = dataset %>% setdiff(train) # get test data

x_train = model.matrix(Grad.Rate~., train)[,-1]
x_test = model.matrix(Grad.Rate~., test)[,-1]

# get the outcome for train and test
y_train = train %>%
  select(Grad.Rate) %>%
  unlist() %>%
  as.numeric()

y_test = test %>%
  select(Grad.Rate) %>%
  unlist() %>%
  as.numeric()
# get dimension of the data
cat("Dimension of train data: ", dim(x_train), "\n")

## Dimension of train data:  544 17

cat("Dimension of test data: ", dim(x_test), "\n")

## Dimension of test data:  233 17

Ridge Regression

Find the best lambda

In Ridge regression, the selection of the regularization parameter, lambda (\(\lambda\)), is crucial as it determines the strength of the penalty applied to the coefficients. The value of lambda.min = 1.947471 is the value that minimizes the mean squared error in the cross-validation process. It indicates the level of regularization needed to prevent overfitting while retaining predictive accuracy. This suggest that, we can minimize the prediction error when lambda = lambda.min. The value of lambda.1se = 26.35021 represents the lambda parameter that is one standard error away from the lambda value that minimizes the mean cross-validated error. It is typically used for model simplicity, as it tends to produce a model with fewer variables without sacrificing much predictive accuracy. In this case if we set lambda = lambda.1se, we are striking a balance between model complexity and performance, resulting in a simpler model that still retains reasonable predictive capability.

# ----------------- RIDGE REGRESSION ----------------
# Find best lambda
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0,nfolds = 10) # create a cross validation model

ridge_lambda_min <- cv_ridge$lambda.min # get min lambda value
ridge_lambda_1se <- cv_ridge$lambda.1se # get 1se lambda value
# print lambda
print(ridge_lambda_min)

## [1] 1.947471

print(ridge_lambda_1se)

## [1] 15.07856

CV Model Plot

From the plot below (Figure 2), the y-axis represent the mean-square error, the x-axis is the log of lambda(log(\(\lambda\))) and the numbers across the top of the plot is the number of non-zero coefficients in the model for each value of lambda. The red dots denote the error estimate and the boundary line around the red line is the confidence interval for the error estimate. The first dotted line represent the minimum value of lambda i.e log(\(\lambda\)) = 0.6665315 which retain all 17 predictor variables within the model. The second dotted line represent the maximum value within one standard error of the minimum error which also retain all 17 non-zero coefficients of the ridge model

# plot cv model
plot(cv_ridge)

Cross Validation Ridge Model Plot

Fit model on train data

After fitting the model on the train data, and reviewing the coefficients, the following observations were made:

The intercept is approximately 39.14, this implies when all other variables are zero, US college schools have 39.14 graduate rate.
Private: The private variable which indicate whether a college is private or public has a coefficient of approximately 4.49. This suggests that private colleges, on average, have a graduate rate approximately 4.49 units higher than non-private colleges.
Top25perc, Top10perc, PhD, perc.alumni: These variables have relatively high positive coefficient as compared to others, suggesting that one-unit increase in these predictors leads to increases in the graduate rate. Other positive coefficients variables are Apps, Accept, Outstate, Room.Board, but their coefficients are very small, suggesting that these predictors have relatively small effects on the response variable in this model.
Terminal, S.F.Ratio, Enroll, F.Undergrad, P.Undergrad, Books, Personal, Expend: These variables have negative coefficients suggesting that a one-unit increase in these predictors leads to decreases in the response variable(Grad.Rate).

# Fit the model
ridge_model = glmnet(x_train, y_train, alpha = 0,lambda = ridge_lambda_min)
# print coefficient values
coef(ridge_model)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  3.941455e+01
## Private      4.490714e+00
## Apps         7.765641e-04
## Accept       2.221845e-04
## Enroll      -2.764178e-04
## Top10perc    6.715711e-02
## Top25perc    1.358328e-01
## F.Undergrad -3.882698e-05
## P.Undergrad -1.163104e-03
## Outstate     6.168634e-04
## Room.Board   2.034711e-03
## Books       -5.832332e-04
## Personal    -2.928525e-03
## PhD          9.167559e-02
## Terminal    -8.243149e-02
## S.F.Ratio   -1.700936e-01
## perc.alumni  2.899054e-01
## Expend      -3.170015e-04

Performance of the fit model

The root mean square error (RMSE) for the training set = 12.55158. The RMSE value represents the average difference between the actual values (y_train) and the predicted values (ridge_pred_train) from the ridge regression model. In this context, the RMSE indicates that, on average, the model’s predictions deviate from the actual values by approximately 12.55 units.

# Train set Prediction
ridge_pred_train <- predict(ridge_model, newx = x_train)
ridge_train_rmse <- rmse(y_train, ridge_pred_train)
ridge_train_rmse

## [1] 12.55158

Performance of Test set

The root mean square error (RMSE) for the test set is computed to be 13.02. This measure evaluates the average difference between the actual values and the predicted values generated by the ridge regression model on the test data. In this context, the RMSE value of 13.02 indicates that, on average, the model’s predictions deviate from the actual values by approximately 13.02 units.

# Test set Prediction
ridge_pred_test <- predict(ridge_model, newx = x_test)
ridge_test_rmse <- rmse(y_test, ridge_pred_test)
ridge_test_rmse

## [1] 13.01708

The RMSE for the test set is slightly higher than the RMSE for the train set. Implying the model’s performance on the test data is slightly worse than on the training data. While this difference in performance suggests some degree of overfitting, it is relatively small. Therefore, while the model may be slightly overfitting, it does not appear to be a significant concern in this case.

#check the difference in rmse
abs(ridge_train_rmse - ridge_test_rmse)

## [1] 0.465502

Lasso Regression

Find best Lambda

The value of lambda.min = 0.1858955 is the value that minimizes the mean squared error in the cross-validation process. This suggest that the level of regularization needed to prevent overfitting while retaining predictive accuracy is when lambda = lambda.min. The value of lambda.1se = 1.902698 represents the lambda parameter that is one standard error away from the lambda value that minimizes the mean cross-validated error. In this case for our lasso regression if we set lambda = lambda.1se, we are striking a balance between model complexity and performance, resulting in a simpler model that still retains reasonable predictive capability.

# ------------- LASSO REGRESSION ------------
# Find best lambda
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1,nfolds = 10) # create a cross validation model

lasso_lambda_min <- cv_lasso$lambda.min # get min lambda value
lasso_lambda_1se <- cv_lasso$lambda.1se # get 1se lambda value
# print lambda
print(lasso_lambda_min)

## [1] 0.1858955

print(lasso_lambda_1se)

## [1] 1.902698

Lasso CV plot

From Figure 3, the y-axis represent the mean-square error, the x-axis is the log of lambda(log(\(\lambda\))) and the numbers across the top of the plot is the number of non-zero coefficients in the model for each value of lambda. The first dotted line represent the minimum value of lambda(\(\lambda_{\text{min}}\)) i.e log(\(\lambda_{\text{min}}\)) = -1.68257. This lambda value retains only 13 predictor variables within the model, indicating that 4 predictors were dropped from our model during the regularization process. Similarly, the second dotted line represents the maximum value within one standard error of the minimum error, which also retains only 8 non-zero coefficients of the lasso model. The red dots denote the error estimate and the boundary line around the red line is the confidence interval for the error estimate.

# plot cv model
plot(cv_lasso) # plot cv model

Cross Validation Lasso Model Plot

Fit model on train data

From the coefficients provided by the Lasso model, it is evident that some coefficients have reduced to zero. The predictors for which the coefficients have been reduced to zero are Accept, Enroll, F.Undergrad,Books. This indicate that these predictors have been effectively dropped from the model during the regularization process.
On the other hand, predictors such as Private, Apps, Top10perc, Top25perc, P.Undergrad, Outstate, Room.Board, Personal, PhD, Terminal, S.F.Ratio, perc.alumni, Expend retain non-zero coefficients, suggesting that they continue to contribute to the model’s predictions.
Unlike the coefficient from the ridge regression, the intercept for the lasso regression is approximately 36.46, which imply when all other variables are zero, US college schools have 36.46 graduate rate.

# Fit the model
lasso_model <- glmnet(x_train, y_train, alpha = 1,lambda = lasso_lambda_min)
# Print the coefficients
coef(lasso_model)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 36.4623243485
## Private      4.8080660824
## Apps         0.0009039970
## Accept       .           
## Enroll       .           
## Top10perc    0.0078727136
## Top25perc    0.1748126729
## F.Undergrad  .           
## P.Undergrad -0.0012340518
## Outstate     0.0006557908
## Room.Board   0.0020506959
## Books        .           
## Personal    -0.0029136075
## PhD          0.0769753598
## Terminal    -0.0754698238
## S.F.Ratio   -0.1081763298
## perc.alumni  0.3144156922
## Expend      -0.0002939809

Performance of train data

The root mean square error (RMSE) for the training set = 12.53952. The RMSE value represents the average difference between the actual values (y_train) and the predicted values (pred_train) from our lasso regression model. In this context, the RMSE indicates that, on average, the model’s predictions(lasso_pred_train) deviate from the actual values(y_train) by approximately 12.54 units.

# Train set Prediction
lasso_pred_train <- predict(lasso_model, newx = x_train)
lasso_train_rmse <- rmse(y_train, lasso_pred_train)
lasso_train_rmse

## [1] 12.53952

Performance of test data

The root mean square error (RMSE) for the test set is 13.04479. This measure evaluates the average difference between the actual values(y_test) and the predicted values(lasso_pred_test) generated by the lasso regression model on the test data. In this context, the RMSE value of 13.04 indicates that, on average, the model’s predictions deviate from the actual values by approximately 13.04 units.

# Test set Prediction
lasso_pred_test <- predict(lasso_model, newx = x_test)
lasso_test_rmse <- rmse(y_test, lasso_pred_test)
lasso_test_rmse

## [1] 13.04479

Based on the provided RMSE values for the training set (12.53952) and the test set (13.04479), the model’s performance on the test data is slightly worse than on the training data. While this difference in performance suggests some degree of overfitting, it is relatively small. Therefore, while the model may be slightly overfitting, it does not appear to be a significant concern in this case.

#check for difference between rmse
abs(lasso_train_rmse-lasso_test_rmse)

## [1] 0.5052712

Comparing the models (Ridge Vs Lasso)

To determine which model performed better, we need to compare their performance metrics, such as RMSE, on the test set. The RMSE for the ridge regression model on the test set = 13.02. The RMSE for the lasso regression model on the test set = 13.04.

Since the RMSE for the ridge regression model is slightly lower than that of the lasso regression model, we can conclude that the ridge regression model performed slightly better on the test set. The Ridge regression model may have performed better in this case because it retains all predictors in the model but penalizes the coefficients to avoid overfitting. This approach allows it to capture more nuanced relationships between predictors and the response variable. On the other hand, the Lasso regression model may have dropped some predictors by setting their coefficients to zero during the regularization process. While this feature selection can reduce model complexity and potentially improve interpretability, it may also lead to loss of information and slightly higher prediction errors.

Another difference was that the absolute difference between the RMSE of train and test set for ridge regression is slightly lower than that of the lasso regression.

Stepwise Selection

Using the stepwise selection and fitting the model using linear regression the RMSE for the test data is 13.16379, and the RMSE for the train data is 12.49078.

# Stepwise selection method
step_model <- step(lm(Grad.Rate ~ ., data = train), direction = 'both') 
#summary(step_model)
# Predict on training set
train_pred_step <- predict(step_model, newdata = train)
# Make predictions on the test data
test_pred_step <- predict(step_model, newdata = test)

# Evaluate test set performance
step_train_rmse <- sqrt(mean((train$Grad.Rate - train_pred_step)^2))
step_train_rmse

## [1] 12.49078

# Evaluate test set performance
step_test_rmse <- sqrt(mean((test$Grad.Rate - test_pred_step)^2))
step_test_rmse

## [1] 13.16379

Comparing all Models

Based on the provided RMSE values, it appears that the Stepwise Selection using lm model performed slightly better on the training set compared to both Ridge regression and LASSO, as it has the lowest training RMSE. However, on the test set, the Stepwise Selection lm model has the highest RMSE, indicating poorer performance compared to Ridge regression and LASSO. This indicate overfitting since it perform well on train but fail to perform well on test. Therefore, in terms of RMSE values, Ridge regression and LASSO are preferable over the Stepwise Selection model.

When we compare Ridge and lasso, the RMSE for ridge is lesser than lasso but the opposite with the train set. But as for the different between the RMSE of train and test set, ridge is less indicateing that Ridge regression is better at controlling overfitting and has better generalization performance. In summary, Ridge regression is the better model for this particular dataset and problem.

# Compare RMSE for stepwise selection with Ridge and LASSO regression
all_model <- data.frame(
  Model = c("Ridge", "Lasso", "Stepwise Selection"),
  Train_RMSE = c(ridge_train_rmse, lasso_train_rmse, step_train_rmse),
  Test_RMSE = c(ridge_test_rmse,lasso_test_rmse,step_test_rmse))
all_model$RMSE_difference <- abs(all_model$Train_RMSE - all_model$Test_RMSE)
all_model

Regularization: Ridge & Lasso Regression - ALY6015 - Northeastern University - Prof. Vivian Clement Edwin

Sheila Kwartemaa Boateng

2024-05-07

Introduction

Objective

Analysis

Explore data

Correlation plot

Split Data into Train-Test

Ridge Regression

Find the best lambda

CV Model Plot

Fit model on train data

Performance of the fit model

Performance of Test set

Lasso Regression

Find best Lambda

Lasso CV plot

Fit model on train data

Performance of train data

Performance of test data

Comparing the models (Ridge Vs Lasso)

Stepwise Selection

Comparing all Models

Conclusion

Reference