M4 Project
Report
ALY6015_71821:Intermediate Analytics
SEC_09_Fall_2023_CPS
Northeastern University
Professor: Vladimir Shapiro
By: Zeeshan Ahmad Ansari
Date of Submission:04
December, 2023
Introduction
This report looks at what factors impact graduation rates at colleges in the United States. It uses a data set called the College data set to build models to predict graduation rates. The goal is to find out which college characteristics are most important in determining if students will graduate.
Three modeling techniques are used - Ridge regression, LASSO regression, and stepwise regression. These methods help identify the college factors that are most related to graduation rate and how strong those relationships are. Some of the factors examined include things like admissions rates, tuition costs, student expenses, and faculty credentials.
The models are evaluated to see how accurately they can predict a college’s graduation rate, using something called root mean square error. Lower error means the model predicts more accurately. The report also looks at how interpretable the models are - can we easily understand which factors are most important?
By the end, the report gives recommendations on the best modeling approach to predict graduation rate. It also highlights specific college factors that show strong connections with how many students end up graduating.
The goal is to provide useful insights for college leaders and policymakers working to improve graduation rates. Increasing how many students complete college is important for issues like college access, affordability, and value across the education system. This report aims to support decision-making in these areas.
Analysis
Library
#The report utilizes a set of libraries for various data processing and visualization tasks.
library(ISLR)
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(psych)
library(reshape2)
library(corrplot)
library(caret)
library(leaps)
library(GGally)
library(caret)
library(pROC)
library(MASS)
library(gridExtra)
library(glmnet)
Task 1
Split the data into a train and test set – refer to the ALY6015_Feature_Selection_R.pdf document for information on how to split a dataset.
In the process of modeling, it is a common and essential practice to partition the dataset into two distinct subsets: a training dataset and a testing dataset. The training dataset serves as the foundation upon which the model is trained, while the testing dataset is employed to assess the model’s performance. This division is crucial for verifying that the model generalizes effectively to new, unseen data, guarding against the risk of over fitting (Brownlee, 2020).
A frequently employed strategy is the 70/30 split, where 70% of the dataset is allocated to the training set, and the remaining 30% is designated for testing. This approach strikes a balance between providing the model with sufficient data for learning and ensuring a robust evaluation on independent test data.
# Load the College dataset
data("College")
# Display the structure of the dataset
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
# Viewing the first few rows of the dataset
head(College)
# Summary statistics
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
# Summary statistics of the dataset
data_summary <- summary(College)
# Check for missing values
sum(is.na(College))
## [1] 0
college_data <- College
# Split the data into training and testing sets
set.seed(123)
train_indices <- sample(1:nrow(college_data), 0.7 * nrow(college_data))
train_data <- college_data[train_indices, ]
test_data <- college_data[-train_indices, ]
# Separate predictors and response variable for training set
x_train <- model.matrix(Grad.Rate ~ . - 1, data = train_data) # Exclude intercept term
y_train <- train_data$Grad.Rate
# Separate predictors and response variable for test set
x_test <- model.matrix(Grad.Rate ~ . - 1, data = test_data) # Exclude intercept term
y_test <- test_data$Grad.Rate
Ridge Regression
Task 2
Use the cv.glmnet function to estimate the lambda.min and lambda.1se values. Compare and discuss the values.
# Find the best lamda using cross-validation
set.seed(123)
# Use cv.glmnet to estimate lambda.min and lambda.1se
ridge_model <- cv.glmnet(x_train, y_train, alpha = 0)
#optimal value of lambda minimizes the prediction error
# lambda min- minimizes out of sample loss
# lambda 1se -largerst value of lambda within 1 standard error of the lambda min
# Extract optimal lambda values
best_lambda_min_ridge <- ridge_model$lambda.min
best_lambda_min_ridge
## [1] 1.617762
best_lambda_1se_ridge <- ridge_model$lambda.1se
best_lambda_1se_ridge
## [1] 18.17271
Trade-off Between Regularization and Model Fit:
A smaller best_lambda_min_ridge= 1.6177618 indicates a more
regularized model, which may have a simpler structure and might
generalize better to new data. A larger
best_lambda_1se_ridge = 18.1727072 is less regularized and
might capture more complex patterns in the training data.
Selection of Lambda:
The choice between these two lambdas depends on the balance we want to strike between model simplicity and predictive accuracy. If prediction accuracy is of utmost importance, we might lean towards best_lambda_min_ridge. If we prefer a simpler model that still performs reasonably well, we might consider best_lambda_1se_ridge.
Overfitting and Generalization:
Smaller lambda values, like best_lambda_min_ridge, might risk overfitting the training data, meaning the model may not generalize well to new, unseen data. Larger lambda values, like best_lambda_1se_ridge, tend to produce more generalized models, but they might sacrifice some predictive accuracy.
Model Complexity:
The exact interpretation of these values depends on the characteristics on our specific data. A high-dimensional dataset or highly correlated predictors might lead to more regularization.
In conclusion, the choice between these two lambdas involves a trade-off between model complexity and generalization. It’s often beneficial to evaluate the performance of the chosen lambda on a separate validation set or test set to ensure good generalization.
Task 3
Plot the results from the cv.glmnet function provide an interpretation. What does this plot tell us?
# Plot results from cv.glmnet
plot(ridge_model)
# Add a caption to the plot
title( sub = "Figure 1: Ridge Model, Lambda values vs. Mean Squared Error")#(Kabacoff, 2015)
The plot tells us that ridge regression can improve the performance of linear regression on the test set, but we need to choose the right value of lambda. The optimal lambda value can be found using cross-validation.
The lambda.min value is relatively small, which suggests that the ridge model is not overfitting the training data.
The lambda.1se value is much larger than the lambda.min value, which suggests that we have a good deal of flexibility in choosing the lambda value.
We can consider using a lambda value that is slightly larger than the lambda.min value in order to avoid overfitting.
Task 4
Fit a Ridge regression model against the training set and report on the coefficients. Is there anything interesting?
# Fit Ridge regression model with optimal lambda.min
ridge_fit <- glmnet(x_train, y_train, alpha = 0, lambda = best_lambda_min_ridge)
# Report coefficients
coef(ridge_fit)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 4.166376e+01
## PrivateNo -2.447503e+00
## PrivateYes 2.404691e+00
## Apps 8.400655e-04
## Accept 1.802923e-04
## Enroll -3.149232e-04
## Top10perc 6.630663e-02
## Top25perc 1.409056e-01
## F.Undergrad -3.616917e-05
## P.Undergrad -1.169025e-03
## Outstate 6.149904e-04
## Room.Board 2.016640e-03
## Books -5.561352e-04
## Personal -2.933265e-03
## PhD 9.874167e-02
## Terminal -8.946227e-02
## S.F.Ratio -1.692663e-01
## perc.alumni 2.921334e-01
## Expend -3.482096e-04
In examining the results of our Ridge regression model, we find
insightful patterns in the estimated coefficients. The intercept,
representing the starting point when all other factors are zero, is
approximately 41.66. When considering the impact of a
college being labeled as PrivateNo or
PrivateYes, we observe that the model estimates a decrease
of about 2.45 for PrivateNo and an increase of about 2.40
for PrivateYes. These values suggest a significant
influence of the private status on the predicted outcome.
Moreover, our model considers various numeric predictors, and we find
that small changes in certain factors can have noticeable effects on the
estimated outcome. For instance, an increase of one unit in
Apps corresponds to an estimated increase of about
0.0008401, while Top10perc has a positive
impact of about 0.06631. These insights provide a nuanced
understanding of how specific attributes contribute to the predicted
outcome.
Interestingly, the output reveals a sparse matrix, indicating that many coefficients are precisely zero. This sparsity is a consequence of Ridge regression’s regularization technique, which penalizes overly complex models. This regularization effect aims to strike a balance between model simplicity and predictive accuracy, preventing our model from being too influenced by noise or irrelevant factors.
The signs and magnitudes of the coefficients offer further insights into the relationships between predictors and the response variable. Positive coefficients suggest a positive impact on the outcome, while negative coefficients imply a negative impact. The magnitude of these coefficients provides a sense of the strength of these relationships.
In summary, our Ridge regression model not only provides predictive power but also offers valuable insights into the relative importance of different factors. The regularization employed helps maintain a manageable and interpretable model, ensuring that our predictions are meaningful and not overly influenced by noise in the data.
Task 5
Determine the performance of the fit model against the training set by calculating the root mean square error (RMSE). sqrt(mean((actual - predicted)^2))
# Calculate RMSE on training set
ridge_train_predictions <- predict(ridge_fit, newx = x_train)
ridge_train_rmse <- sqrt(mean((ridge_train_predictions - y_train)^2))
cat("Ridge RMSE on training set:", ridge_train_rmse, "\n")
## Ridge RMSE on training set: 12.5368
The Root Mean Square Error (RMSE) is a measure of how well our Ridge
regression model is performing on the training data. The RMSE value we
obtained is approximately 12.5367962.
The RMSE value represents the average difference between the actual
Graduation Rate values in our training dataset and the values predicted
by our Ridge regression model. In this case, an RMSE of
12.5367962 means, on average, our predictions are off by
around 12.54 percentage points.
Generally, a lower RMSE indicates better model accuracy. In our context,
an RMSE of 12.5367962 suggests that, on average, our Ridge
regression model’s predictions for Graduation Rates are within a range
of 12.54 percentage points of the actual values.
It’s important to compare the RMSE to the scale of the Graduation Rate variable. If the range of actual Graduation Rates is small compared to the RMSE, it might indicate that the model’s predictions are not precise enough. However, if the RMSE is comparable to or smaller than the range of actual Graduation Rates, it suggests a relatively good performance.
Context matters in the evaluation of RMSE. In the context of predicting
Graduation Rates, a deviation of 12.54 percentage points
might be acceptable, depending on the specific application and the
consequences of prediction errors.
In summary, the RMSE value of 12.5367962 gives us an
indication of the average prediction error of our Ridge regression model
on the training data. Interpretation of this value should be done in the
context of the specific dataset and the goals of the predictive
modeling.
Task 6
Determine the performance of the fit model against the test set by calculating the root mean square error (RMSE). Is your model overfit?
# Calculate RMSE on test set
ridge_test_predictions <- predict(ridge_fit, newx = x_test)
ridge_test_rmse <- sqrt(mean((ridge_test_predictions - y_test)^2))
cat("Ridge RMSE on test set:", ridge_test_rmse, "\n")
## Ridge RMSE on test set: 13.04473
The Root Mean Square Error (RMSE) calculated on the test set serves as a
measure of how well our Ridge regression model performs on new, unseen
data. The obtained RMSE value for the test set is approximately
13.04.
When we compare the RMSE on the training set 12.54 with the
RMSE on the test set 13.04, we notice a slight increase in
prediction error on the test data. This is expected because the model is
being evaluated on data it hasn’t seen during training.
An increase in RMSE from training to test data is a common observation and is not necessarily a cause for concern. However, it’s crucial to ensure that the increase is not too substantial, as it could indicate overfitting.
Overfitting occurs when a model learns the training data too well, capturing noise and specifics that don’t generalize to new data. A significant increase in RMSE from training to test might suggest overfitting. In our case, the increase is relatively small, which is a positive sign.
The RMSE of 13.04 means, on average, our predictions on the
test set are off by around 13.04 percentage points. This
provides a measure of the model’s accuracy on new data.
The acceptability of the RMSE value depends on the context of your
specific application. In some cases, a deviation of 13.04
percentage points might be considered acceptable, while in others, it
might be necessary to improve model performance.
In conclusion, while the test RMSE is slightly higher than the training RMSE, the increase is not alarming. The model demonstrates reasonable generalization to new data, and the prediction error on the test set remains within an acceptable range. Further considerations may be needed based on the specific requirements and goals of your modeling task.
LASSO
Use the cv.glmnet function to estimate the lambda.min and lambda.1se values. Compare and discuss the values.
# Find the best lamda using cross-validation
set.seed(123)
# Use cv.glmnet to estimate lambda.min and lambda.1se for LASSO
lasso_model <- cv.glmnet(x_train, y_train, alpha = 1)
#optimal value of lambda minimizes the prediction error
# lambda min- minimizes out of sample loss
# lambda 1se -largerst value of lambda within 1 standard error of the lambda min
# Extract optimal lambda values for LASSO
best_lambda_min_lasso <- lasso_model$lambda.min
best_lambda_min_lasso
## [1] 0.07336353
best_lambda_1se_lasso <- lasso_model$lambda.1se
best_lambda_1se_lasso
## [1] 1.312216
A smaller best_lambda_min_lasso = 0.0733635 indicates a
more regularized model, which may have a simpler structure and might
generalize better to new data.
A larger best_lambda_1se_lasso = 1.3122165 is less
regularized and might capture more complex patterns in the training
data.
Selection of Lambda:
The choice between these two lambdas depends on the balance we want to strike between model simplicity and predictive accuracy. If prediction accuracy is of utmost importance, we need to lean towards best_lambda_min_lasso. If we want a simpler model that still performs reasonably well, we can consider best_lambda_1se_lasso.
As lambda increases, the model becomes more regularized, and the risk of overfitting decreases. best_lambda_min_lasso is the optimal lambda that minimizes out-of-sample loss.
The values of best_lambda_min_lasso and best_lambda_1se_lasso help understand the magnitude of the penalty applied to the coefficients. Smaller values indicate a weaker penalty, while larger values indicate a stronger penalty.
LASSO has the property of setting some coefficients exactly to zero. The choice of lambda influences the degree of sparsity in the model.
In conclusion, the LASSO regularization has identified two important lambdas. best_lambda_min_lasso provides the optimal balance between complexity and accuracy, while best_lambda_1se_lasso offers a slightly less regularized model. The specific choice depends on your modeling goals and the importance of simplicity versus accuracy in your context.
Task 8
Plot the results from the cv.glmnet function provide an interpretation. What does this plot tell us?
# Plot results from cv.glmnet for LASSO
plot(lasso_model)
# Add a caption to the plot
title( sub = "Figure 2: LASSO Model, Lambda values vs. Mean Squared Error")#(Kabacoff, 2015)
The point where the curve reaches its minimum corresponds to lambda.min. This represents the most intricate model before encountering overfitting, as determined through cross-validation.
As we progress to the right (with an increase in log(lambda)), we enforce more robust regularization, leading to a simplification of the model by zeroing out certain coefficients.
The error bars illustrate that the variability of MSE grows as log(lambda) decreases (reduced regularization), aligning with expectations as more intricate models introduce greater complexity and potential variability.
lambda.1se is selected as a compromise between the intricacy of the model and MSE, aiming to simplify the model without a substantial increase in error.
lambda.1se is commonly employed for model selection to forestall overfitting while still maintaining a reasonable level of prediction error.
The sharp incline near the minimum MSE indicates a threshold beyond which heightened regularization rapidly escalates error. This signifies the point where the model begins to lose crucial predictors, emphasizing the importance of balancing complexity and accuracy.
Task 9
Fit a LASSO regression model against the training set and report on the coefficients. Do any coefficients reduce to zero? If so, which ones?
# Fit LASSO regression model with optimal lambda.min
lasso_fit <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda_min_lasso)
# Report coefficients for LASSO
coef(lasso_fit)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 4.344051e+01
## PrivateNo -4.668788e+00
## PrivateYes 2.936056e-13
## Apps 1.265878e-03
## Accept -2.093923e-04
## Enroll -6.955668e-04
## Top10perc 1.094524e-02
## Top25perc 1.775816e-01
## F.Undergrad .
## P.Undergrad -1.254736e-03
## Outstate 6.995230e-04
## Room.Board 2.047359e-03
## Books .
## Personal -2.946155e-03
## PhD 1.246762e-01
## Terminal -1.277848e-01
## S.F.Ratio -1.791803e-01
## perc.alumni 3.232318e-01
## Expend -4.401914e-04
The LASSO regression analysis, designed to find the best way to predict
Graduation Rates, uncovered some interesting findings. When all
predictors are zero, we start with a Graduation Rate estimate of about
43.44. Looking at whether a college is private or not, being
PrivateNo decreases the rate by around 4.67, while
PrivateYes doesn’t seem to matter much.
When we consider other things that might influence Graduation Rates,
like the number of full-time undergraduates (F.Undergrad)
or spending on books (Books), the LASSO model essentially
says these factors don’t really matter – their impact is practically
zero.
On the flip side, factors like the percentage of alumni
(perc.alumni) and the percentage of faculty with a
Ph.D. (PhD) do seem to matter. The model suggests they play
a role in influencing Graduation Rates.
The LASSO method also makes the model simpler by setting some factors to zero. This simplicity helps us focus on what really matters, making it easier to understand and potentially better at predicting Graduation Rates. In a nutshell, LASSO helps us sift through the noise and highlight the important stuff.
Task 10
Determine the performance of the fit model against the training set by calculating the root mean square error (RMSE). sqrt(mean((actual - predicted)^2))
# Calculate RMSE on training set for LASSO
lasso_train_predictions <- predict(lasso_fit, newx = x_train)
lasso_train_rmse <- sqrt(mean((lasso_train_predictions - y_train)^2))
cat("LASSO RMSE on training set:", lasso_train_rmse, "\n")
## LASSO RMSE on training set: 12.48835
The performance of the LASSO regression model on the training set is evaluated using the Root Mean Square Error (RMSE), which measures the average difference between the predicted and actual Graduation Rates. In this case, the calculated RMSE of approximately 12.49 suggests that, on average, the model’s predictions deviate by around 12.49 percentage points from the true values. When compared to the RMSE obtained from the Ridge regression model on the same training set (12.54), the LASSO model exhibits a similar level of accuracy, indicating comparable performance in balancing precision and simplicity.
The acceptability of this prediction error hinges on the specific requirements and implications within the context of the modeling task. Lower RMSE values generally indicate a better fit, but the significance of the error depends on the particular application. It’s important to note that the evaluation is conducted on the training set, and further analysis should extend to an independent test set to ensure the model’s ability to generalize to new, unseen data.
In summary, the LASSO model, with a training set RMSE of 12.49, demonstrates a reasonable level of accuracy in predicting Graduation Rates. The next phase of evaluation will involve testing its performance on a separate dataset to validate its effectiveness in making accurate predictions beyond the training data.
Task 11
Determine the performance of the fit model against the test set by calculating the root mean square error (RMSE). Is your model overfit?
# Calculate RMSE on test set for LASSO
lasso_test_predictions <- predict(lasso_fit, newx = x_test)
lasso_test_rmse <- sqrt(mean((lasso_test_predictions - y_test)^2))
cat("LASSO RMSE on test set:", lasso_test_rmse, "\n")
## LASSO RMSE on test set: 13.16976
The performance of the LASSO regression model on the test set is evaluated using the Root Mean Square Error (RMSE), which measures the average difference between the predicted and actual Graduation Rates on new, unseen data. In this instance, the calculated RMSE on the test set is approximately 13.17, indicating that, on average, the model’s predictions deviate by around 13.17 percentage points from the actual values.
When considering the RMSE on both the training set (12.49) and the test set (13.17), it’s observed that there is a slight increase in prediction error on the test data. While this increase is relatively modest, it prompts consideration of potential overfitting — a situation where the model may have become too tailored to the training data and struggles to generalize well to new, unseen data.
The slight increase in RMSE from training to test set suggests that the LASSO model might be exhibiting some overfitting tendencies. To ensure a more robust evaluation of overfitting, further investigation and comparison with alternative models may be beneficial. Additionally, exploring the impact of different regularization parameters during model tuning could help strike a better balance between accuracy on the training set and generalization to the test set.
In conclusion, while the LASSO model demonstrates reasonable predictive performance on the test set, the slight increase in RMSE prompts cautious consideration of potential overfitting. Further refinement and fine-tuning of the model parameters may be explored to enhance its generalization to new data.
Task 12
Which model performed better and why? Is that what you expected?
Comparison
# Comparison
cat("Optimal lambda.min for Ridge:", best_lambda_min_ridge, "\n")
## Optimal lambda.min for Ridge: 1.617762
cat("Optimal lambda.min for LASSO:", best_lambda_min_lasso, "\n")
## Optimal lambda.min for LASSO: 0.07336353
# Compare RMSE between Ridge and LASSO on the test set
if (ridge_test_rmse < lasso_test_rmse) {
cat("Ridge performed better on the test set.\n")
} else if (lasso_test_rmse < ridge_test_rmse) {
cat("LASSO performed better on the test set.\n")
} else {
cat("Both models performed equally on the test set.\n")
}
## Ridge performed better on the test set.
In comparing the performance of the Ridge and LASSO regression models, we considered the optimal lambda.min values and the Root Mean Square Error (RMSE) on the test set. The optimal lambda.min for Ridge was found to be approximately 1.62, while for LASSO, it was around 0.0734. Lower values of lambda.min generally indicate less regularization and a closer fit to the training data.
Assessing the RMSE on the test set, we found that the Ridge model exhibited a lower RMSE (13.04) compared to LASSO (13.17). This implies that, on average, the Ridge model’s predictions on the test data were more accurate than those of the LASSO model.
The observed better performance of the Ridge model aligns with expectations to some extent. Ridge regression, with its emphasis on shrinking coefficients without setting them exactly to zero, often results in models that capture important relationships while avoiding overfitting. The LASSO model, on the other hand, enforces sparsity by setting some coefficients precisely to zero, potentially leading to a simpler model but with a risk of oversimplification.
It’s noteworthy that the comparison considers a trade-off between model complexity and accuracy. While Ridge performed better in this specific evaluation, the choice between Ridge and LASSO depends on the specific goals of the modeling task. Further exploration, including fine-tuning of regularization parameters and validation on additional datasets, could provide additional insights and refinement to model selection.
In summary, the comparison suggests that, in this instance, the Ridge model outperformed the LASSO model on the test set. This aligns with the general expectation of Ridge regression’s effectiveness in balancing accuracy and simplicity. However, the ultimate choice should be made considering the specific objectives and nuances of the predictive modeling task.
Task 13
Refer to the ALY6015_Feature_Selection_R.pdf document for how to perform stepwise selection and then fit a model. Did this model perform better or as well as Ridge regression or LASSO? Which method do you prefer and why?
stepwise_model <- stepAIC(lm(Grad.Rate ~ ., data = train_data), direction = "both")
## Start: AIC=2776.64
## Grad.Rate ~ Private + Apps + Accept + Enroll + Top10perc + Top25perc +
## F.Undergrad + P.Undergrad + Outstate + Room.Board + Books +
## Personal + PhD + Terminal + S.F.Ratio + perc.alumni + Expend
##
## Df Sum of Sq RSS AIC
## - Top10perc 1 0.3 84486 2774.7
## - Books 1 1.0 84487 2774.7
## - F.Undergrad 1 2.7 84488 2774.7
## - Enroll 1 20.5 84506 2774.8
## - Accept 1 111.7 84597 2775.4
## - S.F.Ratio 1 216.4 84702 2776.0
## <none> 84486 2776.6
## - Terminal 1 756.6 85242 2779.5
## - Private 1 842.3 85328 2780.0
## - PhD 1 861.2 85347 2780.2
## - Outstate 1 1212.7 85698 2782.4
## - Top25perc 1 1306.4 85792 2783.0
## - P.Undergrad 1 1335.2 85821 2783.2
## - Room.Board 1 1469.0 85955 2784.0
## - Expend 1 1554.5 86040 2784.6
## - Apps 1 1590.9 86077 2784.8
## - Personal 1 1669.9 86156 2785.3
## - perc.alumni 1 5149.1 89635 2806.8
##
## Step: AIC=2774.65
## Grad.Rate ~ Private + Apps + Accept + Enroll + Top25perc + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Books + Personal +
## PhD + Terminal + S.F.Ratio + perc.alumni + Expend
##
## Df Sum of Sq RSS AIC
## - Books 1 1.1 84487 2772.7
## - F.Undergrad 1 2.7 84489 2772.7
## - Enroll 1 21.3 84507 2772.8
## - Accept 1 119.5 84605 2773.4
## - S.F.Ratio 1 216.2 84702 2774.0
## <none> 84486 2774.7
## + Top10perc 1 0.3 84486 2776.6
## - Terminal 1 759.2 85245 2777.5
## - Private 1 844.4 85330 2778.1
## - PhD 1 866.8 85353 2778.2
## - Outstate 1 1217.5 85703 2780.4
## - P.Undergrad 1 1350.0 85836 2781.3
## - Room.Board 1 1474.9 85961 2782.0
## - Personal 1 1669.6 86156 2783.3
## - Expend 1 1704.5 86191 2783.5
## - Apps 1 1770.3 86256 2783.9
## - Top25perc 1 3472.1 87958 2794.5
## - perc.alumni 1 5148.8 89635 2804.8
##
## Step: AIC=2772.65
## Grad.Rate ~ Private + Apps + Accept + Enroll + Top25perc + F.Undergrad +
## P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal +
## S.F.Ratio + perc.alumni + Expend
##
## Df Sum of Sq RSS AIC
## - F.Undergrad 1 2.7 84490 2770.7
## - Enroll 1 21.2 84508 2770.8
## - Accept 1 119.3 84606 2771.4
## - S.F.Ratio 1 216.7 84704 2772.0
## <none> 84487 2772.7
## + Books 1 1.1 84486 2774.7
## + Top10perc 1 0.4 84487 2774.7
## - Terminal 1 777.7 85265 2775.6
## - Private 1 843.3 85330 2776.1
## - PhD 1 882.2 85369 2776.3
## - Outstate 1 1219.6 85707 2778.4
## - P.Undergrad 1 1351.2 85838 2779.3
## - Room.Board 1 1474.9 85962 2780.1
## - Expend 1 1704.5 86192 2781.5
## - Personal 1 1745.2 86232 2781.8
## - Apps 1 1769.9 86257 2781.9
## - Top25perc 1 3498.6 87986 2792.7
## - perc.alumni 1 5177.1 89664 2802.9
##
## Step: AIC=2770.67
## Grad.Rate ~ Private + Apps + Accept + Enroll + Top25perc + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + S.F.Ratio +
## perc.alumni + Expend
##
## Df Sum of Sq RSS AIC
## - Enroll 1 23.8 84514 2768.8
## - Accept 1 117.0 84607 2769.4
## - S.F.Ratio 1 215.0 84705 2770.1
## <none> 84490 2770.7
## + F.Undergrad 1 2.7 84487 2772.7
## + Books 1 1.1 84489 2772.7
## + Top10perc 1 0.4 84489 2772.7
## - Terminal 1 775.4 85265 2773.6
## - Private 1 849.5 85339 2774.1
## - PhD 1 884.8 85375 2774.3
## - Outstate 1 1217.2 85707 2776.4
## - P.Undergrad 1 1459.4 85949 2778.0
## - Room.Board 1 1478.8 85969 2778.1
## - Expend 1 1712.1 86202 2779.6
## - Personal 1 1743.9 86234 2779.8
## - Apps 1 1774.7 86264 2780.0
## - Top25perc 1 3531.1 88021 2790.9
## - perc.alumni 1 5188.1 89678 2801.0
##
## Step: AIC=2768.82
## Grad.Rate ~ Private + Apps + Accept + Top25perc + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + S.F.Ratio +
## perc.alumni + Expend
##
## Df Sum of Sq RSS AIC
## - S.F.Ratio 1 220.5 84734 2768.2
## - Accept 1 311.1 84825 2768.8
## <none> 84514 2768.8
## + Enroll 1 23.8 84490 2770.7
## + F.Undergrad 1 5.3 84508 2770.8
## + Top10perc 1 1.7 84512 2770.8
## + Books 1 0.9 84513 2770.8
## - Terminal 1 778.2 85292 2771.8
## - PhD 1 884.5 85398 2772.5
## - Private 1 897.1 85411 2772.6
## - Outstate 1 1298.5 85812 2775.1
## - P.Undergrad 1 1588.7 86102 2776.9
## - Room.Board 1 1597.5 86111 2777.0
## - Expend 1 1730.5 86244 2777.8
## - Apps 1 1751.4 86265 2778.0
## - Personal 1 1831.7 86345 2778.5
## - Top25perc 1 3510.8 88024 2788.9
## - perc.alumni 1 5209.2 89723 2799.3
##
## Step: AIC=2768.24
## Grad.Rate ~ Private + Apps + Accept + Top25perc + P.Undergrad +
## Outstate + Room.Board + Personal + PhD + Terminal + perc.alumni +
## Expend
##
## Df Sum of Sq RSS AIC
## <none> 84734 2768.2
## - Accept 1 344.9 85079 2768.4
## + S.F.Ratio 1 220.5 84514 2768.8
## + Enroll 1 29.4 84705 2770.1
## + F.Undergrad 1 10.0 84724 2770.2
## + Top10perc 1 1.3 84733 2770.2
## + Books 1 1.3 84733 2770.2
## - Terminal 1 721.3 85455 2770.8
## - PhD 1 822.2 85556 2771.5
## - Private 1 1158.9 85893 2773.6
## - Outstate 1 1383.8 86118 2775.0
## - Expend 1 1509.9 86244 2775.8
## - P.Undergrad 1 1598.0 86332 2776.4
## - Room.Board 1 1602.5 86337 2776.4
## - Personal 1 1729.5 86464 2777.2
## - Apps 1 1769.5 86504 2777.5
## - Top25perc 1 3485.1 88219 2788.1
## - perc.alumni 1 5442.4 90177 2800.0
summary(stepwise_model)
##
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Accept + Top25perc +
## P.Undergrad + Outstate + Room.Board + Personal + PhD + Terminal +
## perc.alumni + Expend, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.842 -7.123 -0.586 6.963 53.423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0093351 3.9355603 8.896 < 2e-16 ***
## PrivateYes 5.4253539 2.0151104 2.692 0.007320 **
## Apps 0.0017183 0.0005165 3.327 0.000939 ***
## Accept -0.0011039 0.0007517 -1.469 0.142508
## Top25perc 0.1787948 0.0382949 4.669 3.84e-06 ***
## P.Undergrad -0.0013048 0.0004127 -3.162 0.001659 **
## Outstate 0.0007828 0.0002661 2.942 0.003403 **
## Room.Board 0.0020948 0.0006616 3.166 0.001634 **
## Personal -0.0029227 0.0008886 -3.289 0.001072 **
## PhD 0.1517769 0.0669276 2.268 0.023745 *
## Terminal -0.1535561 0.0722936 -2.124 0.034127 *
## perc.alumni 0.3310725 0.0567440 5.834 9.40e-09 ***
## Expend -0.0004790 0.0001559 -3.073 0.002227 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.64 on 530 degrees of freedom
## Multiple R-squared: 0.4952, Adjusted R-squared: 0.4838
## F-statistic: 43.33 on 12 and 530 DF, p-value: < 2.2e-16
# Calculate RMSE for the Stepwise selection model on the test set
predictions_stepwise <- predict(stepwise_model, newdata = test_data)
rmse_test_stepwise <- sqrt(mean((predictions_stepwise - y_test)^2))
# Output the RMSE for the Stepwise selection model on the test set
rmse_test_stepwise
## [1] 13.14759
The stepwise selection process resulted in a model with a subset of predictors that were deemed most influential in explaining the variability in the response variable, Grad.Rate. The stepwise model includes Private status, the number of applications (Apps), the percentage of students from the top 25% of their high school class (Top25perc), the number of part-time undergraduates (P.Undergrad), the out-of-state tuition (Outstate), room and board expenses (Room.Board), personal expenses (Personal), the percentage of alumni who donate (perc.alumni), and institutional expenditure (Expend).
The coefficients obtained from the stepwise model provide insights into the impact of each predictor on the graduation rate. For instance, a one-unit increase in the number of applications (Apps) is associated with an estimated increase in the graduation rate. Similarly, variables like the percentage of alumni who donate (perc.alumni) and out-of-state tuition (Outstate) also contribute significantly to explaining the variation in the graduation rate.
The model’s performance was evaluated on the test set, and the root mean square error (RMSE) was found to be approximately 13.15. This value represents the average deviation between the actual and predicted graduation rates. In the context of RMSE, lower values indicate better predictive performance.
In comparison to the Ridge and LASSO regression models, the stepwise model’s performance can be assessed based on metrics like RMSE. It’s important to note that the choice between these methods depends on the specific objectives of the analysis and the trade-offs between model complexity and interpretability. The stepwise selection approach offers interpretability by selecting a subset of predictors, but its performance should be considered alongside other regularization techniques like Ridge and LASSO.
Conclusion
This analysis aimed to pinpoint the main factors influencing graduation rates and construct refined models for predicting outcomes in various higher education institutions across the United States. Utilizing methodologies such as Ridge regression, LASSO regression, and stepwise model selection on the College dataset, the analysis identified influential variables while maintaining a balance between model accuracy and interpretability.
Among the compared methods, Ridge regression emerged as the most effective in achieving a balance between prediction performance on the test data (RMSE of 13.04) and retaining insights into the relative significance of different institutional characteristics on graduation success. Notable factors such as the percentage of new students graduating in the top 10% of their high school class, student-faculty ratios, and instructional expenditure demonstrated substantial influence.
These findings offer valuable insights for college leadership in making strategic investments in areas like student academic preparation, faculty engagement, and budget allocations to positively impact graduation outcomes. Furthermore, they underscore how timely completion is influenced by a complex network of operational and financial factors.
In summary, this report has shed light on both the key determinants of student success and the optimal approach for predictive analysis. Implementing these insights through customized analytics dashboards and monitoring mechanisms can enable institutions to develop policies and programs that facilitate the transition from enrollment to graduation, serving as the ultimate measure of higher education’s fulfillment of its mission.
References
Brownlee, J. (2020, August 26). Train-test split for Evaluating Machine Learning Algorithms. MachineLearningMastery.com. https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
Kabacoff, R. (2015). R in action : data analysis and graphics with R (Second edition.). Manning Publications.
Appendix
This report contains an R Markdown file named as follows
ALY6015_ZeeshanAhmadAnsari_WEEK_4_FALL_B_2023.Rmd