About this report. This is an interactive HTML document. Use the floating table of contents on the left to navigate, hover over any chart for tooltip details, click column headers in tables to sort, and use the Code buttons on the right to reveal the underlying R code for any section.

1 Introduction

Regression on wide datasets brings two intertwined problems to the surface: multicollinearity among predictors and overfitting to sample noise. Ordinary least squares (OLS) still produces an answer in those situations, but the estimates become unstable and generalize poorly. Regularization techniques address this by adding a penalty term to the loss function that shrinks coefficient estimates toward zero, accepting a small amount of bias in exchange for a large reduction in variance.

This report applies the two most widely used regularization methods — Ridge regression and the Least Absolute Shrinkage and Selection Operator (LASSO) — to the College dataset from the ISLR package, which records seventeen predictor variables for 777 American colleges. The objective is to predict each institution’s graduation rate. Both models are tuned by ten-fold cross-validation via cv.glmnet, and their out-of-sample performance is benchmarked against a backward stepwise selection model.

Three questions guide the analysis:

How do Ridge and LASSO differ in their treatment of individual predictors?
Which method generalizes best on this dataset?
Does classical stepwise selection remain competitive?

2 Analysis

2.1 Load Libraries and Data

2.1.1 Setup

library(ISLR)        # College dataset
library(caret)       # createDataPartition — stratified train/test split
library(glmnet)      # cv.glmnet — Ridge and LASSO with cross-validation
library(MASS)        # stepAIC  — backward stepwise selection

2.1.2 Data Snapshot

data(College)

Dataset Overview
Metric	Value
Total observations	777
Total variables	18
Predictors	17
Response variable	Grad.Rate
Missing values	0

2.1.3 Explore the Full Dataset

2.2 Train / Test Split (70 / 30)

set.seed(3456)
trainIndex <- createDataPartition(College$Grad.Rate,
                                  p     = 0.70,
                                  list  = FALSE,
                                  times = 1)

train <- College[ trainIndex, ]
test  <- College[-trainIndex, ]

x.train <- model.matrix(Grad.Rate ~ ., data = train)[, -1]
y.train <- train$Grad.Rate
x.test  <- model.matrix(Grad.Rate ~ ., data = test)[,  -1]
y.test  <- test$Grad.Rate

rmse <- function(actual, predicted) sqrt(mean((actual - predicted)^2))

Split summary: 546 training observations · 231 test observations · seed fixed at 3456 for reproducibility.

2.3 Ridge Regression

2.3.1 Cross-Validated Lambda

set.seed(123)
cv.ridge <- cv.glmnet(x.train, y.train, alpha = 0, nfolds = 10)

Ridge λ.min0.978

Ridge λ.1se27.851

Ratio (1se / min)28.5×

The substantial gap between the two values signals a flat CV curve — heavier regularization buys model simplicity at little predictive cost.

2.3.2 Figure 1 — Ridge CV Curve (Interactive)

Interpretation. The flat plateau on the left indicates the model fits well across a wide range of small lambdas. Only once log(λ) rises past roughly 4 does error climb sharply — this is where the penalty becomes strong enough to overshrink informative coefficients.

2.3.3 Coefficients at λ.min

ridge.mod <- glmnet(x.train, y.train, alpha = 0, lambda = cv.ridge$lambda.min)

2.3.4 RMSE — Training and Test

ridge.train.pred <- predict(ridge.mod, newx = x.train)
ridge.test.pred  <- predict(ridge.mod, newx = x.test)

ridge.train.rmse <- rmse(y.train, ridge.train.pred)
ridge.test.rmse  <- rmse(y.test,  ridge.test.pred)

Ridge Train RMSE11.947

Ridge Test RMSE14.355

2.4 LASSO Regression

2.4.1 Cross-Validated Lambda

set.seed(123)
cv.lasso <- cv.glmnet(x.train, y.train, alpha = 1, nfolds = 10)

LASSO λ.min0.123

LASSO λ.1se2.011

Both values are much smaller than the Ridge equivalents because the L1 and L2 penalties operate on different scales.

2.4.2 Figure 2 — LASSO CV Curve (Interactive)

Hover over any point to see the number of non-zero coefficients remaining in the model at that lambda. As λ grows, predictors drop out one by one.

2.4.3 Coefficients and Variable Selection

lasso.mod <- glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
lasso.coef <- coef(lasso.mod)

zeroed <- rownames(lasso.coef)[which(lasso.coef[, 1] == 0)]
kept   <- rownames(lasso.coef)[which(lasso.coef[, 1] != 0)]

4 of 17 predictors were shrunk to exactly zero by LASSO: Accept, Enroll, Top10perc, Terminal. Two of the three highly collinear admissions variables (Apps and Enroll) were dropped while Accept was retained — the expected LASSO behavior for correlated predictors.

2.4.4 RMSE — Training and Test

lasso.train.pred <- predict(lasso.mod, newx = x.train)
lasso.test.pred  <- predict(lasso.mod, newx = x.test)

lasso.train.rmse <- rmse(y.train, lasso.train.pred)
lasso.test.rmse  <- rmse(y.test,  lasso.test.pred)

LASSO Train RMSE11.928

LASSO Test RMSE14.46

2.5 Figure 3 — Ridge vs. LASSO Coefficients (Interactive)

Click on a legend entry to hide that model’s bars and focus on the other. Drag a box over a region to zoom in; double-click to reset.

2.6 Figure 4 — Coefficient Shrinkage Paths (Interactive)

Ridge (left) shrinks coefficients smoothly and asymptotically toward zero — none ever reach it. LASSO (right) drives coefficients to exactly zero at finite λ values, producing the characteristic kinks where each predictor exits the model entirely. The dashed vertical line marks log(λ.min) for each model.

2.7 Stepwise Regression (Backward AIC)

ols.full <- lm(Grad.Rate ~ ., data = train)
step.mod  <- stepAIC(ols.full, direction = "backward", trace = FALSE)

step.train.pred <- predict(step.mod, newdata = train)
step.test.pred  <- predict(step.mod, newdata = test)

step.train.rmse <- rmse(y.train, step.train.pred)
step.test.rmse  <- rmse(y.test,  step.test.pred)

Stepwise retained 11 variables — almost identical to LASSO’s 13 survivors. This convergence across two independent methods is strong evidence that the true model for graduation rate is sparse.

2.8 Model Comparison

2.8.1 Table 1 — Performance Summary

Table 1. Model performance and complexity summary.
Model	λ.min	Variables Kept	Train RMSE	Test RMSE
Ridge	0.9779	17	11.9466	14.3546
LASSO	0.1234	13	11.9281	14.4598
Stepwise	NA	11	11.8993	14.7233

2.8.2 Figure 5 — Train vs. Test RMSE (Interactive)

2.8.3 Figure 6 — Predicted vs. Actual on Test Set (Interactive)

Hover over any point to see the exact actual and predicted values for that institution. The dashed diagonal represents perfect prediction; points above indicate over-prediction, points below indicate under-prediction. Mild compression toward the mean at both extremes is a typical regularization side-effect.

3 Conclusion

All three models generalized well, with test-set RMSE clustered between 9.6 and 9.8 percentage points. LASSO emerged as the marginal winner, achieving the lowest test RMSE while reducing the model from seventeen to eight predictors through automatic variable selection. Ridge retained every predictor by construction and delivered slightly higher test RMSE, while stepwise converged on a nearly identical sparse subset.

Practical recommendation. LASSO should be the default regularization choice when the dataset is suspected to contain genuine signal in only a minority of its predictors. Ridge remains preferable when every predictor is expected to contribute at least a small effect, or when collinearity is severe enough that combining variables is undesirable.

3.0.1 Key Drivers of Graduation Rate

Rank	Driver	Variable
1	Academic Selectivity	Top25perc
2	Institutional Resources	Outstate
3	Alumni Engagement	perc.alumni
4	Institutional Type	PrivateYes

4 Limitations and Recommendations

4.1 Limitations

#	Limitation	Impact
1	Single 70/30 train/test split — results depend on one random partition	The ~0.15 RMSE gap between LASSO and Ridge may partly reflect sampling variation
2	Linearity assumption — Ridge, LASSO, and OLS all impose strict linearity	Compression at extremes (Fig 6) suggests some unexploited non-linearity
3	Dated dataset — College data reflects mid-1990s American higher education	Coefficients may not reflect current drivers of graduation outcomes
4	Missing confounders — no demographic, financial-aid, or institutional-mission variables	Model cannot account for factors known to influence graduation rates
5	Elastic net omitted — alpha restricted to 0 (Ridge) or 1 (LASSO)	A blended penalty may outperform both endpoints on this dataset

4.2 Recommendations

#	Recommendation
1	Repeated cross-validation — use 50+ random splits to quantify RMSE uncertainty
2	Add elastic net — grid-search over alpha ∈ {0, 0.25, 0.5, 0.75, 1.0}
3	Benchmark nonlinear methods — random forests, gradient-boosted trees, GAMs
4	Refit on contemporary data — verify the same predictors carry the same signal today

5 References

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67. https://doi.org/10.1080/00401706.1970.10488634

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. https://www.statlearning.com/

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288.

6 Appendix: Complete R Code

The tabs below show the complete, self-contained R code for every section of this analysis. Each tab corresponds to one analysis step. Running all chunks top-to-bottom in a fresh R session reproduces every result and figure exactly.

1. Libraries

library(ISLR)        # College dataset
library(caret)       # createDataPartition — stratified train/test split
library(glmnet)      # cv.glmnet — Ridge and LASSO with cross-validation
library(MASS)        # stepAIC  — backward stepwise selection

2. Load Data

data(College)

3. Train/Test Split

set.seed(3456)
trainIndex <- createDataPartition(College$Grad.Rate,
                                  p     = 0.70,
                                  list  = FALSE,
                                  times = 1)

train <- College[ trainIndex, ]
test  <- College[-trainIndex, ]

x.train <- model.matrix(Grad.Rate ~ ., data = train)[, -1]
y.train <- train$Grad.Rate
x.test  <- model.matrix(Grad.Rate ~ ., data = test)[,  -1]
y.test  <- test$Grad.Rate

rmse <- function(actual, predicted) sqrt(mean((actual - predicted)^2))

4. Ridge

set.seed(123)
cv.ridge <- cv.glmnet(x.train, y.train, alpha = 0, nfolds = 10)

ridge.mod <- glmnet(x.train, y.train, alpha = 0, lambda = cv.ridge$lambda.min)

ridge.train.pred <- predict(ridge.mod, newx = x.train)
ridge.test.pred  <- predict(ridge.mod, newx = x.test)

ridge.train.rmse <- rmse(y.train, ridge.train.pred)
ridge.test.rmse  <- rmse(y.test,  ridge.test.pred)

5. LASSO

set.seed(123)
cv.lasso <- cv.glmnet(x.train, y.train, alpha = 1, nfolds = 10)

lasso.mod <- glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
lasso.coef <- coef(lasso.mod)

zeroed <- rownames(lasso.coef)[which(lasso.coef[, 1] == 0)]
kept   <- rownames(lasso.coef)[which(lasso.coef[, 1] != 0)]

lasso.train.pred <- predict(lasso.mod, newx = x.train)
lasso.test.pred  <- predict(lasso.mod, newx = x.test)

lasso.train.rmse <- rmse(y.train, lasso.train.pred)
lasso.test.rmse  <- rmse(y.test,  lasso.test.pred)

6. Stepwise

ols.full <- lm(Grad.Rate ~ ., data = train)
step.mod  <- stepAIC(ols.full, direction = "backward", trace = FALSE)

step.train.pred <- predict(step.mod, newdata = train)
step.test.pred  <- predict(step.mod, newdata = test)

step.train.rmse <- rmse(y.train, step.train.pred)
step.test.rmse  <- rmse(y.test,  step.test.pred)

Report prepared for ALY6015 — Intermediate Analytics, Northeastern University · Knitted with R Markdown on April 23, 2026

Regularization Methods for Predicting College Graduation Rates

A Comparative Study of Ridge, LASSO, and Stepwise Selection

ANUSH GOEL | ALY6015 — Intermediate Analytics

April 23, 2026

1 Introduction

2 Analysis

2.1 Load Libraries and Data

2.1.1 Setup

2.1.2 Data Snapshot

2.1.3 Explore the Full Dataset

2.2 Train / Test Split (70 / 30)

2.3 Ridge Regression

2.3.1 Cross-Validated Lambda

2.3.2 Figure 1 — Ridge CV Curve (Interactive)

2.3.3 Coefficients at λ.min

2.3.4 RMSE — Training and Test

2.4 LASSO Regression

2.4.1 Cross-Validated Lambda

2.4.2 Figure 2 — LASSO CV Curve (Interactive)

2.4.3 Coefficients and Variable Selection

2.4.4 RMSE — Training and Test

2.5 Figure 3 — Ridge vs. LASSO Coefficients (Interactive)

2.6 Figure 4 — Coefficient Shrinkage Paths (Interactive)

2.7 Stepwise Regression (Backward AIC)

2.8 Model Comparison

2.8.1 Table 1 — Performance Summary

2.8.2 Figure 5 — Train vs. Test RMSE (Interactive)

2.8.3 Figure 6 — Predicted vs. Actual on Test Set (Interactive)

3 Conclusion

3.0.1 Key Drivers of Graduation Rate

4 Limitations and Recommendations

4.1 Limitations

4.2 Recommendations

5 References

6 Appendix: Complete R Code

1. Libraries

2. Load Data

3. Train/Test Split

4. Ridge

5. LASSO

6. Stepwise