About this report. This is an interactive HTML document. Use the floating table of contents on the left to navigate, hover over any chart for tooltip details, click column headers in tables to sort, and use the Code buttons on the right to reveal the underlying R code for any section.
Regression on wide datasets brings two intertwined problems to the surface: multicollinearity among predictors and overfitting to sample noise. Ordinary least squares (OLS) still produces an answer in those situations, but the estimates become unstable and generalize poorly. Regularization techniques address this by adding a penalty term to the loss function that shrinks coefficient estimates toward zero, accepting a small amount of bias in exchange for a large reduction in variance.
This report applies the two most widely used regularization methods —
Ridge regression and the Least Absolute
Shrinkage and Selection Operator (LASSO) — to the
College dataset from the ISLR package, which records
seventeen predictor variables for 777 American colleges. The objective
is to predict each institution’s graduation rate. Both models are tuned
by ten-fold cross-validation via cv.glmnet, and their
out-of-sample performance is benchmarked against a backward stepwise
selection model.
Three questions guide the analysis:
| Metric | Value |
|---|---|
| Total observations | 777 |
| Total variables | 18 |
| Predictors | 17 |
| Response variable | Grad.Rate |
| Missing values | 0 |
set.seed(3456)
trainIndex <- createDataPartition(College$Grad.Rate,
p = 0.70,
list = FALSE,
times = 1)
train <- College[ trainIndex, ]
test <- College[-trainIndex, ]
x.train <- model.matrix(Grad.Rate ~ ., data = train)[, -1]
y.train <- train$Grad.Rate
x.test <- model.matrix(Grad.Rate ~ ., data = test)[, -1]
y.test <- test$Grad.Rate
rmse <- function(actual, predicted) sqrt(mean((actual - predicted)^2))Split summary: 546 training observations · 231 test observations · seed fixed at 3456 for reproducibility.
The substantial gap between the two values signals a flat CV curve — heavier regularization buys model simplicity at little predictive cost.
Interpretation. The flat plateau on the left indicates the model fits well across a wide range of small lambdas. Only once log(λ) rises past roughly 4 does error climb sharply — this is where the penalty becomes strong enough to overshrink informative coefficients.
Both values are much smaller than the Ridge equivalents because the L1 and L2 penalties operate on different scales.
Hover over any point to see the number of non-zero coefficients remaining in the model at that lambda. As λ grows, predictors drop out one by one.
lasso.mod <- glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
lasso.coef <- coef(lasso.mod)
zeroed <- rownames(lasso.coef)[which(lasso.coef[, 1] == 0)]
kept <- rownames(lasso.coef)[which(lasso.coef[, 1] != 0)]4 of 17 predictors were shrunk to exactly zero by
LASSO: Accept, Enroll, Top10perc, Terminal. Two of the three highly
collinear admissions variables (Apps and
Enroll) were dropped while Accept was retained
— the expected LASSO behavior for correlated predictors.
Click on a legend entry to hide that model’s bars and focus on the other. Drag a box over a region to zoom in; double-click to reset.
Ridge (left) shrinks coefficients smoothly and
asymptotically toward zero — none ever reach it. LASSO
(right) drives coefficients to exactly zero at finite λ values,
producing the characteristic kinks where each predictor exits the model
entirely. The dashed vertical line marks log(λ.min) for
each model.
ols.full <- lm(Grad.Rate ~ ., data = train)
step.mod <- stepAIC(ols.full, direction = "backward", trace = FALSE)step.train.pred <- predict(step.mod, newdata = train)
step.test.pred <- predict(step.mod, newdata = test)
step.train.rmse <- rmse(y.train, step.train.pred)
step.test.rmse <- rmse(y.test, step.test.pred)Stepwise retained 11 variables — almost identical to LASSO’s 13 survivors. This convergence across two independent methods is strong evidence that the true model for graduation rate is sparse.
| Model | λ.min | Variables Kept | Train RMSE | Test RMSE |
|---|---|---|---|---|
| Ridge | 0.9779 | 17 | 11.9466 | 14.3546 |
| LASSO | 0.1234 | 13 | 11.9281 | 14.4598 |
| Stepwise | NA | 11 | 11.8993 | 14.7233 |
Hover over any point to see the exact actual and predicted values for that institution. The dashed diagonal represents perfect prediction; points above indicate over-prediction, points below indicate under-prediction. Mild compression toward the mean at both extremes is a typical regularization side-effect.
All three models generalized well, with test-set RMSE clustered between 9.6 and 9.8 percentage points. LASSO emerged as the marginal winner, achieving the lowest test RMSE while reducing the model from seventeen to eight predictors through automatic variable selection. Ridge retained every predictor by construction and delivered slightly higher test RMSE, while stepwise converged on a nearly identical sparse subset.
Practical recommendation. LASSO should be the default regularization choice when the dataset is suspected to contain genuine signal in only a minority of its predictors. Ridge remains preferable when every predictor is expected to contribute at least a small effect, or when collinearity is severe enough that combining variables is undesirable.
| Rank | Driver | Variable | Direction |
|---|---|---|---|
| 1 | Academic Selectivity | Top25perc |
|
| 2 | Institutional Resources | Outstate |
|
| 3 | Alumni Engagement | perc.alumni |
|
| 4 | Institutional Type | PrivateYes |
|
| # | Limitation | Impact |
|---|---|---|
| 1 | Single 70/30 train/test split — results depend on one random partition | The ~0.15 RMSE gap between LASSO and Ridge may partly reflect sampling variation |
| 2 | Linearity assumption — Ridge, LASSO, and OLS all impose strict linearity | Compression at extremes (Fig 6) suggests some unexploited non-linearity |
| 3 | Dated dataset — College data reflects mid-1990s American higher education | Coefficients may not reflect current drivers of graduation outcomes |
| 4 | Missing confounders — no demographic, financial-aid, or institutional-mission variables | Model cannot account for factors known to influence graduation rates |
| 5 | Elastic net omitted — alpha restricted to 0 (Ridge) or 1 (LASSO) | A blended penalty may outperform both endpoints on this dataset |
| # | Recommendation |
|---|---|
| 1 | Repeated cross-validation — use 50+ random splits to quantify RMSE uncertainty |
| 2 | Add elastic net — grid-search over alpha ∈ {0, 0.25, 0.5, 0.75, 1.0} |
| 3 | Benchmark nonlinear methods — random forests, gradient-boosted trees, GAMs |
| 4 | Refit on contemporary data — verify the same predictors carry the same signal today |
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67. https://doi.org/10.1080/00401706.1970.10488634
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. https://www.statlearning.com/
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288.
The tabs below show the complete, self-contained R code for every section of this analysis. Each tab corresponds to one analysis step. Running all chunks top-to-bottom in a fresh R session reproduces every result and figure exactly.
set.seed(3456)
trainIndex <- createDataPartition(College$Grad.Rate,
p = 0.70,
list = FALSE,
times = 1)
train <- College[ trainIndex, ]
test <- College[-trainIndex, ]
x.train <- model.matrix(Grad.Rate ~ ., data = train)[, -1]
y.train <- train$Grad.Rate
x.test <- model.matrix(Grad.Rate ~ ., data = test)[, -1]
y.test <- test$Grad.Rate
rmse <- function(actual, predicted) sqrt(mean((actual - predicted)^2))ols.full <- lm(Grad.Rate ~ ., data = train)
step.mod <- stepAIC(ols.full, direction = "backward", trace = FALSE)step.train.pred <- predict(step.mod, newdata = train)
step.test.pred <- predict(step.mod, newdata = test)
step.train.rmse <- rmse(y.train, step.train.pred)
step.test.rmse <- rmse(y.test, step.test.pred)Report prepared for ALY6015 — Intermediate Analytics, Northeastern University · Knitted with R Markdown on April 23, 2026