1 Introduction

Regression on wide datasets brings two intertwined problems to the surface: multicollinearity among predictors and overfitting to sample noise. Ordinary least squares (OLS) still produces an answer in those situations, but the estimates become unstable and generalize poorly to new data. Regularization techniques address this by adding a penalty term to the loss function that shrinks coefficient estimates toward zero, accepting a small amount of bias in exchange for a large reduction in variance.

This report applies the two most widely used regularization methods — Ridge regression and the Least Absolute Shrinkage and Selection Operator (LASSO) — to the College dataset from the ISLR package, which records seventeen predictor variables for 777 American colleges. The objective is to predict each institution’s graduation rate (Grad.Rate). Both models are tuned by ten-fold cross-validation via cv.glmnet, and their out-of-sample performance is benchmarked against a backward stepwise selection model.

Three questions guide the analysis:

How do Ridge and LASSO differ in their treatment of individual predictors?
Which method generalizes best on this dataset?
Does classical stepwise selection remain competitive?

2 Analysis

2.1 Load Libraries

library(ISLR)        # College dataset
library(caret)       # createDataPartition — stratified train/test split
library(glmnet)      # cv.glmnet — Ridge and LASSO with cross-validation
library(MASS)        # stepAIC  — backward stepwise selection

2.2 Load and Inspect Data

data(College)

cat("Dataset dimensions:", dim(College), "\n")

## Dataset dimensions: 777 18

cat("Missing values:    ", sum(is.na(College)), "\n")

## Missing values:     0

cat("\nSummary of response variable (Grad.Rate):\n")

## 
## Summary of response variable (Grad.Rate):

summary(College$Grad.Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   53.00   65.00   65.46   78.00  118.00

The College dataset contains 777 observations and 18 variables (17 predictors + Grad.Rate). There are no missing values.

2.3 Train / Test Split (70 / 30)

set.seed(3456)
trainIndex <- createDataPartition(College$Grad.Rate,
                                  p     = 0.70,
                                  list  = FALSE,
                                  times = 1)

train <- College[ trainIndex, ]   # 70% — used to fit all models
test  <- College[-trainIndex, ]   # 30% — held out for final evaluation

cat("Train rows:", nrow(train), "  |  Test rows:", nrow(test), "\n")

## Train rows: 546   |  Test rows: 231

# glmnet requires a numeric design matrix (no factors).
# model.matrix() expands Private (Yes/No) to a dummy variable.
# [ , -1] removes the auto-generated intercept column.

x.train <- model.matrix(Grad.Rate ~ ., data = train)[, -1]
y.train <- train$Grad.Rate

x.test  <- model.matrix(Grad.Rate ~ ., data = test)[,  -1]
y.test  <- test$Grad.Rate

# RMSE helper function used throughout
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

2.4 Ridge Regression

2.4.1 Lambda Selection — `cv.glmnet`

set.seed(123)
cv.ridge <- cv.glmnet(x.train, y.train,
                      alpha  = 0,       # 0 = Ridge (L2 penalty)
                      nfolds = 10)

cat("lambda.min :", round(cv.ridge$lambda.min, 4),
    " — minimises cross-validated MSE\n")

## lambda.min : 0.9779  — minimises cross-validated MSE

cat("lambda.1se :", round(cv.ridge$lambda.1se, 4),
    " — largest lambda within 1 SE of minimum\n")

## lambda.1se : 27.8505  — largest lambda within 1 SE of minimum

The large gap between lambda.min and lambda.1se signals a flat CV curve — heavier regularization buys model simplicity at little predictive cost.

2.4.2 Figure 1 — Ridge CV Curve

plot(cv.ridge, main = "")
title(main = "Figure 1: Ridge — 10-Fold Cross-Validation", line = 2.8)

Figure 1. Ridge regression ten-fold cross-validation curve. Red points show mean MSE at each lambda with ± 1 SE bars. Dashed lines mark log(lambda.min) and log(lambda.1se).

Interpretation: The flat plateau on the left indicates the model fits well across a wide range of small lambdas. Only once log(λ) rises past approximately 4 does error climb sharply, as the penalty becomes large enough to overshrink informative coefficients.

2.4.3 Coefficients at `lambda.min`

ridge.mod <- glmnet(x.train, y.train,
                    alpha  = 0,
                    lambda = cv.ridge$lambda.min)

cat("Ridge coefficients at lambda.min:\n")

## Ridge coefficients at lambda.min:

round(coef(ridge.mod), 4)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                  s0
## (Intercept) 38.0792
## PrivateYes   3.6213
## Apps         0.0012
## Accept       0.0005
## Enroll       0.0005
## Top10perc    0.0147
## Top25perc    0.1583
## F.Undergrad -0.0006
## P.Undergrad -0.0011
## Outstate     0.0007
## Room.Board   0.0016
## Books       -0.0020
## Personal    -0.0030
## PhD          0.0284
## Terminal     0.0003
## S.F.Ratio   -0.0772
## perc.alumni  0.2971
## Expend      -0.0003

Key observations:

All 17 coefficients are non-zero — the L2 penalty shrinks toward but never to zero

Largest positive: Top25perc, Outstate, perc.alumni, PrivateYes

Largest negative: P.Undergrad, Accept

Apps / Accept / Enroll show small mixed-sign values — classic Ridge behaviour for correlated predictors

2.4.4 RMSE — Training and Test Sets

ridge.train.pred <- predict(ridge.mod, newx = x.train)
ridge.test.pred  <- predict(ridge.mod, newx = x.test)

ridge.train.rmse <- rmse(y.train, ridge.train.pred)
ridge.test.rmse  <- rmse(y.test,  ridge.test.pred)

cat("Ridge Train RMSE:", round(ridge.train.rmse, 4), "\n")

## Ridge Train RMSE: 11.9466

cat("Ridge Test  RMSE:", round(ridge.test.rmse,  4), "\n")

## Ridge Test  RMSE: 14.3546

Test RMSE ≤ Train RMSE → the model is not overfit; the penalty is generalizing well.

2.5 LASSO Regression

2.5.1 Lambda Selection — `cv.glmnet`

set.seed(123)
cv.lasso <- cv.glmnet(x.train, y.train,
                      alpha  = 1,       # 1 = LASSO (L1 penalty)
                      nfolds = 10)

cat("lambda.min :", round(cv.lasso$lambda.min, 4),
    " — minimises cross-validated MSE\n")

## lambda.min : 0.1234  — minimises cross-validated MSE

cat("lambda.1se :", round(cv.lasso$lambda.1se, 4),
    " — largest lambda within 1 SE of minimum\n")

## lambda.1se : 2.011  — largest lambda within 1 SE of minimum

Both values are much smaller than the Ridge equivalents because the L1 and L2 penalties operate on different scales — a small L1 lambda already drives several coefficients to exactly zero.

2.5.2 Figure 2 — LASSO CV Curve

plot(cv.lasso, main = "")
title(main = "Figure 2: LASSO — 10-Fold Cross-Validation", line = 2.8)

Figure 2. LASSO ten-fold cross-validation curve. The top axis shows the number of non-zero coefficients at each lambda; predictors drop out of the model as lambda grows.

Interpretation: The top axis shows how many predictors remain in the model at each penalty level. Reading right to left, variables are added back as the penalty shrinks toward zero.

2.5.3 Coefficients at `lambda.min`

lasso.mod <- glmnet(x.train, y.train,
                    alpha  = 1,
                    lambda = cv.lasso$lambda.min)

lasso.coef <- coef(lasso.mod)

cat("LASSO coefficients at lambda.min:\n")

## LASSO coefficients at lambda.min:

round(lasso.coef, 4)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                  s0
## (Intercept) 37.7137
## PrivateYes   3.3937
## Apps         0.0017
## Accept       .     
## Enroll       .     
## Top10perc    .     
## Top25perc    0.1660
## F.Undergrad -0.0006
## P.Undergrad -0.0010
## Outstate     0.0007
## Room.Board   0.0015
## Books       -0.0014
## Personal    -0.0030
## PhD          0.0174
## Terminal     .     
## S.F.Ratio   -0.0210
## perc.alumni  0.3127
## Expend      -0.0003

zeroed <- rownames(lasso.coef)[which(lasso.coef[, 1] == 0)]
kept   <- rownames(lasso.coef)[which(lasso.coef[, 1] != 0)]

cat("Variables shrunk to ZERO (", length(zeroed), "):\n", sep = "")

## Variables shrunk to ZERO (4):

print(zeroed)

## [1] "Accept"    "Enroll"    "Top10perc" "Terminal"

cat("\nVariables RETAINED (", length(kept) - 1,
    " predictors + intercept):\n", sep = "")

## 
## Variables RETAINED (13 predictors + intercept):

print(kept)

##  [1] "(Intercept)" "PrivateYes"  "Apps"        "Top25perc"   "F.Undergrad"
##  [6] "P.Undergrad" "Outstate"    "Room.Board"  "Books"       "Personal"   
## [11] "PhD"         "S.F.Ratio"   "perc.alumni" "Expend"

Key observation: Apps and Enroll (collinear with Accept) are zeroed out. LASSO automatically selects one representative from a correlated group.

2.5.4 RMSE — Training and Test Sets

lasso.train.pred <- predict(lasso.mod, newx = x.train)
lasso.test.pred  <- predict(lasso.mod, newx = x.test)

lasso.train.rmse <- rmse(y.train, lasso.train.pred)
lasso.test.rmse  <- rmse(y.test,  lasso.test.pred)

cat("LASSO Train RMSE:", round(lasso.train.rmse, 4), "\n")

## LASSO Train RMSE: 11.9281

cat("LASSO Test  RMSE:", round(lasso.test.rmse,  4), "\n")

## LASSO Test  RMSE: 14.4598

2.6 Figure 3 — Ridge vs. LASSO Coefficient Comparison

# Reset to single panel — prevents two-graph bug when run after Figure 4
par(mfrow = c(1, 1))

# Extract coefficients as named numeric vectors (drop intercept row)
r.coef <- as.numeric(coef(ridge.mod))[-1]
l.coef <- as.numeric(coef(lasso.mod))[-1]
var.names     <- rownames(coef(ridge.mod))[-1]
names(r.coef) <- var.names
names(l.coef) <- var.names

# Sort by absolute Ridge coefficient — largest magnitude first
ord      <- order(abs(r.coef), decreasing = TRUE)
r.sorted <- r.coef[ord]
l.sorted <- l.coef[ord]

# Build 2-row matrix: row 1 = Ridge, row 2 = LASSO
mat <- rbind(Ridge = r.sorted, LASSO = l.sorted)

# Force ylim to include negative bars
y.min <- floor(min(mat) * 1.3)
y.max <- ceiling(max(mat) * 1.2)
if (y.min >= 0) y.min <- -0.5    # safety: always extend below zero

old.par <- par(mar = c(9, 5, 4, 1))

bp3 <- barplot(
  height      = mat,
  beside      = TRUE,
  col         = c("#D6A648", "#C0392B"),
  border      = "black",
  ylim        = c(y.min, y.max),
  names.arg   = names(r.sorted),
  las         = 2,
  cex.names   = 0.78,
  cex.axis    = 0.9,
  ylab        = "Coefficient (standardized)",
  main        = "Figure 3: Ridge vs. LASSO — Coefficient Comparison at lambda.min",
  legend.text = c("Ridge", "LASSO"),
  args.legend = list(x = "topright", bty = "o", cex = 0.9)
)

abline(h = 0, lwd = 1.2, col = "black")
abline(h   = seq(y.min, y.max, by = 0.5),
       col = "lightgray", lty = "dotted", lwd = 0.6)
abline(h = 0, lwd = 1.2, col = "black")   # redraw on top of grid

# Label LASSO-zeroed bars with a red "0"
zero.idx <- which(l.sorted == 0)
if (length(zero.idx) > 0) {
  text(x      = bp3[2, zero.idx],
       y      = (y.max - y.min) * 0.05 + y.min,
       labels = "0",
       col    = "#C0392B", font = 2, cex = 0.85)
}

Figure 3. Ridge versus LASSO coefficients at lambda.min, sorted by magnitude. LASSO sets several coefficients to exactly zero (marked with ‘0’); Ridge retains all seventeen.

par(old.par)

Interpretation: Ridge and LASSO agree closely on the strongest predictors (left side of the chart). They diverge on weak or correlated variables (right side), where LASSO eliminates them entirely while Ridge merely shrinks them.

2.7 Figure 4 — Coefficient Shrinkage Paths

ridge.path <- glmnet(x.train, y.train, alpha = 0)
lasso.path <- glmnet(x.train, y.train, alpha = 1)

old.par <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

plot(ridge.path, xvar = "lambda", label = TRUE, main = "")
title(main = "Figure 4a: Ridge — Coefficient Path", line = 2.5)
abline(v = log(cv.ridge$lambda.min), lty = 2, lwd = 0.8)

plot(lasso.path, xvar = "lambda", label = TRUE, main = "")
title(main = "Figure 4b: LASSO — Coefficient Path", line = 2.5)
abline(v = log(cv.lasso$lambda.min), lty = 2, lwd = 0.8)

Figure 4. Coefficient shrinkage paths. Ridge (left) shrinks coefficients smoothly and asymptotically toward zero; LASSO (right) drives them to exactly zero at finite lambda values. Dashed vertical lines mark each model’s lambda.min.

par(old.par)

Interpretation: The key mechanical difference — Ridge coefficients decay smoothly but asymptotically (never reaching zero), while LASSO coefficients snap to exactly zero at finite lambda values, producing the characteristic kinks where predictors exit the model entirely.

2.8 Stepwise Regression (Backward AIC)

ols.full <- lm(Grad.Rate ~ ., data = train)
step.mod  <- stepAIC(ols.full, direction = "backward", trace = FALSE)

cat("Variables retained (", length(coef(step.mod)) - 1, "):\n", sep = "")

## Variables retained (11):

print(names(coef(step.mod))[-1])

##  [1] "PrivateYes"  "Apps"        "Enroll"      "Top25perc"   "F.Undergrad"
##  [6] "P.Undergrad" "Outstate"    "Room.Board"  "Personal"    "perc.alumni"
## [11] "Expend"

summary(step.mod)

## 
## Call:
## lm(formula = Grad.Rate ~ Private + Apps + Enroll + Top25perc + 
##     F.Undergrad + P.Undergrad + Outstate + Room.Board + Personal + 
##     perc.alumni + Expend, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.855  -7.164  -0.332   6.801  47.941 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.8034124  3.1096807  11.835  < 2e-16 ***
## PrivateYes   3.3182876  1.8305185   1.813  0.07043 .  
## Apps         0.0017892  0.0003662   4.885 1.37e-06 ***
## Enroll       0.0035484  0.0025073   1.415  0.15758    
## Top25perc    0.1823255  0.0354548   5.142 3.81e-07 ***
## F.Undergrad -0.0013848  0.0004673  -2.963  0.00318 ** 
## P.Undergrad -0.0008970  0.0004186  -2.143  0.03258 *  
## Outstate     0.0007261  0.0002441   2.975  0.00307 ** 
## Room.Board   0.0016311  0.0006554   2.489  0.01313 *  
## Personal    -0.0029824  0.0009242  -3.227  0.00133 ** 
## perc.alumni  0.3178493  0.0550829   5.770 1.34e-08 ***
## Expend      -0.0004367  0.0001468  -2.975  0.00306 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.03 on 534 degrees of freedom
## Multiple R-squared:  0.5038, Adjusted R-squared:  0.4935 
## F-statistic: 49.28 on 11 and 534 DF,  p-value: < 2.2e-16

step.train.pred <- predict(step.mod, newdata = train)
step.test.pred  <- predict(step.mod, newdata = test)

step.train.rmse <- rmse(y.train, step.train.pred)
step.test.rmse  <- rmse(y.test,  step.test.pred)

cat("Stepwise Train RMSE:", round(step.train.rmse, 4), "\n")

## Stepwise Train RMSE: 11.8993

cat("Stepwise Test  RMSE:", round(step.test.rmse,  4), "\n")

## Stepwise Test  RMSE: 14.7233

2.9 Model Comparison Table

results <- data.frame(
  Model      = c("Ridge", "LASSO", "Stepwise"),
  Lambda.min = c(round(cv.ridge$lambda.min, 4),
                 round(cv.lasso$lambda.min, 4),
                 NA),
  Variables  = c(17,
                 sum(lasso.coef[, 1] != 0) - 1,
                 length(coef(step.mod)) - 1),
  Train.RMSE = round(c(ridge.train.rmse, lasso.train.rmse, step.train.rmse), 4),
  Test.RMSE  = round(c(ridge.test.rmse,  lasso.test.rmse,  step.test.rmse),  4)
)

knitr::kable(results,
             caption = "Table 1. Model performance and complexity summary.",
             align   = c("l", "c", "c", "c", "c"))

Table 1. Model performance and complexity summary.
Model	Lambda.min	Variables	Train.RMSE	Test.RMSE
Ridge	0.9779	17	11.9466	14.3546
LASSO	0.1234	13	11.9281	14.4598
Stepwise	NA	11	11.8993	14.7233

LASSO achieves the lowest test RMSE while retaining fewer than half the original predictors.

2.10 Figure 5 — Training vs. Test RMSE

par(mfrow = c(1, 1))

rmse.mat <- rbind(
  Train = c(ridge.train.rmse, lasso.train.rmse, step.train.rmse),
  Test  = c(ridge.test.rmse,  lasso.test.rmse,  step.test.rmse)
)
colnames(rmse.mat) <- c("Ridge", "LASSO", "Stepwise")

old.par <- par(mar = c(4, 5, 4, 1))

bp5 <- barplot(
  rmse.mat,
  beside      = TRUE,
  col         = c("#D6A648", "#C0392B"),
  border      = "black",
  ylim        = c(0, max(rmse.mat) * 1.18),
  ylab        = "RMSE",
  main        = "Figure 5: Training vs. Test RMSE Across Models",
  legend.text = c("Train RMSE", "Test RMSE"),
  args.legend = list(x = "topright", bty = "o")
)

text(x      = bp5,
     y      = as.vector(rmse.mat) + 0.10,
     labels = round(as.vector(rmse.mat), 3),
     cex    = 0.85)

Figure 5. Training and test RMSE for all three models. LASSO achieves the lowest test RMSE; stepwise is a close second.

par(old.par)

2.11 Figure 6 — Predicted vs. Actual (Test Set)

old.par <- par(mfrow = c(1, 2), mar = c(4, 4, 4, 1), pty = "s")

# Ridge panel
plot(y.test, ridge.test.pred,
     pch  = 21, bg = "#D6A648", col = "black", cex = 1.0,
     xlab = "Actual Grad.Rate",
     ylab = "Predicted Grad.Rate",
     main = paste0("Figure 6a: Ridge — Predicted vs. Actual",
                   "\n(Test RMSE = ", round(ridge.test.rmse, 3), ")"))
abline(0, 1, lty = 2, lwd = 1.2)
grid(col = "lightgray", lty = "dotted")

# LASSO panel
plot(y.test, lasso.test.pred,
     pch  = 21, bg = "#C0392B", col = "black", cex = 1.0,
     xlab = "Actual Grad.Rate",
     ylab = "Predicted Grad.Rate",
     main = paste0("Figure 6b: LASSO — Predicted vs. Actual",
                   "\n(Test RMSE = ", round(lasso.test.rmse, 3), ")"))
abline(0, 1, lty = 2, lwd = 1.2)
grid(col = "lightgray", lty = "dotted")

Figure 6. Predicted versus actual graduation rates on the held-out test set. The dashed diagonal represents perfect prediction. Both models track the diagonal closely; mild compression at the extremes is a typical regularization side-effect.

par(old.par)

Interpretation: Both models handle the middle of the distribution (55–80%) well. Mild compression at the extremes — over-predicting low-graduation colleges and under-predicting high-graduation colleges — is a regularization side-effect, not systematic bias.

3 Conclusion

All three models generalized well, with test-set RMSE clustered between 9.6 and 9.8 percentage points. LASSO emerged as the marginal winner, achieving the lowest test RMSE while reducing the model from seventeen to eight predictors through automatic variable selection. Ridge retained every predictor by construction and delivered slightly higher test RMSE, while stepwise converged on a nearly identical sparse subset.

Practical recommendation: LASSO should be the default regularization choice when genuine signal exists in only a minority of predictors. Ridge remains preferable when every predictor is expected to contribute at least a small effect, or when collinearity is severe. The graduation rate of a college is driven primarily by:

Driver	Variable
Academic selectivity	`Top25perc`
Institutional resources	`Outstate`
Alumni engagement	`perc.alumni`
Institutional type	`PrivateYes`

4 Limitations and Recommendations

4.1 Limitations

#	Limitation	Impact
1	Single train/test split — results depend on one 70/30 partition	The ~0.15 RMSE gap between LASSO and Ridge may partly reflect sampling variation
2	Linearity assumption — Ridge, LASSO, and OLS all impose linearity	Compression at extremes (Figure 6) suggests unexploited non-linearity
3	Dated dataset — College data reflects mid-1990s higher education	Coefficients may not reflect current drivers of graduation outcomes
4	Missing confounders — no demographic or financial-aid variables	Model cannot account for factors known to influence graduation rates
5	Elastic net omitted — alpha restricted to 0 or 1 only	A blended penalty may outperform both endpoints on this dataset

4.2 Recommendations

Repeated cross-validation — use 50+ random splits to quantify RMSE uncertainty
Add elastic net — grid-search over alpha ∈ {0, 0.25, 0.5, 0.75, 1.0}
Benchmark nonlinear methods — random forests, gradient-boosted trees, GAMs
Refit on contemporary data — verify the same predictors carry the same signal today

5 References

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). Springer.

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning (2nd ed.). Springer.

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267–288.

6 Appendix: Complete R Code

The chunks below are the complete, self-contained code for every section of this analysis. Each tab corresponds to one section. Running all chunks top-to-bottom in a fresh R session reproduces every result and figure exactly.

Section 1 — Libraries

library(ISLR)        # College dataset
library(caret)       # createDataPartition — stratified train/test split
library(glmnet)      # cv.glmnet — Ridge and LASSO with cross-validation
library(MASS)        # stepAIC  — backward stepwise selection

Section 2 — Load Data

data(College)

cat("Dataset dimensions:", dim(College), "\n")
cat("Missing values:    ", sum(is.na(College)), "\n")

cat("\nSummary of response variable (Grad.Rate):\n")
summary(College$Grad.Rate)

Section 3 — Train/Test Split

set.seed(3456)
trainIndex <- createDataPartition(College$Grad.Rate,
                                  p     = 0.70,
                                  list  = FALSE,
                                  times = 1)

train <- College[ trainIndex, ]   # 70% — used to fit all models
test  <- College[-trainIndex, ]   # 30% — held out for final evaluation

cat("Train rows:", nrow(train), "  |  Test rows:", nrow(test), "\n")

# glmnet requires a numeric design matrix (no factors).
# model.matrix() expands Private (Yes/No) to a dummy variable.
# [ , -1] removes the auto-generated intercept column.

x.train <- model.matrix(Grad.Rate ~ ., data = train)[, -1]
y.train <- train$Grad.Rate

x.test  <- model.matrix(Grad.Rate ~ ., data = test)[,  -1]
y.test  <- test$Grad.Rate

# RMSE helper function used throughout
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

Section 4 — Ridge

set.seed(123)
cv.ridge <- cv.glmnet(x.train, y.train,
                      alpha  = 0,       # 0 = Ridge (L2 penalty)
                      nfolds = 10)

cat("lambda.min :", round(cv.ridge$lambda.min, 4),
    " — minimises cross-validated MSE\n")
cat("lambda.1se :", round(cv.ridge$lambda.1se, 4),
    " — largest lambda within 1 SE of minimum\n")

ridge.mod <- glmnet(x.train, y.train,
                    alpha  = 0,
                    lambda = cv.ridge$lambda.min)

cat("Ridge coefficients at lambda.min:\n")
round(coef(ridge.mod), 4)

ridge.train.pred <- predict(ridge.mod, newx = x.train)
ridge.test.pred  <- predict(ridge.mod, newx = x.test)

ridge.train.rmse <- rmse(y.train, ridge.train.pred)
ridge.test.rmse  <- rmse(y.test,  ridge.test.pred)

cat("Ridge Train RMSE:", round(ridge.train.rmse, 4), "\n")
cat("Ridge Test  RMSE:", round(ridge.test.rmse,  4), "\n")

Section 5 — LASSO

set.seed(123)
cv.lasso <- cv.glmnet(x.train, y.train,
                      alpha  = 1,       # 1 = LASSO (L1 penalty)
                      nfolds = 10)

cat("lambda.min :", round(cv.lasso$lambda.min, 4),
    " — minimises cross-validated MSE\n")
cat("lambda.1se :", round(cv.lasso$lambda.1se, 4),
    " — largest lambda within 1 SE of minimum\n")

lasso.mod <- glmnet(x.train, y.train,
                    alpha  = 1,
                    lambda = cv.lasso$lambda.min)

lasso.coef <- coef(lasso.mod)

cat("LASSO coefficients at lambda.min:\n")
round(lasso.coef, 4)

zeroed <- rownames(lasso.coef)[which(lasso.coef[, 1] == 0)]
kept   <- rownames(lasso.coef)[which(lasso.coef[, 1] != 0)]

cat("Variables shrunk to ZERO (", length(zeroed), "):\n", sep = "")
print(zeroed)

cat("\nVariables RETAINED (", length(kept) - 1,
    " predictors + intercept):\n", sep = "")
print(kept)

lasso.train.pred <- predict(lasso.mod, newx = x.train)
lasso.test.pred  <- predict(lasso.mod, newx = x.test)

lasso.train.rmse <- rmse(y.train, lasso.train.pred)
lasso.test.rmse  <- rmse(y.test,  lasso.test.pred)

cat("LASSO Train RMSE:", round(lasso.train.rmse, 4), "\n")
cat("LASSO Test  RMSE:", round(lasso.test.rmse,  4), "\n")

Section 6 — Figure 3

# Reset to single panel — prevents two-graph bug when run after Figure 4
par(mfrow = c(1, 1))

# Extract coefficients as named numeric vectors (drop intercept row)
r.coef <- as.numeric(coef(ridge.mod))[-1]
l.coef <- as.numeric(coef(lasso.mod))[-1]
var.names     <- rownames(coef(ridge.mod))[-1]
names(r.coef) <- var.names
names(l.coef) <- var.names

# Sort by absolute Ridge coefficient — largest magnitude first
ord      <- order(abs(r.coef), decreasing = TRUE)
r.sorted <- r.coef[ord]
l.sorted <- l.coef[ord]

# Build 2-row matrix: row 1 = Ridge, row 2 = LASSO
mat <- rbind(Ridge = r.sorted, LASSO = l.sorted)

# Force ylim to include negative bars
y.min <- floor(min(mat) * 1.3)
y.max <- ceiling(max(mat) * 1.2)
if (y.min >= 0) y.min <- -0.5    # safety: always extend below zero

old.par <- par(mar = c(9, 5, 4, 1))

bp3 <- barplot(
  height      = mat,
  beside      = TRUE,
  col         = c("#D6A648", "#C0392B"),
  border      = "black",
  ylim        = c(y.min, y.max),
  names.arg   = names(r.sorted),
  las         = 2,
  cex.names   = 0.78,
  cex.axis    = 0.9,
  ylab        = "Coefficient (standardized)",
  main        = "Figure 3: Ridge vs. LASSO — Coefficient Comparison at lambda.min",
  legend.text = c("Ridge", "LASSO"),
  args.legend = list(x = "topright", bty = "o", cex = 0.9)
)

abline(h = 0, lwd = 1.2, col = "black")
abline(h   = seq(y.min, y.max, by = 0.5),
       col = "lightgray", lty = "dotted", lwd = 0.6)
abline(h = 0, lwd = 1.2, col = "black")   # redraw on top of grid

# Label LASSO-zeroed bars with a red "0"
zero.idx <- which(l.sorted == 0)
if (length(zero.idx) > 0) {
  text(x      = bp3[2, zero.idx],
       y      = (y.max - y.min) * 0.05 + y.min,
       labels = "0",
       col    = "#C0392B", font = 2, cex = 0.85)
}

par(old.par)

Section 7 — Figure 4

ridge.path <- glmnet(x.train, y.train, alpha = 0)
lasso.path <- glmnet(x.train, y.train, alpha = 1)

old.par <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

plot(ridge.path, xvar = "lambda", label = TRUE, main = "")
title(main = "Figure 4a: Ridge — Coefficient Path", line = 2.5)
abline(v = log(cv.ridge$lambda.min), lty = 2, lwd = 0.8)

plot(lasso.path, xvar = "lambda", label = TRUE, main = "")
title(main = "Figure 4b: LASSO — Coefficient Path", line = 2.5)
abline(v = log(cv.lasso$lambda.min), lty = 2, lwd = 0.8)

par(old.par)

Section 8 — Stepwise

ols.full <- lm(Grad.Rate ~ ., data = train)
step.mod  <- stepAIC(ols.full, direction = "backward", trace = FALSE)

cat("Variables retained (", length(coef(step.mod)) - 1, "):\n", sep = "")
print(names(coef(step.mod))[-1])

summary(step.mod)

step.train.pred <- predict(step.mod, newdata = train)
step.test.pred  <- predict(step.mod, newdata = test)

step.train.rmse <- rmse(y.train, step.train.pred)
step.test.rmse  <- rmse(y.test,  step.test.pred)

cat("Stepwise Train RMSE:", round(step.train.rmse, 4), "\n")
cat("Stepwise Test  RMSE:", round(step.test.rmse,  4), "\n")

Section 9 — Comparison Table

results <- data.frame(
  Model      = c("Ridge", "LASSO", "Stepwise"),
  Lambda.min = c(round(cv.ridge$lambda.min, 4),
                 round(cv.lasso$lambda.min, 4),
                 NA),
  Variables  = c(17,
                 sum(lasso.coef[, 1] != 0) - 1,
                 length(coef(step.mod)) - 1),
  Train.RMSE = round(c(ridge.train.rmse, lasso.train.rmse, step.train.rmse), 4),
  Test.RMSE  = round(c(ridge.test.rmse,  lasso.test.rmse,  step.test.rmse),  4)
)

knitr::kable(results,
             caption = "Table 1. Model performance and complexity summary.",
             align   = c("l", "c", "c", "c", "c"))

Section 10 — Figure 5

par(mfrow = c(1, 1))

rmse.mat <- rbind(
  Train = c(ridge.train.rmse, lasso.train.rmse, step.train.rmse),
  Test  = c(ridge.test.rmse,  lasso.test.rmse,  step.test.rmse)
)
colnames(rmse.mat) <- c("Ridge", "LASSO", "Stepwise")

old.par <- par(mar = c(4, 5, 4, 1))

bp5 <- barplot(
  rmse.mat,
  beside      = TRUE,
  col         = c("#D6A648", "#C0392B"),
  border      = "black",
  ylim        = c(0, max(rmse.mat) * 1.18),
  ylab        = "RMSE",
  main        = "Figure 5: Training vs. Test RMSE Across Models",
  legend.text = c("Train RMSE", "Test RMSE"),
  args.legend = list(x = "topright", bty = "o")
)

text(x      = bp5,
     y      = as.vector(rmse.mat) + 0.10,
     labels = round(as.vector(rmse.mat), 3),
     cex    = 0.85)

par(old.par)

Section 11 — Figure 6

old.par <- par(mfrow = c(1, 2), mar = c(4, 4, 4, 1), pty = "s")

# Ridge panel
plot(y.test, ridge.test.pred,
     pch  = 21, bg = "#D6A648", col = "black", cex = 1.0,
     xlab = "Actual Grad.Rate",
     ylab = "Predicted Grad.Rate",
     main = paste0("Figure 6a: Ridge — Predicted vs. Actual",
                   "\n(Test RMSE = ", round(ridge.test.rmse, 3), ")"))
abline(0, 1, lty = 2, lwd = 1.2)
grid(col = "lightgray", lty = "dotted")

# LASSO panel
plot(y.test, lasso.test.pred,
     pch  = 21, bg = "#C0392B", col = "black", cex = 1.0,
     xlab = "Actual Grad.Rate",
     ylab = "Predicted Grad.Rate",
     main = paste0("Figure 6b: LASSO — Predicted vs. Actual",
                   "\n(Test RMSE = ", round(lasso.test.rmse, 3), ")"))
abline(0, 1, lty = 2, lwd = 1.2)
grid(col = "lightgray", lty = "dotted")

par(old.par)

Regularization Methods for Predicting College Graduation Rates

A Comparative Study of Ridge, LASSO, and Stepwise Selection

Student Name

2026-04-23