Thoughts on Ridge and Lasso Regression DoF

Author

D.McCabe

In life sciences, models are often built using a large number of gene expression markers without prior knowledge of which are relevant. This “kitchen sink” approach includes all features in the model and relies on regularisation techniques, like Lasso or Ridge regression, to automatically eliminate or downweight those that don’t appear to affect the outcome based on the data. It effectively filters out noise and highlights potentially meaningful biological signals.

Ridge regression uses cross-validation to prevent overfitting by introducing a bias–variance tradeoff. It minimises a penalised cost function—adding a term that discourages large coefficient. The strength of this penalty is tuned using ten-fold cross-validation, selecting the model with the lowest variance and best generalisation - this is regularisation

Regularisation constrains the model more (like estimating a parameter would but by a continous amount) consumes degrees of freedom in an interesting way

Basic Overview of Technique

I’m not too interested in the topic but feel this it this technique illustrates a few interesting things about biased estimators and fractional degrees of freedom in parameter estimation.

The parameterised cost function

For many models, you can write the objective function as: \[ \min_{\boldsymbol{\beta}} \; \mathcal{L}(\boldsymbol{\beta}; X, \mathbf{y}) + \lambda \, \Omega(\boldsymbol{\beta}) \]

where

\(\mathcal{L}(\boldsymbol{\beta}; X, \mathbf{y})\) is the cost function (aggregated loss functions) measuring model fit (depends on the model and problem type). This can be squared error (linear regression), log-loss (logistic regression)
\(\Omega(\boldsymbol{\beta})\) is the penalty function (e.g., Ridge uses the \(\ell_2\) norm, Lasso uses the \(\ell_1\) norm).
\(\boldsymbol{\beta}\) are the model parameters
\(\lambda\) is the regularisation parameter selected using cross validation to minimise variance

Example: Linear Regression - Ordinary Least Squares (OLS)

Unregularised Linear Regression: \[\mathcal{L}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2\]

Ridge Regularised Linear Regression: \[ \min_{\boldsymbol{\beta}} \; \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

Lasso Regularised Linear Regression: \[\min_{\boldsymbol{\beta}} \; \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p |\beta_j|\]

Example: Logistic Regression - Negative Log-Likelihood

Unregularised Logistic Regression: \[ \mathcal{L}(\boldsymbol{\beta}) = - \sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right] \qquad\text{where the logistic function is:}\quad p_i = \frac{1}{1 + e^{-\mathbf{x}_i^T \boldsymbol{\beta}}} \]

Ridge Regularised Logistic Regression: \[ \min_{\boldsymbol{\beta}} \; \mathcal{L}(\boldsymbol{\beta}) + \lambda \sum_{j=1}^p \beta_j^2 \]

Lasso Regularised Logistic Regression: \[ \min_{\boldsymbol{\beta}} \; \mathcal{L}(\boldsymbol{\beta}) + \lambda \sum_{j=1}^p |\beta_j| \]

Degrees of Freedom in Regularised Regression

For Ridge regression, the effective DoF is: see derivation

\[ \text{DoF}_{\text{ridge}} = \mathrm{trace}\left( X \left(X^T X + \lambda I\right)^{-1} X^T \right) \]

For Lasso regression, the degrees of freedom is approximately the number of non zero coefficients sparsity (called sparsity I thought on acount of a sparce projection matrix but lasso cannot be expressed in closed form linear algebra).

Addendum: Ridge Regression Estimator Derivation

\[ \begin{align} \mathcal{L}(\boldsymbol{\beta})&= ( \mathbf{y} - X \boldsymbol{\beta} )^2 + \lambda (\boldsymbol{\beta})^2 \\ &=(\mathbf{y} - X \boldsymbol{\beta})^T (\mathbf{y} - X \boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T \boldsymbol{\beta} \\ &= \mathbf{y}^T \mathbf{y} - 2 \boldsymbol{\beta}^T X^T \mathbf{y} + \boldsymbol{\beta}^T X^T X \boldsymbol{\beta} + \lambda \boldsymbol{\beta}^T \boldsymbol{\beta} \end{align} \;\\ \;\\ \therefore\quad\min_{\boldsymbol{\beta}}\left\{ \mathbf{y}^T \mathbf{y} - 2 \boldsymbol{\beta}^T X^T \mathbf{y} + \boldsymbol{\beta}^T X^T X \boldsymbol{\beta} + \lambda \boldsymbol{\beta}^T \boldsymbol{\beta} \right\} \]

Taking the derivative with respect to \(\boldsymbol{\beta}\) and setting to zero:

\[ \frac{d}{d \boldsymbol{\beta}} \left( \mathbf{y}^T \mathbf{y} - 2 \boldsymbol{\beta}^T X^T \mathbf{y} + \boldsymbol{\beta}^T X^T X \boldsymbol{\beta} + \lambda \boldsymbol{\beta}^T \boldsymbol{\beta} \right) = 0 \]

Calculating derivatives term by term:

\[ -2 X^T \mathbf{y} + 2 X^T X \boldsymbol{\beta} + 2 \lambda \boldsymbol{\beta} = 0\\ (X^T X + \lambda I) \boldsymbol{\beta} = X^T \mathbf{y}\\ \;\\ \boxed{ \hat{\boldsymbol{\beta}} = (X^T X + \lambda I)^{-1} X^T \mathbf{y} } \]