1_Intro.knit

class: center, middle
# Introducion and Regularization
## Econometría Aplicada y Ciencia de Datos
#### Dr. Francisco J. Cabrera-Hernández
#### Maestría en Economía
Otoño 2025
#####CIDE Santa Fe, Ciudad de México.

---
## Outline

- **.green[Introduction]**

- Ridge Regression

- Lasso Estimator

- Elastic Net.

---
## Motivation

Locate the intersection between **Applied Econometrics** and **Machine Learning** to solve empirical problems with modern data.

Two cultures:  
  - **Data modeling** (assumes a stochastic data-generating process)  
  - **Algorithmic modeling** (treats the data mechanism as unknown)
  
The economics community has been committed to the almost exclusive use of data models.
  
Statistics and ML are now **converging**; in economics, adoption has been *slower*.
---

## Motivation

Adoption of ML in economics important because data have changed:

- **Big data:** we now observe information on a large number of units.  
- Many features/X for each unit.  
- Often beyond the simple single–cross-section setting.

Move away from dependence on parametric models toward a more diverse set of tools, yet:

- Preserve the strengths of applied econometrics (identification, external validity)  
- Leverage ML for: selection/regularization, prediction, heterogeneity, and big data.

---
## Econometrics vs. Machine Learning

Econometrics:

- Emphasizes **large-sample properties** — consistency, asymptotic normality, efficiency. Theoretical proofs.

Machine Learning:

- Emphasizes **algorithmic performance** — practical behavior in specific settings, with "error‐rate" orientation and no asymptotic proofs.
- **No formal proofs**: are neural networks uniformly superior to methods like regression trees or random forests?

While **valid confidence intervals** are crucial (e.g., estimating an ATE), methods with no formal inference should not be dismissed.  
  
**Out-of-sample predictive performance** from ML can be valuable, even if prediction is **rare in modern econometrics**.

---
## Econometrics +Machine Learning

Similar:

-  Nonparametric regression is in  ML terminology: **supervised learning for regression problems**.
-  Nonparametric regression for discrete response is in  ML terminology **supervised learning for **classification problems**.

Adding:

- **Unsupervised learning** or clustering analysis and density estimation. 
- Estimates of heterogeneous treatment effects and optimal policy mapping.
- *bandit* approaches for effective experimentation.
- Matrix completion problem. 
- Analysis of text data.

---
## Terminology

ML uses **new terms for old ideas**:  
  - Estimation: *Training*  
  - Regressors: *Features*  
  - Parameters: *Weights*

**Supervised learning:** observe (X, Y) make prediction

**Unsupervised learning:** observe X and clustering or structure discovery.

**Classification:** ML term for discrete outcome models.

In Econometrics dimension of `$X$` is `$k$`; in Machine Learning is `$p$`.

---
## Estimation in Econometrics

We model the conditional distribution of an outcome:

`$$Y_i \mid X_i \sim \mathcal{N}(\alpha + \beta' X_i, \sigma^2)$$`

e.g. Estimate parameters by **least squares**:

`$$(\hat{\alpha}_{LS}, \hat{\beta}_{LS}) =
\arg\min_{\alpha, \beta} 
\sum_{i=1}^{N} (Y_i - \alpha - \beta' X_i)^2$$`

If the model is correct:
  - Estimator is **unbiased** and **BLUE**
  - Also the **MLE** under normality.
  - Has desirable **large-sample efficiency**

---
## Prediction in Machine Learning

In **ML**, the focus is on **prediction** rather than estimation.

Predict a new outcome:

`$$\hat{Y}_{N+1} = \hat{\alpha} + \hat{\beta}'X_{N+1}$$`

Minimize **out-of-sample loss**:

`$$(Y_{N+1} - \hat{Y}_{N+1})^2$$`

- The estimators `$(\hat{\alpha}, \hat{\beta})$`  need **not** come from OLS.

- Any method that improves predictive accuracy — e.g.,  **regularization, trees, or boosting** — may be preferred.

---
## (Cross)Validation

**Econometrics (Validation):**

- The form of the regression model, parametric or nonparametric, and the regressors are *given from the outside*, e.g., economic theory.

- Discussion of model selection, it is often in the form of testing null hypotheses concerning the validity of a particular model.

---
## (Cross)Validation

**Machine Learning (Cross-validation):**

- Out-of-sample cross-validation can help guide such decisions.

There are two components:

- The goal is predictive power, rather than estimation of a particular structural or causal parameter.

- The method uses out-of-sample comparisons, rather than in-sample goodness-of-fit measures.

---
## Overfitting and Regularization

**Overfitting (more of ML):**

- Select flexible models that fit well, but not so well that out-of-sample prediction is compromised.
 
 - Less emphasis on formal results, i.e. particular methods are superior in large samples (asymptotically). Instead, methods are compared on specific data.

---
## Overfitting and Regularization

**Regularization (Metrics +ML):**

- When optimizing, e.g. maximizing the log-likelihood function, a term is added to the objective function to penalize complexity.

- In settings with **many models or parameters**, we add a **penalty** to limit complexity.

- e.g. in MLE (maximum-likelihood), we add a term to the log-likelihood function equal to `$-(k/2)ln(n)$`, leading to the Bayesian Information Criterion.

---
## Regularization

**Econometrics antecedents:**

- Information criteria (AIC, BIC) penalize the number of parameters.

- Bayesian methods use **priors** of parameter distribution (e.g., centered at zero) as implicit regularization.

**ML approaches**, when you get rid of asymptotic theory:

- Regularization is **data-driven** tuned by **out-of-sample performance** rather than subjective priors.

---
## Regularization

Model example:

`$$(Y_i \mid X_i) \sim \mathcal{N}(\beta' X_i, \sigma^2)$$`
- Suppose we think the true coefficients **shouldn’t be too large**.  
  Maybe each predictor has only a **moderate** effect on Y.  
- We express that belief as a **prior**:

We have a prior (Bayesian view) distribution on `$\beta_k$` coefficients (for example from a standardised Y and X).

`$$\beta_k \sim \mathcal{N}(0, \tau^2)$$`
---
## Regularization

Then the posterior mean for `$\beta$` solves:

`$$\arg\min_{\beta}\sum_{i=1}^{N}(Y_i - \beta' X_i)^2+ \frac{\sigma^2}{\tau^2}\|\beta\|_2^2$$`
Where `$\|\beta\|_2$` is `$(\Sigma_{k=1}^K \beta^2)^{1/2}$`. The Euclidean lenght or L2 norm. The usual distance from the origin in K-dimensional space.

This is **Ridge Regularization (Shrinkage)**:

- Penalizing large coefficients or shrinking them toward zero when they don’t improve fit much.

- Each `$\beta$` is pulled slightly toward 0, making predictions more stable and reducing overfitting risk.

---
## Regularization

In ML notation:

`$$\arg\min_{\beta}
\sum_{i=1}^{N}(Y_i - \beta' X_i)^2 + \lambda\|\beta\|_2^2$$`

- The **penalty parameter `$\lambda$`** controls the degree of shrinkage.

- In **Bayesian** settings `$\lambda$` reflects prior beliefs.

- In **ML** `$\lambda$` is chosen via **cross-validation** to optimize predictive performance.

---
## Regularization

- Econometrics motivation is to reduce the degree of collinearity among the regressors.

- ML motivation is regularization of high-dimensional problems (too many regressors).

- Traditional parametric asymptotic theory assumes that `$p$` is fixed as `$n\to \infty$`
implying that p is much smaller than n.

- **A high-dimensional set up** is used to describe the context where `$p\to\infty$`, including when `$p>n$`.

---
## Outline

- Introduction

- **.green[Ridge Regression]**

- Lasso Estimator

- Elastic Net.

---
## High Dimensional Problem

Given `$\beta_{ols}= (X'X)^{-1} X'Y$`:

- When `$p > n$` estimator `$\beta_{ols}$` is not defined since `$X'X$` has deficient rank: more unknowns than equations.

- If `$p < n$` but p is large we can invert `$X'X$` but is ill-conditioned or nearly singular, multicollinearity near perfect.

- The eigenvalues of `$X′X$` tell us how much independent information the data contain along different linear combinations.

- Small eigenvalues of `$X′X$` arise because predictors/features convey nearly redundant information.

---
## High Dimensional Problem

Remember `$(X'X)^{-1}$` contains covariance in predictors/features.

- This can be expresed as terms `$1/eigenvalue$` that explode with small variations in data.

- Many of the variables may be low-information but it is difficult to know a priori which.

Consequently **we turn to estimation methods other than least squares.**

- Say ridge regression, Lasso, elastic net, regression trees, and random forests.

---
## Ridge Regression

Given `$X'X$` being ill-conditioned we can use the ridge regression estimator.

`$$\hat\beta_{ridge} = \large(X'X + \lambda I_p \large)^{-1} X'Y$$`

- Where `$\lambda > 0$` is the shrinkage parameter: well defined, not ill-conditioning.

- Even if `$p>n$`!: Chooses the minimum-norm solution among infinite OLS fits

- ** `$\lambda$` is the tunning parameter.**

---
## Ridge Regression

**Spectral decomposition:**

- `$X'X = H'DH$` where H is orthonormal (a rotation matrix: eigenvectors).

- Orthonormal means `$H'H = HH' = I.$` So H only rotates the system, does not reescale.

- `$D = diag \{r_1,...r_p\}$` is a diagonal matrix con eigenvalues (stretch/shrink factors) `$r_j$` of `$X'X$`.

Each **eigenvector** defines a direction in `$\beta-space$`, and each **eigenvalue** tells how much variation or information the data contain in that direction.

---
## Ridge Regression

Set `$\Lambda = \lambda I_p$`:

`$$X'X + \lambda I_p = H'DH + \lambda H'H = H'(D+\Lambda)H$$`

Which has strictly positive eigenvalues `$r_j + \lambda > 0$`

- `$\lambda = 0$`: gives the OLS estimator.

- When inverting: `$H'(D+\lambda)^{-1}H$`, think of it as `$1/(D + \lambda)$`.

- If eigenvalues are small and `$\lambda$`=0: i.e. high correlation between regressors, `$\beta$` "explodes".

---
## Ridge Regression

`$$X'X + \lambda I_p = H'DH + \lambda H'H = H'(D+\Lambda)H$$`

- Ridge adds `$\lambda$` to every eigenvalue, stabilizing the inverse while keeping the same eigenvectors.

- Stable estimation for `$\lambda \to \infty$`:

`$$\hat\beta_{ridge} = \large(X'X + \lambda I_p \large)^{-1} X'Y$$`

[Some Code](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/1_ridge_matrixes.R)

* for the code remember:

For a coefficient vector the **Euclidean norm** (`$L_2$` norm) measures the *magnitude* of the vector:

$$
\|\beta\|_2
= \sqrt{\beta_1^2 + \beta_2^2 + \cdots + \beta_p^2}.
$$

The distance of `$\beta$` from the origin in *p-dimensional* space.

---
## Ridge Regression (Regularization)

Hence, the second motivation for using Ridge is:

- When `$X'X$` is ill-conditioned its inverse is *ill-posed*: nearly singular or `$p\ge n$` (not unique).

- Techniques to deal with ill-posed estimators are called *regularization*.

- This could be made through *Penalization*.

---
## Ridge Regression (Regularization)

Consider:
$$
SSE_2(\beta, \lambda)
= (Y - X\beta)'(Y - X\beta) + \lambda \beta'\beta
= \|Y - X\beta\|_2^2 + \lambda \|\beta\|_2^2.
$$
The minimizer of `$SSE_2$` is a regularized least squared estimator:

`$$SSE_2(\beta, \lambda)= \underbrace{\|Y - X\beta\|_2^2}_{\text{Fit term}}+ \underbrace{\lambda \|\beta\|_2^2}_{\text{Penalty term}}$$`

- The **fit term** rewards accurate predictions.  
- The **penalty term** discourages large coefficients. 
- Penalizing large coefficient vectors avoids them being too large and erratic.
- So Ridge avoids beta estimators to explode and regularize them towards zero.

---
## Numeric Example

Let `$y = 2$`, `$x = 1$`, and `$\lambda = 1$`:

| `$\beta$` | Fit term `$(y - xβ)^2$` | Penalty term `$λβ^2$` | Total `$SSE_2(β,λ)$` |
|---:|------------------------:|----------------------:|----------------------:|
| 0  | 4                      | 0                    | 4                    |
| 1  | 1                      | 1                    | .green[2 (minimum)] |
| 2  | 0                      | 4                    | 4                    |
| 3  | 1                      | 9                    | 10                   |

- The optimal `$\beta$` is smaller than the OLS value of 2

- Ridge **shrinks** coefficients toward zero.

- Large `$\lambda$`: stronger “pull” toward zero.

- Small `$\lambda$`: weaker penalty, closer to OLS.

---
## Ridge Regression (Regularization)

You can either think of Ridge regression it as penalization (`$\lambda$`) or as a constraint (`$\tau$`):

- Alternatively:

`$$\min_{\beta} \ \|Y - X\beta\|_2^2 \quad \text{subject to} \quad \|\beta\|_2^2 \le \tau$$`

- Here, `$\tau \ge 0$` is the maximum allowed “size” of `$\beta$`.

You *control* the allowed size of coefficients. Hence the larger `$\tau$` the smaller lambda and viceversa.

---
## Ridge Regression (Regularization)

Where: 
$$
\min_{\beta} \ (Y - X\beta)'(Y - X\beta) + \lambda \, (\beta'\beta - \tau)
$$

- The first order condition for both minimization problems is identical:

$$
-2X'(Y - X\beta) + 2\lambda\beta = 0.
$$

- They are connected since the values of `$\lambda$` and `$\tau$` satisfy the relationship:

$$
Y'X (X'X + \lambda I_p)^{-1} (X'X + \lambda I_p)^{-1} X'Y = \tau.
$$

- You find `$\lambda$` given `$\tau$` numerical (more of an Econometrics Approach: Dual minimization).

- The solution from ML is finding `$\lambda$` (not subject to a constraint) through *cross-validation* methods.

---
## Ridge Regression: Dual minimization

Ridge "path" is giving by varying sizes of `$\lambda$`. The contour of the sphere is given by `$\tau$`.

---
## Ridge regression in Practice

One can chose tau in multiple ways, one is a prior (Bayesian)

Or cross-validation on `$\tau$`:

- Pick a grid of `$\tau$` values (small: strong shrinkage).

- For each `$\tau$`, solve the constrained problem, compute CV error, pick the `$\tau$` with the lowest CV error.

---
## Ridge regression in Practice

**CV error** = average prediction error on data not used for fitting.

Steps:

1) Split data into `$K$` folds

2) For each `$\tau$`:  
   - fit model on `$K−1$` folds  
   - predict left-out fold  
   - compute mean squared error

3) Average across folds: CV(`$\tau$`)

- Choose `$\tau$` that minimizes CV error.
- Best balance between bias (underfitting) and variance (overfitting).

---
## Ridge Regression: Cross-validation

But the most common method is cross-validation of `$\lambda$`.

**Formally:** The leave-one-out (LOO) ridge estimator, prediction errors, and CV criterion are:

`$$\hat{\beta}_{-i}(\lambda)
= \left(
\sum_{j \ne i} X_j X_j' + \Lambda
\right)^{-1}
\left(
\sum_{j \ne i} X_j Y_j
\right),$$`

`$$\tilde{e}_i(\lambda) = Y_i - X_i' \hat{\beta}_{-i}(\lambda),$$`

`$$CV(\lambda) = \sum_{i=1}^n \tilde{e}_i(\lambda)^2.$$`

Choose the `$\hat\lambda$` that minimizes the cross-validation error `$CV(\lambda)$` and use to calculate cross-validation ridge.

[Gimme the code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/2_ridge_regression.R)

---
## Ridge Regression: Covariance Matrix

Under random sampling, the covariance matrix of the Ridge estimator is:

`$$\operatorname{Var}[\hat{\beta}_{ridge} \mid X]
= (X'X + \lambda I_p)^{-1}
  (X'DX)
  (X'X + \lambda I_p)^{-1},$$`

where

`$$D = \operatorname{diag}\{\sigma^2(X_1), \ldots, \sigma^2(X_n)\}
\quad \text{and} \quad
\sigma^2(x) = \mathbb{E}[e^2 \mid X = x].$$`

---
## Ridge Regression: Covariance Matrix

- When errors have constant variance (homoskedasticity), `$D = \sigma^2 I_n$`, so:

`$$\operatorname{Var}[\hat{\beta}_{ridge} \mid X]
  = \sigma^2 (X'X + \lambda I_p)^{-1} X'X (X'X + \lambda I_p)^{-1}.$$`

- Under clustering or serial correlation, the middle term `$X'DX$` is modified accordingly.

- This expression shows how `$\lambda$` stabilizes the inversion of `$X'X$`, **reducing the variance of the estimated coefficients.**

Overall `$\hat\beta_{ridge}$` plus `$Var[\hat\beta_{ridge}|X]$` give us the nature of the **trade off bias - variance of the Ridge estimator.**

---
## Ridge Regression: Covariance Matrix

**.blue[Warning:]**

- Standard Errors interpretation in *Ridge Regression* is non-standard. `$\beta$`'s are biased by construction.

- Confidence Intervals will have deficient coverage.

- Ridge is generally employed in prediction with no Standard Errors.

- Note: Ridge regression will include all `$p$` predictors in the final model.

- Now **in Stata:** Download and perform OLS and Ridge Regression:

[Da code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/3_Ridge_Regression.do)

---
## Outline

- Introduction

- Ridge Regression

- **.green[Lasso Estimator]**

- Elastic Net.

---
## The Lasso Estimator

An intermediate case to Ridge uses the `$L_1$` norm penalty.

- Known as the **Lasso** (*Least Absolute Shrinkage and Selection Operator*).

*Objective Function:*

- The least squares criterion with an `$L_1$` (absolute) penalty is:

`$$SSE_1(\beta, \lambda)= (Y - X\beta)'(Y - X\beta) + \lambda \sum_{j=1}^{p} |\beta_j| \\ =  \|Y - X\beta\|_2^2 + \lambda \|\beta\|_1$$`

- The Lasso estimator is the minimizer of this objective:

`$$\hat{\beta}_{Lasso}= \arg\min_{\beta} SSE_1(\beta, \lambda)$$`

---
## The Lasso Estimator

- For `$\lambda > 0$`, the Lasso estimator is **well-defined even when p > n**.

- The solution generally must be found **numerically**.

- The `$L_1$` penalty induces **sparsity**: e.g. some coefficients are set exactly to zero, combining shrinkage and variable selection.

---
## The Lasso Estimator

The Lasso minimization problem uses the dual constrained minimization problem:

`$$\hat{\beta}_{Lasso} = \arg\min_{\|\beta\|_1 \le \tau} SSE_1(\beta)$$`

Observe that the constrained minimization has the Lagrangian:

`$$\min_{\beta} (Y - X\beta)'(Y - X\beta) + \lambda \left( \sum_{j=1}^{p} |\beta_j| - \tau \right)$$`

Which has first order conditions:

`$$-2X'_j (Y - X\beta) + \lambda \, \text{sgn}(\beta_j) = 0.$$`

This is the same as those for minimization of the LASSO penalized criterion.

Thus, for every value of `$\lambda$`, there exists a `$\tau$` that yields `$\hat\beta$`.

---
## Ridge Regression: Dual minimization

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#lasso_dualmin.png" alt=" " width="140%" />
<p class="caption"> </p>
</div>
The Lasso path is drawn with the dashed line. This is the sequence of solutions obtained as the
constraint set is varied.

---
## Ridge Regression: Dual minimization

- Since we minimize a quadratic subject to a polytope the solution tends to be at vertices. This eliminates a subset of coefficients (in this case `$\beta_1$`=0).

- While Lasso is a shrinkage estimator it does not shrink individual coefficients monotonically (as `$\beta_1$` shrinks to zero `$\beta2$` grows).

- In Ridge regression all coefficients toward zero by a similar amount,

- In Lasso  sufficiently small coefficients are shrunken all the way to zero

---
##  Lasso Solution (Orthogonal Case)

Suppose that `$X'X = I_p$`.

Then the first-order condition for minimization simplifies to:

`$$-2(\hat{\beta}_{ols,j} - \hat{\beta}_{Lasso,j}) + \lambda \, \text{sgn}(\hat{\beta}_{Lasso,j}) = 0$$`

Which has the explicit solution:

`$$\hat{\beta}_{Lasso,j} = \begin{cases} \hat{\beta}_{ols,j} - \lambda/2, & \hat{\beta}_{ols,j} > \lambda/2 \\[6pt] 0, & |\hat{\beta}_{ols,j}| \le \lambda/2 \\[6pt] \hat{\beta}_{ols,j} + \lambda/2, & \hat{\beta}_{ols,j} < -\lambda/2 \end{cases}$$`
- Lasso estimate is a **continuous transformation** of the OLS estimate.  
- For small values of the OLS estimate, the Lasso is set to zero.  
- For all other values moves towards zero by `$\lambda/2$`.

---
##  Lasso Solution (Orthogonal Case)

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#lasso_vsols.png" alt=" " width="100%" />
<p class="caption"> </p>
</div>
- When `$X'X = I_p$`, the ridge estimator equals: `$\hat{\beta}_{ridge} = (1 + \lambda)^{-1} \hat{\beta}_{ols}$`
- So it shrinks the coefficients towards zero by a common multiple.

---
## Lasso Regression:

**Note:**

- The penalty has a different meaning depending on the scale of the regressors.

- Consequently, it is important to scale the regressors appropriately before applying Lasso.

- It is conventional to scale all the variables to have mean zero and unit variance.

---
## Lasso Penalty Selection

- Picking `$\lambda$` induces a trade-off between complexity and parsimony.

- As `$\lambda$` increases the number of selected variables falls.

- K-fold criterion is aimed to select models with good forecast accuracy, but not necessarily for other purposes such as accurate inference.

- Another popular choice is called the “1se” rule, which is the ¸ which yields the most parsimonious model fo values within one standard error of the minimum.

- The idea is to select a model similar but more parsimonious than the CV-minimizing choice.

[Codito Here.](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/4_lasso_regression.R)

---
## Elastic Net

Taking a weighted average of the Ridge and Lasso penalties, we obtain the **Elastic Net** criterion:

$$SSE(\beta, \lambda, \alpha) = (Y - X\beta)'(Y - X\beta) + \lambda \left( \alpha \|\beta\|_2^2 + (1 - \alpha)\|\beta\|_1 \right) $$

- With weight `$0 \leq \alpha \leq 1$`. With Lasso (`$\alpha = 0$`) and Ridge (`$\alpha = 1$`).

- Parameters `$(\alpha, \lambda)$` are selected by **joint minimization of the K-fold cross-validation criterion**.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>