class: center, middle # Introducion and Regularization ## Econometría Aplicada y Ciencia de Datos #### Dr. Francisco J. Cabrera-Hernández #### Maestría en Economía Otoño 2025 #####CIDE Santa Fe, Ciudad de México. --- ## Outline - **.green[Introduction]** - Ridge Regression - Lasso Estimator - Elastic Net. --- ## Motivation Locate the intersection between **Applied Econometrics** and **Machine Learning** to solve empirical problems with modern data. Two cultures: - **Data modeling** (assumes a stochastic data-generating process) - **Algorithmic modeling** (treats the data mechanism as unknown) The economics community has been committed to the almost exclusive use of data models. Statistics and ML are now **converging**; in economics, adoption has been *slower*. --- ## Motivation Adoption of ML in economics important because data have changed: - **Big data:** we now observe information on a large number of units. - Many features/X for each unit. - Often beyond the simple single–cross-section setting. Move away from dependence on parametric models toward a more diverse set of tools, yet: - Preserve the strengths of applied econometrics (identification, external validity) - Leverage ML for: selection/regularization, prediction, heterogeneity, and big data. --- ## Econometrics vs. Machine Learning Econometrics: - Emphasizes **large-sample properties** — consistency, asymptotic normality, efficiency. Theoretical proofs. Machine Learning: - Emphasizes **algorithmic performance** — practical behavior in specific settings, with "error‐rate" orientation and no asymptotic proofs. - **No formal proofs**: are neural networks uniformly superior to methods like regression trees or random forests? While **valid confidence intervals** are crucial (e.g., estimating an ATE), methods with no formal inference should not be dismissed. **Out-of-sample predictive performance** from ML can be valuable, even if prediction is **rare in modern econometrics**. --- ## Econometrics +Machine Learning Similar: - Nonparametric regression is in ML terminology: **supervised learning for regression problems**. - Nonparametric regression for discrete response is in ML terminology **supervised learning for **classification problems**. Adding: - **Unsupervised learning** or clustering analysis and density estimation. - Estimates of heterogeneous treatment effects and optimal policy mapping. - *bandit* approaches for effective experimentation. - Matrix completion problem. - Analysis of text data. --- ## Terminology ML uses **new terms for old ideas**: - Estimation: *Training* - Regressors: *Features* - Parameters: *Weights* **Supervised learning:** observe (X, Y) make prediction **Unsupervised learning:** observe X and clustering or structure discovery. **Classification:** ML term for discrete outcome models. In Econometrics dimension of `\(X\)` is `\(k\)`; in Machine Learning is `\(p\)`. --- ## Estimation in Econometrics We model the conditional distribution of an outcome: `$$Y_i \mid X_i \sim \mathcal{N}(\alpha + \beta' X_i, \sigma^2)$$` e.g. Estimate parameters by **least squares**: `$$(\hat{\alpha}_{LS}, \hat{\beta}_{LS}) = \arg\min_{\alpha, \beta} \sum_{i=1}^{N} (Y_i - \alpha - \beta' X_i)^2$$` If the model is correct: - Estimator is **unbiased** and **BLUE** - Also the **MLE** under normality. - Has desirable **large-sample efficiency** --- ## Prediction in Machine Learning In **ML**, the focus is on **prediction** rather than estimation. Predict a new outcome: `$$\hat{Y}_{N+1} = \hat{\alpha} + \hat{\beta}'X_{N+1}$$` Minimize **out-of-sample loss**: `$$(Y_{N+1} - \hat{Y}_{N+1})^2$$` - The estimators `\((\hat{\alpha}, \hat{\beta})\)` need **not** come from OLS. - Any method that improves predictive accuracy — e.g., **regularization, trees, or boosting** — may be preferred. --- ## (Cross)Validation **Econometrics (Validation):** - The form of the regression model, parametric or nonparametric, and the regressors are *given from the outside*, e.g., economic theory. - Discussion of model selection, it is often in the form of testing null hypotheses concerning the validity of a particular model. --- ## (Cross)Validation **Machine Learning (Cross-validation):** - Out-of-sample cross-validation can help guide such decisions. There are two components: - The goal is predictive power, rather than estimation of a particular structural or causal parameter. - The method uses out-of-sample comparisons, rather than in-sample goodness-of-fit measures. --- ## Overfitting and Regularization **Overfitting (more of ML):** - Select flexible models that fit well, but not so well that out-of-sample prediction is compromised. - Less emphasis on formal results, i.e. particular methods are superior in large samples (asymptotically). Instead, methods are compared on specific data. --- ## Overfitting and Regularization **Regularization (Metrics +ML):** - When optimizing, e.g. maximizing the log-likelihood function, a term is added to the objective function to penalize complexity. - In settings with **many models or parameters**, we add a **penalty** to limit complexity. - e.g. in MLE (maximum-likelihood), we add a term to the log-likelihood function equal to `\(-(k/2)ln(n)\)`, leading to the Bayesian Information Criterion. --- ## Regularization **Econometrics antecedents:** - Information criteria (AIC, BIC) penalize the number of parameters. - Bayesian methods use **priors** of parameter distribution (e.g., centered at zero) as implicit regularization. **ML approaches**, when you get rid of asymptotic theory: - Regularization is **data-driven** tuned by **out-of-sample performance** rather than subjective priors. --- ## Regularization Model example: `$$(Y_i \mid X_i) \sim \mathcal{N}(\beta' X_i, \sigma^2)$$` - Suppose we think the true coefficients **shouldn’t be too large**. Maybe each predictor has only a **moderate** effect on Y. - We express that belief as a **prior**: We have a prior (Bayesian view) distribution on `\(\beta_k\)` coefficients (for example from a standardised Y and X). `$$\beta_k \sim \mathcal{N}(0, \tau^2)$$` --- ## Regularization Then the posterior mean for `\(\beta\)` solves: `$$\arg\min_{\beta}\sum_{i=1}^{N}(Y_i - \beta' X_i)^2+ \frac{\sigma^2}{\tau^2}\|\beta\|_2^2$$` Where `\(\|\beta\|_2\)` is `\((\Sigma_{k=1}^K \beta^2)^{1/2}\)`. The Euclidean lenght or L2 norm. The usual distance from the origin in K-dimensional space. This is **Ridge Regularization (Shrinkage)**: - Penalizing large coefficients or shrinking them toward zero when they don’t improve fit much. - Each `\(\beta\)` is pulled slightly toward 0, making predictions more stable and reducing overfitting risk. --- ## Regularization In ML notation: `$$\arg\min_{\beta} \sum_{i=1}^{N}(Y_i - \beta' X_i)^2 + \lambda\|\beta\|_2^2$$` - The **penalty parameter `\(\lambda\)`** controls the degree of shrinkage. - In **Bayesian** settings `\(\lambda\)` reflects prior beliefs. - In **ML** `\(\lambda\)` is chosen via **cross-validation** to optimize predictive performance. --- ## Regularization - Econometrics motivation is to reduce the degree of collinearity among the regressors. - ML motivation is regularization of high-dimensional problems (too many regressors). - Traditional parametric asymptotic theory assumes that `\(p\)` is fixed as `\(n\to \infty\)` implying that p is much smaller than n. - **A high-dimensional set up** is used to describe the context where `\(p\to\infty\)`, including when `\(p>n\)`. --- ## Outline - Introduction - **.green[Ridge Regression]** - Lasso Estimator - Elastic Net. --- ## High Dimensional Problem Given `\(\beta_{ols}= (X'X)^{-1} X'Y\)`: - When `\(p > n\)` estimator `\(\beta_{ols}\)` is not defined since `\(X'X\)` has deficient rank: more unknowns than equations. - If `\(p < n\)` but p is large we can invert `\(X'X\)` but is ill-conditioned or nearly singular, multicollinearity near perfect. - The eigenvalues of `\(X′X\)` tell us how much independent information the data contain along different linear combinations. - Small eigenvalues of `\(X′X\)` arise because predictors/features convey nearly redundant information. --- ## High Dimensional Problem Remember `\((X'X)^{-1}\)` contains covariance in predictors/features. - This can be expresed as terms `\(1/eigenvalue\)` that explode with small variations in data. - Many of the variables may be low-information but it is difficult to know a priori which. Consequently **we turn to estimation methods other than least squares.** - Say ridge regression, Lasso, elastic net, regression trees, and random forests. --- ## Ridge Regression Given `\(X'X\)` being ill-conditioned we can use the ridge regression estimator. `$$\hat\beta_{ridge} = \large(X'X + \lambda I_p \large)^{-1} X'Y$$` - Where `\(\lambda > 0\)` is the shrinkage parameter: well defined, not ill-conditioning. - Even if `\(p>n\)`!: Chooses the minimum-norm solution among infinite OLS fits - ** `\(\lambda\)` is the tunning parameter.** --- ## Ridge Regression **Spectral decomposition:** - `\(X'X = H'DH\)` where H is orthonormal (a rotation matrix: eigenvectors). - Orthonormal means `\(H'H = HH' = I.\)` So H only rotates the system, does not reescale. - `\(D = diag \{r_1,...r_p\}\)` is a diagonal matrix con eigenvalues (stretch/shrink factors) `\(r_j\)` of `\(X'X\)`. Each **eigenvector** defines a direction in `\(\beta-space\)`, and each **eigenvalue** tells how much variation or information the data contain in that direction. --- ## Ridge Regression Set `\(\Lambda = \lambda I_p\)`: `$$X'X + \lambda I_p = H'DH + \lambda H'H = H'(D+\Lambda)H$$` Which has strictly positive eigenvalues `\(r_j + \lambda > 0\)` - `\(\lambda = 0\)`: gives the OLS estimator. - When inverting: `\(H'(D+\lambda)^{-1}H\)`, think of it as `\(1/(D + \lambda)\)`. - If eigenvalues are small and `\(\lambda\)`=0: i.e. high correlation between regressors, `\(\beta\)` "explodes". --- ## Ridge Regression `$$X'X + \lambda I_p = H'DH + \lambda H'H = H'(D+\Lambda)H$$` - Ridge adds `\(\lambda\)` to every eigenvalue, stabilizing the inverse while keeping the same eigenvectors. - Stable estimation for `\(\lambda \to \infty\)`: `$$\hat\beta_{ridge} = \large(X'X + \lambda I_p \large)^{-1} X'Y$$` [Some Code](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/1_ridge_matrixes.R) * for the code remember: For a coefficient vector the **Euclidean norm** (`\(L_2\)` norm) measures the *magnitude* of the vector: $$ \|\beta\|_2 = \sqrt{\beta_1^2 + \beta_2^2 + \cdots + \beta_p^2}. $$ The distance of `\(\beta\)` from the origin in *p-dimensional* space. --- ## Ridge Regression (Regularization) Hence, the second motivation for using Ridge is: - When `\(X'X\)` is ill-conditioned its inverse is *ill-posed*: nearly singular or `\(p\ge n\)` (not unique). - Techniques to deal with ill-posed estimators are called *regularization*. - This could be made through *Penalization*. --- ## Ridge Regression (Regularization) Consider: $$ SSE_2(\beta, \lambda) = (Y - X\beta)'(Y - X\beta) + \lambda \beta'\beta = \|Y - X\beta\|_2^2 + \lambda \|\beta\|_2^2. $$ The minimizer of `\(SSE_2\)` is a regularized least squared estimator: `$$SSE_2(\beta, \lambda)= \underbrace{\|Y - X\beta\|_2^2}_{\text{Fit term}}+ \underbrace{\lambda \|\beta\|_2^2}_{\text{Penalty term}}$$` - The **fit term** rewards accurate predictions. - The **penalty term** discourages large coefficients. - Penalizing large coefficient vectors avoids them being too large and erratic. - So Ridge avoids beta estimators to explode and regularize them towards zero. --- ## Numeric Example Let `\(y = 2\)`, `\(x = 1\)`, and `\(\lambda = 1\)`: | `\(\beta\)` | Fit term `\((y - xβ)^2\)` | Penalty term `\(λβ^2\)` | Total `\(SSE_2(β,λ)\)` | |---:|------------------------:|----------------------:|----------------------:| | 0 | 4 | 0 | 4 | | 1 | 1 | 1 | .green[2 (minimum)] | | 2 | 0 | 4 | 4 | | 3 | 1 | 9 | 10 | - The optimal `\(\beta\)` is smaller than the OLS value of 2 - Ridge **shrinks** coefficients toward zero. - Large `\(\lambda\)`: stronger “pull” toward zero. - Small `\(\lambda\)`: weaker penalty, closer to OLS. --- ## Ridge Regression (Regularization) You can either think of Ridge regression it as penalization (`\(\lambda\)`) or as a constraint (`\(\tau\)`): - Alternatively: `$$\min_{\beta} \ \|Y - X\beta\|_2^2 \quad \text{subject to} \quad \|\beta\|_2^2 \le \tau$$` - Here, `\(\tau \ge 0\)` is the maximum allowed “size” of `\(\beta\)`. You *control* the allowed size of coefficients. Hence the larger `\(\tau\)` the smaller lambda and viceversa. --- ## Ridge Regression (Regularization) Where: $$ \min_{\beta} \ (Y - X\beta)'(Y - X\beta) + \lambda \, (\beta'\beta - \tau) $$ - The first order condition for both minimization problems is identical: $$ -2X'(Y - X\beta) + 2\lambda\beta = 0. $$ - They are connected since the values of `\(\lambda\)` and `\(\tau\)` satisfy the relationship: $$ Y'X (X'X + \lambda I_p)^{-1} (X'X + \lambda I_p)^{-1} X'Y = \tau. $$ - You find `\(\lambda\)` given `\(\tau\)` numerical (more of an Econometrics Approach: Dual minimization). - The solution from ML is finding `\(\lambda\)` (not subject to a constraint) through *cross-validation* methods. --- ## Ridge Regression: Dual minimization <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#ridge_dualmin.png" alt=" " width="180%" /> <p class="caption"> </p> </div> Ridge "path" is giving by varying sizes of `\(\lambda\)`. The contour of the sphere is given by `\(\tau\)`. --- ## Ridge regression in Practice One can chose tau in multiple ways, one is a prior (Bayesian) Or cross-validation on `\(\tau\)`: - Pick a grid of `\(\tau\)` values (small: strong shrinkage). - For each `\(\tau\)`, solve the constrained problem, compute CV error, pick the `\(\tau\)` with the lowest CV error. --- ## Ridge regression in Practice **CV error** = average prediction error on data not used for fitting. Steps: 1) Split data into `\(K\)` folds 2) For each `\(\tau\)`: - fit model on `\(K−1\)` folds - predict left-out fold - compute mean squared error 3) Average across folds: CV(`\(\tau\)`) - Choose `\(\tau\)` that minimizes CV error. - Best balance between bias (underfitting) and variance (overfitting). --- ## Ridge Regression: Cross-validation But the most common method is cross-validation of `\(\lambda\)`. **Formally:** The leave-one-out (LOO) ridge estimator, prediction errors, and CV criterion are: `$$\hat{\beta}_{-i}(\lambda) = \left( \sum_{j \ne i} X_j X_j' + \Lambda \right)^{-1} \left( \sum_{j \ne i} X_j Y_j \right),$$` `$$\tilde{e}_i(\lambda) = Y_i - X_i' \hat{\beta}_{-i}(\lambda),$$` `$$CV(\lambda) = \sum_{i=1}^n \tilde{e}_i(\lambda)^2.$$` Choose the `\(\hat\lambda\)` that minimizes the cross-validation error `\(CV(\lambda)\)` and use to calculate cross-validation ridge. [Gimme the code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/2_ridge_regression.R) --- ## Ridge Regression: Covariance Matrix Under random sampling, the covariance matrix of the Ridge estimator is: `$$\operatorname{Var}[\hat{\beta}_{ridge} \mid X] = (X'X + \lambda I_p)^{-1} (X'DX) (X'X + \lambda I_p)^{-1},$$` where `$$D = \operatorname{diag}\{\sigma^2(X_1), \ldots, \sigma^2(X_n)\} \quad \text{and} \quad \sigma^2(x) = \mathbb{E}[e^2 \mid X = x].$$` --- ## Ridge Regression: Covariance Matrix - When errors have constant variance (homoskedasticity), `\(D = \sigma^2 I_n\)`, so: `$$\operatorname{Var}[\hat{\beta}_{ridge} \mid X] = \sigma^2 (X'X + \lambda I_p)^{-1} X'X (X'X + \lambda I_p)^{-1}.$$` - Under clustering or serial correlation, the middle term `\(X'DX\)` is modified accordingly. - This expression shows how `\(\lambda\)` stabilizes the inversion of `\(X'X\)`, **reducing the variance of the estimated coefficients.** Overall `\(\hat\beta_{ridge}\)` plus `\(Var[\hat\beta_{ridge}|X]\)` give us the nature of the **trade off bias - variance of the Ridge estimator.** --- ## Ridge Regression: Covariance Matrix **.blue[Warning:]** - Standard Errors interpretation in *Ridge Regression* is non-standard. `\(\beta\)`'s are biased by construction. - Confidence Intervals will have deficient coverage. - Ridge is generally employed in prediction with no Standard Errors. - Note: Ridge regression will include all `\(p\)` predictors in the final model. - Now **in Stata:** Download and perform OLS and Ridge Regression: [Da code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/3_Ridge_Regression.do) --- ## Outline - Introduction - Ridge Regression - **.green[Lasso Estimator]** - Elastic Net. --- ## The Lasso Estimator An intermediate case to Ridge uses the `\(L_1\)` norm penalty. - Known as the **Lasso** (*Least Absolute Shrinkage and Selection Operator*). *Objective Function:* - The least squares criterion with an `\(L_1\)` (absolute) penalty is: `$$SSE_1(\beta, \lambda)= (Y - X\beta)'(Y - X\beta) + \lambda \sum_{j=1}^{p} |\beta_j| \\ = \|Y - X\beta\|_2^2 + \lambda \|\beta\|_1$$` - The Lasso estimator is the minimizer of this objective: `$$\hat{\beta}_{Lasso}= \arg\min_{\beta} SSE_1(\beta, \lambda)$$` --- ## The Lasso Estimator - For `\(\lambda > 0\)`, the Lasso estimator is **well-defined even when p > n**. - The solution generally must be found **numerically**. - The `\(L_1\)` penalty induces **sparsity**: e.g. some coefficients are set exactly to zero, combining shrinkage and variable selection. --- ## The Lasso Estimator The Lasso minimization problem uses the dual constrained minimization problem: `$$\hat{\beta}_{Lasso} = \arg\min_{\|\beta\|_1 \le \tau} SSE_1(\beta)$$` Observe that the constrained minimization has the Lagrangian: `$$\min_{\beta} (Y - X\beta)'(Y - X\beta) + \lambda \left( \sum_{j=1}^{p} |\beta_j| - \tau \right)$$` Which has first order conditions: `$$-2X'_j (Y - X\beta) + \lambda \, \text{sgn}(\beta_j) = 0.$$` This is the same as those for minimization of the LASSO penalized criterion. Thus, for every value of `\(\lambda\)`, there exists a `\(\tau\)` that yields `\(\hat\beta\)`. --- ## Ridge Regression: Dual minimization <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#lasso_dualmin.png" alt=" " width="140%" /> <p class="caption"> </p> </div> The Lasso path is drawn with the dashed line. This is the sequence of solutions obtained as the constraint set is varied. --- ## Ridge Regression: Dual minimization - Since we minimize a quadratic subject to a polytope the solution tends to be at vertices. This eliminates a subset of coefficients (in this case `\(\beta_1\)`=0). - While Lasso is a shrinkage estimator it does not shrink individual coefficients monotonically (as `\(\beta_1\)` shrinks to zero `\(\beta2\)` grows). - In Ridge regression all coefficients toward zero by a similar amount, - In Lasso sufficiently small coefficients are shrunken all the way to zero --- ## Lasso Solution (Orthogonal Case) Suppose that `\(X'X = I_p\)`. Then the first-order condition for minimization simplifies to: `$$-2(\hat{\beta}_{ols,j} - \hat{\beta}_{Lasso,j}) + \lambda \, \text{sgn}(\hat{\beta}_{Lasso,j}) = 0$$` Which has the explicit solution: `$$\hat{\beta}_{Lasso,j} = \begin{cases} \hat{\beta}_{ols,j} - \lambda/2, & \hat{\beta}_{ols,j} > \lambda/2 \\[6pt] 0, & |\hat{\beta}_{ols,j}| \le \lambda/2 \\[6pt] \hat{\beta}_{ols,j} + \lambda/2, & \hat{\beta}_{ols,j} < -\lambda/2 \end{cases}$$` - Lasso estimate is a **continuous transformation** of the OLS estimate. - For small values of the OLS estimate, the Lasso is set to zero. - For all other values moves towards zero by `\(\lambda/2\)`. --- ## Lasso Solution (Orthogonal Case) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#lasso_vsols.png" alt=" " width="100%" /> <p class="caption"> </p> </div> - When `\(X'X = I_p\)`, the ridge estimator equals: `\(\hat{\beta}_{ridge} = (1 + \lambda)^{-1} \hat{\beta}_{ols}\)` - So it shrinks the coefficients towards zero by a common multiple. --- ## Lasso Regression: **Note:** - The penalty has a different meaning depending on the scale of the regressors. - Consequently, it is important to scale the regressors appropriately before applying Lasso. - It is conventional to scale all the variables to have mean zero and unit variance. --- ## Lasso Penalty Selection - Picking `\(\lambda\)` induces a trade-off between complexity and parsimony. - As `\(\lambda\)` increases the number of selected variables falls. - K-fold criterion is aimed to select models with good forecast accuracy, but not necessarily for other purposes such as accurate inference. - Another popular choice is called the “1se” rule, which is the ¸ which yields the most parsimonious model fo values within one standard error of the minimum. - The idea is to select a model similar but more parsimonious than the CV-minimizing choice. [Codito Here.](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/4_lasso_regression.R) --- ## Elastic Net Taking a weighted average of the Ridge and Lasso penalties, we obtain the **Elastic Net** criterion: $$SSE(\beta, \lambda, \alpha) = (Y - X\beta)'(Y - X\beta) + \lambda \left( \alpha \|\beta\|_2^2 + (1 - \alpha)\|\beta\|_1 \right) $$ - With weight `\(0 \leq \alpha \leq 1\)`. With Lasso (`\(\alpha = 0\)`) and Ridge (`\(\alpha = 1\)`). - Parameters `\((\alpha, \lambda)\)` are selected by **joint minimization of the K-fold cross-validation criterion**. --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h1>The End</h1> </div>