class: center, middle # Averaging, Selection and Applications. ## Econometría Aplicada y Ciencia de Datos #### Dr. Francisco J. Cabrera-Hernández #### Maestría en Economía Otoño 2025 #####CIDE Santa Fe, Ciudad de México. --- ## Outline - **.green[Model Averaging and Selection**] - Double Selection Lasso - Double Debiased Machine Learning --- ## Ensembling - Suppose you have a set of estimators (e.g., Kernel regression, LLR, series regression, ridge regression, Lasso, regression tree or random forest). - Ensembling, or model averaging makes a selection of models and take a weighted average of them. - A popular method is **stacking** or (from econometrics) **Jacknife Model Averaging.** - It selects the model averaging weights by minimizing a CV criterion, subjet to weights are non-negative and sum to one. - **No theory from ML, basis comes from Econometrics.** --- ## Jackknife Estimator The jackknife is a resampling method used to estimate the sampling variability of an estimator using *leave-one-out* samples. Given a sample of size `\(n\)` and an estimator `\(\hat{\theta}\)`: 1. For each observation `\(i = 1,\ldots,n\)`, compute the *leave-one-out* estimate `\(\hat{\theta}_{(i)}\)`, by removing observation `\(i\)`. 2. The average leave-one-out estimate: `\(\bar{\theta} = \frac{1}{n} \sum_{i=1}^n \hat{\theta}_{(i)}\)` 3. The jackknife variance estimator is: `$$\widehat{\mathrm{Var}}_{\text{jack}}(\hat{\theta}) = \frac{n-1}{n} \sum_{i=1}^n \left( \hat{\theta}_{(i)} - \bar{\theta} \right)^2$$` Jackknife measures how sensitive the estimator is to the removal of each individual observation. --- ## Jackknife vs Bootstrap **Jackknife:** - Uses leave-one-out resampling. - Produces exactly `\(n\)` replicates. - Deterministic; no randomness. - Performs poorly for non-smooth estimators (e.g. quantiles). **Bootstrap:** - Resamples the data with replacement to create samples of size `\(n\)`. - Produces `\(B\)` replicates, where `\(B\)` is user-chosen. - Monte Carlo method; involves randomness. - Applicable to a wide range of estimators, including non-smooth or complex estimators. --- ## Jackknife (CV) Model Averaging Model averaging combines predictions from several candidate models. Jackknife (CV) averaging is a common method among other averaging procedures. - Consider `\(M\)` linear models, each producing least squares estimates `\(\hat{\beta}_m\)` - For fixed weights `\(w = (w_1,\ldots,w_M)\)`, the averaged regression can be written as: `$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_m + \hat{e}_i(w)$$` --- ## Jackknife (CV) Model Averaging. This method in ML is called **stacking**. *Do not confuse with stacking in the DiD context.* - For leave-one-out cross-validation, observation `\(i\)` is removed and each model is refit, yielding `\(\hat{\beta}_{m,(-i)}\)`. Hence: `$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_{m,(-i)} + \tilde{e}_i(w)$$` - The leave-one-out prediction errors satisfy: `$$\tilde{e}_i(w) = \sum_{m=1}^M w_m \tilde{e}_{mi}$$` Where `\(\tilde{e}_{mi}\)` are the model-specific leave-one-out errors. --- ## Jackknife (CV) Model Averaging. Let `\(\tilde{E}\)` be the `\(n \times M\)` matrix whose `\((i,m)\)` entry is `\(\tilde{e}_{mi}\)`. - In matrix notation, the jackknife CV criterion is: `$$\mathrm{CV}(w) = w' \tilde{E}' \tilde{E} w$$` - The jackknife model averaging weights are obtained by solving: `$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} \mathrm{CV}(w),$$` The resulting estimator is known as the Jackknife Model Averaging (JMA) estimator. --- ## Optimization Selecting weights by minimizing a cross-validation criterion that is a convex quadratic function: `$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} w' \tilde{E}' \tilde{E} w$$` - Where `\(\tilde{E}\)` is the matrix of leave-one-out prediction errors. - And the feasible set is the simplex: `$$\mathcal{W} = \{ w \in \mathbb{R}^M : w_m \ge 0,\; \sum_{m=1}^{M} w_m = 1 \}$$` The simplex contains all convex combinations of the `\(M\)` candidate models. --- ## Optimization - Minimizing a quadratic form subject to a simplex restriction typically produces **sparse** solutions, as in (**Ridge regression**). - Interior points of the simplex combine multiple models, such mixtures raise the value of `\(w' \tilde{E}' \tilde{E} w\)`. - This makes the optimizer prefer solutions located on the boundary of the simplex, where many weights are zero. - Jackknife (CV) averaging therefore performs implicit selection and shrinkage. --- ## Three Models Reasoning. - When there are three candidate models, the weight vector is `\(w = (w_1, w_2, w_3)\)` with `\(w_1 \ge 0\)`, `\(w_2 \ge 0\)`, `\(w_3 \ge 0\)`, and `\(w_1 + w_2 + w_3 = 1\)` - The feasible region is a two-dimensional simplex: an equilateral triangle. - The three vertices represent the choices: `\((1,0,0)\)`, `\((0,1,0)\)`, `\((0,0,1)\)`, each corresponding to selecting a single model. - Each edge represents a mixture of exactly two models. - The interior of the triangle consists of all combinations with `\(w_1 > 0\)`, `\(w_2 > 0\)`, and `\(w_3 > 0\)`. --- ## Three Models Reasoning. - For the JMA criterion `\(w'\tilde{E}'\tilde{E}w\)`, interior points combine all three columns of `\(\tilde{E}\)`. - Such combinations increase the value of `\(w'\tilde{E}'\tilde{E}w\)`. - As a result, the minimizer tends to lie on an edge (positive weights on two models) - Or at a vertex (positive weight on one model). The same geometric intuition applies to higher-dimensional cases when `\(M > 3\)`. - The solution is found numerically by quadratic programming which is computationally simple and fast even when the number of models M is large. [JMA code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/9_JK_averaging.R) --- ## Outline - Model Averaging and Selection - **.green[Double Selection Lasso**] - Double Debiased Machine Learning --- ## Double-Selection Lasso - Post-estimation inference is difficult with most machine learning estimators. - For example, consider the **post-Lasso estimator** (OLS to the regressors selected by the Lasso). - This is a postmodel selection estimator with bad coverage probability (wrong SEs). - Belloni, Chernozhukov, and Hansen (2014) proposed an alternative achieving better coverage rates. - They show that deficiencies come and are increasing in the correlation between `\(D\)` (treatment) and `\(X\)` (high-dimension covariates). - Improved coverage accuracy can be achieved if `\(X\)` is included in the regression (below) whenever `\(X\)` and `\(D\)` are correlated. --- ## Double-Selection Lasso Consider the partially linear model: `$$Y = D\theta_0 + X'\beta_0 + \varepsilon,\qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$` - When `\(X\)` is high-dimensional (or `\(p > n\)`), naive model selection can omit important confounders of `\(D\)` and `\(Y\)`, leading to biased estimates of `\(\theta_0\)`. - **Lasso** is suitable in this setting because of sparsity. - (Approximate)Sparsity means that although `\(X\)` may contain many variables, only a relatively small number have non-negligible effects on `\(Y\)` or `\(D\)`. - Lasso recovers key confounding variables while discarding irrelevant ones. --- ## Double-Selection Lasso The first-stage (treatment) equation is: `$$D = X'\gamma + V, \qquad \mathbb{E}[V \mid X] = 0 \tag{1}$$` The reduced-form outcome equation is: $$ Y = X'\eta + U, \qquad \mathbb{E}[U \mid X] = 0. \tag{2} $$ - Double selection applies model selection to (1) and (2) and takes the union of the selected regressors: `\(\tilde{X} = X_1 \cup X_2\)` - Then estimates the structural equation: `$$Y = D\theta + \tilde{X}'\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$` Using least squares with the selected controls allowing for SE estimation. --- ## DSL Summary - A variable can be an important confounder even if its coefficient in `\(\beta_0\)` is small. - Such a variable may strongly predict the treatment `\(D\)` but have only a weak relationship with `\(Y\)`. - If Lasso is applied only to the outcome regression `\(Y \sim D + X\)`, the penalty may shrink its coefficient to zero, excluding it. - This omission produces bias in estimating `\(\theta_0\)`, because a variable that predicts `\(D\)` is excluded. - Double selection fixes this by selecting variables that are important in the treatment regression, even if not important in the outcome regression. --- ## Outline - Model Averaging and Selection - Double Selection Lasso - **.green[Double Debiased Machine Learning**] --- ## Post-Regularization Lasso A refinement of the double-selection Lasso is the **post-regularization Lasso** Estimator of Chernozhukov, Hansen, and Spindler (2015). - Called **partialling-out Lasso** in Stata. Start with the structural equation: `$$Y = D\theta + X'\beta + e.$$` - To eliminate the high-dimensional component `\(X'\beta\)`, take expectations conditional on `\(X\)` and subtract: `$$Y - \mathbb{E}[Y \mid X] = (D - \mathbb{E}[D \mid X])\theta + e.$$` - This removes the regressor `\(X\)` from the equation. --- ## Post-Regularization Lasso Using the linear projections `$$\mathbb{E}[Y \mid X] = X'\eta, \qquad \mathbb{E}[D \mid X] = X'\gamma$$` we obtain: `$$Y - X'\eta = (D - X'\gamma)\theta + e$$` - If `\(\eta\)` and `\(\gamma\)` were known, `\(\theta\)` could be estimated by OLS. - They are estimated — typically using Lasso or post-Lasso, as we did in double selection Lasso. --- ## Post-Regularization Lasso Let `\(\hat{\gamma}\)` be the coefficient estimator, and define the residual: `$$\hat{V}_i = D_i - X_i' \hat{\gamma}.$$` Let `\(\hat{\eta}\)` be the coefficient estimator, and define the residual: `$$\hat{U}_i = Y_i - X_i' \hat{\eta}.$$` Estimate `\(\theta\)` by OLS using the residualized equation: `$$\hat{U}_i = \theta\, \hat{V}_i + \hat{e}_i.$$` - “Partialling-out” approach. We “purge” `\(Y\)` and `\(D\)` of all predictable components coming from `\(X\)`. - It has a conventional asymptotic normality valid under **approximate sparsity.** --- ## Double/Debiased Machine Learning (DML) Post-regularization estimator first estimates the coefficients `\(\gamma\)` and `\(\eta\)` and then estimates the coefficient `\(\theta\)`. The DML estimator takes this a step further by using **K-fold partitioning**. 1. Randomly split the sample into `\(K\)` folds: `\(A_k,\quad k = 1,\ldots,K.\)` with roughly `\(n/K\)` observations. - Each fold will serve as a small test set while the model is trained on the remaining `\(K-1\)` folds. - This prevents overfitting and ensures orthogonality. --- ## Double/Debiased Machine Learning (DML) 2. Write the data matrices for each fold as `\((Y_k, D_k, X_k)\)`. 3. For `\(k = 1, \ldots, K\)`: (a) Use all observations except for fold `\(k\)` to estimate the coefficients `\(\gamma\)` and `\(\eta\)` by Lasso or post-Lasso. (b) Write these leave-fold-out estimators as; `\(\hat{\gamma}_{-k}\)` and `\(\hat{\eta}_{-k}\)`. - These predictions are out-of-sample, ensuring debiasing. (c) Set: `\(\hat{V}_k = D_k - X_k \hat{\gamma}_{-k}\)` and `\(\hat{U}_k = Y_k - X_k \hat{\eta}_{-k}\)`. - Residuals isolate the variation in `\(Y\)` and `\(D\)` not explained by `\(X\)`: - We use ML to partial out the effect of many controls `\(X\)`, but avoid overfitting by leaving fold `\(k\)` out during training. --- ## Double/Debiased Machine Learning (DML) 4. Stack `\(\hat{V}_k\)` and `\(\hat{U}_k\)` into `\(n \times 1\)` vectors `\(\hat{V}\)` and `\(\hat{U}\)` - Set: `$$\hat{\theta}_{DML} = \left( \sum_{k=1}^K \hat{V}_k' \hat{V}_k \right)^{-1} \left( \sum_{k=1}^K \hat{V}_k' \hat{U}_k \right)$$` `$$\hat{\theta}_{DML}= (\hat{V}' \hat{V})^{-1} (\hat{V}' \hat{U})$$` We regress the “residual `\(Y\)`” on the “residual `\(D\)`”. --- ## DML Summary - Lasso and ML methods estimate `\(g(X)\)` and `\(m(X)\)` flexibly. - Sample splitting avoids overfitting and ensures clean residuals. - The final regression of `\(\hat{U}\)` on `\(\hat{V}\)` gives the causal effect `\(\theta\)`, after partialling out. - This is conditional on `\(E(e|D,X)=0\)` - DML produces valid inference even with high-dimensional `\(X\)`. --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h1>The End</h1> </div>