class: center, middle # Averaging, Selection and Applications. ## EconometrÃa Aplicada y Ciencia de Datos #### Dr. Francisco J. Cabrera-Hernández #### MaestrÃa en EconomÃa Otoño 2025 #####CIDE Santa Fe, Ciudad de México. --- ## Outline - **.green[Model Averaging and Selection**] - Double Selection Lasso - Double Debiased Machine Learning --- ## Ensembling - Suppose you have a set of estimators (e.g., Kernel regression, LLR, series regression, ridge regression, Lasso, regression tree or random forest). - Ensembling, or model averaging makes a selection of models and take a weighted average of them. - A popular method is **stacking** or (from econometrics) **Jacknife Model Averaging.** - It selects the model averaging weights by minimizing a CV criterion, subjet to weights are non-negative and sum to one. - **No theory from ML, basis comes from Econometrics.** --- ## Jackknife Estimator The jackknife is a resampling method used to estimate the sampling variability of an estimator using *leave-one-out* samples. Given a sample of size `\(n\)` and an estimator `\(\hat{\theta}\)`: 1. For each observation `\(i = 1,\ldots,n\)`, compute the *leave-one-out* estimate `\(\hat{\theta}_{(i)}\)`, by removing observation `\(i\)`. 2. The average leave-one-out estimate: `\(\bar{\theta} = \frac{1}{n} \sum_{i=1}^n \hat{\theta}_{(i)}\)` 3. The jackknife variance estimator is: `$$\widehat{\mathrm{Var}}_{\text{jack}}(\hat{\theta}) = \frac{n-1}{n} \sum_{i=1}^n \left( \hat{\theta}_{(i)} - \bar{\theta} \right)^2$$` Jackknife measures how sensitive the estimator is to the removal of each individual observation. --- ## Jackknife vs Bootstrap **Jackknife:** - Uses leave-one-out resampling. - Produces exactly `\(n\)` replicates. - Deterministic; no randomness. - Performs poorly for non-smooth estimators (e.g. quantiles). **Bootstrap:** - Resamples the data with replacement to create samples of size `\(n\)`. - Produces `\(B\)` replicates, where `\(B\)` is user-chosen. - Monte Carlo method; involves randomness. - Applicable to a wide range of estimators, including non-smooth or complex estimators. --- ## Jackknife (CV) Model Averaging Model averaging combines predictions from several candidate models. Jackknife (CV) averaging is a common method among other averaging procedures. - Consider `\(M\)` linear models, each producing least squares estimates `\(\hat{\beta}_m\)` - For fixed weights `\(w = (w_1,\ldots,w_M)\)`, the averaged regression can be written as: `$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_m + \hat{e}_i(w)$$` --- ## Jackknife (CV) Model Averaging. This method in ML is called **stacking**. *Do not confuse with stacking in the DiD context.* - For leave-one-out cross-validation, observation `\(i\)` is removed and each model is refit, yielding `\(\hat{\beta}_{m,(-i)}\)`. Hence: `$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_{m,(-i)} + \tilde{e}_i(w)$$` - The leave-one-out prediction errors satisfy: `$$\tilde{e}_i(w) = \sum_{m=1}^M w_m \tilde{e}_{mi}$$` Where `\(\tilde{e}_{mi}\)` are the model-specific leave-one-out errors. --- ## Jackknife (CV) Model Averaging. Let `\(\tilde{E}\)` be the `\(n \times M\)` matrix whose `\((i,m)\)` entry is `\(\tilde{e}_{mi}\)`. - In matrix notation, the jackknife CV criterion is: `$$\mathrm{CV}(w) = w' \tilde{E}' \tilde{E} w$$` - The jackknife model averaging weights are obtained by solving: `$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} \mathrm{CV}(w),$$` The resulting estimator is known as the Jackknife Model Averaging (JMA) estimator. --- ## Optimization Selecting weights by minimizing a cross-validation criterion that is a convex quadratic function: `$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} w' \tilde{E}' \tilde{E} w$$` - Where `\(\tilde{E}\)` is the matrix of leave-one-out prediction errors. - And the feasible set is the simplex: `$$\mathcal{W} = \{ w \in \mathbb{R}^M : w_m \ge 0,\; \sum_{m=1}^{M} w_m = 1 \}$$` The simplex contains all convex combinations of the `\(M\)` candidate models. --- ## Optimization - Minimizing a quadratic form subject to a simplex restriction typically produces **sparse** solutions, as in (**Ridge regression**). - Interior points of the simplex combine multiple models, such mixtures raise the value of `\(w' \tilde{E}' \tilde{E} w\)`. - This makes the optimizer prefer solutions located on the boundary of the simplex, where many weights are zero. - Jackknife (CV) averaging therefore performs implicit selection and shrinkage. --- ## Three Models Reasoning. - When there are three candidate models, the weight vector is `\(w = (w_1, w_2, w_3)\)` with `\(w_1 \ge 0\)`, `\(w_2 \ge 0\)`, `\(w_3 \ge 0\)`, and `\(w_1 + w_2 + w_3 = 1\)` - The feasible region is a two-dimensional simplex: an equilateral triangle. - The three vertices represent the choices: `\((1,0,0)\)`, `\((0,1,0)\)`, `\((0,0,1)\)`, each corresponding to selecting a single model. - Each edge represents a mixture of exactly two models. - The interior of the triangle consists of all combinations with `\(w_1 > 0\)`, `\(w_2 > 0\)`, and `\(w_3 > 0\)`. --- ## Three Models Reasoning. - For the JMA criterion `\(w'\tilde{E}'\tilde{E}w\)`, interior points combine all three columns of `\(\tilde{E}\)`. - Such combinations increase the value of `\(w'\tilde{E}'\tilde{E}w\)`. - As a result, the minimizer tends to lie on an edge (positive weights on two models) - Or at a vertex (positive weight on one model). The same geometric intuition applies to higher-dimensional cases when `\(M > 3\)`. - The solution is found numerically by quadratic programming which is computationally simple and fast even when the number of models M is large. [JMA code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/9_JK_averaging.R) --- ## Outline - Model Averaging and Selection - **.green[Double Selection Lasso**] - Double Debiased Machine Learning --- ## Double-Selection Lasso Consider the partially linear model: `$$Y = D\theta_0 + X'\beta_0 + \varepsilon,\qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$` - When `\(X\)` is high-dimensional (or `\(p > n\)`), naive model selection can omit important confounders of `\(D\)` and `\(Y\)`, leading to biased estimates of `\(\theta_0\)`. - Lasso is suitable in this setting because of sparsity. - Sparsity means that although `\(X\)` may contain many variables, only a relatively small number have non-negligible effects on `\(Y\)` or `\(D\)`. - Lasso recovers key confounding variables while discarding irrelevant ones. --- ## Double Selection Lasso The first-stage (treatment) equation is: `$$D = X'\gamma + V, \qquad \mathbb{E}[V \mid X] = 0 \tag{1}$$` The reduced-form outcome equation is: $$ Y = X'\eta + U, \qquad \mathbb{E}[U \mid X] = 0. \tag{2} $$ - Double selection applies model selection to (1) and (2) and takes the union of the selected regressors: `\(\tilde{X} = X_1 \cup X_2\)` - Then estimates the structural equation: `$$Y = D\theta + \tilde{X}'\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$` Using least squares with the selected controls allowing for SE estimation. --- ## Double Selection Lasso - A variable can be an important confounder even if its coefficient in `\(\beta_0\)` is small. - Such a variable may strongly predict the treatment `\(D\)` but have only a weak relationship with `\(Y\)`. - If Lasso is applied only to the outcome regression `\(Y \sim D + X\)`, the penalty may shrink its coefficient to zero, excluding it. - This omission produces bias in estimating `\(\theta_0\)`, because a variable that predicts `\(D\)` is excluded. - Double selection fixes this by selecting variables that are important in the treatment regression, even if not important in the outcome regression. --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h1>The End</h1> </div>