3_averaging_selection.knit

class: center, middle
# Averaging, Selection and Applications.
## Econometría Aplicada y Ciencia de Datos
#### Dr. Francisco J. Cabrera-Hernández
#### Maestría en Economía
Otoño 2025
#####CIDE Santa Fe, Ciudad de México.

---
## Outline

- **.green[Model Averaging and Selection**]

- Double Selection Lasso

- Double Debiased Machine Learning

---
## Ensembling

- Suppose you have a set of estimators (e.g., Kernel regression, LLR, series regression, ridge regression, Lasso, regression tree or random forest).

- Ensembling, or model averaging makes a selection of models and take a weighted average of them.

- A popular method is **stacking** or (from econometrics) **Jacknife Model Averaging.**

- It selects the model averaging weights by minimizing a CV criterion, subjet to weights are non-negative and sum to one.

- **No theory from  ML, basis comes from Econometrics.**

---
## Jackknife Estimator

The jackknife is a resampling method used to estimate the sampling variability of an estimator using *leave-one-out* samples.

Given a sample of size `$n$` and an estimator `$\hat{\theta}$`:

1. For each observation `$i = 1,\ldots,n$`, compute the *leave-one-out* estimate `$\hat{\theta}_{(i)}$`, by removing observation `$i$`.

2. The average leave-one-out estimate: `$\bar{\theta} = \frac{1}{n} \sum_{i=1}^n \hat{\theta}_{(i)}$`

3. The jackknife variance estimator is:  
   `$$\widehat{\mathrm{Var}}_{\text{jack}}(\hat{\theta}) = \frac{n-1}{n} \sum_{i=1}^n \left( \hat{\theta}_{(i)} - \bar{\theta} \right)^2$$`

Jackknife measures how sensitive the estimator is to the removal of each individual observation.

---
## Jackknife vs Bootstrap

**Jackknife:**
- Uses leave-one-out resampling.  
- Produces exactly `$n$` replicates.  
- Deterministic; no randomness.  
- Performs poorly for non-smooth estimators (e.g. quantiles).

**Bootstrap:**
- Resamples the data with replacement to create samples of size `$n$`.  
- Produces `$B$` replicates, where `$B$` is user-chosen.  
- Monte Carlo method; involves randomness.  
- Applicable to a wide range of estimators, including non-smooth or complex estimators.

---
## Jackknife (CV) Model Averaging

Model averaging combines predictions from several candidate models.

Jackknife (CV) averaging is a common method among other averaging procedures.

- Consider `$M$` linear models, each producing least squares estimates `$\hat{\beta}_m$`

- For fixed weights `$w = (w_1,\ldots,w_M)$`, the averaged regression can be written as:

`$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_m + \hat{e}_i(w)$$`

---
## Jackknife (CV) Model Averaging.

This method in ML is called **stacking**. *Do not confuse with stacking in the DiD context.*

- For leave-one-out cross-validation, observation `$i$` is removed and each model is refit, yielding  `$\hat{\beta}_{m,(-i)}$`.

Hence:

`$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_{m,(-i)} + \tilde{e}_i(w)$$`

- The leave-one-out prediction errors satisfy:

`$$\tilde{e}_i(w) = \sum_{m=1}^M w_m \tilde{e}_{mi}$$`

Where `$\tilde{e}_{mi}$` are the model-specific leave-one-out errors.

---
## Jackknife (CV) Model Averaging.

Let `$\tilde{E}$` be the `$n \times M$` matrix whose `$(i,m)$` entry is `$\tilde{e}_{mi}$`.

- In matrix notation, the jackknife CV criterion is:

`$$\mathrm{CV}(w) = w' \tilde{E}' \tilde{E} w$$`

- The jackknife model averaging weights are obtained by solving:

`$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} \mathrm{CV}(w),$$`
The resulting estimator is known as the Jackknife Model Averaging (JMA) estimator.

---
## Optimization

Selecting weights by minimizing a cross-validation criterion that is a convex quadratic function:

`$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} w' \tilde{E}' \tilde{E} w$$`
- Where `$\tilde{E}$` is the matrix of leave-one-out prediction errors.

- And the feasible set is the simplex:

`$$\mathcal{W} = \{ w \in \mathbb{R}^M : w_m \ge 0,\; \sum_{m=1}^{M} w_m = 1 \}$$`

The simplex contains all convex combinations of the `$M$` candidate models.

---
## Optimization

- Minimizing a quadratic form subject to a simplex restriction typically produces **sparse** solutions, as in (**Ridge regression**).

- Interior points of the simplex combine multiple models, such mixtures raise the value of `$w' \tilde{E}' \tilde{E} w$`.

- This makes the optimizer prefer solutions located on the boundary of the simplex, where many weights are zero.

- Jackknife (CV) averaging therefore performs implicit selection and shrinkage.

---
## Three Models Reasoning.

- When there are three candidate models, the weight vector is  
`$w = (w_1, w_2, w_3)$` with  
`$w_1 \ge 0$`, `$w_2 \ge 0$`, `$w_3 \ge 0$`, and `$w_1 + w_2 + w_3 = 1$`

- The feasible region is a two-dimensional simplex: an equilateral triangle.

- The three vertices represent the choices:  `$(1,0,0)$`, `$(0,1,0)$`, `$(0,0,1)$`,  each corresponding to selecting a single model.

- Each edge represents a mixture of exactly two models.

- The interior of the triangle consists of all combinations with  
  `$w_1 > 0$`, `$w_2 > 0$`, and `$w_3 > 0$`.

---
## Three Models Reasoning.

- For the JMA criterion `$w'\tilde{E}'\tilde{E}w$`, interior points combine all three columns of `$\tilde{E}$`.

- Such combinations increase the value of `$w'\tilde{E}'\tilde{E}w$`.

- As a result, the minimizer tends to lie on an edge (positive weights on two models)

- Or at a vertex (positive weight on one model). The same geometric intuition applies to higher-dimensional cases when `$M > 3$`.

- The solution is found numerically by quadratic programming which is computationally simple and fast even when the number of models M is large.

[JMA code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/9_JK_averaging.R)

---
## Outline

- Model Averaging and Selection

- **.green[Double Selection Lasso**]

- Double Debiased Machine Learning

---
## Double-Selection Lasso

- Post-estimation inference is difficult with most machine learning estimators.

- For example, consider the **post-Lasso estimator** (OLS to the regressors selected by the Lasso).

- This is a postmodel selection estimator with bad coverage probability (wrong SEs).

- Belloni, Chernozhukov, and Hansen (2014) proposed an alternative achieving better coverage rates.

- They show that deficiencies come and are increasing in the correlation between `$D$` (treatment) and `$X$` (high-dimension covariates).

- Improved coverage accuracy can be achieved if `$X$` is included in the regression (below) whenever `$X$` and `$D$` are correlated.

---
## Double-Selection Lasso

Consider the partially linear model:

`$$Y = D\theta_0 + X'\beta_0 + \varepsilon,\qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$`

- When `$X$` is high-dimensional (or `$p > n$`), naive model selection can omit important confounders of `$D$` and `$Y$`, leading to biased estimates of `$\theta_0$`.

- **Lasso** is suitable in this setting because of sparsity.

- (Approximate)Sparsity means that although `$X$` may contain many variables, only a relatively small number have non-negligible effects on `$Y$` or `$D$`.

- Lasso recovers key confounding variables while discarding irrelevant ones.

---
## Double-Selection Lasso

The first-stage (treatment) equation is:  
`$$D = X'\gamma + V, \qquad \mathbb{E}[V \mid X] = 0 \tag{1}$$`

The reduced-form outcome equation is:  
$$
Y = X'\eta + U, \qquad \mathbb{E}[U \mid X] = 0. \tag{2}
$$
- Double selection applies model selection to (1) and (2) and takes the union of the selected regressors: `$\tilde{X} = X_1 \cup X_2$`

- Then estimates the structural equation:

`$$Y = D\theta + \tilde{X}'\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$`

Using least squares with the selected controls allowing for SE estimation.

---
## DSL Summary

- A variable can be an important confounder even if its coefficient in `$\beta_0$` is small.

- Such a variable may strongly predict the treatment `$D$` but have only a weak relationship with `$Y$`.

- If Lasso is applied only to the outcome regression `$Y \sim D + X$`, the penalty may shrink its coefficient to zero, excluding it.

- This omission produces bias in estimating `$\theta_0$`, because a variable that predicts `$D$` is excluded.

- Double selection fixes this by selecting variables that are important in the treatment regression, even if not important in the outcome regression.

---
## Outline

- Model Averaging and Selection

- Double Selection Lasso

- **.green[Double Debiased Machine Learning**]

---
## Post-Regularization Lasso

A refinement of the double-selection Lasso is the **post-regularization Lasso** Estimator of Chernozhukov, Hansen, and Spindler (2015).

- Called **partialling-out Lasso** in Stata.

Start with the structural equation:

`$$Y = D\theta + X'\beta + e.$$`

- To eliminate the high-dimensional component `$X'\beta$`, take expectations conditional on `$X$` and subtract:

`$$Y - \mathbb{E}[Y \mid X]
= (D - \mathbb{E}[D \mid X])\theta + e.$$`

- This removes the regressor `$X$` from the equation.

---
## Post-Regularization Lasso

Using the linear projections

`$$\mathbb{E}[Y \mid X] = X'\eta, \qquad
\mathbb{E}[D \mid X] = X'\gamma$$`

we obtain:

`$$Y - X'\eta = (D - X'\gamma)\theta + e$$`

- If `$\eta$` and `$\gamma$` were known, `$\theta$` could be estimated by OLS.

- They are estimated — typically using Lasso or post-Lasso, as we did in double selection Lasso.

---
## Post-Regularization Lasso

Let `$\hat{\gamma}$` be the coefficient estimator, and define the residual: `$$\hat{V}_i = D_i - X_i' \hat{\gamma}.$$`

Let `$\hat{\eta}$` be the coefficient estimator, and define the residual: `$$\hat{U}_i = Y_i - X_i' \hat{\eta}.$$`

Estimate `$\theta$` by OLS using the residualized equation: `$$\hat{U}_i = \theta\, \hat{V}_i + \hat{e}_i.$$`

- “Partialling-out” approach. We “purge” `$Y$` and `$D$` of all predictable components coming from `$X$`.

- It has a conventional asymptotic normality valid under **approximate sparsity.**

---
## Double/Debiased Machine Learning (DML)

Post-regularization estimator first estimates the coefficients `$\gamma$` and `$\eta$`  and then estimates the coefficient `$\theta$`.

The DML estimator takes this a step further by using **K-fold partitioning**.

1. Randomly split the sample into `$K$` folds: `$A_k,\quad k = 1,\ldots,K.$` with roughly `$n/K$` observations.

- Each fold will serve as a small test set while the model is trained on the remaining `$K-1$` folds.

- This prevents overfitting and ensures orthogonality.

---
## Double/Debiased Machine Learning (DML)

2. Write the data matrices for each fold as `$(Y_k, D_k, X_k)$`.

3. For `$k = 1, \ldots, K$`:

(a) Use all observations except for fold `$k$` to estimate the coefficients `$\gamma$` and `$\eta$` by Lasso or post-Lasso.

(b) Write these leave-fold-out estimators as; `$\hat{\gamma}_{-k}$` and `$\hat{\eta}_{-k}$`.
   
- These predictions are out-of-sample, ensuring debiasing.

(c) Set: `$\hat{V}_k = D_k - X_k \hat{\gamma}_{-k}$` and `$\hat{U}_k = Y_k - X_k \hat{\eta}_{-k}$`. 
   
- Residuals isolate the variation in `$Y$` and `$D$` not explained by `$X$`:
   
- We use ML to partial out the effect of many controls `$X$`, but avoid overfitting by leaving fold `$k$` out during training.

---
## Double/Debiased Machine Learning (DML)

4. Stack `$\hat{V}_k$` and `$\hat{U}_k$` into `$n \times 1$` vectors `$\hat{V}$` and `$\hat{U}$`

- Set:  
   `$$\hat{\theta}_{DML}
   =
   \left(
   \sum_{k=1}^K
   \hat{V}_k'
   \hat{V}_k
   \right)^{-1}
   \left(
   \sum_{k=1}^K
   \hat{V}_k'
   \hat{U}_k
   \right)$$`

`$$\hat{\theta}_{DML}= (\hat{V}' \hat{V})^{-1}
   (\hat{V}' \hat{U})$$`

We regress the “residual `$Y$`” on the “residual `$D$`”.

---
## DML Summary

- Lasso and ML methods estimate `$g(X)$` and `$m(X)$` flexibly.

- Sample splitting avoids overfitting and ensures clean residuals.

- The final regression of `$\hat{U}$` on `$\hat{V}$` gives the causal effect `$\theta$`, after partialling out.

- This is conditional on `$E(e|D,X)=0$`

- DML produces valid inference even with high-dimensional `$X$`.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>