3_averaging_selection.knit

class: center, middle
# Averaging, Selection and Applications.
## Econometría Aplicada y Ciencia de Datos
#### Dr. Francisco J. Cabrera-Hernández
#### Maestría en Economía
Otoño 2025
#####CIDE Santa Fe, Ciudad de México.

---
## Outline

- **.green[Model Averaging and Selection**]

- Double Selection Lasso

- Double Debiased Machine Learning

---
## Ensembling

- Suppose you have a set of estimators (e.g., Kernel regression, LLR, series regression, ridge regression, Lasso, regression tree or random forest).

- Ensembling, or model averaging makes a selection of models and take a weighted average of them.

- A popular method is **stacking** or (from econometrics) **Jacknife Model Averaging.**

- It selects the model averaging weights by minimizing a CV criterion, subjet to weights are non-negative and sum to one.

- **No theory from  ML, basis comes from Econometrics.**

---
## Jackknife Estimator

The jackknife is a resampling method used to estimate the sampling variability of an estimator using *leave-one-out* samples.

Given a sample of size `$n$` and an estimator `$\hat{\theta}$`:

1. For each observation `$i = 1,\ldots,n$`, compute the *leave-one-out* estimate `$\hat{\theta}_{(i)}$`, by removing observation `$i$`.

2. The average leave-one-out estimate: `$\bar{\theta} = \frac{1}{n} \sum_{i=1}^n \hat{\theta}_{(i)}$`

3. The jackknife variance estimator is:  
   `$$\widehat{\mathrm{Var}}_{\text{jack}}(\hat{\theta}) = \frac{n-1}{n} \sum_{i=1}^n \left( \hat{\theta}_{(i)} - \bar{\theta} \right)^2$$`

Jackknife measures how sensitive the estimator is to the removal of each individual observation.

---
## Jackknife vs Bootstrap

**Jackknife:**
- Uses leave-one-out resampling.  
- Produces exactly `$n$` replicates.  
- Deterministic; no randomness.  
- Performs poorly for non-smooth estimators (e.g. quantiles).

**Bootstrap:**
- Resamples the data with replacement to create samples of size `$n$`.  
- Produces `$B$` replicates, where `$B$` is user-chosen.  
- Monte Carlo method; involves randomness.  
- Applicable to a wide range of estimators, including non-smooth or complex estimators.

---
## Jackknife (CV) Model Averaging

Model averaging combines predictions from several candidate models.

Jackknife (CV) averaging is a common method among other averaging procedures.

- Consider `$M$` linear models, each producing least squares estimates `$\hat{\beta}_m$`

- For fixed weights `$w = (w_1,\ldots,w_M)$`, the averaged regression can be written as:

`$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_m + \hat{e}_i(w)$$`

---
## Jackknife (CV) Model Averaging.

This method in ML is called **stacking**. *Do not confuse with stacking in the DiD context.*

- For leave-one-out cross-validation, observation `$i$` is removed and each model is refit, yielding  `$\hat{\beta}_{m,(-i)}$`.

Hence:

`$$Y_i = \sum_{m=1}^M w_m X_{mi}' \hat{\beta}_{m,(-i)} + \tilde{e}_i(w)$$`

- The leave-one-out prediction errors satisfy:

`$$\tilde{e}_i(w) = \sum_{m=1}^M w_m \tilde{e}_{mi}$$`

Where `$\tilde{e}_{mi}$` are the model-specific leave-one-out errors.

---
## Jackknife (CV) Model Averaging.

Let `$\tilde{E}$` be the `$n \times M$` matrix whose `$(i,m)$` entry is `$\tilde{e}_{mi}$`.

- In matrix notation, the jackknife CV criterion is:

`$$\mathrm{CV}(w) = w' \tilde{E}' \tilde{E} w$$`

- The jackknife model averaging weights are obtained by solving:

`$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} \mathrm{CV}(w),$$`
The resulting estimator is known as the Jackknife Model Averaging (JMA) estimator.

---
## Optimization

Selecting weights by minimizing a cross-validation criterion that is a convex quadratic function:

`$$\hat{w}_{\mathrm{JMA}} = \arg\min_{w \in \mathcal{W}} w' \tilde{E}' \tilde{E} w$$`
- Where `$\tilde{E}$` is the matrix of leave-one-out prediction errors.

- And the feasible set is the simplex:

`$$\mathcal{W} = \{ w \in \mathbb{R}^M : w_m \ge 0,\; \sum_{m=1}^{M} w_m = 1 \}$$`

The simplex contains all convex combinations of the `$M$` candidate models.

---
## Optimization

- Minimizing a quadratic form subject to a simplex restriction typically produces **sparse** solutions, as in (**Ridge regression**).

- Interior points of the simplex combine multiple models, such mixtures raise the value of `$w' \tilde{E}' \tilde{E} w$`.

- This makes the optimizer prefer solutions located on the boundary of the simplex, where many weights are zero.

- Jackknife (CV) averaging therefore performs implicit selection and shrinkage.

---
## Three Models Reasoning.

- When there are three candidate models, the weight vector is  
`$w = (w_1, w_2, w_3)$` with  
`$w_1 \ge 0$`, `$w_2 \ge 0$`, `$w_3 \ge 0$`, and `$w_1 + w_2 + w_3 = 1$`

- The feasible region is a two-dimensional simplex: an equilateral triangle.

- The three vertices represent the choices:  `$(1,0,0)$`, `$(0,1,0)$`, `$(0,0,1)$`,  each corresponding to selecting a single model.

- Each edge represents a mixture of exactly two models.

- The interior of the triangle consists of all combinations with  
  `$w_1 > 0$`, `$w_2 > 0$`, and `$w_3 > 0$`.

---
## Three Models Reasoning.

- For the JMA criterion `$w'\tilde{E}'\tilde{E}w$`, interior points combine all three columns of `$\tilde{E}$`.

- Such combinations increase the value of `$w'\tilde{E}'\tilde{E}w$`.

- As a result, the minimizer tends to lie on an edge (positive weights on two models)

- Or at a vertex (positive weight on one model). The same geometric intuition applies to higher-dimensional cases when `$M > 3$`.

- The solution is found numerically by quadratic programming which is computationally simple and fast even when the number of models M is large.

[JMA code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/9_JK_averaging.R)

---
## Outline

- Model Averaging and Selection

- **.green[Double Selection Lasso**]

- Double Debiased Machine Learning

---
## Double-Selection Lasso

Consider the partially linear model:

`$$Y = D\theta_0 + X'\beta_0 + \varepsilon,\qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$`

- When `$X$` is high-dimensional (or `$p > n$`), naive model selection can omit important confounders of `$D$` and `$Y$`, leading to biased estimates of `$\theta_0$`.

- Lasso is suitable in this setting because of sparsity.

- Sparsity means that although `$X$` may contain many variables, only a relatively small number have non-negligible effects on `$Y$` or `$D$`.

- Lasso recovers key confounding variables while discarding irrelevant ones.

---
## Double Selection Lasso

The first-stage (treatment) equation is:  
`$$D = X'\gamma + V, \qquad \mathbb{E}[V \mid X] = 0 \tag{1}$$`

The reduced-form outcome equation is:  
$$
Y = X'\eta + U, \qquad \mathbb{E}[U \mid X] = 0. \tag{2}
$$
- Double selection applies model selection to (1) and (2) and takes the union of the selected regressors: `$\tilde{X} = X_1 \cup X_2$`

- Then estimates the structural equation:

`$$Y = D\theta + \tilde{X}'\beta + \varepsilon, \qquad \mathbb{E}[\varepsilon \mid D, X] = 0$$`

Using least squares with the selected controls allowing for SE estimation.

---
## Double Selection Lasso

- A variable can be an important confounder even if its coefficient in `$\beta_0$` is small.

- Such a variable may strongly predict the treatment `$D$` but have only a weak relationship with `$Y$`.

- If Lasso is applied only to the outcome regression `$Y \sim D + X$`, the penalty may shrink its coefficient to zero, excluding it.

- This omission produces bias in estimating `$\theta_0$`, because a variable that predicts `$D$` is excluded.

- Double selection fixes this by selecting variables that are important in the treatment regression, even if not important in the outcome regression.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>