2_regression_trees.knit

class: center, middle
# Non-Parametric Regression and Trees 
## Econometría Aplicada y Ciencia de Datos
#### Dr. Francisco J. Cabrera-Hernández
#### Maestría en Economía
Otoño 2025
#####CIDE Santa Fe, Ciudad de México.

---
## Outline

- **.green[Non-parametric Regression]**

- Series and Spline Regression

- Regression Trees.

- Bagging and Random Forest.

---
## Motivation

- Regression trees or CART for Classification and Regression Trees.

- Is an extension of econometrical procedures: e.g. a nonparametric regression using a large number of step functions.

- They have les interpretability as we do not observe a `$\beta$` effect, but they are useful to predict.

- Especially useful (for prediction) when there are a combination of continuous and discrete regressors.

---
## Non parametric regression.

Nonparametric estimation of the conditional expectation function.

$$
\mathbb{E}[Y \mid X = x] = m(x)
$$

An economic model restricts the form of `$m(x)$` to a parametric function.

But if `$m(x)$` can take any nonlinear shape and is therefore **nonparametric**.

**We will discuss nonparametric kernel smoothing estimators of `$m(x)$`.**

---
## Non parametric regression.

Consider a single real-valued regressor X.

- The nonparametric regression model is:

$$
Y = m(X) + e, \qquad
\mathbb{E}[e \mid X] = 0, \qquad
\mathbb{E}[e^2 \mid X] = \sigma^2(X).
$$
- We have `$n$` observations for the pair `$(Y, X)$`.

- The goal is to estimate `$m(x)$` either at a single point `$x$` or at a set of points.

- For most of the theory we focus on estimation at a single point `$x$`.

- In addition to the GM assumptions we assume that both `$m(x)$` and `$f(x)$` (the
marginal density of X) are continuous in `$x$`.

---
## Binned Means Estimator

Fix the point x and consider estimation of m(x):

- We get the mean of `$Y$` for random pairs `$(Y,X)$` such that `$X = x$`.

- When `$X$` is discrete we estimate `$m(x)$` by taking the average of the sub-sample of observations `$Y_i$` for which `$Xi = x$`.

- When `$X$` is continuous then the probability is zero that `$X$` exactly equals `$x$`. But, if `$m(x)$` is continuous we take the average of the observations for which `$X_i$` is close to `$x$`.

-Say: `$|X_i - x| \le h$` for some small `$h \ge 0$`. `$h$` is the bandwith.

---
## Binned Means Estimator

*Notes: Discrete X we compute `$m(x) = \mathbb{E}[Y \mid X = x]$` by taking the mean of `$Y$` for all observations where `$X_i = x$`. For continuous X we estimate `$m(x)$` using the average of `$Y$` values for observations where `$X_i$` is close to `$x$` (within a chosen bandwidth).*

---
## Binned Means Estimator

The **nonparametric estimator** of `$m(x)$` using a local (uniform) window of width `$h$` is

`$$\hat{m}(x) = \frac{\displaystyle \sum_{i=1}^{n} \mathbf{1}\{ |X_i - x| \le h \} Y_i
}{\displaystyle \sum_{i=1}^{n} \mathbf{1}\{ |X_i - x| \le h \}}$$`

This is an **step function** estimator of regression function `$m(x)$`

---
## Binned Means Estimator

- Considering the step function (dashed horizontal lines in (a)), the jumps at the edges of the bandwiths are an artifact of the discrete binning.
- We can evaluate on a fine grid of values `$x$` or use LLR.

---
## Kernel regression

- The source of the discontinuity in the step function is that the weights are discontinuous indicator functions.

- i.e. The function `$\mathbf{1}{|X_i - x| \le h}$` assigns equal weight = 1 to all points in the window `$[x-h,,x+h]$` and 0 to all others.

- If instead the weights are continuous functions then `$\hat{m}(x)$` will also be continuous in `$x$`.

- e.g. the **Nadaraya–Watson** estimator uses a smooth kernel function `$K(\cdot)$`, which gives more weight to points closer to `$x$` and viceversa.

- A kernel function is a bounded probability density function which is symmetric about
zero satisfying: `$$\int_{-\infty}^{\infty} u^2 K(u) \, du = 1.$$`

---

## Nadaraya–Watson Estimator

The Nadaraya–Watson method estimates `$m(x) = \mathbb{E}[Y|X=x]$` as a **weighted local average** of nearby `$Y_i$` values:

`$$\hat{m}(x) =
\frac{
\sum_{i=1}^{n} K\!\left(\frac{X_i - x}{h}\right) Y_i
}{
\sum_{i=1}^{n} K\!\left(\frac{X_i - x}{h}\right)
}$$`

- `$X_i - x$` measures how close observation `$i$` is to the point of interest.
- `$K(\cdot)$`: kernel function (smooth, symmetric, positive) ensures scale invariance.
- `$h$`: bandwidth (controls smoothness). The larger the smoother.
- **Kernels**: Rectangular, **Gaussian**, Epanechnikov, Triangular.

- Generalizes the discrete-window estimator by assigning **smooth weights** that decline with distance from `$x$`.

---
## Local linear regression

- The Nadaraya–Watson estimator estimates a local constant that works well if 
`$m(x)$` is roughly constant within the bandwidth `$h$`. **The local slope ignored!**

The local linear estimator fits a local line at each point `$x$`:

`$$\min_{a,b} \sum_{i=1}^n
\big(Y_i - a - b(X_i - x)\big)^2
K\!\left(\frac{X_i - x}{h}\right).$$`

- If `$b=0$` we get the minimization problem of Nayadara-Watson.

**Note**: in LLR if `$h\to\infty$` we simply get OLS estimator including all observations with equal weight.

- Hence provides estimators of the sploe at `$x$`.

---
## Local Polynomial Estimator

We approximate `$m(X_i)$` near `$x$` with a polynomial of degree `$p$`:

$$
m(X_i) \approx a_0 + a_1 (X_i - x) + a_2 (X_i - x)^2 + \cdots + a_p (X_i - x)^p.
$$

Estimate the coefficients `$(a_0, \ldots, a_p)$` by:

`$$\min_{a_0,\ldots,a_p}
\sum_{i=1}^n
\Big[ Y_i - \sum_{j=0}^p a_j (X_i - x)^j \Big]^2
K\!\left(\frac{X_i - x}{h}\right).$$`

The fitted value at `$x$` is `$\hat{m}_{LP}(x) = \hat{a}_0$`.

- Higher `$p$` → lower bias, higher variance due to more parameters.
- Usually `$p = 1$` (local linear) or `$p = 2$` (local quadratic) in practice.

---

## The Bandwidth

The bandwidth `$h$` determines how *local* the nonparametric regression is.

| Bandwidth | Effect | Bias | Variance |
|------------|---------|------|-----------|
| Small `$h$` | Very local fit; follows noise |  Low bias |  High variance |
| Large `$h$` | Very smooth fit; averages widely |  High bias |  Low variance |

The bandwidth controls the **bias–variance tradeoff**.

---
## Rule of Thumb

A **rule-of-thumb bandwidth** gives a practical starting point based on theoretical approximations.

For a Gaussian kernel and roughly normal `$X$`:

$$
h_{\text{rot}} = 1.06 n^{-1/5} \, \hat{\sigma}_X
$$

- Larger samples: smaller `$h$` (more local detail).

- More dispersed `$X$`: larger `$h$` (wider smoothing window).

- It provides a reasonable default but does not adapt to the actual data pattern.

- More elaborated computations explained in Hansen (see pp. 676)

---

## Cross-Validation (Data-Driven)

Cross-validation (CV) chooses `$h$` by minimizing out-of-sample prediction error.

Leave-one-out CV criterion:

`$$CV(h) = \frac{1}{n} \sum_{i=1}^n 
\left( Y_i - \hat{m}_{-i}(X_i; h) \right)^2$$`

- Where `$\hat{m}_{-i}(X_i; h)$` is estimated without observation `$i$`.

As described before:
- Pick a grid of `$h$` values.  
- Compute `$CV(h)$` for each (leaving one out (fold or observation)).  
- Choose `$\hat{h}$` that minimizes `$CV(h)$`.

---
## Mincer non-parametric regression

---
## Conditional Variance

- In this course we do not study in detal conditional nor asymptotic variance but it is worth knowing that it is also a weighted "sandwich":

`$$\widehat{V}_{\hat{\beta}}(x)
= (Z'KZ)^{-1}
\left(
\sum_{i=1}^{n}
K\!\left(\frac{X_i - x}{h}\right)^2
Z_i(x) Z_i(x)' \hat{e}_i^2
\right)
(Z'KZ)^{-1}.$$`

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#conditional_variance.png" alt=" " width="75%" />
<p class="caption"> </p>
</div>
[R code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/5_nonparam_regression.R)

---
## Outline

- Non-parametric Regression

- **.green[Series and Spline Regression]**

- Regression Trees.

- Bagging and Random Forest.

---
## Series Regression

An alternative class of nonparametric regression.

- A series regression model is a sequence `$K = 1, 2, \ldots,$` of approximating models `$m_K(x)$` with `$K$` parameters.

- Approximating models could be polynomial or *splines*.

- Linear series regression models take the form

$$
Y = X_K' \beta_K + e_K
$$

- Where `$X_K = X_K(X)$` is a vector of regressors obtained by making transformations of `$X$` and `$\beta_K$` is a coefficient vector.

---
## Series Regression

Instead of assuming a simple linear model  
$$
m(x) = \beta_0 + \beta_1 X,
$$

we represent `$m(x)$` as a **series expansion** that allows nonlinear shapes:

$$
m_K(x) = \beta_0 + \beta_1 X + \beta_2 X^2 + \cdots + \beta_p X^p.
$$

Here:
- `$K = p + 1$` is the number of basis functions (constant + `$p$` polynomial terms).
- Increasing `$K$` makes the model more flexible.
- The series regression approximates `$m(x)$` by fitting a weighted (by `$\beta_p$`) combination of these basis functions.

---
## Series Regression

We build a **design matrix** `$X_K$` that contains transformations of `$X$`:

`$$X_K =
\begin{bmatrix}
1 & X_1 & X_1^2 & \dots & X_1^p \\
1 & X_2 & X_2^2 & \dots & X_2^p \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & X_n & X_n^2 & \dots & X_n^p
\end{bmatrix}.$$`

Then we estimate the coefficients by OLS:

$$
\hat{\beta}_K = (X_K' X_K)^{-1} X_K' Y.
$$

- That is, we run a single regression of `$Y$` on all polynomial terms of `$X$`.  
- Each column in `$X_K$` represents one basis function (e.g., `$1$`, `$X$`, `$X^2$`),  
- The estimator of `$m(x)$` is `$\hat{m}_{k}(x) = X_K (x)'\hat{\beta}_k$`

*It is called the series regression because it is obtained by adding the series of variable `$X^p$`.*

---
## Spline Regression

A **spline** is a piecewise polynomial.

- The flexibility of the model is determined by the number of polynomial segments.

- The join points between segments are called **knots**.

- To impose smoothness and parsimony it is common to constrain the spline function to have continuous derivatives up to the order of the spline.

- Typically the order of the polynomial is pre-selected to be linear, quadratic, or cubic.

---
## Spline Regression

A **linear spline** with one knot `$\tau$` is

$$
m_K(x)=\beta_0+\beta_1 x+\beta_2 (x-\tau)\,\mathbf{1}\{x\ge \tau\}.
$$

- For `$x\le \tau$` the function `$m_K(x)=\beta_0+\beta_1 x$` is linear with slope `$\beta_1$`; 
- For `$x\ge \tau$` the function `$m_K(x)$` is linear with slope `$\beta_1+\beta_2$`; 
- Note that `$\beta_2$` is the change in the slope at `$\tau$`.

A **linear spline** with two knots `$\tau_1<\tau_2$` is

$$
m_K(x)=\beta_0+\beta_1 x+\beta_2 (x-\tau_1)\,\mathbf{1}\{x\ge \tau_1\}
+\beta_3 (x-\tau_2)\,\mathbf{1}\{x\ge \tau_2\}.
$$

---
## Spline Regression

A **quadratic spline** with one knot is

$$
m_K(x)=\beta_0+\beta_1 x+\beta_2 x^2+\beta_3 (x-\tau)^2\,\mathbf{1}\{x\ge \tau\}.
$$

- Observe that for `$x\le \tau$` the function is the quadratic `$\beta_0+\beta_1 x+\beta_2 x^2$` with second derivative `$m_K''(\tau)=2\beta_2$`;

- For `$x\ge \tau$` the second derivative is `$m_K''(\tau)=2(\beta_2+\beta_3)$`; so `$2\beta_3$` is the change in the second derivative at `$\tau$`.

The first derivative at `$x=\tau$` is the continuous function `$m_K'(\tau)=\beta_1+2\beta_2 \tau$`. **Ensuring continuity at the knot**

---
## Spline Regression

In general, a `$p^{\text{th}}$`-order spline with `$N$` knots `$\tau_1<\tau_2<\cdots<\tau_N$` is

`$$m_K(x)=\sum_{j=0}^{p}\beta_j x^j+\sum_{k=1}^{N}\beta_{p+k}(x-\tau_k)^p\,\mathbf{1}\{x\ge \tau_k\}.$$`
---
## Splines

---
## Splines

Cubic spline with two knots — smoother and continuously differentiable across segments.

---
## Splines

- In practice a spline will depend critically on the choice of the knots `$\tau_k$`.

- When `$X$` is bounded with an approximately uniform distribution it is common to space the knots evenly so all segments have the same length.

- When the distribution of `$X$` is not uniform an alternative is to set the knots at the quantiles `$j/(N+1)$`

- A third alternative is to set the knots at the points where `$m(x)$` has the greatest change in curvature.

---
## Regression Spline Example

---
## Outline

- Non-parametric Regression

- Series and Spline Regression

- **.green[Regression Trees.]**

- Bagging and Random Forest.

---
## Regression Trees

CARTs (Classification and Regression Trees):

- A regression tree is a nonparametric regression using a large number of step functions.

- `$i_th$` large number of split points a step function can approximate any function.

Useful when there are a combination of continuous and discrete regressors:

- Kernel and series methods are challenging to implement.

- Regression trees can be thought of as a `$0th-order$` spline with free knots.

---
## Regression Trees

Estimate `$m(x) = E[Y | X = x]$` for scalar Y and vector X. The elements of X can be continuous, binary, or ordinal.

- If a regressor is categorical is should be first transformed to a set of binary
variables.

Language based on the metaphor of a living tree:

- A subsample is a **branch**.  
- Terminal branches are **nodes** or **leaves**.  
- Increasing the number of branches is **growing** a tree.  
- Decreasing the number of branches is **pruning** a tree.

---
## Regression Trees

- The basic algorithm starts with a single branch.

- Grow a large tree by sequentially splitting the branches. Then **prune back** using an information criterion.

- The goal of the growth stage is to develop a rich, data-determined tree with small estimation bias.

- Pruning back has the goal of reducing **over-parameterization** and estimation variance (backward stepwise regression).

---
## Regression Trees

At the beginning of the regression tree algorithm:

- The entire dataset is treated as one branch (no splits).
- The predicted value is the **overall mean**:
  `$${m}(x) = \mu$$`

Then the regression tree algorithm makes use of the a **regression sample split algorithm**.

- The method uses non-linear Least Squares (NLLS) to estimate the model:

$$
Y = \mu_1 \mathbf{1}\lbrace{X_d \le \gamma \rbrace} + \mu_2 \mathbf{1}\lbrace{X_d > \gamma\rbrace} + e
$$
- Linear in `$\mu_1$`, `$\mu_2$`,  but **nonlinear in the cutoff `$\gamma$`**.

---
## Regression Trees

- We estimate `$\mu_j$` and the split `$\gamma$` to **minimize `$SSR(\gamma)$`**.

`$$SSR(\gamma) =\sum_{i: X_{di} \le \gamma} (Y_i - \bar{Y}_L)^2 \;+\; \sum_{i: X_{di} > \gamma} (Y_i - \bar{Y}_R)^2$$`
---
## Grow a tree

For a fix the size of the leave (final node to `$N_{min}$`):

**(a)** Apply the regression sample split algorithm to split each branch into two sub-branches, with size at least `$N_min$`.

**(b)** On each sub branch. Choose branch that minimizes:
    
`$$\text{SSR}(\gamma) = \sum (Y_i - \hat{m}(X_{d}; \gamma))^2$$`

**(c)** Split again such branch into two branches.

- Repeat **(a)-(b)** until each branch cannot be further split (due to `$N_min$`).

---
## Grow a tree

- The terminal (unsplit) branches are the leaves.

- Before that: adding subsequent branches improve prediction capacity.

- e.g. Two branches: step function with one cutoff. Many branches: piecewise-constant approximation (a “grown” tree).

- Each split **reduces prediction error** and increases flexibility:

---
## Prune a tree

Define the *Mallows-type* information criterion (that penalizes model complexity or N leaves)

$$
C = \sum_{i=1}^{n} \hat{e}_i^2 + \lambda N
$$
- where `$N$` is the number of leaves and `$\alpha$` is a penalty parameter.

- Compute the criterion `$C$` for the current tree.

- **Use backward stepwise regression to reduce the number of leaves:**

**(a)** Identify the leaf whose removal most decreases `$C$`.  
   **(b)** Prune (remove) this leaf.  
   **(c)** If there is no leaf whose removal decreases `$C$`, then stop pruning.  
   **(d)** Otherwise, repeat **(a)–(c)**.

*The penalty parameter `$\lambda$` is selected by K-fold cross-validation.*

---
## Regression Trees

Finding the best split which minimizes the deviations to the mean in each leave.

- This is a tree of depth 1:

---
## Regression Trees

This is a tree with three covariates:

---
## Regression Trees:

**Disadvantages:**

- The results are difficult to interpret as there are no regression coefficients.

- Fitted regression `$\hat{m}(x)$` is a discrete step function, a crude approximation when `$m(x)$` is continuous and smooth.

- A regression tree may require a high number of leaves which can result in a non-parsimonious model with high estimation variance.

---
## Tree vs. Linear Model

Depending on the problem at hand:

- **Linear regression:**
  $$
  m(X) = \beta_0 + \sum_{j=1}^{K} X_j \beta_j
  $$

- **Regression tree:**
  $$
  m(X) = \sum_{j=1}^{J} c_j \mathbf{1}(x \in R_j)
  $$

- If the relationship between `$y$` and `$(x_1, \ldots, x_K)$` is **linear**,  
  a linear model should perform better.

- If the relationship between `$y$` and `$(x_1, \ldots, x_K)$` is **highly nonlinear and complex**,  a tree model should perform better.

---

## A classification Tree:

- This is a tree with three covariates:

[Regression Tree Code](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/6_regression_tree.R)

---
## Outline

- Non-parametric Regression

- Series and Spline Regression

- Regression Trees.

- **.green[Bagging and Random Forest.]**

---
## Bagging

Bagging (bootstrap aggregating) is a method to reduce the variance of a predictor.

**(1)** Generate a large number **B** of bootstrap samples.

**(2)** Estimate your regression model on each bootstrap sample, and take the average of the estimates.

The mean of the bootstrap estimates is the **bagging estimator** of the conditional mean.

---
## Bagging

Let `$m(x) = \mathbb{E}[Y \mid X = x]$` be the conditional mean and `$\hat{m}(x)$` an estimator such as a regression tree.

Let `$\hat{m}_b^{*}(x)$` be the same estimator constructed on a bootstrap sample.

The **bagging** estimator of `$m(x)$` is:

`$$\hat{m}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{m}_b^{*}(x).$$`

- As `$B$` increases this converges in bootstrap probability to the ideal bagging estimator `$\mathbb{E}^{*}\!\left[\hat{m}^{*}(x)\right]$`.

---
## Bagging

- Useful when the conditional mean estimator has low bias but high variance.

- Useful for hard threshold estimators such as regression trees.

- Each tree uses around 2/3 of the obs.

- The remaining 1/3 obs are referred to as **the out-of-bag (OOB)** observations used for prediction and Mean squared error (MSE).

- Bagging may exaggerate bias.

---
## Bagging

- Each regression tree might split the data differently depending on which points got resampled.

- Some trees will split earlier others will be smoother.

- When you average them, the random idiosyncrasies cancel out leaving a smoother and more stable prediction function.

[Gimme the code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/7_bagging.R)

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>