class: center, middle # Non-Parametric Regression and Trees ## Econometría Aplicada y Ciencia de Datos #### Dr. Francisco J. Cabrera-Hernández #### Maestría en Economía Otoño 2025 #####CIDE Santa Fe, Ciudad de México. --- ## Outline - **.green[Non-parametric Regression]** - Series and Spline Regression - Regression Trees. - Bagging and Random Forest. --- ## Motivation - Regression trees or CART for Classification and Regression Trees. - Is an extension of econometrical procedures: e.g. a nonparametric regression using a large number of step functions. - They have les interpretability as we do not observe a `\(\beta\)` effect, but they are useful to predict. - Especially useful (for prediction) when there are a combination of continuous and discrete regressors. --- ## Non parametric regression. Nonparametric estimation of the conditional expectation function. $$ \mathbb{E}[Y \mid X = x] = m(x) $$ An economic model restricts the form of `\(m(x)\)` to a parametric function. But if `\(m(x)\)` can take any nonlinear shape and is therefore **nonparametric**. **We will discuss nonparametric kernel smoothing estimators of `\(m(x)\)`.** --- ## Non parametric regression. Consider a single real-valued regressor X. - The nonparametric regression model is: $$ Y = m(X) + e, \qquad \mathbb{E}[e \mid X] = 0, \qquad \mathbb{E}[e^2 \mid X] = \sigma^2(X). $$ - We have `\(n\)` observations for the pair `\((Y, X)\)`. - The goal is to estimate `\(m(x)\)` either at a single point `\(x\)` or at a set of points. - For most of the theory we focus on estimation at a single point `\(x\)`. - In addition to the GM assumptions we assume that both `\(m(x)\)` and `\(f(x)\)` (the marginal density of X) are continuous in `\(x\)`. --- ## Binned Means Estimator Fix the point x and consider estimation of m(x): - We get the mean of `\(Y\)` for random pairs `\((Y,X)\)` such that `\(X = x\)`. - When `\(X\)` is discrete we estimate `\(m(x)\)` by taking the average of the sub-sample of observations `\(Y_i\)` for which `\(Xi = x\)`. - When `\(X\)` is continuous then the probability is zero that `\(X\)` exactly equals `\(x\)`. But, if `\(m(x)\)` is continuous we take the average of the observations for which `\(X_i\)` is close to `\(x\)`. -Say: `\(|X_i - x| \le h\)` for some small `\(h \ge 0\)`. `\(h\)` is the bandwith. --- ## Binned Means Estimator <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#non_param_dicrete_cont.png" alt=" " width="140%" /> <p class="caption"> </p> </div> *Notes: Discrete X we compute `\(m(x) = \mathbb{E}[Y \mid X = x]\)` by taking the mean of `\(Y\)` for all observations where `\(X_i = x\)`. For continuous X we estimate `\(m(x)\)` using the average of `\(Y\)` values for observations where `\(X_i\)` is close to `\(x\)` (within a chosen bandwidth).* --- ## Binned Means Estimator The **nonparametric estimator** of `\(m(x)\)` using a local (uniform) window of width `\(h\)` is `$$\hat{m}(x) = \frac{\displaystyle \sum_{i=1}^{n} \mathbf{1}\{ |X_i - x| \le h \} Y_i }{\displaystyle \sum_{i=1}^{n} \mathbf{1}\{ |X_i - x| \le h \}}$$` This is an **step function** estimator of regression function `\(m(x)\)` --- ## Binned Means Estimator <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#non_param_tech.png" alt=" " width="100%" /> <p class="caption"> </p> </div> - Considering the step function (dashed horizontal lines in (a)), the jumps at the edges of the bandwiths are an artifact of the discrete binning. - We can evaluate on a fine grid of values `\(x\)` or use LLR. --- ## Kernel regression - The source of the discontinuity in the step function is that the weights are discontinuous indicator functions. - i.e. The function `\(\mathbf{1}{|X_i - x| \le h}\)` assigns equal weight = 1 to all points in the window `\([x-h,,x+h]\)` and 0 to all others. - If instead the weights are continuous functions then `\(\hat{m}(x)\)` will also be continuous in `\(x\)`. - e.g. the **Nadaraya–Watson** estimator uses a smooth kernel function `\(K(\cdot)\)`, which gives more weight to points closer to `\(x\)` and viceversa. - A kernel function is a bounded probability density function which is symmetric about zero satisfying: `$$\int_{-\infty}^{\infty} u^2 K(u) \, du = 1.$$` --- ## Nadaraya–Watson Estimator The Nadaraya–Watson method estimates `\(m(x) = \mathbb{E}[Y|X=x]\)` as a **weighted local average** of nearby `\(Y_i\)` values: `$$\hat{m}(x) = \frac{ \sum_{i=1}^{n} K\!\left(\frac{X_i - x}{h}\right) Y_i }{ \sum_{i=1}^{n} K\!\left(\frac{X_i - x}{h}\right) }$$` - `\(X_i - x\)` measures how close observation `\(i\)` is to the point of interest. - `\(K(\cdot)\)`: kernel function (smooth, symmetric, positive) ensures scale invariance. - `\(h\)`: bandwidth (controls smoothness). The larger the smoother. - **Kernels**: Rectangular, **Gaussian**, Epanechnikov, Triangular. - Generalizes the discrete-window estimator by assigning **smooth weights** that decline with distance from `\(x\)`. --- ## Local linear regression - The Nadaraya–Watson estimator estimates a local constant that works well if `\(m(x)\)` is roughly constant within the bandwidth `\(h\)`. **The local slope ignored!** The local linear estimator fits a local line at each point `\(x\)`: `$$\min_{a,b} \sum_{i=1}^n \big(Y_i - a - b(X_i - x)\big)^2 K\!\left(\frac{X_i - x}{h}\right).$$` - If `\(b=0\)` we get the minimization problem of Nayadara-Watson. **Note**: in LLR if `\(h\to\infty\)` we simply get OLS estimator including all observations with equal weight. - Hence provides estimators of the sploe at `\(x\)`. --- ## Local Polynomial Estimator We approximate `\(m(X_i)\)` near `\(x\)` with a polynomial of degree `\(p\)`: $$ m(X_i) \approx a_0 + a_1 (X_i - x) + a_2 (X_i - x)^2 + \cdots + a_p (X_i - x)^p. $$ Estimate the coefficients `\((a_0, \ldots, a_p)\)` by: `$$\min_{a_0,\ldots,a_p} \sum_{i=1}^n \Big[ Y_i - \sum_{j=0}^p a_j (X_i - x)^j \Big]^2 K\!\left(\frac{X_i - x}{h}\right).$$` The fitted value at `\(x\)` is `\(\hat{m}_{LP}(x) = \hat{a}_0\)`. - Higher `\(p\)` → lower bias, higher variance due to more parameters. - Usually `\(p = 1\)` (local linear) or `\(p = 2\)` (local quadratic) in practice. --- ## The Bandwidth The bandwidth `\(h\)` determines how *local* the nonparametric regression is. | Bandwidth | Effect | Bias | Variance | |------------|---------|------|-----------| | Small `\(h\)` | Very local fit; follows noise | Low bias | High variance | | Large `\(h\)` | Very smooth fit; averages widely | High bias | Low variance | The bandwidth controls the **bias–variance tradeoff**. --- ## Rule of Thumb A **rule-of-thumb bandwidth** gives a practical starting point based on theoretical approximations. For a Gaussian kernel and roughly normal `\(X\)`: $$ h_{\text{rot}} = 1.06 n^{-1/5} \, \hat{\sigma}_X $$ - Larger samples: smaller `\(h\)` (more local detail). - More dispersed `\(X\)`: larger `\(h\)` (wider smoothing window). - It provides a reasonable default but does not adapt to the actual data pattern. - More elaborated computations explained in Hansen (see pp. 676) --- ## Cross-Validation (Data-Driven) Cross-validation (CV) chooses `\(h\)` by minimizing out-of-sample prediction error. Leave-one-out CV criterion: `$$CV(h) = \frac{1}{n} \sum_{i=1}^n \left( Y_i - \hat{m}_{-i}(X_i; h) \right)^2$$` - Where `\(\hat{m}_{-i}(X_i; h)\)` is estimated without observation `\(i\)`. As described before: - Pick a grid of `\(h\)` values. - Compute `\(CV(h)\)` for each (leaving one out (fold or observation)). - Choose `\(\hat{h}\)` that minimizes `\(CV(h)\)`. --- ## Mincer non-parametric regression <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#LLR_example.png" alt=" " width="100%" /> <p class="caption"> </p> </div> --- ## Conditional Variance - In this course we do not study in detal conditional nor asymptotic variance but it is worth knowing that it is also a weighted "sandwich": `$$\widehat{V}_{\hat{\beta}}(x) = (Z'KZ)^{-1} \left( \sum_{i=1}^{n} K\!\left(\frac{X_i - x}{h}\right)^2 Z_i(x) Z_i(x)' \hat{e}_i^2 \right) (Z'KZ)^{-1}.$$` <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#conditional_variance.png" alt=" " width="75%" /> <p class="caption"> </p> </div> [R code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/5_nonparam_regression.R) --- ## Outline - Non-parametric Regression - **.green[Series and Spline Regression]** - Regression Trees. - Bagging and Random Forest. --- ## Series Regression An alternative class of nonparametric regression. - A series regression model is a sequence `\(K = 1, 2, \ldots,\)` of approximating models `\(m_K(x)\)` with `\(K\)` parameters. - Approximating models could be polynomial or *splines*. - Linear series regression models take the form $$ Y = X_K' \beta_K + e_K $$ - Where `\(X_K = X_K(X)\)` is a vector of regressors obtained by making transformations of `\(X\)` and `\(\beta_K\)` is a coefficient vector. --- ## Series Regression Instead of assuming a simple linear model $$ m(x) = \beta_0 + \beta_1 X, $$ we represent `\(m(x)\)` as a **series expansion** that allows nonlinear shapes: $$ m_K(x) = \beta_0 + \beta_1 X + \beta_2 X^2 + \cdots + \beta_p X^p. $$ Here: - `\(K = p + 1\)` is the number of basis functions (constant + `\(p\)` polynomial terms). - Increasing `\(K\)` makes the model more flexible. - The series regression approximates `\(m(x)\)` by fitting a weighted (by `\(\beta_p\)`) combination of these basis functions. --- ## Series Regression We build a **design matrix** `\(X_K\)` that contains transformations of `\(X\)`: `$$X_K = \begin{bmatrix} 1 & X_1 & X_1^2 & \dots & X_1^p \\ 1 & X_2 & X_2^2 & \dots & X_2^p \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_n & X_n^2 & \dots & X_n^p \end{bmatrix}.$$` Then we estimate the coefficients by OLS: $$ \hat{\beta}_K = (X_K' X_K)^{-1} X_K' Y. $$ - That is, we run a single regression of `\(Y\)` on all polynomial terms of `\(X\)`. - Each column in `\(X_K\)` represents one basis function (e.g., `\(1\)`, `\(X\)`, `\(X^2\)`), - The estimator of `\(m(x)\)` is `\(\hat{m}_{k}(x) = X_K (x)'\hat{\beta}_k\)` *It is called the series regression because it is obtained by adding the series of variable `\(X^p\)`.* --- ## Spline Regression A **spline** is a piecewise polynomial. - The flexibility of the model is determined by the number of polynomial segments. - The join points between segments are called **knots**. - To impose smoothness and parsimony it is common to constrain the spline function to have continuous derivatives up to the order of the spline. - Typically the order of the polynomial is pre-selected to be linear, quadratic, or cubic. --- ## Spline Regression A **linear spline** with one knot `\(\tau\)` is $$ m_K(x)=\beta_0+\beta_1 x+\beta_2 (x-\tau)\,\mathbf{1}\{x\ge \tau\}. $$ - For `\(x\le \tau\)` the function `\(m_K(x)=\beta_0+\beta_1 x\)` is linear with slope `\(\beta_1\)`; - For `\(x\ge \tau\)` the function `\(m_K(x)\)` is linear with slope `\(\beta_1+\beta_2\)`; - Note that `\(\beta_2\)` is the change in the slope at `\(\tau\)`. A **linear spline** with two knots `\(\tau_1<\tau_2\)` is $$ m_K(x)=\beta_0+\beta_1 x+\beta_2 (x-\tau_1)\,\mathbf{1}\{x\ge \tau_1\} +\beta_3 (x-\tau_2)\,\mathbf{1}\{x\ge \tau_2\}. $$ --- ## Spline Regression A **quadratic spline** with one knot is $$ m_K(x)=\beta_0+\beta_1 x+\beta_2 x^2+\beta_3 (x-\tau)^2\,\mathbf{1}\{x\ge \tau\}. $$ - Observe that for `\(x\le \tau\)` the function is the quadratic `\(\beta_0+\beta_1 x+\beta_2 x^2\)` with second derivative `\(m_K''(\tau)=2\beta_2\)`; - For `\(x\ge \tau\)` the second derivative is `\(m_K''(\tau)=2(\beta_2+\beta_3)\)`; so `\(2\beta_3\)` is the change in the second derivative at `\(\tau\)`. The first derivative at `\(x=\tau\)` is the continuous function `\(m_K'(\tau)=\beta_1+2\beta_2 \tau\)`. **Ensuring continuity at the knot** --- ## Spline Regression In general, a `\(p^{\text{th}}\)`-order spline with `\(N\)` knots `\(\tau_1<\tau_2<\cdots<\tau_N\)` is `$$m_K(x)=\sum_{j=0}^{p}\beta_j x^j+\sum_{k=1}^{N}\beta_{p+k}(x-\tau_k)^p\,\mathbf{1}\{x\ge \tau_k\}.$$` --- ## Splines <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#spline_linear.png" alt=" " width="100%" /> <p class="caption"> </p> </div> --- ## Splines <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#spline_cubic.png" alt=" " width="70%" /> <p class="caption"> </p> </div> Cubic spline with two knots — smoother and continuously differentiable across segments. --- ## Splines - In practice a spline will depend critically on the choice of the knots `\(\tau_k\)`. - When `\(X\)` is bounded with an approximately uniform distribution it is common to space the knots evenly so all segments have the same length. - When the distribution of `\(X\)` is not uniform an alternative is to set the knots at the quantiles `\(j/(N+1)\)` - A third alternative is to set the knots at the points where `\(m(x)\)` has the greatest change in curvature. --- ## Regression Spline Example <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#spline_example.png" alt=" " width="80%" /> <p class="caption"> </p> </div> --- ## Outline - Non-parametric Regression - Series and Spline Regression - **.green[Regression Trees.]** - Bagging and Random Forest. --- ## Regression Trees CARTs (Classification and Regression Trees): - A regression tree is a nonparametric regression using a large number of step functions. - `\(i_th\)` large number of split points a step function can approximate any function. Useful when there are a combination of continuous and discrete regressors: - Kernel and series methods are challenging to implement. - Regression trees can be thought of as a `\(0th-order\)` spline with free knots. --- ## Regression Trees Estimate `\(m(x) = E[Y | X = x]\)` for scalar Y and vector X. The elements of X can be continuous, binary, or ordinal. - If a regressor is categorical is should be first transformed to a set of binary variables. Language based on the metaphor of a living tree: - A subsample is a **branch**. - Terminal branches are **nodes** or **leaves**. - Increasing the number of branches is **growing** a tree. - Decreasing the number of branches is **pruning** a tree. --- ## Regression Trees - The basic algorithm starts with a single branch. - Grow a large tree by sequentially splitting the branches. Then **prune back** using an information criterion. - The goal of the growth stage is to develop a rich, data-determined tree with small estimation bias. - Pruning back has the goal of reducing **over-parameterization** and estimation variance (backward stepwise regression). --- ## Regression Trees At the beginning of the regression tree algorithm: - The entire dataset is treated as one branch (no splits). - The predicted value is the **overall mean**: `$${m}(x) = \mu$$` Then the regression tree algorithm makes use of the a **regression sample split algorithm**. - The method uses non-linear Least Squares (NLLS) to estimate the model: $$ Y = \mu_1 \mathbf{1}\lbrace{X_d \le \gamma \rbrace} + \mu_2 \mathbf{1}\lbrace{X_d > \gamma\rbrace} + e $$ - Linear in `\(\mu_1\)`, `\(\mu_2\)`, but **nonlinear in the cutoff `\(\gamma\)`**. --- ## Regression Trees - We estimate `\(\mu_j\)` and the split `\(\gamma\)` to **minimize `\(SSR(\gamma)\)`**. `$$SSR(\gamma) =\sum_{i: X_{di} \le \gamma} (Y_i - \bar{Y}_L)^2 \;+\; \sum_{i: X_{di} > \gamma} (Y_i - \bar{Y}_R)^2$$` --- ## Grow a tree For a fix the size of the leave (final node to `\(N_{min}\)`): **(a)** Apply the regression sample split algorithm to split each branch into two sub-branches, with size at least `\(N_min\)`. **(b)** On each sub branch. Choose branch that minimizes: `$$\text{SSR}(\gamma) = \sum (Y_i - \hat{m}(X_{d}; \gamma))^2$$` **(c)** Split again such branch into two branches. - Repeat **(a)-(b)** until each branch cannot be further split (due to `\(N_min\)`). --- ## Grow a tree - The terminal (unsplit) branches are the leaves. - Before that: adding subsequent branches improve prediction capacity. - e.g. Two branches: step function with one cutoff. Many branches: piecewise-constant approximation (a “grown” tree). - Each split **reduces prediction error** and increases flexibility: --- ## Prune a tree Define the *Mallows-type* information criterion (that penalizes model complexity or N leaves) $$ C = \sum_{i=1}^{n} \hat{e}_i^2 + \lambda N $$ - where `\(N\)` is the number of leaves and `\(\alpha\)` is a penalty parameter. - Compute the criterion `\(C\)` for the current tree. - **Use backward stepwise regression to reduce the number of leaves:** **(a)** Identify the leaf whose removal most decreases `\(C\)`. **(b)** Prune (remove) this leaf. **(c)** If there is no leaf whose removal decreases `\(C\)`, then stop pruning. **(d)** Otherwise, repeat **(a)–(c)**. *The penalty parameter `\(\lambda\)` is selected by K-fold cross-validation.* --- ## Regression Trees Finding the best split which minimizes the deviations to the mean in each leave. - This is a tree of depth 1: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#tree_example.png" alt=" " width="100%" /> <p class="caption"> </p> </div> --- ## Regression Trees This is a tree with three covariates: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#tree_example_mult.png" alt=" " width="100%" /> <p class="caption"> </p> </div> --- ## Regression Trees: **Disadvantages:** - The results are difficult to interpret as there are no regression coefficients. - Fitted regression `\(\hat{m}(x)\)` is a discrete step function, a crude approximation when `\(m(x)\)` is continuous and smooth. - A regression tree may require a high number of leaves which can result in a non-parsimonious model with high estimation variance. --- ## Tree vs. Linear Model Depending on the problem at hand: - **Linear regression:** $$ m(X) = \beta_0 + \sum_{j=1}^{K} X_j \beta_j $$ - **Regression tree:** $$ m(X) = \sum_{j=1}^{J} c_j \mathbf{1}(x \in R_j) $$ - If the relationship between `\(y\)` and `\((x_1, \ldots, x_K)\)` is **linear**, a linear model should perform better. - If the relationship between `\(y\)` and `\((x_1, \ldots, x_K)\)` is **highly nonlinear and complex**, a tree model should perform better. --- ## A classification Tree: - This is a tree with three covariates: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#tree_clasification.png" alt=" " width="100%" /> <p class="caption"> </p> </div> [Regression Tree Code](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/6_regression_tree.R) --- ## Outline - Non-parametric Regression - Series and Spline Regression - Regression Trees. - **.green[Bagging and Random Forest.]** --- ## Bagging Bagging (bootstrap aggregating) is a method to reduce the variance of a predictor. **(1)** Generate a large number **B** of bootstrap samples. **(2)** Estimate your regression model on each bootstrap sample, and take the average of the estimates. The mean of the bootstrap estimates is the **bagging estimator** of the conditional mean. --- ## Bagging Let `\(m(x) = \mathbb{E}[Y \mid X = x]\)` be the conditional mean and `\(\hat{m}(x)\)` an estimator such as a regression tree. Let `\(\hat{m}_b^{*}(x)\)` be the same estimator constructed on a bootstrap sample. The **bagging** estimator of `\(m(x)\)` is: `$$\hat{m}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{m}_b^{*}(x).$$` - As `\(B\)` increases this converges in bootstrap probability to the ideal bagging estimator `\(\mathbb{E}^{*}\!\left[\hat{m}^{*}(x)\right]\)`. --- ## Bagging - Useful when the conditional mean estimator has low bias but high variance. - Useful for hard threshold estimators such as regression trees. - Each tree uses around 2/3 of the obs. - The remaining 1/3 obs are referred to as **the out-of-bag (OOB)** observations used for prediction and Mean squared error (MSE). - Bagging may exaggerate bias. --- ## Bagging - Each regression tree might split the data differently depending on which points got resampled. - Some trees will split earlier others will be smoother. - When you average them, the random idiosyncrasies cancel out leaving a smoother and more stable prediction function. [Gimme the code!](https://github.com/fcabrerahz/EconometricsDS/blob/main/Code/7_bagging.R) --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h1>The End</h1> </div>