QREG.knit

class: center, middle
# Quantile Regression and Binary Choice
### Dr. Francisco J. Cabrera-Hernández
#### Econometría
#### Maestría en Economía
Primavera 2025
#####CIDE Santa Fe, Ciudad de México.

---
## Introduction

We introduce Least Absolute Deviations, Quantile regression and Binary Choice.

We have discussed projections and conditional means.

You could project on conditional medians or more generally in any quantile.

We  focus on continuously-distributed `$Y$` where quantiles are uniquely defined.

---
## Median Regression

The **median** of `$Y$` is the value `$m = \text{med}[Y]$` such that:
  $$
  \mathbb{P}[Y \leq m] = \mathbb{P}[Y \geq m] = 0.5
  $$

It represents the **typical realization** of `$Y$`

We define the **conditional median** of `$Y$` given `$X = x$` as:

`$$m(x) = \text{med}[Y \mid X = x] \quad$$`

Such that:

`$$\quad \mathbb{P}[Y \leq m(x) \mid X = x] = 0.5$$`

---
## Median Regression

We can write the relationship between Y and X as the **median regression model**:

`$$Y = m(X) + e$$`
`$$\text{med}[e|X]=0$$`

The error e is the deviation of `$Y$` from its conditional median and has a conditional median of zero.

With **linear median regression model:**

`$$Y = X'\beta + e$$`

Remeber the **true** median regression function is not necessarily linear, linearity is an assumption.

---
## Median Regression

To estimate `$\beta$` it is useful to characterize it as a function of the distribution.

In **OLS** the estimator minimizes the sum of squared residuals:

`$$\min_{\beta} \sum_{i=1}^n (Y_i - X_i'\beta)^2$$`

In **median regression**, the estimator minimizes the sum of absolute residuals:

`$$\min_{\beta} \sum_{i=1}^n |Y_i - X_i'\beta|$$`
---
## Median Regression

The minimization for the median departs from the sign function:

$$
\frac{d}{de} |e| = 
\text{sgn}(Y-m) =
`\begin{cases}
\mathbf{1}\{\!Y > m\} - \mathbf{1}\{\!Y < m\}, & e \neq 0 \\\\
Y=m, & e = 0
\end{cases}`
$$
Note `$\mathbb{P}(Y = m) \to 0$` for continuous `$Y$`.

By definition of the median:

`$$\mathbb{P}(Y < m) = \mathbb{P}(Y > m) = 0.5$$`

Hence:

`$$\mathbb{E}[\text{sgn}(Y - m)] = (-1)(0.5) + (1)(0.5) = 0$$`

`$$\mathbb{E}[\text{sgn}(Y - m)] = 0$$`
---
## Median Regression

If conditional distribution `$F(y \mid x)$` of `$Y$` given `$X = x$` is continuous in `$y$`, the conditional median error `$e = Y - m(X)$` satisfies:

`$$\mathbb{E}[\text{sgn}(e) \mid X] = 0$$`

If in addition `$\mathbb{E}|Y| < \infty$`, the conditional median satisfies:

`$$m(x) = \arg\min_{\theta} \mathbb{E}[|Y - \theta| \mid X = x]$$`

---
## Median Regression

Hence, if `$(Y, X)$` satisfy the linear median regression model (i.e. the conditional median of `$Y$` given `$X$` is linear in `$X$`),

the coefficient `$\beta$` satisfies:

$$
\beta = \arg\min_{b} \mathbb{E}[|Y - X'b|] 
$$
The FOC for minimization implies `$E[X \text{sgn}(e)]=0$`

Analogously to OLS, this is the **best linear median predictor**.

---
## Median Regression

The difference with OLS is the loss function.

OLS Loss (`$e^2$`) grows quadratically with residuals — heavily penalizing outliers.

Loss (`$|e|$`) grows linearly — more robust to large errors.

[Yummy Code](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/13_OLSvsLAD_loss.R)

---
## Least Absolute Deviations

The sample estimator of the deviations function is the average of absolute errors:

`$$M_n(\beta) = \frac{1}{n} \sum_{i=1}^n |Y_i - X_i'\beta|$$`
The M-estimator for `$\beta$` is the minimizer of `$M_n(\beta)$`:

$$
\hat{\beta} = \arg\min_{\beta} M_n(\beta).
$$
This is called the **Least Absolute Deviations (LAD)** estimator.

---
## Least Absolute Deviations

The function `$\hat{m}(x) = x'\hat{\beta}$` is the **median regression** estimator.

*LAD refers to the minimization criteria, median regression refers to the targeted quantile.*

[More Yummy Code](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/14_LAD_MINcriteria.R)

---
## Residuals

The LAD residuals are `$\hat{e}_i = Y_i - X_i'\hat{\beta}$`. They approximately satisfy the property:

$$
\frac{1}{n} \sum_{i=1}^n X_i \, \text{sgn}(\hat{e}_i) \simeq 0.
$$

The approximation holds exactly if `$\hat{e}_i \ne 0$` for all `$i$`, which can occur when `$Y$` is continuously distributed.

---
## Least Absolute Deviations

The **first-order condition** (analogous to the normal equations in OLS) involves the sign function:

$$
\sum_{i=1}^n \text{sign}(Y_i - X_i'\beta) X_i = 0
$$

Which is discontinuous...

i.e. The function `$f(e) = |e|$` has a **kink** at `$e = 0$`, meaning it lacks a derivative.

The minimizer can be defined by a set of linear constraints so **linear programming methods** are appropriate.

---
## Least Absolute Deviations

LAD estimator `$\hat{\beta}$` must be found by **numerical optimization**. For example:

- Linear programming methods (e.g., the simplex method).
 
- Iterative optimization (e.g., interior point or subgradient methods).
 
- **MLE** if we make an assumption (Laplace) on the errors.

In R, LAD can be estimated using `quantreg::rq()`. In Stata, with the `qreg` command

[Code Code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/15_LADvsOLSreg.R)

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>

<div class="centered-word">
  <h1>Quantile Regression</h2>
</div>

---
## Quantile Regression

For `$\tau \in [0, 1]$`, the `$\tau^{\text{th}}$` quantile `$q_\tau$` of `$Y$` is defined as the value such that:

`$$\mathbb{P}(Y \leq q_\tau) = \tau$$`
The **median** is the special case `$\tau = 0.5$`.

The **conditional quantile** of `$Y$` given `$X = x$` is the value `$q_\tau(x)$` such that:

`$$\mathbb{P}(Y \leq q_\tau(x) \mid X = x) = \tau$$`

---
## Quantile Regression

The function `$q_\tau(x)$` is also called the **quantile regression function**.

Note `$q_\tau(x)$` is the true conditional quantile function that we will approximate.

We define the conditional quantile operators as:

`$$\mathbb{Q}_\tau[Y \mid X = x]$$` And:  `$$\mathbb{Q}_\tau[Y \mid X]$$`

---
## Quantile Regression
<img src="data:image/png;base64,#qreg.png" width="80%" style="display: block; margin: auto;" />

---
## Quantile Regression
We define the **quantile regression model** analogously to the median regression model:

`$$Y = q_\tau(X) + e$$`

`$$\mathbb{Q}_\tau[e \mid X] = 0$$`

The error `$e$` is centered so that its `$\tau^{\text{th}}$` quantile is zero.

The **linear quantile regression model** is:

`$$Y = X'\beta_\tau + e$$`
`$$\mathbb{Q}_\tau[e \mid X] = 0$$`

---
## Quantile Regression

The **median** regression minimizes the absolute error loss.

There is an analog for the quantile. Define the **tilted absolute loss function**:

$$
\rho_\tau(x) =
`\begin{cases}
-x(1 - \tau) & \text{if } x < 0 \\\\
x\tau & \text{if } x \geq 0
\end{cases}`
$$

`$$\rho_\tau(x) = x \cdot (\tau - \mathbb{1}\{x < 0\})$$`

---
## Quantile Regression

This is also called the **check function**.

For `$\tau = 0.5$`, this becomes the scaled absolute loss `$\frac{1}{2}|x|$`.

- When `$\tau < 0.5$`, the function is **tilted to the right**.

- When `$\tau > 0.5$`, it is **tilted to the left**.

[Codi!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/16_qfunction.R)

---
## Quantile Regression

How it works? Note:

`$$\psi_\tau(x) = \frac{d}{dx} \rho_\tau(x) = \tau - \mathbb{1}\{x < 0\}$$`
The change of the quantile loss function `$\rho_\tau(x)$`

- If `$x \geq 0$`: `$\psi_\tau(x) = \tau$`
- If `$x < 0$`: `$\psi_\tau(x) = \tau - 1$`

This gives us a piecewise slope:

- Positive residuals (underpredictions) get a weight of `$\tau$`
- Negative residuals (overpredictions) get a weight of `$\tau - 1$`

This **asymmetry** is what makes quantile regression target a **specific quantile** — not the mean.

---
## Quantile regression

| Residual | `$\psi_{0.5}(e)$`    | `$\psi_{0.2}(e)$` |
|----------|---------------------|---------------------|
| -2       | -0.5                | -0.8                |
| -0.5     | -0.5                | -0.8                |
| 0        | 0.5                 | 0.2                 |
| 0.3      | 0.5                 | 0.2                 |
| 1.5      | 0.5                 | 0.2                 |

---
## Quantile Regression

---
## Quantile Regression

If `$(Y, X)$` satisfy the linear quantile regression model and `$\mathbb{E}|Y| < \infty$`,  
then the coefficient `$\beta$` satisfies:

`$$\beta_\tau = \arg\min_b \, \mathbb{E}\left[ \rho_\tau(Y - X'b) \right]$$`
Where `$Y - X'b=e$` and `$\rho_\tau(e) = x \cdot (\tau - \mathbb{1}\{e < 0\})$`

This equals the **true conditional quantile coefficient when the true function is linear.**

`$\beta_\tau$` will produce an **approximation**  `$x'\beta_\tau$` to the true conditional quantile function `$q_\tau(x)$`.

---
## Quantile Regression

Remember `$\psi_\tau(x) = \dfrac{d}{dx} \rho_\tau(x) = \tau - \mathbb{1}\{x < 0\}$` for `$x \neq 0$`.

The first-order condition for minimization implies that:

`$$\mathbb{E}\left[ X \, \psi_\tau(e) \right] = 0$$`
The predictors `$X$`, when weighted by the "direction and magnitude" of the quantile gradient `$\psi_\tau(e)$`, should balance out to zero in expectation.

Where OLS `$E[Xe]=0$` are **mean-zero** residuals

---

## Quantile Regression

When quantile regression adds little:

Quantile regression provides little gain as all quantile lines have the **same slope**

The model is **conditionally homoskedastic**, as the spread of the conditional distribution does not vary with `$X$`.

---
## Estimation

The sample quantile loss function is:

`$$M_n(\beta; \tau) = \frac{1}{n} \sum_{i=1}^n \rho_\tau(Y_i - X_i'\beta)$$`

The **M-estimator** for `$\beta_\tau$` is the minimizer of `$M_n(\beta; \tau)$`:

`$$\hat{\beta}_\tau = \arg\min_\beta M_n(\beta; \tau)$$`

This is called the **Quantile Regression** estimator of `$\beta_\tau$`.

Again, the coefficient `$\hat{\beta}_\tau$` does not have a closed-form solution,  
so it must be found by **numerical minimization**.

---
## Estimation

The quantile regression residuals `$\hat{e}_i(\tau) = Y_i - X_i'\hat{\beta}_\tau$`  
satisfy the approximate condition:

`$$\frac{1}{n} \sum_{i=1}^n X_i \, \psi_\tau(\hat{e}_i(\tau)) \simeq 0$$`

As for LAD, this holds exactly if `$\hat{e}_i(\tau) \ne 0$` for all `$i$`.

This occurs with high probability if `$Y$` is continuously distributed.

[Codi Codi!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/19_regresiduals.R)

[Gimme the Code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/17_qregvsOLS.R)

---
## Covariance Matrix

The quantile regression estimator is asymptotically normal with a sandwich asymptotic covariance matrix.

Three Levels of Covariance Matrices

1. General case (even non-linear and Heteroskedastic):  
   `$$\mathbf{V}_\tau = \mathbf{Q}_\tau^{-1} \Omega_\tau \mathbf{Q}_\tau^{-1}$$`

2. Correct Specification:  
   `$$\mathbf{V}_\tau^c = \tau(1 - \tau)\mathbf{Q}_\tau^{-1} \mathbf{Q} \mathbf{Q}_\tau^{-1}$$`

3. Quantile Independence (Homoskedastic):  
   `$$\mathbf{V}_\tau^0 = \frac{\tau(1 - \tau)}{f_\tau(0)^2} \mathbf{Q}^{-1}$$`

---
## Covariance Matrix

Easier to interpret when:

`$$\mathbf{V}_\tau^0 = \frac{\tau(1 - \tau)}{f_\tau(0)^2} \mathbf{Q}^{-1}$$`

`$f_\tau(0)$` is the **density of the residuals** `$e = Y - X'\beta_\tau$` evaluated at **zero**.

It measures how concentrated the distribution is near the `$\tau-th$` quantile.

The **precision** of `$\hat{\beta}_\tau$` depends on `$f_\tau(0)$`:

- High `$f_\tau(0)$` → more data near the quantile → **higher precision**

- Low `$f_\tau(0)$` → fewer data near the quantile → **lower precision**

---
## Covar Matrix Estimation

The easiest is based on the **quantile independence assumption**, leading to:

`$$\hat{\mathbf{V}}_\tau^0 = \tau(1 - \tau) \, \hat{f}_\tau(0)^{-2} \, \hat{\mathbf{Q}}^{-1}$$`

where:

`$$\hat{\mathbf{Q}} = \frac{1}{n} \sum_{i=1}^n X_i X_i'$$`

Here, `$\hat{f}_\tau(0)^{-2}$` is a **nonparametric estimator** of `$f_\tau(0)^{-2}$`.

Generally using a **bandwith** estimator of `$f_\tau(0)$` directly.

---
## Covar Matrix Estimation

An estimator of `$\mathbf{V}_\tau^c$` assuming **correct specification** is:

`$$\hat{\mathbf{V}}_\tau^c = \tau(1 - \tau) \, \hat{\mathbf{Q}}_\tau^{-1} \, \mathbf{Q} \, \hat{\mathbf{Q}}_\tau^{-1}$$`

where `$\hat{\mathbf{Q}}_\tau$` is a **nonparametric estimator** of `$\mathbf{Q}_\tau$`.

A feasible choice given a bandwidth `$h$` is:

`$$\hat{\mathbf{Q}}_\tau = \frac{1}{2nh} \sum_{i=1}^n X_i X_i' \mathbb{1} \{|\hat{e}_i| < h\}$$`

Small `$h$`: uses fewer residuals close to 0: more variable, potentially unstable estimates.

*Misspecification refers to situations where the conditional quantile function is not linear in X or heteroskedastic*

---
## Covar Matrix Estimation

An estimator of `$\mathbf{V}_\tau$` that **allows for misspesification** is:

`$$\hat{\mathbf{V}}_\tau = \hat{\mathbf{Q}}_\tau^{-1} \hat{\Omega}_\tau \hat{\mathbf{Q}}_\tau^{-1}$$`

where:

`$$\hat{\Omega}_\tau = \frac{1}{h} \sum_{i=1}^n X_i X_i' \hat{\psi}_{i\tau}^2$$`

`$$\hat{\psi}_{i\tau} = \tau - \mathbb{1}\{Y_i < X_i' \hat{\beta}_\tau\}$$`

It estimates the variability of the gradient of the loss function (via `$\hat{\psi}_{i\tau}$` non-parametrically)

Remember `$\hat{\psi}_{i\tau}(e)$` is like a directional “weight” that guides the estimation.

It tells us how much and in which direction an observation "pulls" on the regression fit at quantile `$\tau$`

---
## Covar Matrix Estimation

Of the three methods (`$\hat{\mathbf{V}}_\tau^0$`, `$\hat{\mathbf{V}}_\tau^c$`, and `$\hat{\mathbf{V}}_\tau$`):

Avoid `$\hat{\mathbf{V}}_\tau^0$` (classical, assumes homoskedasticity).

`$\hat{\mathbf{V}}_\tau$` is most robust (does **not** require correct specification),  
    but **not available in software**.

`$\hat{\mathbf{V}}_\tau^c$` is a **practical and recommended** choice in applications.

The most common method for estimating `$\hat{\mathbf{V}}_\tau^c$`: *Covariance matrices* CIs and SEs in quantile regression is **bootstraping**

[More Code?](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/18_qregbandwith.R)

---
## Covar Matrix Estimation

Under **clustered dependence**, the asymptotic covariance matrix changes.

`$$\mathbf{V}_\tau = \mathbf{Q}_\tau^{-1} \Omega_\tau \mathbf{Q}_\tau^{-1}$$`

Can be estimated by:

`$$\hat{\Omega}_\tau^{\text{cluster}} = \frac{1}{n} \sum_{g=1}^G \left[ 
\left( \sum_{\ell = 1}^{n_g} X_{\ell g} \hat{\psi}_{\ell g \tau} \right) 
\left( \sum_{\ell = 1}^{n_g} X_{\ell g} \hat{\psi}_{\ell g \tau} \right)' \right]$$`

---
## Covar Matrix Estimation

This leads to the **cluster-robust asymptotic covariance matrix estimator**:
`$$\hat{\mathbf{V}}_\tau^{\text{cluster}} = \hat{\mathbf{Q}}_\tau^{-1} \hat{\Omega}_\tau^{\text{cluster}} \hat{\mathbf{Q}}_\tau^{-1}$$`

The cluster-robust estimator `$\hat{\mathbf{V}}_\tau^{\text{cluster}}$` is **not implemented** in Stata  
nor in the R `quantreg` package.

Instead, the **clustered bootstrap**  (sampling clusters with replacement) is recommended.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>

<div class="centered-word">
  <h2>Binary Choice</h2>
</div>

---
## Binary Choice

The simplest case where `$Y$` is **binary**, thus `$Y$` has support `$\{0, 1\}$`.

In econometrics we call this class of models **binary choice**.

The goal in binary choice analysis is estimation of the conditional or **response probability**

`$\mathbb{P}[Y = 1 \mid X]$` given a set of regressors `$X$`.

We may be interested in some transformation of it,  such as its derivative – the **marginal effect**.

A traditional approach to binary choice modeling is **parametric** with estimation by **maximum likelihood**.

---

## Binary Choice Models

Let `$(Y, X)$` be random with `$Y \in \{0,1\}$` and `$X \in \mathbb{R}^k$`.  
The **response probability** of `$Y$` with respect to `$X$` is:

$$
P(x) = \mathbb{P}[Y = 1 \mid X = x] = \mathbb{E}[Y \mid X = x]
$$

The response probability completely describes the conditional distribution.  
The **marginal effect** is:

$$
\frac{\partial}{\partial x} P(x) = \frac{\partial}{\partial x} \mathbb{P}[Y = 1 \mid X = x] = \frac{\partial}{\partial x} \mathbb{E}[Y \mid X = x]
$$
What we model is the response probability and afterwards we differentiate.

---
## Binary Regression

The variables satisfy the regression framework:

`$$Y = P(X) + e$$`
`$$\mathbb{E}[e \mid X] = 0$$`

The error `$e$` is **not classical**. It has a two-point conditional distribution:

$$
e = 
`\begin{cases}
1 - P(X), & \text{with probability } P(X) \\\\
P(X), & \text{with probability } 1 - P(X)
\end{cases}`
\tag{25.1}
$$

It is also highly **heteroskedastic**, with conditional variance:

$$
\text{Var}[e \mid X] = P(X)(1 - P(X))
\tag{25.2}
$$

---
## Residual Distribution

---
## Models for Response Probability `$P(x)$`

**Linear Probability Model**:

`$P(x) = x'\beta$` where `$\beta$` is a coefficient vector.

The response probability is a linear function of the regressors.

The coefficients `$\beta$` equal the marginal effects (when `$X$` does not include nonlinear transformations).

We estimate with OLS what is an advantage as more advanced models can be performed: DiD, RDD, IV, etc. Standard estimators can be employed.

---
## Models for Response Probability

[Problems are (Code):](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/20_outboundsprob.R)

If `$X$` takes on very large or very small values, the linear combination `$X'\beta$` can exceed 1 or drop below 0.

Example: If `$X \sim N(0, 2^2)$` and `$\beta = 0.8$`, then `$X'\beta$` will range roughly from `$-4$` to `$4 \Rightarrow$` predictions far outside `$[0, 1]$`.

A steep slope means that small changes in `$X$` cause large changes in predicted probability.

Bounded covariates like `$X \in [-1, 1]$` can help keep predictions in `$[0, 1]$`.

---
## Models for Response Probability

**Index Models:**

`$P(x) = G(x'\beta)$` where `$G(u)$` is a **link function** and `$\beta$` is a coefficient vector.

In practice, `$G$` is typically the **normal** or **logistic** distribution function, both symmetric:

`$$G(-u) = 1 - G(u)$$`

Always respect the `$[0, 1]$` probability bounds.

Allow for **nonlinear relationships** between `$X$` and `$P(X)$`.

---
## Index Models

**Probit Model:** `$P(x) = \Phi(x'\beta)$` where `$\Phi(u)$` is the standard normal CDF.

**Logit Model:** `$P(x) = \Lambda(x'\beta)$` where `$\Lambda(u) = (1 + \exp(-u))^{-1}$`

For example, in the logit model:

$$
P(X) = \Lambda(X'\beta) = \frac{1}{1 + e^{-X'\beta}}
$$

Here:

`$X'\beta$` is the **linear predictor** (also called the **index** or **score**).

The term `$e^{-X'\beta}$` is what makes the **logistic function** S-shaped.

---
## Index Models

In the probit model:

$$
P(X) = \Phi(X'\beta)
$$

Here:

`$\Phi(X'\beta)$` is the **CDF** of the standard normal distribution, defined as:

$$
\Phi(X'\beta) = \int_{-\infty}^{X'\beta} \phi(t) \, dt
$$

`$\phi(t)$` is the **standard normal density function**.

---
## Logit and Probit estimation

Probit and logit models are typically estimated by **maximum likelihood**.

To construct the likelihood, we need the distribution of observation `$i$`.

If `$Y$` is Bernoulli, such that `$\mathbb{P}[Y = 1] = p$` and `$\mathbb{P}[Y = 0] = 1 - p$`,  
 
`$Y$` has the probability mass function:

$$
\pi(y) = p^y (1 - p)^{1 - y}, \quad y = 0, 1
$$

---
## Logit and Probit estimation

In the index model `$\mathbb{P}[Y = 1 \mid X] = G(X'\beta)$`:

`$Y$` is conditionally Bernoulli, so its conditional probability mass function is:

`$$\pi(Y \mid X) = G(X'\beta)^Y \left(1 - G(X'\beta)\right)^{1 - Y}$$` 
`$$= G(X'\beta)^Y G(-X'\beta)^{1 - Y} = G(Z'\beta)$$`

where:

`$$Z =\begin{cases} X & \text{if } Y = 1 \\\\ - X & \text{if } Y = 0 \end{cases}$$`

e.g. for `$Y = 0$`, we have `$1 - G(X'β) = G(-X'β)$`

---
## Logit and Probit estimation

Taking logs and summing across observations, the **log-likelihood function** is:

`$$\ell_n(\beta) = \sum_{i=1}^n \log G(Z_i'\beta)$$`

Probit model:

`$$\ell_n^{\text{probit}}(\beta) = \sum_{i=1}^n \log \Phi(Z_i'\beta)$$`

Logit model:

`$$\ell_n^{\text{logit}}(\beta) = \sum_{i=1}^n \log \Lambda(Z_i'\beta)$$`
---
## Logit and Probit estimation

The MLE is the value which maximizes `$\ell_n(\beta)$`.

We write this as:

$$
\hat{\beta}^{\text{probit}} = \arg\max_{\beta} \, \ell_n^{\text{probit}}(\beta)
$$

$$
\hat{\beta}^{\text{logit}} = \arg\max_{\beta} \, \ell_n^{\text{logit}}(\beta)
$$

Since the probit and logit log-likelihoods are **globally concave** (see Hansen pp. 806),
`$\hat{\beta}^{\text{probit}}$` and `$\hat{\beta}^{\text{logit}}$` are **unique**.

---
## Marginal Effects

**What is `$\hat\beta$`? **

Each element of vector `$\hat\beta$` reflects the direction and strength of the relationship between its associated variable and the latent variable `$Y^*$`

e.g.: `$X'\hat\beta = −0.5 + 1.2 + 0.5 = 1.2$`

`$\hat P(X) = \Phi(1.2) = 0.8849$`

So this person has a predicted 88.5% probability of `$Y = 1$`.

Because the probit model is nonlinear, `$\hat\beta$`  does not directly equal marginal effects on `$P(Y = 1)$`, unlike in the linear probability model.

---
## Latent Variable

A binary choice model can be viewed as a **latent variable model**. Define:

$$
Y^* = X'\beta + e, \quad e \sim G(e)
$$

We only observe:

$$
Y = 
`\begin{cases}
1 & \text{if } Y^* > 0 \\
0 & \text{otherwise}
\end{cases}`
$$

That is, `$Y = 1$` if the latent `$Y^*$` exceeds 0. This is equivalent to:

$$
X'\beta + e > 0
$$

So the response probability is:

$$
P(x) = \mathbb{P}(e > -X'\beta) = 1 - G(-X'\beta) = G(X'\beta)
$$
[Showing Code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/19_latentvar.R)

---
## Marginal Effects

Take the index model `$\mathbb{P}[Y = 1 \mid X = x] = G(x'\beta)$`, assuming `$x$` does not include nonlinear transformations.

In this case, the marginal effects are:

$$
\delta(x) = \frac{\partial}{\partial x} P(x) = \beta g(x'\beta)
$$

This varies with `$x$`. Hence, it is common to report an **average marginal effect (AME)**:

$$
\text{AME} = \mathbb{E}[\delta(X)] = \beta \mathbb{E}[g(X'\beta)]
$$

---

## Estimation of Marginal Effects

An estimator of `$\delta(x)$` is:

$$
\hat{\delta}(x) = \hat{\beta} g(x'\hat{\beta})
$$

An estimator of the AME is:

`$$\widehat{\text{AME}} = \frac{1}{n} \sum_{i=1}^n \hat{\delta}(X_i) = \hat{\beta} \frac{1}{n} \sum_{i=1}^n g(X_i'\hat{\beta})$$`

[Code it!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/20_probit.R)

---

## Nonlinear Case

When `$X$` includes nonlinear transformations, e.g.:

$$
\mathbb{P}[Y = 1 \mid X = x] = G(\beta_0 + \beta_1 x + \cdots + \beta_p x^p)
$$

The marginal effect is:

$$
\delta(x) = (\beta_1 + \cdots + p \beta_p x^{p-1}) \, g(\beta_0 + \beta_1 x + \cdots + \beta_p x^p)
$$

And its estimator:

$$
\hat{\delta}(x) = (\hat{\beta}_1 + \cdots + p \hat{\beta}_p x^{p-1}) \, g(\hat{\beta}_0 + \hat{\beta}_1 x + \cdots + \hat{\beta}_p x^p)
$$

The estimator of the AME is:

$$
\widehat{\text{AME}} = \frac{1}{n} \sum_{i=1}^n \hat{\delta}(X_i)
$$

---