Normal_v1.knit

class: center, middle
# Normal Regression
### Dr. Francisco J. Cabrera-Hernández
#### Econometría
#### Maestría en Economía
Primavera 2024
#####CIDE Santa Fe, Ciudad de México.

---
##Introduction

Normal regression is a special case of linear regression model.

Allows for precise distributional characterizations and sharp inferences.

Fully parametric setting where maximum likelihood estimation is appropriate.

Exact distributions are useful for inference (critical and p-values)

---
## The standard normal distribution

A random variable Z has the standard normal distribution `$Z \sim N(0,1)$` if .green[Density Funtion:]

$$\phi(x) = {1 \over \sqrt{2\pi}}exp\big( -{x^2 \over 2} \big),  -\infty < x < \infty $$
With .green[Distribution Function] denoted as `$\Phi(z)$`.

---
## Univariate normal distribution

`$X$` has the **univariate normal distribution**:

Or the distribution of a single continuous random variable, denoted as:

`$$X \sim N(\mu, \sigma^2)$$`

with density:

$$f(x|\mu,\sigma^2) = {1 \over \sqrt{2\pi\sigma^2}}exp\big( - {(x - \mu)^2 \over 2\sigma^2} \big),  -\infty < x < \infty $$
With mean and variance of X: `$\mu$` and `$\sigma^2$`

When `$\mu = 0$` and `$\sigma^2 =1$`, this collapses to the standard normal defined above.

Normal distribution and relatives (Chi-sq, t, F) are frequently used for inference and p-values.

---
## Multivariate standard normal

A k-vector `$Z$` has a multivariate standard normal distribution `$Z \sim N(0,I_k)$` with joint density;

$$\phi(x) = {1 \over (2\pi)^{k/2}}exp\big( - {x'x \over 2} \big),  x \in \mathbb{R}^k $$
The mean and covariance matrix of Z are 0 and `$I_k$`.

---
## Multivariate standard normal

---
## Multivariate normal distribution

The k-vector `$X$` has a multivariate normal distribution `$X \sim N(\mu,\Sigma)$`.

With a mean vector `$\mu$` and covariance matrix `$\Sigma$`.

where `$\Sigma = \ge 0$`:

$$f(x) = {1 \over (2\pi)^{k/2}det(\Sigma)^{1/2}}exp\big( - {(x -\mu)' \Sigma^{-1} (x -\mu) \over 2} \big),  x \in \mathbb{R}^k $$
If `$k=1$` multivariate simplifies to univariate normal.

---
## Multivariate normal distribution

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#mult_normal.png" alt=" " width="70%" />
<p class="caption"> </p>
</div>
[Here the code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/9_mult_normal_simulations.R)

---
## Multivariate normal distribution

If `$(Y,X)$` are multivariate normal:

`$$\left(
\begin{array}{c}
Y\\
X\\
\end{array}
\right)
\sim N \left( \left(
\begin{array}{c}
\mu_Y\\
\mu_X\\
\end{array}
\right)
,
\left(
\begin{array}{c}
\Sigma_{YY} & \Sigma_{YX}\\
\Sigma_{XY} & \Sigma_{XX}\\
\end{array}
\right)
\right)$$`

If you have normal random vectors, affine functions such as `$Y=a+BX$` are multivariate normal.

Theorem: `$X \sim N(\mu,\Sigma)$` and `$Y=a+BX$`, then `$Y \sim N (a+B\mu, B\Sigma B')$`

Multiplying a normal random variable by a matrix `$B$` and adding a constant `$a$` does not alter its normality, only its mean and covariance.

*If X vector is multivariate normal, each component of X is univariate normal*,  and every linear combination of the components is also normal.

---

## Multivariate normal distribution

---
## Joint Normality and Linear Regression

`$(Y,X)$` are jointly normally distributed. Consider the best linear predictor (BLP) of Y given X:

`$$Y = X'\beta + \alpha + e$$`

Since `$(e, X)$` is an affine transformation of the normal vector `$(Y,X)$`, `$(e,X)$` is jointly normal.

By BLP's properties: `$E[Xe]=0$` and `$E[e]=0$`, so `$X$` and `$e$` are uncorrelated-independent.

This independence (no correlation) implies the properties of CEF homoskedastic linear appear:

`$$E[e^2|X]=E[e]=0$$`
`$$E[e^2|X]= E[e^2]=\sigma^2$$`

Where `$e \sim N(0,\sigma^2)$` is independent of X.

---
## Normal Regression Model

Hence, the **normal regression model** is the linear regression model with an independent normal error

`$$Y = X'\beta + e$$`
`$$e \sim N(0,\sigma^2)$$`

Normal regression does not require joint normality, only that distribution of `$(Y|X)$` is normal.

The marginal distribution of X is unrestricted.

Normal regression is a parametric model where likelihood methods can be used for estimation.

---
## The likelihood

The likelihood "is the joint probability density of the data, evaluated at the observed sample, viewed as a function of the model parameters."

In other words: "Given our data, how plausible are different parameters?" for example `$\mu$` and `$\sigma^2$`

The Maximum-Likelihood Estimator is the value that maximizes this likelihood function.

---
## Log-Likelihood Function

`$Y = X'\beta + e$` is equivalent to the conditional density of Y given X with the form:

$$f(y|x) = {1 \over (2\pi\sigma^2)^{1/2}}exp\big( - {1 \over 2\sigma^2} (y - x'\beta )^{2} \big),  x \in \mathbb{R}^k $$
With observations mutually independent, the conditional density of `$(Y_1,...,Y_n)$` given  `$(X_1,...,X_n)$` is:

`$$f (Y_1,...,Y_n|X_1,...,X_n) = \prod_{i=1}^ n {1 \over (2\pi\sigma^2)^{1/2}}exp\big( - {1 \over 2\sigma^2} (y_i - x_i'\beta )^{2} \big)$$`
`$$= {1 \over (2\pi\sigma^2)^{n/2}}exp\big( - {1 \over 2\sigma^2} \sum_{i=1}^n (y_i - x_i'\beta )^{2} \big) =_{def} L_n(\beta,\sigma^2)$$`
If you chose `$\sigma^2$` and `$\beta$`, and then sum for each `$i$` value, you have a certain "likelihood" of observing the joint conditional density of `$(Y|X)$`

---
## Maximum Likelihood Estimator

For convenience we work with logarithms, `$Log L_n(\beta,\sigma^2)$`:

We want to get the estimator values `$\hat{\beta}$` and `$\hat{\sigma}^2$` that maximize the log-likelihood function.

`$$l_n(\beta, \sigma^2) = -{n \over 2}log(2\pi\sigma^2)-{1 \over {2\sigma^2}}\sum_{i=1}^n (Y_i - X'_i\beta)^2$$`

The maximizers `$(\hat{\beta},\hat{\sigma}^2)$` of the log-likelihood solve the FOC:

`$$0 = \left. \frac{\partial}{\partial \beta} l_n(\beta, \sigma^2) \right|_{\beta = \hat{\beta}_{mle}, \sigma^2 = \hat{\sigma}^2_{mle} } = {1 \over \hat\sigma^2_{mle}}\sum_{i=1}^n X_i(Y_i-X'i\hat{\beta}_{mle})$$`

`$$0 = \left. \frac{\partial}{\partial \sigma^2} l_n(\beta, \sigma^2) \right|_{\beta = \hat{\beta}_{mle}, \sigma^2 = \hat{\sigma}^2_{mle} } = {n \over 2\hat\sigma^2_{mle}}+{1 \over 2\hat\sigma^4_{mle}}\sum_{i=1}^n (Y_i-X'i\hat{\beta}_{mle})^2$$`
---
## Maximum Likelihood Estimator
The first FOC is proportional to the LS minimization problem. Hence:

`$$\hat{\beta}_{mle} = \big(\sum_{i=1}^nX_iX'_i \big)^{-1}\big(\sum_{i=1}^nX_iY_i \big) = \hat{\beta}_{ols}$$`

The MLE for `$\beta$` is algebraically identical to the OLS estimator.

---
## Maximum Likelihood Estimator

Solving the second FOC for `$\hat{\sigma}^{2}_{mle}$`:

`$$\hat{\sigma}^2_{mle} = {1 \over n}\sum_{i=1}^n (Y_i-X'_i\hat{\beta}_{mle})^2$$`

`$$= {1 \over n}\sum_{i=1}^n (Y_i-X'i\hat{\beta}_{ols})^2={1 \over n}\sum_{i=1}^n \hat{e}_i^2 = \hat{\sigma}^2_{ols}$$`
The MLE for `$\sigma^2$` is algebraically identical to the OLS moment estimator.

MLE and OLS are equivalent under normality, when the error `$e$` has a known normal distribution.

---
## Maximum Likelihood Estimator

Plug in the estimators of `$\beta$` and `$\sigma^2$` into the log-likelihood function:

`$$l_n(\beta, \sigma^2) = -{n \over 2}log(2\pi\sigma^2)-{1 \over {2\sigma^2}}\sum_{i=1}^n (Y_i - X'_i\beta)^2$$`

And you get the maximized log-likelihood that is used as a measure of fit (for example, using different explanatory variables given a sample)

---
## Distribution of OLS coefficient vector

The error vector `$e$` is independent of `$X$` and normally distributed.

`$$e | X \sim \color{green}{N (0 , I_n\sigma^2)}$$`
Recall: `$\hat{\beta}-\beta = (X'X)^{-1}X'e$` a linear function of `$e$`.

Since linear functions of normals are also normal, hence:

`$$\hat{\beta}-\beta|X \sim (X'X)^{-1}X'\color{green}{N(0,I_n\sigma^2)}$$`
`$$\sim N(0,\sigma^2(X'X)^{-1}X'X(X'X)^{-1})$$`
`$$N(0,\sigma^2(X'X)^{-1})$$`
With normality of errors, the OLS estimator has an exact normal distribution.

---
## Distribution of OLS coefficient vector

Hence, in the normal regression model:
`$$\hat{\beta}|X \sim N(\beta, \color{blue}{\sigma^2(X'X)^{-1}})$$`

Any affine function of the OLS estimator is also normally distributed.

---
## Distribution of OLS residual vector

Recall that `$\hat{e} = Me | X$`, where `$M = I_n - X(X'X)^{-1}X'$`; `$\hat e$` is linear in `$e$`. Hence:

`$$\hat e= Me |X \sim N(0,\sigma^2MM) = N (0, \color{red}{\sigma^2 M})$$`
Hence, the residual vector has an exact normal distribution.

---
## Distribution of OLS residual vector

The joint distribution of `$\hat{\beta}$` and `$\hat{e}$` is a linear function of e:

`$$\left(
\begin{array}{c}
\hat{\beta}-\beta\\
\hat e\\
\end{array}
\right)
= \left( 
\begin{array}{c}
(X'X)^{-1}X'e\\
Me\\
\end{array}
\right)
=
\left(
\begin{array}{c}
(X'X)^{-1}X'\\
M\\
\end{array}
\right) e$$`

The vector has a joint normal distribution with covar matrix:
`$$\left(
\begin{array}{c}
\color{blue}{\sigma(X'X)^{-1}} & 0\\
0 & \color{red}{\sigma^2M}\\
\end{array}
\right)$$`

The off-diagonal is zero because `$\hat \beta$` and `$\hat e$` are statistically independent.

This implies that `$\hat{\beta}$` is independent of any function of the residuals, including `$\hat{e}_i$` and variance estimators `$s^2$` and `$\hat{\sigma}^2$`

---
##t-statistic
`$$\hat{\beta}_j|X \sim N(\beta_j,\sigma^2[(X'X)^{-1}]_{jj})$$`
Can be written as:
`$${\hat{\beta}_j - \beta_j \over {\sigma^2[(X'X)^{-1}]_{jj}}} {\sim N(0,1)}$$` 
Replacing the unknown variance:

`$$T = {\hat{\beta}_j - \beta_j \over \sqrt{s^2[(X'X)^{-1}]_{jj}}} = {\hat\beta_j-\beta_j \over s(\hat{\beta_j})} \sim t_{n-k}$$`

---
##t-statistic

Notes:

`$s(\hat{\beta_j})$`  is the classic **homoskedastic** standard error for `$\hat{\beta_j}$`.

Robust t-statistics can have finite sample distributions that deviate from `$t_{n-k}$`.

**The use of t distribution in finite samples is only exact under normality**.

---
## Confidence intervals for regression coeficients

`$\hat{\beta}$` is a point estimator with **random interval** `$\hat{C}=[\hat L,\hat U]$`

Interval estimator `$\hat{C}$` is called a `$1-\alpha$` confidence interval when `$\mathbb{P}[\beta \in \hat{C}] = 1-\alpha$`

`$1-\alpha$` is the coverage probability that typically is 0.99 or 0.95.

Note:  `$\mathbb{P}[\beta \in \hat{C}]$` treats the point `$\beta$` as fixed, so is the probability that the random set `$\hat{C}$` contains the fixed coefficient `$\beta$`.

The typical choice for a CI is:

`$$\hat{C} = \hat{\beta} - c \times s(\hat{\beta}),  \hat{\beta} + c \times s(\hat{\beta})$$`
Where `$c=1.96$` (e.g. with `$n-k \ge 61$`, `$c \le 2$` for a 95% interval)

---
## t Test and P-value.

`$$\mathbb{H}_0: \beta = \beta_o$$`
`$$\mathbb{H}_1: \beta \ne \beta_o$$`
`$$|T| = \left| \frac{\hat\beta}{s(\hat{\beta})} \right|$$`

Reject `$\mathbb{H}_0$` if `$|T| > c$`.

We can interpret the p-value instead. the p-value is the probability left in the tails of the distribution given by |T|. If this is too small, this is evidence against `$\mathbb{H}_0$`:

---
##P-value.

By lowering the significance level, we avoid more and more making the error of rejecting a correct `$\mathbb{H_0}$` (error type I)

However this increases the probability of an error type II.

A small p-value is evidence against the null hypothesis.

A large p-value is evidence in favor of the null hypothesis.

P-values are more informative than tests at fixed significance levels.

---
## t Test and P-value.

```
## 
## Call:
## lm(formula = lwage ~ educ + exper, data = wage1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05800 -0.30136 -0.04539  0.30601  1.44425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.216854   0.108595   1.997   0.0464 *  
## educ        0.097936   0.007622  12.848  < 2e-16 ***
## exper       0.010347   0.001555   6.653 7.24e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4614 on 523 degrees of freedom
## Multiple R-squared:  0.2493,	Adjusted R-squared:  0.2465 
## F-statistic: 86.86 on 2 and 523 DF,  p-value: < 2.2e-16
```

---
##t Test and P-value.

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#Normal_v1_files/figure-html/unnamed-chunk-6-1.png" alt=" "  />
<p class="caption"> </p>
</div>
---
##Likelihood Ratio test

Useful to assess a set of coefficients.

An F test is ideal and can be derived from a LR test (in the normal regression)

Model:

`$$Y = X'_1 \beta_1 + X'_2\beta_2 + e$$`
$$ \mathbb{H_o}: \beta_2 = 0$$
Under `$\mathbb{H_o}$` we have a "constraint model":
`$$Y = X'_1 \beta_1 + e$$`
---
##Likelihood Ratio test

Recall that maximized log-likelihood is:

$$l_n(\hat\beta , \hat\sigma^2) = -{n\over2}log(2\pi\hat\sigma^2)-{n\over2} $$
For the constraint model:
$$l_n(\tilde\beta , \tilde\sigma^2) = -{n\over2}log(2\pi\tilde\sigma^2)-{n\over2} $$
---
##Likelihood Ratio test

`$$LR = 2(l_n(\hat\beta , \hat\sigma^2) - (l_n(\tilde\beta , \tilde\sigma^2))$$`
$$ = 2 \left( (-{n\over2}log(2\pi\hat\sigma^2)-{n\over2}) - (-{n\over2}log(2\pi\tilde\sigma^2)-{n\over2}) \right) $$
`$$= n . \mathcal{log} \left({\tilde\sigma^2 \over \hat\sigma^2}\right)$$`  
We reject `$\mathbb{H_o}$` for large values of LR. Equivalently:

`$$F = {(\tilde\sigma^2 - \hat\sigma^2)/q \over  \hat\sigma^2/(n-k)} \tilde{} F_{q,n-k}$$`
Reject `$\mathbb{H_o}$` in favor of `$\mathbb{H_1}$` if `$F>c$` with significance level `$\alpha$`, 
where `$c$` is set so that `$\mathbb{P}[F_{q,n-k} \ge c] = \alpha$`.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>