Finite_v1.knit

class: center, middle
# OLS Finite Sample Properties
### Dr. Francisco J. Cabrera-Hernández
#### Econometría
#### Maestría en Economía
Primavera 2024
#####CIDE Santa Fe, Ciudad de México.

---
##Introduction

We now investigate OLS **finte-sample** properties.

We recap the finite sample means and covariance matrix

We focus on Standard Error propositions.

---
##Assumptions

In cohorts a sample of 1000 people from Mexico,  their response is mutually independent is reasonable.

Assumption 1. **Random variables** `$\{(X_n,Y_n)\}$` are i.i.d. (from same distribution)

Assumption 2. Variables `$(X,Y)$` satisfy the **linear equation**:

`$$Y = X'\beta + e$$` 
`$$E[e|X] = 0$$`
---
##Assumptions

Finite second moments: `$E[Y^2] < \infty$`; `$E||X||^2 < \infty$`

And **invertible matrix** `$Q_{xx} = E[XX'] > 0$`

Asumption 3 (if necessary). **Homoskedastic**

`$E [e^2|X] = \sigma^2(X) = \sigma^2$`

---

##Expectation of LS estimator (Unbiased)

Using `$\hat{\beta}=(X'X)^{-1}(X'Y)$`

Assuming independence across `$i$` and linearity of expectations: 
`$$E[Y_i|X_1...X_n] = E[Y_i|X_i] = X'_i\beta = X\beta$$` 
Given conditioning theorem `$\color{green}{E[g(x)Y|X]=g(X)E[X|Y]}$`:

`$$E[\hat{\beta}|X] = E[(X'X)^{-1}X'Y|X]$$`
`$$= (X'X)^{-1}X'E[Y|X]$$`

`$$=(X'X)^{-1}X'X\beta = \beta$$`
The key here is that `$g(X)$` is non-random, **given X**! And expectation distributes over linear transformations.

[Some code here!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/6_OLS_linearity.R)

---
##Expectation of LS estimator (Unbiased)

Similarly:

`$$\hat{\beta} = (X'X)^{-1} (X'(X\beta+e))$$`
`$$=(X'X)^{-1}X'X\beta+(X'X)^{-1}(X'e)$$`
`$$=\beta + (X'X)^{-1}X'e$$`
This is `$\hat\beta$` = `$\beta$` plus a stochastic component.

---
##Expectation of LS estimator (Unbiased)

Given:

`$$\hat\beta=\beta + (X'X)^{-1}X'e$$`
Then:
`$$E[\hat{\beta} - \beta|X] = E[(X'X)^{-1} X'e|X]$$`
`$$=(X'X)^{-1}X'E[e|X]=0$$`
`$E[\hat{\beta}|X] = \beta$` conditional distribution of `$\hat{\beta}$` centers at `$\beta$`. *For any realization of matrix X*.

Hence with i.i.d. sampling: `$E(\hat{\beta}|X) = \beta$` (conditionally unbiased).

---
## Variance of least square estimators

For any *rx1* random vector Z define the *rxr* covariance matrix

`$$Var[Z] = E[(Z-E[Z])(Z-E[Z])'] = E[ZZ'] - E([Z])(E[Z])'$$`

For any pair (Z,X) define the conditional covariance matrix.

`$$Var[Z|X] = E[(Z-E[Z|X])(Z-E[Z|X])'|X]$$`
---
## Variance of least square estimators

We define `$V_{\hat{\beta}} =_{def} Var[\hat{\beta}|X]$` as the covariance matrix of regression coefficients.

The conditional covariance matrix of the *nx1* regression error e is the *nxn* matrix.

`$$var[e|X] = E[ee'|X]=_{def}D$$`
The `$i_{th}$` diagonal element D is:
`$$E[e^2_i|X] = E [e^2_i|X_i] = \sigma^2_i$$`
The `$ij_{th}$` off-diagonal element D is:

`$$E[e_ie_j|X] = E [e_i|X_i]E[e_j|X_j] =0$$`
*This equality uses independence of observations.*

---
## Variance of least square estimators

`$$D = diag(\sigma^2_1...\sigma^2_n) =  \left(
\begin{array}{c}
\sigma^2_1 & 0 & ... & 0 \\
0 & \sigma^2_2 & ... & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & ... &  \sigma^2_n  \\
\end{array}
\right)$$`

In the rare homoskedastic case `$E[e^2_i|X_i] = \sigma^2_i = \sigma^2$`
`$$D = I_n\sigma^2$$`
---
## Variance of least square estimators

And for any *nxr* matrix `$A=A(X)$`,
`$$var [A'Y|X] = var [A'e|X] = A'DA.$$`
For `$\hat{\beta} = A'Y$` where `$A=X(X'X)^{-1}$`, we have:

`$$V_{\hat{\beta}} = var[\hat{\beta}|X] = A'DA = (X'X)^{-1}X'D X(X'X)^{-1}$$`
Note that:  `$X'DX = \sum_{i = 1}^{n}X_iX'_i\sigma^2_i$` is a weighted version of `$X'X.$`

If homoskedastic: `$D= I_n \sigma^2$`; `$X'DX = X'X\sigma^2$` and varcovar matrix simplifies to:

`$$V_{\hat{\beta}}= (X'X)^{-1}\sigma^2$$`

---
## Variance of least square estimators

`$Y \tilde{} N(\mu, \sigma^2) \to \mu = X\beta \to Y \tilde{} N(X\beta, D)$`, where if homoskedastic: `$D=\sigma^2I$`

`$Y: n\times1$`; `$X: n\times k$`; `$\beta: k\times 1$`; `$D:n \times n$`

- Covariance matrix (no assumptions)

`$$D = \begin{pmatrix}
\sigma_{11}^2 & \sigma_{12} & \sigma_{13} & \cdots & \sigma_{1n} \\
\sigma_{21} & \sigma_{22}^2 & \sigma_{23} & \cdots & \sigma_{2n} \\
\sigma_{31} & \sigma_{32} & \sigma_{33}^2 & \cdots & \sigma_{3n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
\sigma_{n1} & \sigma_{n2} & \sigma_{n3} & \cdots & \sigma_{nn}^2
\end{pmatrix}$$`

---
## Variance of least square estimators

- Assuming Independence:
`$$D = \begin{pmatrix}
\sigma_{1}^2 & 0 & 0 & \cdots & 0 \\
0 & \sigma_{2}^2 & 0 & \cdots & 0 \\
0 & 0 & \sigma_{3}^2 & \cdots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \cdots & \sigma_{n}^2
\end{pmatrix}$$`

- Assuming Homoskedasticity:
`$$\Sigma = \begin{pmatrix}
\sigma_{}^2 & 0 & 0 & \cdots & 0 \\
0 & \sigma_{}^2 & 0 & \cdots & 0 \\
0 & 0 & \sigma_{}^2 & \cdots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \cdots & \sigma_{}^2
\end{pmatrix} = \sigma^2
\begin{pmatrix}
1_{} & 0 & 0 & \cdots & 0 \\
0 & 1_{} & 0 & \cdots & 0 \\
0 & 0 & 1_{} & \cdots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \cdots & 1_{}
\end{pmatrix}$$`

---
## Variance of least square estimators (proof)

`$$\hat\beta = (X'X)^{-1}X'Y$$` 
`$$var(\hat\beta) = var[(X'X)^{-1}X'Y]$$` 
- This is `$aX= a^2.var(X)$`, and given `$Y \tilde{} N(X\beta, \color{green}{D})$`:

`$$var(\hat\beta) = [(X'X)^{-1}X'] \color{green}{\sigma^2 I} [(X'X)^{-1}X']'$$`

`$$var(\hat\beta) = \sigma^2[(X'X)^{-1}X'] I [X[(X'X)^{-1}]']$$` 
- Given: `$[(X'X)^{-1}]'= (X'X)^{-1}$`

Under homoskedasticity and with independence of errors:

`$$var(\hat\beta) = \sigma^2(X'X)^{-1}$$`

---
## Gauss-Markov Theorem
The LS estimator is the case when `$A=X(X'X)^{-1}$`.

What is the best choice of A?

The Gauss Markov theorem states *LS is the best choice* of `$A$`, among linear unbiased estimators **when errors are homoskedastic.**

With `$E[Y|X]=X\beta$` and for any linear `$\tilde{\beta}=A'Y$` we have:

`$$E[\tilde{\beta}|X] = A'E[Y|X] = A'X\beta$$`

`$\tilde{\beta}$` is unbiased if `$A'X = I_{k}$`

---
## Gauss-Markov Theorem

Furthermore:

`$$Var[\tilde{\beta}|X]=var[A'Y|X]=A'DA=A'A\sigma^2$$`

The last equality comes from: `$D=I_n\sigma^2$`

The BLUE comes from finding the matrix `$A_o$` that satisfies `$A_o'X=I_k$`

Such that `$A_o'A_o$` is minimized in the positive semi-definite sense.

Otherwise the variance of the estimator is higher than `$\sigma^2$`.

---
## Gauss-Markov Theorem

We have seen LS satisfies this if linear estimators with i.i.d. sampling.

So if `$\hat{\beta}$` is linear unbiased estimator of `$\beta$` then:

`$$Var[\tilde{\beta}|X] \ge \sigma^2(X'X)^{-1}$$`

No unbiased linear estimator can have a variance matrix smaller (in the positive definite sense) than `$\sigma^2(X'X)^{-1}$`

---
## Gauss-Markov Theorem (proof)

Let `$A$` be any `$nxk$` function of `$X$` such that `$\color{green}{A'X = I_k}$`

The estimator `$A'Y$` is unbiased for `$\beta$` with variance `$A'A\sigma^2$`

It is sufficient to show that difference between OLS variance matrix and `$A'A$` is positive semi-definite, or:

`$$A'A-(X'X)^{-1} \ge 0$$`

---
## Gauss-Markov Theorem (proof)

Set: `$$C= A-X(X'X)^{-1}$$`
`$$\color{green}{A = C+X(X'X)^{-1}}$$`

Note `${X'C=0}$`. Because `${A'X = I_k}$`

`$$A'A-(X'X)^{-1} = (\color{green}{C+X(X'X)^{-1}})'(C + X(X'X)^{-1}) - (X'X)^{-1}$$`
`$$=C'C + C'X(X'X)^{-1} +  (X'X)^{-1}X'C + \\ \color{red}{(X'X)^{-1}X'X(X'X)^{-1} - (X'X)^{-1}}$$` 
`$$=C'C\ge 0$$`

Which is positive semi-definite.

---
## Generalized Least Squares

Model in matrix form:

`$$Y = X\beta + e$$`

Consider a generalized situation when the errors heteroskedastic.

`$$E[e|X]=0$$` `$$var[e|X]= \Omega$$`
`$\Omega$` allows for i.i.d sampling where `$\Omega=D$` but also for non-diagonal covariance matrixes as well.

Hence:

`$$E[\hat{\beta}|X] = \beta$$`
`$$var[\hat{\beta}|X] = (X'X)^{-1} (X'\Omega X) (X'X)^{-1}$$`

---
## Generalized Least Squares

A generalized Gauss-Markov is: 
`$$var[\tilde{\beta}|X] \ge \ (X'\Omega^{-1} X)^{-1}$$`
This is when we know `$\Omega$` up to scale.

If we have homoskedasticity and i.i.d., this variance estimator is larger.

Suppose that we know `$\Omega=c^2\Sigma$`; where `$c^2>0$` and real; `$\Sigma$` is `$nxn$` **and known**

---
## Generalized Least Squares

A case of GLS is where we pre multiply by `$\Sigma^{-1/2}$`, producing:

`$$\tilde{\beta}_{gls} = (\tilde{X'}\tilde{X})^{-1} (\tilde{X'}\tilde{Y})$$`
`$$= ((\Sigma^{-1/2}{X})' (\Sigma^{-1/2}{X}))^{-1} (\Sigma^{-1/2}{X})' (\Sigma^{-1/2}{Y}) =$$` 
`$$(X' \Sigma^{-1}X)^{-1} X'\Sigma^{-1}Y$$`
Hence:

`$$E[\tilde{\beta}_{gls}|X] = \beta$$`
`$$Var[\tilde{\beta}_{gls}|X] = (X'\Omega^{-1} X)$$`
---
## Generalized Least Squares

The variance lower bound is sharp when `$\Sigma$` **is known**. And GLS is efficient under heteroskedasticity.

In the linear regression model with independent observations and known conditional variances, so that `$\Omega = \Sigma = D = diag(\sigma_1... ,\sigma_n)$`, GLS takes the form:

`$$\tilde\beta_{gls}=(X'D^{-1}X)^{-1}X'D^{-1}Y$$`

In the practice the covariance Matrix `$\Omega$` is unknown, it can be estimated (Feasible Least Squares).

**No longer common in current applied econometric practice.**

---
##  Estimation of Error Variance

Error variance `$\sigma^2 = E[e^2]$` measuring the unexplained part of the regression.

With method of moments estimator:

`$$\hat{\sigma}^2 = {1 \over n} \sum_{i = 1}^{n}\hat{e_i}^2$$`

It can be shown (BH p. 108) that under conditional homoskedasticity `$E[e^2|X] = \sigma^2$` so that `$D = I_{n}\sigma^2$`:

`$$E(\hat{\sigma}^2 | X) = {1 \over n} tr(M\sigma^2) = {\sigma^2 ({n-k \over n})}$$`

Showing that `$\hat{\sigma}^2$` **is biased towards zero**, this bias is more important if `$k/n$` is large.

---
## Estimation of Error Variance

This can be rescaled to:

`$$s^2= {1 \over n-k}\sum_{i = 1}^{n}\hat{e_i}^2$$`

`$$E[s^2|X] = \sigma^2$$`

---

## Homoskedastic Covariance Matrix Estimation 
For inference we need to estimate covariance matrix `$V_{\hat{\beta}}$`

Under Homoskedasticity: `$V^0_{\hat{\beta}} = (X'X)^{-1}\sigma^2$` or `$\hat{V}^0_{\hat{\beta}} = (X'X)^{-1}s^2$`

Conditionally unbiased:
`$$E[\hat{V}^0_{\hat{\beta}}|X] = (X'X)^{-1} E[s^2|X] = (X'X)^{-1} \sigma^2 = V_{\hat{\beta}}$$`

*If the regression error is heteroskedastic it is possible for `$\hat{V}^0_{\hat{\beta}}$`  to be biased.

---
## Biased Covariance Matrix Estimation

Remember that if `$D = I_n\sigma^2$` then `$X'DX = \color{green} {\sum_{i=1}^n X_iX_i'\sigma^2_i}$` (see slide 11)

If `$k=1$` and `$\sigma^2_i = X^2_i$`, implying `$\sigma^2 = E[\sigma^2_i] = E[X^2]$`.

If we use `$\hat{V}^0_{\hat{\beta}} = (X'X)^{-1}s^2$` but the error is heteroskedastic.

The ratio of the true variance to the expectation of the variance estimator is:

`$${V_{\hat{\beta}} \over E[\hat{V}^0_{\hat{\beta}}|X]} = {\sum_{i = 1}^{n}X_i^{4} \over \sigma^2\sum_{i = 1}^{n}X_i^{2}}= {E[X_i^4] \over (E[X_i^2])^2}= _{def} = k$$`

Where `$k$` is the standardized kurtosis. If `$X \tilde{} N(0,\sigma^2)$`, `$k$`= 3, so variance is 3 times higher in this example.

---
## Heteroskedastic Covariance Matrix Estimation

We can construct varcovar matrix estimator not requiring homoskedasticity.

General form: `$$V_{\hat{\beta}} = var[\hat{\beta}|X] = (X'X)^{-1}X'D X(X'X)^{-1}$$`

`$$D = diag(\sigma^2_i,...,\sigma^2_n)=E[ee'|X] = E[\tilde{D}|X]$$`

Where `$\tilde{D}=diag(e_1^2... e_n^2)$`. `$\tilde{D}$` is a conditionally unbiased estimator of `$D$`. If `$e^2_i$` were observable we could construct:

`$$\hat{V}^{ideal}_{\hat{\beta}} = (X'X)^{-1}X'\tilde{D} X(X'X)^{-1}$$`
`$$= (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_ie^{2}_i)(X'X)^{-1}$$`
---
## Heteroskedastic Covariance Matrix Estimation 
From here:

`$$E[\hat{V}^{ideal}_{\hat{\beta}}|X] = (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_iE[e^{2}_i|X])(X'X)^{-1}$$`
`$$= E[\hat{V}^{ideal}_{\hat{\beta}}|X] = (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_i\sigma^2)(X'X)^{-1}$$`
`$$= (X'X)^{-1}X'D X(X'X)^{-1} = V_{\hat{\beta}}$$`
`$$E[\hat{V}^{ideal}_{\hat{\beta}}] = V_{\hat{\beta}}$$`

Verifying that it is unbiased.

---
## Heteroskedastic Covariance Matrix Estimation

Under Heteroskedasticity:

`$$var(\hat\beta) = [(X'X)^{-1}X']D[X(X'X)^{-1}]$$`

- `$D: n \times n$`, and you can't estimate `$n \times n$` elements with n observations.

- Instead of estimating the full covar matrix, the diagonal can be estimated with residuals. A "weighted" version *(White, 1980)*

---
## Heteroskedastic Covariance Matrix Estimation 
Also, as `$e^2_i$` are unobserved, `$\hat{V}^{ideal}_{\hat{\beta}}$` is not feasible.

So we replace `$e^2_i$` with the residuals `$\hat{e}^2$` obtaining:

`$$\hat{V}^{HC0}_{\hat{\beta}} = (X'X)^{-1} \large( \sum_{i = 1}^{n}X_iX'_i\hat{e}^{2}_i \large)(X'X)^{-1}$$`

This the "baseline" heteroskedasticiy-consistent covar matrix estimator.

---
## Heteroskedastic Covariance Matrix Estimation

Furthermore as `$\hat{e}^{2}_i$` is biased towards zero, we rescale by `$n/(n-k)$`:

`$$\hat{V}^{HC1}_{\hat{\beta}} = ({n \over {n-k}}) (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_i\hat{e}^{2}_i)(X'X)^{-1}$$`

These are robust, heteroskedasticity-consistent, or heteroskedasticity-robust covar matrix.

HC0 is the Eicker-White or White covariance matrix estimator.

**HC errors are not the default in STATA**. If *robust* is added, it is HC1.

Standard errors are the square root of the diagonal elements of `$\hat{V}^{}_{\hat{\beta}}$`

---
## Estimation Example

With `$n \to \infty$` and under Homoskedasticity:

`$$var(\hat \beta_1) = \frac {\hat\sigma^2} {\sum_{i=1}^n (x_i - \bar x)^2}$$`

`$\hat{var}(\hat\beta_1)$` falls at a rate 1/n.

This rate refers to *efficiency*.

The Heteroskedastic robust covar matrix falls at a lower rate.

---
##Homoskedastic Convergence

``` r
repet <- 5000   
running_se <- NULL

# Set seed for reproducibility
set.seed(123456)

# Simulate beta estimates and calculate the running mean
for (i in 50:repet) {
  x <- rnorm(i)  # Regressor x
  u <- rnorm(i, sd = 2)  # Random error with some variance
  y <- 2 + beta_1_true * x + u      # Define y, with beta_1 = 2
  model <- lm(y ~ x)  # Store beta_1 estimate
  # Calculate the running variance up to the current iteration
  robust_vcov <- vcovHC(model, type = "HC1")
  running_variances[i] <- diag(robust_vcov)
}

# we then plot the running variance...
```

---
##Homoskedastic Convergence

---
##Heteroskedastic Convergence

``` r
repet <- 5000   
running_variances <- NULL

# Set seed for reproducibility
set.seed(123456)

# Simulate beta estimates and calculate the running mean
for (i in 50:repet) {
  x <- rnorm(i)  # Regressor x
  u <- rnorm(i, sd = 2 + abs(x))  # Random error with some variance
  y <- 2 + beta_1_true * x + u      # Define y, with beta_1 = 2
  model <- lm(y ~ x)  # Store beta_1 estimate
  # Calculate the running variance up to the current iteration
  robust_vcov <- vcovHC(model, type = "HC1")
  running_variances[i] <- diag(robust_vcov)
}

# we then plot the robust se estimates...
```

---
##Variance of `$\hat\beta_1$` Convergence

---
## Other Heteroskedastic Variance Estimations

HC2 and HC3 come from standardized errors `$\bar{e}$` and prediction errors `$\tilde{e}$`, respectively.

Where: `$\hat{V}^{HC0}_{\hat{\beta}} < \hat{V}^{HC2}_{\hat{\beta}} < \hat{V}^{HC3}_{\hat{\beta}}$`

---
## Other Heteroskedastic Variance Estimations

Before, we define **Leverage Values:**

There are `$n$` leverage values denoted as `$h_{ii}$` for `$i=1...n$`

`$$h_{ii} = X'_i(X'X)^{-1}X_i$$`

The leverage value is a normalized length of the observed regressor vector `$X_i$` and is between 0 and 1.

Measures how unusual the `$i_{th}$` observation `$X_i$` is relative to the other observations in the sample.

An **extreme** example of `$h_{ii}=1$` is a dummy that only takes once the value of 1.

---
## Heteroskedastic Covariance Matrix Estimation

`$$\hat{V}^{HC2}_{\hat{\beta}} = (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_i\bar{e}^{2}_i)(X'X)^{-1}$$`

`$$\hat{V}^{HC2}_{\hat{\beta}} = (X'X)^{-1}(\sum_{i = 1}^{n}(1-h_{ii})^{-1}X_iX'_i\hat{e}^{2}_i)(X'X)^{-1}$$`

If there is an observation with `$h_{ii}$` close to one, then `$(1-h_{ii})^{-1}$` would be large, giving this observation more weight.

---
## Heteroskedastic Covariance Matrix Estimation

While:

`$$\hat{V}^{HC3}_{\hat{\beta}} = (X'X)^{-1}(\sum_{i = 1}^{n}(1-h_{ii})^{-2}X_iX'_i\hat{e}^{2}_i)(X'X)^{-1}$$`

In **actividad 2** you will show that HC2 is unbiased (see BH p.113)

---
## Clustered Sampling

Samples could be correlated within groups (not across). For example when studying schools, firms, households or localities.

This is `$Y_{ig}, X_{ig}$` where `$g = 1,..G.$` indexes the cluster.

Number of observations per cluster is `$n_g$` and `$n=\sum_{g = 1}^{G}n_g$`.

A model is:

`$$Y_{ig} = X'_{ig}\beta + e_{ig}$$`

---
## Clustered Sampling

Or we can use cluster notation:

`$$Y_g= X'_g\beta+e_g$$`

Where `$e_g = (e_{1g},..., e_{n_gg})'$` is an `$n_gx1$` error vector.

We can write the sums over observations as `$\sum_{g = 1}^{G}\sum_{i=1}^{n_g}$`

This is the sum across clusters of the sum across observations within each cluster.

---
## Clustered Sampling

OLS is:

`$$\beta= (\sum_{g = 1}^{G}\sum_{i=1}^{n_g}X_{ig}X'_{ig})^{-1} (\sum_{g = 1}^{G}\sum_{i=1}^{n_g}X_{ig}Y_{ig})$$`
`$$= (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}Y_{g})$$`
`$$=(X'X)^{-1}(X'Y)$$`
With residuals `$\hat{e}_{ig}= Y_{ig}-X'_{ig}\hat{\beta}$`  or  `$\hat{e}_{g}= Y_{g}-X'_{g}\hat{\beta}$` (in cluster level notation)

---
## Clustered Sampling

Assumption that clusters are mutually independent.

Also errors conditionally mean `$E[e_{g}|X_{g}]=0$`.

This is if **all interaction effects within clusters** have been accounted for in the specification of the individual regressors `$X_{ig}$`.

e.g the achievement of any student is unaffected by the individual `$X_i$` (e.g. age, gender and test scores) of other students within the same school.

---
## Clustered Sampling

We can calculate the mean of OLS estimator substituting   
`$$Y_g= X_g\beta+e_g$$`

into

`$$= (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}Y_{g})$$`
If we substract:

`$$\hat{\beta}-\beta = (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}e_{g})$$`

---
## Clustered Sampling

The mean of `$\hat{\beta}-\beta$` conditioning on all X is:

`$$E[\hat{\beta}-\beta|X] = (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}E[e_{g}|X_g]) = 0$$`

As clusters are assumed independent of each other we can write `$X$` as `$X_g$`.

This shows that OLS is unbiased under clustering if the conditional mean is linear allowing `$E[e_{g}|X_g]=0$`

---
## Clustered Sampling (Example)

From Duflo et. al. (2011) in 121 primary schools in Kenya.
Students are randomly assigned into "tracking" classrooms or heterogenous classrooms.

Discuss:

`$$TestScore_{ig} = -0.071 + 0.138Tracking_{g} + e_{ig}$$`

---
## Variance with clusters.

Let:

`$$\sum_g= E[e_ge'_g|X_g]$$`

Denoting the `$n_g x n_g$` conditional covariance matrix of errors within the `$g_{th}$` cluster.

- `$e_g$` is the vector of residuals for all `$n_g$` observations in cluster `$g$`.
- `$X_g$` is the matrix of regressors corresponding to those observations.
- Conditional on `$X_g$`, we are focusing on the variation in the errors not explained by `$X_g$`.

This covariance matrix captures both the variance of individual errors and their correlation within the cluster.

Off-diagonal elements are not zero.

---
## Variance of `$\hat\beta$` (reminder)

`$$\hat{\beta} = (X'X)^{-1}X'Y.$$`

$$ Y = X\beta + e $$

$$ \hat{\beta} = (X'X)^{-1}X'(X\beta + e). $$

$$ \hat{\beta} = \color\green{(X'X)^{-1}X'X}\beta + (X'X)^{-1}X'e. $$

$$ \hat{\beta} = \beta + (X'X)^{-1}X'e. $$

The conditional variance is:

`$$Var (\hat{\beta} | X) = Var (\beta + (X'X)^{-1}X'e | X)$$`

$$ \text{Var} (\hat{\beta} | X) = (X'X)^{-1}X' \text{Var}(e | X) X (X'X)^{-1}$$

---
## Variance of `$\hat\beta$` with clusters.
Let:

`$$\sum_g= E[e_ge'_g|X_g]$$`

`$$\text{Var}(e | X) = \text{blockdiag}(\Sigma_1, \Sigma_2, \dots, \Sigma_G).$$`

[Some code to view!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/7_covar_matrix_cluster_view.R)

---
## Variance of `$\hat\beta$` with clusters.
Let: `$\sum_g= E[e_ge'_g|X_g]$`

Hence:

`$$var[(\sum_{g=1}^G X'_ge_g)|X]=\sum_{g=1}^G var [X'_ge_g|X_g]$$`
`$$= \sum_{g=1}^G X'_g E[e_ge'_g|X_g]X_g$$`
`$$= \sum_{g=1}^G X'_g \Sigma_g X_g =_{def} \Omega_n$$`

`$\Omega_n$`  captures how within-cluster error correlation contributes to the overall uncertainty in `$\hat\beta$`

---
## Variance of `$\hat\beta$` with clusters.

Hence:

$$ V_{\hat{\beta}}= var[\hat{\beta}|X] = (X'X)^{-1} \Omega_n(X'X)^{-1}  $$

This differs from the formula of the independent case, due to correlation within clusters.

---
## Variance with clusters  (intuitively)

Variance difference depends on the degree of correlation between observation within clusters.

e.g. if same number of observations within cluster `$n_g = N$`, `$E[e^2_{ig}|X] = \sigma^2$`; `$E[e^2_{ig},e^2_{lg}|X] = \sigma^2\rho$` for `$i\ne l$`.

Same regressors within clusters. Hence:

`$$V_\hat{\beta} = (X'X)^{-1} \sigma^2 \color{green}{(1 + \rho(N-1))}$$`

For `$\rho>0$` is approximately a multiple `$\rho N$` of the conventional formula.

**If cluster size 100 and `$\rho = 0.25$`, the exact variance should be 25 times bigger with SE five times bigger.**

But this depends on number of clusters, within-n and `$\rho$` size.

---
## Variance with clusters

Arellano Bond (1987) give the cluster robust covariance matrix that extends White:

Squared error `$e^2_i$` is unbiased for `$E[e^2_i|X_i]=\sigma^2_i$`

With cluster dependence the matrix `$e'_ge'_g$` is unbiased for `$E[e^2_ge'^2_g|X_g]=\Sigma_g$`

The unbiased estimator for `${\Omega_n}$` is `$\tilde{\Omega}_{n} = \Sigma^G_{g=1} X'_ge_ge'_gX_g'$` replacing with residuals:

`$$\hat{\Omega}_n = \sum^G_{g=1} X'_g\hat{e}_g\hat{e}'_gX_g$$`

---
## Variance with clusters

`$$\hat{\Omega}_n = \Sigma^G_{g=1} X'_g\hat{e}_g\hat{e}'_gX$$`

`$$= \sum_{g=1}^G  \sum_{i=1}^{n_g}  \sum_{l=1}^{n_g} X_{ig} X'_{lg} \hat{e}_{ig} \hat{e}_{lg}$$`

`$$= \sum_{g=1}^G(\sum_{i=1}^{n_g}X_{ig}\hat{e}_{ig})  (\sum_{l=1}^{n_g}X_{lg}\hat{e}_{lg})'$$`

[Some beautiful code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/8_covar_matrix_cluster_estimation.R)

---
## Variance with clusters

A finite sample adjustment is: `$a_n(X'X)^{-1}\hat{\Omega_n}(X'X)^{-1}$`.

Where `$a_n = ({n-1 \over n-k})  ({G \over G-1})$` to improve performance when G is small.

This is the *Liang-Zeger* clustering adjustment.

**Stata uses this when *cluster* formula is used.**

Example:

`$$TestScore_{ig} = -0.071 + 0.138 Tracking_g + e_{ig}$$`
         `$$\quad        (0.019)   \quad   (0.026)$$`
         `$$\quad        [0.054]   \quad   [0.054]$$`

---
#Multicolinearity

If `$X'X$` is singular then `$(X'X)^{-1}$` and `$\hat{\beta}$` are not defined. This strict multicollinearity happens when `$X_k=X_j$`

If we have near multicollinearity, coefficient estimates are imprecise.

If `$V_\hat{\beta} = (X'X)^{-1}\sigma^2$`:

`$$Y = X_1\beta_1 + X_2\beta_2 + e$$`
`$${1 \over n} X'X =  \left(
\begin{array}{c}
1 & \rho  \\
\rho & 1\\
\end{array}
\right)$$`

`$$var[\hat\beta|X] = {\sigma^2\over n}\left(
\begin{array}{c}
1 & \rho  \\
\rho & 1\\
\end{array}
\right)^{-1} = {\sigma^2 \over n(1-\rho^2)} \left(
\begin{array}{c}
1 & -\rho  \\
-\rho & 1\\
\end{array}
\right)$$`

The more “collinear” the regressors the worse the precision of the estimates.

---
## Measures of Fit

`$R^2$` is defined as:

`$$R^2 = 1 -  {{\sum_{i = 1}^{n}\hat{e_i}^2} \over \sum_{i = 1}^{n} (Y_i - \bar{Y})^2} = 1- {\hat{\sigma}^2 \over \hat{\sigma}^2_Y}$$`

Yet `$\hat{\sigma}^2$` and `$\hat{\sigma}^2_Y$` are biased. Hence:

`$$\bar{R^2} = 1 - {s^2 \over \tilde{\sigma}^2_Y} = 1- {{(n-1)^{-1}\sum_{i = 1}^{n}\hat{e}^2_i} \over (n-k)^{-1}\sum_{i = 1}^{n} (Yi - \bar{Y})^2}$$`
This is the adjusted R-squared, openly used.

---
## Measures of Fit

But it is preferred to use:

`$$\tilde{R^2} = 1 - {\tilde{\sigma}^2 \over \hat{\sigma}^2_Y} = 1- {{\sum_{i = 1}^{n}\tilde{e}^2_i} \over \sum_{i = 1}^{n} (Y_i - \bar{Y})^2}$$`
Where `$\tilde{e}^2$` are **prediction** errors not residuals.

`$\tilde{R^2}$` and  `$\bar{R^2}$` are non-monotonic in the number of regressors.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>