class: center, middle # OLS Finite Sample Properties ### Dr. Francisco J. Cabrera-Hernández #### Econometría #### Maestría en Economía Primavera 2024 #####CIDE Santa Fe, Ciudad de México. --- ##Introduction We now investigate OLS **finte-sample** properties. We recap the finite sample means and covariance matrix We focus on Standard Error propositions. --- ##Assumptions In cohorts a sample of 1000 people from Mexico, their response is mutually independent is reasonable. Assumption 1. **Random variables** `\(\{(X_n,Y_n)\}\)` are i.i.d. (from same distribution) Assumption 2. Variables `\((X,Y)\)` satisfy the **linear equation**: `$$Y = X'\beta + e$$` `$$E[e|X] = 0$$` --- ##Assumptions Finite second moments: `\(E[Y^2] < \infty\)`; `\(E||X||^2 < \infty\)` And **invertible matrix** `\(Q_{xx} = E[XX'] > 0\)` Asumption 3 (if necessary). **Homoskedastic** `\(E [e^2|X] = \sigma^2(X) = \sigma^2\)` --- ##Expectation of LS estimator (Unbiased) Using `\(\hat{\beta}=(X'X)^{-1}(X'Y)\)` Assuming independence across `\(i\)` and linearity of expectations: `$$E[Y_i|X_1...X_n] = E[Y_i|X_i] = X'_i\beta = X\beta$$` Given conditioning theorem `\(\color{green}{E[g(x)Y|X]=g(X)E[X|Y]}\)`: `$$E[\hat{\beta}|X] = E[(X'X)^{-1}X'Y|X]$$` `$$= (X'X)^{-1}X'E[Y|X]$$` `$$=(X'X)^{-1}X'X\beta = \beta$$` The key here is that `\(g(X)\)` is non-random, **given X**! And expectation distributes over linear transformations. [Some code here!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/6_OLS_linearity.R) --- ##Expectation of LS estimator (Unbiased) Similarly: `$$\hat{\beta} = (X'X)^{-1} (X'(X\beta+e))$$` `$$=(X'X)^{-1}X'X\beta+(X'X)^{-1}(X'e)$$` `$$=\beta + (X'X)^{-1}X'e$$` This is `\(\hat\beta\)` = `\(\beta\)` plus a stochastic component. --- ##Expectation of LS estimator (Unbiased) Given: `$$\hat\beta=\beta + (X'X)^{-1}X'e$$` Then: `$$E[\hat{\beta} - \beta|X] = E[(X'X)^{-1} X'e|X]$$` `$$=(X'X)^{-1}X'E[e|X]=0$$` `\(E[\hat{\beta}|X] = \beta\)` conditional distribution of `\(\hat{\beta}\)` centers at `\(\beta\)`. *For any realization of matrix X*. Hence with i.i.d. sampling: `\(E(\hat{\beta}|X) = \beta\)` (conditionally unbiased). --- ## Variance of least square estimators For any *rx1* random vector Z define the *rxr* covariance matrix `$$Var[Z] = E[(Z-E[Z])(Z-E[Z])'] = E[ZZ'] - E([Z])(E[Z])'$$` For any pair (Z,X) define the conditional covariance matrix. `$$Var[Z|X] = E[(Z-E[Z|X])(Z-E[Z|X])'|X]$$` --- ## Variance of least square estimators We define `\(V_{\hat{\beta}} =_{def} Var[\hat{\beta}|X]\)` as the covariance matrix of regression coefficients. The conditional covariance matrix of the *nx1* regression error e is the *nxn* matrix. `$$var[e|X] = E[ee'|X]=_{def}D$$` The `\(i_{th}\)` diagonal element D is: `$$E[e^2_i|X] = E [e^2_i|X_i] = \sigma^2_i$$` The `\(ij_{th}\)` off-diagonal element D is: `$$E[e_ie_j|X] = E [e_i|X_i]E[e_j|X_j] =0$$` *This equality uses independence of observations.* --- ## Variance of least square estimators `$$D = diag(\sigma^2_1...\sigma^2_n) = \left( \begin{array}{c} \sigma^2_1 & 0 & ... & 0 \\ 0 & \sigma^2_2 & ... & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & ... & \sigma^2_n \\ \end{array} \right)$$` In the rare homoskedastic case `\(E[e^2_i|X_i] = \sigma^2_i = \sigma^2\)` `$$D = I_n\sigma^2$$` --- ## Variance of least square estimators And for any *nxr* matrix `\(A=A(X)\)`, `$$var [A'Y|X] = var [A'e|X] = A'DA.$$` For `\(\hat{\beta} = A'Y\)` where `\(A=X(X'X)^{-1}\)`, we have: `$$V_{\hat{\beta}} = var[\hat{\beta}|X] = A'DA = (X'X)^{-1}X'D X(X'X)^{-1}$$` Note that: `\(X'DX = \sum_{i = 1}^{n}X_iX'_i\sigma^2_i\)` is a weighted version of `\(X'X.\)` If homoskedastic: `\(D= I_n \sigma^2\)`; `\(X'DX = X'X\sigma^2\)` and varcovar matrix simplifies to: `$$V_{\hat{\beta}}= (X'X)^{-1}\sigma^2$$` --- ## Variance of least square estimators `\(Y \tilde{} N(\mu, \sigma^2) \to \mu = X\beta \to Y \tilde{} N(X\beta, D)\)`, where if homoskedastic: `\(D=\sigma^2I\)` `\(Y: n\times1\)`; `\(X: n\times k\)`; `\(\beta: k\times 1\)`; `\(D:n \times n\)` - Covariance matrix (no assumptions) `$$D = \begin{pmatrix} \sigma_{11}^2 & \sigma_{12} & \sigma_{13} & \cdots & \sigma_{1n} \\ \sigma_{21} & \sigma_{22}^2 & \sigma_{23} & \cdots & \sigma_{2n} \\ \sigma_{31} & \sigma_{32} & \sigma_{33}^2 & \cdots & \sigma_{3n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1} & \sigma_{n2} & \sigma_{n3} & \cdots & \sigma_{nn}^2 \end{pmatrix}$$` --- ## Variance of least square estimators - Assuming Independence: `$$D = \begin{pmatrix} \sigma_{1}^2 & 0 & 0 & \cdots & 0 \\ 0 & \sigma_{2}^2 & 0 & \cdots & 0 \\ 0 & 0 & \sigma_{3}^2 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma_{n}^2 \end{pmatrix}$$` - Assuming Homoskedasticity: `$$\Sigma = \begin{pmatrix} \sigma_{}^2 & 0 & 0 & \cdots & 0 \\ 0 & \sigma_{}^2 & 0 & \cdots & 0 \\ 0 & 0 & \sigma_{}^2 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma_{}^2 \end{pmatrix} = \sigma^2 \begin{pmatrix} 1_{} & 0 & 0 & \cdots & 0 \\ 0 & 1_{} & 0 & \cdots & 0 \\ 0 & 0 & 1_{} & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1_{} \end{pmatrix}$$` --- ## Variance of least square estimators (proof) `$$\hat\beta = (X'X)^{-1}X'Y$$` `$$var(\hat\beta) = var[(X'X)^{-1}X'Y]$$` - This is `\(aX= a^2.var(X)\)`, and given `\(Y \tilde{} N(X\beta, \color{green}{D})\)`: `$$var(\hat\beta) = [(X'X)^{-1}X'] \color{green}{\sigma^2 I} [(X'X)^{-1}X']'$$` `$$var(\hat\beta) = \sigma^2[(X'X)^{-1}X'] I [X[(X'X)^{-1}]']$$` - Given: `\([(X'X)^{-1}]'= (X'X)^{-1}\)` Under homoskedasticity and with independence of errors: `$$var(\hat\beta) = \sigma^2(X'X)^{-1}$$` --- ## Gauss-Markov Theorem The LS estimator is the case when `\(A=X(X'X)^{-1}\)`. What is the best choice of A? The Gauss Markov theorem states *LS is the best choice* of `\(A\)`, among linear unbiased estimators **when errors are homoskedastic.** With `\(E[Y|X]=X\beta\)` and for any linear `\(\tilde{\beta}=A'Y\)` we have: `$$E[\tilde{\beta}|X] = A'E[Y|X] = A'X\beta$$` `\(\tilde{\beta}\)` is unbiased if `\(A'X = I_{k}\)` --- ## Gauss-Markov Theorem Furthermore: `$$Var[\tilde{\beta}|X]=var[A'Y|X]=A'DA=A'A\sigma^2$$` The last equality comes from: `\(D=I_n\sigma^2\)` The BLUE comes from finding the matrix `\(A_o\)` that satisfies `\(A_o'X=I_k\)` Such that `\(A_o'A_o\)` is minimized in the positive semi-definite sense. Otherwise the variance of the estimator is higher than `\(\sigma^2\)`. --- ## Gauss-Markov Theorem We have seen LS satisfies this if linear estimators with i.i.d. sampling. So if `\(\hat{\beta}\)` is linear unbiased estimator of `\(\beta\)` then: `$$Var[\tilde{\beta}|X] \ge \sigma^2(X'X)^{-1}$$` No unbiased linear estimator can have a variance matrix smaller (in the positive definite sense) than `\(\sigma^2(X'X)^{-1}\)` --- ## Gauss-Markov Theorem (proof) Let `\(A\)` be any `\(nxk\)` function of `\(X\)` such that `\(\color{green}{A'X = I_k}\)` The estimator `\(A'Y\)` is unbiased for `\(\beta\)` with variance `\(A'A\sigma^2\)` It is sufficient to show that difference between OLS variance matrix and `\(A'A\)` is positive semi-definite, or: `$$A'A-(X'X)^{-1} \ge 0$$` --- ## Gauss-Markov Theorem (proof) Set: `$$C= A-X(X'X)^{-1}$$` `$$\color{green}{A = C+X(X'X)^{-1}}$$` Note `\({X'C=0}\)`. Because `\({A'X = I_k}\)` `$$A'A-(X'X)^{-1} = (\color{green}{C+X(X'X)^{-1}})'(C + X(X'X)^{-1}) - (X'X)^{-1}$$` `$$=C'C + C'X(X'X)^{-1} + (X'X)^{-1}X'C + \\ \color{red}{(X'X)^{-1}X'X(X'X)^{-1} - (X'X)^{-1}}$$` `$$=C'C\ge 0$$` Which is positive semi-definite. --- ## Generalized Least Squares Model in matrix form: `$$Y = X\beta + e$$` Consider a generalized situation when the errors heteroskedastic. `$$E[e|X]=0$$` `$$var[e|X]= \Omega$$` `\(\Omega\)` allows for i.i.d sampling where `\(\Omega=D\)` but also for non-diagonal covariance matrixes as well. Hence: `$$E[\hat{\beta}|X] = \beta$$` `$$var[\hat{\beta}|X] = (X'X)^{-1} (X'\Omega X) (X'X)^{-1}$$` --- ## Generalized Least Squares A generalized Gauss-Markov is: `$$var[\tilde{\beta}|X] \ge \ (X'\Omega^{-1} X)^{-1}$$` This is when we know `\(\Omega\)` up to scale. If we have homoskedasticity and i.i.d., this variance estimator is larger. Suppose that we know `\(\Omega=c^2\Sigma\)`; where `\(c^2>0\)` and real; `\(\Sigma\)` is `\(nxn\)` **and known** --- ## Generalized Least Squares A case of GLS is where we pre multiply by `\(\Sigma^{-1/2}\)`, producing: `$$\tilde{\beta}_{gls} = (\tilde{X'}\tilde{X})^{-1} (\tilde{X'}\tilde{Y})$$` `$$= ((\Sigma^{-1/2}{X})' (\Sigma^{-1/2}{X}))^{-1} (\Sigma^{-1/2}{X})' (\Sigma^{-1/2}{Y}) =$$` `$$(X' \Sigma^{-1}X)^{-1} X'\Sigma^{-1}Y$$` Hence: `$$E[\tilde{\beta}_{gls}|X] = \beta$$` `$$Var[\tilde{\beta}_{gls}|X] = (X'\Omega^{-1} X)$$` --- ## Generalized Least Squares The variance lower bound is sharp when `\(\Sigma\)` **is known**. And GLS is efficient under heteroskedasticity. In the linear regression model with independent observations and known conditional variances, so that `\(\Omega = \Sigma = D = diag(\sigma_1... ,\sigma_n)\)`, GLS takes the form: `$$\tilde\beta_{gls}=(X'D^{-1}X)^{-1}X'D^{-1}Y$$` In the practice the covariance Matrix `\(\Omega\)` is unknown, it can be estimated (Feasible Least Squares). **No longer common in current applied econometric practice.** --- ## Estimation of Error Variance Error variance `\(\sigma^2 = E[e^2]\)` measuring the unexplained part of the regression. With method of moments estimator: `$$\hat{\sigma}^2 = {1 \over n} \sum_{i = 1}^{n}\hat{e_i}^2$$` It can be shown (BH p. 108) that under conditional homoskedasticity `\(E[e^2|X] = \sigma^2\)` so that `\(D = I_{n}\sigma^2\)`: `$$E(\hat{\sigma}^2 | X) = {1 \over n} tr(M\sigma^2) = {\sigma^2 ({n-k \over n})}$$` Showing that `\(\hat{\sigma}^2\)` **is biased towards zero**, this bias is more important if `\(k/n\)` is large. --- ## Estimation of Error Variance This can be rescaled to: `$$s^2= {1 \over n-k}\sum_{i = 1}^{n}\hat{e_i}^2$$` `$$E[s^2|X] = \sigma^2$$` --- ## Homoskedastic Covariance Matrix Estimation For inference we need to estimate covariance matrix `\(V_{\hat{\beta}}\)` Under Homoskedasticity: `\(V^0_{\hat{\beta}} = (X'X)^{-1}\sigma^2\)` or `\(\hat{V}^0_{\hat{\beta}} = (X'X)^{-1}s^2\)` Conditionally unbiased: `$$E[\hat{V}^0_{\hat{\beta}}|X] = (X'X)^{-1} E[s^2|X] = (X'X)^{-1} \sigma^2 = V_{\hat{\beta}}$$` *If the regression error is heteroskedastic it is possible for `\(\hat{V}^0_{\hat{\beta}}\)` to be biased. --- ## Biased Covariance Matrix Estimation Remember that if `\(D = I_n\sigma^2\)` then `\(X'DX = \color{green} {\sum_{i=1}^n X_iX_i'\sigma^2_i}\)` (see slide 11) If `\(k=1\)` and `\(\sigma^2_i = X^2_i\)`, implying `\(\sigma^2 = E[\sigma^2_i] = E[X^2]\)`. If we use `\(\hat{V}^0_{\hat{\beta}} = (X'X)^{-1}s^2\)` but the error is heteroskedastic. The ratio of the true variance to the expectation of the variance estimator is: `$${V_{\hat{\beta}} \over E[\hat{V}^0_{\hat{\beta}}|X]} = {\sum_{i = 1}^{n}X_i^{4} \over \sigma^2\sum_{i = 1}^{n}X_i^{2}}= {E[X_i^4] \over (E[X_i^2])^2}= _{def} = k$$` Where `\(k\)` is the standardized kurtosis. If `\(X \tilde{} N(0,\sigma^2)\)`, `\(k\)`= 3, so variance is 3 times higher in this example. --- ## Heteroskedastic Covariance Matrix Estimation We can construct varcovar matrix estimator not requiring homoskedasticity. General form: `$$V_{\hat{\beta}} = var[\hat{\beta}|X] = (X'X)^{-1}X'D X(X'X)^{-1}$$` `$$D = diag(\sigma^2_i,...,\sigma^2_n)=E[ee'|X] = E[\tilde{D}|X]$$` Where `\(\tilde{D}=diag(e_1^2... e_n^2)\)`. `\(\tilde{D}\)` is a conditionally unbiased estimator of `\(D\)`. If `\(e^2_i\)` were observable we could construct: `$$\hat{V}^{ideal}_{\hat{\beta}} = (X'X)^{-1}X'\tilde{D} X(X'X)^{-1}$$` `$$= (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_ie^{2}_i)(X'X)^{-1}$$` --- ## Heteroskedastic Covariance Matrix Estimation From here: `$$E[\hat{V}^{ideal}_{\hat{\beta}}|X] = (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_iE[e^{2}_i|X])(X'X)^{-1}$$` `$$= E[\hat{V}^{ideal}_{\hat{\beta}}|X] = (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_i\sigma^2)(X'X)^{-1}$$` `$$= (X'X)^{-1}X'D X(X'X)^{-1} = V_{\hat{\beta}}$$` `$$E[\hat{V}^{ideal}_{\hat{\beta}}] = V_{\hat{\beta}}$$` Verifying that it is unbiased. --- ## Heteroskedastic Covariance Matrix Estimation Under Heteroskedasticity: `$$var(\hat\beta) = [(X'X)^{-1}X']D[X(X'X)^{-1}]$$` - `\(D: n \times n\)`, and you can't estimate `\(n \times n\)` elements with n observations. - Instead of estimating the full covar matrix, the diagonal can be estimated with residuals. A "weighted" version *(White, 1980)* --- ## Heteroskedastic Covariance Matrix Estimation Also, as `\(e^2_i\)` are unobserved, `\(\hat{V}^{ideal}_{\hat{\beta}}\)` is not feasible. So we replace `\(e^2_i\)` with the residuals `\(\hat{e}^2\)` obtaining: `$$\hat{V}^{HC0}_{\hat{\beta}} = (X'X)^{-1} \large( \sum_{i = 1}^{n}X_iX'_i\hat{e}^{2}_i \large)(X'X)^{-1}$$` This the "baseline" heteroskedasticiy-consistent covar matrix estimator. --- ## Heteroskedastic Covariance Matrix Estimation Furthermore as `\(\hat{e}^{2}_i\)` is biased towards zero, we rescale by `\(n/(n-k)\)`: `$$\hat{V}^{HC1}_{\hat{\beta}} = ({n \over {n-k}}) (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_i\hat{e}^{2}_i)(X'X)^{-1}$$` These are robust, heteroskedasticity-consistent, or heteroskedasticity-robust covar matrix. HC0 is the Eicker-White or White covariance matrix estimator. **HC errors are not the default in STATA**. If *robust* is added, it is HC1. Standard errors are the square root of the diagonal elements of `\(\hat{V}^{}_{\hat{\beta}}\)` --- ## Estimation Example With `\(n \to \infty\)` and under Homoskedasticity: `$$var(\hat \beta_1) = \frac {\hat\sigma^2} {\sum_{i=1}^n (x_i - \bar x)^2}$$` `\(\hat{var}(\hat\beta_1)\)` falls at a rate 1/n. This rate refers to *efficiency*. The Heteroskedastic robust covar matrix falls at a lower rate. --- ##Homoskedastic Convergence ``` r repet <- 5000 running_se <- NULL # Set seed for reproducibility set.seed(123456) # Simulate beta estimates and calculate the running mean for (i in 50:repet) { x <- rnorm(i) # Regressor x u <- rnorm(i, sd = 2) # Random error with some variance y <- 2 + beta_1_true * x + u # Define y, with beta_1 = 2 model <- lm(y ~ x) # Store beta_1 estimate # Calculate the running variance up to the current iteration robust_vcov <- vcovHC(model, type = "HC1") running_variances[i] <- diag(robust_vcov) } # we then plot the running variance... ``` --- ##Homoskedastic Convergence <img src="data:image/png;base64,#Finite_v1_files/figure-html/unnamed-chunk-2-1.png" width="65%" style="display: block; margin: auto;" /> --- ##Heteroskedastic Convergence ``` r repet <- 5000 running_variances <- NULL # Set seed for reproducibility set.seed(123456) # Simulate beta estimates and calculate the running mean for (i in 50:repet) { x <- rnorm(i) # Regressor x u <- rnorm(i, sd = 2 + abs(x)) # Random error with some variance y <- 2 + beta_1_true * x + u # Define y, with beta_1 = 2 model <- lm(y ~ x) # Store beta_1 estimate # Calculate the running variance up to the current iteration robust_vcov <- vcovHC(model, type = "HC1") running_variances[i] <- diag(robust_vcov) } # we then plot the robust se estimates... ``` --- ##Variance of `\(\hat\beta_1\)` Convergence <img src="data:image/png;base64,#Finite_v1_files/figure-html/unnamed-chunk-4-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Other Heteroskedastic Variance Estimations HC2 and HC3 come from standardized errors `\(\bar{e}\)` and prediction errors `\(\tilde{e}\)`, respectively. Where: `\(\hat{V}^{HC0}_{\hat{\beta}} < \hat{V}^{HC2}_{\hat{\beta}} < \hat{V}^{HC3}_{\hat{\beta}}\)` --- ## Other Heteroskedastic Variance Estimations Before, we define **Leverage Values:** There are `\(n\)` leverage values denoted as `\(h_{ii}\)` for `\(i=1...n\)` `$$h_{ii} = X'_i(X'X)^{-1}X_i$$` The leverage value is a normalized length of the observed regressor vector `\(X_i\)` and is between 0 and 1. Measures how unusual the `\(i_{th}\)` observation `\(X_i\)` is relative to the other observations in the sample. An **extreme** example of `\(h_{ii}=1\)` is a dummy that only takes once the value of 1. --- ## Heteroskedastic Covariance Matrix Estimation `$$\hat{V}^{HC2}_{\hat{\beta}} = (X'X)^{-1}(\sum_{i = 1}^{n}X_iX'_i\bar{e}^{2}_i)(X'X)^{-1}$$` `$$\hat{V}^{HC2}_{\hat{\beta}} = (X'X)^{-1}(\sum_{i = 1}^{n}(1-h_{ii})^{-1}X_iX'_i\hat{e}^{2}_i)(X'X)^{-1}$$` If there is an observation with `\(h_{ii}\)` close to one, then `\((1-h_{ii})^{-1}\)` would be large, giving this observation more weight. --- ## Heteroskedastic Covariance Matrix Estimation While: `$$\hat{V}^{HC3}_{\hat{\beta}} = (X'X)^{-1}(\sum_{i = 1}^{n}(1-h_{ii})^{-2}X_iX'_i\hat{e}^{2}_i)(X'X)^{-1}$$` In **actividad 2** you will show that HC2 is unbiased (see BH p.113) --- ## Clustered Sampling Samples could be correlated within groups (not across). For example when studying schools, firms, households or localities. This is `\(Y_{ig}, X_{ig}\)` where `\(g = 1,..G.\)` indexes the cluster. Number of observations per cluster is `\(n_g\)` and `\(n=\sum_{g = 1}^{G}n_g\)`. A model is: `$$Y_{ig} = X'_{ig}\beta + e_{ig}$$` --- ## Clustered Sampling Or we can use cluster notation: `$$Y_g= X'_g\beta+e_g$$` Where `\(e_g = (e_{1g},..., e_{n_gg})'\)` is an `\(n_gx1\)` error vector. We can write the sums over observations as `\(\sum_{g = 1}^{G}\sum_{i=1}^{n_g}\)` This is the sum across clusters of the sum across observations within each cluster. --- ## Clustered Sampling OLS is: `$$\beta= (\sum_{g = 1}^{G}\sum_{i=1}^{n_g}X_{ig}X'_{ig})^{-1} (\sum_{g = 1}^{G}\sum_{i=1}^{n_g}X_{ig}Y_{ig})$$` `$$= (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}Y_{g})$$` `$$=(X'X)^{-1}(X'Y)$$` With residuals `\(\hat{e}_{ig}= Y_{ig}-X'_{ig}\hat{\beta}\)` or `\(\hat{e}_{g}= Y_{g}-X'_{g}\hat{\beta}\)` (in cluster level notation) --- ## Clustered Sampling Assumption that clusters are mutually independent. Also errors conditionally mean `\(E[e_{g}|X_{g}]=0\)`. This is if **all interaction effects within clusters** have been accounted for in the specification of the individual regressors `\(X_{ig}\)`. e.g the achievement of any student is unaffected by the individual `\(X_i\)` (e.g. age, gender and test scores) of other students within the same school. --- ## Clustered Sampling We can calculate the mean of OLS estimator substituting `$$Y_g= X_g\beta+e_g$$` into `$$= (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}Y_{g})$$` If we substract: `$$\hat{\beta}-\beta = (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}e_{g})$$` --- ## Clustered Sampling The mean of `\(\hat{\beta}-\beta\)` conditioning on all X is: `$$E[\hat{\beta}-\beta|X] = (\sum_{g = 1}^{G}X'_gX_{g})^{-1} (\sum_{g = 1}^{G}X'_{g}E[e_{g}|X_g]) = 0$$` As clusters are assumed independent of each other we can write `\(X\)` as `\(X_g\)`. This shows that OLS is unbiased under clustering if the conditional mean is linear allowing `\(E[e_{g}|X_g]=0\)` --- ## Clustered Sampling (Example) From Duflo et. al. (2011) in 121 primary schools in Kenya. Students are randomly assigned into "tracking" classrooms or heterogenous classrooms. Discuss: `$$TestScore_{ig} = -0.071 + 0.138Tracking_{g} + e_{ig}$$` --- ## Variance with clusters. Let: `$$\sum_g= E[e_ge'_g|X_g]$$` Denoting the `\(n_g x n_g\)` conditional covariance matrix of errors within the `\(g_{th}\)` cluster. - `\(e_g\)` is the vector of residuals for all `\(n_g\)` observations in cluster `\(g\)`. - `\(X_g\)` is the matrix of regressors corresponding to those observations. - Conditional on `\(X_g\)`, we are focusing on the variation in the errors not explained by `\(X_g\)`. This covariance matrix captures both the variance of individual errors and their correlation within the cluster. Off-diagonal elements are not zero. --- ## Variance of `\(\hat\beta\)` (reminder) `$$\hat{\beta} = (X'X)^{-1}X'Y.$$` $$ Y = X\beta + e $$ $$ \hat{\beta} = (X'X)^{-1}X'(X\beta + e). $$ $$ \hat{\beta} = \color\green{(X'X)^{-1}X'X}\beta + (X'X)^{-1}X'e. $$ $$ \hat{\beta} = \beta + (X'X)^{-1}X'e. $$ The conditional variance is: `$$Var (\hat{\beta} | X) = Var (\beta + (X'X)^{-1}X'e | X)$$` $$ \text{Var} (\hat{\beta} | X) = (X'X)^{-1}X' \text{Var}(e | X) X (X'X)^{-1}$$ --- ## Variance of `\(\hat\beta\)` with clusters. Let: `$$\sum_g= E[e_ge'_g|X_g]$$` `$$\text{Var}(e | X) = \text{blockdiag}(\Sigma_1, \Sigma_2, \dots, \Sigma_G).$$` [Some code to view!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/7_covar_matrix_cluster_view.R) --- ## Variance of `\(\hat\beta\)` with clusters. Let: `\(\sum_g= E[e_ge'_g|X_g]\)` Hence: `$$var[(\sum_{g=1}^G X'_ge_g)|X]=\sum_{g=1}^G var [X'_ge_g|X_g]$$` `$$= \sum_{g=1}^G X'_g E[e_ge'_g|X_g]X_g$$` `$$= \sum_{g=1}^G X'_g \Sigma_g X_g =_{def} \Omega_n$$` `\(\Omega_n\)` captures how within-cluster error correlation contributes to the overall uncertainty in `\(\hat\beta\)` --- ## Variance of `\(\hat\beta\)` with clusters. Hence: $$ V_{\hat{\beta}}= var[\hat{\beta}|X] = (X'X)^{-1} \Omega_n(X'X)^{-1} $$ This differs from the formula of the independent case, due to correlation within clusters. --- ## Variance with clusters (intuitively) Variance difference depends on the degree of correlation between observation within clusters. e.g. if same number of observations within cluster `\(n_g = N\)`, `\(E[e^2_{ig}|X] = \sigma^2\)`; `\(E[e^2_{ig},e^2_{lg}|X] = \sigma^2\rho\)` for `\(i\ne l\)`. Same regressors within clusters. Hence: `$$V_\hat{\beta} = (X'X)^{-1} \sigma^2 \color{green}{(1 + \rho(N-1))}$$` For `\(\rho>0\)` is approximately a multiple `\(\rho N\)` of the conventional formula. **If cluster size 100 and `\(\rho = 0.25\)`, the exact variance should be 25 times bigger with SE five times bigger.** But this depends on number of clusters, within-n and `\(\rho\)` size. --- ## Variance with clusters Arellano Bond (1987) give the cluster robust covariance matrix that extends White: Squared error `\(e^2_i\)` is unbiased for `\(E[e^2_i|X_i]=\sigma^2_i\)` With cluster dependence the matrix `\(e'_ge'_g\)` is unbiased for `\(E[e^2_ge'^2_g|X_g]=\Sigma_g\)` The unbiased estimator for `\({\Omega_n}\)` is `\(\tilde{\Omega}_{n} = \Sigma^G_{g=1} X'_ge_ge'_gX_g'\)` replacing with residuals: `$$\hat{\Omega}_n = \sum^G_{g=1} X'_g\hat{e}_g\hat{e}'_gX_g$$` --- ## Variance with clusters `$$\hat{\Omega}_n = \Sigma^G_{g=1} X'_g\hat{e}_g\hat{e}'_gX$$` `$$= \sum_{g=1}^G \sum_{i=1}^{n_g} \sum_{l=1}^{n_g} X_{ig} X'_{lg} \hat{e}_{ig} \hat{e}_{lg}$$` `$$= \sum_{g=1}^G(\sum_{i=1}^{n_g}X_{ig}\hat{e}_{ig}) (\sum_{l=1}^{n_g}X_{lg}\hat{e}_{lg})'$$` [Some beautiful code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/8_covar_matrix_cluster_estimation.R) --- ## Variance with clusters A finite sample adjustment is: `\(a_n(X'X)^{-1}\hat{\Omega_n}(X'X)^{-1}\)`. Where `\(a_n = ({n-1 \over n-k}) ({G \over G-1})\)` to improve performance when G is small. This is the *Liang-Zeger* clustering adjustment. **Stata uses this when *cluster* formula is used.** Example: `$$TestScore_{ig} = -0.071 + 0.138 Tracking_g + e_{ig}$$` `$$\quad (0.019) \quad (0.026)$$` `$$\quad [0.054] \quad [0.054]$$` --- #Multicolinearity If `\(X'X\)` is singular then `\((X'X)^{-1}\)` and `\(\hat{\beta}\)` are not defined. This strict multicollinearity happens when `\(X_k=X_j\)` If we have near multicollinearity, coefficient estimates are imprecise. If `\(V_\hat{\beta} = (X'X)^{-1}\sigma^2\)`: `$$Y = X_1\beta_1 + X_2\beta_2 + e$$` `$${1 \over n} X'X = \left( \begin{array}{c} 1 & \rho \\ \rho & 1\\ \end{array} \right)$$` `$$var[\hat\beta|X] = {\sigma^2\over n}\left( \begin{array}{c} 1 & \rho \\ \rho & 1\\ \end{array} \right)^{-1} = {\sigma^2 \over n(1-\rho^2)} \left( \begin{array}{c} 1 & -\rho \\ -\rho & 1\\ \end{array} \right)$$` The more “collinear” the regressors the worse the precision of the estimates. --- ## Measures of Fit `\(R^2\)` is defined as: `$$R^2 = 1 - {{\sum_{i = 1}^{n}\hat{e_i}^2} \over \sum_{i = 1}^{n} (Y_i - \bar{Y})^2} = 1- {\hat{\sigma}^2 \over \hat{\sigma}^2_Y}$$` Yet `\(\hat{\sigma}^2\)` and `\(\hat{\sigma}^2_Y\)` are biased. Hence: `$$\bar{R^2} = 1 - {s^2 \over \tilde{\sigma}^2_Y} = 1- {{(n-1)^{-1}\sum_{i = 1}^{n}\hat{e}^2_i} \over (n-k)^{-1}\sum_{i = 1}^{n} (Yi - \bar{Y})^2}$$` This is the adjusted R-squared, openly used. --- ## Measures of Fit But it is preferred to use: `$$\tilde{R^2} = 1 - {\tilde{\sigma}^2 \over \hat{\sigma}^2_Y} = 1- {{\sum_{i = 1}^{n}\tilde{e}^2_i} \over \sum_{i = 1}^{n} (Y_i - \bar{Y})^2}$$` Where `\(\tilde{e}^2\)` are **prediction** errors not residuals. `\(\tilde{R^2}\)` and `\(\bar{R^2}\)` are non-monotonic in the number of regressors. --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h2>The End</h2> </div>