class: center, middle # Asymptotic Properties of OLS ### Dr. Francisco J. Cabrera-Hernández #### EconometrÃa #### MaestrÃa en EconomÃa Primavera 2025 #####CIDE Santa Fe, Ciudad de México. --- ## Introduction We now investigate OLS asymptotic properties. Applies to the CEF model and the projection model. `$$Y = X'\beta + e$$` $$ \beta = (E[XX'])^{-1} E[XY] $$ We maintain assumptions: 1. `\((Yi,Xi), i=1,...,n,\)` are i.i.d. 2. `\(E[Y^2] < \infty\)` 3. `\(E||X||^2 < \infty\)` 4. `\(Qxx= E[XX']\)` is positive definite. --- ## Consistency **Definition 1**. A sequence of random vectors `\(Z_n \in \mathbb{R}^k\)` converge in probability to `\(Z\)` as `\(n \to \infty\)` `$$lim_{n \to \infty} \mathbb{P}[||Z_n-Z||<\delta]=1$$` Z is the probability limit of `\(Z_n\)` $$ Z_n \to_p Z \quad \text{as} \quad n \to \infty $$ For a random vector, this holds if and only if each element in the vector converges in probability to its limit. --- ## Consistency **Definition 2**. `\(Z_n\)` are random vectors with distributions `\(F_n(u) = \mathbb{P}[Z_n \le u]\)` `\(u \in \mathbb{R}^k\)` is afixed vector. For all `\(u\)` at which `\(F_n(u) = \mathbb{P}[Z_n \le u]\)` is continuous `\(F_n(u) \to F(u)\)` as `\(n \to \infty\)`. We say that `\(Z_n \to_d Z\)` or **converges in distribution** to Z as `\(n \to \infty\)` `\(Z\)` and `\(F(U)\)` are called the asymptotic distributions, large sample distribution, or the limit distribution of `\(Z_n\)`. [Some nice code](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/10_distribution_convergence.R) --- ## Consistency **Weak Law of Large Numbers (WLLN)** If `\(Y_i \in \mathbb{R}^k\)` are i.i.d. and `\(E||Y|| < \infty\)` then as `\(n \to \infty\)` `$$\bar{Y} = {1 \over n} \sum_{i=1}^{n} Y_i \to_p E[Y]$$` The sample mean `\(\bar{Y}\)` in probability to the true population expectation `\(\mu\)`. An estimator `\(\hat{\theta}\)` is **consistent** if `\(\hat{\theta} \to \theta\)` as `\(n \to \infty\)`. --- ## Central Limit Theorem If `\(Y_i \in \mathbb{R}^k\)` are i.i.d. and `\(E||Y||^2 < \infty\)` then as `\(n \to \infty\)` `$$\sqrt{n}(\bar{Y}-\mu) \to_d N(0,V)$$` Where `\(\mu = E[Y]\)` and `\(V= E[(Y-\mu)(Y-\mu)']\)` The central limit theorem shows that distribution sample mean is approximately normal in large samples. It allows for singular `\(V\)`. --- ## Summary - Being i.i.d. does not imply normality. - A sample can be i.i.d. from a non-normal distribution, such as Uniform, Exponential, Bernoulli, Poisson. - The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT), hold under the i.i.d. assumption, even when distribution not normal. - But the limit in CLT is normal. - Normality is typically required only for small-sample exact inference, such as when using t-statistics. - In large samples, asymptotic normality of estimators emerges, even from non-normal data, as long as the observations are i.i.d. and regularity conditions are met (e.g. consistency and identification!) --- ## Continuos Mapping Theorem (CMT) Makes use of convergence in probability and convergence in distribution. Let `\(Z_n \in \mathbb{R}^k\)` and `\(g(u)\)`: `\(\mathbb{R}^k \to \mathbb{R}^q\)`: If `\(Z_n \to_p c\)` as `\(n \to \infty\)` and `\(g(u)\)` is continuous at `\(c\)` then `\(g(Z_n) \to_p g(c)\)` as `\(n \to \infty\)`. If a sequence of random vectors `\(Z_n\)` converge in probability to a constant vector `\(c\)`, and you apply a continuous function `\(g\)` to each `\(Z_n\)`, then the transformed sequence `\(g(Z_n)\)` will also converge in probability to `\(g(c)\)`. Needed for deriving the asymptotic distributions of estimators after transformation (e.g., log, inverse, square root, etc.). --- ## Illustration of the Continuous Mapping Theorem (CMT) Let: `$$Z_n \sim \mathcal{N}(1, 1/n),$$` `$$Z_n \xrightarrow{p} 1 \quad \text{as } n \to \infty.$$` A continuous function: $$ g(u) = \log(u), $$ * which is continuous at `\(u = 1\)`. Then, by the CMT: $$ \log(Z_n) \xrightarrow{p} \log(1) = 0. $$ [Beautiful Code](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/11_CMT.R) --- ## Continuos Mapping Theorem (CMT) If `\(Z_n \to_d Z\)` as `\(n \to \infty\)` and `\(g\)`: `\(\mathbb{R}^m \to \mathbb{R}^k\)` has the set of discontinuity points `\(D_g\)` such that `\(\mathbb{P}[Z\in D_g]=0\)` then `\(g(Z_n) \to_d g(Z)\)` as `\(n \to \infty\)`. Differentiable functions of asymptotically normal random estimators are asymptotically normal. This version of the Continuous Mapping Theorem applies when we have **convergence in distribution** rather than in probability. Allows applying a `\(g(\cdot)\)` — even with discontinuities — to a converging sequence of random variables, **as long as the limiting random `\(Z\)` does not land on the discontinuity set `\(D_g\)` with positive probability**. --- ## Consistency of Least Square Estimators OLS estimator can be written as a continuous function of a set of sample moments. The WLLN shows that sample moments converge in probability to population moments. The CMT states that continuous functions preserve convergence in probability. OLS is a function of sample moments `\(\hat{Q}_{XX}^{-1}\)` `\(\hat{Q}_{XY}\)`: `$$\hat{\beta} = ({1 \over n} \sum_{i=1}^n X_iX_i')^{-1}({1 \over n} \sum_{i=1}^n Y_iX_i) = \hat{Q}^{-1}_{XX}\hat{Q}_{XY}$$` --- ## Consistency of Least Square Estimators Using **WLLN** these sample moments converge in probability to their population expectations. As `\((Y_i,X_i)\)` are i.i.d, any function of these is i.i.d. including `\(X_iX_i'\)` and `\(Y_iX_i\)`. With finite expectations, as `\(n \to \infty\)`: `$$\hat{Q}_{XX} = {1 \over n} \sum_{i=1}^n X_iX_i' \to_p \mathbb{E}[XX'] = Q_{XX}$$` `$$\hat{Q}_{XY} = {1 \over n} \sum_{i=1}^n X_iY_i' \to_p \mathbb{E}[XY] = Q_{XY}$$` --- ## Consistency of Least Square Estimators By the **CMT** we are allowed to combine these above equations: `$$\hat{\beta}=\hat{Q}^{-1}_{XX}\hat{Q}_{XY} \to_p Q_{XX}^{-1}Q_{XY}^{} = \beta$$` OLS estimator converges in probability to the projection coefficient vector `\(\beta\)` as the sample size `\(n\)` gets large. Because: `$$\hat{\beta}=g(\hat{Q}_{XX},\hat{Q}_{XY})$$` and `\(g\)` is a continuous function of `\(Q_{XX}\)` and `\(Q_{XY}\)` at all values of the arguments, such that `\({Q}^{-1}_{XX}\)` exists. This justifies the use of the CMT. --- ## Consistency of Least Square Estimators A different demonstration: `$$\hat{\beta} - \beta = \hat{Q}^{-1}_{XX}\hat{Q}_{Xe}$$` Where: `$$\hat{Q}_{Xe}= {1 \over n} \sum_{i=1}^nX_ie_i$$` WLLN imply: `$$\hat{Q}_{Xe} \to_p \mathbb{E}[Xe]=0$$` `$$\hat{\beta} - \beta = \hat{Q}^{-1}_{XX}\hat{Q}_{Xe} \to_p \hat{Q_{XX}}^{-1}0=0$$` This is: `\(\hat{\beta} \to_p \beta\)` as `\(n \to \infty\)`. Thus, `\(\hat{\beta}\)` is consistent for `\(\beta\)`. --- ##Estimadores insesgados `\((\hat\beta_1)\)` n=1000 - Demostración con datos para `\(\hat\beta_1\)`: ``` r repet <- 1000 n <- 1000 beta <- NULL set.seed(1234567) for (i in 1:repet){ x <- rnorm(n) #n i-values for x between 0 and 30 u <- rnorm(n) #DO NOT correlate u to x y=2+2*x+u # we define PRF, so that beta is 2 by definition. beta[i] <- lm(y~x)$coef[2] #we collect all betas 1 from our 1000 estimations in one vector. } hist(beta, main="Unbiased estimator", xlim = c(1.9,2.1) ) abline(v = mean(beta), col="red", lwd=3, lty=2 ) abline(v = 2, col="blue", lwd=3, lty=2) ``` --- ##Estimadores insesgados `\((\hat\beta_1)\)` n=1000 - Demostración con datos para `\(\hat\beta_1\)`: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#asymptotic_v1_files/figure-html/unnamed-chunk-2-1.png" alt=" " width="55%" /> <p class="caption"> </p> </div> --- ##Estimadores insesgados `\((\hat\beta_1)\)` n=30 - Demostración con datos para `\(\hat\beta_1\)`: ``` r repet <- 1000 n <- 30 beta <- NULL set.seed(1234567) for (i in 1:repet){ x <- rnorm(n) #n i-values for x between 0 and 30 u <- rnorm(n) #DO NOT correlate u to x y=2+2*x+u # we define PRF, so that beta is 2 by definition. beta[i] <- lm(y~x)$coef[2] #we collect all betas 1 from our 1000 estimations in one vector. } hist(beta, main="Unbiased estimator", xlim = c(1.4,2.6) ) abline(v = mean(beta), col="red", lwd=3, lty=2 ) abline(v = 2, col="blue", lwd=3, lty=2) ``` --- ##Estimadores insesgados `\((\hat\beta_1)\)` n=30 - Demostración con datos para `\(\hat\beta_1\)`: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#asymptotic_v1_files/figure-html/unnamed-chunk-4-1.png" alt=" " width="55%" /> <p class="caption"> </p> </div> --- ##Estimadores sesgados `\((\hat\beta_1)\)` - Demostración con datos para `\(\hat\beta_1\)`: ``` r repet <- 1000 n <- 1000 beta <- NULL set.seed(1234567) for (i in 1:repet){ x <- rnorm(n) #n i-values for x between 0 and 30 u <- (rnorm(n)+.1*x) #correlate u to x, this biases and makes x inconsistent #(the higher the correlation, the bigger the bias). y=2+2*x+u # we define PRF, so that beta is 2 by definition. beta[i] <- lm(y~x)$coef[2] #we collect all betas 1 from our 1000 estimations in one vector. } hist(beta, main="Unbiased estimator", xlim = c(1.9,2.3) ) abline(v = mean(beta), col="red", lwd=3, lty=2 ) abline(v = 2, col="blue", lwd=3, lty=2) ``` --- ##Estimadores sesgados `\((\hat\beta_1)\)` - Demostración con datos (Monte Carlo) para `\(\hat\beta_1\)`: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#asymptotic_v1_files/figure-html/unnamed-chunk-6-1.png" alt=" " width="55%" /> <p class="caption"> </p> </div> --- ## Asymptotic Normality Consistency is a good first step, but does not describe the distribution of the estimator. In moments, one can be written as a sum of zero-mean random vectors normalized so that CLT applies: `$$\sqrt{n} (\hat{\beta} - \beta) = ({1 \over n} \sum_{i=1}^n X_iX_i')^{-1} \color{green}{ ({1 \over \sqrt{n}} \sum_{i=1}^n X_ie_i)}$$` `\(\sqrt{n} (\hat{\beta} - \beta)\)` is a function of the sample average `\(({1 \over n} \sum_{i=1}^n X_iX_i')\)` and the normalized sample average `\(({1 \over \sqrt{n}} \sum_{i=1}^n X_ie_i)\)` --- ## Asymptotic Normality Any function of `\((Y_i,X_i)\)` is i.i.d this includes `\(X_i,e_i\)`. This is mean-zero `\(E(Xe)=0\)` with `\(kxk\)` covar matrix: `$$\Omega= E[(Xe)(Xe)']=E(XX'e^2)$$` `\(\Omega < \infty\)` and since `\(X_ie_i\)` is i.i.d., mean zero and finite variance, we can use CLT that implies as `\(n \to \infty\)`: `$$\color{green}{ {1 \over \sqrt{n}} \sum_{i=1}^nX_ie_i} \to_d N(0,\Omega)$$` --- ## Asymptotic Normality Hence: `$$\sqrt{n}(\hat{\beta} - \beta) \to_d Q^{-1}_{XX} \color{green}{N(0,\Omega)} = N(0, Q^{-1}_{XX}\Omega Q^{-1}_{XX})$$` *because linear combination of normal vectors are also normal* Therefore as `\(n \to \infty\)`: `$$\sqrt{n}(\hat{\beta} - \beta) \to_d N(0,V_\beta)$$` Where: `$$V_\beta = Q^{-1}_{XX}\Omega Q^{-1}_{XX}$$` Is the asymptotic covariance matrix of `\(\beta\)` The (sandwich) variance of the asymptotic distribution of `\(\sqrt{n}(\hat{\beta} - \beta)\)`. --- ## Asymptotic Normality Under `\(cov (XX', e^2)=0\)` (homoskedastic and unbiased) the asymptotic variance simplifies to: $$ \Omega = E[XX']E[e^2]= Q_{XX}\sigma^2 $$ `$$V_\beta = Q^{-1}_{XX}\Omega Q^{-1}_{XX} = Q_{XX}^{-1}\sigma^2 \equiv V^0_\beta$$` This is the **homoskedastic asymptotic covariance matrix**. --- ## Asymptotic Normality Recall that conditional finite variance (heteroskedastic) in CEF is: `$$V_{\hat\beta} = var[\hat\beta|X]= (X'X)^{-1} (X'DX) (X'X)^{-1}$$` Note that `\(V_{\hat\beta}\)` is the exact conditional variance of `\(\hat\beta\)` and `\(V_\beta\)` is the asymptotic variance of `\(\sqrt{n}({\hat\beta}-\beta)\)` Thus `\(V_\beta\)` should roughly be n times as large as `\(V_{\hat\beta}\)`: `\(V_\beta \approx nV_{\hat\beta}\)`. --- ## Asymptotic Normality Indeed: `$$n V_{\hat\beta} = \left( \frac{1}{n} X'X \right)^{-1} \left( \frac{1}{n} X' D X \right) \left( \frac{1}{n} X'X \right)^{-1}$$` Which is an estimator of the asymptotic `\(V_\beta\)` and converges as `\(n \to \infty\)` `\(V_{\hat\beta}\)` is useful for practical inference of standard errors and hypothesis tests as it is the variace of `\(\hat\beta\)`. When not asymptotical **assumes** normality. `\(V_{\hat\beta}\)` is useful for asymptotic theory. It is well defined in `\(n\to\infty\)` --- ## But how large should `\(n\)` be? There is no simple answer. For some data distributions, normal approximation is poor. e.g. Let `\(Y= \beta_1X + \beta_2 + e\)` where X is `\(N(0,1)\)` and `\(e\)` is independent of `\(X\)`. `\(e\)` has **double pareto density** `\(f(e)= {a\over 2} |e|^{-\alpha-1}\)`, `\(|e| \ge 1\)` (e.g. extreme values are highly likely). If `\(\alpha >2\)` the error has zero mean and variance `\(\alpha/(\alpha-2)\)`. As `\(\alpha \to 2\)` its variance diverges to infinity. Note `\(\sqrt {n {\alpha-2 \over \alpha}}(\hat\beta_1 - \beta_1)\)` when `\(\alpha \to 2\)` makes the scaling inconsistent. --- ## Consistency of error variance `\(\hat{\sigma}^2\)` and `\(s^2\)` are consistent for `\(\sigma^2\)`. To probe this, note residual `\(\hat{e}_i\)` is the error `\(e_i\)` plus a deviation: `\(\hat{e_i} = Y_i - X'_i\hat{\beta}\)` = `\(e_i - X'_i (\hat{\beta}-\beta)\)` Hence: `\(\hat{e_i}^2 = e_i^2- 2e_iX'_i(\hat{\beta} - \beta) + (\hat{\beta} - \beta)' X_iX'_i(\hat{\beta} - \beta)\)` `$$\hat{\sigma}^2 = {1 \over n} \sum_{i=1}^n e_i^2 - 2 \big({1 \over n} \sum_{i=1}^n \color{green}{e_iX'_i \big) (\hat{\beta} - \beta)} + (\hat{\beta} - \beta)' \big({1 \over n} \sum_{i=1}^n X_iX'_i \big) (\hat{\beta} - \beta)$$` (*) We obtain the average of the squared error, plus two terms that are (hopefully) asymptotically negligible. --- ## Consistency of error variance The WLLN shows that: `$${1 \over n}\sum_{i=1}^n e_i^2 \to_p {\sigma^2}$$` `$${1 \over n}\sum_{i=1}^n e_iX'_i \to_p E[eX'] = 0$$` `$$\hat{\beta} \to_p \beta$$` So equation (*) converges to `\(\sigma^2\)` Since `\(n/(n-k) \to 1\)` as `\(n \to \infty\)`, `\(s^2 = {n \over n-k} \hat{\sigma}^2 \to_p \sigma^2\)`. Thus both estimators are consistent. --- ## Homoskedasticity Covariance Matrix Estimation Asymptotic variance covariance matrix of `\(\sqrt{n}(\hat{\beta} - \beta)\)` is `\(V_\beta = Q^{-1}_{XX}\Omega Q^{-1}_{XX}\)` For asymptotic inference (tests) we need a consistent estimator of `\(V_{\beta}\)` Under Homoskedasticity, `\(V_\beta\)` simplifies to `\(V^0_\beta = Q^{-1}_{XX}\sigma^2\)` The moment estimator are `\(\hat{V^0}_\beta = \hat{Q}^{-1}_{XX}s^2\)` We have established that `\(\hat{Q}_{XX} \to_p Q_{XX}\)` and that `\(s^2 \to_p \sigma^2\)`. Also `\(V^0_\beta=Q^{-1}_{XX}\sigma^2\)` is a continuous function of `\(Q_{XX}\)` and `\(\sigma^2\)`, if `\(Q_{XX}\)`>0. **By the CMT:** `\(\hat{V}^0_\beta = \hat{Q}^{-1}_{XX}s^2 \to_p V^0_\beta = Q^{-1}_{XX}\sigma^2\)` So `\(\hat{V}^0_\beta\)` is consistent for `\(V^0_\beta\)` as `\(n \to \infty\)`. --- ## Heteroskedasticity Covariance Matrix Estimation Asymptotic variance covariance matrix of `\(\sqrt{n}(\hat{\beta} - \beta)\)` is `\(V_\beta = Q^{-1}_{XX}\Omega Q^{-1}_{XX}\)` The moment estimator for `\(\Omega\)` is: `$$\hat{\Omega} = {1 \over n} \sum_{i=1}^n X_iX_i'\hat{e}^2_i$$` Hence: `$$\hat{V}_{\beta}^{HC0} = \hat{Q}^{-1}_{XX}\hat{\Omega}\hat{Q}^{-1}_{XX}$$` Satisfying the simple relationship `\(\hat{V}_{\beta}^{HC0} = n\hat{V}_\hat{\beta}^{HC0}\)` We have said that `\(\hat{Q}_{XX} \to_p {Q}_{XX}\)`, now we need to show consistency for `\(\hat{\Omega}\)`. --- ## Heteroskedasticity Covariance Matrix Consistency `$$\hat{\Omega} = {1 \over n} \sum_{i=1}^n X_iX_i'\hat{e}^2_i$$` `$$\hat{\Omega} = {1 \over n} \sum_{i=1}^n X_iX_i'{e}^2_i + {1 \over n} \sum_{i=1}^n X_iX_i'(\hat{e}^2_i - {e}^2_i)$$` **By WLLN** `\({1 \over n}\sum_{i=1}^n X_iX_i'{e}^2_i \to_p E[XX'e^2]=\Omega\)` Recall that: residual is the error plus a deviation: `$$\hat{e_i} = Y_i - X'i\hat{\beta} = e_i - X'_i (\hat{\beta}-\beta)$$` And `\(\hat{\beta} - \beta \to_p 0\)` so that `\(\hat{e}_i\)` is close to `\(e_i\)` if `\(n\)` is large. Hence: `\(\hat{\Omega} \to_p \Omega\)` if `\(n\)` is large --- ## Summing up Covariance Matrix Exact variance of `\(\hat\beta\)`, under classic assumptions (including normality) is: `$$V_\hat{\beta} = var[\hat{\beta}|X] = (X'X)^{-1} (X'DX) (X'X)^{-1}$$` The asymptotic variance of `\(\sqrt{n}(\hat{\beta} - \beta)\)` (under the more general assumptions) is: `$$V_\beta = avar[\sqrt{n} (\hat\beta - \beta)] = Q^{-1}_{XX} \Omega Q^{-1}_{XX}$$` With HC0 estimators: `$$\hat{V}_{\hat\beta}^{HCO} = (X'X)^{-1} (\sum_{i=1}^n X_i X'_i \hat{e}^2_i)(X'X)^{-1}$$` `$$\hat{V}_{\beta}^{HCO} = \hat{Q}_{XX}^{-1} \hat\Omega \hat{Q}_{XX}^{-1}$$` --- ## Summing up Covariance Matrix Finally, `\(\tilde\Omega \to_p \Omega\)` and `\(\bar\Omega \to_p \Omega\)`; where: `$$\hat{V}_{\beta}^{HC2} = \bar\Omega = (1-h_{ii})^{-1} \sum_i^n X_i X'_i \hat{e}^2_i$$` `$$\hat{V}_{\beta}^{HC3} = \tilde\Omega = (1-h_{ii})^{-2} \sum_i^n X_i X'_i \hat{e}^2_i$$` The intuition is the fact that the leverage values are asymptotically negligible. --- ## t-Statistic Let `\(\theta = r(\beta): \mathbb{R}^k \to \mathbb{R}\)` be a parameter of interest, `\(\hat{\theta}\)` its estimator and `\(s(\hat{\theta})\)` its asymptotic standard error: `$$T(\theta)= {\hat{\theta}-\theta \over s(\hat{\theta})}$$` Similarly, `\(\sqrt{n}(\hat{\theta}-\theta) \to_d N(0,V_\theta)\)` and `\(\hat{V}_{\theta} \to_p V_\theta\)`: `$$T(\theta)= {\hat{\theta}-\theta \over s(\hat{\theta})} = {{\sqrt{n}(\hat\theta - \theta)} \over {\sqrt{\hat{V}_{\theta}}}}$$` `$$\to_d {{N(0,V_\theta)} \over {\sqrt{{V}_{\theta}}}}= Z \sim N(0,1)$$` The asymptotic distribution of t-statistic is standard normal. --- ##Error Type 1 Type 1 error is called the **size** of the test: `$$\mathbb{P}[Reject\,H_0 | H_o \, true] = \mathbb{P}[T>c | H_0 \, true]$$` In typical econometric models the exact sampling of estimators is unknown, and we rely on asymptotic approximations. We suppose that the test-statistic has an asymptotic distribution under `\(H_0\)`: `\(T \to_d \xi\)` as `\(n \to \infty\)`. If the null hypothesis holds then `\(T \to_d |Z|\)` as `\(n\to \infty\)` where `\(Z \sim N(0,1)\)` --- ##Error Type 1 The asymptotic probability of a Type 1 error is: $$lim_{n\to\infty} \mathbb{P}[T>c|H_0 \, true] = \mathbb{P}[\xi >c] = 1-G(c) $$ The asymptotic critical value is `\(2(1-\Phi(1.96))\)` = 0.05. Hence the 5% asymptotic critical value for the absolute t-stat is `\(c(1.96)\)` --- ##Error Type 2 Type 2 error relates to the **power** of the test. This is one minus the probability of making a type 2 error: `$$\pi(\theta) = \mathbb{P}[reject \, H_0 | H_1 \, true] = P[T>c|H_1 \, true]$$` Hence the power of a test is the *probability of rejecting `\(H_0\)` when `\(H_1\)` is true.* Increasing `\(c\)` reduces Type 1 error (decreases size) but increases Type 2 (reduces the power). The power increases as `\(\theta\)` moves away from the null hypothesis `\(\theta_0\)`, and as sample size increases. --- ## Power and Test Consistency Suppose `\(Y_i\)` is i.i.d. `\(N(\theta, \sigma^2)\)` with `\(\sigma^2\)` known. Consider `\(T(\theta) = \sqrt{n}(\bar{Y} - \theta)/\sigma\)` and tests of `\(H_0: \theta = 0\)` against `\(H_1: \theta >0\)`. We reject `\(H_0\)` if T = T(0)>c. Where: `$$T = T(\theta) + \sqrt{n} \theta/\sigma$$` - `\(\sqrt{n} \theta / \sigma\)` is a small deviation from the null. - Reflects the effect of estimating under an alternative, **converging to it** as `\(n \to \infty\)`. - `\(T(\theta)\)` has an exact `\(N(0,1)\)` distribution. --- ## Power and Test Consistency The **power of the test** is: `$$\mathbb{P}[T>c|\theta] = \mathbb{P}[Z + \sqrt{n}\theta/\sigma > c] = 1 - \Phi (c - \sqrt{n}\theta/\sigma)$$` For any `\(c\)` and `\(\theta \ne 0\)` power increases to 1 as `\(n \to \infty\)`. Hence if `\(\theta \in H_1\)` , the test will reject `\(H_0\)` with probability approaching 1. This is consistency! --- ## Confidence Intervals The asymptotic normality is used to justify confidence intervals and tests for parameters. The `\(\hat{\theta}\)` is a point estimator for `\(\theta\)` or a single value in `\(\mathbb{R}^q\)`. A set estimator, `\(\hat{C}\)`, is a collection of values in `\(\mathbb{R}^q\)`. When the parameter `\(\theta\)` is real-valued is common to focus on sets of form `\(\hat{C} = [\hat{L},\hat{U}]\)` This is the interval estimator for `\(\theta\)`. --- ## Confidence Intervals Coming from random data, and interval estimate is also random with coverage probability `\(\hat{C} = [\hat{L},\hat{U}]\)` is `\(\mathbb{P} [\theta \in \hat{C}]\)`. Randomness comes from `\(\hat{C}\)`, as parameter `\(\theta\)` is a fixed scalar. When we cannot rely in the exact normal distribution we used for normal regression, we use asymptotic approximations. This allows building intervals for parameters not only regression coefficients. When `\(\hat{\theta}\)` is asymptotically normal with `\(s(\hat{\theta})\)` the interval confidence takes the value: `$$\hat{C} = [\hat{\theta}-c \, x \, s(\hat{\theta}), \hat{\theta} + c \, x \, s(\hat{\theta})]$$` --- ## Confidence Intervals The `\(1-\alpha\)` confidence interval is `\(\mathbb{P}_\theta [\theta \in \hat{C}] = 1-\alpha\)` The goal is to set the coverage probability to equal a pre-specified `\(\alpha\)` such as 90% or 95%. i.e. Asymptotically, for c = 1.96 `\(P[\theta \in \hat{C}] \to 0.95\)` The critical values c are calculated using the `\(Z\)` distribution. --- ## Confidence Intervals **In normal regression, CIs were constructed under homoskedasticity** Asymptotic CIs can be constructed with a heteroskedastic-robust standard error. **Still compare to normal Z distribution.** *Note: Stata by default reports 95% CI for each coefficient where `\(c\)` is calculated using the `\(t_n-k\)` distribution (normal-homoskedastic) with larger CI than `\(Z\)`.* **This is only exact for homoskedastic standard errors and under normality.** With small `\(n\)`, heteroskedastic asymptotic standard errors can be biased, hence t-stats and CIs (see the estimate `\(\hat\Omega\)` when n is "not large enough"). --- ## Wald Statistic With t-test we check if `\(\hat\beta=0\)`, not functions of vector `\(\beta\)`. We want to test `\(\theta=\theta_0\)` vs. `\(\theta \ne \theta_0\)` Let `\(\theta = r(\beta) : \mathbb{R}^k \to \mathbb{R}^q\)` be any parameter vector of interest, `\(\hat\theta=r(\hat\beta)\)` its estimator, and `\(\hat{V}_{\hat{\theta}}\)` its covariance matrix estimator. `\(\theta\)` can test multivariate restrictions, so it is a vector. A measure is: `$$W(\theta) = (\hat\theta - \theta)' \hat{V}_\hat\theta^{-1} (\hat\theta - \theta) = n \, (\hat\theta - \theta)' \hat{V}_\theta^{-1} (\hat\theta - \theta)$$` A weighted Euclidean measure of how much `\(\hat\theta\)` deviates from `\(\theta\)`. Where: `$$\hat{V}_\theta = n \hat{V}_\hat\theta$$` --- ## Wald Statistic When `\(q=1\)` then `\(W(\theta)=T(\theta)^2\)` When `\(q>2\)`, `\(W(\theta)\)` is the Wald statistic. An asymptotic test rejects `\(H_0\)` in favor of `\(H_1\)` if `\(W > c\)`. As `\({\sqrt{n}(\hat\theta - \theta)} \to_d Z \sim N(0,V_\theta)\)` and `\(\hat{V}_\theta \to_p {V}_\theta\)`, then: `$$W(\theta) = \sqrt{n} (\hat\theta - \theta)' \hat{V}_{\theta}^{-1} \sqrt{n} (\hat\theta - \theta) \to_d Z'V_{\theta}^{-1}Z$$` A quadratic in the normal random vector is a Chi-Square: `\(\chi_q^2\)` with `\(q\)` degrees of freedom. As `\(n \to \infty\)`. `\(W(\theta) \to \chi^2_q\)` e.g. For a given significance level `\(\alpha\)` the asymptotic critical value `\(c\)` satisfies `\(\alpha = 1 - G_q(c)\)` e.g. `\(q\)`=3: 7.82 --- ## Homoskedastik Wald Statistic We can construct the Wald Statistic using the homoskedastic covariance matrix estimator `\(\hat{V}_{\theta}^0\)` Under conditional homoskedasticity `\(\mathbb{E}[e^2|X]=\sigma^2\)` it has the same distribution as `\(W(\theta)\)` `$$W^0(\theta) = (\hat\theta - \theta)' {(\hat{V}_\hat{\theta}^0})^{-1} (\hat\theta - \theta) = n \, (\hat\theta - \theta)' {(\hat{V}_{\theta}^0})^{-1} (\hat\theta - \theta)$$` *Note: The F version of the general Wald test usually reported in software is `\(F = W/q \sim F_{q,n-k}\)`. Which is valid if `\(n-k\)` is large.* [Some code](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/12_t_and_wald.R) --- ## Summary - Wald tests arise naturally from the maximum likelihood framework. - If the estimator `\(\hat{\theta}\)` is asymptotically normal (via the CLT), Wald statistic has an asymptotic `\(\chi^2\)` distribution, this even when small-sample approximations are poor. - In contrast, the t-test relies on the `\(t-dist\)` and asymptotically in `\(Z\)`, and compares to a c.v. from these standardized distributions. - This makes the t-test sensitive to violations of classical assumptions if n is not large. - The t-test is exact in small samples under strict assumptions. The Wald test is more flexible and consistent as the sample size grows. --- ##Bonferroni Corrections When testing multiple hypotheses, the chance of finding a "significant" result by chance increases. - Suppose we test `\(k\)` hypotheses with individual significance level `\(\alpha\)`. `$$\text{FWER} = \mathbb{P}(\text{Reject at least one } H_{0j} \mid \text{All } H_{0j} \text{true}) \approx 1 - (1 - \alpha)^k$$` $$ \text{FWER} = 1 - (1 - 0.05)^5 \approx 0.23 $$ - This is with independent tests (worst case upper bond). - The probability that **at least one** test falsely rejects (familywise error) is bounded by: $$ \mathbb{P}\left(\min_{j \leq k} p_j < \alpha\right) \leq \alpha k $$ --- ##Bonferroni Corrections **Bonferroni Rule:** To control this error at level `\(\alpha\)`, reject only if $$ \min_{j \leq k} p_j < \frac{\alpha}{k} $$ **Bonferroni-adjusted p-value:** $$ \text{FWER p-value} = k \cdot \min_{j \leq k} p_j $$ --- ## Bonferroni Corrections **Ho: one of the tests falsely rejects.** Two p-values: 0.04 and 0.15 - Bonferroni-adjusted p-value: `\(0.04 \times 2 = 0.08\)` → **Not significant** at 5% level Now: p-values 0.01 and 0.15 - Adjusted p-value: `\(0.01 \times 2 = 0.02\)` → **Significant** Bonferroni rejects only if the smallest p-value `\(< \alpha / k\)` --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h2>The End</h2> </div>