Asymptotic Theory, Regression Analysis and Potential Outcomes

Sebastian Shaqiri Johansson

2023-01-04

1 Asymptotic Theory

This first section gives a short introduction to asymptotic theory. Before we derive some important results that will be used in results presented throughout the notes, we need to define some central convergence concepts. Then we can derive some useful central limit theorems (CTLs), and laws of Large Numbers (LLNs) that will be used when discussing asymptotic properties of different estimators

1.1 Some Convergence Concepts

For convenience, in any subsequent sections, a sequence $(z_1,z_2,….)$ will be denoted by $\left\{z_n \right\}.$ With that said, lets define some convergence concepts.

Definition 1.1 (Convergence in probability) A sequence of random vectors $\left\{\mathbf{z}_n\right\}$ convergece in probability to a vector of constants (nonrandom) $\mathbf{\alpha}$ if, for any, $\epsilon>0$ \[ \lim_{n\to \infty}Pr(\left| \mathbf{z}_n-\mathbf{\alpha} \right|>\epsilon)=0.\] The constant is called the probability limit of $\mathbf{z}_n$ and is written as $\text{plim}_{n\to \infty}\mathbf{z_n}=\mathbf{\alpha},$ or $\mathbf{z}_n\to_p\mathbf{\alpha}.$ Also $\mathbf{z}_n\to_p\mathbf{\alpha}$ is the same as $\mathbf{z}_n-\mathbf{\alpha}\to_p \mathbf{0}.$

Definition 1.2 (Almost sure convergence) A sequence of random vectors $\left\{ \mathbf{z}_n \right\}$ converges almost surly to a vector of constants $ if \[Pr\left(\lim_{n\to \infty}\mathbf{z}_n=\mathbf{\alpha} \right)=1.\] We write this as $\mathbf{z}_n\to_{a.s}\mathbf{\alpha}.$

Remark: Almost sure convergence is a stronger convergence concept than convergence in probability. Hence, if a sequence converge almost surly, it also converges in probability. Almost sure convergence is sometimes referred to as strong consistency, while convergence in probability is referred to as weak consistency.

Definition 1.3 (Convergence in distribution) A sequence of random vectors $\left\{ \mathbf{z}_n \right\}$ converges in distribution to $\mathbf{z}$ if \[\lim_{n\to \infty}F_n(\mathbf{z}_n)=F(\mathbf{z}),\] for all points $\mathbf{z}$ at which the CDF $F(\mathbf{z})$ is continuous. We write this as $\mathbf{z}_n\to_d \mathbf{z}.$

Remark: Distributional convergence is what we seek to establish when we talk about asymptotic normality.

1.1.1 Some useful convergence results

Lemma 1.1 (The continuous mapping results) Let $\mathbf{a}:\mathbb{R}^K\to \mathbb{R}^r$ be a vector-valued function that does not depend on $n.$ Then:

if $\mathbf{a}(\cdot)$ is continuous at the (nonrandom) point $\mathbf{\alpha}\in\mathbb{R}^K,$ then \[\mathbf{z}_n\to_p\mathbf{\alpha} \Longrightarrow \mathbf{a}(\mathbf{z}_n)\to_p\mathbf{a}(\mathbf{\alpha}).\]
if $\mathbf{a}(\cdot)$ is continuous everywhere and $\mathbf{z}$ is a random vector, then \[\mathbf{z}_n\to_d\mathbf{z}\Longrightarrow \mathbf{a}(\mathbf{z}_n)\to_d\mathbf{a}(\mathbf{z}).\]
if $\mathbf{a}(\cdot)$ is continuous at the (nonrandom) point $\mathbf{\alpha}\in\mathbb{R}^K,$ then \[\mathbf{z}_n\to_{a.s.} \mathbf{\alpha}\Longrightarrow \mathbf{a}(\mathbf{z}_n)\to_{a.s.}\mathbf{a}(\mathbf{\alpha}).\]

Remark: Observe that (1) follows from (3) immediately.

Lemma 1.2

If $\mathbf{x}_n\to_d\mathbf{x}$ and $\mathbf{y}_n\to_p\mathbf{\alpha}$, then \[\mathbf{x}_n+\mathbf{y}_n\to_d\mathbf{x}+\mathbf{\alpha}.\]
If $\mathbf{x}_n\to_d\mathbf{x}$ and $\mathbf{y}_n\to_p\mathbf{0}$, then \[\mathbf{y}'_n\mathbf{x}_n\to_p 0.\]
If $\mathbf{x}_n\to_d\mathbf{x}$ and $\mathbf{A}_n\to_d\mathbf{A}$, then \[\mathbf{A}_n\mathbf{x}_n\to_d\mathbf{Ax},\] provided that $\mathbf{A}_n$ and $\mathbf{x}_n$ are conformable. In particular, if $\mathbf{x}_n\to_d\mathbf{x}$ with $N(,), then $\mathbf{A}_n\mathbf{x}_n \to_d N(\mathbf{0},\boldsymbol{A\Sigma A'}).$

Remark: Condition a) and c) are sometimes referred to as Slutsky’s theorem.

Lemma 1.3 (The delta method) Let $\left\{\mathbf{x}_n\right\}$ be a sequence of $K$-dimensional random vectors such that $\mathbf{x}_n\to \boldsymbol{\beta}$ and \[\sqrt{n}(\mathbf{x}_n-\boldsymbol{\beta})\to_d\mathbf{z},\] and suppose that $\mathbf{a}(\cdot):\mathbb{R}^K\to \mathbb{R}^r$ has continuous first derivatives with $\mathbf{A}(\boldsymbol{\beta})$ denoting the $r\times K$ matrix of first derivatives evaluated at $\mathbf{\beta}$: \[\underset{(r\times K)}{\mathbf{A}}=\frac{\partial \mathbf{a}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}'}.\] Then, \[\sqrt{n}\left[\mathbf{a}(\mathbf{x}_n)-\mathbf{a}(\boldsymbol{\beta}) \right]\to_d\mathbf{A}(\boldsymbol{\beta})\mathbf{z}.\] In particular, if $\sqrt{n}(\mathbf{x}_n-\boldsymbol{\beta})\to_d N(\mathbf{0},\boldsymbol{\Sigma})$, then \[\sqrt{n}\left[\mathbf{a}(\mathbf{x}_n)-\mathbf{a}(\boldsymbol{\beta}) \right]\to_d N(\boldsymbol{0, A(\beta)\Sigma A(\beta)'}).\]

1.2 Limit Theorems for Random Samples

Now let us consider two important limit theorems (CLT and LLN) that will help us to make statements about the asymptotic properties of certain estimators. The only requirement we impose is that the sample is randomly draw from the population, or in other words, the observations are independent and identically distributed (iid).

Proposition 1.1 (Kolmogorov's second strong LLN) Let $\left\{z \right\}_n$ be iid with $E(z_i)=\mu.$ Then \[n^{-1}\sum_{i=1}^n z_i\to_{a.s.} \mu.\]

This proposition just states that given a independent and identically distributed sequence of observations, the sample mean will converge almost surly to the population mean $\mu.$

Next, let $\widehat{\boldsymbol{\theta}}_n$ be an estimate of $\boldsymbol{\theta}$ , central limit theorems will help us make statements about the difference, blown up by transformations of $n$ as $n\to \infty.$ Let us start by a general result, and then we shall move over to the general properties of estimators.

Proposition 1.2 (The Lindeberg-Levy CLT) Let $\left\{\mathbf{z}_n \right\}$ be iid with $E(\mathbf{z}_i)=\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}=E\left[(\mathbf{z}_i-E(\mathbf{z_i}))(\mathbf{z}_i-E(\mathbf{z_i})') \right]$. Then \[\sqrt{n}(\overline{\mathbf{z}}_n-\boldsymbol{\mu})\to_d N(\boldsymbol{0,\Sigma}),\] where $\overline{\mathbf{z}_n}=n^{-1}\sum_{i=1}^n \mathbf{z}_i.$$

This proposition states that given a independent and identically distributed sequence of observations with expectation equal to the population mean and a covariance matrix $\boldsymbol{\Sigma}$, will converge in distribution to a multivariate normal distribution with mean zero and variance $\boldsymbol{\Sigma}$.

1.3 Asymptotic Properties of Estimators

Now we shall make some use of Proposition (1.1) and (1.2) by defining some very important concepts regarding asymptotic properties of estimators. These properties forms the basis in any large sample analysis of estimators and the usefulness of a particular estimator depends, partly, yet crucially, on these properties.

Definition 1.4 (Consistency of an estimators) Let $\widehat{\boldsymbol{\theta}}_n$ be an sequence of estimates of $\boldsymbol{\theta}\in \boldsymbol{\Theta}$. If \[\widehat{\boldsymbol{\theta}}_n \to_p \boldsymbol{\theta}\] for any value of $\boldsymbol{\theta}$, we say that $\widehat{\boldsymbol{\theta}}_n$ is a consistent estimator of $\boldsymbol{\theta}.$

Notice here that it is sufficient for us that the estimator is just weakly consistent, since we only require it to converge in probability. Since we do not know the true value $\boldsymbol{\theta}$ in practice, consistency is a very useful (and essential) property of every estimator.

Definition 1.5 (Asymptotically normally distributed) Let $\widehat{\boldsymbol{\theta}}_n$ be a sequence of estimates of $\boldsymbol{\theta}\in \boldsymbol{\Theta}$. If \[\sqrt{n}(\widehat{\boldsymbol{\theta}}_n-\boldsymbol{\theta})\to_dN(\boldsymbol{0,\Sigma}),\] where $\boldsymbol{\Sigma}$ is a positive semidefinite variance-covariance matrix, we say that $\widehat{\boldsymbol{\theta}}_n$ is asymptotically normally distributed and $\boldsymbol{\Sigma}$ is the asymptotic variance denoted $\text{Avar}(\widehat{\boldsymbol{\theta}})$.

Remark: Of course, we cannot observe the population asymptotic variance, hence, this has to be estimated as well. Therefore, further analysis also require us to prove that the asymptotic variance is can be consistently estimated. We will discuss this more in detail in the next chapter when we investigate the asymptotic properties of the OLS estimator.

Definition 1.6 (Asymptotic efficiency of an estimator) Let $\widehat{\boldsymbol{\theta}}_n$ and Let $\tilde{\boldsymbol{\theta}}_n$ be two estimators with asymptotic variances $\text{Avar}(\widehat{\boldsymbol{\theta}})$ and $\text{Avar}(\tilde{\boldsymbol{\theta}})$. We say that an estimator is asymptotically more efficient than another estimator if $\text{Avar}(\widehat{\boldsymbol{\theta}})\leq \text{Avar}(\tilde{\boldsymbol{\theta}}).$

2 Regression Analysis

This section assumes previous knowledge of the OLS estimator, therefore no particular focus will be on explaning what ordinary least squares is. Instead, I will start by breifly present some finite sample properties of the OLS estimator and under which assumptions these properties hold. Then, I will discuss some asymptotic results of the OLS estimator and show under which assumptions OLS is consistent and asymptotically normal.

2.1 Finite Sample Properties of the OLS Estimator

The finite sample assumptions for the OLS estimator are:

Linearity: The data generating process is given by $y_i=\beta_1x_{i1}+\cdots+\beta_Kx_{iK}+\epsilon_i, \quad i=1,…,n,$ where $\beta_1,…,\beta_K$ are real-valued, unknown parameters to be estimated, $\epsilon_i$ is a stochastic unobserved error term, and $n<\infty$ is the sample size.
Strict exogeneity: The conditional expectation of the error term is zero. That is $E(\epsilon_i\mid \mathbf{X})=E(\epsilon_i\mid \mathbf x_1,...,\mathbf x_K)=0$ fo all $i=1,…,n.$
Full rank: The rank of the $n\times K$ data matrix $\mathbf{X}$ is $K$ with probability $1.$
Conditional homoskedasticity: $E(\epsilon_i^2\mid \mathbf{X})=\sigma^2$ for each $i$, where $0<\sigma^2 <\infty.$
Conditionally normal errors: Conditionally on the data matrix $\mathbf X$ , the errors $\epsilon_1,…,\epsilon_n$ are jointly multivariate normal.

Using various combinations of these assumptions, we can formulate some important results that holds in the finite sample analysis for the OLS estimator.

Proposition 2.1 (Conditional unbiasedness of the OLS estimator) Under assumptions 1-3, the conditional expectation of the OLS estimate is equal to the true population parameter; that is \[E(\mathbf{b}\mid \mathbf X)=\boldsymbol \beta,\] where $\mathbf b$ is the OLS estimate.

Proof. Notice that $E(\mathbf b-\boldsymbol \beta\mid \mathbf X)=0$ is equivalent to $E(\mathbf b\mid \mathbf X)=\boldsymbol \beta$, so we can prove the former. Construct the difference \[\begin{align} \mathbf{b}-\boldsymbol{\beta}&=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}-\boldsymbol{\beta}\notag\\ &=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon})-\boldsymbol{\beta}\notag\\ &=(\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\mathbf{X})\boldsymbol{\beta}+(\mathbf{X}'\mathbf{X})^{-1}\boldsymbol{\epsilon}-\boldsymbol{\beta}\notag\\ &=\boldsymbol{\beta}+(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}-\boldsymbol{\beta}\notag\\ & =(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}. \label{finiteprop1} \end{align}\] Taking the conditional expectation of the last expression gives \[\begin{align*} \mathbb{E}\left[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}\right]&=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\cdot \mathbb{E}(\boldsymbol{\epsilon}\mid \mathbf{X})\\ &=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\cdot 0\\ &=0 \end{align*}\] and the proof is done.

Proposition 2.2 (Expression for conditional variance) Under assumptions 1-4, the conditional variance of the OLS estimate is \[\text{Var}(\mathbf{b}\mid \mathbf{X})=\sigma^2(\mathbf{X}'\mathbf{X})^{-1}.\]

Proof. From Proposition 2.1 above we have that $(\mathbf{b}-\boldsymbol{\beta})=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}.$ This gives \[\begin{align*} \mathbb{E}\left[ (\mathbf{b}-\boldsymbol{\beta})(\mathbf{b}-\boldsymbol{\beta})'\mid \mathbf{X}\right]&=\mathbb{E}\left[ \left( (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}\right)\left( (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}\right)' \right]\\ &=\mathbb{E}\left[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\epsilon}\boldsymbol{\epsilon}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mid \mathbf{X}\right]\\ &=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbb{E}(\boldsymbol{\epsilon}\boldsymbol{\epsilon}'\mid \mathbf{X})\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\\ &=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\sigma^2\mathbf{I}_n)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \quad \text{(Assumption 4)}\\ &=\sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\\ &=\sigma^2(\mathbf{X}'\mathbf{X})^{-1} \end{align*}\] and the proof is done.

Proposition 2.3 (The Gauss-Markov Theorem) Under Assumptions 1-4, the OLS estimator is efficient in the class of linear unbiased estimators. That is, for any unbiased estimator $\widehat{\boldsymbol{\beta}}$ that is linear in $\mathbf{y}$, $\text{Var}(\widehat{\boldsymbol{\beta}}\mid \mathbf{X})\geq \text{Var}(\mathbf{b}\mid \mathbf{X}).$

Proof. Since $\widehat{\boldsymbol{\beta}}$ is linear in $\mathbf{y},$ it can be written as $\widehat{\boldsymbol{\beta}}=\mathbf{C}\mathbf{y}$ for some matrix $\mathbf{C}$. Let $\mathbf{D}:=\mathbf{C}-\mathbf{A}$ or $\mathbf{C}=\mathbf{D}+\mathbf{A}$ where $\mathbf{A}:=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'.$ Then \[\begin{align*} \widehat{\boldsymbol{\beta}}&=(\mathbf{D}+\mathbf{A})\mathbf{y}\\ &=\mathbf{D}\mathbf{y}+\mathbf{A}\mathbf{y}\\ &=\mathbf{D}(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon})+\mathbf{b}\\ &=\mathbf{D}\mathbf{X}\boldsymbol{\beta}+\mathbf{D}\boldsymbol{\epsilon}+\mathbf{b}, \end{align*}\] where the third equality follows from the fact that $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}$ and $\mathbf{A}\mathbf{y}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}=\mathbf{b}.$ Taking the conditional expectation of both sides gives \[\mathbb{E}(\widehat{\boldsymbol{\beta}}\mid \mathbf{X})=\mathbf{D}\mathbf{X}\boldsymbol{\beta}+\mathbb{E}(\mathbf{D}\boldsymbol{\epsilon}\mid \mathbf{X})+\mathbb{E}(\mathbf{b}\mid \mathbf{X}).\] Since both $\mathbf{b}$ and $\widehat{\boldsymbol{\beta}}$ are unbiased and since $\mathbb{E}(\mathbf{D}\boldsymbol{\epsilon}\mid \mathbf{X})=\mathbf{D}\mathbb{E}(\boldsymbol{\epsilon}\mid \mathbf{X})=\mathbf{0},$ it follows that $\mathbf{D}\mathbf{X}\boldsymbol{\beta}=\mathbf{0}.$ For this to be true for any given $\boldsymbol{\beta}$, it is necessary that $\mathbf{D}\mathbf{X}=0.$ So, $\widehat{\boldsymbol{\beta}}=\mathbf{D}\boldsymbol{\epsilon}+\mathbf{b}$ and \[\begin{align*} \widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}&=\mathbf{D}\boldsymbol{\epsilon}+(\mathbf{b}-\boldsymbol{\beta})\\ &=(\mathbf{D}+\mathbf{A})\boldsymbol{\epsilon}. \end{align*}\] So the variance is \[\begin{align*} \text{Var}(\widehat{\boldsymbol{\beta}}\mid \mathbf{X})&=\text{Var}(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}\mid \mathbf{X})\\ &=\text{Var}\left[(\mathbf{D}+\mathbf{A})\boldsymbol{\epsilon}\mid \mathbf{X}\right]\\ &=(\mathbf{D}+\mathbf{A})\text{Var}(\boldsymbol{\epsilon}\mid \mathbf{X})(\mathbf{D}'+\mathbf{A}')\\ &=\sigma^2\cdot (\mathbf{D}+\mathbf{A})(\mathbf{D}'+\mathbf{A}')\\ &=\sigma^2\cdot(\mathbf{D}\mathbf{D}'+\mathbf{A}\mathbf{D}'+\mathbf{D}\mathbf{A}'+\mathbf{A}\mathbf{A}'). \end{align*}\] But $\mathbf{D}\mathbf{A}'=\mathbf{D}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}=0$ since $\mathbf{D}\mathbf{X}=0.$ Also, $\mathbf{A}\mathbf{A}'=(\mathbf{X}'\mathbf{X})^{-1},$ so \[\begin{align*} \text{Var}(\widehat{\boldsymbol{\beta}}\mid \mathbf{X})&=\sigma^2\cdot \left[\mathbf{D}\mathbf{D}'+(\mathbf{X}'\mathbf{X})^{-1} \right]\\ &\geq \sigma^2\cdot(\mathbf{X}'\mathbf{X})^{-1}\\ &=\text{Var}(\mathbf{b}\mid \mathbf{X}), \end{align*}\] so we have shown that $\mathbf{b}$ is more efficient than any other linear unbiased estimator, and the proof is done.

2.1.1 Simulation study: Gauss-Markov Theorem

Before we move to the asymptotic properties of the OLS estimator, I just want to take a quick detour to provide some intuition of the Gauss-Markov Theorem. In particular, I want to compare the OLS estimator to another unbiased, linear estimator where the parameter estimates are weighed averages of the outcome (see code for detail). We set up a simulation, and hope to observe that the variance of the distribution of the OLS estimates should be smaller than the other estimator; this is what we expect from the Gauss-Markov Theorem.

windowsFonts(A = windowsFont("LM Roman 10"))
#Number of observations and simulation reps
n <- 100
reps <-10000

#Create alternative estimator
eps <- 0.9
w <- c(rep((1+eps)/n, n/2),
       rep((1-eps)/n, n/2))

#Generate DGP and run simulation
ols <- rep(NA, reps)
nols <- rep(NA, reps)
for (i in 1:reps){
  y <- rnorm(n)
  ols[i] <- mean(y)
  nols[i]<- crossprod(w,y)
}
plot(density(ols),
     col="blue",
     xlab="Estimates",
     ylab="Density",
     main="Illustration of the Gauss-Markov Theorem",
     family="A")
lines(density(nols),
      col="red")
abline(v=0)

As we can see the variance of the blue density (OLS) is smaller than the variance of the red density (not-OLS), as we expected from the Gauss-Markov Theorem.

2.2 Large Sample Properties

Now we shall show the large sample properties of the OLS estimator. As noted above, we strive to show that OLS is consistent and asymptotically normally distributed. Indeed, it is relatively straightforward to do this using the theory we have discussed above.

2.2.1 Consistency

Let $\mathbf{b}$ denote the OLS estimate of $\boldsymbol{\beta}$. To show consistency, we first note a few things. First, from Proposition 2.1 we now that under OLS assumptions 1-3, the OLS estimate is conditionally unbiased. We can also show, using the Law of Iterated Expectations (LIE) that OLS also is unconditionally unbiased, which we can use to prove consistency using Kolomogorov’s second LLN (Proposition 1.1).

Corollary 2.1 (Unconditional unbiasedness of the OLS estimator) The unconditional expectation of the OLS estimate is equal to the true parameter under OLS assumption 1-3. That is, \[E(\mathbf{b})=\boldsymbol{\beta}.\]

Proof. The result follow directly from the LIE \[E(\mathbf{b})=E\left[E(\mathbf{b}\mid \boldsymbol{\beta}) \right]=E(\boldsymbol{\beta})=\boldsymbol{\beta}.\]

Proposition 2.4 (Consistency of the OLS estimator) Under OLS assumptions 1-3, the OLS estimator is consistent. That is, \[\mathbf{b}\to_{a.s.}\boldsymbol{\beta}.\]

Proof. Think the estimates of $\mathbf{b}$ as a sequence $\left\{ \mathbf{b}\right\}_n$, and given OLS Assumptions 1-3 we have $E(\mathbf{b})=\boldsymbol{\beta}.$ Thus, by Kolomogorov’s second strong LLN the sample mean will converge to the true parameter.

In the proof above we outlined that we needed assumptions 1-3, it is actually the case that we can disregard the linearity assumption, by thinking of the regression model as a linear projection. Also, when doing asymptotic theory, we do not need to impose strict exogeneity. Instead, it suffice to impose that the covariance between the error term and the regressors is equal to zero, that is, $E(\mathbf{x}'\boldsymbol{\epsilon})=0.$ But since we rely on the unbiasedness assumption, we require the strict exogeneity assumption; however, one could obtain consistency even without it (but not necessarily unbiasdness).

2.2.2 Asymptotic normality

Lets now focus on asymptotic normality of the OLS estimator. I will provide a general proposition of the result that hold under various structures put on the variance of the distribution.

Proposition 2.5 (Asymptotic normality of the OLS estimator) Under OLS assumptions 1-4 \[\sqrt{n}(\mathbf{b}-\boldsymbol{\beta})\to_d N(\mathbf{0}, \Sigma_{\mathbf{x}\mathbf{x}}^{-1}\bS\Sigma_{\mathbf{x}\mathbf{x}}^{-1}),\] where $\Sigma_{\mathbf{x}\mathbf{x}}=E(\mathbf{x}' \mathbf{x})$

Proof. It follows from Proposition 1.2 that the difference between the expectation and the true value blown up by the square root converges in distribution to a multivariate normal with zero mean and variance $\Sigma.$

Under proposition 2.2, the conditional variance is given by $\sigma^2\Sigma,$ so the asymptotic variance for the OLS estimator under homoskedasticity can be given by $$\[\begin{equation} \text{Avar}(\mathbf{b})=\sigma^2[E(\mathbf{x}'\mathbf{x})]^{-1}. \tag{2.1} \end{equation}\]$$

Under OLS assumptions 1-4, one could also show that the asymptotic variance is consistently estimated, however as soon as the homoskedasticity assumption breaks down, this variance will be inconsistent. To account for this, we use heteroskedasticity robust standard errors. Essentially, it is just to impose a different structure onto $\bS$ that accounts for the fact that the asymptotic variances of the estimates is a function of the values of the regressors. In particular, letting \[\widehat{\text{Avar}}(\mathbf{b})=(\mathbf{X}'\mathbf{X})^{-1}\left(\sum_{i=1}^n\heps_i\mathbf{x}'_i\mathbf{x}_i \right)(\mathbf{X}'\mathbf{X})^{-1}\] be the expression for the asymptotic variance accounts for problems that arise when the homoskedasticity assumption fails. This formulation of $\bS$ is referred to as the heteroskedasticity-robust standard errors. Notice also that (2.1) estimates the asymptotic variance consistently under homoskedasticity.

2.2.3 Simulation study: Asymptotic normality and consistency of the OLS estimator

Now let us do some simulations in R to illustrate consistency and asymptotic normality of the OLS estimates. It is relatively straightforward: construct a data generating process (DGP) and simulate a sufficient number of observations (remember we operate in large sample territory). To illustrate consistency and asymptotic normality, plotting a simple histogram will suffice.

Consider the following DGP \[y_i=1+2x_{i1}+\epsilon_i,\] where $\epsilon\sim N(0,1).$ Lets run this simulation.

library(gets)
windowsFonts(A = windowsFont("LM Roman 10"))
n <-2500       #Number of observations
set.seed(123)  #For replication
ols_est <- matrix(NA, nrow = n, ncol = 1)
colnames(ols_est) <- c("beta2")

for (i in 1:n){
  #1: Set up dgp
  x <- rnorm(n)      #Exogenous variable
  eps <- rnorm(n)    #Error term
  beta1 <- 1
  beta2 <- 2
  y <-beta1+beta2*x+eps  #DGP
  
  #2: Run regression and store estimates for each repetition i
  X <- cbind(1, x)
  ols_model <- ols(y[1:n], X[1:n,])
  ols_est[i,1] <- ols_model$coefficients[2]
}
plot(hist(ols_est),
     col="gray",
     xlab="Estimate of beta",
     ylab="Frequency",
     main="Distribution of OLS estimate",
     family="A")

First we observe that the distribution cluster around the true parameter value $\beta_2=2,$ which gives us an indication of consistency. Moreover, the distribution looks approximately normally distributed.

3 Chapter Questions & Answers

3.1 Questions

Solve question 4.2-4.4 in Wooldridge
Download a small dataset, for example to Excel. Do OLS (one X is fine) and standard errors manually, both the robust and the not-so-robust version of standard errors.
Generate data from a linear regression model, do parametric bootstrap, nonparametric bootstrap and standard OLS to assess standard errors. In addition to using the bootstraps to compute standard errors, also use them to assess approximate normality is estimates (graphically).
Generate data for a treatment effect, with random treatment and heterogeneous treatment effects. Show that you estimate the average treatment effect through linear regression.

3.2 Answers

3.2.1 Question 1

3.2.1.1 4.2 in Wooldridge

This showing ubiasedness of the OLS estimate. Setting

\[E(\widehat{\beta}\mid \mathbf{X})=\boldsymbol{\beta}+(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\boldsymbol{\epsilon}\mid \mathbf{X})=\boldsymbol{\beta}\]

since $E(\boldsymbol{\epsilon}\mid \mathbf{X})=0$ by assumption. Notice also that $(\mathbf{X}'\mathbf{X})^{-1}$ has to exits, and it does so only if $\mathbf{X}'\mathbf{X}$ is full rank; hence, the full rank assumption also has to hold. This is exactly the same as in Proposition 2.1.

This is the same as in Proposition 2.2. We have,

$$\begin{align*} \mathbb{E}\left[ (\mathbf{b}-\boldsymbol{\beta})(\mathbf{b}-\boldsymbol{\beta})’\mid \mathbf{X}\right]&=\mathbb{E}\left[ \left( (\mathbf{X}‘\mathbf{X})^{-1}\mathbf{X}’\boldsymbol{\epsilon}\right)\left( (\mathbf{X}‘\mathbf{X})^{-1}\mathbf{X}’\boldsymbol{\epsilon}\right)’ \right]\\&=\mathbb{E}\left[ (\mathbf{X}‘\mathbf{X})^{-1}\mathbf{X}’\boldsymbol{\epsilon}\boldsymbol{\epsilon}‘\mathbf{X}(\mathbf{X}’\mathbf{X})^{-1}\mid \mathbf{X}\right]\\&=(\mathbf{X}‘\mathbf{X})^{-1}\mathbf{X}’\mathbb{E}(\boldsymbol{\epsilon}\boldsymbol{\epsilon}‘\mid \mathbf{X})\mathbf{X}(\mathbf{X}’\mathbf{X})^{-1}\\&=(\mathbf{X}‘\mathbf{X})^{-1}\mathbf{X}’(\sigma^2\mathbf{I}_n)\mathbf{X}(\mathbf{X}‘\mathbf{X})^{-1} \quad \text{(Assumption 4)}\\&=\sigma^2(\mathbf{X}’\mathbf{X})^{-1}\mathbf{X}‘\mathbf{X}(\mathbf{X}’\mathbf{X})^{-1}\\&=\sigma^2(\mathbf{X}’\mathbf{X})^{-1}\end{align*}$$

3.2.1.2 4.3 In Wooldridge

Not in general. The conditional variance can always be written as

\[ \text{Var}(\epsilon\mid\mathbf{x})=E(\epsilon^2\mid\mathbf{x})=E(\epsilon^2\mid \mathbf{x})-\left[E(\epsilon\mid \mathbf{x}) \right]^2 \]

and if $E(\epsilon\mid \mathbf{x})\neq0,$ then $\text{Var}(\epsilon\mid \mathbf{x})\neq E(\epsilon^2\mid \mathbf{x}).$

It could be the case that $E(\mathbf{x}' \epsilon)=0,$ in which case OLS is consistent, and $\text{Var}(\epsilon\mid \mathbf{x})$ is constant (homoskedastic).

3.2.1.3 4.4 In Wooldridge

For each $i$, $\widehat{u}_i=y_i-\mathbf{x}_i\widehat{\boldsymbol{\beta}}=u_i-\mathbf{x}_i(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}),$ and so $\widehat{u}_i^2=u_i^2-2u_i\mathbf{x}(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})+\left[\mathbf{x}_i(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}) \right]^2.$ Therefore, we can write

\[ N^{-1}\sum_{i=1}^N\widehat{u}_1^2\mathbf{x}_i'\mathbf{x}_i=N^{-1}\sum_{i=1}^Nu_1^2\mathbf{x}_i'\mathbf{x}_i-2N^{-1}\sum_{i=1}^N[u_i\mathbf{x}_i(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})]\mathbf{x}_i'\mathbf{x}_i+N^{-1}\sum_{i=1}^N[\mathbf{x}_i(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})]^2\mathbf{x}_i'\mathbf{x}_i. \]

Dropping the “-2”, the second term can be written as the sum of $K$ terms of the form

\[ N^{-1}\sum_{i=1}^N[u_ix_{ij}(\widehat{\beta}_j-\beta_j)]\mathbf{x}_i'\mathbf{x}_i=(\widehat{\beta}_j-\beta_j)N^{-1}\sum_{i=1}^N(u_ix_{ij})\mathbf{x}_i'\mathbf{x}_i=o_P(1)\cdot O_P(1), \]

where we have used $\widehat{\beta}_j-\beta_j=o_P(1)$ and $N^{-1} \sum_{i=1}^N(u_ix_{ij})\mathbf{x}_i'\mathbf{x}_i=O_P(1)$ whenever $E[\mid u_ix_{ij}x_{ih}x_{ik}]<\infty$ for all $j,h,k$ (as we assumed). Similarly, the third term can be written as the sum of $K^2$ terms of the form

\[ (\widehat{\beta}_j-\beta_j)(\widehat{\beta}_h-\beta_h)N^{-1}\sum_{i=1}^N(x_{ij}x_{ih})\mathbf{x}_i'\mathbf{x}_i=o_P(1)\cdot o_P(1)\cdot O_P(1)=o_P(1), \]

where we have used $N^{-1} \sum_{i=1}^N(x_{ij}x_{ih})\mathbf{x}_i'\mathbf{x}_i)=O_P(1),$ whenever $E[\mid u_ix_{ij}x_{ih}x_{ik}x_{im}]<\infty$ for all $j,h,k,m$. We have shown that $N^{-1} \sum_{i=1}^N \widehat{u}_i^2 \mathbf{x}_i'\mathbf{x}i=N^{-1}\sum{i=1}^Nu_i^2\mathbf{x}_i'\mathbf{x}+o_P(1),$ which is what we wanted to show.

3.2.2 Question 2

3.2.2.1 Stata Code

**Lets use the auto.dta dataset to do this question in mata.
sysuse auto.dta
reg mpg weight                                                                  //Regression of interest
reg mpg weight, robust  
                                                                                


mata
*OLS estimates 
y = st_data(.,"mpg")                                                            //Dependent var
X = st_data(.,"weight")                                                     //Independet vars
n = rows(X)                                                                         //Number of obs
X =X,J(n,1,1)                                                                       //Add constant
X_prod=quadcross(X,X)                                                           //X´X
inv_X=invsym(X_prod)                                                            //inv(X´X)
beta =inv_X*quadcross(X,y)                                              //OLS formula
beta                                                                                    //Correct values

*Normal SE
e       = y-X*beta                                                                //Residual term
e2      = e:^2                                                                      //Squared residual
k       = cols(X)                                                                   //Number of cols
vcov    =(quadsum(e2)/(n-k))*inv_X                                  //Variance covariance matrix ((sum_{i=1}^n e^_i)(n-k)*inv(X'X))
se      =sqrt(diagonal(vcov))'                                          //Standard errors (non-robust)

*Robust SE
B           = quadcross(X, e2, X)                                           //Sandwich term (middle term) in varcov expression
vcov_robust = (n/(n-k))*inv_X*B*inv_X                           //Robust variance covariance matrix 
se_robust   =sqrt(diagonal(vcov_robust))
end

3.2.2.2 R-Code

library(haven)
library(fixest)
setwd("C:/Users/A2210391/OneDrive - BI Norwegian Business School (BIEDU)/Course folder/Microeconometrics")
auto <-read_dta("auto.dta")
##Initial regression
ols <- feols(mpg ~ weight, auto)
ols_r <- feols(mpg ~ weight, auto)
summary(ols)

OLS estimation, Dep. Var.: mpg
Observations: 74 
Standard-errors: IID 
              Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)    39.4403     1.6140  24.4363 < 2.2e-16 ***
weight      -6008.6870   517.8782 -11.6025 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.3921   Adj. R2: 0.646691

summary(ols, vcov="hetero")

OLS estimation, Dep. Var.: mpg
Observations: 74 
Standard-errors: Heteroskedasticity-robust 
              Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)    39.4403    1.98832  19.8360  < 2.2e-16 ***
weight      -6008.6870  584.08394 -10.2874 8.7973e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.3921   Adj. R2: 0.646691

##Set up programming
y <- auto$mpg                                                                   #depvar                                                           
X <- auto$weight                                                                #indepvar                                                          
n <- nrow(auto)                                                                 #Number of obs
X <- cbind(X, rep(1, nrow(auto)))                                               #Add constant
#matrix(X, nrow = nrow(auto), ncol = 2)
X_prod <- t(X)%*%X                                                              #X'X
inv_X <- solve(X_prod)                                                          #inv(X'X)
beta <- inv_X%*%t(X)%*%y                                                        #OlS formula
                                                                           

#Normal SE
e <- y-X%*%beta                                                                 #Residual term
e2 <-e*e                                                                        #Squared residual
k <- ncol(X)                                                                    #Number of cols
se2 <- sum(e2)
vcov <- (se2/(n-k))*inv_X                                                       #Variance covariance matrix ((sum_{i=1}^n e^_i)(n-k)*inv(X'X))
se_1 <- sqrt(vcov[1,1])
se_2 <- sqrt(vcov[2,2])
se <- cbind(se_1, se_2)                                                         #Standard errors (non-robust)

#Robust SE
e2=c(e2)                                                                        #Declear as matrix for diag() to work
de2 <- diag(e2)
b <- t(X)%*%de2                                                                  
B <- b%*%X                                                                      #Sandqish term (middle term) X'diag(e2)X
vcov_robust <- (n/(n-k))*inv_X%*%B%*%inv_X                                      #Robust variance covariance matrix
ser_1<-sqrt(vcov_robust[1,1])
ser_2 <-sqrt(vcov_robust[2,2])
se_robust <- cbind(ser_1, ser_2)                                                

##Comparision:
summary(ols)

OLS estimation, Dep. Var.: mpg
Observations: 74 
Standard-errors: IID 
              Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)    39.4403     1.6140  24.4363 < 2.2e-16 ***
weight      -6008.6870   517.8782 -11.6025 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.3921   Adj. R2: 0.646691

se

         se_1     se_2
[1,] 517.8782 1.614003

summary(ols, vcov="hetero")

OLS estimation, Dep. Var.: mpg
Observations: 74 
Standard-errors: Heteroskedasticity-robust 
              Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)    39.4403    1.98832  19.8360  < 2.2e-16 ***
weight      -6008.6870  584.08394 -10.2874 8.7973e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.3921   Adj. R2: 0.646691

se_robust

        ser_1   ser_2
[1,] 584.0839 1.98832

beta

         [,1]
X -6008.68696
     39.44028

ols$coefficients

(Intercept)      weight 
   39.44028 -6008.68696

3.2.3 Question 3

3.2.3.1 Stata Code

clear
*Generate the data and program
program drop se_mce
program define se_mce, rclass
drop _all   
set obs 1000                                                                                                                    
gen x=runiform()                                                                   //indepvar
gen eps=rnormal(0,1)                                                             //error
gen beta1 =1                                                                         //beta1
gen beta2=5                                                                          //beta2
gen y=beta1+beta2*x+eps                                                      //DGP
reg y x
end

*simulation 
clear 
se_mce
simulate _b, reps(1000): se_mce         
sum _b_x                                                                               //Good representation of "true values"
global mean_b2=r(mean)
global sd_b2=r(sd)
sum _b_cons 
global mean_b1=r(mean)
global sd_b1=r(sd)                                                              //display $mean_b1

*compare "true" with estimation
se_mce
reg y x                                                                               //Normal
reg y x, robust                                                                   //Robust
reg y x, vce(bootstrap, reps(1000))                             //Bootstrap


*Assess Normality
simulate _b _se, reps(1000): se_mce
hist _se_x, normal graphregion(color(white))
hist _se_cons, normal graphregion(color(white))

3.2.3.2 R_code

##Set up simulation
library(gets)
n <- 1000
set.seed(123)
ols_est <- matrix(NA, nrow = n, ncol = 2)
ols_se <- matrix(NA, nrow = n, ncol = 2)
colnames(ols_est) <- c("beta1", "beta2")
colnames(ols_se) <- c("se1", "se2")

##Run simulation  
for (i in 1:n){
  #DGP
  x <- runif(n)
  eps <- rnorm(n)
  beta1 <- 1
  beta2 <- 5
  y=1+5*x+eps
  
  #Estimate ols
  X <- cbind(1, x)
  model_ols <- ols(y[1:n], X[1:n,])
  ols_est[i,1] <- model_ols$coefficients[1]
  ols_est[i,2] <- model_ols$coefficients[2]
  ols_se[i,1] <- sqrt(model_ols$vcov[1,1])
  ols_se[i,2] <- sqrt(model_ols$vcov[2,2])
}

mean_coef <- colMeans(ols_est)
mean_sd <- colMeans(ols_se)
mean_coef

   beta1    beta2 
1.000446 4.998079

mean_sd

       se1        se2 
0.06320083 0.10952651

#Compare "true" with estimation
data <- data.frame(cbind(y, x))
ols <- feols(y ~ x, data = data)
ols_iid <- summary(ols)
ols_robust <- summary(ols, vcov="hetero")
#ols_boots <- summary (ols, vcov="bootstrap") not possible with feols
ols_iid

OLS estimation, Dep. Var.: y
Observations: 1,000 
Standard-errors: IID 
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.00368   0.062594 16.0347 < 2.2e-16 ***
x            4.96478   0.107152 46.3339 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.98664   Adj. R2: 0.682336

ols_robust

OLS estimation, Dep. Var.: y
Observations: 1,000 
Standard-errors: Heteroskedasticity-robust 
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.00368   0.062063 16.1718 < 2.2e-16 ***
x            4.96478   0.109917 45.1686 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.98664   Adj. R2: 0.682336

mean_sd

       se1        se2 
0.06320083 0.10952651

#Assess normality
windowsFonts(A = windowsFont("LM Roman 10"))
plot(hist(ols_se),
     col="gray",
     xlab="Standard error",
     ylab="Frequency",
     main=,
     family="A")

3.2.4 Question 4

3.2.4.1 Stata Code

*Set up environment
clear
set obs 100
set seed 123
gen x=rnormal(20,5)                                                             //Exogenous regressor
gen treat=rbinomial(1,0.5)                                              //Treatment assignment with probability 0.5
gen eps=rnormal(0,1)                                                            //Error
gen y =1+x+2*treat+eps                                                      //DGP

*ATE from regression
reg y treat, robust 
reg y treat x, robust                                                           //No heterogeneity
reg y  i.treat##c.x, robust                                                     //Heterogeneity

*Same using margins command
margins, dydx(treat)
marginsplot, graphregion(color(white))

3.2.4.2 R-Code

#Set up environment
library(fixest)
library(margins)
n <- 100
set.seed(123)
x <- rnorm(n, mean = 20, sd=5)     #Exogenous regressor
treat <- rbinom(n, 1, 0.5)         #Treatment assignment with probability 0.5
eps <- rnorm(n)                    #Error
y <- 1+x+2*treat+eps               #DGP

#ATE from regression
data=data.frame(cbind(y,x,treat))
ate_reg1 <- feols(y ~ treat, data=data)
ate_reg2 <- feols(y ~ treat + x, data=data)
ate_reg3 <- feols(y ~ treat + x + treat*x, data=data)
ate1 <- summary(ate_reg1, vcov="hetero")
ate2 <- summary(ate_reg2, vcov="hetero")
ate3 <- summary(ate_reg3, vcov="hetero")
ate1

OLS estimation, Dep. Var.: y
Observations: 100 
Standard-errors: Heteroskedasticity-robust 
            Estimate Std. Error  t value  Pr(>|t|)    
(Intercept) 21.10475   0.622106 33.92467 < 2.2e-16 ***
treat        2.75504   0.911467  3.02265 0.0031976 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 4.50281   Adj. R2: 0.076123

ate2

OLS estimation, Dep. Var.: y
Observations: 100 
Standard-errors: Heteroskedasticity-robust 
            Estimate Std. Error  t value   Pr(>|t|)    
(Intercept) 1.529212   0.366319  4.17454 6.5193e-05 ***
treat       2.054639   0.192159 10.69241  < 2.2e-16 ***
x           0.973582   0.017808 54.67002  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.922188   Adj. R2: 0.960849

ate3

OLS estimation, Dep. Var.: y
Observations: 100 
Standard-errors: Heteroskedasticity-robust 
             Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)  1.122666   0.503427  2.23005 0.02807417 *  
treat        2.861248   0.756931  3.78006 0.00027228 ***
x            0.993802   0.025659 38.73162  < 2.2e-16 ***
treat:x     -0.039429   0.034756 -1.13447 0.25942302    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.917863   Adj. R2: 0.960812

#Variant of margins
ate_reg3 <- lm(y ~ treat + x + treat*x, data=data)   #(has to use lm instead of feols)
m <- margins(ate_reg3)
plot(m)