One of the classical assumptions is strict exogeneity. This assumption is usually presented as:
\[ E(\epsilon_i|X)=0 \]
(a) Re-write the assumption without using matrix \(X\), but using instead vectors xi and xj, which include, respectively, observation i and observation j of all regressors:
\(E(\epsilon_i|x_i, x_j)\)
Where:
\(x_i=[1, x_{i,2},...,x_{i,k}]', x_j=[1, x_{j,2},...,x_{j,k}]'\)
\(E(\epsilon_i |1, x_{i,2},...,x_{i,k}, 1, x_{j,2},...,x_{j,k}) =0\)
However, since a constant (1) does not provide information, the expectation conditional in the vector including the constant is the same as the expectation conditional on \((x_{i,2},...,x_{i,k}, , x_{j,2},...,x_{j,k})\)
(b) Explain the meaning of this assumption.
The expected value of \(\epsilon_i\) given \(X\) is zero. It can be also stated that \(E(\epsilon_i|xi)\) which is the expected value of \(\epsilon_i\) given \(x_i\). The latter is implied by the former. Under \(E(\epsilon_i|X) = 0\) we can interpret the regression model as describing the conditional expected value of \(y_i\) given all observations of \(X\)
(c) Prove, one by one, that strict exogeneity implies:
\[ E(\epsilon_i) =0 \]
The law of total expectation implies that: \(E[E(\epsilon_i|X)]= E(\epsilon_i)\) and the unconditonal mean of the error term is zero, e.g. \(E(\epsilon_i)=0\quad for \quad i=1,2,...,n\). Therefore in our example we can see that:
\[ E[E(\epsilon_i|X)]= E(\epsilon_i)=0 \]
\[ E(x_j\cdot\epsilon_i)=0 \]
Which can also be expressed as:
\[ E(x_j\cdot\epsilon_i)= \begin{bmatrix} E(x_{j,1},\epsilon_i) \\ E(x_{j,2},\epsilon_i) \\ \vdots\\ E(x_{j,k},\epsilon_i) \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots\\ 0 \end{bmatrix} \]
Where:
\[ \begin{align*} E(x_{j,k},\epsilon_i)= E[E(x_{j,k}\epsilon_i|x_{j,k})] \leftarrow LTE\\ = E[x_{j,k}\cdot E(\epsilon_i|x_{j,k})]\leftarrow LCE \\ = E[x_{j,k}\cdot0]\\ =0 \end{align*} \]
And LTE is Law of Total Expectation and LCE is linearility of conditional expectations.
\[ Cov(x_{j,k}\cdot\epsilon_i) = 0 \]
We can express covariance as:
\[ Cov(X,Y) = E(XY) - E(X)E(Y) \]
And therefore we can show that:
\[ Cov(x_{j,k}\cdot\epsilon_i) = E(x_{j,k}\cdot\epsilon_i)- E(x_{j,k}) E(\epsilon_i) = 0- x_{j,k}\cdot 0 = 0 \]
Consider the following DGP
Consider we estimate the parameter of regression by OLS using samples by the DGP above:
\[ y_i = \beta_1 +\beta_2x_{i,2} +\beta_2x_{i,3} + \epsilon_i \]
A1, linearity holds because it is an additive model.
A2, regressors possibly stochastic and not perfectly collinear holds because are drawn from different random uniform distributions with different bounds.
A3, Strict exogeneity holds because errors are drawn from exogeneous distribution totally exogeneous to the regressors.
A4, conditional sperical distributions holds because how they are randomly drawn from a distribution.
A5, Disturbances conditionally jointly normal holds because disturbances are drawn from a normal distribution.
%Exercise 2.b.
% DGP: y = 2 + 2_x2 + 2x_3 + e
%Samle size
n=90
%Error
var_error = 64;
e = 0 +sqrt(var_error).*randn(n,1);
v = 0 +sqrt(16).*randn(n,1);
%Regressor x_2
sz = [n 1];
%Regressor x_3
x_3 = unifrnd(0,90,sz);
x_2_alt = x_3 + v;
U_b = 100
x_2 = unifrnd(0,U_b,sz);
%Describing the GDP for y
y = 2 + 2*x_2 + 2*x_3 +e
%Let's compute B_ols
X = [ones(n,1) x_2 x_3]
[b_ols1,~,~,~,~, R2_1] = ols_A(y, X);
display(b_ols1)
%Set the # of repetitions of the experiment;
reps = 10000;
%Create the output vector/matrix to be filled out;
k = size(X,2);
betas = nan(reps,k);
%Loop over all the repetitions storing in the output vector/matrix.
for i = 1:reps
e = 0 +sqrt(64).*randn(n,1); %The error term follows a normal (0,20)
y = 2 + 2*x_2 + 2*x_3 +e
betas(i,:) = (X'*X\X'*y)';
end
% MC simulation outcome:
mean(betas)
figure()
subplot(3,1,1)
histfit(betas(:,2))
txt_A = [ 'B_{0,2}~ U(0,' int2str(U_b) ')']
% txt_A = [ 'B_{0,2}~ x_2= x_3+ v']
txt_B = ['N = ' int2str(n) ' Obs' ' Var Error ' int2str(var_error)];
title({txt_A;txt_B})
% txt = '{\it\mu} = 10, {\it\sigma} = 5';
% subtitle(txt)
xlim([1.85 2.15])
subplot(3,1,2) ; scatter(x_2, y,'filled');
xlim([0 100])
ylim([0 400])
xlabel('x_2');
ylabel('y');
subplot(3,1,3); scatter(x_3, y, 'filled');
xlim([0 100])
ylim([0 400])
xlabel('x_3');
ylabel('y');
saveas(gcf,'Barchart.png')
var(betas(:,2))
mean(betas(:,2))
var(x_2)
var(x_2_alt)
The plot:
It would decrease the variance of the distribution of the estimator but the mean should remain approximately the same. The variance is higher so that travels to our OLS estimator as well
As we see below, the variance of our estimator increases. This may be counterintuive as first because we are decreasing the variance of the distribution we are sampling from our regressor. However, when we see the scatter plot of \(x_2\) and \(x_3\) against \(y\) it is easy to see what is happening. On the left, we have the scatter when our regressor is sampled from U(0,10), we see that \(x_3\) plays a bigger role explaining fluctuations in \(y\). Tighter when we decrease the variance of X2, the variance of our OLS estimator is lower
The larger the sample variance of X, the smaller the variance of the OLS estimates and vice versa.
No, we were not surprised - as we increase the size of n it narrows down. That is, as we increase the sample size the estimator becomes closer to it’s true value. (With repeated sampling we can expect that the OLS estimator is on average equal to the true value)
Consider Nerlove’s famous example estimating a cost function for electricity supply.
%Question 3
%We want to estimate b_hat and SE(b_hat) of a linear model:
nerlove = readtable('nerlove.csv'); %Insert data
data = table2array(nerlove);
X = log(data); %transform our data into log form
y = X(:,2); %isolate our dependent variable, cost
X(:,2) = []; %Remove the second column, cost, from X so we have output, labor, fuel and capital
X(:,1) = 1; %Create the first observation column which is equal to 1
[n k] = size(X); %The size of X is n=145 and k=4
b_hat = inv(X'*X)*X'*y;
se = (((y'*(y - X*b_hat))/( n - k ))*inv(X'*X));
for i= 1:k
stderrs(i) = sqrt(se(i,i));
end
Coef. | Std. Err. | |
---|---|---|
Constant | -3.5265 | 1.7744 |
Ln(Output) | 0.7204 | 0.0175 |
Ln(Labour) | 0.4363 | 0.2910 |
Ln(Fuel) | 0.4265 | 0.1004 |
Ln(Capital) | -0.2199 | 0.3394 |
All classical assumptions (1 to 4) excluding normality. Under assumptions A1 to A A4 we have that:
\[ E(\hat{\beta}_j)=\beta_j \quad j=0,1,...,k \]
For Any values of the population parameter \(\beta_j\). In other words, the OLS estimators are ubiased estimator of the population parameter. We say that OLS is unbiased under assumptions A1 to A4, and we mean that the procedure by which the OLS estimates are obtained is unbiased when we view the procedure as being applied across all possible random samples. We hope that we have obtained a sample that gives us an estimate close to the population value, but, unfortunately, this cannot be assured. What is assured is that we have no reason to believe our estimate is more likely to be too big or more likely to be too small.
Exercise 4
\[ \tilde{\sigma}\equiv \frac{\hat{\epsilon}'\hat{\epsilon}}{n} \]
First, we need to characterize the residuals:
\[ Y = X\hat{\beta}+ \hat{\epsilon} \quad \rightarrow \quad \hat{\epsilon} = Y - X\hat{\beta} \]
A residual unbiased estimator for error would be:
\[ E(\hat{\epsilon}) = E(Y - X\hat{\beta}) =E(X\beta +\epsilon)-XE(\hat{\beta})= X\beta + E(\epsilon) - X\beta = X\beta - X\beta =0 \]
Therefore:
\[ \tilde{\sigma}\equiv \frac{\hat{\epsilon}'\hat{\epsilon}}{n} \]
\[ \begin{align*} \hat{\epsilon}= Y-X\hat{\beta}=Y-X(X'X)^{-1}X'Y=(I-X(X'X)^{-1 }Y = MY\\ =(I - X(X'X)^{-1}X')(X\beta+\epsilon) = (X\beta-X(X'X)^{-1}X'X\beta+M\epsilon\\ =(X\beta - X\beta)+ M\epsilon\\ =M\epsilon \end{align*} \]
We see that \(\hat{\epsilon}\) depends on error term, residual random due to random \(\epsilon\). Now, we will make use of two properties, first, matrix \(M\) is an idempotent matrix and second \(\epsilon'M\epsilon\) is an scalar becasue is a matrix multiplications such as: \(1xn \quad nxm \quad nx1\)
\[ E(\tilde{\sigma}) =\frac{1}{n}E(\hat{\epsilon}'\hat{\epsilon})= \frac{1}{n}E(\epsilon'M'M\epsilon)= \frac{1}{n}E(\epsilon'MM\epsilon) = \frac{1}{n}(\epsilon'M\epsilon) \]
Now, we apply the trace operator, bearing in mind that the trace of a scalar is a scalar itself:
$$ \[\begin{align*} E(\tilde{\sigma}) =\frac{1}{n} tr[E(\epsilon'M\epsilon)]\\ = \frac{1}{n} E[tr(\epsilon'M\epsilon)]\\ = \frac{1}{n} E[tr(M\epsilon'\epsilon)]\\ = \frac{1}{n} tr[E(M\epsilon'\epsilon)]\\ = \frac{1}{n} tr[E(M)E(\epsilon'\epsilon)] \leftarrow\quad E(\epsilon'\epsilon)\quad is\quad var(\epsilon) \\ = \frac{1}{n} tr[ME(\epsilon'\epsilon)]\\ = \frac{\sigma^2}{n} tr[M]\\ =\frac{\sigma^2}{n} tr[(I-X(X'X)^{-1}X')] \\ =\frac{\sigma^2}{n} tr[(I)]-tr[(X(X'X)^{-1}X')] \\ =\frac{\sigma^2}{n}( n-k ) \neq \frac{\sigma^2}{n} \\ \end{align*}\] $$
Since the biased estimator ignores the K degrees of freedom, it underestimated the true variance as its denominator is larger \((n>n-k)\). Hence it’s negatively biased
As \(n\) increases the bias goes to zero because when \(n\rightarrow\infty: n\approx n-k\) Because Because \(\hat{\beta}_j\) is unbiased under Assumptions A1 through A4, this distribution has mean value \(\beta_j\) If this estimator is consistent, then the distribution of \(\hat{\beta}_j\) becomes more and more tightly distributed around \(\beta_j\) the sample size grows. As n tends to infinity, the distribution of \(\hat{\beta}_j\)collapses to the single point\(\beta_j\). In effect, this means that we can make our estimator arbitrarily close to \(\beta_j\)j if we can collect as much data as we want.
Consider the following DGP:
Where . To estimate parameter , the following estimators are proposed:
Where is the n x 1 vector of OLS residuals of estimating the following regresssion model with a sample generated by the DGP above:
\[ \frac{\hat{\epsilon}'\hat{\epsilon}}{n}= \frac{\epsilon'\epsilon}{n}- \frac{\epsilon'X}{n}(\frac{X'X}{n})^{-1}\frac{X'\epsilon}{n} \]
Did you need to introduce any assumption form the DGP for your prove?
We can write residuals in vector notation as:
\[ \hat{\epsilon}=y- \hat{y}= y-X\hat{\beta}=My=M(X\beta +\epsilon)=MX\beta+M\epsilon =M\epsilon\\ M\epsilon=(I_n-X(X'X)^{-1}X')\hat{\epsilon} \]
\[ \frac{\hat{\epsilon}'\hat{\epsilon}}{n}= \frac{(My)'My}{n}= \frac{y'M'My}{n}=\frac{y'My}{n} \]
b.Prove that
\[ plim(\frac{\sum^n_{i=1}e_i^2}{n}) = \sigma^2_0 \]
First we see the consistency of error variance estimation:
\[ \begin{align*} \hat{e}^2= \frac{1}{n}\sum^n_{i=1}\hat{e}_i^2 = \frac{1}{n}\sum(y_i - X_i'\hat{\beta})^2\\ =\frac{1}{n}\sum(e_i + X_i'\beta - X_i'\hat{\beta})^2\\ =\frac{1}{n}\sum(e_i - X_i'(\hat{\beta}-\beta))^2\\ =\frac{1}{n}\sum(e^2_i + 2e_iX_i'(\hat{\beta}-\beta)+(\hat{\beta}-\beta)'X_iX_i'(\hat{\beta}-\beta)\\ = \frac{1}{n}\sum(e^2_i + 2\frac{1}{n}\sum e_iX_i'(\hat{\beta}-\beta)+\frac{1}{n}\sum'X_iX_i'(\hat{\beta}-\beta)\\ \end{align*} \]
By the Weak Law of Large Numbers:
\[ plim(\frac{\sum^n_{i=1}e_i^2}{n}) = \sigma^2_0 \\plim(\frac{\sum e_ix_i'}{n}) =E(e_ix_i')= 0\\ plim(\frac{\sum x_ix_i'}{n}) = E(x_ix_i') < \infty \]
And we know that:
\[ \hat{\beta} \rightarrow\beta \quad therefore \quad \hat{\sigma}^2 \xrightarrow{ p} \sigma^2 \]
\[ s^2= \frac{1}{n-k}\sum\hat{e}_i^2 \]
Following the same process plus \(n/(n-k)\) as \(n \rightarrow \infty\)
In other words:
\[ plim(\tilde{\sigma}^2)= plim(\frac{\hat{\epsilon}'\hat{\epsilon}}{n}) =plim(\frac{\epsilon'\epsilon}{n}) = \sigma^2_0 \]
We can show that \(\hat{\sigma}^2\) is also a consistent estimator of \(\sigma^2_0\) as:
\[ plim(\hat{\sigma}^2)= plim(\frac{\hat{\epsilon}'\hat{\epsilon}}{n-k}) \]
Because by the Weak Law of Large Numbers we have that:
\[ plim(\hat{\sigma}^2)= lim_{n \rightarrow \infty}(\frac{\sigma^2_0(n-k)}{n-k}) = lim_{n \rightarrow \infty} \sigma^2_0 = 0 \]