Exercise 1

One of the classical assumptions is strict exogeneity. This assumption is usually presented as:

\[ E(\epsilon_i|X)=0 \]

(a) Re-write the assumption without using matrix \(X\), but using instead vectors xi and xj, which include, respectively, observation i and observation j of all regressors:

\(E(\epsilon_i|x_i, x_j)\)

Where:

\(x_i=[1, x_{i,2},...,x_{i,k}]', x_j=[1, x_{j,2},...,x_{j,k}]'\)

\(E(\epsilon_i |1, x_{i,2},...,x_{i,k}, 1, x_{j,2},...,x_{j,k}) =0\)

However, since a constant (1) does not provide information, the expectation conditional in the vector including the constant is the same as the expectation conditional on \((x_{i,2},...,x_{i,k}, , x_{j,2},...,x_{j,k})\)

(b) Explain the meaning of this assumption.

The expected value of \(\epsilon_i\) given \(X\) is zero. It can be also stated that \(E(\epsilon_i|xi)\) which is the expected value of \(\epsilon_i\) given \(x_i\). The latter is implied by the former. Under \(E(\epsilon_i|X) = 0\) we can interpret the regression model as describing the conditional expected value of \(y_i\) given all observations of \(X\)

(c) Prove, one by one, that strict exogeneity implies:

\[ E(\epsilon_i) =0 \]

The law of total expectation implies that: \(E[E(\epsilon_i|X)]= E(\epsilon_i)\) and the unconditonal mean of the error term is zero, e.g. \(E(\epsilon_i)=0\quad for \quad i=1,2,...,n\). Therefore in our example we can see that:

\[ E[E(\epsilon_i|X)]= E(\epsilon_i)=0 \]

\[ E(x_j\cdot\epsilon_i)=0 \]

Which can also be expressed as:

\[ E(x_j\cdot\epsilon_i)= \begin{bmatrix} E(x_{j,1},\epsilon_i) \\ E(x_{j,2},\epsilon_i) \\ \vdots\\ E(x_{j,k},\epsilon_i) \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots\\ 0 \end{bmatrix} \]

Where:

\[ \begin{align*} E(x_{j,k},\epsilon_i)= E[E(x_{j,k}\epsilon_i|x_{j,k})] \leftarrow LTE\\ = E[x_{j,k}\cdot E(\epsilon_i|x_{j,k})]\leftarrow LCE \\ = E[x_{j,k}\cdot0]\\ =0 \end{align*} \]

And LTE is Law of Total Expectation and LCE is linearility of conditional expectations.

\[ Cov(x_{j,k}\cdot\epsilon_i) = 0 \]

We can express covariance as:

\[ Cov(X,Y) = E(XY) - E(X)E(Y) \]

And therefore we can show that:

\[ Cov(x_{j,k}\cdot\epsilon_i) = E(x_{j,k}\cdot\epsilon_i)- E(x_{j,k}) E(\epsilon_i) = 0- x_{j,k}\cdot 0 = 0 \]

Exercise 2

Consider the following DGP

Consider we estimate the parameter of regression by OLS using samples by the DGP above:

\[ y_i = \beta_1 +\beta_2x_{i,2} +\beta_2x_{i,3} + \epsilon_i \]

  1. Do you think assumptions [A1] to [A5] hold? Briefly discuss whether that is the case, one at a time.
  1. Write a Matlab .m file that allows you to experimentally approximate the conditional sampling distribution of the OLS estimator of parameter β2 by drawing 10000 samples of 90 observations each. Run the program and, with the 10000 estimates you got, plot the density histogram of βˆ2/X. Present the .m fifile and the plot in your answer. Briefly comment on the shape and location of the density. Are they as you expected? Rigorously justify.
%Exercise 2.b. 

% DGP: y = 2 + 2_x2 + 2x_3 + e

%Samle size
n=90
%Error
var_error = 64;
e = 0 +sqrt(var_error).*randn(n,1);
v = 0 +sqrt(16).*randn(n,1);
%Regressor x_2
sz = [n 1];
%Regressor x_3
x_3 = unifrnd(0,90,sz);
x_2_alt = x_3 + v;
U_b = 100
x_2 = unifrnd(0,U_b,sz);
%Describing the GDP for y  
y = 2 + 2*x_2 + 2*x_3 +e 
%Let's compute B_ols
X = [ones(n,1) x_2 x_3]
[b_ols1,~,~,~,~, R2_1] = ols_A(y, X);  
display(b_ols1) 
%Set the # of repetitions of the experiment; 
reps = 10000;
%Create the output vector/matrix to be filled out;
k = size(X,2);
betas = nan(reps,k);
%Loop over all the repetitions storing in the output vector/matrix.
    for i = 1:reps
    e = 0 +sqrt(64).*randn(n,1);   %The error term follows a normal (0,20)
    y = 2 + 2*x_2 + 2*x_3 +e 
    betas(i,:) = (X'*X\X'*y)';     
    end
% MC simulation outcome:
mean(betas)

figure()
subplot(3,1,1)
histfit(betas(:,2))
txt_A = [ 'B_{0,2}~ U(0,' int2str(U_b) ')']
% txt_A = [ 'B_{0,2}~ x_2= x_3+ v']
txt_B = ['N = ' int2str(n) ' Obs' ' Var Error ' int2str(var_error)];

title({txt_A;txt_B})
% txt = '{\it\mu} = 10, {\it\sigma} = 5';
% subtitle(txt)

xlim([1.85 2.15])
subplot(3,1,2) ; scatter(x_2, y,'filled');
    xlim([0 100])
    ylim([0 400])
    xlabel('x_2');
    ylabel('y');
subplot(3,1,3); scatter(x_3, y, 'filled');
    xlim([0 100])
    ylim([0 400])
    xlabel('x_3');
    ylabel('y');

saveas(gcf,'Barchart.png')

var(betas(:,2))
mean(betas(:,2))

var(x_2) 
var(x_2_alt)

The plot:

  1. Now, if you changed the value of σ20 from 64 to 144, and repeated the experiment you performed in 2b, how do you think the estimated density would change?

It would decrease the variance of the distribution of the estimator but the mean should remain approximately the same. The variance is higher so that travels to our OLS estimator as well

  1. Now, consider the true value of \(\sigma^2_0\) remained equal to 64, but the 90 observations of \(x_2\) were drawn from a \(Uniform(0, 10)\) instead. How do you think the estimated density from 2b would change?

As we see below, the variance of our estimator increases. This may be counterintuive as first because we are decreasing the variance of the distribution we are sampling from our regressor. However, when we see the scatter plot of \(x_2\) and \(x_3\) against \(y\) it is easy to see what is happening. On the left, we have the scatter when our regressor is sampled from U(0,10), we see that \(x_3\) plays a bigger role explaining fluctuations in \(y\). Tighter when we decrease the variance of X2, the variance of our OLS estimator is lower

  1. Now, consider the true value of σ20 remained equal to 64, but xi2 = xi3 + vi, where vi ∼ i.i.N(0, 16). How do you think the density from 2b would change?

The larger the sample variance of X, the smaller the variance of the OLS estimates and vice versa.

  1. Finally, consider we repeat 2b, but by drawing 10000 samples of 900 observations each. Run the program and, with the 10000 estimates you got, plot the density histogram of βˆ2/X. Side-by-side, plot the density histogram you in answering 2b with the one you got in this question, keeping the range of the x-axis the same in both graphs. Are you surprised of how the histogram has changed when we move from n = 90 to n = 900?

No, we were not surprised - as we increase the size of n it narrows down. That is, as we increase the sample size the estimator becomes closer to it’s true value. (With repeated sampling we can expect that the OLS estimator is on average equal to the true value)

Exercise 3

Consider Nerlove’s famous example estimating a cost function for electricity supply.

  1. Using data fifile nerlove.csv, write a matlab .m fifile that, using the same data set, calculates the following elements without using any pre-defifined OLS function:
%Question 3
%We want to estimate b_hat and SE(b_hat) of a linear model:
nerlove = readtable('nerlove.csv'); %Insert data
data = table2array(nerlove);

X = log(data); %transform our data into log form
y = X(:,2); %isolate our dependent variable, cost
X(:,2) = []; %Remove the second column, cost, from X so we have output, labor, fuel and capital
X(:,1) = 1; %Create the first observation column which is  equal to 1
[n k] = size(X); %The size of X is n=145 and k=4

b_hat = inv(X'*X)*X'*y; 
se = (((y'*(y - X*b_hat))/( n - k ))*inv(X'*X)); 

for i= 1:k
    stderrs(i) = sqrt(se(i,i));
end
Coef. Std. Err.
Constant -3.5265 1.7744
Ln(Output) 0.7204 0.0175
Ln(Labour) 0.4363 0.2910
Ln(Fuel) 0.4265 0.1004
Ln(Capital) -0.2199 0.3394
  1. List which assumptions regarding the DGP behind the data would be necessary to justify the way these standard errors were calculated.

All classical assumptions (1 to 4) excluding normality. Under assumptions A1 to A A4 we have that:

\[ E(\hat{\beta}_j)=\beta_j \quad j=0,1,...,k \]

For Any values of the population parameter \(\beta_j\). In other words, the OLS estimators are ubiased estimator of the population parameter. We say that OLS is unbiased under assumptions A1 to A4, and we mean that the procedure by which the OLS estimates are obtained is unbiased when we view the procedure as being applied across all possible random samples. We hope that we have obtained a sample that gives us an estimate close to the population value, but, unfortunately, this cannot be assured. What is assured is that we have no reason to believe our estimate is more likely to be too big or more likely to be too small.

Exercise 4

  1. We are asked why the estimator of \(\sigma^2_0\) defined as follows is a biased estimator using the fact that \(E(\hat{\sigma}|X)=\sigma^2_0\) which is an assumption (Homoskedasticity) which may reads as "the conditional variance of the error term is constant and does not vary as function of the explanatory variables.

\[ \tilde{\sigma}\equiv \frac{\hat{\epsilon}'\hat{\epsilon}}{n} \]

First, we need to characterize the residuals:

\[ Y = X\hat{\beta}+ \hat{\epsilon} \quad \rightarrow \quad \hat{\epsilon} = Y - X\hat{\beta} \]

A residual unbiased estimator for error would be:

\[ E(\hat{\epsilon}) = E(Y - X\hat{\beta}) =E(X\beta +\epsilon)-XE(\hat{\beta})= X\beta + E(\epsilon) - X\beta = X\beta - X\beta =0 \]

Therefore:

\[ \tilde{\sigma}\equiv \frac{\hat{\epsilon}'\hat{\epsilon}}{n} \]

\[ \begin{align*} \hat{\epsilon}= Y-X\hat{\beta}=Y-X(X'X)^{-1}X'Y=(I-X(X'X)^{-1 }Y = MY\\ =(I - X(X'X)^{-1}X')(X\beta+\epsilon) = (X\beta-X(X'X)^{-1}X'X\beta+M\epsilon\\ =(X\beta - X\beta)+ M\epsilon\\ =M\epsilon \end{align*} \]

We see that \(\hat{\epsilon}\) depends on error term, residual random due to random \(\epsilon\). Now, we will make use of two properties, first, matrix \(M\) is an idempotent matrix and second \(\epsilon'M\epsilon\) is an scalar becasue is a matrix multiplications such as: \(1xn \quad nxm \quad nx1\)

\[ E(\tilde{\sigma}) =\frac{1}{n}E(\hat{\epsilon}'\hat{\epsilon})= \frac{1}{n}E(\epsilon'M'M\epsilon)= \frac{1}{n}E(\epsilon'MM\epsilon) = \frac{1}{n}(\epsilon'M\epsilon) \]

Now, we apply the trace operator, bearing in mind that the trace of a scalar is a scalar itself:

$$ \[\begin{align*} E(\tilde{\sigma}) =\frac{1}{n} tr[E(\epsilon'M\epsilon)]\\ = \frac{1}{n} E[tr(\epsilon'M\epsilon)]\\ = \frac{1}{n} E[tr(M\epsilon'\epsilon)]\\ = \frac{1}{n} tr[E(M\epsilon'\epsilon)]\\ = \frac{1}{n} tr[E(M)E(\epsilon'\epsilon)] \leftarrow\quad E(\epsilon'\epsilon)\quad is\quad var(\epsilon) \\ = \frac{1}{n} tr[ME(\epsilon'\epsilon)]\\ = \frac{\sigma^2}{n} tr[M]\\ =\frac{\sigma^2}{n} tr[(I-X(X'X)^{-1}X')] \\ =\frac{\sigma^2}{n} tr[(I)]-tr[(X(X'X)^{-1}X')] \\ =\frac{\sigma^2}{n}( n-k ) \neq \frac{\sigma^2}{n} \\ \end{align*}\] $$

  1. Can you say anything about whether \(\tilde{\sigma}^2\) is positively or negatively biased?

Since the biased estimator ignores the K degrees of freedom, it underestimated the true variance as its denominator is larger \((n>n-k)\). Hence it’s negatively biased

  1. What can you say about the size of the bias of \(\tilde{\sigma}^2\) as \(n\) increases?

As \(n\) increases the bias goes to zero because when \(n\rightarrow\infty: n\approx n-k\) Because Because \(\hat{\beta}_j\) is unbiased under Assumptions A1 through A4, this distribution has mean value \(\beta_j\) If this estimator is consistent, then the distribution of \(\hat{\beta}_j\) becomes more and more tightly distributed around \(\beta_j\) the sample size grows. As n tends to infinity, the distribution of \(\hat{\beta}_j\)collapses to the single point\(\beta_j\). In effect, this means that we can make our estimator arbitrarily close to \(\beta_j\)j if we can collect as much data as we want.

Exercise 5

Consider the following DGP:

Where . To estimate parameter , the following estimators are proposed:

Where is the n x 1 vector of OLS residuals of estimating the following regresssion model with a sample generated by the DGP above:

  1. Prove that:

\[ \frac{\hat{\epsilon}'\hat{\epsilon}}{n}= \frac{\epsilon'\epsilon}{n}- \frac{\epsilon'X}{n}(\frac{X'X}{n})^{-1}\frac{X'\epsilon}{n} \]

Did you need to introduce any assumption form the DGP for your prove?

We can write residuals in vector notation as:

\[ \hat{\epsilon}=y- \hat{y}= y-X\hat{\beta}=My=M(X\beta +\epsilon)=MX\beta+M\epsilon =M\epsilon\\ M\epsilon=(I_n-X(X'X)^{-1}X')\hat{\epsilon} \]

\[ \frac{\hat{\epsilon}'\hat{\epsilon}}{n}= \frac{(My)'My}{n}= \frac{y'M'My}{n}=\frac{y'My}{n} \]

b.Prove that

\[ plim(\frac{\sum^n_{i=1}e_i^2}{n}) = \sigma^2_0 \]

First we see the consistency of error variance estimation:

\[ \begin{align*} \hat{e}^2= \frac{1}{n}\sum^n_{i=1}\hat{e}_i^2 = \frac{1}{n}\sum(y_i - X_i'\hat{\beta})^2\\ =\frac{1}{n}\sum(e_i + X_i'\beta - X_i'\hat{\beta})^2\\ =\frac{1}{n}\sum(e_i - X_i'(\hat{\beta}-\beta))^2\\ =\frac{1}{n}\sum(e^2_i + 2e_iX_i'(\hat{\beta}-\beta)+(\hat{\beta}-\beta)'X_iX_i'(\hat{\beta}-\beta)\\ = \frac{1}{n}\sum(e^2_i + 2\frac{1}{n}\sum e_iX_i'(\hat{\beta}-\beta)+\frac{1}{n}\sum'X_iX_i'(\hat{\beta}-\beta)\\ \end{align*} \]

By the Weak Law of Large Numbers:

\[ plim(\frac{\sum^n_{i=1}e_i^2}{n}) = \sigma^2_0 \\plim(\frac{\sum e_ix_i'}{n}) =E(e_ix_i')= 0\\ plim(\frac{\sum x_ix_i'}{n}) = E(x_ix_i') < \infty \]

And we know that:

\[ \hat{\beta} \rightarrow\beta \quad therefore \quad \hat{\sigma}^2 \xrightarrow{ p} \sigma^2 \]

\[ s^2= \frac{1}{n-k}\sum\hat{e}_i^2 \]

Following the same process plus \(n/(n-k)\) as \(n \rightarrow \infty\)

In other words:

\[ plim(\tilde{\sigma}^2)= plim(\frac{\hat{\epsilon}'\hat{\epsilon}}{n}) =plim(\frac{\epsilon'\epsilon}{n}) = \sigma^2_0 \]

We can show that \(\hat{\sigma}^2\) is also a consistent estimator of \(\sigma^2_0\) as:

\[ plim(\hat{\sigma}^2)= plim(\frac{\hat{\epsilon}'\hat{\epsilon}}{n-k}) \]

Because by the Weak Law of Large Numbers we have that:

\[ plim(\hat{\sigma}^2)= lim_{n \rightarrow \infty}(\frac{\sigma^2_0(n-k)}{n-k}) = lim_{n \rightarrow \infty} \sigma^2_0 = 0 \]