What are the properties of the distributions of \(\beta_0\) and \(\beta_1\) over different random samples from the population?
What are the expected values and variances of OLS estimators?
We will first examine finite sample properties: unbiasedness and efficiency. These are valid for any sample size n.
Recall that unbiasedness means that the mean of the sampling distribution of an estimator is equal to the unknown parameter value.
Efficiency is related to the variance of the estimators.
An estimator is said to be efficient if its variance is the smallest among a set of unbiased estimators.
We need the following assumptions for unbiasedness:
(SLR.1) Model is linear in parameters: \(y = \beta_0 + \beta_1x + u\)
(SLR.2) Random sampling: we have a random sample from the target population.
(SLR.3) The variance of \(x\) must not be zero: \(\sum_{i=1}^n (x_i - \overline{x}\)
(SLR.4) Zero conditional mean: \(E(u|x) = 0\). Since we have a random sample we can write:
\[ E(u_i|x_i) = 0, \; \forall i = 1,2,\cdots,n \]
THEOREM:
If all SLR.1-SLR.4 assumptions hold then OLS estimators are unbiased:
\[\begin{align} E(\hat{\beta}_0) &= \beta_0 \notag \\ E(\hat{\beta}_1) &= \beta_1 \notag \end{align}\]
Proof: (see Wooldridge, pp 43-44)
Unbiasedness is feature of the sampling distributions of \(\beta_0\) and \(\beta_1\) that are obtained via repeated random sampling.
As such, it does not say anything about the estimate that we obtain for a given sample. It is possible that we could obtain an estimate which is far from the true value.
Unbiasedness generally fails if any of the SLR assumptions fail.
SLR.2 needs to be relaxed for time series data. But there are ways that it cannot hold in cross-sectional data as well.
If SLR.4 fails then the OLS estimators will generally be biased. This is the most important issue in nonexperimental data. If \(x\) and \(u\) are correlated then we have biased estimators.
Spurious correlation: we find a relationship between \(y\) and \(x\) that is really due to other unobserved factors that affect \(y\)
Population model (DGP - Data Generating Process): \[ y = 1 + 0:5x + 2 × N(0; 1)\ \]
True parameter values are known: \(\beta_0 = 1, \beta_1 = 0:5, u = 2 × N(0; 1)\) (what is the variance of u?). N(0; 1) represents a random draw from the standard normal distribution.
The values of \(x\) are drawn from the Uniform distribution: \(x \sim 10 × Unif(0; 1)\)
Using random numbers we can generate artificial data sets. Then, for each data set we can apply the OLS method to find estimates.
After repeating these steps many times, say 1000, we would obtain 1000 slope and intercept estimates. Then we can analyze the sampling distribution of these estimates.
This is a simple example of Monte Carlo simulation experiment. These experiments may be useful in analyzing properties of estimators.
# Set the random seed
# So that we will obtain the same results
# Otherwise, simulation results will change
set.seed(1234567)
# set sample size
n <- 50
# the number of simulations
MCreps <- 10000
# set true parameters: betas and standard deviation of u
beta0 <- 1
beta1 <- 0.5
su <- 2
# initialize b0hat and b1hat to store results later:
b0hat <- numeric(MCreps)
b1hat <- numeric(MCreps)
# Draw a sample of x
# this is going to be fixed in repeated samples
x <- 10*runif(n,0,1)
# repeat MCreps times:
for(i in 1:MCreps) {
print(i)
# Draw a sample of y:
u <- rnorm(n,0,su)
y <- beta0 + beta1*x + u
# estimate parameters by OLS and store them in the vectors
bhat <- coefficients( lm(y~x) )
b0hat[i] <- bhat["(Intercept)"]
b1hat[i] <- bhat["x"]
}
# draw histogram and summary statistics
hist(b0hat)
summary(b0hat)
mean(b0hat)
sd(b0hat)
hist(b1hat)
summary(b1hat)
mean(b1hat)
sd(b1hat)
# smoothed histogram
hist(b1hat,
freq = FALSE,
breaks=seq(0,1,0.025),
axes = FALSE,
main=expression("Sampling Distribution of b1hat"))
axis(1,at = seq(0,1,0.1),labels = TRUE,pos = 0)
axis(2,pos = 0)
lines(density(b1hat), lwd=2, col="blue")
hist(b0hat,
freq = FALSE,
breaks=seq(-2,4,0.1),
axes = FALSE,
main="Sampling Distribution of b0hat")
axis(1,at = seq(-1,3,1),labels = TRUE,pos = 0)
axis(2,pos = -2)
lines(density(b0hat), lwd=2, col="blue")
Unbiasedness of OLS estimators, \(\beta_0\) and \(\beta_1\) is a feature about the center of the sampling distributions.
We should also know how far we can expect \(\hat{\beta}_1\) to be away from \(\beta_1\) on average.
In other words, we should know the sampling variation in OLS estimators in order to establish efficiency and to calculate standard errors.
SLR.5: Homoscedasticity (constant variance assumption): This says that the variance of \(u\) conditional on \(x\) is constant, \(var(u|x) = var(u) = \sigma^2\)
Assumptions SLR.4 and SLR.5 can be rewritten in terms of the conditional mean and variance of \(y\):
\[\begin{align} E(y|x) &= \beta_0 + \beta_1 x \notag \\ var(y|x) &= \sigma^2 \notag \end{align}\]
Simple Regression Model under Homoscedasticity
Simple Regression Model under Hoeteroscedasticity
\[\begin{align} Var(\hat{\beta}_0) &= \frac{\sigma^2}{\sum_{i=1}^n (x_i = \overline{x})^2} \notag \\ \text{and} \notag \\ Var(\hat{\beta}_1) &= \frac{\sigma^2 \sum_{i=1}^n x_i^2}{n \sum_{i=1}^n (x_i = \overline{x})^2} \notag \end{align}\]
These formulas are not valid under heteroscedasticity (if SLR.5 does not hold).
Sampling variances of OLS estimators increase with the error variance and decrease with the sampling variation in \(x\)
We would like to find an unbiased estimator for \(\sigma^2\).
Since by assumption we have \(E(u^2) = \sigma^2\) an unbiased estimator is:
\[ \frac{1}{n}\sum_{i=1}^n u_i^2 \]
\[ \frac{1}{n}\sum_{i=1}^n \hat{u}_i^2 = \frac{SSE}{n} \]
\[ \frac{1}{n}\sum_{i=1}^n \hat{u}_i^2 = \frac{SSE}{n-2} \]
\[ \hat{\sigma} = \sqrt{\frac{SSE}{n-2}} \] - Standard error of the OLS slope estimate can be written as:
\[ se(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{{\sum_{i=1}^n} (x_i - \overline{x})^2}} = \frac{\hat{\sigma}}{s_x} \]
In some rare cases we want y = 0 when x = 0. For example, tax revenue is zero whenever income is zero.
We can redefine the simple regression model without the constant term as follows: \(\tilde{y} = \tilde{\beta}_1x\)
Using OLS principle
\[ min \sum_{i=1}^n (\tilde{y} - \tilde{\beta}_1x_i)^2 \]
\[ \sum_{i=1}^n x_i (\tilde{y} - \tilde{\beta}_1x_i) = 0 \]
Solving this we obtain the OLS estimator of the slope parameter: \[ \tilde{\beta}_1 = \frac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^n x_i^2} \]
For example,
# Regression through the origin:
res1 <- lm(salary ~ 0 + roe, data = ceosal1)
summary(res1)
##
## Call:
## lm(formula = salary ~ 0 + roe, data = ceosal1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1697.4 -309.1 -34.3 459.2 13589.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## roe 63.538 5.156 12.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1429 on 208 degrees of freedom
## Multiple R-squared: 0.422, Adjusted R-squared: 0.4193
## F-statistic: 151.9 on 1 and 208 DF, p-value: < 2.2e-16
# Regression on a constant
res2 <- lm(salary ~ 1, data = ceosal1)
summary(res2)
##
## Call:
## lm(formula = salary ~ 1, data = ceosal1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1058.1 -545.1 -242.1 125.9 13540.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1281.12 94.93 13.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1372 on 208 degrees of freedom
#Full SLR
res3 <- lm(salary ~ roe, data = ceosal1)
summary(res3)
##
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1160.2 -526.0 -254.0 138.8 13499.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 963.19 213.24 4.517 1.05e-05 ***
## roe 18.50 11.12 1.663 0.0978 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared: 0.01319, Adjusted R-squared: 0.008421
## F-statistic: 2.767 on 1 and 207 DF, p-value: 0.09777
plot(x= ceosal1$roe,
y = ceosal1$salary,
ylim = c(0,4000),
xlab = "Return on equity",
ylab = "CEO salary")
abline(res1,col="blue")
abline(res2,col="red")
abline(res3,col="black")