I. Gaus-Markov Conditions
There are five assumptions in the Gaus-Markov
theorem, which are used to assure that the coefficient estimates are
BLUE, BLUE being an acronym for “Best Linear Unbiased Estimator.” The
five assumptions are:
1. Random
Sample 2. Linear in
Parameter 3. Zero-Conditional
Mean 4. No Perfect
Collinearity 5.
Homoskedacity The random sample assumption
is necessary so our estimates aren’t skewed in one direction. For
example, were we to investigate the impact of exercise on cardiovascular
health, but our sample only included unhealthy individuals who don’t
exercise, including training in their daily routine would have much more
significant effects on them than on the rest of the population,
potentially biasing the estimates.
The second assumption requires the slope coefficients in our
regression equation to be linear in parameter. This means that all
\(\beta\)s have to be to the power of one.
That is not to say that the variables have to be linear, but the slope
coefficients have to be linear.
The Zero Conditional Mean is
perhaps the most important assumption. It dictates that the variables we
are use cannot be correlated with any other factors outside of the model
that would impact our dependent variable. For example, if we ran a
regression such as:
\(Future\) \(Income_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(Education_{i}\)\(+\)\(\epsilon_{i}\)
Investigating the impact of education on future
earnings, we might expect a significant positive relationship. However,
the effect we observe might be more than just that of education.
Wealthier people are more likely to have more education thanks to their
family support and resources such as tutoring. They are also more likely
to have higher future earnings since their families tend to have better
connections and understand the labor force landscape more. This means
that our
\(\beta_{1}\) coefficient is
biased upwards in this specific regression since we are observing the
effect of income as well as education. This correlation is a violation
of the Z.C.M. assumption, which can be mathmatically expressed as:
\(E(u|x)=0\)
The fourth condition is the no perfect collinearity
assumption. In plain terms, this means that no one variable in your
econometric model can be explained by a combination of other variables.
A standard example would be a genetic sex variable in your model.
\(Psychosis\) \(Risk_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(THCExposure_{i}\) \(+\) \(\beta_{2}\) \(Female_{i}\)\(+\)\(\beta_{3}\) \(Male_{i}\)\(+\)\(\epsilon_{i}\)
An issue that arises from perfect colinearity is that
the column vectors of the dataset are linearly dependent. This causes
the determinant of the matrix to equal zero, making the inverse of the
matrix impossible to calculate and, with that, impossible to get a
\(\beta\) estimate.
The final
condition is homoscedasticity. This condition requires that the variance
of the independent variable is the same across the board. If that is not
the case, then we are experiencing heteroskedasticity, which is when the
variance of the independent variable changes. The mathematical
expression for heteroskedasticity is:
\(Var(u|x)\)\(=\)\(\sigma^2\)
Unlike the previous conditions, homeoscedacity is not a
condition for avoiding bias in the model. It is only related to the
efficiency or, more precisely, the accuracy of the model.
II. Regressions
The regression I will be running is:
\(Sepal\) \(Length_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(SepalWidth_{i}\) \(+\)\(\epsilon_{i}\)
My dependent variable is sepal length of iris
flowers, and my independent variable is are sepal width.
library(AER)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
tab_model(my_reg)
|
|
Sepal.Length
|
|
Predictors
|
Estimates
|
CI
|
p
|
|
(Intercept)
|
6.53
|
5.58 – 7.47
|
<0.001
|
|
Sepal Width
|
-0.22
|
-0.53 – 0.08
|
0.152
|
|
Observations
|
150
|
|
R2 / R2 adjusted
|
0.014 / 0.007
|
As we can see, sepal width in this regression has no
statistically significant effect on sepal length. With the p-value being
0.152, there is a 15.2% chance that we could get this result, given that
the null hypothesis is true(In this case,
\(H_0\)\(:\)\(\beta_1\)\(=\)\(0\)).
Additionally, we can see that the results are not statistically
significant since zero is included in the confidence interval of the
values of
\(\beta_1\). The
interpretation of the
\(\beta_1\)
coefficient would be, for each additional centimeter of Sepal Width, the
Sepal Length of an iris, on average, decreases by 0.22 centimeters, CP.
The
\(\beta_0\) coefficient shows that
when the Sepal Width is 0, on average, the Sepal Length is 6.53
centimeters, CP. These results feel somewhat counterintuitive, so I
decided to run an additional regression with some added controls. The
model for this regression is:
\(Sepal\) \(Length_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(SepalWidth_{i}\)\(+\) \(\beta_{2}\) \(PetalWidth_{i}\)\(+\) \(\beta_{3}\) \(SepalLength_{i}\) \(+\)\(\epsilon_{i}\)
library(stargazer)
library(AER)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
tab_model(OLS)
|
|
Sepal.Length
|
|
Predictors
|
Estimates
|
CI
|
p
|
|
(Intercept)
|
1.86
|
1.36 – 2.35
|
<0.001
|
|
Sepal Width
|
0.65
|
0.52 – 0.78
|
<0.001
|
|
Petal Length
|
0.71
|
0.60 – 0.82
|
<0.001
|
|
Petal Width
|
-0.56
|
-0.81 – -0.30
|
<0.001
|
|
Observations
|
150
|
|
R2 / R2 adjusted
|
0.859 / 0.856
|
We can see that all three of our
\(\beta\) coefficients are statistically
significant, with p values being less than one-tenth of a percent. It
can further be proven that our coefficients are statistically
significant since none of the confidence intervals include 0. Therefore,
we can say that, with 99% confidence, there is less than a one percent
chance we would get these estimates were the null hypotheses true (In
this case,
\(H_0\)\(:\)\(\beta_n\)\(=\)\(0\)).
For the sepal width estimate, we can say that for each
additional centimeter in sepal width, there is an expected increase of
0.65 centimeters in sepal length, CP. For
\(\beta_2\), we can say that for each
additional centimeter in petal length, on average, sepal length
increases by 0.71 centimeters, CP. Finally, for petal width, we see that
for each centimeter increase in petal width, on average, sepal length
decreases by 0.56 centimeters, CP.
\(\beta_0\), or the intercept coefficient,
indicates that were all the right-hand side variables at zero, the sepal
length of the flower would be, on average, 1.86 centimeters, CP. The
interpretation of this model is fairly easy, thanks to the fact we are
using level-level specifications in the model.
III. Residual Analysis
library(stargazer)
library(ggplot2)
ggplot(data = iris)+
geom_point(mapping = aes(x = Sepal.Width, y = Sepal.Length, color = "blue"))+
geom_smooth(mapping = aes(x = Sepal.Width, y = Sepal.Length, color = "red"), se = FALSE, method = "lm") +
labs(y = "Sepal Length (cm)",x = "Sepal Width (cm)",title = "The relationship of sepal width and sepal length of irises") +scale_color_manual(values = c("red" = "red"),labels = c("Trend Line")) +guides(color = guide_legend(title = "Legend"))
There are four graphs we use for residual analysis:
my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=1)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=1)
The first graph maps the relationship between the fitted
values of our model and the residuals. Specifically, it maps the
variability of the data. The goal is for the red line to be as
consistent and as close to zero as possible. If a pattern emerges, it
could indicate variability issues, in which case we might have to
transform our model somehow. As we can see in the second version of
graph one, which uses my second model with controls, the initial pattern
we saw in simple OLS is now gone, and the curve is much closer to zero
than before.
my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=2)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=2)
The second residual analyses graph is a quntile-quntile plot.
This plot compares the distribution of our sample to the distribution of
a theoretical normal distribution. If the data is on the theoretical
line, it is likely normally distributed. We can see that that is the
case in our second graph, but there are some slight deviations in the
first graph without controls. This graph also tests for the variance of
the sample. If our data has a different slope than the theoretical line,
it can show that the variance is either larger or smaller than the
theoretical distribution, though this does not seem to be the case in
our graphs. The graph also tests for outliers and skewness, which,
however, does not seem to be relevant to our current sample.
my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=3)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=3)
The scale location graph is to assess homoscedasticity. If the
line is straight, it indicates that the variance of our observations is
the same everywhere. However, if there is a pattern in the line, we are
likely experiencing heteroskedasticity. We see in the first graph that
the simple regression model has some heteroskedasticity issues; however,
after including controls, the problem becomes much more contained.
my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=4)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=4)
The final plot we use is the residuals vs. leverage plot. We
see that in both graphs, there are some outliers, indicating that these
values greatly influence our results. If this poses any issues, it might
be wise to remove these observations, or at least investigate why they
have such a vast effect on our estimates.
Our model with no
controls clearly has some issues. Residuals vs. Fitted values graph
shows us issues with the model’s fit, indicating endogeneity issues; the
Q-Q graphs indicate issues with normality, and the scale location graph
shows some possible heteroscedacity issues. However, after the inclusion
of controls, these problems are much smaller.
To further
improve my model, it might be beneficial to log transform the OLS model
with controls to improve the fit of our line and deal with some of the
outliers.
my_regLOG <- lm(log(Sepal.Length)~log(Sepal.Width), data = iris)
par(mfrow = c(2, 2))
plot(my_regLOG)

tab_model(my_regLOG)
|
|
log(Sepal.Length)
|
|
Predictors
|
Estimates
|
CI
|
p
|
|
(Intercept)
|
1.88
|
1.70 – 2.06
|
<0.001
|
|
Sepal Width [log]
|
-0.11
|
-0.27 – 0.05
|
0.174
|
|
Observations
|
150
|
|
R2 / R2 adjusted
|
0.012 / 0.006
|
OLSLOG <- lm(log(Sepal.Length)~log(Sepal.Width)+log(Petal.Length)+log(Petal.Width), data = iris)
par(mfrow = c(2, 2))
plot(OLSLOG)

tab_model(OLSLOG)
|
|
log(Sepal.Length)
|
|
Predictors
|
Estimates
|
CI
|
p
|
|
(Intercept)
|
1.01
|
0.88 – 1.14
|
<0.001
|
|
Sepal Width [log]
|
0.37
|
0.29 – 0.44
|
<0.001
|
|
Petal Length [log]
|
0.29
|
0.22 – 0.35
|
<0.001
|
|
Petal Width [log]
|
-0.03
|
-0.07 – 0.01
|
0.155
|
|
Observations
|
150
|
|
R2 / R2 adjusted
|
0.816 / 0.812
|
While the changes are not large, we can see that the fit of
both of our models has improved after log transforming the data. Now we
only have to be careful with the results interpretation, since the
regression is now log-log specified.