I. Gaus-Markov Conditions

      There are five assumptions in the Gaus-Markov theorem, which are used to assure that the coefficient estimates are BLUE, BLUE being an acronym for “Best Linear Unbiased Estimator.” The five assumptions are:

1. Random Sample
2. Linear in Parameter
3. Zero-Conditional Mean
4. No Perfect Collinearity
5. Homoskedacity

      The random sample assumption is necessary so our estimates aren’t skewed in one direction. For example, were we to investigate the impact of exercise on cardiovascular health, but our sample only included unhealthy individuals who don’t exercise, including training in their daily routine would have much more significant effects on them than on the rest of the population, potentially biasing the estimates.

      The second assumption requires the slope coefficients in our regression equation to be linear in parameter. This means that all \(\beta\)s have to be to the power of one. That is not to say that the variables have to be linear, but the slope coefficients have to be linear.
      The Zero Conditional Mean is perhaps the most important assumption. It dictates that the variables we are use cannot be correlated with any other factors outside of the model that would impact our dependent variable. For example, if we ran a regression such as:
\(Future\) \(Income_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(Education_{i}\)\(+\)\(\epsilon_{i}\)

     Investigating the impact of education on future earnings, we might expect a significant positive relationship. However, the effect we observe might be more than just that of education. Wealthier people are more likely to have more education thanks to their family support and resources such as tutoring. They are also more likely to have higher future earnings since their families tend to have better connections and understand the labor force landscape more. This means that our \(\beta_{1}\) coefficient is biased upwards in this specific regression since we are observing the effect of income as well as education. This correlation is a violation of the Z.C.M. assumption, which can be mathmatically expressed as:
\(E(u|x)=0\)

      The fourth condition is the no perfect collinearity assumption. In plain terms, this means that no one variable in your econometric model can be explained by a combination of other variables. A standard example would be a genetic sex variable in your model.
\(Psychosis\) \(Risk_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(THCExposure_{i}\) \(+\) \(\beta_{2}\) \(Female_{i}\)\(+\)\(\beta_{3}\) \(Male_{i}\)\(+\)\(\epsilon_{i}\)

      An issue that arises from perfect colinearity is that the column vectors of the dataset are linearly dependent. This causes the determinant of the matrix to equal zero, making the inverse of the matrix impossible to calculate and, with that, impossible to get a \(\beta\) estimate.
      The final condition is homoscedasticity. This condition requires that the variance of the independent variable is the same across the board. If that is not the case, then we are experiencing heteroskedasticity, which is when the variance of the independent variable changes. The mathematical expression for heteroskedasticity is:
\(Var(u|x)\)\(=\)\(\sigma^2\)

Unlike the previous conditions, homeoscedacity is not a condition for avoiding bias in the model. It is only related to the efficiency or, more precisely, the accuracy of the model.

II. Regressions

      The regression I will be running is:
\(Sepal\) \(Length_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(SepalWidth_{i}\) \(+\)\(\epsilon_{i}\)


      My dependent variable is sepal length of iris flowers, and my independent variable is are sepal width.

library(AER)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
tab_model(my_reg)
  Sepal.Length
Predictors Estimates CI p
(Intercept) 6.53 5.58 – 7.47 <0.001
Sepal Width -0.22 -0.53 – 0.08 0.152
Observations 150
R2 / R2 adjusted 0.014 / 0.007

      As we can see, sepal width in this regression has no statistically significant effect on sepal length. With the p-value being 0.152, there is a 15.2% chance that we could get this result, given that the null hypothesis is true(In this case, \(H_0\)\(:\)\(\beta_1\)\(=\)\(0\)). Additionally, we can see that the results are not statistically significant since zero is included in the confidence interval of the values of \(\beta_1\). The interpretation of the \(\beta_1\) coefficient would be, for each additional centimeter of Sepal Width, the Sepal Length of an iris, on average, decreases by 0.22 centimeters, CP. The \(\beta_0\) coefficient shows that when the Sepal Width is 0, on average, the Sepal Length is 6.53 centimeters, CP. These results feel somewhat counterintuitive, so I decided to run an additional regression with some added controls. The model for this regression is:
\(Sepal\) \(Length_{i}\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\) \(SepalWidth_{i}\)\(+\) \(\beta_{2}\) \(PetalWidth_{i}\)\(+\) \(\beta_{3}\) \(SepalLength_{i}\) \(+\)\(\epsilon_{i}\)


library(stargazer)
library(AER)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
tab_model(OLS)
  Sepal.Length
Predictors Estimates CI p
(Intercept) 1.86 1.36 – 2.35 <0.001
Sepal Width 0.65 0.52 – 0.78 <0.001
Petal Length 0.71 0.60 – 0.82 <0.001
Petal Width -0.56 -0.81 – -0.30 <0.001
Observations 150
R2 / R2 adjusted 0.859 / 0.856

      We can see that all three of our \(\beta\) coefficients are statistically significant, with p values being less than one-tenth of a percent. It can further be proven that our coefficients are statistically significant since none of the confidence intervals include 0. Therefore, we can say that, with 99% confidence, there is less than a one percent chance we would get these estimates were the null hypotheses true (In this case, \(H_0\)\(:\)\(\beta_n\)\(=\)\(0\)).
      For the sepal width estimate, we can say that for each additional centimeter in sepal width, there is an expected increase of 0.65 centimeters in sepal length, CP. For \(\beta_2\), we can say that for each additional centimeter in petal length, on average, sepal length increases by 0.71 centimeters, CP. Finally, for petal width, we see that for each centimeter increase in petal width, on average, sepal length decreases by 0.56 centimeters, CP. \(\beta_0\), or the intercept coefficient, indicates that were all the right-hand side variables at zero, the sepal length of the flower would be, on average, 1.86 centimeters, CP. The interpretation of this model is fairly easy, thanks to the fact we are using level-level specifications in the model.

III. Residual Analysis


library(stargazer)
library(ggplot2)


 ggplot(data = iris)+
   geom_point(mapping = aes(x = Sepal.Width, y = Sepal.Length, color = "blue"))+
   geom_smooth(mapping = aes(x = Sepal.Width, y = Sepal.Length, color = "red"), se = FALSE, method = "lm") +
   labs(y = "Sepal Length (cm)",x = "Sepal Width (cm)",title = "The relationship of sepal width and sepal length of irises") +scale_color_manual(values = c("red" = "red"),labels = c("Trend Line")) +guides(color = guide_legend(title = "Legend"))


      There are four graphs we use for residual analysis:

my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=1)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=1)


      The first graph maps the relationship between the fitted values of our model and the residuals. Specifically, it maps the variability of the data. The goal is for the red line to be as consistent and as close to zero as possible. If a pattern emerges, it could indicate variability issues, in which case we might have to transform our model somehow. As we can see in the second version of graph one, which uses my second model with controls, the initial pattern we saw in simple OLS is now gone, and the curve is much closer to zero than before.

my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=2)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=2)


      The second residual analyses graph is a quntile-quntile plot. This plot compares the distribution of our sample to the distribution of a theoretical normal distribution. If the data is on the theoretical line, it is likely normally distributed. We can see that that is the case in our second graph, but there are some slight deviations in the first graph without controls. This graph also tests for the variance of the sample. If our data has a different slope than the theoretical line, it can show that the variance is either larger or smaller than the theoretical distribution, though this does not seem to be the case in our graphs. The graph also tests for outliers and skewness, which, however, does not seem to be relevant to our current sample.

my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=3)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=3)


      The scale location graph is to assess homoscedasticity. If the line is straight, it indicates that the variance of our observations is the same everywhere. However, if there is a pattern in the line, we are likely experiencing heteroskedasticity. We see in the first graph that the simple regression model has some heteroskedasticity issues; however, after including controls, the problem becomes much more contained.

my_reg <- lm(Sepal.Length~Sepal.Width, data = iris)
plot(my_reg,which=4)

OLS <- lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data = iris)
plot(OLS,which=4)


      The final plot we use is the residuals vs. leverage plot. We see that in both graphs, there are some outliers, indicating that these values greatly influence our results. If this poses any issues, it might be wise to remove these observations, or at least investigate why they have such a vast effect on our estimates.
      Our model with no controls clearly has some issues. Residuals vs. Fitted values graph shows us issues with the model’s fit, indicating endogeneity issues; the Q-Q graphs indicate issues with normality, and the scale location graph shows some possible heteroscedacity issues. However, after the inclusion of controls, these problems are much smaller.
      To further improve my model, it might be beneficial to log transform the OLS model with controls to improve the fit of our line and deal with some of the outliers.

my_regLOG <- lm(log(Sepal.Length)~log(Sepal.Width), data = iris)
par(mfrow = c(2, 2))
plot(my_regLOG)

tab_model(my_regLOG)
  log(Sepal.Length)
Predictors Estimates CI p
(Intercept) 1.88 1.70 – 2.06 <0.001
Sepal Width [log] -0.11 -0.27 – 0.05 0.174
Observations 150
R2 / R2 adjusted 0.012 / 0.006
OLSLOG <- lm(log(Sepal.Length)~log(Sepal.Width)+log(Petal.Length)+log(Petal.Width), data = iris)
par(mfrow = c(2, 2))
plot(OLSLOG)

tab_model(OLSLOG)
  log(Sepal.Length)
Predictors Estimates CI p
(Intercept) 1.01 0.88 – 1.14 <0.001
Sepal Width [log] 0.37 0.29 – 0.44 <0.001
Petal Length [log] 0.29 0.22 – 0.35 <0.001
Petal Width [log] -0.03 -0.07 – 0.01 0.155
Observations 150
R2 / R2 adjusted 0.816 / 0.812


      While the changes are not large, we can see that the fit of both of our models has improved after log transforming the data. Now we only have to be careful with the results interpretation, since the regression is now log-log specified.