Gauss-Markov assumptions described Linearity: In a single variable regression, both the parameters (Beta0 and Beta1) and variables (Y and X) should be linear. Full column rank: There should be no exact collinearity among the predictor variables i.e. the independent variables. Constant variance or homoscedasticty: the error term has the same variance across all individual levels of the independent variables. Disturbances or errors should be normally distributed. No autocorrelation: Error terms should not be correlated with each other. Zero conditional mean of error:the expected value of the error term is zero given any set of independent variables. Random: the data used to build the model must come from a random sample of the population.
Gauss-Markov in plain English Gauss Markov assumptions are a set of criteria which, if met, indicate that a set of data would be a good candidate for an ordinary least squares regression. The first is that the relationship between the dependent and independent variables should be linear in the parameter. It implies that the effect of a one-unit change in an independent variable on the dependent variable is constant and can be expressed as a fixed change in the expected value of the dependent variable. The next criteria is all independent variables should have no perfectly linear relationships. In other words, we don’t want to be able to perfectly predict one independent variable using another. This diminishes their usefulness in the model and sometimes makes the computation impossible. The next is homoscedasticity, which means that the variability of the errors of the model (distance from the best fit line) is consistent from the minimum value of the independent variable to the maximum value. Next is randomness: we want our data to be randomly sampled so it is not biased. For example, if we are trying to guess a population’s shoe sizes by age and only sample males, we’re not going to have a very effective model. Finally, the errors in a model’s results should be independent of one another. The errors of the model should not change in the same direction.
Gauss-Markov put more technically
Linearity: In a single variable regression, both the parameters (Beta0 and Beta1) and variables (Y and X) should be linear. Formally, it is expressed as Y = β0 + β1X1 + β2X2 + … + βnXn + ε, where β0, β1, β2, …, βn are the population parameters, and ε represents the error term. Full column rank: the matrix of independent variables should be full rank, meaning the there is no perfect linear relationship between independent variables. Constant variance or homoscedasticty: For all independent variables, the variance should be constant i.e. Var(e) = omega squared, which is constant. Zero conditional mean of error:the expected value of the error term (ε) given the values of the independent variables (X) is zero. In mathematical terms, it is stated as E[ε|X] = 0, indicating that there is no systematic bias in the errors based on the values of X.
I looked at recreation demand from the AER library. Among other things, this cross-sectional dataset has income and expenditure for visitors of three different Texas lakes - Somerville, Conroe and Houston - in 1985. I decided to use the expenditure figures to model income. I was hoping there was a strong relationship between spending at these lakes and income. In this first model, I used a level-level approach. The cost figures are in USD and income is in thousands USD.
\[Income = \beta0 + costH\beta1 + costS\beta2 + costC\beta3 + \epsilon\] The intercept 3.5 represents the estimate of income when spending $0 at all three lakes. The coefficients show either a positive or negative relationship between lake expenditure and income. In the cases of Somerville and Conroe, a one dollar increase in expenditure results in a slight increase in income (4 dollars and 28 dollars, respectively). The opposite is true of Lake Houston - a one dollar increase in spending there is related to a 27 dollar decrease income. The Houston (costH) and Conroe (costC) coefficients appear to be significant at the 0.95 confidence interval. The costS coefficient does not appear to be significant, as the P value is very high.
rm(list=ls())
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#bring in recreation demand data
data("RecreationDemand")
my_reg <- lm(income ~ costH + costS + costC, data = RecreationDemand)
summary(my_reg)
##
## Call:
## lm(formula = income ~ costH + costS + costC, data = RecreationDemand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7373 -1.0407 -0.4553 0.8548 5.5188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.575525 0.120109 29.769 < 2e-16 ***
## costH -0.027767 0.009427 -2.946 0.00334 **
## costS 0.004623 0.007153 0.646 0.51827
## costC 0.028055 0.011439 2.453 0.01444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.826 on 655 degrees of freedom
## Multiple R-squared: 0.03229, Adjusted R-squared: 0.02786
## F-statistic: 7.285 on 3 and 655 DF, p-value: 8.234e-05
plot(my_reg)
Residuals vs fitted: This plot shows the linearity of the residuals. There are some pretty distinct patterns in the residuals so I am a little worried we may have some non-linearity here.
Normal Q-Q: This gives us a visual sense of whether the residuals are normally distributed. The closer to the line, the more normally distributed are the residuals. Here I like what I see from Quantiles -3 to 2, but above that the distribution seems to have some skew.
Scale-Location: This graph shows the spread of residuals across the different levels of the independent variables. If the plotted values are spread equally and the red line is horizontal, you’ve likely met the criteria of homoscedasticity. It appears that neither are true in this case, so I think this model fails that criteria.
Residuals vs Leverage: This graph shows any leverage points in the model. Outliers are indicated in the upper and lower right hand corners. These may or may not be leverage points, though. Outliers are deemed leverage points if the red line moves toward the outlier.
I think the assumptions are being violated. I am sensing in particular some heteroscedasticity based on the results of the scale-location graph.
Results of using log-log appear to have improved the model, but not enough to rule out heteroscedasticity. - Residuals look more evenly distributed. - Still some odd patterns in the Scale-Location data that do not confirm homoscedasticity. -Pretty similar coefficient results related to significance and.
##
## Call:
## lm(formula = log(income) ~ log(costH) + log(costS) + log(costC),
## data = RecreationDemand)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3925 -0.2145 0.0416 0.3029 1.2971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.88818 0.12465 7.126 2.74e-12 ***
## log(costH) -0.39328 0.10993 -3.577 0.000373 ***
## log(costS) 0.05314 0.07238 0.734 0.463068
## log(costC) 0.43238 0.12370 3.496 0.000505 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4912 on 655 degrees of freedom
## Multiple R-squared: 0.04387, Adjusted R-squared: 0.03949
## F-statistic: 10.02 on 3 and 655 DF, p-value: 1.838e-06