Gauss-Markov Week 3

Gauss-Markov Assumptions
Cross-Sectional Dataset
Linear Regression Plots

I. Gauss-Markov Assumptions

Overview

The Gauss-Markov assumptions describe the conditions under which an ordinary least squares (OLS) estimate is the best linear unbiased estimator (BLUE). First, the model must be linear in parameters (linearity). Second, there cannot be exact linear relationships among predictors (full column rank). Third, the error term must have an expected value of zero (exogeneity). Fourth, error terms must have constant variance (homoscedasticity) and be uncorrelated across observations (no autocorrelation). Fifth, the independent variables may be fixed or random (data generation). Sixth, the error terms are normally distributed (normality). The first three assumptions affect the estimate of beta, while the last three primarily affect the standard errors and inference.

Non-Technical

A set of six assumptions helps make sure an econometric model gives reliable and meaningful results. Linearity (1) means the relationship is consistent, so changes in inputs lead to predictable changes in outputs. Even though it might not be so simple and straightforward in real life, this assumption provides a useful approximation of how variables are related. Full column rank (2) means that each variable contains unique information. Otherwise, they would overlap, and we wouldn’t be able to tell which variable is driving the outcome. Exogeneity (3) means that the independent variables influence the dependent variable (x’s influence y) and not vice versa. In other words, the two cannot be determined simultaneously; there is a clear direction of cause and effect.

Homoscedasticity (4a) means that the size of the errors is roughly the same across all the observations, rather than some being very small and others very large, which helps keep the estimates stable. No autocorrelation (4b) means that one observation’s error does not affect another’s, so each data point contributes independent information. Next, data generation (5) assumes data is obtained from an impartial process, not systematically distorted or constructed. Finally, normality (6) means that the errors are distributed in a balanced, bell-shaped way, allowing for different statistical analyses to be performed. The assumptions together ensure the model produces fair and dependable results, and is not driven by biases or patterns in the data.

Technical

The six Gauss-Markov assumptions define the conditions under which an OLS estimator is considered BLUE. Linearity in the parameters allows for OLS estimation by expressing the dependent variable as a linear combination of coefficients and independent variables, plus an error term. Full column rank requires that there is no perfect multicollinearity among the predicting variables, ensuring that the matrix is invertible and unique coefficients exist. Exogeneity explains that the error term has an expected value of zero, conditional on the independent variables, ensuring the coefficient estimates are unbiased. These first three assumptions affect the estimation of the coefficients.

Next, homoscedasticity assumes that the variance of the error term is constant across observations, while no autocorrelation assumes that error terms are uncorrelated across observations. The data generation assumption allows independent variables to be either fixed or random, given that they are not perfectly collinear and satisfy the other assumptions. Normality assumes the error terms are normally distributed, which is important for t-tests and building confidence intervals, but not required for unbiasedness. These final three assumptions mainly affect the variance of the estimators and the validity of statistical inferences.

II. Cross-Sectional Dataset

library(MASS)
data("Boston")
?Boston

Estimating Equation:

\[ medv_i = \beta_0 + \beta_1 rm_i + \epsilon_i \]

Variables:

Dependent variable (medv): Median home value (in $1000s)
Independent variable (rm): Average number of rooms per dwelling

Regression:

my_reg <- lm(medv ~ rm, data = Boston)
summary(my_reg)


Call:
lm(formula = medv ~ rm, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.346  -2.547   0.090   2.986  39.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.671      2.650  -13.08   <2e-16 ***
rm             9.102      0.419   21.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.616 on 504 degrees of freedom
Multiple R-squared:  0.4835,    Adjusted R-squared:  0.4825 
F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16

Interpretation:

Slope ($\beta_1$ = 9.102)

A one-unit increase in the average number of rooms is associated with an increase of about $9,102 in median home value, on average. This relationship is positive, indicating that larger homes tend to have higher prices.

The coefficient on rm has a p-value of <2e-16 (<0.01), meaning it is statistically significant at the 1% level. This value indicates there is strong evidence that the number of rooms is related to housing prices.

The economic magnitude of the coefficient is meaningful, as it means that a home value will be increase by over $9,000 per additional room. This is a large change in value that suggests home size is an important determinant of price, and may inform building decisions.

Intercept ($\beta_0$ = -34.671)

When the average number of rooms is zero, the predicted median home value would be -$34,671. This is not economically meaningful since a home cannot have zero rooms, but it serves as a baseline for the regression line.

III. Linear Regression Plots

plot(my_reg)

The 4 linear regression plots provide a visual way to determine whether the assumptions of the linear regression model are satisfied. The first plot, “Residuals vs Fitted”, reveals if there is any clear pattern among the residuals. Ideally, they are randomly scattered around zero, indicating the relationship is approximately linear. A curved pattern, for example, may indicate higher-order terms, such as a quadratic.

The second plot, “Q-Q Residuals”, compares the distribution of the residuals to a normal distribution. The assumption of normality is supported if the residual points fall roughly along the straight line. Large deviations, especially in the tails, suggest the errors may be skewed.

The third plot, “Scale-Location”, shows the spread of the residuals across fitted values to assess homoscedasticity. Constant variance would portray a horizontal line with evenly spread points. A funnel shape would suggest heteroscedasticity.

The fourth plot, “Residuals vs Leverage”, uses Cook’s distance lines to identify observations that might have a disproportionately large influence on the regression results. If observations past the threshold are removed, results would be significantly altered.
The linear regression charts suggest that Gauss-Markov Assumptions may not fully hold, but are not seriously violated. “Residuals vs Fitted” shows a slight dip, but not so severe that it would suggest nonlinearity. The Q-Q plot shows some deviation in the tails, suggesting errors are not perfectly normally distributed. “Scale-Location” shows a more defined curve and some increase in spread, indicating potential heteroscedasticity. Finally, “Residuals vs Leverage” identifies a few potentially influential points, such as 366, but nothing too far past the threshold.
Transformation

my_reg_log <- lm(log(medv) ~ rm, data = Boston)
summary(my_reg_log)


Call:
lm(formula = log(medv) ~ rm, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20386 -0.09419  0.06416  0.16991  1.36088 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.72374    0.12699   5.699 2.05e-08 ***
rm           0.36769    0.02008  18.309  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3171 on 504 degrees of freedom
Multiple R-squared:  0.3995,    Adjusted R-squared:  0.3983 
F-statistic: 335.2 on 1 and 504 DF,  p-value: < 2.2e-16

plot(my_reg_log)

Taking the log of the dependent variable medv improves the model fit by reducing the curve observed in “Residuals vs Fitted” and making the relationship more linear. “Scale-Location” shows a slightly more consistent spread of residuals, though deviations at the tails remain in the Q-Q plot. This transformation also changes the interpretation of the coefficient to a percentage effect, which is slightly less intuitive.

Table of Contents