Gauss Markov Assumptions and Residual Analysis

Gauss-Markov Assumptions: What They Are & What They Mean

1. Linearity

The true relationship between your outcome and your predictors is linear in the parameters. Think of it as assuming every extra year of school adds the same dollar bump to your paycheck, a straight line, not a curve. If the real relationship bends but you force a straight line through it, your predictions will be systematically wrong everywhere.

The model is Y = Xβ + u, where linearity is in β, not the variables, so X², log(X), etc. are all valid. OLS minimizes squared residuals to yield the closed-form β̂ = (X’X)⁻¹X’Y, which only exists under this structure. If the true DGP is nonlinear in β, OLS is misspecified and biased even asymptotically.

2. Random Sampling

Observations are drawn randomly from the population, giving everyone a fair shot at being included. If your sample is self-selected or cherry-picked, your findings won’t generalize beyond that unrepresentative slice. No amount of clever math fixes a biased sample, garbage in, garbage out.

Observations are i.i.d. draws from the joint population distribution, which licenses the LLN and CLT needed for consistency and asymptotic inference. Without it, X may not span the population distribution of regressors, inducing selection bias. Cluster or stratified sampling requires corrected standard errors.

3. No Perfect Multicollinearity

No predictor can be an exact linear copy of another,like including both Fahrenheit and Celsius temperatures in the same model. When two variables move in perfect lockstep, the model can’t tell which one deserves the credit and the math breaks down entirely. Each variable needs to bring unique information to the table.

X must have full column rank so that (X’X) is invertible and β̂ = (X’X)⁻¹X’Y has a unique solution. Perfect multicollinearity makes X’X singular; near-multicollinearity inflates Var(β̂) = σ²(X’X)⁻¹ by pushing (X’X) toward singularity. VIF quantifies the inflation — values above 10 signal concern.

4. Zero Conditional Mean

The unexplained part of your model (the error) must be pure random noise, not secretly correlated with your predictors. If it leans one way for certain values of X, something important was left out and is hiding in the residuals, biasing your slope. This is omitted variable bias, and it’s the most dangerous threat to a trustworthy regression.

E(u | X) = 0 implies Cov(xⱼ, u) = 0 for all j, which delivers unbiasedness: E(β̂) = β. Violations from omitted variables, simultaneity, or measurement error cause plim(β̂) ≠ β — inconsistency that more data cannot fix. IV/2SLS restores consistency when a valid instrument exists.

5. Homoskedasticity

The spread of your errors must be roughly constant across all values of your predictor, like darts scattering equally whether you aim high or low. If errors fan out as X grows (heteroskedasticity), your standard errors are wrong and your significance tests become unreliable, even though the coefficient estimates themselves stay unbiased. The easy fix is White’s robust standard errors.

Var(uᵢ | X) = σ² for all i gives Var(u | X) = σ²I, the structure needed for OLS to be BLUE. Under heteroskedasticity, β̂ stays unbiased but σ²(X’X)⁻¹ is misspecified, invalidating standard errors and test statistics. The sandwich estimator (X’X)⁻¹(X’ΩX)(X’X)⁻¹ gives White’s robust SEs; GLS/WLS restore efficiency.

6. No Autocorrelation

The error from one observation must not predict the error in another — each residual should be its own independent surprise. This matters most in time-series data, where yesterday’s mistake tends to echo into today’s. Like heteroskedasticity, autocorrelation doesn’t bias your coefficients but it breaks your standard errors and inflates false confidence.

Cov(uᵢ, uⱼ | X) = 0 for all i ≠ j sets all off-diagonal elements of Ω to zero; combined with A5 this gives Ω = σ²I, required for OLS to be BLUE. Autocorrelation leaves β̂ unbiased but understates standard errors, causing systematic overrejection of true nulls. Newey-West HAC standard errors or GLS are the standard remedies.

7. Normality of Errors

The unexplained portion of the model should follow a bell-curve shape — small errors are common, large ones are rare. This assumption is needed for exact t-tests and F-tests in small samples, but the Central Limit Theorem rescues you once n exceeds ~120: the sampling distribution of your estimates becomes approximately normal regardless of what the errors look like. With large datasets, this assumption is the least worrisome of the seven.

u | X ~ N(0, σ²I) is not needed for BLUE — that follows from A1–A6. It is needed for exact finite-sample t- and F-distributions under H₀. By the CLT, (β̂ − β)/SE → N(0,1) asymptotically regardless of error distribution, making this assumption largely moot for n > 120.

Cross Sectional Data set

data(wage1)

dim(wage1)

[1] 526  24

head(wage1)

  wage educ exper tenure nonwhite female married numdep smsa northcen south
1 3.10   11     2      0        0      1       0      2    1        0     0
2 3.24   12    22      2        0      1       1      3    1        0     0
3 3.00   11     2      0        0      0       0      2    0        0     0
4 6.00    8    44     28        0      0       1      0    1        0     0
5 5.30   12     7      2        0      0       1      1    0        0     0
6 8.75   16     9      8        0      0       1      0    1        0     0
  west construc ndurman trcommpu trade services profserv profocc clerocc
1    1        0       0        0     0        0        0       0       0
2    1        0       0        0     0        1        0       0       0
3    1        0       0        0     1        0        0       0       0
4    1        0       0        0     0        0        0       0       1
5    1        0       0        0     0        0        0       0       0
6    1        0       0        0     0        0        1       1       0
  servocc    lwage expersq tenursq
1       0 1.131402       4       0
2       1 1.175573     484       4
3       0 1.098612       4       0
4       0 1.791759    1936     784
5       0 1.667707      49       4
6       0 2.169054      81      64

my_reg <- lm(wage ~ educ, data = wage1)
summary(my_reg)


Call:
lm(formula = wage ~ educ, data = wage1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3396 -2.1501 -0.9674  1.1921 16.6085 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.90485    0.68497  -1.321    0.187    
educ         0.54136    0.05325  10.167   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.378 on 524 degrees of freedom
Multiple R-squared:  0.1648,    Adjusted R-squared:  0.1632 
F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Intercept (β̂₀ = −0.905): The predicted hourly wage for someone with zero years of education is −$0.91. This is negative and makes no economic sense — but that’s fine. It’s pure mathematical extrapolation; nobody in the sample has zero years of education. Not statistically significant either (p = 0.187), so we don’t even stress it.

Slope (β̂₁ = 0.541): Each additional year of education is associated with a $0.54 increase in hourly wage, on average, holding everything else fixed. Statistically significant at the 0.1% level (α = 0.001) — the t-stat of 10.17 is enormous, and the p-value is essentially zero.

Economic magnitude: Over a 4-year college degree that’s roughly +$2.16/hr over a high school grad. At 2,000 hours/year that’s ~$4,300/year — meaningful for 1976 wages. R² = 16.5%, so education alone explains about a sixth of wage variation, which is solid for a single variable.

par(mfrow = c(2, 2))
plot(my_reg)

Plot 1 — Residuals vs Fitted:

The residuals scatter randomly around zero with a flat red line. Here the spread fans out as fitted values increase, classic heteroskedasticity. The red line is also slightly curved, hinting at mild non linearity.

Plot 2 — Q-Q Plot:

Points hug the diagonal line. The upper tail lifts off significantly, meaning large positive residuals are more extreme than a normal distribution would predict, right skew, expected with wage data.

Plot 3 — Scale-Location:

The red line should be flat. Here it slopes downward then levels off, again confirming heteroskedasticity, residual variance is not constant across fitted values.

Plot 4 — Residuals vs Leverage:

No points fall beyond Cook’s distance = 1, so no single observation is dangerously distorting the estimates. A few labeled points (e.g. 112) are worth noting but not alarming.

Log transformation

my_reg_log <- lm(log(wage) ~ educ, data = wage1)
summary(my_reg_log)


Call:
lm(formula = log(wage) ~ educ, data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.21158 -0.36393 -0.07263  0.29712  1.52339 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.583773   0.097336   5.998 3.74e-09 ***
educ        0.082744   0.007567  10.935  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4801 on 524 degrees of freedom
Multiple R-squared:  0.1858,    Adjusted R-squared:  0.1843 
F-statistic: 119.6 on 1 and 524 DF,  p-value: < 2.2e-16

Wage distributions are naturally right-skewed — most people earn modest wages but a few earn extremely high ones, stretching the right tail. When you take the log of wage, you compress those large values, pulling the distribution closer to symmetric. Think of it like a rubber band — the log squishes the long right tail back toward the middle, making the errors more evenly spread and the straight-line assumption more defensible.

Diagnostic plots

par(mfrow = c(2, 2))
plot(my_reg_log)

Plot 1 — Residuals vs Fitted:

Much flatter red line than before. The fanning is largely gone, residuals are more evenly scattered around zero. Heteroskedasticity substantially reduced.

Plot 2- Q-Q Plot

Points hug the diagonal far more tightly across the full range. The upper tail still lifts slightly but nothing like the levels model. Normality assumption much more defensible.

Plot 3 — Scale-Location:

Red line is noticeably flatter than the levels model. Variance is more stable across fitted values, confirms the log transformation did its job on heteroskedasticity.

Plot 4 — Residuals vs Leverage:

Observation 379 stands out, high leverage AND a large residual, sitting near the Cook’s distance = 0.5 contour. Worth investigating. Observations 112 and 139 are also labeled but less concerning.

Conclusion

The log transformation meaningfully improves model fit across all four diagnostics. Heteroskedasticity is reduced, errors are closer to normal, and the model is more correctly specified.