Detecting and Fixing Heteroskedasticity and Multicollinearity (Using Gretl Interface)


Session 1: Heteroskedasticity

1. Why It Matters

Discussion Question:
“Why might women’s earnings have less variance than men’s? How would this affect OLS results?”

Key Points:
- Heteroskedasticity → Biased standard errors → Wrong conclusions about significance.
- Example: If high earners have more volatile salaries, OLS understates uncertainty for top income brackets.


2. Detection

A. Visual Inspection (RVF Plot)

  1. Run Regression:
    • Model > Ordinary Least Squares
    • Dependent: AHE
    • Regressors: Age, Female, Bachelor
    • Click OK.
  2. Plot Residuals:
    • In results window: Graph > Fitted, actual plot > Residual plot.
    • Look for: Fan-shaped pattern (residuals spread out as fitted values increase).

B. Formal Tests

White Test:
1. After regression: Tests > Heteroskedasticity > White’s test.
2. Criterion: Reject null (homoskedasticity) if p-value < 0.05.

Breusch-Pagan Test:
1. Tests > Heteroskedasticity > Breusch-Pagan.
2. Criterion: Reject null if p-value < 0.05.

Discussion Question:
“Why might White’s test be preferred over Breusch-Pagan?”
- Answer: White’s test detects any form of heteroskedasticity; BP assumes a linear relationship between errors and regressors.


3. Fixing Heteroskedasticity

Robust Standard Errors

  1. Re-run regression:
    • Check “Robust standard errors” in the OLS dialog box.
  2. Compare:
    • Non-robust SEs (original) vs. robust SEs (larger SEs → more conservative inference).

Discussion Question:
“When would you use WLS instead of robust SEs?”
- Answer: If you know the variance structure (e.g., variance ∝ education level).


Session 2: Multicollinearity

1. Why It Matters

Discussion Question:
“Why can’t we include all region dummies (Northeast, Midwest, South, West) plus an intercept?”
- Answer: Perfect multicollinearity (regions sum to 1).


2. Detection

A. Correlation Matrix

  1. View > Correlation matrix > Select Age, Age², Female, Bachelor.
  2. Rule: |r| > 0.8 → Problem.

B. Variance Inflation Factor

  1. Run regression: Model > OLS > Include Age, Age², Female, Bachelor.
  2. Tests > Variance inflation factors.
  3. Criterion: VIF > 10 → Severe multicollinearity.

Discussion Question:
“Why does multicollinearity inflate standard errors but not bias coefficients?”
- Answer: OLS estimates remain unbiased, but uncertainty increases because predictors “compete” to explain the same variation.


3. Fixing Multicollinearity

A. Drop Redundant Variables

  • Example: Omit Northeast (baseline category).

B. Center Variables for Polynomials

  1. Add > Define new variable:
    • Name: Age_centered
    • Formula: Age - mean(Age)
  2. Create Age_centered².
  3. Re-run regression with centered terms.

Discussion Question:
“Why does centering reduce multicollinearity between Age and Age²?”
- Answer: Removes correlation between linear and quadratic terms.


Session 3: Recap & Application

1. Group Exercise (30 min)

Task: Diagnose and fix issues in:
\[ \text{ln(AHE)} = \beta_0 + \beta_1 \text{Age} + \beta_2 \text{Age}^2 + \beta_3 \text{Female} + \beta_4 \text{Bachelor} \]

Steps:
1. Check heteroskedasticity (White’s test).
2. Check multicollinearity (VIF).
3. Apply fixes (centering + robust SEs).

2. Key Takeaways

Issue Test Solution Gretl Path
Heteroskedasticity White’s test (p < 0.05) Robust SEs Tests > Heteroskedasticity
Multicollinearity VIF > 10 Centering/dropping Tests > Variance inflation

Investigating the Relationship Between Age and Earnings (Exercises)

a. Linear Regression of AHE on Age, Gender, and Education
- Run a regression of average hourly earnings (AHE) on:
- Age (continuous)
- Female (binary: 1 if female, 0 if male)
- Bachelor (binary: 1 if holds a bachelor’s degree, 0 otherwise).
- Interpretation:
- If Age increases from 25 to 26, by how much do earnings change?
- If Age increases from 33 to 34, by how much do earnings change?

b. Log-Linear Regression of ln(AHE) on Age, Gender, and Education
- Run a regression of ln(AHE) on the same variables (Age, Female, Bachelor).
- Interpretation:
- For a one-year increase in Age from 25 to 26, what is the expected percentage change in earnings?
- For a one-year increase in Age from 33 to 34, what is the expected percentage change in earnings?

c. Log-Log Regression of ln(AHE) on ln(Age), Gender, and Education
- Run a regression of ln(AHE) on:
- ln(Age) (natural log of Age)
- Female
- Bachelor.
- Interpretation:
- If Age increases from 25 to 26, what is the expected percentage change in earnings?
- If Age increases from 33 to 34, what is the expected percentage change in earnings?

d. Quadratic (Polynomial) Regression of ln(AHE) on Age and Age²
- Run a regression of ln(AHE) on:
- Age
- Age² (Age squared)
- Female
- Bachelor.
- Interpretation:
- If Age increases from 25 to 26, what is the expected percentage change in earnings?
- If Age increases from 33 to 34, what is the expected percentage change in earnings?

e. Model Comparison (c vs. b)
- Do you prefer the log-log model (c) over the log-linear model (b)? Explain why.

f. Model Comparison (d vs. b)
- Do you prefer the quadratic model (d) over the log-linear model (b)? Explain why.

g. Model Comparison (d vs. c)
- Do you prefer the quadratic model (d) over the log-log model (c)? Explain why.

h. Graphical Analysis of Regression Functions
- Plot the estimated relationship between Age and ln(AHE) for:
- Males with only a high school diploma (no bachelor’s degree), using models (b), (c), and (d).
- Describe:
- Similarities and differences between the three regression curves.
- Additional Consideration:
- Would the results differ if plotted for females with a bachelor’s degree?