Session Overview

Multicollinearity

The dummy variable trap is a classic example of perfect multicollinearity: one regressor is an exact linear function of others. Perfect multicollinearity prevents OLS estimation. It often arises from including all categories of a categorical variable along with an intercept.

Perfect Multicollinearity – More Examples

  • Including both PctEL (percentage of English learners) and FracEL (fraction of English learners) because PctEL = 100 * FracEL.
  • Including a binary variable that is constant for all observations (e.g., NVSi).

What is NVSi?

NVSi stands for “not very small” classes. It is a binary variable that equals 1 if the student‑teacher ratio is 12 or more, and 0 otherwise. However, in the California data set, the smallest STR is 14, so NVSi = 1 for every district. Thus NVSi is perfectly collinear with the constant term (since the constant is also 1 for all observations). Trying to include NVSi along with an intercept causes perfect multicollinearity.

  • Including PctEL and PctES (percentage of English speakers) because PctES = 100 - PctEL. This also involves the constant.

How to Fix Perfect Multicollinearity

  1. Drop one of the redundant variables. For dummy variables, omit one category. For linear dependencies like PctEL and FracEL, keep only one of them.
  2. Drop the constant term (rarely recommended). If you have all categories, you can suppress the intercept. But this makes interpretation less intuitive.
  3. Check your data. Sometimes multicollinearity arises because a variable has no variation (e.g., all observations have the same value). In that case, drop that variable.

In GRETL, perfect multicollinearity triggers an error message. The software may automatically drop one variable, but you should understand why and make a conscious choice.

Imperfect Multicollinearity

Imperfect multicollinearity occurs when two or more regressors are highly correlated but not perfectly. This does not prevent estimation, but it inflates standard errors, making coefficients imprecisely estimated.

  • Example: Including PctEL and PctImmigrants (percentage of first‑generation immigrants). These are positively correlated because immigrant families often have children learning English.
  • The OLS estimates remain unbiased, but the variance of the coefficients becomes large. The t‑statistics may be small even when the true effects are large.

In GRETL: Look for high pairwise correlations (View → Correlation matrix). Also check variance inflation factors (VIF) after regression (Analysis → Collinearity). A VIF above 10 (or 5, depending on context) indicates problematic multicollinearity.

Why it matters: Imperfect multicollinearity is not a mistake – it is a data limitation. You may need a larger sample or a different research design to precisely estimate the separate effects.


Exercise: Detecting Multicollinearity in the California Test Score Data

We will examine a regression model that explains test scores using the following variables:

  • l_avginc = log of average district income
  • STRsmall = dummy for student‑teacher ratio < 20
  • STRlarge = dummy for STR > 25
  • calw_pct = percentage of students qualifying for CalWORKS (welfare)
  • meal_pct = percentage of students qualifying for reduced‑price lunch
  • el_pct = percentage of English learners

Goal: Detect imperfect multicollinearity using correlation matrix and VIF.

Step 1: Compute the Correlation Matrix

In GRETL:
View → Correlation matrix → select all six variables.

The output (partial) for the three economic disadvantage variables is:

calw_pct meal_pct el_pct
calw_pct 1.0000 0.7394 0.3196
meal_pct 0.7394 1.0000 0.6531
el_pct 0.3196 0.6531 1.0000

Critical values for n = 420:
- 5% significance: 0.0957
- 1% significance: 0.1256

All correlations shown are far above 0.1256, so they are statistically significant at the 1% level.

Interpretation of Correlations

  • calw_pct and meal_pct (0.7394): Strong positive correlation. Both measure economic disadvantage. Districts with many CalWORKS families also tend to have many students qualifying for free/reduced‑price lunch. This is expected because both programs target low‑income households.

  • meal_pct and el_pct (0.6531): Moderately strong positive correlation. Schools with many English learners often serve low‑income communities, so these variables move together.

  • calw_pct and el_pct (0.3196): Weak to moderate correlation. While both are poverty‑related, the link is weaker because English learners are not always from welfare‑dependent families.

Implication: Including calw_pct, meal_pct, and el_pct together may cause imperfect multicollinearity, inflating standard errors. However, the correlations are below 0.8, so multicollinearity may not be severe. We need to check VIFs.

Step 2: Compute Variance Inflation Factors (VIF)

Run the full regression in GRETL:
Model → OLS → Dependent: TestScore → Regressors: l_avginc, STRsmall, STRlarge, calw_pct, meal_pct, el_pct.
After estimation, go to Analysis → Collinearity.

Typical VIF results (illustrative):

Variable VIF
l_avginc 2.57
STRsmall 1.06
STRlarge 1.02
calw_pct 2.48
meal_pct 6.20
el_pct 2.12

Interpretation:

  • meal_pct has VIF = 6.20, which is below the conventional threshold of 10. This suggests moderate but not severe multicollinearity.
  • All other VIFs are below 3, indicating little collinearity.

Conclusion: The correlation between meal_pct and other poverty measures (especially calw_pct) inflates the variance of meal_pct’s coefficient, but not to a problematic level. The model is still reliable, but standard errors are slightly larger than if only one poverty measure were used.

Step 3: Decision

  • Do not drop variables based solely on VIF < 10. Each variable may capture a different aspect of economic disadvantage.
  • If you are concerned about efficiency, you could combine them into a single index (e.g., principal component) or keep only meal_pct as a general poverty proxy.
  • In practice, many researchers retain all three because each has a distinct policy relevance (e.g., English learners may need different interventions than low‑income native speakers).

Takeaway: Imperfect multicollinearity is a data limitation, not an error. It does not bias coefficients, only reduces precision. With a sample of 420 observations, the loss of precision is acceptable.

Heteroskedasticity

What Is Heteroskedasticity? Intuitive Explanation

In a regression model, we assume that the error term has constant variance across all observations. This property is called homoskedasticity (from Greek: homos = same, skedasis = dispersion). When the variance of the errors changes depending on the value of the independent variables, we have heteroskedasticity (heteros = different).

Simple analogy: Imagine two archers. One shoots arrows at a target from 10 meters away; the other shoots from 100 meters away. The second archer’s arrows will have much more spread (variance) because distance magnifies errors. Similarly, in regression, the spread of residuals can increase (or decrease) as the predicted value changes.

Real‑World Examples of Heteroskedasticity

  1. Income and consumption: Low‑income households spend almost all their income on necessities, so consumption is very predictable (low variance). High‑income households have more discretionary spending, so consumption varies widely (high variance). → Variance of errors increases with income.

  2. Age and medical expenses: Young adults have low and relatively constant medical costs. Elderly individuals have highly variable costs – some are healthy, others have expensive surgeries or chronic conditions. → Variance increases with age.

  3. Firm size and profit margins: Small businesses often have similar profit margins within an industry. Large corporations have huge variation – some have razor‑thin margins (e.g., retail), others have very high margins (e.g., software). → Variance increases with firm size.

  4. Test scores and district income (our dataset): Poor school districts may have consistently low test scores (small variance). Affluent districts may have a mix of very high and moderate scores (larger variance) because they serve diverse student populations, or because of unmeasured factors like parental involvement. → Variance might increase with income.

Why Heteroskedasticity Matters

  • OLS coefficients remain unbiased (still correct on average).
  • However, standard errors become biased → t‑statistics and confidence intervals are invalid.
  • This can lead to incorrect conclusions: we might think a variable is statistically significant when it is not, or vice versa.

Detecting Heteroskedasticity in GRETL (Using Test Score Data)

A. Visual Inspection – Residuals vs. Fitted Plot

Run a regression of TestScore on l_avginc, calw_pct, meal_pct, el_pct, STRsmall, STRlarge.

In GRETL: - Model → OLS → select dependent and regressors → OK. - In the model window: Graph → Fitted, actual plot → Residual plot.

What to look for: A fan‑shaped pattern (residuals spread out as fitted values increase) or any systematic change in spread.

Example interpretation: If the residuals become more dispersed for districts with high predicted test scores, that suggests heteroskedasticity.

B. Formal Tests

White’s Test (detects general heteroskedasticity, including nonlinear forms)

  • After regression: Tests → Heteroskedasticity → White’s test.
  • Null hypothesis \(H_0\): homoskedasticity (constant variance).
  • If p‑value < 0.05, reject \(H_0\) → evidence of heteroskedasticity.

Breusch‑Pagan Test (assumes linear relationship between squared residuals and regressors)

  • Tests → Heteroskedasticity → Breusch‑Pagan.
  • Same decision rule: p‑value < 0.05 indicates heteroskedasticity.

Which test to trust? White’s test is more general and often preferred. If both tests agree, the evidence is strong.

Fixing Heteroskedasticity

Robust Standard Errors (Huber‑White)

The simplest fix: use standard errors that are valid even under heteroskedasticity.

In GRETL: Re‑run the regression, and in the OLS dialog box, check “Robust standard errors” (or “HAC” for heteroskedasticity‑consistent). Compare the new standard errors with the original ones. Usually, robust SEs are larger, leading to more conservative inference.

When to use: Always a good practice, especially in cross‑sectional data like test scores or earnings. Many economists report robust SEs by default.

Exercise: Heteroskedasticity in the California Test Score Data

  1. Estimate the model: \[ \text{TestScore} = \beta_0 + \beta_1 l\_avginc + \beta_2 STRsmall + \beta_3 STRlarge + \beta_4 calw\_pct + \beta_5 meal\_pct + \beta_6 el\_pct + \varepsilon \]

  2. Produce a residual‑vs‑fitted plot. Describe what you see.

The spread of residuals increases as test scores increase (fan shape).This suggests:

  • Heteroskedasticity is likely present
  • Variance of errors is not constant

** Implication:**

  • OLS coefficients are still unbiased
  • BUT standard errors are unreliable
  • You should use robust standard errors (White) in Gretl

The residual plot shows that residuals are generally centered around zero, suggesting no major bias in the model. However, there is a slight upward pattern, indicating that the model tends to overpredict low test scores and underpredict high test scores, which suggests possible functional form misspecification. Additionally, the variance of residuals appears to increase with test scores, providing evidence of heteroskedasticity. Therefore, robust standard errors should be used for inference.

  1. Run White’s test. Report the p‑value. Is there evidence of heteroskedasticity at the 5% level?

The White test yields a test statistic of 17.93 with a p-value of 0.328. Since the p-value is greater than 0.05, we fail to reject the null hypothesis of homoskedasticity. Therefore, there is no evidence of heteroskedasticity in the model, even when allowing for nonlinearities and interaction effects. The OLS standard errors can be considered reliable.

  1. Run Breusch‑Pagan test. Compare the results.

The Breusch–Pagan test yields an LM statistic of 4.146 with a p-value of 0.5285. Since the p-value is greater than 0.05, we fail to reject the null hypothesis of homoskedasticity. Therefore, there is no evidence of heteroskedasticity in the model, and the OLS standard errors can be considered reliable.

  1. Graphical evidence ≠ statistical evidence “Plots suggest, tests decide.”

  2. Re‑estimate the model with robust standard errors.

    • Which coefficients changed their t‑statistics the most?
    • Did any coefficient become insignificant after using robust SEs?
  3. Discussion: Based on your findings, would you recommend using robust SEs for this model? Why or why not?

Final takeaway: Heteroskedasticity does not bias coefficients but invalidates standard errors. Always check for it in cross‑sectional data. Robust standard errors are an easy and reliable fix.


Exercise: Heteroskedasticity and Multicollinearity in Earnings Data

Task: Diagnose and fix issues in:
\[ \text{ln(AHE)} = \beta_0 + \beta_1 \text{Age} + \beta_2 \text{Age}^2 + \beta_3 \text{Female} + \beta_4 \text{Bachelor} \]

Steps:
1. Check heteroskedasticity (White’s test).
2. Check multicollinearity (VIF).
3. Apply fixes (robust SEs).

2. Key Takeaways

Issue Test Solution Gretl Path
Heteroskedasticity White’s test (p < 0.05) Robust SEs Tests > Heteroskedasticity
Multicollinearity VIF > 10 Centering/dropping Analysis → Collinearity