Duration: 2 sessions of 45 minutes each
Tools: GRETL (GUI)
Datasets:
TestScore, l_avginc (log income),
STRsmall, STRlarge, calw_pct,
meal_pct, el_pctAHE (average hourly
earnings), Age, Age², Female,
BachelorObjective: Use GRETL’s interface to detect and correct violations of OLS assumptions: multicollinearity and heteroskedasticity. Students will learn to interpret correlation matrices, variance inflation factors (VIF), residual plots, and formal tests (White, Breusch‑Pagan), and apply fixes such as robust standard errors.
Session 1 (45 min) – Multicollinearity
calw_pct, meal_pct, el_pct.
Interpret results.Session 2 (45 min) – Heteroskedasticity
Key Takeaways Table
Summary of tests and solutions for quick reference.
The dummy variable trap is a classic example of perfect multicollinearity: one regressor is an exact linear function of others. Perfect multicollinearity prevents OLS estimation. It often arises from including all categories of a categorical variable along with an intercept.
PctEL (percentage of English learners)
and FracEL (fraction of English learners) because
PctEL = 100 * FracEL.NVSi).What is NVSi?
NVSi stands for “not very small” classes. It is a binary
variable that equals 1 if the student‑teacher ratio is 12 or more, and 0
otherwise. However, in the California data set, the smallest STR is 14,
so NVSi = 1 for every district. Thus NVSi is
perfectly collinear with the constant term (since the constant is also 1
for all observations). Trying to include NVSi along with an
intercept causes perfect multicollinearity.
PctEL and PctES (percentage of
English speakers) because PctES = 100 - PctEL. This also
involves the constant.PctEL and FracEL, keep only one of them.In GRETL, perfect multicollinearity triggers an error message. The software may automatically drop one variable, but you should understand why and make a conscious choice.
Imperfect multicollinearity occurs when two or more regressors are highly correlated but not perfectly. This does not prevent estimation, but it inflates standard errors, making coefficients imprecisely estimated.
PctEL and PctImmigrants
(percentage of first‑generation immigrants). These are positively
correlated because immigrant families often have children learning
English.In GRETL: Look for high pairwise correlations
(View → Correlation matrix). Also check variance inflation
factors (VIF) after regression (Analysis → Collinearity). A
VIF above 10 (or 5, depending on context) indicates problematic
multicollinearity.
Why it matters: Imperfect multicollinearity is not a mistake – it is a data limitation. You may need a larger sample or a different research design to precisely estimate the separate effects.
We will examine a regression model that explains test scores using the following variables:
l_avginc = log of average district incomeSTRsmall = dummy for student‑teacher ratio <
20STRlarge = dummy for STR > 25calw_pct = percentage of students qualifying for
CalWORKS (welfare)meal_pct = percentage of students qualifying for
reduced‑price lunchel_pct = percentage of English learnersGoal: Detect imperfect multicollinearity using correlation matrix and VIF.
In GRETL:
View → Correlation matrix → select all six variables.
The output (partial) for the three economic disadvantage variables is:
| calw_pct | meal_pct | el_pct | |
|---|---|---|---|
| calw_pct | 1.0000 | 0.7394 | 0.3196 |
| meal_pct | 0.7394 | 1.0000 | 0.6531 |
| el_pct | 0.3196 | 0.6531 | 1.0000 |
Critical values for n = 420:
- 5% significance: 0.0957
- 1% significance: 0.1256
All correlations shown are far above 0.1256, so they are statistically significant at the 1% level.
calw_pct and meal_pct (0.7394): Strong positive correlation. Both measure economic disadvantage. Districts with many CalWORKS families also tend to have many students qualifying for free/reduced‑price lunch. This is expected because both programs target low‑income households.
meal_pct and el_pct (0.6531): Moderately strong positive correlation. Schools with many English learners often serve low‑income communities, so these variables move together.
calw_pct and el_pct (0.3196): Weak to moderate correlation. While both are poverty‑related, the link is weaker because English learners are not always from welfare‑dependent families.
Implication: Including calw_pct,
meal_pct, and el_pct together may cause
imperfect multicollinearity, inflating standard errors. However, the
correlations are below 0.8, so multicollinearity may not be severe. We
need to check VIFs.
Run the full regression in GRETL:
Model → OLS → Dependent: TestScore →
Regressors: l_avginc, STRsmall,
STRlarge, calw_pct, meal_pct,
el_pct.
After estimation, go to Analysis → Collinearity.
Typical VIF results (illustrative):
| Variable | VIF |
|---|---|
| l_avginc | 2.57 |
| STRsmall | 1.06 |
| STRlarge | 1.02 |
| calw_pct | 2.48 |
| meal_pct | 6.20 |
| el_pct | 2.12 |
Interpretation:
meal_pct has VIF = 6.20, which is below the
conventional threshold of 10. This suggests moderate
but not severe multicollinearity.Conclusion: The correlation between
meal_pct and other poverty measures (especially
calw_pct) inflates the variance of meal_pct’s
coefficient, but not to a problematic level. The model is still
reliable, but standard errors are slightly larger than if only one
poverty measure were used.
meal_pct as a general poverty proxy.Takeaway: Imperfect multicollinearity is a data limitation, not an error. It does not bias coefficients, only reduces precision. With a sample of 420 observations, the loss of precision is acceptable.
In a regression model, we assume that the error term has constant variance across all observations. This property is called homoskedasticity (from Greek: homos = same, skedasis = dispersion). When the variance of the errors changes depending on the value of the independent variables, we have heteroskedasticity (heteros = different).
Simple analogy: Imagine two archers. One shoots arrows at a target from 10 meters away; the other shoots from 100 meters away. The second archer’s arrows will have much more spread (variance) because distance magnifies errors. Similarly, in regression, the spread of residuals can increase (or decrease) as the predicted value changes.
Income and consumption: Low‑income households spend almost all their income on necessities, so consumption is very predictable (low variance). High‑income households have more discretionary spending, so consumption varies widely (high variance). → Variance of errors increases with income.
Age and medical expenses: Young adults have low and relatively constant medical costs. Elderly individuals have highly variable costs – some are healthy, others have expensive surgeries or chronic conditions. → Variance increases with age.
Firm size and profit margins: Small businesses often have similar profit margins within an industry. Large corporations have huge variation – some have razor‑thin margins (e.g., retail), others have very high margins (e.g., software). → Variance increases with firm size.
Test scores and district income (our dataset): Poor school districts may have consistently low test scores (small variance). Affluent districts may have a mix of very high and moderate scores (larger variance) because they serve diverse student populations, or because of unmeasured factors like parental involvement. → Variance might increase with income.
Run a regression of TestScore on l_avginc,
calw_pct, meal_pct, el_pct,
STRsmall, STRlarge.
In GRETL: - Model → OLS → select dependent and regressors → OK. - In the model window: Graph → Fitted, actual plot → Residual plot.
What to look for: A fan‑shaped pattern (residuals spread out as fitted values increase) or any systematic change in spread.
Example interpretation: If the residuals become more dispersed for districts with high predicted test scores, that suggests heteroskedasticity.
White’s Test (detects general heteroskedasticity, including nonlinear forms)
Breusch‑Pagan Test (assumes linear relationship between squared residuals and regressors)
Which test to trust? White’s test is more general and often preferred. If both tests agree, the evidence is strong.
The simplest fix: use standard errors that are valid even under heteroskedasticity.
In GRETL: Re‑run the regression, and in the OLS dialog box, check “Robust standard errors” (or “HAC” for heteroskedasticity‑consistent). Compare the new standard errors with the original ones. Usually, robust SEs are larger, leading to more conservative inference.
When to use: Always a good practice, especially in cross‑sectional data like test scores or earnings. Many economists report robust SEs by default.
Estimate the model: \[ \text{TestScore} = \beta_0 + \beta_1 l\_avginc + \beta_2 STRsmall + \beta_3 STRlarge + \beta_4 calw\_pct + \beta_5 meal\_pct + \beta_6 el\_pct + \varepsilon \]
Produce a residual‑vs‑fitted plot. Describe what you see.
The spread of residuals increases as test scores increase (fan shape).This suggests:
** Implication:**
The residual plot shows that residuals are generally centered around zero, suggesting no major bias in the model. However, there is a slight upward pattern, indicating that the model tends to overpredict low test scores and underpredict high test scores, which suggests possible functional form misspecification. Additionally, the variance of residuals appears to increase with test scores, providing evidence of heteroskedasticity. Therefore, robust standard errors should be used for inference.
The White test yields a test statistic of 17.93 with a p-value of 0.328. Since the p-value is greater than 0.05, we fail to reject the null hypothesis of homoskedasticity. Therefore, there is no evidence of heteroskedasticity in the model, even when allowing for nonlinearities and interaction effects. The OLS standard errors can be considered reliable.
The Breusch–Pagan test yields an LM statistic of 4.146 with a p-value of 0.5285. Since the p-value is greater than 0.05, we fail to reject the null hypothesis of homoskedasticity. Therefore, there is no evidence of heteroskedasticity in the model, and the OLS standard errors can be considered reliable.
Graphical evidence ≠ statistical evidence “Plots suggest, tests decide.”
Re‑estimate the model with robust standard errors.
Discussion: Based on your findings, would you recommend using robust SEs for this model? Why or why not?
Final takeaway: Heteroskedasticity does not bias coefficients but invalidates standard errors. Always check for it in cross‑sectional data. Robust standard errors are an easy and reliable fix.
Task: Diagnose and fix issues in:
\[ \text{ln(AHE)} = \beta_0 + \beta_1
\text{Age} + \beta_2 \text{Age}^2 + \beta_3 \text{Female} + \beta_4
\text{Bachelor} \]
Steps:
1. Check heteroskedasticity (White’s test).
2. Check multicollinearity (VIF).
3. Apply fixes (robust SEs).
| Issue | Test | Solution | Gretl Path |
|---|---|---|---|
| Heteroskedasticity | White’s test (p < 0.05) | Robust SEs | Tests > Heteroskedasticity |
| Multicollinearity | VIF > 10 | Centering/dropping | Analysis → Collinearity |