Teaching Note for Applied Econometrics and Economic Modelling Lab Session - Day 8
Investigating Age and Earnings in GRETL: Functional Forms, Diagnostics, and Model Selection
Session Overview
- Duration: 1 session of 90 minutes (or two 45‑minute sessions)
- Dataset:
Earnings.xlsx– variables:ahe(average hourly earnings),age,female(1 = female, 0 = male),yrseduc(years of education) - Objective: Students will estimate multiple regression models (linear, log‑linear, log‑log, quadratic), interpret coefficients correctly, compare model performance, conduct diagnostic tests (heteroskedasticity, multicollinearity), and build a final improved model with robust standard errors and centering.
- Focus: Hands‑on GRETL implementation, interpretation of coefficients in different functional forms, and practical diagnostic testing.
🎯 Learning Outcomes
By the end of this lab, you will be able to:
- Estimate linear, log‑linear, log‑log, and quadratic models in GRETL.
- Interpret coefficients correctly depending on the functional form.
- Compare models using (R^2) and adjusted (R^2).
- Detect heteroskedasticity using Breusch‑Pagan and White tests.
- Apply robust standard errors to correct for heteroskedasticity.
- Detect multicollinearity using variance inflation factors (VIF) and fix it by centering.
- Build a final improved model based on theory and diagnostics.
STEP 1 — Load the Data
In GRETL:
- File → Open data → Import → Excel
- Select your dataset:
Earnings.xlsx
Check variables
You should see:
| Variable | Description |
|---|---|
ahe |
Average hourly earnings (in dollars) |
age |
Age in years |
female |
1 if female, 0 if male |
yrseduc |
Years of education |
STEP 2 — Create Variables
Go to Add → Define new variable and create the following:
lnahe = ln(ahe) lnage = ln(age) agesq = age^2
Explanation
lnahe→ allows percentage interpretation of coefficients (log‑linear and log‑log models).lnage→ enables elasticity interpretation (log‑log model).agesq→ captures nonlinear (quadratic) effects, allowing earnings to increase at a decreasing rate and eventually peak.
STEP 3 — Model A (Linear Model)
Model → Ordinary Least Squares (OLS)
Dependent variable:
aheIndependent variables:
age female yrseduc
Interpretation
The coefficient on age = change in earnings (in dollars) for a one‑year increase in age, holding gender and education constant.
Effect is constant across all ages.
From age 25 to 26, earnings increase by the same dollar amount as from age 33 to 34.
STEP 4 — Model B (Log‑Linear Model)
Model → OLS
Dependent:
lnaheIndependent:
age female yrseduc
Interpretation
The coefficient on age approximates the percentage change in earnings for a one‑year increase.
Use the exact formula:
\(\%\Delta \text{earnings} = 100 \times \left(e^{\beta_{\text{age}}} - 1\right)\)
For small coefficients (e.g., \(\beta_{\text{age}} \approx 0.01\)), the approximation \(\%\Delta \text{earnings} \approx 100 \times \beta_{\text{age}}\) is very close.
- Effect is a constant percentage change for each additional year.
STEP 5 — Model C (Log‑Log Model)
Model → OLS
Dependent:
lnaheIndependent:
lnage female yrseduc
Interpretation
The coefficient on lnage is the elasticity of earnings with respect to age.
\[ 1\% \text{ increase in age} \;\rightarrow\; \beta_2\% \text{ increase in earnings} \]
The effect depends on age level: a 1% increase from age 25 to 25.25 is a smaller absolute year increase than from 33 to 33.33, but the percentage effect on earnings is constant.
This model often fits better when the relationship is multiplicative.
STEP 6 — Model D (Quadratic Model)
Model → OLS
Dependent:
lnaheIndependent:
age agesq female yrseduc
Interpretation
The marginal effect of age is:
\[ \frac{\partial \ln(\text{earnings})}{\partial \text{age}} = \beta_{\text{age}} + 2\,\beta_{\text{agesq}} \cdot \text{age} \]
Earnings increase with age but at a decreasing rate (if \(\beta_{\text{agesq}} < 0\).
Eventually, earnings may peak and then decline (typical life‑cycle pattern).
Turning point (peak age)
\(\text{Turning point} = -\frac{\beta_{\text{age}}}{2\beta_{\text{agesq}}}\)
This is the age at which earnings reach their maximum before starting to decline.
STEP 7 — Compare Models
Look at:
R2R2 and adjusted R2R2 (higher is better for same dependent variable).
Models B, C, and D all use
lnaheas dependent variable, so they are directly comparable.Model A uses
ahe(raw dollars) – do not compare R2R2 with the others because the dependent variable is different.
Expected conclusion:
The quadratic model (Model D) usually fits best because it captures the life‑cycle pattern of earnings – rising steeply in young ages, flattening, and eventually declining.
STEP 8 — Heteroskedasticity Tests
After estimating Model D (quadratic with lnahe), run:
Tests → Heteroskedasticity → Breusch‑Pagan
Tests → Heteroskedasticity → White
Interpretation
Null hypothesis: homoskedasticity (constant error variance).
If p‑value < 0.05, reject homoskedasticity → heteroskedasticity is present.
If the two tests disagree, trust White’s test (it is more general and detects nonlinear forms of heteroskedasticity).
STEP 9 — Robust Standard Errors
If heteroskedasticity is present, you must correct the standard errors.
Re‑estimate Model D:
Model → OLS → in the dialog box, check “Robust standard errors (HC1)” (or “HAC”).
Explanation
Coefficients remain unchanged (they are still unbiased).
Standard errors become valid for inference (t‑statistics, p‑values, confidence intervals).
Use these corrected standard errors in your final report.
STEP 10 — Multicollinearity Check
After estimating Model D (with age and agesq), run:
Analysis → Collinearity (or Tests → Variance inflation factors)
What you will see
VIF for
ageandagesqwill be very high (often > 10 or even > 100).This is normal for polynomials – it does not mean the model is bad, but it can cause numerical instability.
STEP 11 — Fix by Centering Age
Centering removes the non‑essential multicollinearity between age and agesq.
Create centered variables:
\[ \text{age}_c = \text{age} - \bar{\text{age}} \]
\[ \text{age}_c^2 = (\text{age}_c)^2 \]```
Re‑estimate Model D using:
Dependent:
lnaheIndependent:
age_c age_c_sq female yrseduc
Result
Same model fit (same R2R2, same predicted values).
VIFs will drop dramatically (often to around 1–2).
The coefficient on
age_cnow represents the marginal effect at the mean age.The coefficient on
age_c_sqis unchanged in interpretation (curvature).
STEP 12 — Final Model
Your final improved model should be:
\[ \ln(\text{AHE}) = \beta_1 + \beta_2 \,\text{age}_c + \beta_3 \,\text{age}_c^2 + \beta_4 \,\text{female} + \beta_5 \,\text{yrseduc} + \varepsilon \]
with:
- Robust Standard Errors
- Centered age variables
- Quadratic term to capture life‑cycle pattern
- Log‑dependent variable for percentage interpretation
STEP 13 — Graph the Results
Graph → X-Y scatter
X:
ageY:
lnahe
Then add fitted lines from:
Model B (log‑linear)
Model C (log‑log)
Model D (quadratic)
How to add fitted lines in GRETL:
After estimating each model, go to Graph → Fitted, actual plot and then overlay or use Add → Series to save fitted values, then plot them on the same scatter.
Interpretation
Compare shapes:
Model B (log‑linear) → straight line (constant percentage increase per year).
Model C (log‑log) → concave curve (elasticity constant, but slope decreases in absolute age).
Model D (quadratic) → hump‑shaped curve (increases, flattens, then declines).
The quadratic curve best matches the typical earnings life cycle.
STEP 14 — Final Interpretation (Project Ready)
Students should write a paragraph similar to:
Earnings increase with age, but at a decreasing rate, consistent with a life‑cycle pattern. The quadratic model provides the best fit and most realistic representation. Education has a strong positive effect on earnings, while females earn less on average, holding other factors constant. Diagnostic tests indicated heteroskedasticity, so robust standard errors are used. Centering age eliminated multicollinearity between age and its square without changing the model’s fit.
Key Takeaways
| Concept | Takeaway |
|---|---|
| Functional form | Linear, log‑linear, log‑log, and quadratic models give different interpretations. |
| Log‑linear | Coefficient × 100 ≈ percentage change (use exact formula for precision). |
| Log‑log | Coefficient = elasticity (percentage change in Y for 1% change in X). |
| Quadratic | Captures increasing‑then‑decreasing patterns (life cycle). |
| Heteroskedasticity | Use White test; fix with robust standard errors. |
| Multicollinearity | Polynomials create high VIF; centering solves it without changing fit. |
| Model selection | Choose based on theory, fit (R²), and diagnostics. |
Quick GRETL Reference
| Task | Menu Path |
|---|---|
| Load Excel | File → Open data → Import → Excel |
| Create new variable | Add → Define new variable |
| OLS regression | Model → Ordinary Least Squares |
| Robust standard errors | In OLS dialog → Robust standard errors (HC1) |
| Heteroskedasticity tests | Model window → Tests → Heteroskedasticity |
| VIF (collinearity) | Model window → Analysis → Collinearity |
| Scatter plot with fitted lines | Graph → X-Y scatter → then add series |