Teaching Note for Applied Econometrics and Economic Modelling Lab Session - Day 8

Investigating Age and Earnings in GRETL: Functional Forms, Diagnostics, and Model Selection

Author

Gül Ertan Özgüzer

Published

April 17, 2026

Session Overview

  • Duration: 1 session of 90 minutes (or two 45‑minute sessions)
  • Dataset: Earnings.xlsx – variables: ahe (average hourly earnings), age, female (1 = female, 0 = male), yrseduc (years of education)
  • Objective: Students will estimate multiple regression models (linear, log‑linear, log‑log, quadratic), interpret coefficients correctly, compare model performance, conduct diagnostic tests (heteroskedasticity, multicollinearity), and build a final improved model with robust standard errors and centering.
  • Focus: Hands‑on GRETL implementation, interpretation of coefficients in different functional forms, and practical diagnostic testing.

🎯 Learning Outcomes

By the end of this lab, you will be able to:

  • Estimate linear, log‑linear, log‑log, and quadratic models in GRETL.
  • Interpret coefficients correctly depending on the functional form.
  • Compare models using (R^2) and adjusted (R^2).
  • Detect heteroskedasticity using Breusch‑Pagan and White tests.
  • Apply robust standard errors to correct for heteroskedasticity.
  • Detect multicollinearity using variance inflation factors (VIF) and fix it by centering.
  • Build a final improved model based on theory and diagnostics.

STEP 1 — Load the Data

In GRETL:

  1. File → Open data → Import → Excel
  2. Select your dataset: Earnings.xlsx

Check variables

You should see:

Variable Description
ahe Average hourly earnings (in dollars)
age Age in years
female 1 if female, 0 if male
yrseduc Years of education

STEP 2 — Create Variables

Go to Add → Define new variable and create the following:

lnahe = ln(ahe) lnage = ln(age) agesq = age^2

Explanation

  • lnahe → allows percentage interpretation of coefficients (log‑linear and log‑log models).

  • lnage → enables elasticity interpretation (log‑log model).

  • agesq → captures nonlinear (quadratic) effects, allowing earnings to increase at a decreasing rate and eventually peak.

STEP 3 — Model A (Linear Model)

Model → Ordinary Least Squares (OLS)

  • Dependent variable: ahe

  • Independent variables: age female yrseduc

Interpretation

The coefficient on age = change in earnings (in dollars) for a one‑year increase in age, holding gender and education constant.

  • Effect is constant across all ages.

  • From age 25 to 26, earnings increase by the same dollar amount as from age 33 to 34.

STEP 4 — Model B (Log‑Linear Model)

Model → OLS

  • Dependent: lnahe

  • Independent: age female yrseduc

Interpretation

The coefficient on age approximates the percentage change in earnings for a one‑year increase.
Use the exact formula:

\(\%\Delta \text{earnings} = 100 \times \left(e^{\beta_{\text{age}}} - 1\right)\)

For small coefficients (e.g., \(\beta_{\text{age}} \approx 0.01\)), the approximation \(\%\Delta \text{earnings} \approx 100 \times \beta_{\text{age}}\) is very close.

  • Effect is a constant percentage change for each additional year.

STEP 5 — Model C (Log‑Log Model)

Model → OLS

  • Dependent: lnahe

  • Independent: lnage female yrseduc

Interpretation

The coefficient on lnage is the elasticity of earnings with respect to age.

\[ 1\% \text{ increase in age} \;\rightarrow\; \beta_2\% \text{ increase in earnings} \]

  • The effect depends on age level: a 1% increase from age 25 to 25.25 is a smaller absolute year increase than from 33 to 33.33, but the percentage effect on earnings is constant.

  • This model often fits better when the relationship is multiplicative.

STEP 6 — Model D (Quadratic Model)

Model → OLS

  • Dependent: lnahe

  • Independent: age agesq female yrseduc

Interpretation

The marginal effect of age is:

\[ \frac{\partial \ln(\text{earnings})}{\partial \text{age}} = \beta_{\text{age}} + 2\,\beta_{\text{agesq}} \cdot \text{age} \]

  • Earnings increase with age but at a decreasing rate (if \(\beta_{\text{agesq}} < 0\).

  • Eventually, earnings may peak and then decline (typical life‑cycle pattern).

Turning point (peak age)

\(\text{Turning point} = -\frac{\beta_{\text{age}}}{2\beta_{\text{agesq}}}\)

This is the age at which earnings reach their maximum before starting to decline.

STEP 7 — Compare Models

Look at:

  • R2R2 and adjusted R2R2 (higher is better for same dependent variable).

  • Models B, C, and D all use lnahe as dependent variable, so they are directly comparable.

  • Model A uses ahe (raw dollars) – do not compare R2R2 with the others because the dependent variable is different.

Expected conclusion:
The quadratic model (Model D) usually fits best because it captures the life‑cycle pattern of earnings – rising steeply in young ages, flattening, and eventually declining.

STEP 8 — Heteroskedasticity Tests

After estimating Model D (quadratic with lnahe), run:

  • Tests → Heteroskedasticity → Breusch‑Pagan

  • Tests → Heteroskedasticity → White

Interpretation

  • Null hypothesis: homoskedasticity (constant error variance).

  • If p‑value < 0.05, reject homoskedasticity → heteroskedasticity is present.

  • If the two tests disagree, trust White’s test (it is more general and detects nonlinear forms of heteroskedasticity).

STEP 9 — Robust Standard Errors

If heteroskedasticity is present, you must correct the standard errors.

Re‑estimate Model D:

Model → OLS → in the dialog box, check “Robust standard errors (HC1)” (or “HAC”).

Explanation

  • Coefficients remain unchanged (they are still unbiased).

  • Standard errors become valid for inference (t‑statistics, p‑values, confidence intervals).

  • Use these corrected standard errors in your final report.

STEP 10 — Multicollinearity Check

After estimating Model D (with age and agesq), run:

Analysis → Collinearity (or Tests → Variance inflation factors)

What you will see

  • VIF for age and agesq will be very high (often > 10 or even > 100).

  • This is normal for polynomials – it does not mean the model is bad, but it can cause numerical instability.

STEP 11 — Fix by Centering Age

Centering removes the non‑essential multicollinearity between age and agesq.

Create centered variables:

\[ \text{age}_c = \text{age} - \bar{\text{age}} \]

\[ \text{age}_c^2 = (\text{age}_c)^2 \]```

Re‑estimate Model D using:

  • Dependent: lnahe

  • Independent: age_c age_c_sq female yrseduc

Result

  • Same model fit (same R2R2, same predicted values).

  • VIFs will drop dramatically (often to around 1–2).

  • The coefficient on age_c now represents the marginal effect at the mean age.

  • The coefficient on age_c_sq is unchanged in interpretation (curvature).

STEP 12 — Final Model

Your final improved model should be:

\[ \ln(\text{AHE}) = \beta_1 + \beta_2 \,\text{age}_c + \beta_3 \,\text{age}_c^2 + \beta_4 \,\text{female} + \beta_5 \,\text{yrseduc} + \varepsilon \]

with:

  • Robust Standard Errors
  • Centered age variables
  • Quadratic term to capture life‑cycle pattern
  • Log‑dependent variable for percentage interpretation

STEP 13 — Graph the Results

Graph → X-Y scatter

  • X: age

  • Y: lnahe

Then add fitted lines from:

  • Model B (log‑linear)

  • Model C (log‑log)

  • Model D (quadratic)

How to add fitted lines in GRETL:
After estimating each model, go to Graph → Fitted, actual plot and then overlay or use Add → Series to save fitted values, then plot them on the same scatter.

Interpretation

Compare shapes:

  • Model B (log‑linear) → straight line (constant percentage increase per year).

  • Model C (log‑log) → concave curve (elasticity constant, but slope decreases in absolute age).

  • Model D (quadratic) → hump‑shaped curve (increases, flattens, then declines).

The quadratic curve best matches the typical earnings life cycle.

STEP 14 — Final Interpretation (Project Ready)

Students should write a paragraph similar to:

Earnings increase with age, but at a decreasing rate, consistent with a life‑cycle pattern. The quadratic model provides the best fit and most realistic representation. Education has a strong positive effect on earnings, while females earn less on average, holding other factors constant. Diagnostic tests indicated heteroskedasticity, so robust standard errors are used. Centering age eliminated multicollinearity between age and its square without changing the model’s fit.

Key Takeaways

Concept Takeaway
Functional form Linear, log‑linear, log‑log, and quadratic models give different interpretations.
Log‑linear Coefficient × 100 ≈ percentage change (use exact formula for precision).
Log‑log Coefficient = elasticity (percentage change in Y for 1% change in X).
Quadratic Captures increasing‑then‑decreasing patterns (life cycle).
Heteroskedasticity Use White test; fix with robust standard errors.
Multicollinearity Polynomials create high VIF; centering solves it without changing fit.
Model selection Choose based on theory, fit (R²), and diagnostics.

Quick GRETL Reference

Task Menu Path
Load Excel File → Open data → Import → Excel
Create new variable Add → Define new variable
OLS regression Model → Ordinary Least Squares
Robust standard errors In OLS dialog → Robust standard errors (HC1)
Heteroskedasticity tests Model window → Tests → Heteroskedasticity
VIF (collinearity) Model window → Analysis → Collinearity
Scatter plot with fitted lines Graph → X-Y scatter → then add series