Teaching Notes: Day 5 - Multiple Regression Application Lab

Session Overview

Duration: 2 × 45-minute sessions
Objective: Apply multiple regression concepts to analyze wage determinants, test model specifications, create dummies for education levels, test the linearity of education levels and interpret results.
Tools: GRETL


Session 1: Model Estimation & Interpretation

1. Review of Key Concepts

The Fundamental Problem: Why Simple Regression Isn’t Enough

A. The Naïve Approach: Wage vs. Gender (Simple Regression)

  • Question: “Do women earn less than men?”

  • Simple Regression Approach:
    \[ \log(\text{Wage}) = \beta_0 + \beta_1 \text{Female} + \epsilon \]

    • Finds: Women earn 22% less on average.
    • But is this all discrimination?
    • Problem: This bundles all gender-related differences (education, job type, experience) into one coefficient.

B. The Reality: Gender Affects Wage Through Multiple Channels

  • Structural Pathways:

    Gender → Education → Wage  
    Gender → Part-Time Status → Wage  
    Gender → (Direct Discrimination) → Wage
  • Example:

    • Suppose:
      • Women, on average, have less education (48% in Level 1 vs. 34% for men).
      • Education → Higher wages (omitted variable bias).
    • Result: Simple regression overstates discrimination because it doesn’t account for education differences.

C. The Need for Multiple Regression

  • Goal: Isolate the direct effect of gender, holding other factors constant.
  • Analogy:
    • “Comparing wages of men and women with the same education and job status.”
    • Like a controlled experiment, where we “adjust” for the factors that affect wage (dependent variable).

2. Total Effect vs. Partial Effect: Two Different Research Questions

( Policy vs. Economic Analysis)

A. When to Use Simple Regression (Total Effect)

  • Question: “What is the overall wage gap society observes?”
    • Useful for policy debates (e.g., “Do we need gender equity laws?”).
    • Example: If women earn less because they choose lower-paying careers, should policymakers intervene?

B. When to Use Multiple Regression (Partial Effect)

  • Question: “Is there wage discrimination after accounting for qualifications?”
    • Used in legal cases (e.g., suing for pay discrimination).
    • Example: If women with the same education and job status still earn less, this suggests bias.

C. Real-World Parallel: The “College Premium” Debate

  • Simple Regression: College grads earn more.
  • Multiple Regression: But what if they also come from wealthier families?
  • Key Insight: Multiple regression helps separate causation from correlation.

3. Omitted Variable Bias: The Core Issue

A. Visualizing Bias

  1. Draw:
    • X-axis: Female (0 = Male, 1 = Female)
    • Y-axis: Log(Wage)
    • Regression line: Downward slope (-0.25).
  2. Now overlay:
    • Education: Women cluster at lower education → pulls their wages down further.
    • Part-Time: More women work part-time → also lowers wages.

B. The Two Biases at Play

  1. Downward Bias from Education
    • Women have less education → lowers wages.
    • If we don’t control for education, the gender gap looks worse than it is.
  2. Downward Bias from Part-Time Work
    • Women work part-time more often → lowers wages.
    • If we don’t control for part-time status, the gender gap looks worse than it is.

C. Net Effect in Simple Regression

  • The observed -0.25 is a mixture:
    • True discrimination effect (unknown).
    • Minus education bias (makes gap seem bigger).
    • Minus part-time bias (makes gap seem bigger).
  • Conclusion: We can’t trust the simple regression!

4. GRETL: What We Test

(Linking Theory to Lab)

A. Residual Analysis: Detecting Omitted Variables

  • If residuals correlate with education → education was omitted!
  • If residuals correlate with part-time → part-time was omitted!

Summary

Concept Simple Regression Multiple Regression
What it measures Total gender gap Partial (direct) effect
Use case Policy discussions Legal discrimination
Main problem Omitted variable bias Requires more data
Gender gap interpretation “Women earn 22% less” “Women earn X% less for the same qualifications

Lab Exercise : Multiple Regression in GRETL** *

Task: Run the model:
\[ \log(\text{Wage}) = 4.27 - 0.09\text{Female} + 0.22\text{Educ} - 0.14\text{Parttime} + e \]
Steps:
1. Load dataset in GRETL.
2. Model > Ordinary Least Squares > log(Wage) ~ Female + Age + Educ + Parttime.

Regression Outcomes Discussion:
| Variable | Coeff. | Interpretation (Relative Effect) |
|————|——–|———————————-|
| Female | -0.095 | Women earn 9% less (p=0.000) |
| Educ | 0.219 | +25% per level (p=0.000) |
| Parttime | -0.142 | -13% (p=0.000) |

Key Points:

  • Gender effect shrinks from -22% (simple) to -9% (multiple regression).
  • Education and part-time are highly significant confounders.

Model Extensions & Testing**

5. Absolute vs. Relative Effects

Mathematical Interpretation:

  • Log-wage model (relative effects):
    \[ \beta_j = \frac{\partial \text{Wage}}{\partial x_j} \cdot \frac{1}{\text{Wage}} \approx \% \Delta \text{Wage} \]
    • e.g., \(e^{0.219} - 1 = 25\%\) wage increase per education level.
  • Linear wage model (absolute effects): \[ \text{Wage} = -77.87 - 2.12\text{Female} + 29.47\text{Educ} + \dots \]
    • Education adds $29.47 per level (vs. 26% in log model).

Conceptual Discussion:

  • “Use log models for % or relative effects, linear for $ effects.”

6. Testing Education Effects

Task: Test if education returns are constant across levels. Because using educ as a single number forces the relationship to be linear: it assumes going from level 1→2 has the same effect as from level 2→3, and from 3→4. That is, each additional education level adds the same wage increase. That’s a strong assumption – and often wrong.

  • Dummies allow each level to have its own effect, without any constraint. That’s the unrestricted model.

Steps:

  1. Create dummies (DE2, DE3, DE4) for education levels.

-Add → Dummies for selected discrete variable. - Select education as the variable. -Choose “Create dummies for all levels except the first

  1. Run OLS with all dummies and other controls. \[ \log(\text{Wage}) = 4.52 - 0.08\text{Female} + 0.03\text{Age} + 0.11\text{DE2} + 0.34\text{DE3} + 0.73\text{DE4} - 0.15\text{Parttime} + e \]

    • Reference category: Education level 1 (lowest). Its effect is captured in the constant (4.52).
  • DE2 coefficient (0.11): Moving from level 1 to level 2 increases log wage by 0.11 → approximately 12% higher wage (since e^0.11 ≈ 1.12).

  • DE3 coefficient (0.34): Level 3 vs. level 1 → 0.34 log points → about 41% higher wage.

  • DE4 coefficient (0.73): Level 4 vs. level 1 → 0.73 log points → about 108% higher wage (more than double!).

But the incremental gains are:

  • Level 1 → 2: +12%

  • Level 2 → 3: from 12% to 41% → extra +29%? Actually careful: The difference between DE3 and DE2 is 0.34 - 0.11 = 0.23 → about +26% extra.

  • Level 3 → 4: 0.73 - 0.34 = 0.39 → about +47% extra.

  • So the returns increase as education level rises – the highest level gives a huge jump.

The F-test: Testing if the “step pattern” is just a straight line

  • The null hypothesis (linear model) says: The increase from 1→2 = some amount δ.

  • Then 1→3 must be 2δ, and 1→4 must be 3δ.

  • In terms of dummy coefficients: β₃ = 2×β₂ and β₄ = 3×β₂.

  • If that’s true, then the three dummy coefficients lie exactly on a straight line when plotted against education level

  • The F-test quantifies whether the deviation is larger than what random chance would produce. Here F=8.98, much larger than the critical value of 2.6, so we reject linearity.

Why does this matter for policy?

  • If returns were constant, encouraging any additional education would yield the same benefit. But here, the biggest reward comes from reaching the highest level (e.g., a university degree vs. high school). Policies that focus on getting people from level 3 to level 4 (e.g., university access) may be more effective than those that only push from level 1 to 2.

Summary:

  • Dummies let each education level have its own wage effect.

  • The pattern shows increasing returns: the jump from high school to college is much larger than from primary to secondary.

  • The F-test confirms that this pattern is not a straight line – it’s steeper at the top.

  1. F-test:
    • \(H_0\): Linear education effects (\(\beta_5 = 2\beta_4, \beta_6 = 3\beta_4\)).
    • Compute \(F = \frac{(0.39 - 0.37)/2}{(1-0.39)/494} = 8.098\) > 2.6 (critical value).
    • Reject \(H_0\): Returns are non-linear

Wage Effects by Education Level: - Level 1 → 2: +12%
- Level 2 → 3: +26%
- Level 3 → 4: +47%

Discussion:
- “Higher education yields increasing returns, especially at top levels.”

Lab Session on Nonlinear Regression Functions

Introduction

This lab session focuses on detecting and modeling nonlinear relationships in regression analysis using the California Test Score dataset. We’ll explore

  • When the effect of X₁ on Y depends on the value of X₁ itself.

We’ll focus on nonlinear relationships where the effect of a predictor depends on its own value.

1. Visualizing Nonlinearity: Test Scores vs. District Income

Scatterplot with Linear OLS Regression

Let’s begin by examining the relationship between test scores (Y) and district average income (X) with a simple linear regression:

View → Graph Specified Vars → X-Y Scatter

Observation: The linear fit doesn’t capture the apparent curvature in the data - most points are below the line at very low and very high incomes, but above the line in the middle range.

2. Quadratic Regression Model

Specification

To model this curvature, we’ll estimate a quadratic regression: TestScoreᵢ = β₀ + β₁Incomeᵢ + β₂Incomeᵢ²+

avginc  → Add → square of selected variables 
Model  → ordinary least squares  → testscr avginc avginc_sq

Estimated equation: TestScore = 607.3 + 3.85 Income - 0.0423 Income²

Testing Nonlinearity

We can test whether the quadratic term is needed:

H0:   β₂ = 0 or not

The significant t-statistic (-8.81) and small p-value (<0.01%) suggest the quadratic term improves the model.

Interpreting Effects

The effect of income changes depends on the initial income level:

  1. Increase from $10K to $11K: ΔTestScore = [607.3 + 3.85×11 - 0.0423×11²] - [607.3 + 3.85×10 - 0.0423×10²] = 2.96 points

  2. Increase from $40K to $41K: ΔTestScore = [607.3 + 3.85×41 - 0.0423×41²] - [607.3 + 3.85×40 - 0.0423×40²] = 0.42 points

Key Insight: A $1000 increase has a larger effect in poorer districts than wealthy ones.

3. General Approach to Modeling Nonlinearities (Recipe)

  1. Identify possible nonlinearity: Examine scatterplots and consider theoretical relationships
  2. Specify nonlinear function: Start with quadratic terms
  3. Test against linear model: Use t-tests on nonlinear terms
  4. Plot estimated function: Visualize fit
  5. Interpret effects: Calculate effects at meaningful X values

4. Cubic Regression Model

For more flexibility, we can estimate a cubic model: TestScoreᵢ = β₀ + β₁Incomeᵢ + β₂Incomeᵢ² + β₃Incomeᵢ³ +uᵢ

Add → Define new variable → avgin_cube=avg_inc^3

Estimated equation: TestScore = 600.1 + 5.02 Income - 0.096 Income² + 0.00069 Income³

Testing Higher-Order Terms

H0:   β3 = 0 or not

Polynomial Modeling Recipe

  1. Start with quadratic (X²) terms
  2. Add cubic (X³) if theory suggests or data shows more complexity
  3. Test significance of highest-order term
  4. If insignificant, remove and use lower-order model
  5. Continue until highest-order term is significant

Conclusion

We’ve explored several approaches to modeling when the effect of X₁ on Y depends on X₁’s value: