Session Overview


Session 1: Exploring the Gender Wage Gap (90 minutes)

1. Learning Objectives for Session 1

  • Load and inspect the Wage Gender dataset.
  • Compare raw wage vs. log(wage) distributions.
  • Visualize the gender wage gap using boxplots.
  • Run a simple linear regression of log(wage) on female indicator.
  • Interpret the coefficient as a percentage wage gap.
  • Identify potential confounders (education, part-time work).
  • Discuss the difference between total and partial effects.

2. Data Loading and Initial Inspection (15 min)

2.1 Load the dataset in GRETL

  • Open GRETL.
  • File → Open data → Import → CSV.
  • Select the Wage Gender.xls file
  • Name variables: obs, wage, female, age, educ, parttime.

Variable definitions:

  • wage: hourly wage (in some currency unit)
  • female: 1 if woman, 0 if man
  • age: age in years
  • educ: education level (1 = low, 2 = medium, 3 = high, 4 = very high)
  • parttime: 1 if part-time worker, 0 if full-time

2.2 Summary statistics

GRETL: View → Summary statistics (select all variables).

  • Mean wage, standard deviation.
  • Proportion of females (mean of female).
  • Distribution of educ and parttime by gender.

Exercise:

  • Compute mean wage for men and women separately:
  • View → Summary statistics → check “by group” → select female as group variable.

Expected observation:

  • Women are less educated on average.
  • Women work more on part time jobs on average.
  • Women earn less on average than men.

3. Raw Wage vs. Log(Wage) Distributions (20 min)

3.1 Why use log(wage)?

Intuitive explanation:

  • Wages are typically right‑skewed (a few people earn very high wages).
  • Log transformation makes the distribution closer to normal, which is better for regression assumptions.
  • More importantly: In economics, effects are often multiplicative (e.g., a 10% raise, not a $2 raise). Using log(wage) allows us to interpret coefficients as percentage differences.

Create log(wage) in GRETL:

  • Add → Logs of selected variables → select wage → natural log.
  • New variable l_wage appears.

3.2 Compare histograms

Instructions:

  • Variable → frequency distribution → Test against normal distribution for wage.
  • Repeat for l_wage.

Discussion questions:

  • Which distribution looks more symmetric?
  • Why do we care about normality in regression? (Answer: for inference, though OLS is robust; log form often improves homoskedasticity.)

3.3 Side-by-side boxplots

GRETL: View → Graph → Factorized Boxplots → check “Draw boxplots for each level of a categorical variable” → select female as factor.

Observe:

  • Median wage for men (higher) vs. women (lower).
  • Spread: Are women’s wages more or less variable?
  • Outliers: Do both groups have high earners?

Expected findings:

  • Men’s median wage noticeably higher.
  • Women have more observations in lower part of the distribution.

4. First Regression: Simple Log-Wage on Female (20 min)

4.1 Run the regression

Model:
\[ \ln(\text{wage}_i) = \beta_1 + \beta_2 \text{female}_i + \varepsilon_i \]

GRETL: Model → Ordinary Least Squares

  • Dependent variable: l_wage
  • Independent variables: female
  • Click OK.

4.2 Interpret the coefficient on female

  • The coefficient is -0.25 in log points.
  • To get percentage difference: \(100 \times (e^{\beta} - 1)\%\).
  • Compute: \(e^{-0.25} \approx 0.7788\), so \((0.7788 - 1) \times 100 = -22.12\%\).

Interpretation:
On average, women earn about 22% less than men in this sample.

4.3 Critical discussion – Is this discrimination?

  • “Does this simple regression prove that employers pay women less for the same work?”
  • “What other factors might explain part of this gap?”

Expected answers:

  • Education levels may differ.
  • Women may work part-time more often.
  • Age/experience could differ.
  • Occupation choice is not in the model.

Key insight: The simple regression measures the total wage gap (including all differences), not the causal effect of gender holding other factors constant.


5. Identifying Key Confounders (20 min)

5.1 Education levels by gender

GRETL: View → Summary statistics → factorized by female, select educ.

Create a cross‑tabulation manually or use Data → Frequency tables → two variables: educ and female.

Expected pattern: - Lower education levels (educ=1) more frequent among women. - Higher education levels (educ=4) less frequent among women.

Discussion: - “If women have less education on average, and education raises wages, then part of the raw wage gap is due to education differences, not discrimination.”

5.2 Part-time work by gender

GRETL: Summary statistics for parttime by female.

Expected result: - Proportion of part‑time workers among women: ~56% - Among men: ~22%

Discussion: - Part‑time jobs typically pay less per hour (fewer benefits, less seniority). - Women may choose part‑time work for family reasons, but the wage penalty should not be ignored.

5.3 Visualizing confounder relationships

Factorized box-plots: - l_wage vs. educ – shows positive relationship. - l_wage vs. parttime – shows negative relationship.

GRETL: View → Graph Specified Vars → Factorized Box-plots

Summary table:

Variable Men (female=0) Women (female=1) Relationship with wage
Education Higher average Lower average Positive
Part‑time Lower share Higher share Negative

Conclusion: Both confounders likely exaggerate the raw gender gap. A multiple regression can estimate the direct effect of gender after controlling for these factors.


6. Wrap-up of Session 1 (5 min)

Key takeaways: - The simple regression shows a 22% total wage gap. - But women also have lower education and higher part‑time rates. - To isolate discrimination, we need to control for these confounders.

Preview of Session 2:
“We will diagnose omitted variable bias by analyzing residuals, then learn how multiple regression solves the problem.”


Session 2: Diagnosing Omitted Variable Bias (90 minutes)

1. Learning Objectives for Session 2

  • Understand the concept of omitted variable bias (OVB).
  • Compute residuals from the simple regression.
  • Analyze correlations between residuals and omitted variables.
  • Use GRETL to perform residual diagnostics.
  • Distinguish between total and partial effects.
  • Prepare for multiple regression in the next session.

2. Conceptual Framework: Total vs. Partial Effects (25 min)

2.1 Total effect (simple regression)

Definition: The total effect of gender on wage includes all pathways:

  • Direct discrimination (employers pay women less for same job)
  • Indirect effects via education, occupation, part‑time status, etc.

Formula:
\[ \text{Total gap} = \underbrace{\text{Direct effect}}_{\text{causal}} + \underbrace{\text{Indirect via confounders}}_{\text{non‑discriminatory}} \]

Policy use: Total gap measures overall inequality, useful for broad social policy.

2.2 Partial effect (multiple regression)

Definition: The effect of gender holding other variables constant (e.g., same education, same part‑time status).

Interpretation: The partial effect is often interpreted as the upper bound of discrimination (if all relevant confounders are controlled).

Legal use: Courts often require evidence of discrimination after accounting for qualifications.

2.3 Omitted Variable Bias (OVB) – intuitive explanation

For education:

  • If women have lower education (negative correlation between female and educ), and education increases wage (positive effect), then omitting educ causes the female coefficient to be more negative (biased downward).
  • Direction: Bias = (correlation between omitted var and included var) × (effect of omitted var on outcome).

For part‑time:

  • Women more likely to work part‑time (positive correlation with female).
  • Part‑time reduces wage (negative effect).
  • Omission causes female coefficient to be more negative (bias downward).

Conclusion: The simple regression overestimates the gender gap (makes it look larger) because it fails to control for factors that are correlated with gender and also affect wages.


3. Hands‑on Residual Analysis (40 min)

3.1 What are residuals?

Residual \(e_i = \ln(\text{wage}_i) - \hat{\beta}_1 - \hat{\beta}_2 \text{female}_i\).

Interpretation: The part of log‑wage not explained by gender alone.

If the model is correct, residuals should be random (no correlation with other variables). If residuals correlate with education or part‑time, that means those variables belong in the model.

3.2 Save residuals from the simple regression

GRETL steps:

  1. After running the regression (l_wage on female), click Save → Residuals.
  2. Name the new variable e_simple.
  3. Also save fitted values (optional).

3.3 Check correlation between residuals and education

Scatterplot:

  • View → Graph → Scatterplot → X variable: educ, Y variable: e_simple. Add jitter.

Expected pattern: Positive correlation.

  • Higher education → positive residuals (model underpredicts wage for high‑educated).
  • Lower education → negative residuals.

Calculate correlation coefficient:
- View → Summary statistics → select e_simple and educ → check “Correlation matrix”.

Expected correlation: roughly +0.53 (positive).

Interpretation:
“Residuals are not random – they increase with education. This means our model is missing education. Women have less education, so the model incorrectly attributes the lower wage from less education to gender.”

3.4 Check correlation between residuals and part‑time

Scatterplot: e_simple vs. parttime (part‑time is 0/1, so use boxplot or scatter with jitter).

Expected pattern: Negative relationship.

  • Part‑time workers → negative residuals (model overpredicts their wage because it doesn’t account for part‑time penalty).
  • Full‑time workers → positive or zero residuals.

Calculate correlation: should be negative (e.g., –0.2 to –0.3).

Interpretation:
“Residuals are lower for part‑time workers. Since women are more likely to work part‑time, the simple regression mistakenly treats part‑time wage penalty as part of the gender gap.”

3.5 Joint correlation: residuals vs. both confounders

Multiple regression of residuals on educ and parttime:

  • Model → OLS → dependent: e_simple, independent: educ, parttime.

Expected output: Both coefficients significant, \(R^2\) around 0.2–0.3.

Interpretation:
“The omitted variables explain a substantial portion of the residuals. This confirms that the simple regression suffers from omitted variable bias.”


4. Why Residual Analysis Matters (10 min)

Key message:

  • “If your residuals correlate with any variable not in the model, your model is incomplete.”
  • “Residual analysis is like a diagnostic checkup for your regression.”
  • “In real-world econometrics, you always check residuals before trusting your coefficients.”

Practical rule:

  • Plot residuals against each potential omitted variable. If you see a pattern, add that variable to the model.

5. Summary Table: Simple vs. Multiple Regression (10 min)

Concept Simple Regression (l_wage ~ female) Multiple Regression (l_wage ~ female + educ + parttime)
What it measures Total wage gap (including confounders) Direct gender effect (holding education and part‑time constant)
Interpretation of female coefficient –22% (biased downward) Expected to be smaller (closer to zero)
Key limitation Omitted variable bias Requires that no other confounders exist
Use case Describing overall inequality Estimating discrimination (partial effect)
Residual diagnostics Correlated with educ and parttime Should be uncorrelated with included variables

Appendix: GRETL Commands Quick Reference

Task GRETL Menu Path
Load CSV XLSX XLS File → Open data → Import → CSV or
Summary statistics View → Summary statistics
Grouped stats View → Summary statistics → factorized
Histogram Variable→ Frequency distribution → Normal distribution
Boxplot View → Graph → Boxplot → select categorical variable (factor)
OLS regression Model → Ordinary Least Squares
Save residuals From model window: Save → Residuals
Correlation matrix View → Summary statistics → check “Correlation matrix”
Create log variable Add → Logs of selected variables → natural log

Discussion Questions for Students

  1. Why do economists prefer log(wage) over wage as dependent variable?
  2. The simple regression shows women earn 22% less. Does this mean employers are discriminatory? Why or why not?
  3. How would the gender coefficient change if we also controlled for age (experience)? Make a hypothesis.
  4. What other omitted variables might bias the gender coefficient? (e.g., occupation, industry, union status)
  5. If after adding all reasonable controls the gender coefficient is still –5%, how would you interpret that?