Who Earns More?

Predicting High Income Using Human Capital and Labor Market Characteristics

Sude Arslan & Selhan Çil

2026-06-04

Motivation & Economic Question

“To what extent do human capital and labor market characteristics predict membership in the high-income group earning more than $50,000 per year?”

Motivation

  • Income inequality is one of the central concerns of labor economics
  • Human capital theory predicts that education and experience drive earnings
  • Occupation and investment income add further explanatory power

Our Approach

  • Dataset: Adult Census Income (UCI, 1994)
  • 30,162 individuals after cleaning
  • Binary classification: >$50K vs ≤$50K
  • Method: Logistic Regression

Dataset at a Glance

Variable Type Role
age Numeric Human capital proxy
education_num Numeric Human capital
occupation Categorical Labor market position
hours_per_week Numeric Work intensity
gender Categorical Demographic
capital_gain/loss Numeric Investment income

Outcome Variable

high_income

1 → earns > $50K/year

0 → earns ≤ $50K/year

30,162 observations

Binary classification problem

Stage 1: Distribution Analysis

75.1%

earn ≤ $50K

24.9%

earn > $50K — sample probability p = 0.249

Theoretical Distribution: Bernoulli(p)

The outcome is binary — each individual either belongs to the high-income group or does not.

Distribution Appropriate?
Bernoulli ✅ Binary outcome
Normal ❌ Continuous only
Log-normal ❌ Positive skewed continuous
Exponential ❌ Duration/waiting time

Stage 1: What the Data Reveals

Exploratory analysis shows clear patterns:

Associated with higher income

  • 📚 More years of education
  • 💼 Executive & professional roles
  • ⏰ More hours worked per week
  • 🎂 Older age — returns to experience
  • 💰 Higher capital gains & losses

Associated with lower income

  • 🌾 Agricultural & service occupations
  • 👩 Female — gender wage gap documented
  • 📉 Lower education levels
  • ⏱️ Part-time or fewer hours

These patterns are consistent with human capital theory — motivating a logistic regression model

Modeling Strategy

Reproducible Train/Test Split

set.seed(465)
initial_split(data, prop = 0.8)
Split Observations
Training set 24,574
Test set 6,144

80/20 split ensures honest evaluation on unseen data

Two Logistic Regression Models

Model 1 — Baseline high_income ~ age + education + hours

Model 2 — Extended high_income ~ age + education + hours + occupation + gender + capital_gain + capital_loss

Model 2 adds labor market and demographic variables

Results: Model Comparison

Metric Model 1 Model 2 Gain
Accuracy 0.786 0.825 +3.9%
Precision 0.622 0.720 +9.8%
Recall 0.334 0.477 +14.3%

Model 2 outperforms Model 1 across all metrics — we select Model 2 as our final model

Recall improved most: Model 2 correctly identifies 43% more high-income individuals than Model 1

Economic Interpretation of Coefficients

Factors increasing income probability

Variable Coefficient Significance
Education +0.25 ***
Age +0.04 ***
Exec-managerial +0.77 ***
Male (gender) +1.27 ***
Capital gain +0.0003 ***

Factors decreasing income probability

Variable Coefficient Significance
Farming/fishing −1.29 ***
Other-service −1.04 ***
Handler/cleaner −0.86 ***

All key variables significant at p < 0.001 (z-test)

Results confirm education premium, returns to experience, and occupational wage differentials

Cross-Validation: Is the Model Reliable?

5-Fold Cross-Validation on Model 2

Metric CV Mean Std Error
Accuracy 0.825 0.002

Each fold: ~24,000 training / ~6,000 validation

Small standard error → consistent performance across folds

CV Accuracy = 0.825

Test Accuracy = 0.825

Difference = 0.000

No overfitting detected

The model generalizes well — results are not specific to one random data split

Limitations

Data Limitations

1. Historical dataset 1994 US Census — labor market has changed substantially since then

2. Class imbalance 75% of observations are ≤$50K — recall of 0.477 indicates the model still misses many high earners

Methodological Limitations

3. Correlation, not causation Logistic regression identifies associations — not causal effects; omitted variable bias may exist

4. Gender wage gap The coefficient captures the gap but cannot explain the structural mechanisms behind it

Conclusion & Future Directions

Human capital and labor market characteristics predict high income with 82.5% accuracy — consistent with economic theory

What we would do differently

  • Use more recent data (post-2010)
  • Address class imbalance with oversampling
  • Add interaction terms: education × occupation

Future research question

Has the predictive power of education on high income changed over time, as returns to college degrees have shifted in the US labor market?

Thank you — we welcome any questions