Who Earns More?

Predicting High Income Using Human Capital and Labor Market Characteristics

Sude Arslan & Selhan Çil

2026-06-04

Motivation & Economic Question

“To what extent do human capital and labor market characteristics predict membership in the high-income group earning more than $50,000 per year?”

Motivation

Income inequality is one of the central concerns of labor economics
Human capital theory predicts that education and experience drive earnings
Occupation and investment income add further explanatory power

Our Approach

Dataset: Adult Census Income (UCI, 1994)
30,162 individuals after cleaning
Binary classification: >$50K vs ≤$50K
Method: Logistic Regression

Dataset at a Glance

Variable	Type	Role
`age`	Numeric	Human capital proxy
`education_num`	Numeric	Human capital
`occupation`	Categorical	Labor market position
`hours_per_week`	Numeric	Work intensity
`gender`	Categorical	Demographic
`capital_gain/loss`	Numeric	Investment income

Outcome Variable

high_income

1 → earns > $50K/year

0 → earns ≤ $50K/year

30,162 observations

Binary classification problem

Stage 1: Distribution Analysis

75.1%

earn ≤ $50K

24.9%

earn > $50K — sample probability p = 0.249

Theoretical Distribution: Bernoulli(p)

The outcome is binary — each individual either belongs to the high-income group or does not.

Distribution	Appropriate?
Bernoulli	✅ Binary outcome
Normal	❌ Continuous only
Log-normal	❌ Positive skewed continuous
Exponential	❌ Duration/waiting time

Stage 1: What the Data Reveals

Exploratory analysis shows clear patterns:

Associated with higher income

📚 More years of education
💼 Executive & professional roles
⏰ More hours worked per week
🎂 Older age — returns to experience
💰 Higher capital gains & losses

Associated with lower income

🌾 Agricultural & service occupations
👩 Female — gender wage gap documented
📉 Lower education levels
⏱️ Part-time or fewer hours

These patterns are consistent with human capital theory — motivating a logistic regression model

Modeling Strategy

Reproducible Train/Test Split

set.seed(465)
initial_split(data, prop = 0.8)

Split	Observations
Training set	24,574
Test set	6,144

80/20 split ensures honest evaluation on unseen data

Two Logistic Regression Models

Model 1 — Baseline high_income ~ age + education + hours

Model 2 — Extended high_income ~ age + education + hours + occupation + gender + capital_gain + capital_loss

Model 2 adds labor market and demographic variables

Results: Model Comparison

Metric	Model 1	Model 2	Gain
Accuracy	0.786	0.825	+3.9%
Precision	0.622	0.720	+9.8%
Recall	0.334	0.477	+14.3%

Model 2 outperforms Model 1 across all metrics — we select Model 2 as our final model

Recall improved most: Model 2 correctly identifies 43% more high-income individuals than Model 1

Economic Interpretation of Coefficients

Factors increasing income probability

Variable	Coefficient	Significance
Education	+0.25	***
Age	+0.04	***
Exec-managerial	+0.77	***
Male (gender)	+1.27	***
Capital gain	+0.0003	***

Factors decreasing income probability

Variable	Coefficient	Significance
Farming/fishing	−1.29	***
Other-service	−1.04	***
Handler/cleaner	−0.86	***

All key variables significant at p < 0.001 (z-test)

Results confirm education premium, returns to experience, and occupational wage differentials

Cross-Validation: Is the Model Reliable?

5-Fold Cross-Validation on Model 2

Metric	CV Mean	Std Error
Accuracy	0.825	0.002

Each fold: ~24,000 training / ~6,000 validation

Small standard error → consistent performance across folds

CV Accuracy = 0.825

Test Accuracy = 0.825

Difference = 0.000 ✅

No overfitting detected

The model generalizes well — results are not specific to one random data split

Limitations

Data Limitations

1. Historical dataset 1994 US Census — labor market has changed substantially since then

2. Class imbalance 75% of observations are ≤$50K — recall of 0.477 indicates the model still misses many high earners

Methodological Limitations

3. Correlation, not causation Logistic regression identifies associations — not causal effects; omitted variable bias may exist

4. Gender wage gap The coefficient captures the gap but cannot explain the structural mechanisms behind it

Conclusion & Future Directions

Human capital and labor market characteristics predict high income with 82.5% accuracy — consistent with economic theory

What we would do differently

Use more recent data (post-2010)
Address class imbalance with oversampling
Add interaction terms: education × occupation

Future research question

Has the predictive power of education on high income changed over time, as returns to college degrees have shifted in the US labor market?

Thank you — we welcome any questions