Predicting High Income Using Human Capital and Labor Market Characteristics
2026-06-04
“To what extent do human capital and labor market characteristics predict membership in the high-income group earning more than $50,000 per year?”
Motivation
Our Approach
| Variable | Type | Role |
|---|---|---|
age |
Numeric | Human capital proxy |
education_num |
Numeric | Human capital |
occupation |
Categorical | Labor market position |
hours_per_week |
Numeric | Work intensity |
gender |
Categorical | Demographic |
capital_gain/loss |
Numeric | Investment income |
Outcome Variable
high_income
1 → earns > $50K/year
0 → earns ≤ $50K/year
30,162 observations
Binary classification problem
75.1%
earn ≤ $50K
24.9%
earn > $50K — sample probability p = 0.249
Theoretical Distribution: Bernoulli(p)
The outcome is binary — each individual either belongs to the high-income group or does not.
| Distribution | Appropriate? |
|---|---|
| Bernoulli | ✅ Binary outcome |
| Normal | ❌ Continuous only |
| Log-normal | ❌ Positive skewed continuous |
| Exponential | ❌ Duration/waiting time |
Exploratory analysis shows clear patterns:
Associated with higher income
Associated with lower income
These patterns are consistent with human capital theory — motivating a logistic regression model
Reproducible Train/Test Split
| Split | Observations |
|---|---|
| Training set | 24,574 |
| Test set | 6,144 |
80/20 split ensures honest evaluation on unseen data
Two Logistic Regression Models
Model 1 — Baseline high_income ~ age + education + hours
Model 2 — Extended high_income ~ age + education + hours + occupation + gender + capital_gain + capital_loss
Model 2 adds labor market and demographic variables
| Metric | Model 1 | Model 2 | Gain |
|---|---|---|---|
| Accuracy | 0.786 | 0.825 | +3.9% |
| Precision | 0.622 | 0.720 | +9.8% |
| Recall | 0.334 | 0.477 | +14.3% |
Model 2 outperforms Model 1 across all metrics — we select Model 2 as our final model
Recall improved most: Model 2 correctly identifies 43% more high-income individuals than Model 1
Factors increasing income probability
| Variable | Coefficient | Significance |
|---|---|---|
| Education | +0.25 | *** |
| Age | +0.04 | *** |
| Exec-managerial | +0.77 | *** |
| Male (gender) | +1.27 | *** |
| Capital gain | +0.0003 | *** |
Factors decreasing income probability
| Variable | Coefficient | Significance |
|---|---|---|
| Farming/fishing | −1.29 | *** |
| Other-service | −1.04 | *** |
| Handler/cleaner | −0.86 | *** |
All key variables significant at p < 0.001 (z-test)
Results confirm education premium, returns to experience, and occupational wage differentials
5-Fold Cross-Validation on Model 2
| Metric | CV Mean | Std Error |
|---|---|---|
| Accuracy | 0.825 | 0.002 |
Each fold: ~24,000 training / ~6,000 validation
Small standard error → consistent performance across folds
CV Accuracy = 0.825
Test Accuracy = 0.825
Difference = 0.000 ✅
No overfitting detected
The model generalizes well — results are not specific to one random data split
Data Limitations
1. Historical dataset 1994 US Census — labor market has changed substantially since then
2. Class imbalance 75% of observations are ≤$50K — recall of 0.477 indicates the model still misses many high earners
Methodological Limitations
3. Correlation, not causation Logistic regression identifies associations — not causal effects; omitted variable bias may exist
4. Gender wage gap The coefficient captures the gap but cannot explain the structural mechanisms behind it
Human capital and labor market characteristics predict high income with 82.5% accuracy — consistent with economic theory
What we would do differently
Future research question
Has the predictive power of education on high income changed over time, as returns to college degrees have shifted in the US labor market?
Thank you — we welcome any questions
ECON 465 – Sude Arslan & Selhan Çil | June 2026