library(tidyverse)
library(tidymodels)
library(readr)
library(knitr)
library(scales)
set.seed(465)
theme_set(theme_minimal(base_size = 12))ECON 465 – Stage 3: Final Analysis Report
Predicting High Income Using Human Capital and Labor Market Characteristics
1. Economic Question
This report investigates the following economic question:
To what extent do education, occupation, age, and weekly working hours predict whether an individual belongs to the high-income group earning more than $50,000 per year?
Understanding the determinants of high income is a central concern in labor economics. Human capital theory predicts that education and work experience are the primary drivers of earnings. In addition, occupation, gender, and investment income are known to create substantial wage differentials across individuals. By building a classification model on the Adult Census Income dataset, this analysis aims to identify which socioeconomic characteristics are most strongly associated with belonging to the high-income group.
The findings have implications for labor market policy, educational investment decisions, and the study of income inequality.
2. Data
2.1 Dataset Description
The dataset used in this analysis is the Adult Census Income dataset, originally derived from the 1994 U.S. Census Bureau data. It is publicly available through the UCI Machine Learning Repository and Kaggle.
- Source: UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/2/adult
- Observations: 30,162 individuals after cleaning
- Outcome variable:
high_income— 1 if annual income > $50,000, 0 otherwise
The dataset includes demographic and labor market variables such as age, education, occupation, work class, marital status, sex, capital gain, capital loss, and weekly working hours.
2.2 Data Import and Cleaning
adult_raw <- read_csv("adult.csv", show_col_types = FALSE)
adult_clean <- adult_raw |>
mutate(across(where(is.character), ~ str_trim(.x))) |>
mutate(across(where(is.character), ~ na_if(.x, "?"))) |>
rename(
education_num = education.num,
marital_status = marital.status,
capital_gain = capital.gain,
capital_loss = capital.loss,
hours_per_week = hours.per.week
) |>
select(
income, age, education_num, occupation,
hours_per_week, sex, marital_status,
capital_gain, capital_loss
) |>
mutate(
income = factor(income, levels = c("<=50K", ">50K")),
high_income = factor(
if_else(income == ">50K", "Yes", "No"),
levels = c("No", "Yes")
),
high_income_num = if_else(income == ">50K", 1L, 0L),
occupation = as.factor(occupation),
sex = as.factor(sex),
marital_status = as.factor(marital_status)
) |>
drop_na()
dataset_size <- tibble(
observations = nrow(adult_clean),
variables = ncol(adult_clean)
)
kable(dataset_size, caption = "Cleaned Dataset Size")| observations | variables |
|---|---|
| 30718 | 11 |
After cleaning, the dataset contains 30718 observations and 11 variables. Missing values coded as ? were converted to NA and removed.
3. Probability Analysis
3.1 Target Variable Distribution
class_balance <- adult_clean |>
count(income, name = "observations") |>
mutate(share = observations / sum(observations))
kable(
class_balance,
digits = 3,
caption = "Target Variable Class Balance"
)| income | observations | share |
|---|---|---|
| <=50K | 23068 | 0.751 |
| >50K | 7650 | 0.249 |
ggplot(adult_clean, aes(x = income, fill = income)) +
geom_bar() +
scale_fill_manual(values = c("#c0392b", "#2471a3")) +
scale_y_continuous(labels = comma) +
labs(
title = "Distribution of Income Groups",
x = "Income Group",
y = "Number of Individuals"
) +
theme(legend.position = "none")The target variable is imbalanced: 75.1% of individuals earn ≤$50K while only 24.9% earn >$50K. The sample probability of belonging to the high-income group is p = 0.249.
3.2 Theoretical Distribution
The most appropriate theoretical distribution for this binary outcome is the Bernoulli distribution. Each individual has exactly two possible outcomes — high income (1) or not (0). The parameter p = 0.249 represents the probability of belonging to the high-income group.
Normal, log-normal, and exponential distributions are not appropriate because they are designed for continuous outcomes, whereas this target variable is strictly binary.
3.3 Summary Statistics
income_summary <- adult_clean |>
summarise(
mean = mean(high_income_num),
median = median(high_income_num),
standard_deviation = sd(high_income_num),
minimum = min(high_income_num),
maximum = max(high_income_num)
)
kable(
income_summary,
digits = 3,
caption = "Summary Statistics for the Binary High-Income Target"
)| mean | median | standard_deviation | minimum | maximum |
|---|---|---|---|---|
| 0.249 | 0 | 0.432 | 0 | 1 |
Because high_income is binary, the mean equals the sample probability of high income (p = 0.249). The standard deviation of 0.432 reflects the spread of a Bernoulli distribution.
3.4 Key Patterns from Exploratory Analysis
adult_clean |>
group_by(occupation, income) |>
summarise(n = n(), .groups = "drop") |>
group_by(occupation) |>
mutate(share = n / sum(n)) |>
filter(income == ">50K") |>
arrange(desc(share)) |>
ggplot(aes(x = reorder(occupation, share), y = share, fill = share)) +
geom_col() +
coord_flip() +
scale_fill_gradient(low = "#aed6f1", high = "#1a5276") +
scale_y_continuous(labels = percent) +
labs(
title = "Share of High-Income Individuals by Occupation",
x = "Occupation",
y = "Share Earning > $50K"
) +
theme(legend.position = "none")Executive and managerial roles show the highest share of high-income earners, while farming, service, and handler occupations show the lowest. This pattern motivates the inclusion of occupation as a key predictor in the classification model.
4. Modeling
4.1 Train/Test Split
set.seed(465)
class_split <- initial_split(adult_clean, prop = 0.8)
class_train <- training(class_split)
class_test <- testing(class_split)
split_sizes <- tibble(
dataset = c("Training set", "Test set"),
observations = c(nrow(class_train), nrow(class_test))
)
kable(split_sizes, caption = "Train/Test Split")| dataset | observations |
|---|---|
| Training set | 24574 |
| Test set | 6144 |
The data was split into 80% training (24574 observations) and 20% test (6144 observations) using set.seed(465) for reproducibility.
4.2 Logistic Regression Model 1 — Baseline
class_model_1 <- glm(
high_income ~ age + education_num + hours_per_week,
data = class_train,
family = binomial
)
summary(class_model_1)
Call:
glm(formula = high_income ~ age + education_num + hours_per_week,
family = binomial, data = class_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.418065 0.125777 -66.93 <2e-16 ***
age 0.048430 0.001361 35.57 <2e-16 ***
education_num 0.338738 0.007379 45.90 <2e-16 ***
hours_per_week 0.041494 0.001479 28.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27621 on 24573 degrees of freedom
Residual deviance: 22410 on 24570 degrees of freedom
AIC: 22418
Number of Fisher Scoring iterations: 5
Model 1 predicts high income using age, education level, and hours worked per week. All three coefficients are statistically significant at the 1% level. The positive coefficient on education reflects the well-documented education premium in earnings. The positive coefficient on age is consistent with returns to experience in labor markets.
4.3 Logistic Regression Model 2 — Extended
class_model_2 <- glm(
high_income ~ age + education_num + hours_per_week +
occupation + sex + capital_gain + capital_loss,
data = class_train,
family = binomial
)
summary(class_model_2)
Call:
glm(formula = high_income ~ age + education_num + hours_per_week +
occupation + sex + capital_gain + capital_loss, family = binomial,
data = class_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.157e+00 1.575e-01 -51.809 < 2e-16 ***
age 4.213e-02 1.524e-03 27.642 < 2e-16 ***
education_num 2.532e-01 9.537e-03 26.548 < 2e-16 ***
hours_per_week 3.087e-02 1.666e-03 18.523 < 2e-16 ***
occupationArmed-Forces -7.927e-01 1.245e+00 -0.637 0.5241
occupationCraft-repair 7.080e-02 7.963e-02 0.889 0.3739
occupationExec-managerial 7.657e-01 7.532e-02 10.167 < 2e-16 ***
occupationFarming-fishing -1.285e+00 1.454e-01 -8.837 < 2e-16 ***
occupationHandlers-cleaners -8.632e-01 1.465e-01 -5.894 3.78e-09 ***
occupationMachine-op-inspct -2.172e-01 1.037e-01 -2.094 0.0362 *
occupationOther-service -1.043e+00 1.208e-01 -8.636 < 2e-16 ***
occupationPriv-house-serv -4.503e+00 1.815e+00 -2.481 0.0131 *
occupationProf-specialty 3.886e-01 7.801e-02 4.982 6.31e-07 ***
occupationProtective-serv 3.618e-01 1.235e-01 2.929 0.0034 **
occupationSales 1.911e-01 8.031e-02 2.379 0.0174 *
occupationTech-support 5.613e-01 1.104e-01 5.085 3.67e-07 ***
occupationTransport-moving -1.147e-01 1.020e-01 -1.124 0.2608
sexMale 1.271e+00 4.943e-02 25.707 < 2e-16 ***
capital_gain 3.122e-04 1.123e-05 27.810 < 2e-16 ***
capital_loss 6.457e-04 3.845e-05 16.795 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27621 on 24573 degrees of freedom
Residual deviance: 18980 on 24554 degrees of freedom
AIC: 19020
Number of Fisher Scoring iterations: 8
Model 2 adds occupation, sex, capital gain, and capital loss. Most occupation categories are statistically significant. The positive coefficient on sex (Male) reflects the gender wage gap. Capital gain and capital loss are both highly significant, capturing investment income strongly associated with high earners.
4.4 Test Set Performance
class_probs_1 <- predict(class_model_1, newdata = class_test, type = "response")
class_probs_2 <- predict(class_model_2, newdata = class_test, type = "response")
class_test_predictions <- class_test |>
mutate(
pred_model_1 = factor(
ifelse(class_probs_1 > 0.5, "Yes", "No"),
levels = c("No", "Yes")
),
pred_model_2 = factor(
ifelse(class_probs_2 > 0.5, "Yes", "No"),
levels = c("No", "Yes")
)
)
class_metrics_1 <- tibble(
model = "Model 1",
accuracy = accuracy_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_1
),
precision = precision_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_1,
event_level = "second"
),
recall = recall_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_1,
event_level = "second"
)
)
class_metrics_2 <- tibble(
model = "Model 2",
accuracy = accuracy_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_2
),
precision = precision_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_2,
event_level = "second"
),
recall = recall_vec(
truth = class_test_predictions$high_income,
estimate = class_test_predictions$pred_model_2,
event_level = "second"
)
)
class_comparison <- bind_rows(class_metrics_1, class_metrics_2)
kable(
class_comparison,
digits = 3,
caption = "Model Comparison: Test Set Performance"
)| model | accuracy | precision | recall |
|---|---|---|---|
| Model 1 | 0.786 | 0.622 | 0.334 |
| Model 2 | 0.825 | 0.720 | 0.477 |
4.5 Cross-Validation
set.seed(465)
logistic_spec <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
class_folds <- vfold_cv(adult_clean, v = 5)
class_cv_results <- fit_resamples(
logistic_spec,
high_income ~ age + education_num + hours_per_week +
occupation + sex + capital_gain + capital_loss,
resamples = class_folds,
metrics = metric_set(accuracy)
)
class_cv_summary <- collect_metrics(class_cv_results) |>
select(.metric, mean, std_err)
kable(
class_cv_summary,
digits = 3,
caption = "Model 2: 5-Fold Cross-Validation Results"
)| .metric | mean | std_err |
|---|---|---|
| accuracy | 0.825 | 0.002 |
The cross-validated accuracy is 0.825, which is identical to the test set accuracy of 0.825. The difference of zero confirms that the model is stable and does not overfit.
5. Results
5.1 Model Comparison
Model 2 outperforms Model 1 across all three metrics:
- Accuracy: 0.825 vs 0.786 (+3.9%)
- Precision: 0.720 vs 0.622 (+9.8%)
- Recall: 0.477 vs 0.334 (+14.3%)
The most important improvement is in recall — Model 2 correctly identifies 43% more high-income individuals than Model 1. We select Model 2 as the final model.
class_comparison |>
pivot_longer(
cols = c(accuracy, precision, recall),
names_to = "metric",
values_to = "value"
) |>
ggplot(aes(x = metric, y = value, fill = model)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("#c0392b", "#2471a3")) +
labs(
title = "Model Comparison: Test Set Performance",
x = "Metric",
y = "Value",
fill = "Model"
)6. Economic Interpretation
6.1 Answering the Economic Question
Our analysis confirms that education, occupation, age, hours worked, sex, and capital income are strong predictors of belonging to the high-income group. Model 2 achieves 82.5% accuracy with stable cross-validated performance, providing a clear answer to our economic question.
6.2 Coefficient Interpretation
Education (coefficient: +0.25, p < 0.001) Each additional year of education significantly increases the probability of high income. This confirms the education premium documented extensively in labor economics — more educated individuals earn substantially more.
Age (coefficient: +0.04, p < 0.001) Older individuals have higher income probability, consistent with returns to experience. As workers accumulate experience over time, their productivity and earnings increase.
Executive-Managerial Occupation (coefficient: +0.77, p < 0.001) Workers in executive and managerial roles have substantially higher income probability. This reflects occupational wage differentials — high-skill, high-responsibility roles command a wage premium.
Sex — Male (coefficient: +1.27, p < 0.001) The male coefficient is the strongest effect in the model. Even after controlling for education, occupation, and hours worked, male individuals have substantially higher income probability. This confirms the gender wage gap documented in labor economics literature.
Capital Gain (coefficient: +0.0003, p < 0.001) Capital gain is highly significant, capturing investment income. Individuals who earn returns on investments are more likely to belong to the high-income group, reflecting wealth inequality alongside labor income inequality.
Farming/Fishing Occupation (coefficient: −1.29, p < 0.001) Agricultural workers have the lowest income probability among all occupations. These roles require lower levels of formal education and human capital, which reduces the probability of earning above $50,000.
6.3 Policy Implications
These findings have several implications:
- Educational investment: The strong education premium suggests that policies expanding access to higher education can reduce income inequality by helping more individuals enter the high-income group.
- Occupational mobility: The large wage differentials across occupations suggest that retraining programs for workers in low-wage occupations could improve income mobility.
- Gender wage gap: The large sex coefficient — even after controlling for occupation and education — suggests that structural barriers beyond human capital differences contribute to gender-based income inequality. Targeted policy interventions may be needed.
7. Limitations & Reproducibility
7.1 Limitations
1. Historical dataset The dataset is from the 1994 U.S. Census. Labor market conditions, occupational structures, and income distributions have changed substantially since then. The findings reflect the 1994 U.S. labor market and may not generalize to other countries or time periods.
2. Class imbalance 75% of observations are in the ≤$50K group. The model is biased toward predicting low income, which reduces recall to 0.477. Many actual high-income individuals are missed. Techniques such as oversampling (SMOTE) or threshold adjustment could improve recall.
3. Correlation, not causation Logistic regression identifies associations between predictors and the outcome — not causal effects. Omitted variable bias may exist. For example, unobserved factors such as family background or social networks may drive both education attainment and income.
4. Gender wage gap mechanism While the model captures the gender wage gap through the sex coefficient, it cannot explain the structural mechanisms behind it. Discrimination, occupational sorting, and work interruptions are possible explanations that cannot be identified from this dataset alone.
7.2 Reproducibility
The following steps were taken to ensure reproducibility:
set.seed(465)was used before all random operations (train/test split and cross-validation) to ensure identical results across runs.- The
adult.csvfile is stored in the same folder as the.qmdfile and is referenced using a relative path ("adult.csv"), ensuring portability across computers. - All required packages (
tidyverse,tidymodels,readr,knitr,scales) are loaded at the top of the document. - The document was verified to render without errors using
quarto render.
8. AI Use Log
8.1 Tools Used
During this project, we used ChatGPT and Claude as AI assistants.
8.2 Detailed Interaction
Tool: ChatGPT
Prompt: “In R, I am using tidymodels and I want to calculate precision and recall for a logistic regression model where the positive class is the second factor level. My outcome variable is a factor with levels No and Yes. When I use precision_vec() and recall_vec(), I get results for the wrong class. How do I fix this?”
AI Response (relevant excerpt): The AI explained that by default, tidymodels treats the first factor level as the positive class. To calculate precision and recall for the second level (Yes), the argument event_level = "second" should be passed to precision_vec() and recall_vec().
How we used it: We had already written the model building and prediction code ourselves. The issue was specifically that our precision and recall results looked unexpectedly low. We used the AI explanation to understand the event_level argument and added it to our metric calculations. The rest of the code was written and adapted by us.
How we verified it: We verified the fix by checking that the resulting precision and recall values were consistent with a manually computed confusion matrix.
Reflection: The AI was helpful for resolving a specific technical issue not covered in the course labs. The main lesson was that factor level ordering in R affects how classification metrics are computed, and it is important to specify the positive class explicitly.
9. Final Reflections
9.1 One Improvement
With more time or better data, we would use a more recent dataset — ideally post-2010 U.S. Census data — to capture the current labor market structure. The 1994 dataset predates significant shifts in the labor market, including the rise of the technology sector, changes in education premiums, and evolving gender wage dynamics.
We would also address class imbalance using oversampling techniques such as SMOTE (Synthetic Minority Oversampling Technique), which would likely improve recall for the high-income group.
9.2 Future Research Question
Has the predictive power of education on high income changed over time, as returns to college degrees have shifted in the U.S. labor market?
Recent research suggests that the college wage premium has plateaued or even declined for some fields. Comparing classification models trained on 1994, 2005, and 2020 Census data could reveal whether education remains as strong a predictor of high income as our current analysis suggests.
10. Conclusion
This analysis demonstrates that human capital and labor market characteristics predict membership in the high-income group with 82.5% accuracy. Model 2 — which includes occupation, sex, and capital income in addition to age, education, and hours worked — substantially outperforms the baseline model across all three metrics.
The key findings are consistent with human capital theory: education, experience (age), and occupational position are the strongest positive predictors of high income. The gender wage gap is clearly visible in the data, even after controlling for education and occupation. Cross-validation confirms that the model is stable and generalizes well to unseen data.
A key limitation is that the dataset reflects the 1994 U.S. labor market. Future work should use more recent data and address class imbalance to improve the model’s ability to identify high-income individuals.