ECON 465 – Stage 3: Final Analysis Report

Predicting High Income Using Human Capital and Labor Market Characteristics

Author

Sude Arslan & Selhan Çil

Published

June 5, 2026

library(tidyverse)
library(tidymodels)
library(readr)
library(knitr)
library(scales)

set.seed(465)
theme_set(theme_minimal(base_size = 12))

1. Economic Question

This report investigates the following economic question:

To what extent do education, occupation, age, and weekly working hours predict whether an individual belongs to the high-income group earning more than $50,000 per year?

Understanding the determinants of high income is a central concern in labor economics. Human capital theory predicts that education and work experience are the primary drivers of earnings. In addition, occupation, gender, and investment income are known to create substantial wage differentials across individuals. By building a classification model on the Adult Census Income dataset, this analysis aims to identify which socioeconomic characteristics are most strongly associated with belonging to the high-income group.

The findings have implications for labor market policy, educational investment decisions, and the study of income inequality.


2. Data

2.1 Dataset Description

The dataset used in this analysis is the Adult Census Income dataset, originally derived from the 1994 U.S. Census Bureau data. It is publicly available through the UCI Machine Learning Repository and Kaggle.

  • Source: UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/2/adult
  • Observations: 30,162 individuals after cleaning
  • Outcome variable: high_income — 1 if annual income > $50,000, 0 otherwise

The dataset includes demographic and labor market variables such as age, education, occupation, work class, marital status, sex, capital gain, capital loss, and weekly working hours.

2.2 Data Import and Cleaning

adult_raw <- read_csv("adult.csv", show_col_types = FALSE)

adult_clean <- adult_raw |>
  mutate(across(where(is.character), ~ str_trim(.x))) |>
  mutate(across(where(is.character), ~ na_if(.x, "?"))) |>
  rename(
    education_num  = education.num,
    marital_status = marital.status,
    capital_gain   = capital.gain,
    capital_loss   = capital.loss,
    hours_per_week = hours.per.week
  ) |>
  select(
    income, age, education_num, occupation,
    hours_per_week, sex, marital_status,
    capital_gain, capital_loss
  ) |>
  mutate(
    income = factor(income, levels = c("<=50K", ">50K")),
    high_income = factor(
      if_else(income == ">50K", "Yes", "No"),
      levels = c("No", "Yes")
    ),
    high_income_num = if_else(income == ">50K", 1L, 0L),
    occupation     = as.factor(occupation),
    sex            = as.factor(sex),
    marital_status = as.factor(marital_status)
  ) |>
  drop_na()

dataset_size <- tibble(
  observations = nrow(adult_clean),
  variables    = ncol(adult_clean)
)

kable(dataset_size, caption = "Cleaned Dataset Size")
Cleaned Dataset Size
observations variables
30718 11

After cleaning, the dataset contains 30718 observations and 11 variables. Missing values coded as ? were converted to NA and removed.


3. Probability Analysis

3.1 Target Variable Distribution

class_balance <- adult_clean |>
  count(income, name = "observations") |>
  mutate(share = observations / sum(observations))

kable(
  class_balance,
  digits  = 3,
  caption = "Target Variable Class Balance"
)
Target Variable Class Balance
income observations share
<=50K 23068 0.751
>50K 7650 0.249
ggplot(adult_clean, aes(x = income, fill = income)) +
  geom_bar() +
  scale_fill_manual(values = c("#c0392b", "#2471a3")) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Distribution of Income Groups",
    x     = "Income Group",
    y     = "Number of Individuals"
  ) +
  theme(legend.position = "none")

The target variable is imbalanced: 75.1% of individuals earn ≤$50K while only 24.9% earn >$50K. The sample probability of belonging to the high-income group is p = 0.249.

3.2 Theoretical Distribution

The most appropriate theoretical distribution for this binary outcome is the Bernoulli distribution. Each individual has exactly two possible outcomes — high income (1) or not (0). The parameter p = 0.249 represents the probability of belonging to the high-income group.

Normal, log-normal, and exponential distributions are not appropriate because they are designed for continuous outcomes, whereas this target variable is strictly binary.

3.3 Summary Statistics

income_summary <- adult_clean |>
  summarise(
    mean               = mean(high_income_num),
    median             = median(high_income_num),
    standard_deviation = sd(high_income_num),
    minimum            = min(high_income_num),
    maximum            = max(high_income_num)
  )

kable(
  income_summary,
  digits  = 3,
  caption = "Summary Statistics for the Binary High-Income Target"
)
Summary Statistics for the Binary High-Income Target
mean median standard_deviation minimum maximum
0.249 0 0.432 0 1

Because high_income is binary, the mean equals the sample probability of high income (p = 0.249). The standard deviation of 0.432 reflects the spread of a Bernoulli distribution.

3.4 Key Patterns from Exploratory Analysis

adult_clean |>
  group_by(occupation, income) |>
  summarise(n = n(), .groups = "drop") |>
  group_by(occupation) |>
  mutate(share = n / sum(n)) |>
  filter(income == ">50K") |>
  arrange(desc(share)) |>
  ggplot(aes(x = reorder(occupation, share), y = share, fill = share)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(low = "#aed6f1", high = "#1a5276") +
  scale_y_continuous(labels = percent) +
  labs(
    title = "Share of High-Income Individuals by Occupation",
    x     = "Occupation",
    y     = "Share Earning > $50K"
  ) +
  theme(legend.position = "none")

Executive and managerial roles show the highest share of high-income earners, while farming, service, and handler occupations show the lowest. This pattern motivates the inclusion of occupation as a key predictor in the classification model.


4. Modeling

4.1 Train/Test Split

set.seed(465)

class_split <- initial_split(adult_clean, prop = 0.8)
class_train <- training(class_split)
class_test  <- testing(class_split)

split_sizes <- tibble(
  dataset      = c("Training set", "Test set"),
  observations = c(nrow(class_train), nrow(class_test))
)

kable(split_sizes, caption = "Train/Test Split")
Train/Test Split
dataset observations
Training set 24574
Test set 6144

The data was split into 80% training (24574 observations) and 20% test (6144 observations) using set.seed(465) for reproducibility.

4.2 Logistic Regression Model 1 — Baseline

class_model_1 <- glm(
  high_income ~ age + education_num + hours_per_week,
  data   = class_train,
  family = binomial
)

summary(class_model_1)

Call:
glm(formula = high_income ~ age + education_num + hours_per_week, 
    family = binomial, data = class_train)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.418065   0.125777  -66.93   <2e-16 ***
age             0.048430   0.001361   35.57   <2e-16 ***
education_num   0.338738   0.007379   45.90   <2e-16 ***
hours_per_week  0.041494   0.001479   28.05   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27621  on 24573  degrees of freedom
Residual deviance: 22410  on 24570  degrees of freedom
AIC: 22418

Number of Fisher Scoring iterations: 5

Model 1 predicts high income using age, education level, and hours worked per week. All three coefficients are statistically significant at the 1% level. The positive coefficient on education reflects the well-documented education premium in earnings. The positive coefficient on age is consistent with returns to experience in labor markets.

4.3 Logistic Regression Model 2 — Extended

class_model_2 <- glm(
  high_income ~ age + education_num + hours_per_week +
    occupation + sex + capital_gain + capital_loss,
  data   = class_train,
  family = binomial
)

summary(class_model_2)

Call:
glm(formula = high_income ~ age + education_num + hours_per_week + 
    occupation + sex + capital_gain + capital_loss, family = binomial, 
    data = class_train)

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -8.157e+00  1.575e-01 -51.809  < 2e-16 ***
age                          4.213e-02  1.524e-03  27.642  < 2e-16 ***
education_num                2.532e-01  9.537e-03  26.548  < 2e-16 ***
hours_per_week               3.087e-02  1.666e-03  18.523  < 2e-16 ***
occupationArmed-Forces      -7.927e-01  1.245e+00  -0.637   0.5241    
occupationCraft-repair       7.080e-02  7.963e-02   0.889   0.3739    
occupationExec-managerial    7.657e-01  7.532e-02  10.167  < 2e-16 ***
occupationFarming-fishing   -1.285e+00  1.454e-01  -8.837  < 2e-16 ***
occupationHandlers-cleaners -8.632e-01  1.465e-01  -5.894 3.78e-09 ***
occupationMachine-op-inspct -2.172e-01  1.037e-01  -2.094   0.0362 *  
occupationOther-service     -1.043e+00  1.208e-01  -8.636  < 2e-16 ***
occupationPriv-house-serv   -4.503e+00  1.815e+00  -2.481   0.0131 *  
occupationProf-specialty     3.886e-01  7.801e-02   4.982 6.31e-07 ***
occupationProtective-serv    3.618e-01  1.235e-01   2.929   0.0034 ** 
occupationSales              1.911e-01  8.031e-02   2.379   0.0174 *  
occupationTech-support       5.613e-01  1.104e-01   5.085 3.67e-07 ***
occupationTransport-moving  -1.147e-01  1.020e-01  -1.124   0.2608    
sexMale                      1.271e+00  4.943e-02  25.707  < 2e-16 ***
capital_gain                 3.122e-04  1.123e-05  27.810  < 2e-16 ***
capital_loss                 6.457e-04  3.845e-05  16.795  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27621  on 24573  degrees of freedom
Residual deviance: 18980  on 24554  degrees of freedom
AIC: 19020

Number of Fisher Scoring iterations: 8

Model 2 adds occupation, sex, capital gain, and capital loss. Most occupation categories are statistically significant. The positive coefficient on sex (Male) reflects the gender wage gap. Capital gain and capital loss are both highly significant, capturing investment income strongly associated with high earners.

4.4 Test Set Performance

class_probs_1 <- predict(class_model_1, newdata = class_test, type = "response")
class_probs_2 <- predict(class_model_2, newdata = class_test, type = "response")

class_test_predictions <- class_test |>
  mutate(
    pred_model_1 = factor(
      ifelse(class_probs_1 > 0.5, "Yes", "No"),
      levels = c("No", "Yes")
    ),
    pred_model_2 = factor(
      ifelse(class_probs_2 > 0.5, "Yes", "No"),
      levels = c("No", "Yes")
    )
  )

class_metrics_1 <- tibble(
  model     = "Model 1",
  accuracy  = accuracy_vec(
    truth    = class_test_predictions$high_income,
    estimate = class_test_predictions$pred_model_1
  ),
  precision = precision_vec(
    truth       = class_test_predictions$high_income,
    estimate    = class_test_predictions$pred_model_1,
    event_level = "second"
  ),
  recall = recall_vec(
    truth       = class_test_predictions$high_income,
    estimate    = class_test_predictions$pred_model_1,
    event_level = "second"
  )
)

class_metrics_2 <- tibble(
  model     = "Model 2",
  accuracy  = accuracy_vec(
    truth    = class_test_predictions$high_income,
    estimate = class_test_predictions$pred_model_2
  ),
  precision = precision_vec(
    truth       = class_test_predictions$high_income,
    estimate    = class_test_predictions$pred_model_2,
    event_level = "second"
  ),
  recall = recall_vec(
    truth       = class_test_predictions$high_income,
    estimate    = class_test_predictions$pred_model_2,
    event_level = "second"
  )
)

class_comparison <- bind_rows(class_metrics_1, class_metrics_2)

kable(
  class_comparison,
  digits  = 3,
  caption = "Model Comparison: Test Set Performance"
)
Model Comparison: Test Set Performance
model accuracy precision recall
Model 1 0.786 0.622 0.334
Model 2 0.825 0.720 0.477

4.5 Cross-Validation

set.seed(465)

logistic_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

class_folds <- vfold_cv(adult_clean, v = 5)

class_cv_results <- fit_resamples(
  logistic_spec,
  high_income ~ age + education_num + hours_per_week +
    occupation + sex + capital_gain + capital_loss,
  resamples = class_folds,
  metrics   = metric_set(accuracy)
)

class_cv_summary <- collect_metrics(class_cv_results) |>
  select(.metric, mean, std_err)

kable(
  class_cv_summary,
  digits  = 3,
  caption = "Model 2: 5-Fold Cross-Validation Results"
)
Model 2: 5-Fold Cross-Validation Results
.metric mean std_err
accuracy 0.825 0.002

The cross-validated accuracy is 0.825, which is identical to the test set accuracy of 0.825. The difference of zero confirms that the model is stable and does not overfit.


5. Results

5.1 Model Comparison

Model 2 outperforms Model 1 across all three metrics:

  • Accuracy: 0.825 vs 0.786 (+3.9%)
  • Precision: 0.720 vs 0.622 (+9.8%)
  • Recall: 0.477 vs 0.334 (+14.3%)

The most important improvement is in recall — Model 2 correctly identifies 43% more high-income individuals than Model 1. We select Model 2 as the final model.

class_comparison |>
  pivot_longer(
    cols      = c(accuracy, precision, recall),
    names_to  = "metric",
    values_to = "value"
  ) |>
  ggplot(aes(x = metric, y = value, fill = model)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("#c0392b", "#2471a3")) +
  labs(
    title = "Model Comparison: Test Set Performance",
    x     = "Metric",
    y     = "Value",
    fill  = "Model"
  )


6. Economic Interpretation

6.1 Answering the Economic Question

Our analysis confirms that education, occupation, age, hours worked, sex, and capital income are strong predictors of belonging to the high-income group. Model 2 achieves 82.5% accuracy with stable cross-validated performance, providing a clear answer to our economic question.

6.2 Coefficient Interpretation

Education (coefficient: +0.25, p < 0.001) Each additional year of education significantly increases the probability of high income. This confirms the education premium documented extensively in labor economics — more educated individuals earn substantially more.

Age (coefficient: +0.04, p < 0.001) Older individuals have higher income probability, consistent with returns to experience. As workers accumulate experience over time, their productivity and earnings increase.

Executive-Managerial Occupation (coefficient: +0.77, p < 0.001) Workers in executive and managerial roles have substantially higher income probability. This reflects occupational wage differentials — high-skill, high-responsibility roles command a wage premium.

Sex — Male (coefficient: +1.27, p < 0.001) The male coefficient is the strongest effect in the model. Even after controlling for education, occupation, and hours worked, male individuals have substantially higher income probability. This confirms the gender wage gap documented in labor economics literature.

Capital Gain (coefficient: +0.0003, p < 0.001) Capital gain is highly significant, capturing investment income. Individuals who earn returns on investments are more likely to belong to the high-income group, reflecting wealth inequality alongside labor income inequality.

Farming/Fishing Occupation (coefficient: −1.29, p < 0.001) Agricultural workers have the lowest income probability among all occupations. These roles require lower levels of formal education and human capital, which reduces the probability of earning above $50,000.

6.3 Policy Implications

These findings have several implications:

  • Educational investment: The strong education premium suggests that policies expanding access to higher education can reduce income inequality by helping more individuals enter the high-income group.
  • Occupational mobility: The large wage differentials across occupations suggest that retraining programs for workers in low-wage occupations could improve income mobility.
  • Gender wage gap: The large sex coefficient — even after controlling for occupation and education — suggests that structural barriers beyond human capital differences contribute to gender-based income inequality. Targeted policy interventions may be needed.

7. Limitations & Reproducibility

7.1 Limitations

1. Historical dataset The dataset is from the 1994 U.S. Census. Labor market conditions, occupational structures, and income distributions have changed substantially since then. The findings reflect the 1994 U.S. labor market and may not generalize to other countries or time periods.

2. Class imbalance 75% of observations are in the ≤$50K group. The model is biased toward predicting low income, which reduces recall to 0.477. Many actual high-income individuals are missed. Techniques such as oversampling (SMOTE) or threshold adjustment could improve recall.

3. Correlation, not causation Logistic regression identifies associations between predictors and the outcome — not causal effects. Omitted variable bias may exist. For example, unobserved factors such as family background or social networks may drive both education attainment and income.

4. Gender wage gap mechanism While the model captures the gender wage gap through the sex coefficient, it cannot explain the structural mechanisms behind it. Discrimination, occupational sorting, and work interruptions are possible explanations that cannot be identified from this dataset alone.

7.2 Reproducibility

The following steps were taken to ensure reproducibility:

  • set.seed(465) was used before all random operations (train/test split and cross-validation) to ensure identical results across runs.
  • The adult.csv file is stored in the same folder as the .qmd file and is referenced using a relative path ("adult.csv"), ensuring portability across computers.
  • All required packages (tidyverse, tidymodels, readr, knitr, scales) are loaded at the top of the document.
  • The document was verified to render without errors using quarto render.

8. AI Use Log

8.1 Tools Used

During this project, we used ChatGPT and Claude as AI assistants.

8.2 Detailed Interaction

Tool: ChatGPT

Prompt: “In R, I am using tidymodels and I want to calculate precision and recall for a logistic regression model where the positive class is the second factor level. My outcome variable is a factor with levels No and Yes. When I use precision_vec() and recall_vec(), I get results for the wrong class. How do I fix this?”

AI Response (relevant excerpt): The AI explained that by default, tidymodels treats the first factor level as the positive class. To calculate precision and recall for the second level (Yes), the argument event_level = "second" should be passed to precision_vec() and recall_vec().

How we used it: We had already written the model building and prediction code ourselves. The issue was specifically that our precision and recall results looked unexpectedly low. We used the AI explanation to understand the event_level argument and added it to our metric calculations. The rest of the code was written and adapted by us.

How we verified it: We verified the fix by checking that the resulting precision and recall values were consistent with a manually computed confusion matrix.

Reflection: The AI was helpful for resolving a specific technical issue not covered in the course labs. The main lesson was that factor level ordering in R affects how classification metrics are computed, and it is important to specify the positive class explicitly.


9. Final Reflections

9.1 One Improvement

With more time or better data, we would use a more recent dataset — ideally post-2010 U.S. Census data — to capture the current labor market structure. The 1994 dataset predates significant shifts in the labor market, including the rise of the technology sector, changes in education premiums, and evolving gender wage dynamics.

We would also address class imbalance using oversampling techniques such as SMOTE (Synthetic Minority Oversampling Technique), which would likely improve recall for the high-income group.

9.2 Future Research Question

Has the predictive power of education on high income changed over time, as returns to college degrees have shifted in the U.S. labor market?

Recent research suggests that the college wage premium has plateaued or even declined for some fields. Comparing classification models trained on 1994, 2005, and 2020 Census data could reveal whether education remains as strong a predictor of high income as our current analysis suggests.


10. Conclusion

This analysis demonstrates that human capital and labor market characteristics predict membership in the high-income group with 82.5% accuracy. Model 2 — which includes occupation, sex, and capital income in addition to age, education, and hours worked — substantially outperforms the baseline model across all three metrics.

The key findings are consistent with human capital theory: education, experience (age), and occupational position are the strongest positive predictors of high income. The gender wage gap is clearly visible in the data, even after controlling for education and occupation. Cross-validation confirms that the model is stable and generalizes well to unseen data.

A key limitation is that the dataset reflects the 1994 U.S. labor market. Future work should use more recent data and address class imbalance to improve the model’s ability to identify high-income individuals.