Comprehensive Statistical Analysis of the UCI Adult Income Dataset

Author

Generated with Posit Assistant

Published

February 6, 2026

Note

This report was generated using AI under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.

Executive Summary

This report presents a comprehensive statistical analysis of the UCI Adult Income Dataset, which contains census data used to predict whether an individual earns more than $50,000 annually. The dataset comprises 32,561 observations and 15 variables covering demographic, employment, and education characteristics.

Key Findings:

Income Distribution: 75.1% of individuals earn ≤$50K, while 24.9% earn >$50K
Strongest Predictors: Marital status, education, occupation, and age are the most important factors determining income level
Gender Gap: Males are 3.3× more likely to be high earners in the basic model, but this reduces to 1.24× after controlling for occupation and marital status
Education Premium: Each additional year of education increases odds of high income by 35-43%
Model Performance: The extended logistic regression achieves 83% accuracy and AUC of 0.88

Recommendations:

Use threshold = 0.25-0.30 for balanced prediction (F1 = 0.665)
Marital status is the strongest predictor—consider whether this captures household vs individual income
Education effects vary significantly by occupation—targeting education interventions by sector may be more effective

1. Introduction & Data Overview

1.1 Dataset Description

The Adult Income dataset (also known as the “Census Income” dataset) was extracted from the 1994 U.S. Census Bureau database. The prediction task is to determine whether a person earns over $50,000 per year based on census attributes.

# Load required libraries
library(tidyverse)
library(randomForest)

# Download the dataset from UCI Machine Learning Repository
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

col_names <- c("age", "workclass", "fnlwgt", "education", "education_num", 
               "marital_status", "occupation", "relationship", "race", "sex",
               "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")

adult_data <- read_csv(url, col_names = col_names, show_col_types = FALSE, 
                       na = c("", " ?", "?"))

cat("Dataset dimensions:", nrow(adult_data), "rows x", ncol(adult_data), "columns\n")

Dataset dimensions: 32561 rows x 15 columns

1.2 Data Structure

glimpse(adult_data)

Rows: 32,561
Columns: 15
$ age            <dbl> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,…
$ workclass      <chr> "State-gov", "Self-emp-not-inc", "Private", "Private", …
$ fnlwgt         <dbl> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2…
$ education      <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Bachelors…
$ education_num  <dbl> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,…
$ marital_status <chr> "Never-married", "Married-civ-spouse", "Divorced", "Mar…
$ occupation     <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners",…
$ relationship   <chr> "Not-in-family", "Husband", "Not-in-family", "Husband",…
$ race           <chr> "White", "White", "White", "Black", "Black", "White", "…
$ sex            <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
$ capital_gain   <dbl> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, …
$ capital_loss   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hours_per_week <dbl> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,…
$ native_country <chr> "United-States", "United-States", "United-States", "Uni…
$ income         <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", "…

1.3 Missing Value Analysis

missing_summary <- adult_data |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(Missing_Percent = round(Missing_Count / nrow(adult_data) * 100, 2)) |>
  filter(Missing_Count > 0) |>
  arrange(desc(Missing_Count))

knitr::kable(missing_summary, caption = "Variables with Missing Values")

Variables with Missing Values
Variable	Missing_Count	Missing_Percent
occupation	1843	5.66
workclass	1836	5.64
native_country	583	1.79

# Remove rows with missing values for analysis
adult_clean <- adult_data |> drop_na()

cat("Rows after removing missing values:", nrow(adult_clean), "\n")

Rows after removing missing values: 30162

cat("Rows removed:", nrow(adult_data) - nrow(adult_clean), 
    "(", round((nrow(adult_data) - nrow(adult_clean))/nrow(adult_data)*100, 1), "%)\n")

Rows removed: 2399 ( 7.4 %)

1.4 Data Preparation

# Create derived variables
adult_clean <- adult_clean |>
  mutate(
    income_binary = ifelse(str_detect(income, ">50K"), "High (>50K)", "Low (<=50K)"),
    income_high = ifelse(str_detect(income, ">50K"), 1, 0),
    edu_level = case_when(
      education_num <= 8 ~ "No HS",
      education_num <= 9 ~ "HS",
      education_num <= 12 ~ "Some College",
      education_num <= 13 ~ "Bachelors",
      TRUE ~ "Graduate"
    ),
    marital_simple = case_when(
      marital_status %in% c("Married-civ-spouse", "Married-AF-spouse") ~ "Married",
      marital_status == "Never-married" ~ "Never-married",
      marital_status %in% c("Divorced", "Separated", "Widowed") ~ "Previously-married",
      TRUE ~ "Other"
    )
  )

2. Descriptive Statistics

2.1 Numeric Variables

numeric_vars <- c("age", "fnlwgt", "education_num", "capital_gain", 
                  "capital_loss", "hours_per_week")

desc_stats <- adult_clean |>
  select(all_of(numeric_vars)) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    N = n(),
    Mean = round(mean(Value, na.rm = TRUE), 2),
    SD = round(sd(Value, na.rm = TRUE), 2),
    Min = min(Value, na.rm = TRUE),
    Q1 = quantile(Value, 0.25, na.rm = TRUE),
    Median = median(Value, na.rm = TRUE),
    Q3 = quantile(Value, 0.75, na.rm = TRUE),
    Max = max(Value, na.rm = TRUE)
  )

knitr::kable(desc_stats, caption = "Descriptive Statistics for Numeric Variables")

Descriptive Statistics for Numeric Variables
Variable	N	Mean	SD	Min	Q1	Median	Q3	Max
age	30162	38.44	13.13	17	28.0	37	47.0	90
capital_gain	30162	1092.01	7406.35	0	0.0	0	0.0	99999
capital_loss	30162	88.37	404.30	0	0.0	0	0.0	4356
education_num	30162	10.12	2.55	1	9.0	10	13.0	16
fnlwgt	30162	189793.83	105652.97	13769	117627.2	178425	237628.5	1484705
hours_per_week	30162	40.93	11.98	1	40.0	40	45.0	99

2.2 Income Distribution

income_dist <- adult_clean |> 
  count(income_binary) |> 
  mutate(Percent = round(n/sum(n)*100, 1))

ggplot(income_dist, aes(x = income_binary, y = n, fill = income_binary)) +
  geom_bar(stat = "identity", alpha = 0.8) +
  geom_text(aes(label = paste0(Percent, "%")), vjust = -0.5, size = 5) +
  labs(title = "Income Distribution", x = "Income Level", y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "none")

The dataset shows a class imbalance with approximately 75% low earners and 25% high earners.

2.3 Age Distribution by Income

ggplot(adult_clean, aes(x = income_binary, y = age, fill = income_binary)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Age Distribution by Income Level", x = "Income Level", y = "Age") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "none")

High earners tend to be older, with a median age approximately 7 years higher than low earners.

2.4 Education and Income

income_by_edu <- adult_clean |>
  count(edu_level, income_binary) |>
  group_by(edu_level) |>
  mutate(pct = n / sum(n) * 100)

ggplot(income_by_edu, aes(x = reorder(edu_level, -pct), y = pct, fill = income_binary)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Income Level by Education", 
       x = "Education Level", y = "Percentage", fill = "Income") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Education shows a strong positive relationship with income. Graduate degree holders have the highest proportion of high earners (63%), while those without a high school diploma have the lowest (6%).

2.5 Sex Distribution and Income

ggplot(adult_clean, aes(x = sex, fill = income_binary)) +
  geom_bar(position = "dodge") +
  labs(title = "Income Distribution by Sex", x = "Sex", y = "Count", fill = "Income") +
  scale_fill_brewer(palette = "Paired") +
  theme_minimal()

3. Correlation Analysis

3.1 Correlation Matrix

cor_matrix <- adult_clean |>
  select(all_of(numeric_vars)) |>
  cor(use = "complete.obs", method = "pearson")

knitr::kable(round(cor_matrix, 3), caption = "Pearson Correlation Matrix")

Pearson Correlation Matrix
	age	fnlwgt	education_num	capital_gain	capital_loss	hours_per_week
age	1.000	-0.077	0.044	0.080	0.060	0.102
fnlwgt	-0.077	1.000	-0.045	0.000	-0.010	-0.023
education_num	0.044	-0.045	1.000	0.124	0.080	0.153
capital_gain	0.080	0.000	0.124	1.000	-0.032	0.080
capital_loss	0.060	-0.010	0.080	-0.032	1.000	0.052
hours_per_week	0.102	-0.023	0.153	0.080	0.052	1.000

3.2 Correlation Heatmap

cor_long <- cor_matrix |>
  as.data.frame() |>
  rownames_to_column("Var1") |>
  pivot_longer(-Var1, names_to = "Var2", values_to = "Correlation")

ggplot(cor_long, aes(x = Var1, y = Var2, fill = Correlation)) +
  geom_tile() +
  geom_text(aes(label = round(Correlation, 2)), color = "black", size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap of Numeric Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Correlation Heatmap of Numeric Variables

Most correlations between numeric variables are weak. The strongest correlation is between education and hours per week (r = 0.15).

3.3 Correlation Significance Tests

cor_test_age_hours <- cor.test(adult_clean$age, adult_clean$hours_per_week)
cor_test_edu_hours <- cor.test(adult_clean$education_num, adult_clean$hours_per_week)
cor_test_edu_gain <- cor.test(adult_clean$education_num, adult_clean$capital_gain)

cor_tests_summary <- tibble(
  Variable_Pair = c("Age vs Hours/Week", "Education vs Hours/Week", "Education vs Capital Gain"),
  Correlation = c(
    round(cor_test_age_hours$estimate, 4),
    round(cor_test_edu_hours$estimate, 4),
    round(cor_test_edu_gain$estimate, 4)
  ),
  P_Value = c(
    format.pval(cor_test_age_hours$p.value, digits = 4),
    format.pval(cor_test_edu_hours$p.value, digits = 4),
    format.pval(cor_test_edu_gain$p.value, digits = 4)
  )
)

knitr::kable(cor_tests_summary, caption = "Correlation Significance Tests")

Correlation Significance Tests
Variable_Pair	Correlation	P_Value
Age vs Hours/Week	0.1016	< 2.2e-16
Education vs Hours/Week	0.1525	< 2.2e-16
Education vs Capital Gain	0.1244	< 2.2e-16

All correlations are statistically significant (p < 0.001) due to the large sample size, though effect sizes are small.

4. Hypothesis Testing

4.1 T-Test: Age by Income Level

t_test_age_income <- t.test(age ~ income_binary, data = adult_clean)

# Effect size (Cohen's d)
group1 <- adult_clean |> filter(income_binary == "High (>50K)") |> pull(age)
group2 <- adult_clean |> filter(income_binary == "Low (<=50K)") |> pull(age)
pooled_sd <- sqrt(((length(group1)-1)*sd(group1)^2 + (length(group2)-1)*sd(group2)^2) / 
                  (length(group1) + length(group2) - 2))
cohens_d <- (mean(group1) - mean(group2)) / pooled_sd

cat("T-test: Age by Income Level\n")

T-test: Age by Income Level

cat("Mean age (High income):", round(mean(group1), 2), "\n")

Mean age (High income): 43.96

cat("Mean age (Low income):", round(mean(group2), 2), "\n")

Mean age (Low income): 36.61

cat("Difference:", round(mean(group1) - mean(group2), 2), "years\n")

Difference: 7.35 years

cat("t-statistic:", round(t_test_age_income$statistic, 2), "\n")

t-statistic: 49.5

cat("p-value: < 2.2e-16\n")

p-value: < 2.2e-16

cat("Cohen's d:", round(cohens_d, 3), "(medium effect)\n")

Cohen's d: 0.577 (medium effect)

Interpretation: High earners are on average 7.4 years older than low earners. This is a statistically significant difference with a medium effect size (Cohen’s d = 0.58).

4.2 T-Test: Hours per Week by Sex

t_test_hours_sex <- t.test(hours_per_week ~ sex, data = adult_clean)

cat("T-test: Hours per Week by Sex\n")

T-test: Hours per Week by Sex

cat("Mean hours (Female):", round(t_test_hours_sex$estimate[1], 2), "\n")

Mean hours (Female): 36.93

cat("Mean hours (Male):", round(t_test_hours_sex$estimate[2], 2), "\n")

Mean hours (Male): 42.85

cat("Difference:", round(t_test_hours_sex$estimate[2] - t_test_hours_sex$estimate[1], 2), "hours\n")

Difference: 5.92 hours

cat("t-statistic:", round(t_test_hours_sex$statistic, 2), "\n")

t-statistic: -41.7

cat("p-value: < 2.2e-16\n")

p-value: < 2.2e-16

Interpretation: Males work approximately 6 more hours per week on average than females (p < 0.001).

4.3 Chi-Square Test: Income vs Sex

income_sex_table <- table(adult_clean$income_binary, adult_clean$sex)
chi_income_sex <- chisq.test(income_sex_table)

# Cramér's V
n <- sum(income_sex_table)
k <- min(nrow(income_sex_table), ncol(income_sex_table))
cramers_v <- sqrt(chi_income_sex$statistic / (n * (k - 1)))

cat("Chi-Square Test: Income vs Sex\n")

Chi-Square Test: Income vs Sex

print(income_sex_table)

             
              Female  Male
  High (>50K)   1112  6396
  Low (<=50K)   8670 13984

cat("\nChi-square statistic:", round(chi_income_sex$statistic, 2), "\n")


Chi-square statistic: 1415.29

cat("p-value: < 2.2e-16\n")

p-value: < 2.2e-16

cat("Cramér's V:", round(cramers_v, 3), "(small-medium effect)\n")

Cramér's V: 0.217 (small-medium effect)

Interpretation: There is a significant association between sex and income level (χ² = 1415.3, p < 0.001). Only 11.4% of females are high earners compared to 31.4% of males.

4.4 Chi-Square Test: Income vs Education

income_edu_table <- table(adult_clean$income_binary, adult_clean$edu_level)
chi_income_edu <- chisq.test(income_edu_table)

n <- sum(income_edu_table)
k <- min(nrow(income_edu_table), ncol(income_edu_table))
cramers_v_edu <- sqrt(chi_income_edu$statistic / (n * (k - 1)))

cat("Chi-Square Test: Income vs Education\n")

Chi-Square Test: Income vs Education

print(income_edu_table)

             
              Bachelors Graduate   HS No HS Some College
  High (>50K)      2126     1604 1617   225         1936
  Low (<=50K)      2918      940 8223  3516         7057

cat("\nChi-square statistic:", round(chi_income_edu$statistic, 2), "\n")


Chi-square statistic: 3928.85

cat("p-value: < 2.2e-16\n")

p-value: < 2.2e-16

cat("Cramér's V:", round(cramers_v_edu, 3), "(medium effect)\n")

Cramér's V: 0.361 (medium effect)

Interpretation: There is a strong association between education level and income (χ² = 3928.9, p < 0.001, Cramér’s V = 0.36).

4.5 ANOVA: Hours per Week by Workclass

aov_hours_work <- aov(hours_per_week ~ workclass, data = adult_clean)
aov_summary <- summary(aov_hours_work)

# Effect size (eta-squared)
ss_between <- aov_summary[[1]]["workclass", "Sum Sq"]
ss_total <- sum(aov_summary[[1]][, "Sum Sq"])
eta_sq <- ss_between / ss_total

cat("ANOVA: Hours per Week by Workclass\n")

ANOVA: Hours per Week by Workclass

print(aov_summary)

               Df  Sum Sq Mean Sq F value Pr(>F)    
workclass       6  113066   18844   134.8 <2e-16 ***
Residuals   30155 4215641     140                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("\nEta-squared:", round(eta_sq, 4), "(small effect)\n")


Eta-squared: 0.0261 (small effect)

Interpretation: There are significant differences in hours worked across workclass categories (F = 134.8, p < 0.001), though the effect size is small (η² = 0.026).

5. Logistic Regression Modeling

5.1 Original Model (4 Predictors)

logit_model <- glm(income_high ~ age + education_num + hours_per_week + sex, 
                   data = adult_clean, family = binomial)

cat("Original Model Summary:\n")

Original Model Summary:

cat("AIC:", round(AIC(logit_model), 2), "\n")

AIC: 26356.93

cat("McFadden R²:", round(1 - logit_model$deviance/logit_model$null.deviance, 4), "\n\n")

McFadden R²: 0.2217

summary(logit_model)$coefficients |> round(4) |> knitr::kable()

	Estimate	Std. Error	z value
(Intercept)	-9.1345	0.1203	-75.9258
age	0.0474	0.0013	37.5553
education_num	0.3563	0.0069	51.8713
hours_per_week	0.0340	0.0014	24.8036
sexMale	1.1873	0.0389	30.5034

5.2 Extended Model (+ Occupation & Marital Status)

logit_extended <- glm(
  income_high ~ age + education_num + hours_per_week + sex + occupation + marital_simple,
  data = adult_clean,
  family = binomial
)

cat("Extended Model Summary:\n")

Extended Model Summary:

cat("AIC:", round(AIC(logit_extended), 2), "\n")

AIC: 22074

cat("McFadden R²:", round(1 - logit_extended$deviance/logit_extended$null.deviance, 4), "\n")

McFadden R²: 0.3491

5.3 Model Comparison

cat("Model Comparison:\n")

Model Comparison:

cat("Original Model AIC:", round(AIC(logit_model), 2), "\n")

Original Model AIC: 26356.93

cat("Extended Model AIC:", round(AIC(logit_extended), 2), "\n")

Extended Model AIC: 22074

cat("Improvement:", round(AIC(logit_model) - AIC(logit_extended), 2), "\n\n")

Improvement: 4282.93

# Likelihood ratio test
lr_test <- anova(logit_model, logit_extended, test = "Chisq")
print(lr_test)

Analysis of Deviance Table

Model 1: income_high ~ age + education_num + hours_per_week + sex
Model 2: income_high ~ age + education_num + hours_per_week + sex + occupation + 
    marital_simple
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1     30157      26347                          
2     30141      22032 16   4314.9 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The extended model is significantly better (χ² = 4314.9, p < 0.001).

5.4 Odds Ratios (Extended Model)

or_extended <- exp(cbind(OR = coef(logit_extended), confint(logit_extended, level = 0.95)))

or_df <- as.data.frame(or_extended) |>
  rownames_to_column("Predictor") |>
  mutate(across(where(is.numeric), ~round(., 3))) |>
  filter(Predictor != "(Intercept)") |>
  arrange(desc(OR))

knitr::kable(head(or_df, 15), caption = "Odds Ratios (Top 15 Predictors)")

Odds Ratios (Top 15 Predictors)
Predictor	OR	2.5 %	97.5 %
occupationExec-managerial	2.124	1.854	2.436
occupationTech-support	1.743	1.421	2.136
occupationProf-specialty	1.614	1.401	1.861
occupationProtective-serv	1.383	1.104	1.730
education_num	1.347	1.324	1.371
sexMale	1.237	1.123	1.363
occupationSales	1.232	1.066	1.426
hours_per_week	1.031	1.028	1.034
age	1.030	1.027	1.033
occupationCraft-repair	0.935	0.810	1.080
occupationTransport-moving	0.799	0.666	0.958
occupationMachine-op-inspct	0.684	0.567	0.824
occupationArmed-Forces	0.576	0.025	4.896
occupationHandlers-cleaners	0.446	0.340	0.580
occupationOther-service	0.363	0.291	0.450

Key Interpretations:

Married individuals are ~12× more likely to be high earners compared to never-married
Executive/Managerial occupation: 2.1× odds compared to admin-clerical
Each year of education increases odds by 43%
Males have 24% higher odds after controlling for occupation and marital status

5.5 Interaction Model (Education × Occupation)

logit_interact <- glm(
  income_high ~ age + education_num * occupation + hours_per_week + sex + marital_simple,
  data = adult_clean,
  family = binomial
)

# Likelihood ratio test
lr_interact <- anova(logit_extended, logit_interact, test = "Chisq")

cat("Interaction Model:\n")

Interaction Model:

cat("AIC:", round(AIC(logit_interact), 2), "\n")

AIC: 22052.61

cat("Improvement vs Extended:", round(AIC(logit_extended) - AIC(logit_interact), 2), "\n\n")

Improvement vs Extended: 21.39

cat("Likelihood Ratio Test:\n")

Likelihood Ratio Test:

cat("Chi-square:", round(lr_interact$Deviance[2], 2), "\n")

Chi-square: 47.39

cat("p-value:", format.pval(lr_interact$`Pr(>Chi)`[2], digits = 4), "\n")

p-value: 8.296e-06

5.6 Interaction Effects Visualization

# Extract interaction coefficients
interact_coefs <- summary(logit_interact)$coefficients
interact_terms <- interact_coefs[grep("education_num:occupation", rownames(interact_coefs)), ]

interact_df <- data.frame(
  Occupation = gsub("education_num:occupation", "", rownames(interact_terms)),
  Coefficient = interact_terms[, 1],
  P_Value = interact_terms[, 4]
) |>
  mutate(
    Significant = ifelse(P_Value < 0.05, "Significant", "Not significant"),
    Education_Effect = round(0.228 + Coefficient, 3)
  ) |>
  filter(!Occupation %in% c("Armed-Forces", "Priv-house-serv")) |>
  arrange(desc(Education_Effect))

ggplot(interact_df, aes(x = reorder(Occupation, Education_Effect), 
                         y = Education_Effect, fill = Significant)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 0.228, linetype = "dashed", color = "red", linewidth = 1) +
  coord_flip() +
  labs(title = "Education Effect on Income by Occupation",
       subtitle = "Red dashed line = baseline effect (Adm-clerical: 0.228 log-odds/year)",
       x = "Occupation", y = "Education Effect (log-odds per year)", fill = "Interaction") +
  scale_fill_manual(values = c("Not significant" = "steelblue", "Significant" = "darkgreen")) +
  theme_minimal() +
  theme(legend.position = "bottom")

Education Effect on Income by Occupation

Key Finding: Education has a significantly stronger effect on income for Protective services (+77%), Exec-managerial (+37%), Sales (+13%), and Other-service (+38%) occupations compared to the baseline (Administrative/Clerical).

6. Cross-Validation & Model Performance

6.1 10-Fold Cross-Validation

set.seed(123)
n <- nrow(adult_clean)
k <- 10
folds <- sample(rep(1:k, length.out = n))

# Function to calculate metrics
calc_metrics <- function(actual, predicted_prob, threshold = 0.5) {
  predicted_class <- ifelse(predicted_prob > threshold, 1, 0)
  TP <- sum(predicted_class == 1 & actual == 1)
  TN <- sum(predicted_class == 0 & actual == 0)
  FP <- sum(predicted_class == 1 & actual == 0)
  FN <- sum(predicted_class == 0 & actual == 1)
  
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  sensitivity <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  
  list(accuracy = accuracy, sensitivity = sensitivity, specificity = specificity)
}

# Initialize storage
cv_results_original <- data.frame(fold = 1:k, accuracy = NA, sensitivity = NA, specificity = NA)
cv_results_extended <- data.frame(fold = 1:k, accuracy = NA, sensitivity = NA, specificity = NA)

# Run CV
for (i in 1:k) {
  train_data <- adult_clean[folds != i, ]
  test_data <- adult_clean[folds == i, ]
  
  # Original model
  model_orig <- glm(income_high ~ age + education_num + hours_per_week + sex,
                    data = train_data, family = binomial)
  pred_orig <- predict(model_orig, newdata = test_data, type = "response")
  metrics_orig <- calc_metrics(test_data$income_high, pred_orig)
  cv_results_original[i, 2:4] <- unlist(metrics_orig)
  
  # Extended model
  model_ext <- glm(income_high ~ age + education_num + hours_per_week + sex + 
                   occupation + marital_simple, data = train_data, family = binomial)
  pred_ext <- predict(model_ext, newdata = test_data, type = "response")
  metrics_ext <- calc_metrics(test_data$income_high, pred_ext)
  cv_results_extended[i, 2:4] <- unlist(metrics_ext)
}

6.2 Cross-Validation Results

cv_summary <- data.frame(
  Model = c("Original", "Extended"),
  Mean_Accuracy = c(mean(cv_results_original$accuracy), mean(cv_results_extended$accuracy)),
  SD_Accuracy = c(sd(cv_results_original$accuracy), sd(cv_results_extended$accuracy)),
  Mean_Sensitivity = c(mean(cv_results_original$sensitivity), mean(cv_results_extended$sensitivity)),
  Mean_Specificity = c(mean(cv_results_original$specificity), mean(cv_results_extended$specificity))
) |>
  mutate(across(where(is.numeric), ~round(., 3)))

knitr::kable(cv_summary, caption = "10-Fold Cross-Validation Results")

10-Fold Cross-Validation Results
Model	Mean_Accuracy	SD_Accuracy	Mean_Sensitivity	Mean_Specificity
Original	0.797	0.010	0.392	0.932
Extended	0.826	0.008	0.550	0.918

6.3 CV Results Visualization

cv_plot_data <- bind_rows(
  cv_results_original |> mutate(Model = "Original"),
  cv_results_extended |> mutate(Model = "Extended")
) |>
  pivot_longer(cols = c(accuracy, sensitivity, specificity),
               names_to = "Metric", values_to = "Value") |>
  mutate(Metric = str_to_title(Metric))

ggplot(cv_plot_data, aes(x = Model, y = Value, fill = Model)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ Metric, scales = "free_y") +
  labs(title = "10-Fold Cross-Validation: Model Comparison",
       y = "Score") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "none")

Cross-Validation Results: Model Comparison

The extended model shows consistent improvement across all metrics:

Accuracy: 79.7% → 82.6% (+2.9%)
Sensitivity: 39.2% → 55.0% (+15.8%)
Specificity: 93.2% → 91.8% (-1.4%)

7. Threshold Optimization

7.1 Precision-Recall Analysis

pred_probs <- predict(logit_extended, type = "response")
actual <- adult_clean$income_high
thresholds <- seq(0.1, 0.9, by = 0.05)

threshold_metrics <- map_dfr(thresholds, function(t) {
  pred <- ifelse(pred_probs > t, 1, 0)
  TP <- sum(pred == 1 & actual == 1)
  TN <- sum(pred == 0 & actual == 0)
  FP <- sum(pred == 1 & actual == 0)
  FN <- sum(pred == 0 & actual == 1)
  
  tibble(
    Threshold = t,
    Accuracy = (TP + TN) / (TP + TN + FP + FN),
    Precision = TP / (TP + FP),
    Recall = TP / (TP + FN),
    Specificity = TN / (TN + FP),
    F1 = 2 * (Precision * Recall) / (Precision + Recall)
  )
})

knitr::kable(threshold_metrics |> mutate(across(where(is.numeric), ~round(., 3))),
             caption = "Model Performance at Different Thresholds")

Model Performance at Different Thresholds
Threshold	Accuracy	Precision	Recall	Specificity	F1
0.10	0.674	0.430	0.951	0.582	0.592
0.15	0.729	0.477	0.916	0.667	0.627
0.20	0.761	0.512	0.877	0.723	0.647
0.25	0.786	0.546	0.832	0.771	0.659
0.30	0.806	0.582	0.777	0.815	0.665
0.35	0.816	0.611	0.713	0.850	0.658
0.40	0.822	0.639	0.656	0.877	0.648
0.45	0.825	0.665	0.601	0.899	0.631
0.50	0.826	0.690	0.551	0.918	0.612
0.55	0.826	0.715	0.501	0.934	0.589
0.60	0.823	0.738	0.447	0.948	0.557
0.65	0.818	0.762	0.389	0.960	0.515
0.70	0.810	0.784	0.327	0.970	0.461
0.75	0.800	0.809	0.256	0.980	0.389
0.80	0.785	0.827	0.174	0.988	0.287
0.85	0.771	0.868	0.095	0.995	0.171
0.90	0.758	0.846	0.032	0.998	0.062

7.2 Optimal Thresholds

best_f1 <- threshold_metrics |> slice_max(F1, n = 1)
best_acc <- threshold_metrics |> slice_max(Accuracy, n = 1)

cat("Optimal Thresholds:\n")

Optimal Thresholds:

cat("Best F1 Score:", round(best_f1$F1, 3), "at threshold =", best_f1$Threshold, "\n")

Best F1 Score: 0.665 at threshold = 0.3

cat("Best Accuracy:", round(best_acc$Accuracy, 3), "at threshold =", best_acc$Threshold, "\n")

Best Accuracy: 0.826 at threshold = 0.5

7.3 Precision-Recall Trade-off Visualization

threshold_plot <- threshold_metrics |>
  select(Threshold, Precision, Recall, F1) |>
  pivot_longer(-Threshold, names_to = "Metric", values_to = "Value")

ggplot(threshold_plot, aes(x = Threshold, y = Value, color = Metric)) +
  geom_line(linewidth = 1.2) +
  geom_vline(xintercept = best_f1$Threshold, linetype = "dashed", color = "darkgreen") +
  annotate("text", x = best_f1$Threshold + 0.08, y = 0.4, 
           label = paste0("Best F1\n(t=", best_f1$Threshold, ")"), size = 3) +
  labs(title = "Precision-Recall Trade-off Across Thresholds",
       x = "Classification Threshold", y = "Score") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal()

Precision-Recall Trade-off Across Thresholds

Recommendations by Use Case:

Objective	Threshold	Recall	Precision	F1
Balanced (Best F1)	0.30	77.7%	58.2%	0.665
High Recall	0.20	87.7%	51.2%	0.647
High Precision	0.60	44.7%	73.8%	0.557
Best Accuracy	0.50	55.1%	69.0%	0.612

8. Machine Learning Comparison

8.1 Random Forest Model

# Prepare data
model_data <- adult_clean |>
  select(income_high, age, education_num, hours_per_week, sex, occupation, marital_simple) |>
  mutate(across(where(is.character), as.factor))

set.seed(42)
train_idx <- sample(1:nrow(model_data), size = 0.8 * nrow(model_data))
train_set <- model_data[train_idx, ]
test_set <- model_data[-train_idx, ]

# Train Random Forest
rf_model <- randomForest(
  as.factor(income_high) ~ .,
  data = train_set,
  ntree = 500,
  mtry = 3,
  importance = TRUE
)

print(rf_model)


Call:
 randomForest(formula = as.factor(income_high) ~ ., data = train_set,      ntree = 500, mtry = 3, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 17.91%
Confusion matrix:
      0    1 class.error
0 16262 1846   0.1019439
1  2476 3545   0.4112274

8.2 Variable Importance

importance_df <- importance(rf_model) |>
  as.data.frame() |>
  rownames_to_column("Variable") |>
  arrange(desc(MeanDecreaseGini))

ggplot(importance_df, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Random Forest: Variable Importance",
       x = "Variable", y = "Mean Decrease in Gini") +
  theme_minimal()

Marital status and age are the most important predictors, followed by education and occupation.

8.3 Model Performance Comparison

# Random Forest predictions
rf_pred_class <- predict(rf_model, newdata = test_set)
rf_pred_prob <- predict(rf_model, newdata = test_set, type = "prob")[, 2]

# Calculate metrics
rf_conf <- table(Predicted = rf_pred_class, Actual = test_set$income_high)
rf_accuracy <- sum(diag(rf_conf)) / sum(rf_conf)
rf_sensitivity <- rf_conf[2, 2] / sum(rf_conf[, 2])
rf_specificity <- rf_conf[1, 1] / sum(rf_conf[, 1])
rf_precision <- rf_conf[2, 2] / sum(rf_conf[2, ])
rf_f1 <- 2 * rf_precision * rf_sensitivity / (rf_precision + rf_sensitivity)

# Logistic Regression on test set
logit_pred_test <- predict(logit_extended, newdata = test_set, type = "response")
logit_pred_class <- ifelse(logit_pred_test > 0.5, 1, 0)
logit_conf <- table(Predicted = logit_pred_class, Actual = test_set$income_high)
logit_accuracy <- sum(diag(logit_conf)) / sum(logit_conf)
logit_sensitivity <- logit_conf[2, 2] / sum(logit_conf[, 2])
logit_specificity <- logit_conf[1, 1] / sum(logit_conf[, 1])
logit_precision <- logit_conf[2, 2] / sum(logit_conf[2, ])
logit_f1 <- 2 * logit_precision * logit_sensitivity / (logit_precision + logit_sensitivity)

# AUC calculation
calc_auc <- function(actual, predicted) {
  n_pos <- sum(actual == 1)
  n_neg <- sum(actual == 0)
  ranks <- rank(predicted)
  (sum(ranks[actual == 1]) - n_pos * (n_pos + 1) / 2) / (n_pos * n_neg)
}

rf_auc <- calc_auc(test_set$income_high, rf_pred_prob)
logit_auc <- calc_auc(test_set$income_high, logit_pred_test)

# Comparison table
comparison <- data.frame(
  Model = c("Logistic Regression", "Random Forest"),
  Accuracy = c(logit_accuracy, rf_accuracy),
  Sensitivity = c(logit_sensitivity, rf_sensitivity),
  Specificity = c(logit_specificity, rf_specificity),
  Precision = c(logit_precision, rf_precision),
  F1_Score = c(logit_f1, rf_f1),
  AUC = c(logit_auc, rf_auc)
) |>
  mutate(across(where(is.numeric), ~round(., 3)))

knitr::kable(comparison, caption = "Test Set Performance: Logistic Regression vs Random Forest")

Test Set Performance: Logistic Regression vs Random Forest
Model	Accuracy	Sensitivity	Specificity	Precision	F1_Score	AUC
Logistic Regression	0.830	0.568	0.916	0.688	0.622	0.882
Random Forest	0.821	0.620	0.887	0.643	0.631	0.859

8.4 Performance Comparison Visualization

comparison_long <- comparison |>
  pivot_longer(-Model, names_to = "Metric", values_to = "Value")

ggplot(comparison_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Model Performance Comparison: Test Set",
       x = "Metric", y = "Score") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key Findings:

Logistic Regression achieves higher AUC (0.88 vs 0.86) and specificity
Random Forest achieves slightly higher sensitivity (62% vs 57%)
Both models perform similarly in terms of overall accuracy (~82%)

9. Conclusions & Recommendations

9.1 Summary of Findings

Income Determinants: The strongest predictors of high income are:
- Marital status (married individuals are 12× more likely to be high earners)
- Education level (each year adds 35-43% to odds)
- Occupation (executive/managerial roles have 2.1× odds)
- Age (7.4 years older on average for high earners)
Gender Disparity: Males have significantly higher odds of being high earners, but much of this is explained by occupation and marital status differences.
Education-Occupation Interaction: The value of education varies by occupation—it has the strongest effect in managerial, sales, and protective services roles.
Model Performance: The extended logistic regression model achieves:
- 83% accuracy
- 55% sensitivity (at default threshold)
- 92% specificity
- AUC of 0.88

9.2 Recommendations

Objective	Recommendation
Balanced Prediction	Use threshold = 0.25-0.30 for best F1 score (0.665)
Finding High Earners	Use threshold = 0.20 for 88% recall
Confident Predictions	Use threshold = 0.60+ for 74% precision
Model Choice	Use logistic regression for interpretability and slightly better AUC; Random Forest for slightly better sensitivity

9.3 Limitations

Data Age: The 1994 census data may not reflect current income patterns
Class Imbalance: 75/25 split may affect model performance on minority class
Missing Data: ~7% of observations were removed due to missing values
Binary Outcome: $50K threshold is arbitrary and doesn’t capture income distribution nuances

9.4 Future Work

Investigate the strong marital status effect (household vs individual income)
Apply cost-sensitive learning to handle class imbalance
Test ensemble methods combining logistic regression and Random Forest
Update analysis with more recent census data

Appendix: Session Information

sessionInfo()

R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomForest_4.7-1.2 lubridate_1.9.4      forcats_1.0.1       
 [4] stringr_1.6.0        dplyr_1.1.4          purrr_1.2.1         
 [7] readr_2.1.6          tidyr_1.3.2          tibble_3.3.1        
[10] ggplot2_4.0.2        tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] bit_4.6.0          gtable_0.3.6       jsonlite_2.0.0     crayon_1.5.3      
 [5] compiler_4.5.2     tidyselect_1.2.1   parallel_4.5.2     scales_1.4.0      
 [9] yaml_2.3.12        fastmap_1.2.0      R6_2.6.1           labeling_0.4.3    
[13] generics_0.1.4     curl_7.0.0         knitr_1.51         htmlwidgets_1.6.4 
[17] pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7       
[21] stringi_1.8.7      xfun_0.56          S7_0.2.1           bit64_4.6.0-1     
[25] otel_0.2.0         timechange_0.4.0   cli_3.6.5          withr_3.0.2       
[29] magrittr_2.0.4     digest_0.6.39      grid_4.5.2         vroom_1.7.0       
[33] rstudioapi_0.18.0  hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1       
[37] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
[41] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.9