library(ISLR)
library(dplyr)
library(stringr)
library(caTools)
library(caret)
library(pROC)
library(DescTools)
Today our analysis will help to determine whether demographic, age, marital status, employment, and health characteristics can be used to predict whether someone earns a (comparably) higher or lower wage. Creating a binary wage outcome based on the median wage with these different variables, conducting classical statistical tests (t-tests, ANOVA, and chi-square), and building a logistic regression model will help us draw conclusions on what factors can predict one’s income.
data("Wage")
str(Wage)
'data.frame': 3000 obs. of 11 variables:
$ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
$ age : int 18 24 45 43 50 54 44 30 41 52 ...
$ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
$ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
$ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
$ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
$ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
$ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
$ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
$ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
$ wage : num 75 70.5 131 154.7 75 ...
wage_median <- median(Wage$wage, na.rm = TRUE)
wage_median
[1] 104.9215
Wage <- Wage %>%
mutate(
WageCategory = case_when(
wage > wage_median ~ "High",
wage < wage_median ~ "Low",
TRUE ~ NA_character_ # rows exactly equal to median
)
)
Wage <- Wage %>% filter(!is.na(WageCategory))
Wage$WageCategory <- factor(Wage$WageCategory, levels = c("Low", "High"))
table(Wage$WageCategory)
Low High
1447 1483
clean_factor <- function(x) {
x <- as.character(x)
x <- str_replace(x, "^[0-9]+\\.\\s*", "")
factor(x)
}
Wage <- Wage %>%
mutate(
race = clean_factor(race),
education = clean_factor(education),
jobclass = clean_factor(jobclass),
health = clean_factor(health),
health_ins = clean_factor(health_ins),
maritl = clean_factor(maritl)
)
summary(Wage[, c("race", "education", "jobclass", "health", "health_ins", "maritl")])
race education jobclass health
Asian: 187 < HS Grad :264 Industrial :1507 <=Good : 841
Black: 280 Advanced Degree:423 Information:1423 >=Very Good:2089
Other: 36 College Grad :661
White:2427 HS Grad :949
Some College :633
health_ins maritl
No : 901 Divorced : 200
Yes:2029 Married :2019
Never Married: 639
Separated : 55
Widowed : 17
Wage %>%
group_by(WageCategory) %>%
summarise(
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
n = n()
)
# A tibble: 2 × 4
WageCategory mean_age sd_age n
<fct> <dbl> <dbl> <int>
1 Low 40.0 12.7 1447
2 High 44.7 9.82 1483
age_ttest <- t.test(age ~ WageCategory, data = Wage)
age_ttest
Welch Two Sample t-test
data: age by WageCategory
t = -11.143, df = 2724.7, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Low and group High is not equal to 0
95 percent confidence interval:
-5.496555 -3.851525
sample estimates:
mean in group Low mean in group High
40.01106 44.68510
A Welch two-sample t-test compared the mean age of “high” vs. “low” income levels. *Results showed a significant difference, t(2724.7) = −11.14, p < .001.
Mean age (Low): 40.01 years
Mean age (High): 44.69 years
Mean difference: ≈ 4.7 years
Interpretation
We can interpret these findings as higher income earners are on average older than lower income earners. Age is positively associated with income.
wage_edu_aov <- aov(wage ~ education, data = Wage)
summary(wage_edu_aov)
Df Sum Sq Mean Sq F value Pr(>F)
education 4 1242703 310676 228.5 <2e-16 ***
Residuals 2925 3976086 1359
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An ANOVA tested whether income differs by education level:
F(4, 2925) = 228.5, p < .001
We can interpret these findings as a strong effect on education, meaning the highest wages made the most money, compared to lowest incomes(across the five income brackets). As education level went up, so did income/ wage level.
wage_maritl_table <- table(Wage$WageCategory, Wage$maritl)
wage_maritl_table
Divorced Married Never Married Separated Widowed
Low 114 817 470 37 9
High 86 1202 169 18 8
chisq_wage_maritl <- chisq.test(wage_maritl_table)
chisq_wage_maritl
Pearson's Chi-squared test
data: wage_maritl_table
X-squared = 225.33, df = 4, p-value < 2.2e-16
chisq_wage_maritl$expected
Divorced Married Never Married Separated Widowed
Low 98.77133 997.0966 315.5744 27.16212 8.395563
High 101.22867 1021.9034 323.4256 27.83788 8.604437
cramer_v <- CramerV(wage_maritl_table)
cramer_v
[1] 0.2773195
A chi-square test evaluated whether wage category is associated with marital status:
χ²(4) = 225.33, p < .001
Cramer’s V = 0.277
We can interpret the chi-square test with a moderate association between marital status and wage category. However, married participants are over-represented in the high wage group and un-married individuals are over-represented in the low wage group, which can skew the data inaccurately.
set.seed(123)
split_flag <- sample.split(Wage$WageCategory, SplitRatio = 0.7)
train_data <- subset(Wage, split_flag == TRUE)
test_data <- subset(Wage, split_flag == FALSE)
nrow(train_data); nrow(test_data)
[1] 2051
[1] 879
logit_model <- glm(
WageCategory ~ age + education + jobclass + maritl + health + health_ins,
data = train_data,
family = binomial
)
summary(logit_model)
Call:
glm(formula = WageCategory ~ age + education + jobclass + maritl +
health + health_ins, family = binomial, data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.906039 0.410689 -9.511 < 2e-16 ***
age 0.016967 0.005301 3.201 0.00137 **
educationAdvanced Degree 2.878189 0.272729 10.553 < 2e-16 ***
educationCollege Grad 2.356742 0.243963 9.660 < 2e-16 ***
educationHS Grad 0.654597 0.231391 2.829 0.00467 **
educationSome College 1.399887 0.238876 5.860 4.62e-09 ***
jobclassInformation 0.128700 0.110463 1.165 0.24398
maritlMarried 0.816820 0.209849 3.892 9.92e-05 ***
maritlNever Married -0.480261 0.245800 -1.954 0.05072 .
maritlSeparated -0.213755 0.474737 -0.450 0.65252
maritlWidowed 0.191127 0.599294 0.319 0.74979
health>=Very Good 0.325054 0.121269 2.680 0.00735 **
health_insYes 1.372473 0.123232 11.137 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2843.0 on 2050 degrees of freedom
Residual deviance: 2115.4 on 2038 degrees of freedom
AIC: 2141.4
Number of Fisher Scoring iterations: 4
odds_ratios <- exp(coef(logit_model))
odds_ratios
(Intercept) age educationAdvanced Degree
0.02012005 1.01711169 17.78204218
educationCollege Grad educationHS Grad educationSome College
10.55650582 1.92436665 4.05474320
jobclassInformation maritlMarried maritlNever Married
1.13734877 2.26329134 0.61862205
maritlSeparated maritlWidowed health>=Very Good
0.80754605 1.21061266 1.38410575
health_insYes
3.94509346
conf_int <- exp(confint(logit_model))
Waiting for profiling to be done...
conf_int
2.5 % 97.5 %
(Intercept) 0.008880154 0.04447506
age 1.006623973 1.02776935
educationAdvanced Degree 10.544345226 30.76750089
educationCollege Grad 6.618263517 17.25899258
educationHS Grad 1.236224160 3.07016179
educationSome College 2.566408373 6.56196069
jobclassInformation 0.915418699 1.41171317
maritlMarried 1.503694264 3.42685266
maritlNever Married 0.382337937 1.00301632
maritlSeparated 0.305655850 1.99075661
maritlWidowed 0.368417266 3.95933760
health>=Very Good 1.091532922 1.75622946
health_insYes 3.104800954 5.03437155
test_probs <- predict(logit_model, newdata = test_data, type = "response")
test_pred_class <- ifelse(test_probs > 0.5, "High", "Low")
test_pred_class <- factor(test_pred_class, levels = c("Low", "High"))
conf_mat <- confusionMatrix(
data = test_pred_class,
reference = test_data$WageCategory,
positive = "High"
)
conf_mat
Confusion Matrix and Statistics
Reference
Prediction Low High
Low 304 119
High 130 326
Accuracy : 0.7167
95% CI : (0.6857, 0.7463)
No Information Rate : 0.5063
P-Value [Acc > NIR] : <2e-16
Kappa : 0.4332
Mcnemar's Test P-Value : 0.5263
Sensitivity : 0.7326
Specificity : 0.7005
Pos Pred Value : 0.7149
Neg Pred Value : 0.7187
Prevalence : 0.5063
Detection Rate : 0.3709
Detection Prevalence : 0.5188
Balanced Accuracy : 0.7165
'Positive' Class : High
roc_obj <- roc(response = test_data$WageCategory,
predictor = test_probs,
levels = c("Low", "High")) # "High" is the event
Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve for WageCategory Model")
auc_value <- auc(roc_obj)
auc_value
Area under the curve: 0.8063
Our analysis shows that high earners were older on average, and education had one of the strongest effects. Incomes increased with each higher level of education. Married workers seemed more likely to be high earners and never-married workers seemed more likely to fall into the low-wage group, but the representations was not equally distributed.The logistic regression also confirmed these patterns. Education was the biggest predictor, with people with advanced degrees having higher odds of being high earners. Age, being married, better health, and having health insurance also increased the likelihood of earning a higher wage, while job class didn’t add much once other factors were included.
The model performed well on new data, with an accuracy of about 71.7%, which is much better than guessing. If we were to continue this analysis, it would be interesting to adjust some variables and explore additional factors like childhood family income, parental involvement, family structure, or early life stability(such as parental divorce, death in family, etc). These kinds of variables might help explain how early experiences shape income later in adulthood. Overall, this project highlights the characteristics most connected to current wage levels and points toward meaningful directions for future research on how upbringing might influence economic outcomes.