table(wage_data$race)
##
## 1. White 2. Black 3. Asian 4. Other
## 2427 280 187 36
table(wage_data$education)
##
## 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad
## 264 949 633 661
## 5. Advanced Degree
## 423
table(wage_data$maritl)
##
## 1. Never Married 2. Married 3. Widowed 4. Divorced
## 639 2019 17 200
## 5. Separated
## 55
table(wage_data$region)
##
## 1. New England 2. Middle Atlantic 3. East North Central
## 0 2930 0
## 4. West North Central 5. South Atlantic 6. East South Central
## 0 0 0
## 7. West South Central 8. Mountain 9. Pacific
## 0 0 0
table(wage_data$health)
##
## 1. <=Good 2. >=Very Good
## 841 2089
table(wage_data$health_ins)
##
## 1. Yes 2. No
## 2029 901
table(wage_data$jobclass)
##
## 1. Industrial 2. Information
## 1507 1423
clean_levels <- function(x) {
factor(trimws(sub("^[0-9]+\\.\\s*", "", as.character(x))))
}
wage_data <- wage_data %>%
mutate(
race = clean_levels(race),
education = clean_levels(education),
maritl = clean_levels(maritl),
region = clean_levels(region),
health = clean_levels(health),
health_ins = clean_levels(health_ins),
jobclass = clean_levels(jobclass)
)
age_summary <- wage_data %>%
group_by(wage_category) %>%
summarize(
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
n = n()
)
print(age_summary)
## # A tibble: 2 × 4
## wage_category mean_age sd_age n
## <fct> <dbl> <dbl> <int>
## 1 Low 40.0 12.7 1447
## 2 High 44.7 9.82 1483
age_ttest <- t.test(age ~ wage_category, data = wage_data)
age_ttest
##
## Welch Two Sample t-test
##
## data: age by wage_category
## t = -11.143, df = 2724.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
## -5.496555 -3.851525
## sample estimates:
## mean in group Low mean in group High
## 40.01106 44.68510
Low wage earners (M = 40.0, SD = 12.7) were significantly younger than high wage earners (M = 44.7, SD = 9.8), t(2724.7) = -11.14, p<.001. Unsurprisingly, individuals who are high wage earners are typically older.
table(wage_data$jobclass)
##
## Industrial Information
## 1507 1423
anova_jobclass <- aov(wage ~ jobclass, data = wage_data)
summary(anova_jobclass)
## Df Sum Sq Mean Sq F value Pr(>F)
## jobclass 1 228631 228631 134.2 <2e-16 ***
## Residuals 2928 4990158 1704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_jobclass
## Call:
## aov(formula = wage ~ jobclass, data = wage_data)
##
## Terms:
## jobclass Residuals
## Sum of Squares 228631 4990158
## Deg. of Freedom 1 2928
##
## Residual standard error: 41.28303
## Estimated effects may be unbalanced
Our one-way ANOVA examined differences in wages between job classes. There was a significant effect of job class on wage, F(1,2928) = 134.2, p<.001, indicating that Information workers earned significantly higher wages than Industrial workers.
wage_maritl_table <- table(wage_data$wage_category, wage_data$maritl)
wage_maritl_table
##
## Divorced Married Never Married Separated Widowed
## Low 114 817 470 37 9
## High 86 1202 169 18 8
chisq_maritl <- chisq.test(wage_maritl_table)
chisq_maritl
##
## Pearson's Chi-squared test
##
## data: wage_maritl_table
## X-squared = 225.33, df = 4, p-value < 2.2e-16
cv_maritl <- cramerV(wage_maritl_table)
cv_maritl
## Cramer V
## 0.2773
Our Chi-Square test of independence examined the relationship between wage category (High vs. Low) and marital status. The association was significant, χ²(4,N=2950) = 225.33, p<.001, with a moderate effect size (Cramer’s V = .28). Married individuals are more likely to be high earners, whereas those who have never been married were more likely to be low earners. In context of our ANOVA analysis, this makes sense, since we can assume younger people are less likely to be married than older people.
set.seed(69)
split_flag <- sample.split(wage_data$wage_category, SplitRatio = 0.7)
train_data <- subset(wage_data, split_flag == TRUE)
test_data <- subset(wage_data, split_flag == FALSE)
table(train_data$wage_category)
##
## Low High
## 1013 1038
table(test_data$wage_category)
##
## Low High
## 434 445
wage_model <- glm(wage_category ~ age + education + jobclass + maritl + health + health_ins + race,
family = "binomial", data = train_data)
summary(wage_model)
##
## Call:
## glm(formula = wage_category ~ age + education + jobclass + maritl +
## health + health_ins + race, family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.048243 0.472767 -8.563 < 2e-16 ***
## age 0.026011 0.005456 4.768 1.86e-06 ***
## educationAdvanced Degree 2.848053 0.276567 10.298 < 2e-16 ***
## educationCollege Grad 2.325086 0.247394 9.398 < 2e-16 ***
## educationHS Grad 0.542361 0.234337 2.314 0.020643 *
## educationSome College 1.490482 0.241067 6.183 6.30e-10 ***
## jobclassInformation 0.266892 0.112748 2.367 0.017925 *
## maritlMarried 0.808737 0.208416 3.880 0.000104 ***
## maritlNever Married -0.608027 0.246409 -2.468 0.013604 *
## maritlSeparated 0.106907 0.478814 0.223 0.823323
## maritlWidowed -0.084486 0.738917 -0.114 0.908971
## health>=Very Good 0.241388 0.125084 1.930 0.053631 .
## health_insYes 1.236373 0.125745 9.832 < 2e-16 ***
## raceBlack -0.503718 0.290358 -1.735 0.082773 .
## raceOther -0.268700 0.552787 -0.486 0.626909
## raceWhite -0.073372 0.236710 -0.310 0.756586
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2843.0 on 2050 degrees of freedom
## Residual deviance: 2053.4 on 2035 degrees of freedom
## AIC: 2085.4
##
## Number of Fisher Scoring iterations: 5
exp(coef(wage_model))
## (Intercept) age educationAdvanced Degree
## 0.0174530 1.0263524 17.2541473
## educationCollege Grad educationHS Grad educationSome College
## 10.2275631 1.7200630 4.4392363
## jobclassInformation maritlMarried maritlNever Married
## 1.3058999 2.2450709 0.5444239
## maritlSeparated maritlWidowed health>=Very Good
## 1.1128303 0.9189848 1.2730145
## health_insYes raceBlack raceOther
## 3.4431022 0.6042796 0.7643727
## raceWhite
## 0.9292549
A logistic regression was conducted to predict wage category (High vs. Low) using age, education, job class, marital status, health status, health-insurance coverage, and race. The model revealed that education was the strongest predictor of wage category, followed by health insurance, marital status, and age. Job class and race did not contribute any significant predictive value to the model. Nothing really surprising stands out.
pR2(wage_model)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.2777218
varImp(wage_model) %>% arrange(desc(Overall))
## Overall
## educationAdvanced Degree 10.2978674
## health_insYes 9.8323672
## educationCollege Grad 9.3983250
## educationSome College 6.1828487
## age 4.7676420
## maritlMarried 3.8803957
## maritlNever Married 2.4675553
## jobclassInformation 2.3671605
## educationHS Grad 2.3144532
## health>=Very Good 1.9298077
## raceBlack 1.7348198
## raceOther 0.4860820
## raceWhite 0.3099673
## maritlSeparated 0.2232736
## maritlWidowed 0.1143372
vif(wage_model)
## GVIF Df GVIF^(1/(2*Df))
## age 1.255144 1 1.120332
## education 1.184975 4 1.021442
## jobclass 1.078005 1 1.038270
## maritl 1.267417 4 1.030066
## health 1.078323 1 1.038423
## health_ins 1.030917 1 1.015341
## race 1.096340 3 1.015448
test_data$pred_prob <- predict(wage_model, newdata = test_data, type = "response")
test_data$pred_class <- ifelse(test_data$pred_prob > 0.5, "High", "Low") %>% as.factor()
conf_matrix <- confusionMatrix(test_data$pred_class, test_data$wage_category, positive = "High")
## Warning in confusionMatrix.default(test_data$pred_class,
## test_data$wage_category, : Levels are not in the same order for reference and
## data. Refactoring data to match.
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 307 138
## High 127 307
##
## Accuracy : 0.6985
## 95% CI : (0.667, 0.7287)
## No Information Rate : 0.5063
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.3971
##
## Mcnemar's Test P-Value : 0.539
##
## Sensitivity : 0.6899
## Specificity : 0.7074
## Pos Pred Value : 0.7074
## Neg Pred Value : 0.6899
## Prevalence : 0.5063
## Detection Rate : 0.3493
## Detection Prevalence : 0.4937
## Balanced Accuracy : 0.6986
##
## 'Positive' Class : High
##
The logistic regression model accurately classified 70% of wage earners in the test set, which seems to suggest that it is a good, but far from perfect model. The model’s accuracy (0.70) was significantly greater than the No-Information Rate (0.51; p<.001), meaning that using the model is better at predicting wage category than randomly guessing.
The analyses here examine whether we can predict wage category (High vs. Low) based off of age, education, job class, marital status, health status, health-insurance coverage, and race. Overall, the answer is yes.
Our logistic regression model had an accuracy of 70%, meaning it classified respondents’ wage category correctly 70% of the time based off of their responses to the variables included in the model. The model’s accuracy significantly exceeded the No-Information Rate (p<.001), confirming that it generalizes meaningfully better than random guessing or always predicting the majority class.
While the overall model has good predictive value, not all of the variables contributed meaningfully. For example, no information related to race (White vs. Black vs. Asian vs. Other) or job class (Information vs. Industrial) provided statistically significant predictive value.
On the opposite end, education was the most important variable, followed by health insurance (yes vs. no), marital status, and age. In particular, having an Advanced Degree was the most telling piece of information within the model, followed by having health insurance, having a bachelor’s degree as your highest level of education, and having some college education.
Sensitivity (≈0.69) and specificity (≈0.71) were nearly equal, suggesting that the model predicts high and low wage earners with similar accuracy. The slightly higher specificity indicates that it performs marginally better at identifying low earners than high earners, though the difference is small.
If I were to repeat this analysis with the exact same dataset, I would simply remove the non-significant predictors (race and job class). If I could improve the dataset, I would add variables related to geographic region, urban vs. suburban vs. rural, parent education, immigrant status, and more.