Data Cleaning

table(wage_data$race)
## 
## 1. White 2. Black 3. Asian 4. Other 
##     2427      280      187       36
table(wage_data$education)
## 
##       1. < HS Grad         2. HS Grad    3. Some College    4. College Grad 
##                264                949                633                661 
## 5. Advanced Degree 
##                423
table(wage_data$maritl)
## 
## 1. Never Married       2. Married       3. Widowed      4. Divorced 
##              639             2019               17              200 
##     5. Separated 
##               55
table(wage_data$region)
## 
##        1. New England    2. Middle Atlantic 3. East North Central 
##                     0                  2930                     0 
## 4. West North Central     5. South Atlantic 6. East South Central 
##                     0                     0                     0 
## 7. West South Central           8. Mountain            9. Pacific 
##                     0                     0                     0
table(wage_data$health)
## 
##      1. <=Good 2. >=Very Good 
##            841           2089
table(wage_data$health_ins)
## 
## 1. Yes  2. No 
##   2029    901
table(wage_data$jobclass)
## 
##  1. Industrial 2. Information 
##           1507           1423
clean_levels <- function(x) {
  factor(trimws(sub("^[0-9]+\\.\\s*", "", as.character(x))))
}

wage_data <- wage_data %>%
  mutate(
    race       = clean_levels(race),
    education  = clean_levels(education),
    maritl     = clean_levels(maritl),
    region     = clean_levels(region),
    health     = clean_levels(health),
    health_ins = clean_levels(health_ins),
    jobclass   = clean_levels(jobclass)
  )

T Test

age_summary <- wage_data %>%
  group_by(wage_category) %>%
  summarize(
    mean_age = mean(age, na.rm = TRUE),
    sd_age = sd(age, na.rm = TRUE),
    n = n()
  )
print(age_summary)
## # A tibble: 2 × 4
##   wage_category mean_age sd_age     n
##   <fct>            <dbl>  <dbl> <int>
## 1 Low               40.0  12.7   1447
## 2 High              44.7   9.82  1483
age_ttest <- t.test(age ~ wage_category, data = wage_data)
age_ttest
## 
##  Welch Two Sample t-test
## 
## data:  age by wage_category
## t = -11.143, df = 2724.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
##  -5.496555 -3.851525
## sample estimates:
##  mean in group Low mean in group High 
##           40.01106           44.68510

Low wage earners (M = 40.0, SD = 12.7) were significantly younger than high wage earners (M = 44.7, SD = 9.8), t(2724.7) = -11.14, p<.001. Unsurprisingly, individuals who are high wage earners are typically older.

ANOVA

table(wage_data$jobclass)
## 
##  Industrial Information 
##        1507        1423
anova_jobclass <- aov(wage ~ jobclass, data = wage_data)
summary(anova_jobclass)
##               Df  Sum Sq Mean Sq F value Pr(>F)    
## jobclass       1  228631  228631   134.2 <2e-16 ***
## Residuals   2928 4990158    1704                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_jobclass
## Call:
##    aov(formula = wage ~ jobclass, data = wage_data)
## 
## Terms:
##                 jobclass Residuals
## Sum of Squares    228631   4990158
## Deg. of Freedom        1      2928
## 
## Residual standard error: 41.28303
## Estimated effects may be unbalanced

Our one-way ANOVA examined differences in wages between job classes. There was a significant effect of job class on wage, F(1,2928) = 134.2, p<.001, indicating that Information workers earned significantly higher wages than Industrial workers.

Chi-Sq

wage_maritl_table <- table(wage_data$wage_category, wage_data$maritl)
wage_maritl_table
##       
##        Divorced Married Never Married Separated Widowed
##   Low       114     817           470        37       9
##   High       86    1202           169        18       8
chisq_maritl <- chisq.test(wage_maritl_table)
chisq_maritl
## 
##  Pearson's Chi-squared test
## 
## data:  wage_maritl_table
## X-squared = 225.33, df = 4, p-value < 2.2e-16
cv_maritl <- cramerV(wage_maritl_table)
cv_maritl
## Cramer V 
##   0.2773

Our Chi-Square test of independence examined the relationship between wage category (High vs. Low) and marital status. The association was significant, χ²(4,N=2950) = 225.33, p<.001, with a moderate effect size (Cramer’s V = .28). Married individuals are more likely to be high earners, whereas those who have never been married were more likely to be low earners. In context of our ANOVA analysis, this makes sense, since we can assume younger people are less likely to be married than older people.

Logistic Regression Model

set.seed(69)
split_flag <- sample.split(wage_data$wage_category, SplitRatio = 0.7)
train_data <- subset(wage_data, split_flag == TRUE)
test_data  <- subset(wage_data, split_flag == FALSE)

table(train_data$wage_category)
## 
##  Low High 
## 1013 1038
table(test_data$wage_category)
## 
##  Low High 
##  434  445
wage_model <- glm(wage_category ~ age + education + jobclass + maritl + health + health_ins + race,
                  family = "binomial", data = train_data)

summary(wage_model)
## 
## Call:
## glm(formula = wage_category ~ age + education + jobclass + maritl + 
##     health + health_ins + race, family = "binomial", data = train_data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -4.048243   0.472767  -8.563  < 2e-16 ***
## age                       0.026011   0.005456   4.768 1.86e-06 ***
## educationAdvanced Degree  2.848053   0.276567  10.298  < 2e-16 ***
## educationCollege Grad     2.325086   0.247394   9.398  < 2e-16 ***
## educationHS Grad          0.542361   0.234337   2.314 0.020643 *  
## educationSome College     1.490482   0.241067   6.183 6.30e-10 ***
## jobclassInformation       0.266892   0.112748   2.367 0.017925 *  
## maritlMarried             0.808737   0.208416   3.880 0.000104 ***
## maritlNever Married      -0.608027   0.246409  -2.468 0.013604 *  
## maritlSeparated           0.106907   0.478814   0.223 0.823323    
## maritlWidowed            -0.084486   0.738917  -0.114 0.908971    
## health>=Very Good         0.241388   0.125084   1.930 0.053631 .  
## health_insYes             1.236373   0.125745   9.832  < 2e-16 ***
## raceBlack                -0.503718   0.290358  -1.735 0.082773 .  
## raceOther                -0.268700   0.552787  -0.486 0.626909    
## raceWhite                -0.073372   0.236710  -0.310 0.756586    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2843.0  on 2050  degrees of freedom
## Residual deviance: 2053.4  on 2035  degrees of freedom
## AIC: 2085.4
## 
## Number of Fisher Scoring iterations: 5
exp(coef(wage_model))
##              (Intercept)                      age educationAdvanced Degree 
##                0.0174530                1.0263524               17.2541473 
##    educationCollege Grad         educationHS Grad    educationSome College 
##               10.2275631                1.7200630                4.4392363 
##      jobclassInformation            maritlMarried      maritlNever Married 
##                1.3058999                2.2450709                0.5444239 
##          maritlSeparated            maritlWidowed        health>=Very Good 
##                1.1128303                0.9189848                1.2730145 
##            health_insYes                raceBlack                raceOther 
##                3.4431022                0.6042796                0.7643727 
##                raceWhite 
##                0.9292549

A logistic regression was conducted to predict wage category (High vs. Low) using age, education, job class, marital status, health status, health-insurance coverage, and race. The model revealed that education was the strongest predictor of wage category, followed by health insurance, marital status, and age. Job class and race did not contribute any significant predictive value to the model. Nothing really surprising stands out.

Model Evaluation

pR2(wage_model)["McFadden"]
## fitting null model for pseudo-r2
##  McFadden 
## 0.2777218
varImp(wage_model) %>% arrange(desc(Overall))
##                             Overall
## educationAdvanced Degree 10.2978674
## health_insYes             9.8323672
## educationCollege Grad     9.3983250
## educationSome College     6.1828487
## age                       4.7676420
## maritlMarried             3.8803957
## maritlNever Married       2.4675553
## jobclassInformation       2.3671605
## educationHS Grad          2.3144532
## health>=Very Good         1.9298077
## raceBlack                 1.7348198
## raceOther                 0.4860820
## raceWhite                 0.3099673
## maritlSeparated           0.2232736
## maritlWidowed             0.1143372
vif(wage_model)
##                GVIF Df GVIF^(1/(2*Df))
## age        1.255144  1        1.120332
## education  1.184975  4        1.021442
## jobclass   1.078005  1        1.038270
## maritl     1.267417  4        1.030066
## health     1.078323  1        1.038423
## health_ins 1.030917  1        1.015341
## race       1.096340  3        1.015448
test_data$pred_prob  <- predict(wage_model, newdata = test_data, type = "response")
test_data$pred_class <- ifelse(test_data$pred_prob > 0.5, "High", "Low") %>% as.factor()

conf_matrix <- confusionMatrix(test_data$pred_class, test_data$wage_category, positive = "High")
## Warning in confusionMatrix.default(test_data$pred_class,
## test_data$wage_category, : Levels are not in the same order for reference and
## data. Refactoring data to match.
conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low  307  138
##       High 127  307
##                                          
##                Accuracy : 0.6985         
##                  95% CI : (0.667, 0.7287)
##     No Information Rate : 0.5063         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.3971         
##                                          
##  Mcnemar's Test P-Value : 0.539          
##                                          
##             Sensitivity : 0.6899         
##             Specificity : 0.7074         
##          Pos Pred Value : 0.7074         
##          Neg Pred Value : 0.6899         
##              Prevalence : 0.5063         
##          Detection Rate : 0.3493         
##    Detection Prevalence : 0.4937         
##       Balanced Accuracy : 0.6986         
##                                          
##        'Positive' Class : High           
## 

The logistic regression model accurately classified 70% of wage earners in the test set, which seems to suggest that it is a good, but far from perfect model. The model’s accuracy (0.70) was significantly greater than the No-Information Rate (0.51; p<.001), meaning that using the model is better at predicting wage category than randomly guessing.

Final Summary

The analyses here examine whether we can predict wage category (High vs. Low) based off of age, education, job class, marital status, health status, health-insurance coverage, and race. Overall, the answer is yes.

Our logistic regression model had an accuracy of 70%, meaning it classified respondents’ wage category correctly 70% of the time based off of their responses to the variables included in the model. The model’s accuracy significantly exceeded the No-Information Rate (p<.001), confirming that it generalizes meaningfully better than random guessing or always predicting the majority class.

While the overall model has good predictive value, not all of the variables contributed meaningfully. For example, no information related to race (White vs. Black vs. Asian vs. Other) or job class (Information vs. Industrial) provided statistically significant predictive value.

On the opposite end, education was the most important variable, followed by health insurance (yes vs. no), marital status, and age. In particular, having an Advanced Degree was the most telling piece of information within the model, followed by having health insurance, having a bachelor’s degree as your highest level of education, and having some college education.

Sensitivity (≈0.69) and specificity (≈0.71) were nearly equal, suggesting that the model predicts high and low wage earners with similar accuracy. The slightly higher specificity indicates that it performs marginally better at identifying low earners than high earners, though the difference is small.

If I were to repeat this analysis with the exact same dataset, I would simply remove the non-significant predictors (race and job class). If I could improve the dataset, I would add variables related to geographic region, urban vs. suburban vs. rural, parent education, immigrant status, and more.