1 Mean ages
2 Test statistics
3 Interpretation
4 From the ANOVA table
5 Interpretation
6 Test statistics
7 Interpreation
8 Significant predictors (P<.05)
9 Non-significant predictors
10 Direction and size (using Odds Ratios)
11 Findings
12 Confusion matrix results
13 Interpretation
14 Interpreation
15 Final Interpretation

library(ISLR)
data(Wage)
wage <- Wage

data("Wage", package = "ISLR")
w <- Wage
str(w)

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

summary(w$wage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.09   85.38  104.92  111.70  128.68  318.34

# create WageCategory: High if wage > median, Low if wage < median.
med_wage <- median(w$wage, na.rm = TRUE)
w <- w %>%
  mutate(WageCategory = case_when(
    wage > med_wage ~ "High",
    wage < med_wage ~ "Low",
    TRUE ~ NA_character_
  )) %>%
  mutate(WageCategory = factor(WageCategory, levels = c("Low", "High")))

table(w$WageCategory, useNA = "ifany")

## 
##  Low High <NA> 
## 1447 1483   70

# Fix factor variables that have numeric prefixes
fix_factor_prefix <- function(f) {
  lv <- levels(f)
  lv <- gsub("^[0-9]+-", "", lv)
  levels(f) <- lv
  return(f)
}

factor_vars <- names(wage)[sapply(wage, is.factor)]

for (v in factor_vars) {
  wage[[v]] <- fix_factor_prefix(wage[[v]])
}
# Verify
lapply(wage[factor_vars], levels)

## $maritl
## [1] "1. Never Married" "2. Married"       "3. Widowed"       "4. Divorced"     
## [5] "5. Separated"    
## 
## $race
## [1] "1. White" "2. Black" "3. Asian" "4. Other"
## 
## $education
## [1] "1. < HS Grad"       "2. HS Grad"         "3. Some College"   
## [4] "4. College Grad"    "5. Advanced Degree"
## 
## $region
## [1] "1. New England"        "2. Middle Atlantic"    "3. East North Central"
## [4] "4. West North Central" "5. South Atlantic"     "6. East South Central"
## [7] "7. West South Central" "8. Mountain"           "9. Pacific"           
## 
## $jobclass
## [1] "1. Industrial"  "2. Information"
## 
## $health
## [1] "1. <=Good"      "2. >=Very Good"
## 
## $health_ins
## [1] "1. Yes" "2. No"

age_by_wage <- w %>% filter(!is.na(WageCategory)) %>% select(age, WageCategory)
means <- age_by_wage %>% group_by(WageCategory) %>% summarise(mean_age = mean(age, na.rm = TRUE), n = n())
means

## # A tibble: 2 × 3
##   WageCategory mean_age     n
##   <fct>           <dbl> <int>
## 1 Low              40.0  1447
## 2 High             44.7  1483

t_res <- t.test(age ~ WageCategory, data = age_by_wage, var.equal = FALSE)
t_res

## 
##  Welch Two Sample t-test
## 
## data:  age by WageCategory
## t = -11.143, df = 2724.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
##  -5.496555 -3.851525
## sample estimates:
##  mean in group Low mean in group High 
##           40.01106           44.68510

tidy(t_res)

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1    -4.67      40.0      44.7     -11.1 3.14e-28     2725.    -5.50     -3.85
## # ℹ 2 more variables: method <chr>, alternative <chr>

1 Mean ages

Low wage earners: 40.01 years
High wage earners: 44.69 years

2 Test statistics

t = -11.14
df = 2724.7
p < 2.2 x 10^-16

3 Interpretation

There is a statistically significant age difference between High and Low wage earners. On average, High earners are about 4.7 years older than Low earners. Because the p-value is extremely small, we conclude that age is meaningfully associated with earning a hgih wage in this dataset.

aov_mod <- aov(wage ~ education, data = w)
anova_table <- summary(aov_mod)
anova_table

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## education      4 1226364  306591   229.8 <2e-16 ***
## Residuals   2995 3995721    1334                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Pairwise comparisons (Tukey) to see which education levels differ
tukey_res <- TukeyHSD(aov_mod)
tukey_res

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = wage ~ education, data = w)
## 
## $education
##                                        diff       lwr      upr    p adj
## 2. HS Grad-1. < HS Grad            11.67894  4.799846 18.55803 3.68e-05
## 3. Some College-1. < HS Grad       23.65115 16.413966 30.88834 0.00e+00
## 4. College Grad-1. < HS Grad       40.32349 33.140488 47.50650 0.00e+00
## 5. Advanced Degree-1. < HS Grad    66.81336 59.040518 74.58621 0.00e+00
## 3. Some College-2. HS Grad         11.97221  6.919816 17.02461 0.00e+00
## 4. College Grad-2. HS Grad         28.64456 23.670077 33.61904 0.00e+00
## 5. Advanced Degree-2. HS Grad      55.13443 49.340723 60.92813 0.00e+00
## 4. College Grad-3. Some College    16.67234 11.213367 22.13132 0.00e+00
## 5. Advanced Degree-3. Some College 43.16221 36.947555 49.37687 0.00e+00
## 5. Advanced Degree-4. College Grad 26.48987 20.338392 32.64134 0.00e+00

4 From the ANOVA table

F(4, 2995) = 229.8
p < 2 x 10^-16

5 Interpretation

There is a significant effect of education level on wage. Individuals with higher levels of education earn significantly higher wages on average. Tukey post-hoc comparisons show that every education level differs significantly from the one below it, with wages increasing steadiy from “<HS Grad” up to “Advanced Degree.” Education appears to be one of the strongest predictors of wage in the dataset.

ct <- table(w$WageCategory, w$jobclass)
ct

##       
##        1. Industrial 2. Information
##   Low            878            569
##   High           629            854

chisq_res <- chisq.test(ct)
chisq_res

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ct
## X-squared = 97.065, df = 1, p-value < 2.2e-16

# Compute Cramer's V
cramer_v <- CramerV(ct)
cramer_v

## [1] 0.1826937

6 Test statistics

X^2 = 97.07
df = 1
p < 2.2 x 10^-16
Cramer’s V = 0.183 (small-medium association)

7 Interpreation

Wage catergor is significantly associated with type of job class. - Low earners are more common in Industrial jobs. - High earners are more common in Information jobs. Although the association is not extremely strong (V≈0.18), it is meaningful and indicates job class plays a role in wage outcomes.

set.seed(2025)
library(caTools)
split <- sample.split(w$WageCategory, SplitRatio = 0.70)
train_data <- subset(w, split == TRUE)
test_data <- subset(w, split == FALSE)
nrow(train_data); nrow(test_data)

## [1] 2051

## [1] 949

prop.table(table(train_data$WageCategory))

## 
##       Low      High 
## 0.4939054 0.5060946

prop.table(table(test_data$WageCategory))

## 
##       Low      High 
## 0.4937429 0.5062571

# Make sure factors are clean and relevel sensible refernce levels
train_data <- train_data %>%
  mutate(education = droplevels(education),
         jobclass = droplevels(jobclass),
         maritl = droplevels(maritl),
         race = droplevels(race),
         health = droplevels(health),
         health_ins = droplevels(health_ins))

# Fit logistic regression
logit_mod <- glm(WageCategory ~ age + education + jobclass + maritl + race + health + health_ins + year,
                 data = train_data, family = binomial)
summary(logit_mod)

## 
## Call:
## glm(formula = WageCategory ~ age + education + jobclass + maritl + 
##     race + health + health_ins + year, family = binomial, data = train_data)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.843e+02  5.361e+01  -3.439 0.000585 ***
## age                          2.310e-02  5.461e-03   4.230 2.34e-05 ***
## education2. HS Grad          5.113e-01  2.186e-01   2.339 0.019349 *  
## education3. Some College     1.381e+00  2.270e-01   6.080 1.20e-09 ***
## education4. College Grad     2.279e+00  2.307e-01   9.879  < 2e-16 ***
## education5. Advanced Degree  2.837e+00  2.631e-01  10.781  < 2e-16 ***
## jobclass2. Information       4.450e-02  1.117e-01   0.398 0.690321    
## maritl2. Married             1.244e+00  1.549e-01   8.033 9.50e-16 ***
## maritl3. Widowed             8.607e-01  7.734e-01   1.113 0.265812    
## maritl4. Divorced            5.827e-01  2.482e-01   2.348 0.018856 *  
## maritl5. Separated           4.309e-01  4.516e-01   0.954 0.339988    
## race2. Black                -5.851e-01  1.914e-01  -3.057 0.002234 ** 
## race3. Asian                -7.874e-02  2.300e-01  -0.342 0.732111    
## race4. Other                -9.274e-02  4.685e-01  -0.198 0.843081    
## health2. >=Very Good         2.443e-01  1.206e-01   2.025 0.042871 *  
## health_ins2. No             -1.237e+00  1.228e-01 -10.074  < 2e-16 ***
## year                         9.038e-02  2.672e-02   3.382 0.000720 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2843.0  on 2050  degrees of freedom
## Residual deviance: 2117.3  on 2034  degrees of freedom
## AIC: 2151.3
## 
## Number of Fisher Scoring iterations: 4

# Odds ratios
exp_coef <- exp(coef(logit_mod))
exp_confint <- exp(confint.default(logit_mod))
odds_table <- cbind(Estimate = coef(logit_mod), OR = exp_coef, `2.5%_OR` = exp_confint[,1], `97.5%_OR` = exp_confint[,2])
knitr::kable(odds_table, digits = 3)

	Estimate	OR	2.5%_OR	97.5%_OR
(Intercept)	-184.339	0.000	0.000	0.000
age	0.023	1.023	1.012	1.034
education2. HS Grad	0.511	1.667	1.086	2.559
education3. Some College	1.381	3.977	2.549	6.206
education4. College Grad	2.279	9.768	6.215	15.353
education5. Advanced Degree	2.837	17.059	10.186	28.570
jobclass2. Information	0.044	1.046	0.840	1.301
maritl2. Married	1.244	3.470	2.562	4.701
maritl3. Widowed	0.861	2.365	0.519	10.768
maritl4. Divorced	0.583	1.791	1.101	2.913
maritl5. Separated	0.431	1.539	0.635	3.728
race2. Black	-0.585	0.557	0.383	0.811
race3. Asian	-0.079	0.924	0.589	1.451
race4. Other	-0.093	0.911	0.364	2.283
health2. >=Very Good	0.244	1.277	1.008	1.617
health_ins2. No	-1.237	0.290	0.228	0.369
year	0.090	1.095	1.039	1.153

8 Significant predictors (P<.05)

Age
Education: all levels 2-5
Married, divorced
Race: Black
Health: ≥Very Good
Health insurance: No
Year

9 Non-significant predictors

Job class
Widowed, separated
Race: Asian, Other

10 Direction and size (using Odds Ratios)

Education is the strongest predictor. For example:
- Advanced Degree OR = 17.06 -College Grade OR = 9.77
Age: OR = 1.023 -> each additional year increases odds of High wage by ~2.3%.
Married: OR = 3.47 -> much higher offs of being a High earner.
Race (Black): OR = 0.56 -> lower odds of hgih wage compared to White.
No health insurance: OR = 0.29 -> surprisingly strong decrease in odds.
Year: OR = 1.095 -> later survey years have higher odds of high wages.

11 Findings

Job class was not significant, even though the Chi-square test showed an association. This often happens when other stronger variables absorb the predictive power. Also, individuals without health insurance had much lower odds of hgih wage, which may reflect socioeconoic or employer-benefit differences.

# Predicted probabilities on test set
test_data$pred_prob <- predict(logit_mod, newdata = test_data, type = "response")

# Choose default cutoff 0.5 for class
test_data$pred_class <- factor(ifelse(test_data$pred_prob >= 0.5, "High", "Low"), levels = c("Low", "High"))

library(caret)
conf_mat <- confusionMatrix(test_data$pred_class, test_data$WageCategory, positive = "High")
conf_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low  325  126
##       High 109  319
##                                           
##                Accuracy : 0.7327          
##                  95% CI : (0.7021, 0.7617)
##     No Information Rate : 0.5063          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4655          
##                                           
##  Mcnemar's Test P-Value : 0.2966          
##                                           
##             Sensitivity : 0.7169          
##             Specificity : 0.7488          
##          Pos Pred Value : 0.7453          
##          Neg Pred Value : 0.7206          
##              Prevalence : 0.5063          
##          Detection Rate : 0.3629          
##    Detection Prevalence : 0.4869          
##       Balanced Accuracy : 0.7329          
##                                           
##        'Positive' Class : High            
##

12 Confusion matrix results

Accuracy: 0.733
Sensitivity (High earners detected): 0.717
Specificity (Low earners detected): 0.749
Balanced Accuracy: 0.733
No Information Rate (NIR): 0.506
Accuracy is much higher than the NIR, and p<2e-16 confirms the model significantly beats chance.

13 Interpretation

The model performs well on unseen data. It predicts both wage categories reasonably well, with slightly stronger performance for Low earners (higher specificity). Overall accuracy is strong relative to the baseline.

library(pROC)
roc_obj <- roc(response = test_data$WageCategory, predictor = test_data$pred_prob, levels = c("Low", "High"), direction = "<")
auc_val <- auc(roc_obj)
roc_obj

## 
## Call:
## roc.default(response = test_data$WageCategory, predictor = test_data$pred_prob,     levels = c("Low", "High"), direction = "<")
## 
## Data: test_data$pred_prob in 434 controls (test_data$WageCategory Low) < 445 cases (test_data$WageCategory High).
## Area under the curve: 0.8196

plot(roc_obj, main = paste0("ROC curve (AUC = ", round(auc_val,3), ")"))

AUC = 0.820
An AUC aboce 0.80 indicates excellent classification performance.

14 Interpreation

The ROC curve shows the model is effective at distinguishing High vs. Low earners. Sensitivity (correctly identifying High earners) is strong, and specificity is also good. The AUC well exceeds the chance level of 0.50, and accuracy is far above the NIR, confirming that the model performs reliably better than random guessing.

15 Final Interpretation

The results across descriptive statistics, group comparisons, and predictive modeling all suggest that wage outcomes are systematically related to age, education, marital status, and job-related characteristics. Age shows a notable difference between wage groups, with High earners being significantly older on average. Education emerges as one of the most significant correlates of wage, confirmed by both ANOVA and logistic regression. Wages consistently increase as educational attainment rises, and each step up from “<HS” to “Advanced Degree” adds a considerable wage increase.

Job class also demonstrates an association with wage category in the Chi-square analysis, with Information jobs containing a higher proportion of High earners. However, when included alongside other predictors in the logistic regression, job class is no longer a significant predictor. This suggests that job class and wage category are linked, but much of the predictive value is explained by other variables, such as education. Marital status also plays a notable role; married individuals have substantially higher odds of being High earners, while divorce offers a smaller and significant increase.

The logistic regression model provided strong predictive performance. Accuracy on the test set (73.3%) was well above the No Information Rate (50.6%), and the AUC of 0.82 shows excellent ability to discriminate between High and Low earners. The model predicted Low earners slightly more accurately, but performance was balanced overall. These results indicate that the model generalizes well to unseen data and captures meaningful patterns in the wage structure.

If this analysis were repeated, it might be valuable to add variables related to type of industry, years of experience, workload, or employer characteristics, which might further clarify wage drivers. Removing non-significant predictors like job class or some of the race categories might slightly simplify the model without harming predictive power. Overall, the results emphasize the strong importance of education, marital status, age, and health insurance status in predicting wage category within the Wage dataset.

Wage Analytics - Predicting High vs. Low Earners Using Logistic Regression

Emma Valentina Tupone

2025-11-17