wage <- Default
skim(wage)
| Name | wage |
| Number of rows | 10000 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| default | 0 | 1 | FALSE | 2 | No: 9667, Yes: 333 |
| student | 0 | 1 | FALSE | 2 | No: 7056, Yes: 2944 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| balance | 0 | 1 | 835.37 | 483.71 | 0.00 | 481.73 | 823.64 | 1166.31 | 2654.32 | ▆▇▅▁▁ |
| income | 0 | 1 | 33516.98 | 13336.64 | 771.97 | 21340.46 | 34552.64 | 43807.73 | 73554.23 | ▂▇▇▅▁ |
median_wage <- median(Wage$wage, na.rm = TRUE)
Wage$WageCategory <- ifelse(Wage$wage > median_wage,
"High",
"Low")
Wage$WageCategory <- as.factor(Wage$WageCategory)
table(Wage$WageCategory)
##
## High Low
## 1483 1517
str(Wage$WageCategory)
## Factor w/ 2 levels "High","Low": 2 2 1 1 2 1 1 1 1 1 ...
vars_to_clean <- c("maritl", "race", "education", "region",
"jobclass", "health", "health_ins")
Wage[vars_to_clean] <- lapply(Wage[vars_to_clean], function(x) {
x <- gsub("^[0-9]+\\.\\s*", "", x)
factor(x)
})
head(Wage)
## year age maritl race education region jobclass
## 231655 2006 18 Never Married White < HS Grad Middle Atlantic Industrial
## 86582 2004 24 Never Married White College Grad Middle Atlantic Information
## 161300 2003 45 Married White Some College Middle Atlantic Industrial
## 155159 2003 43 Married Asian College Grad Middle Atlantic Information
## 11443 2005 50 Divorced White HS Grad Middle Atlantic Information
## 376662 2008 54 Married White College Grad Middle Atlantic Information
## health health_ins logwage wage WageCategory
## 231655 <=Good No 4.318063 75.04315 Low
## 86582 >=Very Good No 4.255273 70.47602 Low
## 161300 <=Good Yes 4.875061 130.98218 High
## 155159 >=Very Good Yes 5.041393 154.68529 High
## 11443 <=Good Yes 4.318063 75.04315 Low
## 376662 >=Very Good Yes 4.845098 127.11574 High
tapply(Wage$age, Wage$WageCategory, mean)
## High Low
## 44.68510 40.19512
t_test_result <- t.test(age ~ WageCategory, data = Wage)
t_test_result
##
## Welch Two Sample t-test
##
## data: age by WageCategory
## t = 10.888, df = 2855, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
## 3.681416 5.298535
## sample estimates:
## mean in group High mean in group Low
## 44.68510 40.19512
The mean for High wage earners is 44.67, and the mean for Low wage earners is 40.20. The t test shows that there is statistical difference between the two groups, which means that those who earn higher wages are more likely to be older than those who don’t.
wage_job <- aov(wage ~ jobclass, data = Wage)
summary(wage_job)
## Df Sum Sq Mean Sq F value Pr(>F)
## jobclass 1 223538 223538 134.1 <2e-16 ***
## Residuals 2998 4998547 1667
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
wage_job
## Call:
## aov(formula = wage ~ jobclass, data = Wage)
##
## Terms:
## jobclass Residuals
## Sum of Squares 223538 4998547
## Deg. of Freedom 1 2998
##
## Residual standard error: 40.83251
## Estimated effects may be unbalanced
A one-way ANOVA was conducted to examine if jobclass had an influence on wage. The ANOVA test revealed that there was a significant effect of jobclass on wage, F(1, 2998) = 134.1, p < 2e-16. It seems like different jobs lead to higher wages. Judging from this dataset, Information jobs provide higher wages than Industrial jobs.
tab <- table(Wage$maritl, Wage$WageCategory)
tab
##
## High Low
## Divorced 86 118
## Married 1202 872
## Never Married 169 479
## Separated 18 37
## Widowed 8 11
chi_result <- chisq.test(tab)
chi_result
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 212.51, df = 4, p-value < 2.2e-16
cramersV(tab)
## [1] 0.2661507
X^2 = 212.51 with a p-value < 2.2x10^-16. Cramer’s V is 0.27. The chi-square analysis indicates that wage is significantly affected by marrital status, and Cramer’s V result (effect size = 0.27) means that there is an association between marital status and wage.
set.seed(123)
split <- sample.split(Wage$WageCategory, SplitRatio = 0.7)
train <- subset(Wage, split == TRUE)
test <- subset(Wage, split == FALSE)
my_model <- glm(WageCategory ~ age + jobclass + education + maritl, data = train, family = binomial)
summary(my_model)
##
## Call:
## glm(formula = WageCategory ~ age + jobclass + education + maritl,
## family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.153364 0.369325 8.538 < 2e-16 ***
## age -0.020615 0.004927 -4.185 2.86e-05 ***
## jobclassInformation -0.226817 0.104394 -2.173 0.029802 *
## educationAdvanced Degree -3.347753 0.269660 -12.415 < 2e-16 ***
## educationCollege Grad -2.386828 0.228674 -10.438 < 2e-16 ***
## educationHS Grad -0.842377 0.218824 -3.850 0.000118 ***
## educationSome College -1.543362 0.225951 -6.831 8.46e-12 ***
## maritlMarried -0.872011 0.200928 -4.340 1.43e-05 ***
## maritlNever Married 0.348751 0.236068 1.477 0.139586
## maritlSeparated -0.233811 0.424155 -0.551 0.581469
## maritlWidowed 0.335188 0.661889 0.506 0.612568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2910.9 on 2099 degrees of freedom
## Residual deviance: 2324.4 on 2089 degrees of freedom
## AIC: 2346.4
##
## Number of Fisher Scoring iterations: 4
exp(coef(my_model))
## (Intercept) age jobclassInformation
## 23.41469156 0.97959593 0.79706683
## educationAdvanced Degree educationCollege Grad educationHS Grad
## 0.03516328 0.09192081 0.43068570
## educationSome College maritlMarried maritlNever Married
## 0.21366149 0.41810994 1.41729692
## maritlSeparated maritlWidowed
## 0.79151129 1.39820373
It seems like age, education, and maritl are significant predictors for wage, specifically the High wage category. Older age is associate with a slight decrease of being in the High wage category (OR = 0.98 per year). Education seems to have the largest effect in wage (OR = 0.04). Marital status also reduces the odds of being in the High wage category (OR = 0.42). It’s interesting that jobclass shows no effect on age considering some prior data analysis.
test$pred_prob <- predict(my_model, newdata = test, type = "response")
Prediction for High wage.
test$pred_class <- ifelse(test$pred_prob >= 0.5, "High", "Low")
Prediction for classes.
test$pred_class <- factor(test$pred_class, levels = c("Low", "High"))
Labeling.
cm <- confusionMatrix(test$pred_class, test$WageCategory, positive = "High")
## Warning in confusionMatrix.default(test$pred_class, test$WageCategory, positive
## = "High"): Levels are not in the same order for reference and data. Refactoring
## data to match.
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low
## High 138 330
## Low 307 125
##
## Accuracy : 0.2922
## 95% CI : (0.2627, 0.3231)
## No Information Rate : 0.5056
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : -0.4149
##
## Mcnemar's Test P-Value : 0.3834
##
## Sensitivity : 0.3101
## Specificity : 0.2747
## Pos Pred Value : 0.2949
## Neg Pred Value : 0.2894
## Prevalence : 0.4944
## Detection Rate : 0.1533
## Detection Prevalence : 0.5200
## Balanced Accuracy : 0.2924
##
## 'Positive' Class : High
##
Confusion Matrix.
roc_wage <- roc(test$WageCategory, test$pred_prob, levels = c("Low", "High"), direction = "<")
plot(roc_wage, col = "blue", main = "ROC Curve")
ROC Curve.
auc(roc_wage)
## Area under the curve: 0.2246
AUC value.
The goal of this project was to investigate how different variables contribute to wage. A t-test and ANOVA revealed that wage was significantly affected by age, jobclass, and education. I saw that older ages were associated with higher earnings.
A logistic regression model was built to figure out which variables could predict the probability of being in the High wage category compared to Low wage category. Judging by the results, age, job class, education, and marital status were statistically significant in predicting wage categories. Although, sometimes depending on the analysis there were a couple of times where some predictors were non-significant.
Testing the new model on unseen showed poor predictions when it comes to Accuracy (29.2%), Sensitivity (38.3%), Specificity (31%), and ROC/AUC (0.225). It seems like this model could not distinguish between High and Low earners. Overall, age, education, job class, and marital status were significantly associated with wage. However, the logistic regression model is not reliable enough to predict wage categories with the new data. In terms of repeating the analysis, I would add years of experience, hours worked (part-time and full-time), and location (city vs suburb).