library(ISLR2)
library(tidyverse)
library(skimr)
data(Wage)
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
View(Wage)
median_wage<- median(Wage$wage)
median_wage
## [1] 104.9215
Wage$WageCategory<- factor(
ifelse(Wage$wage > median_wage, "High", "Low"),
levels = c("Low", "High")
)
Wage$WageCategory<- factor(Wage$WageCategory)
Wage$race<-factor(gsub("^\\d+\\.\\s*","", as.character(Wage$race)))
age_t_test<- t.test(age ~ WageCategory, data=Wage)
age_t_test
##
## Welch Two Sample t-test
##
## data: age by WageCategory
## t = -10.888, df = 2855, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
## -5.298535 -3.681416
## sample estimates:
## mean in group Low mean in group High
## 40.19512 44.68510
The mean age for Low wage earners is 40.19512, and for High wage earners, it’s 44.68510. The t-statistic is -10.888, the df is 2855, and the p-value is less than 0.05, making it statistically significant. This means that age does differ between wage categories.
library(supernova)
wage_anova<- aov(wage ~ education, data=Wage)
wage_anova
## Call:
## aov(formula = wage ~ education, data = Wage)
##
## Terms:
## education Residuals
## Sum of Squares 1226364 3995721
## Deg. of Freedom 4 2995
##
## Residual standard error: 36.52575
## Estimated effects may be unbalanced
supernova(wage_anova)
## Analysis of Variance Table (Type III SS)
## Model: wage ~ education
##
## SS df MS F PRE p
## ----- --------------- | ----------- ---- ---------- ------- ----- -----
## Model (error reduced) | 1226364.485 4 306591.121 229.806 .2348 .0000
## Error (from model) | 3995721.285 2995 1334.131
## ----- --------------- | ----------- ---- ---------- ------- ----- -----
## Total (empty model) | 5222085.770 2999 1741.276
The ANOVA table shows us that the F-statistics is 229.806, the df is 4, and the p-value is less than 0.05, making it statistically significant. This test shows us that there is a statistically significant difference between wage and education.
library(rcompanion)
wage_cont_table<- table(Wage$wage, Wage$education)
wage_cont_table %>% head(10)
##
## 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad
## 20.0855369231877 0 0 1 0
## 20.9343779401018 1 0 0 0
## 22.9624006796429 0 0 1 0
## 23.2747040516601 0 1 0 0
## 23.9529444789965 0 1 1 0
## 25.2910080971035 0 0 1 0
## 27.1405792079024 1 2 1 0
## 29.3769761006037 0 1 0 0
## 30.9237526933873 0 0 1 0
## 31.4109962716359 0 1 0 0
##
## 5. Advanced Degree
## 20.0855369231877 0
## 20.9343779401018 0
## 22.9624006796429 0
## 23.2747040516601 0
## 23.9529444789965 0
## 25.2910080971035 0
## 27.1405792079024 0
## 29.3769761006037 0
## 30.9237526933873 0
## 31.4109962716359 0
chisq.test(wage_cont_table)
##
## Pearson's Chi-squared test
##
## data: wage_cont_table
## X-squared = 2848.6, df = 2028, p-value < 2.2e-16
cramerV(wage_cont_table)
## Cramer V
## 0.4872
The x^2 is 2848.6, the df is 2028, the p-value is less than 0.05, showing that the relationship is statistically significant. The Cramer’s V is 0.4872, meaning that there is a moderate effect size between the relationship with wage and marital status.
library(caTools)
wage_sample<- sample.split(Wage$WageCategory, SplitRatio= 0.7)
wage_training_data<- subset(Wage, sample=T)
wage_testing_data<- subset(Wage, sample=F)
wage_model<- glm(WageCategory ~ education + age + jobclass + maritl,
family= "binomial", data=wage_training_data)
summary(wage_model)
##
## Call:
## glm(formula = WageCategory ~ education + age + jobclass + maritl,
## family = "binomial", data = wage_training_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.485389 0.243052 -14.340 < 2e-16 ***
## education2. HS Grad 0.788268 0.180789 4.360 1.30e-05 ***
## education3. Some College 1.594690 0.186577 8.547 < 2e-16 ***
## education4. College Grad 2.364116 0.189110 12.501 < 2e-16 ***
## education5. Advanced Degree 3.162107 0.218736 14.456 < 2e-16 ***
## age 0.019705 0.004127 4.774 1.81e-06 ***
## jobclass2. Information 0.238834 0.086978 2.746 0.00603 **
## maritl2. Married 1.246406 0.119553 10.426 < 2e-16 ***
## maritl3. Widowed 0.572058 0.541199 1.057 0.29050
## maritl4. Divorced 0.560192 0.194144 2.885 0.00391 **
## maritl5. Separated 0.600632 0.333203 1.803 0.07145 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.5 on 2999 degrees of freedom
## Residual deviance: 3348.2 on 2989 degrees of freedom
## AIC: 3370.2
##
## Number of Fisher Scoring iterations: 4
exp(coef(wage_model))
## (Intercept) education2. HS Grad
## 0.03064184 2.19958409
## education3. Some College education4. College Grad
## 4.92680116 10.63463642
## education5. Advanced Degree age
## 23.62030034 1.01990027
## jobclass2. Information maritl2. Married
## 1.26976776 3.47782266
## maritl3. Widowed maritl4. Divorced
## 1.77191052 1.75100794
## maritl5. Separated
## 1.82327070
library(pscl)
pR2(wage_model)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.1948651
library(caret)
varImp(wage_model) %>% arrange(desc(Overall))
## Overall
## education5. Advanced Degree 14.456240
## education4. College Grad 12.501249
## maritl2. Married 10.425592
## education3. Some College 8.547101
## age 4.774086
## education2. HS Grad 4.360154
## maritl4. Divorced 2.885442
## jobclass2. Information 2.745914
## maritl5. Separated 1.802599
## maritl3. Widowed 1.057020
This regression model conveys that Hs Grad, Some College, College Grad, Advanced Degree, age, and married individuals are statistically significant when looking at the relationship with WageCategory. Jobclass isn’t statistically signficant when looking at its relationship with WageCategory. Looking at the odds ratios, we can see that individuals who are college grads for example, have 10.63x the odds of having a high or low wage. One surprising result is the odd ratio for Advanced Degree, which is 23.62, making it the largest ratio among the variables. Looking at the “McFadden”, we can see that this model is kind of weak (0.195) and doesn’t explain a lot of variation. We can also see that education, specifically Advanced Degree is the most impactful variable.
wage_testing_data$pred_prob<- predict(wage_model, newdata = wage_testing_data, type = "response")
wage_testing_data$pred_class<- ifelse(wage_testing_data$pred_prob > 0.5, "High", "Low") %>% as.factor()
View(wage_testing_data)
library(car)
library(caret)
confusionMatrix(wage_testing_data$pred_class, wage_testing_data$WageCategory)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low High
## Low 1136 471
## High 381 1012
##
## Accuracy : 0.716
## 95% CI : (0.6995, 0.7321)
## No Information Rate : 0.5057
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4315
##
## Mcnemar's Test P-Value : 0.002295
##
## Sensitivity : 0.7488
## Specificity : 0.6824
## Pos Pred Value : 0.7069
## Neg Pred Value : 0.7265
## Prevalence : 0.5057
## Detection Rate : 0.3787
## Detection Prevalence : 0.5357
## Balanced Accuracy : 0.7156
##
## 'Positive' Class : Low
##
library(pROC)
roc_wage<- roc(wage_testing_data$WageCategory, wage_testing_data$pred_prob, levels= c("High", "Low"))
roc_wage
##
## Call:
## roc.default(response = wage_testing_data$WageCategory, predictor = wage_testing_data$pred_prob, levels = c("High", "Low"))
##
## Data: wage_testing_data$pred_prob in 1483 controls (wage_testing_data$WageCategory High) > 1517 cases (wage_testing_data$WageCategory Low).
## Area under the curve: 0.7838
auc(roc_wage)
## Area under the curve: 0.7838
plot(roc_wage, col = "darkblue", lwd = 2,
main = "ROC Curve – Logistic Regression")
After computing the predicted probabilities and classes and confusion matrix, we can see that the accuracy is 68.67%, the sensitivity is 67.53%, the specificity is 69.86%, and the balanced accuracy is 68.70%. The AUC value is 0.7795, which is good. Our model is also statistically significant.
A logistic regression was conducted to examine whether variables such as age, jobclass, education, or marital status predicted wage level (High or Low). The model was statistically significant, with an accuracy of 68.67%. Our findings convey that education relates to wage the most, and is also a significant predictor in the logistic regression. Married marital status is also significant predictor in the logistic regression. This model best predicts the low wage group. If I repeated this analysis, I would remove the jobclass variable.