The goal of this paper is to conduct exploratory data analysis, test three different models predicting wages in Wages dataset from ISLR library 1 and choose the best model to predict top and worst earners. Some general comments:
caret package 2 .In the first step let’s explore if there are some highly correlated variables. Because most of the interesting variables are categorical I cannot count Pearson’s correlations coefficients. Insted, I’m using vcd package to build contingency matrix and count Cramer’s V’s to quicky assess relations between variables. Variables have up to 5 levels and there is 3000 observations so the chi square statistics should be reliable.
Categorical variables are not highly correlated with each other and there will be no need to exlude them from the models. Beside obvious wages~logwages relation there is some interchangebility with age, and in education~jobclass pair.
Now let’s look closer what type of relation between dependent variables and wages we can find. After some trial and error I have chosen some graphs showing how do they relate to each other:
It looks promising. After EDA it seems that at least age, education and health isurance have some influance on the wage and can be used for our predictive models. I decided to exclude two variables logwage and maritil (highly imbalanced).
As I mentioned at the begining, before modeling, let’s convert wage continious variable into 3 categorical levels on the basis of the distribution:
To be able to verify the model I will split the provided dataset into train and test subsets with 8/2 proportion.
inTrain <- createDataPartition(y = wages$wage_level, p = 0.8, list = FALSE)
training <- wages[inTrain,]
testing <- wages[-inTrain,]
training <- tbl_df(training) #dplyr table for better performance
testing <- tbl_df(testing) #dplyr table for better performanceAfter some experimentation, for this assignment I decided to compare three classification models with 10-fold cross validation and default parameters:
control <- trainControl(method="repeatedcv", number=10)
fit_dt <- train(wage_level ~ ., data = training, method = "rpart", trControl = control)
fit_rf <- train(wage_level ~ ., data = training, method = "rf", trControl = control)
fit_nb <- train(wage_level ~ ., data = training, method = "nb", trControl = control)The accuracy level for all three models is not very high. The random forest with 0.57 and kappa 0.34 seems to be the best choice. Kappa indicates that the data seems unbalanced and there is high chance they will randomly classify to less common category. Let’s look closer at estimates of predictors importance for random forest:
The predictors importance confirms first intuitions from exploratory analysis. The most important variables are: age, lack of health insurance and higher education. Finally, let’s test the fit_rf model on our test dataset.
Since the test dataset has not been used during the model building it can be treated as an unbiased estimate of model parameters. Let’s predict wage levels with random forest model and check results with confusion matrix:
prediction <- predict(fit_rf, testing)
conf <- confusionMatrix(prediction, testing$wage_level)
knitr::kable(head(conf$table))| high | low | medium | |
|---|---|---|---|
| high | 115 | 23 | 60 |
| low | 9 | 92 | 41 |
| medium | 52 | 65 | 142 |