Objective
Your goal is to build a classifier that can predict the income levels in order to create more effective policy.
The dataset below includes Census data on 32,000+ individuals with a variety of variables and a target variable for above or below 50k in salary.
Your goal is to build a Random Forest Classifier to be able to predict income levels above or below 50k.
Building the Model
Base Rate
The base rate in the dataset for classifying an individual as having an income >50k is 24.08%, which basically is probability of someone in the dataset earning more than 50k if you were to guess at random.
Mytry Value
The Mytry value is the number of variables randomly sampled as candidates at each split. The default number for classification is sqrt(# of variables). This dataset contains 15 variables, 14 if we subtract out the target variable. taking the square root of 14 gives us an Mytry level of 3.74, so we will round up to 4 variables.
Random Forest - 500 Trees
Initially, we will be generating a random forest made up of 500 trees, and an mtry of 4. In order to ensure that these trees are not all identical and have the opporunity to specialize in different subsets of the data, we will set the argument of replace to TRUE.
Model Output
##
## Call:
## randomForest(formula = income ~ ., data = census.train, ntree = 500, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 15.75%
## Confusion matrix:
## 0 1 class.error
## 0 20756 1457 0.06559222
## 1 3159 3933 0.44543147
The error rate is 15.75% which is fairly high.
Initial Confusion Matrix
## 0 1 class.error
## 0 20756 1457 0.06559222
## 1 3159 3933 0.44543147
From the initial confusion matrix we can calculate an accuracy of 84.25%, which is pretty good considering the base rate was only ~24%.
Vote Percentages
We can also review the percentage of trees that voted for each data point to be in each class. The first 500 data points are displayed:
Variable Importance
Error Visualization
Above is a visualization with the x-axis showing the number of trees and y-axis showing error. There are 4 different lines plotted on this graph: the difference in error rates, the Out-of-Bag error, and the error for each of our classes, <=50k and >50k. From initial glance one can see that the error rates flatten out to a constant value between around 200 and 300 trees, which may hint at the fact that we have too many trees in this random forest and could decrease the number.
Additionally, our error rate for predicting the class when an individual has a salary of less than or equal to 50k is much lower than the opposing class, a difference of around .4. This means we are much better at predicting the class of <=50k than >50k.
Other patterns show a lot of initial fluctuation between the error rates of the class of individuals with salaries >50k and the difference in error rates, as well as the fact that these lines mimic each other quite closely. The OOB error and class of <=50k have decreasing error rates and flatten out quite early.
Confusion Matrix
Confusion Matrix Stats:
Accuracy = 84.25%
Sensitivity = 55.45%
False Positive Rate = 6.56%
Specificity = 93.44%
The accuracy is alright but is a little biased because the model is very good at predicting the negative class. The sensitivity is fair which makes sense due to the proportion of true positives (3933) to false negatives (3159). Overall, the model has a relatively high number of false positives but is not reflected in the FPR because of the extremely large number of true negatives. This could be an issue for the government, especially with policy making, as the higher salary class could be placed at a disproportionate advantage if people were falsely classified in the lower bracket, especially if they were to benefit from policies that were intended for those with lower incomes. Conversely, at a rate of 6.5% we are classifying lower income individuals as being higher income, and this could also backfire if policy were to be made that targeted the upper salary levels. Lower income individuals could take a serious hit, even if the false positive rate is “only” 6.5% if the model is applied to large populations that could constitute thousands of people.
For this context, I believe it is more important to reduce the error with the ‘positive class’, which in this case is the salary above 50k. Misclassifying someone in this class could cause large financial repercussions.
Error Table
By sorting the error table above, we can find the number of trees with the lowest out of bag error rate. With a rate of 15.7%, the lowest out of bag rate is the 486th tree. Therefore, we’ll rerun the model with 486 trees.
New Model
Output
##
## Call:
## randomForest(formula = income ~ ., data = census.train, ntree = 486, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 486
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 15.7%
## Confusion matrix:
## 0 1 class.error
## 0 20762 1451 0.06532211
## 1 3149 3943 0.44402143
After running the new model, the error rate comes out to 15.7% which is not much better than the initial model. This makes sense as 500 and 486 trees is not much of a difference especially when looking at the error visualization plot in the earlier section.
Confusion Matrix
Initial model confusion matrix:
## 0 1 class.error
## 0 20756 1457 0.06559222
## 1 3159 3933 0.44543147
New model confusion matrix:
## 0 1 class.error
## 0 20762 1451 0.06532211
## 1 3149 3943 0.44402143
As we can see above, the confusion matrices are virtually the same, however, the new model performs slightly better than the initial, with 10 less false positives and 6 less false negatives. The accuracy of the new model is 84.3% compared to 84.2% of the initial model—again, that makes sense as the number of trees aren’t much different for each model.
Confusion Matrix
Below is a more in depth confusion matrix output for the new model:
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 20773 3140
## 1 1440 3952
##
## Accuracy : 0.8437
## 95% CI : (0.8395, 0.8479)
## No Information Rate : 0.758
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5362
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5572
## Specificity : 0.9352
## Pos Pred Value : 0.7329
## Neg Pred Value : 0.8687
## Precision : 0.7329
## Recall : 0.5572
## F1 : 0.6331
## Prevalence : 0.2420
## Detection Rate : 0.1349
## Detection Prevalence : 0.1840
## Balanced Accuracy : 0.7462
##
## 'Positive' Class : 1
##
As mentioned above, the accuracy is 84.37% which is barely better than the initial model. The sensitivity is 55.77% which is poor, however, the specificity is 93.5%. This once again reflects that the model is excellent at classifying the negative case (<50k) and poor at predicting the positive case (>50k). To further support that, the F1 score is 0.63.
Variable Importance Plot
Above is a variable importance plot for the new model and shows the mean decrease accuracy and mean decrease Gini plots. Both rank occupation as the most important variable which would make sense as most people aquire their income from the job the have. Education is also high ranking in both which also makes sense as most high paying jobs require higher education. Relationship, marital status, and age are other important variables.
TuneRF Model
By using the tuneRF function we can assess how many variables we should be using in our random forest model.
## mtry OOBError
## 3.OOB 3 0.1347552
## 5.OOB 5 0.1406245
## 10.OOB 10 0.1448558
Based on the function output, an mtry of 3 provides the optimal out of bag error rate (13.62%).
New Random Forest with mtry = 3 and 486 trees
##
## Call:
## randomForest(formula = income ~ ., data = census.train, ntree = 486, mtry = 3, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 486
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 15.65%
## Confusion matrix:
## 0 1 class.error
## 0 20848 1365 0.0614505
## 1 3222 3870 0.4543147
The error rate drops to 15.65%, which compared to the 15.7% of the second model is not much better. The accuracy comes out to 84.35% which is only 0.05% better than the second model.
Visualization for Tree Sizes
We can also create a histogram to visualize how large the various trees are in our model:
Based on the histogram, most trees tend to have a size of 15.
ROC and AUC
We can also create an ROC curve for our final model to visualize the TPR vs FPR and calculate an AUC (area under curve) metric:
The AUC came out to 0.89 which is excellent and is most likely attributed to a very low false positive rate.
Summary
After building 3 random forest models, the last one (486 trees and a mtry of 3) performed fairly well with a high accuracy, low error rates, and fairly low false positive rate. However, I would recommend the government look into potentially evaluating what other classification methods could be used that are stronger at classifying the positive class (>50k), or trying to optimize another tree that is better at this. This current model is exceptional at classifying the negative class (<50k), but not as much with the positive one. When looking into developing policies, the most important variables to look into to determine an income bracket would be occupation, relationship, and education.
I would also be weary of the false rates that this model possesses, particularly falsely classifying someone from a higher salary as one of a lower salary. Policies may not work as effectively or as much as desired if you are targeting lower salaries but end up applying the policies and enforcing them to people who are really of a larger income. They could unfairly benefit greatly, or also be unhappy with the policies.
Additionally, the dataset was definitely skewed with the largest majority of individuals in the set being in the lower income bracket, which may explain this model’s strong ability to classify this class but not the other.