Divide our training set into a train and a validation set
From our training set, we will make a split of 80% for the real train set and 20% for our validation set.
train = sample(c(TRUE, FALSE), size = nrow(train_data), replace = TRUE, prob = c(0.8, 0.2))
small_train_data = train_data[train,]
validation_data = train_data[!train,]
Unbalance training set
As we have seen at the beginning of our report, our original training set is highly unbalanced and has way more low salary classifications than the high one:
- proportion of high salary: 6.21%
- proportion of low salary: 93.79%
So that our model doesn’t get impacted by that and doesn’t try to automatically classify an observation as low salary (as by doing that, your error will be super small because of the unbalanced behavior, if you classify all observations as low salary, your precision will still be high around 94%).
The code below will change our small train dataset (80% of our original training set) to tackle this issue.
We will take the exact same amount of low salary than we have in the high bucket.
set.seed(1)
temp_high = filter(small_train_data, salary_classification == "+50,000") # Take only the high salary observations
temp_low = filter(small_train_data, salary_classification != "+50,000") # Take only the low salary observations
temp_low = sample_n(temp_low, nrow(temp_high), replace = FALSE) # Take a random sample of size the numbers of high salary observations without replacement from the low salary bucket
balanced_small_train_data = rbind(temp_high, temp_low) # Create the unbalanced dataset
kable(data.frame(table(balanced_small_train_data$salary_classification)), col.names = c("salary_classification", "number_of_observations"))
| -50,000 |
9893 |
| +50,000 |
9893 |
Choose the right model using several cutoffs with the Random Forest Model
Train the model
First, let’s try to fit a random forest to our balanced small training set.
rf_fit = randomForest(salary_classification ~ ., data = balanced_small_train_data, importance = TRUE) # We set importance to be TRUE so that we can find the importance of each feature in our classification
print(rf_fit)
##
## Call:
## randomForest(formula = salary_classification ~ ., data = balanced_small_train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 12.87%
## Confusion matrix:
## -50,000 +50,000 class.error
## -50,000 8245 1648 0.16658243
## +50,000 898 8995 0.09077125
Tune the right cutoff using our validation set
Our random will output the probability of being in the higher salary classification.
Instead of assigning the observation to be in “+50,000” if this probability is bigger than 0.5, let’s try several values and take the one that leads to the best score.
For that, we need first to create the function that will give us a score for each model.
There are several ones but we will use here the MCC (Mathews Correlation Coefficient) which is a measure of the quality of a binary classification as it takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.
Our function will also output the Precision, Recall and F1-score.
In our problem, the True Positive is an observation that is classified as “+50,000” when the real classification is indeed “+50,000”.
calculate_scores = function(confusion_matrix) {
TP = confusion_matrix[2,2]
TN = confusion_matrix[1,1]
FP = confusion_matrix[2,1]
FN = confusion_matrix[1,2]
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)
MCC = (TP * TN - FP * FN) / (sqrt(TP+FP) * sqrt(TP + FN) * sqrt(TN + FP) * sqrt(TN + FN)) # We need to calculate first the SQRT of each product otherwise the number is too big and we have the following error: "NAs produced by integer overflow"
return(list("precision" = precision
, "recall" = recall
, "f1_score" = f1_score
, "MCC" = MCC))
}
Let’s try now several cutoff parameters and output the MCC for each of them.
We will take the cutoff that leads to the highest MCC.
# Let's first predict the probabilities of being in class "+50,000" in our validation set
validation_prob = predict(rf_fit, validation_data, type = "prob")[,2]
# Let's create our vector with different cutoff from 0 to 1 with 0.01 increment
validation_cutoff_result = data.frame("cutoff_value" = seq(0.1,1, by = 0.01))
# Let's test our model for each of those cutoff
cutoff_MCC = c(NULL)
for (i in 1:nrow(validation_cutoff_result)) {
temp_pred_class = ifelse(validation_prob >= validation_cutoff_result$cutoff_value[i], "+50,000", "-50,000")
temp_table = table(temp_pred_class, validation_data$salary_classification)
cutoff_MCC = c(cutoff_MCC, calculate_scores(temp_table)[["MCC"]])
}
validation_cutoff_result$MCC_result = cutoff_MCC
# Let's find the cutoff value that gives the highest MCC score
best_cutoff = validation_cutoff_result$cutoff_value[which.max(validation_cutoff_result$MCC_result)]
Here is the plot giving the MCC score for all the different cutoff parameters with a vertical line showing the value of the optimal cutoff that gives the maximal MCC:
best_MCC = validation_cutoff_result$MCC_result[validation_cutoff_result$cutoff_value == best_cutoff]
ggplot(data = validation_cutoff_result, aes(y = MCC_result, x = cutoff_value)) +
geom_point(aes(colour = MCC_result)) +
scale_colour_gradient(low = "blue") +
geom_vline(xintercept = best_cutoff, colour="blue") +
geom_text(aes(best_cutoff, 0, label = best_cutoff, vjust = -1), size = 5) +
geom_text(aes(best_cutoff, best_MCC, label = round(best_MCC,4)), size = 5)

We see that the best cutoff is 0.83 which gives an MCC equal to 0.5335164.
We will then test our model in the test set using this cutoff.
Test our model in the test set
We will test our model in teh test set with the cutoff found in the earlier section which is 0.83.
If the probability of being in the higher classification is bigger than this cutoff, we will classify our observation in the bucket “+50,000”
# Calculate the predicted probabilities of being in the class "+50,000"
test_predict_prob = predict(rf_fit, test_data, type = "prob")[,2]
# Assign the observation to the class "+50,000" if the calculated predicted probability is higher than the best cutoff found in the earlier section
test_predict_class = ifelse(test_predict_prob >= best_cutoff, "+50,000", "-50,000")
# Calculate the confusion matrix for our test set
test_confusion_matrix = table(test_predict_class, test_data$salary_classification)
# Calculate the different scores of our predicted model
test_rf_results = calculate_scores(test_confusion_matrix)
# Calculate the accuracy which is the number of well classified observations out of all the observations
rf_accuracy = sum(test_predict_class == test_data$salary_classification) / length(test_predict_class)
We have the following confusion matrix for our model (the rows being the predicted classes and the columns being the real ones):
| -50,000 |
89223 |
2097 |
| +50,000 |
4353 |
4089 |
Our model has those scores (a True Positive being a “+50,000” observation being correctly classified as “+50,000”):
- Precision: 48.44%
- Recall: 66.1%
- F1-score: 0.5590648
- MCC: 0.5324719
- Accuracy: 93.53%
Model conclusions
- Accuracy
- We see that our accuracy is 93.53% which is pretty good
- The test set has 93.8% of low salary so if we have a model that assigns every observation to the low salary classification, the accuracy will be 93.8% which is slightly better than our random forest model but we will in this case miss all the high salary people
- MCC:
- Our MCC is 0.5324719 which is pretty good as this score is always between -1 and 1 (0 meaning picking at random, 1 meaning a perfect classification and -1 meaning a perfect misclassification)
- Precision and Recall for being in the “+50,000” classification:
- Our precision is 48.44%
- Our recall is 66.1% which are not really high values
- This seems to be due to our strongly unbalanced testing set. As we have trained our model in a fakely balanced dataset, we tend to put more observations in the high salary classification so our precision gets smaller. If we change our cutoff, we can reach a higher recall but the precision will drastically go down. Our analyses in the previous section enabled us to find the optimal cutoff that gives the maximum MCC that takes into consideration the precision and recall of both classifications
- Precision and Recall for being in the “-50,000” classification:
- Our precision is 97.7%
- Our recall is 95.35%
- We can clearly that our recall and precision in this classification are super high which can also be due to the unbalanced structure of our dataset
- We can now graph the importance of each of our features used in our Random forest model:
varImpPlot(rf_fit, type = 1, main = "Features Importance of our Optimal Random Forest")

The mean decrease in accuracy a variable causes is determined during the out of bag error calculation phase. The more the accuracy of the random forest decreases due to the exclusion of a single variable, the more important that variable is.
It means that variables with a large mean decrease in accuracy are more important for classifying our observations.
Here we can see that sex, education, external financial balance and age are the most important ones in our classification process, followed closely by the major occupation and the weeks worked in the year.
Those features make completely sense as we would expect them to give us high information on the salary bucket of a given observation.