The Xgboost package in R is a powerful library that can be used to solve a variety of different issues.
One of great importance among these is the class-imbalance problem, whereby the levels in a categorical target variable are unevenly distributed. This distribution can effect the results of a machine learning prediction. Generally, the predictions on the majority class are good whereas those of the minority class may be woefully inaccurate.
For example, the dataset we will be using in this article is employment data from 2002 to 2012. The outcome we are trying to predict is “employmentstatus” which has 3 levels: employed, unemployed, and not in labor force. In the training data, these outcomes are distributed as follows:
Employment Status | Count |
---|---|
Employed | 41098 |
Not in labor force | 19630 |
Unemployed | 3278 |
From this we can see that the number of individuals that are unemployed vs. not in the labor force is clearly imbalanced. This affects our predictions: for example, the Random Forest that was used to make predictions on this data had a 20% error rate for predicting unemployed whereas the predictions for the other two classes were both more than 95% accurate.
In order to use the XGBoost, we must first manipulate the input data into a format that can be used by the package.
There is one key point that we must remember when using XGboost-
Anyway, lets get started:
# Create a training and validation sets
trainObs <- sample(nrow(train), .8 * nrow(train), replace = FALSE)
valObs <- sample(nrow(train), .2 * nrow(train), replace = FALSE)
train_dat <- train[trainObs,]
val_dat <- train[valObs,]
# Create numeric labels with one-hot encoding
train_labs <- as.numeric(train_dat$employmentstatus) - 1
val_labs <- as.numeric(val_dat$employmentstatus) - 1
new_train <- model.matrix(~ . + 0, data = train_dat[, -5])
new_val <- model.matrix(~ . + 0, data = train[valObs, -5])
# Prepare matrices
xgb_train <- xgb.DMatrix(data = new_train, label = train_labs)
xgb_val <- xgb.DMatrix(data = new_val, label = val_labs)
Now, it is time to outline the parameters of the model. For this, we must keep in mind that our objective is a multi-class classification. This will be clarified in the objective
parameter. We will also have to consequently set a num_class
parameter as well, and an evaluation metric- which is defined as “mlogloss” for multiclass problems.
# Set parameters(default)
params <- list(booster = "gbtree", objective = "multi:softprob", num_class = 3, eval_metric = "mlogloss")
# Calculate # of folds for cross-validation
xgbcv <- xgb.cv(params = params, data = xgb_train, nrounds = 100, nfold = 5, showsd = TRUE, stratified = TRUE, print.every.n = 10, early_stop_round = 20, maximize = FALSE, prediction = TRUE)
## [1] train-mlogloss:0.751527+0.000486 test-mlogloss:0.752013+0.001056
## [11] train-mlogloss:0.162974+0.001208 test-mlogloss:0.169053+0.004298
## [21] train-mlogloss:0.132703+0.000719 test-mlogloss:0.148054+0.004466
## [31] train-mlogloss:0.123582+0.001039 test-mlogloss:0.147539+0.004299
## [41] train-mlogloss:0.114668+0.001869 test-mlogloss:0.148498+0.004435
## [51] train-mlogloss:0.106774+0.002176 test-mlogloss:0.149133+0.004574
## [61] train-mlogloss:0.099195+0.002091 test-mlogloss:0.150232+0.004556
## [71] train-mlogloss:0.092473+0.001362 test-mlogloss:0.151361+0.004664
## [81] train-mlogloss:0.086219+0.001263 test-mlogloss:0.152508+0.005045
## [91] train-mlogloss:0.080046+0.001225 test-mlogloss:0.153651+0.005307
## [100] train-mlogloss:0.075534+0.001644 test-mlogloss:0.154566+0.005404
The multi:softprob
objective parameter essentially gives us a fuzzy clustering in which each observation is given a distinct probability of belonging to each class. In order to use these probabilities for classification, we will have to determine the max probability for each observation and assign a class.
# Function to compute classification error
classification_error <- function(conf_mat) {
conf_mat = as.matrix(conf_mat)
error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
return (error)
}
# Mutate xgb output to deliver hard predictions
xgb_train_preds <- data.frame(xgbcv$pred) %>% mutate(max = max.col(., ties.method = "last"), label = train_labs + 1)
# Examine output
head(xgb_train_preds)
## X1 X2 X3 max label
## 1 0.9999969006 0.000001076692 0.0000019829622 1 1
## 2 0.0001608563 0.980735719204 0.0191034991294 2 2
## 3 0.0237003975 0.935675859451 0.0406237915158 2 2
## 4 0.0975659713 0.775012969971 0.1274210512638 2 2
## 5 0.9999982119 0.000001305091 0.0000004595854 1 1
## 6 0.0673223436 0.273968368769 0.6587093472481 3 1
# Confustion Matrix
xgb_conf_mat <- table(true = train_labs + 1, pred = xgb_train_preds$max)
# Error
cat("XGB Training Classification Error Rate:", classification_error(xgb_conf_mat), "\n")
## XGB Training Classification Error Rate: 0.05489805
# Automated confusion matrix using "caret"
xgb_conf_mat_2 <- confusionMatrix(factor(xgb_train_preds$label),
factor(xgb_train_preds$max),
mode = "everything")
print(xgb_conf_mat_2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 32383 432 26
## 2 18 15377 340
## 3 3 1992 633
##
## Overall Statistics
##
## Accuracy : 0.9451
## 95% CI : (0.9431, 0.9471)
## No Information Rate : 0.6328
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8871
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.9994 0.8638 0.63363
## Specificity 0.9756 0.9893 0.96026
## Pos Pred Value 0.9861 0.9772 0.24087
## Neg Pred Value 0.9989 0.9317 0.99247
## Precision 0.9861 0.9772 0.24087
## Recall 0.9994 0.8638 0.63363
## F1 0.9927 0.9170 0.34905
## Prevalence 0.6328 0.3476 0.01951
## Detection Rate 0.6324 0.3003 0.01236
## Detection Prevalence 0.6414 0.3073 0.05132
## Balanced Accuracy 0.9875 0.9266 0.79695
The XGB model achieved a 6% classification error rate on the training set, which is roughly the same as that of the RF model utlized elsewhere. Now lets see how it does on the validation set.
# Create the model
xgb_model <- xgb.train(params = params, data = xgb_train, nrounds = 100)
# Predict for validation set
xgb_val_preds <- predict(xgb_model, newdata = xgb_val)
xgb_val_out <- matrix(xgb_val_preds, nrow = 3, ncol = length(xgb_val_preds) / 3) %>%
t() %>%
data.frame() %>%
mutate(max = max.col(., ties.method = "last"), label = val_labs + 1)
# Confustion Matrix
xgb_val_conf <- table(true = val_labs + 1, pred = xgb_val_out$max)
cat("XGB Validation Classification Error Rate:", classification_error(xgb_val_conf), "\n")
## XGB Validation Classification Error Rate: 0.03648152
# Automated confusion matrix using "caret"
xgb_val_conf2 <- confusionMatrix(factor(xgb_val_out$label),
factor(xgb_val_out$max),
mode = "everything")
print(xgb_val_conf2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 8123 76 4
## 2 0 3892 31
## 3 0 356 319
##
## Overall Statistics
##
## Accuracy : 0.9635
## 95% CI : (0.9601, 0.9667)
## No Information Rate : 0.6346
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9253
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 0.9001 0.90113
## Specificity 0.9829 0.9963 0.97140
## Pos Pred Value 0.9902 0.9921 0.47259
## Neg Pred Value 1.0000 0.9513 0.99711
## Precision 0.9902 0.9921 0.47259
## Recall 1.0000 0.9001 0.90113
## F1 0.9951 0.9439 0.62002
## Prevalence 0.6346 0.3378 0.02765
## Detection Rate 0.6346 0.3040 0.02492
## Detection Prevalence 0.6408 0.3065 0.05273
## Balanced Accuracy 0.9914 0.9482 0.93626
The primary purpose of this post was to get some experience using the XGBoost package.
I was hoping that the package would prove to be more accurate than the previously employed random forest model for the employment predictions- however, it seems like the two algorithms are very comparable and almost identical in terms of performance. However, one thing is for sure: while the XGBoost model may not necessarily be more accurate than the random forest- it is definitely a whole lot faster.