Introduction:

The Xgboost package in R is a powerful library that can be used to solve a variety of different issues.

One of great importance among these is the class-imbalance problem, whereby the levels in a categorical target variable are unevenly distributed. This distribution can effect the results of a machine learning prediction. Generally, the predictions on the majority class are good whereas those of the minority class may be woefully inaccurate.

For example, the dataset we will be using in this article is employment data from 2002 to 2012. The outcome we are trying to predict is “employmentstatus” which has 3 levels: employed, unemployed, and not in labor force. In the training data, these outcomes are distributed as follows:

Employment Status	Count
Employed	41098
Not in labor force	19630
Unemployed	3278

From this we can see that the number of individuals that are unemployed vs. not in the labor force is clearly imbalanced. This affects our predictions: for example, the Random Forest that was used to make predictions on this data had a 20% error rate for predicting unemployed whereas the predictions for the other two classes were both more than 95% accurate.

Preparing the Data:

Preprocessing-

In order to use the XGBoost, we must first manipulate the input data into a format that can be used by the package.

There is one key point that we must remember when using XGboost-

Categorical variables must be converted to numeric.

Anyway, lets get started:

# Create a training and validation sets
trainObs <- sample(nrow(train), .8 * nrow(train), replace = FALSE)
valObs <- sample(nrow(train), .2 * nrow(train), replace = FALSE)

train_dat <- train[trainObs,]
val_dat <- train[valObs,]

# Create numeric labels with one-hot encoding
train_labs <- as.numeric(train_dat$employmentstatus) - 1
val_labs <- as.numeric(val_dat$employmentstatus) - 1

new_train <- model.matrix(~ . + 0, data = train_dat[, -5])
new_val <- model.matrix(~ . + 0, data = train[valObs, -5])

# Prepare matrices
xgb_train <- xgb.DMatrix(data = new_train, label = train_labs)
xgb_val <- xgb.DMatrix(data = new_val, label = val_labs)

Now, it is time to outline the parameters of the model. For this, we must keep in mind that our objective is a multi-class classification. This will be clarified in the objective parameter. We will also have to consequently set a num_class parameter as well, and an evaluation metric- which is defined as “mlogloss” for multiclass problems.

# Set parameters(default)
params <- list(booster = "gbtree", objective = "multi:softprob", num_class = 3, eval_metric = "mlogloss")

# Calculate # of folds for cross-validation
xgbcv <- xgb.cv(params = params, data = xgb_train, nrounds = 100, nfold = 5, showsd = TRUE, stratified = TRUE, print.every.n = 10, early_stop_round = 20, maximize = FALSE, prediction = TRUE)

## [1]  train-mlogloss:0.751527+0.000486    test-mlogloss:0.752013+0.001056 
## [11] train-mlogloss:0.162974+0.001208    test-mlogloss:0.169053+0.004298 
## [21] train-mlogloss:0.132703+0.000719    test-mlogloss:0.148054+0.004466 
## [31] train-mlogloss:0.123582+0.001039    test-mlogloss:0.147539+0.004299 
## [41] train-mlogloss:0.114668+0.001869    test-mlogloss:0.148498+0.004435 
## [51] train-mlogloss:0.106774+0.002176    test-mlogloss:0.149133+0.004574 
## [61] train-mlogloss:0.099195+0.002091    test-mlogloss:0.150232+0.004556 
## [71] train-mlogloss:0.092473+0.001362    test-mlogloss:0.151361+0.004664 
## [81] train-mlogloss:0.086219+0.001263    test-mlogloss:0.152508+0.005045 
## [91] train-mlogloss:0.080046+0.001225    test-mlogloss:0.153651+0.005307 
## [100]    train-mlogloss:0.075534+0.001644    test-mlogloss:0.154566+0.005404

The multi:softprob objective parameter essentially gives us a fuzzy clustering in which each observation is given a distinct probability of belonging to each class. In order to use these probabilities for classification, we will have to determine the max probability for each observation and assign a class.

# Function to compute classification error
classification_error <- function(conf_mat) {
  conf_mat = as.matrix(conf_mat)
  
  error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
  
  return (error)
}

# Mutate xgb output to deliver hard predictions
xgb_train_preds <- data.frame(xgbcv$pred) %>% mutate(max = max.col(., ties.method = "last"), label = train_labs + 1)

# Examine output
head(xgb_train_preds)

##             X1             X2              X3 max label
## 1 0.9999969006 0.000001076692 0.0000019829622   1     1
## 2 0.0001608563 0.980735719204 0.0191034991294   2     2
## 3 0.0237003975 0.935675859451 0.0406237915158   2     2
## 4 0.0975659713 0.775012969971 0.1274210512638   2     2
## 5 0.9999982119 0.000001305091 0.0000004595854   1     1
## 6 0.0673223436 0.273968368769 0.6587093472481   3     1

# Confustion Matrix
xgb_conf_mat <- table(true = train_labs + 1, pred = xgb_train_preds$max)

# Error 
cat("XGB Training Classification Error Rate:", classification_error(xgb_conf_mat), "\n")

## XGB Training Classification Error Rate: 0.05489805

# Automated confusion matrix using "caret"
xgb_conf_mat_2 <- confusionMatrix(factor(xgb_train_preds$label),
                                  factor(xgb_train_preds$max),
                                  mode = "everything")

print(xgb_conf_mat_2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     2     3
##          1 32383   432    26
##          2    18 15377   340
##          3     3  1992   633
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9451          
##                  95% CI : (0.9431, 0.9471)
##     No Information Rate : 0.6328          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8871          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9994   0.8638  0.63363
## Specificity            0.9756   0.9893  0.96026
## Pos Pred Value         0.9861   0.9772  0.24087
## Neg Pred Value         0.9989   0.9317  0.99247
## Precision              0.9861   0.9772  0.24087
## Recall                 0.9994   0.8638  0.63363
## F1                     0.9927   0.9170  0.34905
## Prevalence             0.6328   0.3476  0.01951
## Detection Rate         0.6324   0.3003  0.01236
## Detection Prevalence   0.6414   0.3073  0.05132
## Balanced Accuracy      0.9875   0.9266  0.79695

The XGB model achieved a 6% classification error rate on the training set, which is roughly the same as that of the RF model utlized elsewhere. Now lets see how it does on the validation set.

# Create the model
xgb_model <- xgb.train(params = params, data = xgb_train, nrounds = 100)

# Predict for validation set
xgb_val_preds <- predict(xgb_model, newdata = xgb_val)

xgb_val_out <- matrix(xgb_val_preds, nrow = 3, ncol = length(xgb_val_preds) / 3) %>% 
               t() %>%
               data.frame() %>%
               mutate(max = max.col(., ties.method = "last"), label = val_labs + 1) 

# Confustion Matrix
xgb_val_conf <- table(true = val_labs + 1, pred = xgb_val_out$max)

cat("XGB Validation Classification Error Rate:", classification_error(xgb_val_conf), "\n")

## XGB Validation Classification Error Rate: 0.03648152

# Automated confusion matrix using "caret"
xgb_val_conf2 <- confusionMatrix(factor(xgb_val_out$label),
                                 factor(xgb_val_out$max),
                                 mode = "everything")

print(xgb_val_conf2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3
##          1 8123   76    4
##          2    0 3892   31
##          3    0  356  319
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9635          
##                  95% CI : (0.9601, 0.9667)
##     No Information Rate : 0.6346          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9253          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   0.9001  0.90113
## Specificity            0.9829   0.9963  0.97140
## Pos Pred Value         0.9902   0.9921  0.47259
## Neg Pred Value         1.0000   0.9513  0.99711
## Precision              0.9902   0.9921  0.47259
## Recall                 1.0000   0.9001  0.90113
## F1                     0.9951   0.9439  0.62002
## Prevalence             0.6346   0.3378  0.02765
## Detection Rate         0.6346   0.3040  0.02492
## Detection Prevalence   0.6408   0.3065  0.05273
## Balanced Accuracy      0.9914   0.9482  0.93626

Conclusion:

The primary purpose of this post was to get some experience using the XGBoost package.

I was hoping that the package would prove to be more accurate than the previously employed random forest model for the employment predictions- however, it seems like the two algorithms are very comparable and almost identical in terms of performance. However, one thing is for sure: while the XGBoost model may not necessarily be more accurate than the random forest- it is definitely a whole lot faster.

Multi-Class Classification Using XGBOOST

ZXS107020

3/10/2018

Introduction:

Preparing the Data:

Preprocessing-

Conclusion: