Multiclass Classification with XGBoost in R

This notebook shows basic methods for:

Fitting the XGBoost algorithm to conduct a multiclass classification
Evaluating Cross-Validation performance with out-of-fold observations
Predicting from the full training model to the hold-out test dataset
Visualizing the contribution to overall accuray of each variable

This notebook was developed in response to a query from a fellow archaeologist, so I am using an archaeological dataset for this analysis. Unfortunately, I did not have a multiclass dataset ICPS elemental datasets, so I had to simulate and bind a third class to the RBGlass1 dataset of the archdata package. The code is a bit verbose and inefficient because I wanted it to be more readable, so feel free to smooth it over in real use. If there are any errors or omissions, please let me know as mr.ecos (@t) gmail (.dot) com.

Required Packages

library("xgboost")  # the main algorithm
library("archdata") # for the sample dataset
library("caret")    # for the confusionmatrix() function (also needs e1071 package)
library("dplyr")    # for some data preperation
library("Ckmeans.1d.dp") # for xgb.ggplot.importance

Data Preperation

Here is where we:

Load the RBGlass1 dataset
convert the variable Site from a factor to numeric
Simulate a third class (furnace) from the data
Bind the new class to the original data
Subtract 1 from the Site names so they start at 0
Print out a summary()

The XGBoost algorithm requires that the class labels (Site names) start at 0 and increase sequentially to the maximum number of classes. This is a bit of an inconvenience as you need to keep track of what Site name goes with which label. Also, you need to be very careful when you add or remove a 1 to go from the zero based labels to the 1 based labels.

The simulated class is created by taking the data from Site == 1, creating an offset ./10 and adding that back to the same value with some random normal noise rnorm(1,.,.*0.1) proportional to the value. You can pretty much ignore this.

# set random seed
set.seed(717)
data(RBGlass1)
dat <- RBGlass1 
dat$Site <- as.numeric(dat$Site)
dat_add <- dat[which(dat$Site == 1),] %>%
  rowwise() %>%
  mutate_each(funs(./10 + rnorm(1,.,.*0.1))) %>%
  mutate_each(funs(round(.,2))) %>%
  mutate(Site = 3)

## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`

dat <- rbind(dat, dat_add) %>%
  mutate(Site = Site - 1)

summary(dat)

##       Site         Al              Fe               Mg        
##  Min.   :0   Min.   :2.060   Min.   :0.3200   Min.   :0.3900  
##  1st Qu.:0   1st Qu.:2.330   1st Qu.:0.4900   1st Qu.:0.5300  
##  Median :1   Median :2.450   Median :0.6550   Median :0.5500  
##  Mean   :1   Mean   :2.479   Mean   :0.6626   Mean   :0.5594  
##  3rd Qu.:2   3rd Qu.:2.592   3rd Qu.:0.8200   3rd Qu.:0.5800  
##  Max.   :2   Max.   :3.120   Max.   :1.3600   Max.   :0.7200  
##        Ca               Na              K                Ti        
##  Min.   : 4.890   Min.   :13.66   Min.   :0.4800   Min.   :0.0600  
##  1st Qu.: 6.360   1st Qu.:17.36   1st Qu.:0.6700   1st Qu.:0.0800  
##  Median : 6.825   Median :18.34   Median :0.7200   Median :0.1000  
##  Mean   : 6.998   Mean   :18.57   Mean   :0.7349   Mean   :0.0978  
##  3rd Qu.: 7.503   3rd Qu.:19.68   3rd Qu.:0.7725   3rd Qu.:0.1100  
##  Max.   :10.620   Max.   :24.58   Max.   :1.5400   Max.   :0.1600  
##        P                Mn               Sb               Pb         
##  Min.   :0.0900   Min.   :0.0700   Min.   :0.0000   Min.   :0.01000  
##  1st Qu.:0.1100   1st Qu.:0.2500   1st Qu.:0.0850   1st Qu.:0.02000  
##  Median :0.1300   Median :0.2800   Median :0.2700   Median :0.03000  
##  Mean   :0.1282   Mean   :0.3244   Mean   :0.2225   Mean   :0.03073  
##  3rd Qu.:0.1400   3rd Qu.:0.3825   3rd Qu.:0.3200   3rd Qu.:0.04000  
##  Max.   :0.2200   Max.   :0.9000   Max.   :0.5300   Max.   :0.09000

Train and Test Split

I am using a basic jackknife 75% train v. 25% test split here. The test set will not be used in model fitting in this example since we get a cross-validation error estimate from the training data. The test set is used as a hold-out validation sample to the final model fit to all the training data. Change the 0.75 in the line below to change the proportion of the train & test split.

The XGBoost algorithm requires the data to be passed as a matrix. Here I use xgb.DMatrix() function to make a dataset of class xgb.DMatrix which is native to XGBoost. The advantage to this over a basic matrix is that I can pass it the variables and the label and identify which column is the label. Therefore I do not need to have separate objects for the train and test labels.

# Make split index
train_index <- sample(1:nrow(dat), nrow(dat)*0.75)
# Full data set
data_variables <- as.matrix(dat[,-1])
data_label <- dat[,"Site"]
data_matrix <- xgb.DMatrix(data = as.matrix(dat), label = data_label)
# split train data and make xgb.DMatrix
train_data   <- data_variables[train_index,]
train_label  <- data_label[train_index]
train_matrix <- xgb.DMatrix(data = train_data, label = train_label)
# split test data and make xgb.DMatrix
test_data  <- data_variables[-train_index,]
test_label <- data_label[-train_index]
test_matrix <- xgb.DMatrix(data = test_data, label = test_label)

K-folds Cross-validation to Estimate Error

Here we set the fitting parameters and then fit a series of XGBoost models to each cv.nfold [A note on this below]. The code block starts with assigning the number of classes, a bunch of parameters for the XGBoost fit, and out CV parameters. The list of xgb_params holds some critical information for multiclass prediction. Here we set the objective to multi:softprob and the eval_metric to mlogloss. These two parameters tell the XGBoost algorithm that we want to to probabilistic classification and use a multiclass logloss as our evaluation metric. Use of the multi:softprob objective also requires that we tell is the number of classes we have with num_class. Here is the documents page on XGB training parameters.

The other parameters of note are nrounds and prediction. The nrounds parameter tells XGBoost how many times to iterate. This is inherently tied to the learning rate eta and will most likely require tuning for your data. Read up on hyper-parameter tuning as it is very important to get right. Finally, the prediction argument tells XGBoost to save the out-of-fold predictions so that we can use them to assess generalization error.

Note on k-folds CV

The cv.nfold value here is arbitrary chosen. The number of cv.nfold will typically be 5 or 10, but it really depends on data size, signal:noise ratio, and training time. Read up on how to select the best k-folds. A decent place to start

numberOfClasses <- length(unique(dat$Site))
xgb_params <- list("objective" = "multi:softprob",
                   "eval_metric" = "mlogloss",
                   "num_class" = numberOfClasses)
nround    <- 50 # number of XGBoost rounds
cv.nfold  <- 5

# Fit cv.nfold * cv.nround XGB models and save OOF predictions
cv_model <- xgb.cv(params = xgb_params,
                   data = train_matrix, 
                   nrounds = nround,
                   nfold = cv.nfold,
                   verbose = FALSE,
                   prediction = TRUE)

Assess Out-of-Fold Prediction Error

To assess prediction error, we used the data returned from the pred slot of the cv_model object. The pred object contains the predicted value for each observation in our train dataset when it was in the \(k^{th}\) fold, known as the out-of-fold sample. This sample was not used to train the model so therefore acts as an independent sample for testing (all caveats about k-folds CV apply!). This step allows us to derived an error estimate on all of our train samples as if the were independent from the model fit.

The pred object returned from xgb.cv() contains a column for each of our classes (Sites) and the probability that each observation belongs to each class. This is the output of the multi:softprob objective function as opposed to the multi:softmax function that predicts the model likely class (not probabilities). However, since we want to test the prediction, we do need to assign a class, so we use max.col() to assign the class that has the highest probability. Following this, I tack on the true class as the label column. We use this table to derive a mutliclass confusion matrix to assess the error. At this random seed set.seed(717), the Accuracy should by 0.7886 with a 95% CI of 0.7058 to 0.857. These values are seemingly decent, but remember that class 3 is simulated data.

Note on Confusion Matrix

The confusion matrix is named after the fact that it shows where the classification algorithm confuses one class for another (gets it wrong). This is one of the most important tools in classification as it gives you a wide range of metrics to assess error rate and ability of the model to generalize. The big caveats here are that the confusion matrix requires predicted class labels, not probabilities. Therefore, you need to establish a probability threshold (often 0.5) or the max probability as done here. Using a different threshold or method will have a big effect on the metric in the confusion matrix. The second big caveat is that the metrics provided are generalized and may not be appropriate for your modeling purpose. Each metric has its own host of assumptions and can be interpreted in its own way. These need to be research and understood in the context of your problem to figure out which metric is the most appropriate.

OOF_prediction <- data.frame(cv_model$pred) %>%
  mutate(max_prob = max.col(., ties.method = "last"),
         label = train_label + 1)
head(OOF_prediction)

# confusion matrix
confusionMatrix(factor(OOF_prediction$max_prob),
                factor(OOF_prediction$label),
                mode = "everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 36  7  5
##          2  6 27  3
##          3  3  1 35
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7148, 0.8639)
##     No Information Rate : 0.3659          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6935          
##  Mcnemar's Test P-Value : 0.6646          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.8000   0.7714   0.8140
## Specificity            0.8462   0.8977   0.9500
## Pos Pred Value         0.7500   0.7500   0.8974
## Neg Pred Value         0.8800   0.9080   0.9048
## Precision              0.7500   0.7500   0.8974
## Recall                 0.8000   0.7714   0.8140
## F1                     0.7742   0.7606   0.8537
## Prevalence             0.3659   0.2846   0.3496
## Detection Rate         0.2927   0.2195   0.2846
## Detection Prevalence   0.3902   0.2927   0.3171
## Balanced Accuracy      0.8231   0.8346   0.8820

Train Full Model and Assess Test Set Error

If the errors assessed from the CV routine are acceptable and within a reasonable range of variation (e.g. all folds show relatively consistent error rates), then a common practice is to then fit the full training data set. This may be done on an optimized set of hyper-parameters [see note below] or with defaults. By doing so, we assume that the CV was properly executed and that the training error observed there is consistent with what could be expected from new and unseen data. Therefore, but fitting the full training set, we are not over-fitting our data; but we still have the hold-out test set to confirm that assumption. To fit the full training model, the xgb.train() function is used with the parameters set above.

Following the model fit, the test set test_matrix is predicted using the bst_model. The results are transformed into a matrix of class predictions, the max probability is taken, and the true label is added all in the same way as done above for the OOF data. Of note though is that the prediction is returned as a single vector and needs to be transformed back to a matrix with care. Otherwise, the labels will be jumbled and the confusion matrix will show very poor performance.

The confusion matrix is computed just as above. We see that the test set provided an overall accuracy of 85% with a 95% CI of 71% to 94%. Note that the mean accuracy is a tad higher than the OOF training set, but the CI is a good bit wider as there are only one-third as many data points in the test as the training set. Less data = less confidence.

Note on Hyper-parameters

The k-fold CV steps above could have been use to find the optimum hyper-parameters (nrounds, eta, max.depth, etc…) for the XGBoost algorithm. We did not do that here, but it is a wise thing to do if you want to find the best performance. However, you must be very mindful of over-fitting with optimized hyper-parameters. There are many ways to search for hyper-parameters (i.e. grid search, random search, genetic search, Bayesian search), but the end result is a set of hyper-parameters that give you the best performance on your OOF observations. Once that set is know, it would be plugged into this step to give you the full train set model.

bst_model <- xgb.train(params = xgb_params,
                       data = train_matrix,
                       nrounds = nround)

# Predict hold-out test set
test_pred <- predict(bst_model, newdata = test_matrix)
test_prediction <- matrix(test_pred, nrow = numberOfClasses,
                          ncol=length(test_pred)/numberOfClasses) %>%
  t() %>%
  data.frame() %>%
  mutate(label = test_label + 1,
         max_prob = max.col(., "last"))
# confusion matrix of test set
confusionMatrix(factor(test_prediction$max_prob),
                factor(test_prediction$label),
                mode = "everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 12  1  1
##          2  2  9  1
##          3  0  1 14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8537          
##                  95% CI : (0.7083, 0.9443)
##     No Information Rate : 0.3902          
##     P-Value [Acc > NIR] : 1.284e-09       
##                                           
##                   Kappa : 0.779           
##  Mcnemar's Test P-Value : 0.7212          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.8571   0.8182   0.8750
## Specificity            0.9259   0.9000   0.9600
## Pos Pred Value         0.8571   0.7500   0.9333
## Neg Pred Value         0.9259   0.9310   0.9231
## Precision              0.8571   0.7500   0.9333
## Recall                 0.8571   0.8182   0.8750
## F1                     0.8571   0.7826   0.9032
## Prevalence             0.3415   0.2683   0.3902
## Detection Rate         0.2927   0.2195   0.3415
## Detection Prevalence   0.3415   0.2927   0.3659
## Balanced Accuracy      0.8915   0.8591   0.9175

A final step that is often done (for better or worse) is to then carry the confidence of the OOF error and test set error and assume that new data that we have never seen with be fit just as well with this model. To implement this, we would fit a final model that contains all of our data (test + train) under the assumption that more data is better. However, one very important thing to note here is again the hyper-parameters. If the hyper-parameters for by the CV for OOF data or on the full train set for the predicting the test set data are drastically different that those used by the final model, then assessed error rates would not be very meaningful any longer. The best route would be to optimize for a set of hyper-parameters, test them on OOF and test sets, then carry those into the final model. But it all depends on requirements of the project, amount of data, time, and inclination. Your mileage may vary!

Variable Importance

The final step in this process, and potentially the first step in a the process of understanding the model, is assessing variable importance. Basically, this is a way of using all the splits in the XGBoost trees to understand how how accurate the classifications are based on the splits. This is quantified with the Gain measurement in the variable importance table obtained from the xgb.importance() function. According to the XGBoost documentation:

Gain is the improvement in accuracy brought by a feature to the branches it is on. >The idea is that before adding a new split on a feature X to the branch there was >some wrongly classified elements, after adding the split on this feature, there are >two new branches, and each of these branch is more accurate (one branch saying if >your observation is on this branch then it should be classified as 1, and the other >branch saying the exact opposite).

The convenient xgb.plot.importance() function takes the Gain information and plots it using ggplot2. The variables are also clustered using k-means clustering that optimizes \(k\) based on Bayesian Information Criteria (BIC) for \(k = 1\) to \(k = 10\), or you can pass a specific \(k\). Here we can see that elements Mn, Mg, Ca, and Na are apparently the most important in classifying each artifact to the accuracy demonstrated by bst_model given the data and hyper-parameters.

# get the feature real names
names <-  colnames(dat[,-1])
# compute feature importance matrix
importance_matrix = xgb.importance(feature_names = names, model = bst_model)
head(importance_matrix)

# plot
gp = xgb.ggplot.importance(importance_matrix)
print(gp)