This simple example, written in R, shows you how to train an XGBoost model to predict unknown flower species—using the famous Iris data set. XGBoost (Extreme Gradient Boosting) is known to regularly outperform many other traditional algorithms for regression and classification. The iris flower species problem represents multi-class (multinomial) classification.

Load the packages and data

The xgboost package contains the XGBoost algorithm and associated tools. You can view the formal documentation online: https://cran.r-project.org/web/packages/xgboost/xgboost.pdf.

library(xgboost)
data(iris)

Label conversion

XGBoost requires the classes to be in an integer format, starting with 0. So, the first class should be 0. The Species factor is converted to the proper integer format.

# Convert the Species factor to an integer class starting at 0
# This is picky, but it's a requirement for XGBoost
species = iris$Species
label = as.integer(iris$Species)-1
iris$Species = NULL

Split the data for training and testing (75/25 split)

The full iris data set is split for training (75%) and testing (25%). The training data set is used to fit the model, and the testing data set is held out for validation. This allows us to validate the performance of our model (known as “hold one out cross-validation”).

n = nrow(iris)
train.index = sample(n,floor(0.75*n))
train.data = as.matrix(iris[train.index,])
train.label = label[train.index]
test.data = as.matrix(iris[-train.index,])
test.label = label[-train.index]

Create the xgb.DMatrix objects

Next, we transform the training and testing data sets into xgb.DMatrix objects that are used for fitting the XGBoost model and predicting new outcomes.

# Transform the two data sets into xgb.Matrix
xgb.train = xgb.DMatrix(data=train.data,label=train.label)
xgb.test = xgb.DMatrix(data=test.data,label=test.label)

Define the main parameters

XGBoost, like most other algorithms, works best when its parameters are hypertuned for optimal performance. The algorithm requires that we define the booster, objective, learning rate, and other parameters. This example uses a set of parameters that I found to be optimal through simple cross-validation. You’ll need to spend most of your time in this step; it’s imperative that you understand your data and use cross-validation.

The multi:softprob objective tells the algorithm to calculate probabilities for every possible outcome (in this case, a probability for each of the three flower species), for every observation.

# Define the parameters for multinomial classification
num_class = length(levels(species))
params = list(
  booster="gbtree",
  eta=0.001,
  max_depth=5,
  gamma=3,
  subsample=0.75,
  colsample_bytree=1,
  objective="multi:softprob",
  eval_metric="mlogloss",
  num_class=num_class
)

Train the model

We can finally train the XGBoost model! I only use one thread (versus parallel execution using multiple threads) because the data set is relatively small and the algorithm quickly converges. The test data set, xgb.test, is listed in the watchlist. This tells the algorithm to use the test data set for validating performance after every round, and the algorithm will stop early if the performance does not improve after 10 consecutive rounds. I include the training data for additional validation so I can assess the variance between the training precision and testing precision to avoid overfitting.

You can set the verbose parameter is set to 1 so we can see the results for each round.

# Train the XGBoost classifer
xgb.fit=xgb.train(
  params=params,
  data=xgb.train,
  nrounds=10000,
  nthreads=1,
  early_stopping_rounds=10,
  watchlist=list(val1=xgb.train,val2=xgb.test),
  verbose=0
)

# Review the final model and results
xgb.fit
## ##### xgb.Booster
## raw: 3.5 Mb 
## call:
##   xgb.train(params = params, data = xgb.train, nrounds = 10000, 
##     watchlist = list(val1 = xgb.train, val2 = xgb.test), verbose = 0, 
##     early_stopping_rounds = 10, nthreads = 1)
## params (as set within xgb.train):
##   booster = "gbtree", eta = "0.001", max_depth = "5", gamma = "3", subsample = "0.75", colsample_bytree = "1", objective = "multi:softprob", eval_metric = "mlogloss", num_class = "3", nthreads = "1", silent = "1"
## xgb.attributes:
##   best_iteration, best_msg, best_ntreelimit, best_score, niter
## callbacks:
##   cb.evaluation.log()
##   cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     verbose = verbose)
## # of features: 4 
## niter: 3249
## best_iteration : 3239 
## best_ntreelimit : 3239 
## best_score : 0.196657 
## nfeatures : 4 
## evaluation_log:
##     iter val1_mlogloss val2_mlogloss
##        1      1.097368      1.097407
##        2      1.096104      1.096117
## ---                                 
##     3248      0.181839      0.196666
##     3249      0.181838      0.196673

Predict new outcomes

Awesome, the model converged! Now we can predict new outcomes given the testing data set that we set aside earlier. We use the predict function to predict the likelihood of each observation in test.data of being each flower species.

Don’t forget to re-convert your labels back to the names of the species by adding 1 back to the integer values

# Predict outcomes with the test data
xgb.pred = predict(xgb.fit,test.data,reshape=T)
xgb.pred = as.data.frame(xgb.pred)
colnames(xgb.pred) = levels(species)

Identify the class with the highest probability for each prediction

Iterate over the predictions and identify the label (class) with the highest probability. This allows us to evaluate the true performance of the model by comparing the actual labels with the predicted labels.

# Use the predicted label with the highest probability
xgb.pred$prediction = apply(xgb.pred,1,function(x) colnames(xgb.pred)[which.max(x)])
xgb.pred$label = levels(species)[test.label+1]

How accurate are the predictions?

Calculate the accuracy of the predictions. This compares the true labels from the test data set with the predicted labels (with the highest probability), and it represents the percent of flower species that were accuracy predicted using the XGBoost model. My results suggest that XGBoost can consistently achieve an accuracy of at least 90%!

# Calculate the final accuracy
result = sum(xgb.pred$prediction==xgb.pred$label)/nrow(xgb.pred)
print(paste("Final Accuracy =",sprintf("%1.2f%%", 100*result)))
## [1] "Final Accuracy = 97.37%"