This simple example, written in R, shows you how to train an XGBoost model to predict unknown flower species—using the famous Iris data set. XGBoost (Extreme Gradient Boosting) is known to regularly outperform many other traditional algorithms for regression and classification. The iris flower species problem represents multi-class (multinomial) classification.
The xgboost
package contains the XGBoost algorithm and associated tools. You can view the formal documentation online: https://cran.r-project.org/web/packages/xgboost/xgboost.pdf.
library(xgboost)
data(iris)
XGBoost requires the classes to be in an integer format, starting with 0. So, the first class should be 0. The Species
factor is converted to the proper integer format.
# Convert the Species factor to an integer class starting at 0
# This is picky, but it's a requirement for XGBoost
species = iris$Species
label = as.integer(iris$Species)-1
iris$Species = NULL
The full iris data set is split for training (75%) and testing (25%). The training data set is used to fit the model, and the testing data set is held out for validation. This allows us to validate the performance of our model (known as “hold one out cross-validation”).
n = nrow(iris)
train.index = sample(n,floor(0.75*n))
train.data = as.matrix(iris[train.index,])
train.label = label[train.index]
test.data = as.matrix(iris[-train.index,])
test.label = label[-train.index]
Next, we transform the training and testing data sets into xgb.DMatrix objects that are used for fitting the XGBoost model and predicting new outcomes.
# Transform the two data sets into xgb.Matrix
xgb.train = xgb.DMatrix(data=train.data,label=train.label)
xgb.test = xgb.DMatrix(data=test.data,label=test.label)
XGBoost, like most other algorithms, works best when its parameters are hypertuned for optimal performance. The algorithm requires that we define the booster, objective, learning rate, and other parameters. This example uses a set of parameters that I found to be optimal through simple cross-validation. You’ll need to spend most of your time in this step; it’s imperative that you understand your data and use cross-validation.
The multi:softprob
objective tells the algorithm to calculate probabilities for every possible outcome (in this case, a probability for each of the three flower species), for every observation.
# Define the parameters for multinomial classification
num_class = length(levels(species))
params = list(
booster="gbtree",
eta=0.001,
max_depth=5,
gamma=3,
subsample=0.75,
colsample_bytree=1,
objective="multi:softprob",
eval_metric="mlogloss",
num_class=num_class
)
We can finally train the XGBoost model! I only use one thread (versus parallel execution using multiple threads) because the data set is relatively small and the algorithm quickly converges. The test data set, xgb.test
, is listed in the watchlist. This tells the algorithm to use the test data set for validating performance after every round, and the algorithm will stop early if the performance does not improve after 10 consecutive rounds. I include the training data for additional validation so I can assess the variance between the training precision and testing precision to avoid overfitting.
You can set the verbose
parameter is set to 1 so we can see the results for each round.
# Train the XGBoost classifer
xgb.fit=xgb.train(
params=params,
data=xgb.train,
nrounds=10000,
nthreads=1,
early_stopping_rounds=10,
watchlist=list(val1=xgb.train,val2=xgb.test),
verbose=0
)
# Review the final model and results
xgb.fit
## ##### xgb.Booster
## raw: 3.5 Mb
## call:
## xgb.train(params = params, data = xgb.train, nrounds = 10000,
## watchlist = list(val1 = xgb.train, val2 = xgb.test), verbose = 0,
## early_stopping_rounds = 10, nthreads = 1)
## params (as set within xgb.train):
## booster = "gbtree", eta = "0.001", max_depth = "5", gamma = "3", subsample = "0.75", colsample_bytree = "1", objective = "multi:softprob", eval_metric = "mlogloss", num_class = "3", nthreads = "1", silent = "1"
## xgb.attributes:
## best_iteration, best_msg, best_ntreelimit, best_score, niter
## callbacks:
## cb.evaluation.log()
## cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize,
## verbose = verbose)
## # of features: 4
## niter: 3249
## best_iteration : 3239
## best_ntreelimit : 3239
## best_score : 0.196657
## nfeatures : 4
## evaluation_log:
## iter val1_mlogloss val2_mlogloss
## 1 1.097368 1.097407
## 2 1.096104 1.096117
## ---
## 3248 0.181839 0.196666
## 3249 0.181838 0.196673
Awesome, the model converged! Now we can predict new outcomes given the testing data set that we set aside earlier. We use the predict
function to predict the likelihood of each observation in test.data
of being each flower species.
Don’t forget to re-convert your labels back to the names of the species by adding 1 back to the integer values
# Predict outcomes with the test data
xgb.pred = predict(xgb.fit,test.data,reshape=T)
xgb.pred = as.data.frame(xgb.pred)
colnames(xgb.pred) = levels(species)
Iterate over the predictions and identify the label (class) with the highest probability. This allows us to evaluate the true performance of the model by comparing the actual labels with the predicted labels.
# Use the predicted label with the highest probability
xgb.pred$prediction = apply(xgb.pred,1,function(x) colnames(xgb.pred)[which.max(x)])
xgb.pred$label = levels(species)[test.label+1]
Calculate the accuracy of the predictions. This compares the true labels from the test data set with the predicted labels (with the highest probability), and it represents the percent of flower species that were accuracy predicted using the XGBoost model. My results suggest that XGBoost can consistently achieve an accuracy of at least 90%!
# Calculate the final accuracy
result = sum(xgb.pred$prediction==xgb.pred$label)/nrow(xgb.pred)
print(paste("Final Accuracy =",sprintf("%1.2f%%", 100*result)))
## [1] "Final Accuracy = 97.37%"