load("financial_data_pred.rda") # Load dataset # already into test and train

XGBoost

XGBoost (eXtreme Gradient Boosting) is an efficient implementation of gradient boosting. It can be used for both classification and regression.

XGBoost works as follows:

    • Create naive model
    • Calculate the errors of the model
    • Build a model predicting the errors/Re-weight incorrect samples and rebuild model
    • Add model to ensemble
    • Repeat steps 2-4 until model converges/number of trees

To make a prediction at each step all of the models are used in the ensemble.

Preparing data for XGBoost

XGBoost cannot be just passed a dataframe, instead it requires the data to be in matrix format:

  • A matrix is a data frame that only contains numbers
  • Convert categorical variables to dummy variables for use in a matrix.
  • Numerical data includes both variables with numbers and TRUE/FALSE (1/0) variables

Convert data to DMatrix

XGBoost uses its own data type called DMatrix that is an efficient method for storing sparse matrices (Many zeros).

# Create training matrix
dtrain <- xgboost::xgb.DMatrix(data = as.matrix(train_data[, 1:220]), label = as.numeric(train_data$class) -1)
# Create test matrix
dtest <- xgboost::xgb.DMatrix(data = as.matrix(test_data[, 1:220]), label = as.numeric(test_data$class) - 1)

Training an XGBoost Model

Next we want to train our XGBoost model. For this XGBoost needs to have:

  • Training data - The data we want to train our model on
  • Number of Rounds - The number of rounds of training to use/Iterations/Trees
  • Objective Function - This determines the output of XGBoost and is determined by the response variable we are using. Here as we have a binary response variable we choose the objective binary:logistic. By default XGBoost predicts a numerical response.
set.seed(111111)
bst_1 <- xgboost(data = dtrain, # Set training data
               
               nrounds = 100, # Set number of rounds
               
               verbose = 1, # 1 - Prints out fit
                print_every_n = 20, # Prints out result every 20th iteration
               
               objective = "binary:logistic", # Set objective
               eval_metric = "auc",
               eval_metric = "error") # Set evaluation metric to use
## [1]  train-auc:0.737380  train-error:0.330500 
## [21] train-auc:0.954374  train-error:0.114625 
## [41] train-auc:0.988477  train-error:0.054750 
## [61] train-auc:0.998311  train-error:0.020000 
## [81] train-auc:0.999815  train-error:0.007375 
## [100]    train-auc:0.999983  train-error:0.001375

Predicting

boost_preds_1 <- predict(bst_1, dtest) # Create predictions for xgboost model

pred_dat <- cbind.data.frame(boost_preds_1 , test_data$class)#
# Convert predictions to classes, using optimal cut-off
boost_pred_class <- rep(0, length(boost_preds_1))
boost_pred_class[boost_preds_1 >= 0.5] <- 1


t <- table(boost_pred_class, test_data$class) # Create table
confusionMatrix(t, positive = "1") # Produce confusion matrix
## Confusion Matrix and Statistics
## 
##                 
## boost_pred_class   0   1
##                0 385 462
##                1 357 796
##                                           
##                Accuracy : 0.5905          
##                  95% CI : (0.5686, 0.6122)
##     No Information Rate : 0.629           
##     P-Value [Acc > NIR] : 0.999819        
##                                           
##                   Kappa : 0.1473          
##                                           
##  Mcnemar's Test P-Value : 0.000279        
##                                           
##             Sensitivity : 0.6328          
##             Specificity : 0.5189          
##          Pos Pred Value : 0.6904          
##          Neg Pred Value : 0.4545          
##              Prevalence : 0.6290          
##          Detection Rate : 0.3980          
##    Detection Prevalence : 0.5765          
##       Balanced Accuracy : 0.5758          
##                                           
##        'Positive' Class : 1               
## 

Tuning with XGBoost

There are several parameters which we can tune for XGBoost:

  • max.depth - The depth of our trees (Maximum number of interactions to consider). (Values: 3-10)
  • nrounds - The number of rounds to train the model. (Values: Aim for about 100 trees)
  • eta - The learning rate of the model. (Values: 0.01 - 0.3)
  • gamma - Minimum loss reduction necessary to make a further partition in a node.
  • min_child_weight - Minimum sum of weight samples necessary to partition a node (Think of as minimum number of samples but weighted by value that ensemble gets incorrect). (Values: Highly problem specific)
  • subsample - The ratio of the training data to use in each tree (Bootstrap samples). (Values: 0.5 - 1)
  • colsample_bytree - The ratio of columns to sample for each tree (Like random forest but per tree not split) (Values: 0.5 - 1)
  • early_stopping_rounds - Triggers the model to stop fitting if the performance has not increased for this number of rounds. We can use this to stop the tree growing early and decide optimal number of trees.

For XGBoost a good parameter tuning approach to take is:

  • Choose a relatively high learning rate from 0.05 to 0.3. A value of 0.1 generally works best. Determine the optimum number of trees for this learning rate.
  • Tune tree-specific parameters such as max.depth, min_child_weight, gamma, subsample, colsample_bytree for decided learning rate and number of trees.
  • Lower the learning rate and decide optimal number of trees.

Visualize variable importance

# Extract importance
imp_mat <- xgb.importance(model = bst_1)
# Plot importance (top 10 variables)
xgb.plot.importance(imp_mat, top_n = 10)

For tuning work, see prediction_2_workbook