load("financial_data_pred.rda") # Load dataset # already into test and train
XGBoost (eXtreme Gradient Boosting) is an efficient implementation of gradient boosting. It can be used for both classification and regression.
XGBoost works as follows:
To make a prediction at each step all of the models are used in the ensemble.
XGBoost cannot be just passed a dataframe, instead it requires the data to be in matrix format:
XGBoost uses its own data type called DMatrix that is an efficient method for storing sparse matrices (Many zeros).
# Create training matrix
dtrain <- xgboost::xgb.DMatrix(data = as.matrix(train_data[, 1:220]), label = as.numeric(train_data$class) -1)
# Create test matrix
dtest <- xgboost::xgb.DMatrix(data = as.matrix(test_data[, 1:220]), label = as.numeric(test_data$class) - 1)
Next we want to train our XGBoost model. For this XGBoost needs to have:
binary:logistic. By default XGBoost predicts a numerical
response.set.seed(111111)
bst_1 <- xgboost(data = dtrain, # Set training data
nrounds = 100, # Set number of rounds
verbose = 1, # 1 - Prints out fit
print_every_n = 20, # Prints out result every 20th iteration
objective = "binary:logistic", # Set objective
eval_metric = "auc",
eval_metric = "error") # Set evaluation metric to use
## [1] train-auc:0.737380 train-error:0.330500
## [21] train-auc:0.954374 train-error:0.114625
## [41] train-auc:0.988477 train-error:0.054750
## [61] train-auc:0.998311 train-error:0.020000
## [81] train-auc:0.999815 train-error:0.007375
## [100] train-auc:0.999983 train-error:0.001375
boost_preds_1 <- predict(bst_1, dtest) # Create predictions for xgboost model
pred_dat <- cbind.data.frame(boost_preds_1 , test_data$class)#
# Convert predictions to classes, using optimal cut-off
boost_pred_class <- rep(0, length(boost_preds_1))
boost_pred_class[boost_preds_1 >= 0.5] <- 1
t <- table(boost_pred_class, test_data$class) # Create table
confusionMatrix(t, positive = "1") # Produce confusion matrix
## Confusion Matrix and Statistics
##
##
## boost_pred_class 0 1
## 0 385 462
## 1 357 796
##
## Accuracy : 0.5905
## 95% CI : (0.5686, 0.6122)
## No Information Rate : 0.629
## P-Value [Acc > NIR] : 0.999819
##
## Kappa : 0.1473
##
## Mcnemar's Test P-Value : 0.000279
##
## Sensitivity : 0.6328
## Specificity : 0.5189
## Pos Pred Value : 0.6904
## Neg Pred Value : 0.4545
## Prevalence : 0.6290
## Detection Rate : 0.3980
## Detection Prevalence : 0.5765
## Balanced Accuracy : 0.5758
##
## 'Positive' Class : 1
##
There are several parameters which we can tune for XGBoost:
For XGBoost a good parameter tuning approach to take is:
# Extract importance
imp_mat <- xgb.importance(model = bst_1)
# Plot importance (top 10 variables)
xgb.plot.importance(imp_mat, top_n = 10)