XGBoost that stands for Extreme Gradient Boosting is my second favorite classifier after LightGBM. It has become an extremely popular tool among Kaggle competitors and Data Scientists in industry. It is used for supervised learning problems, where training data is used to predict a target variable.

Basic Settings and Data Import

Let’s begin by loading the required libraries and importing the data set we are going to use for this model.

#set working directory
setwd("C:/Users/awani/Desktop/50daysofAnalytics")
options(scipen = 999)

# load required libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(pscl, ggplot, ROCR,xgboost, magrittr, Matrix, readr, stringr,caret, car)

#read data
claims = read.csv("Insurance.csv", stringsAsFactors = F)

Data Format Correction

The dependent variable needs to be a factor with levels 0 and 1.

str(claims)
## 'data.frame':    4415 obs. of  19 variables:
##  $ claimid          : int  351069569 806984053 654100160 653220231 226637568 46113373 397237121 917836504 277901331 16290780 ...
##  $ claim_type       : int  3 3 5 1 5 4 1 2 4 5 ...
##  $ uninhabitable    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ claim_amount     : num  192.29 355.9 3.53 33.45 4.03 ...
##  $ fraudulent       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ claim_days       : int  7662 9197 7351 3171 2487 5028 715 4923 10922 697 ...
##  $ coverage         : int  436 925 79 607 119 144 636 254 291 114 ...
##  $ deductible       : int  2000 1000 1000 1000 3000 500 1000 2000 1000 1000 ...
##  $ townsize         : int  1 1 1 1 3 1 2 1 1 3 ...
##  $ gender           : int  1 0 1 0 1 0 0 1 0 1 ...
##  $ age              : int  65 75 69 37 40 78 26 64 60 27 ...
##  $ edcat            : int  2 2 2 3 1 3 4 1 4 1 ...
##  $ work_ex          : int  27 26 4 9 4 39 0 20 11 3 ...
##  $ retire           : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ income           : int  193 203 49 118 18 25 36 13 91 23 ...
##  $ marital          : int  0 0 0 1 0 1 0 1 0 1 ...
##  $ residents        : int  1 1 1 3 1 2 1 2 1 4 ...
##  $ primary_residence: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ occupancy        : int  30 37 31 15 8 15 5 22 40 7 ...
#change categorical or ordinal variables to factor
for ( i in c(2,3,5,9,10,12,14,16,18))
{
  claims[,i]= as.factor(claims[,i])
}

#drop claimid
claims$claimid = NULL

Exploratory Data Analysis

Before we jump into any kind of model training, it is essential that we understand data well. A quick eyeballing uni-variate and bi-variate plots can make us familiar with the data set.

#dependent variable
table(claims$fraudulent)
## 
##    0    1 
## 3952  463
# independent variable

# Fraud by claim amount and claim days
ggplot(claims, aes(x = claim_amount, y = claim_days, shape = fraudulent, color = fraudulent)) +
  geom_point() +ggtitle("Frauds by Claim Amount and Claim Days")

ggplot(claims, aes(x = claim_days, y = coverage, shape = fraudulent, color = fraudulent)) +
  geom_point() + ggtitle("Frauds by Coverage and Claim Days")

Data preparation

XGBoost demands we prepare our training and validation data set in a special manner unlike logistic or other classifiers. The data type should be XGB Matrix the target variable needs to be predefined. Also, like always we will split the data into train and val sets.

# dependent variable
claims$fraudulent = as.numeric(as.character(claims$fraudulent))

#training and Validation dataset
set.seed(123)
smp_size = floor(0.7 * nrow(claims))
train_ind = sample(seq_len(nrow(claims)), size = smp_size)

train = claims[train_ind, ]
val = claims[-train_ind, ]

#prepare training data
trainm = sparse.model.matrix(fraudulent ~., data = train)
train_label = train[,"fraudulent"]
train_matrix = xgb.DMatrix(data = as.matrix(trainm), label = train_label)

#prepare validation data
valm = sparse.model.matrix(fraudulent ~., data= val)
val_label = val[,"fraudulent"]
val_matrix = xgb.DMatrix(data = as.matrix(valm), label = val_label)

Training XGBoost Model

There are many parameters in XGBoost model that can be used to fine tune the model. Here we are defining objective as binary logistics and max depth of 3 to prevent any type of over-fitting. Finally, we run 1000 iterations to find the best model.

#parameters
xgb_params = list(objective   = "binary:logistic",
                   eval_metric = "error",
                   max_depth   = 3,
                   eta         = 0.01,
                   gammma      = 1,
                   colsample_bytree = 0.5,
                   min_child_weight = 1)

#model
bst_model = xgb.train(params = xgb_params, data = train_matrix,
                      nrounds = 1000)

Prediction and Model Evalaution

XGB importance plot is a quick method to visualize importance of independent variables. It can be then used to limit independent variables of our model. The model performance is definitely better than what we had using logistic regression when we compare accuracy, sensitivity and specificity.

#feature importance
imp = xgb.importance(colnames(train_matrix), model = bst_model)
xgb.plot.importance(imp)

#prediction & confusion matrix
p = predict(bst_model, newdata = val_matrix)
val$predicted = ifelse(p > 0.15,1,0)
confusionMatrix(table(val$predicted, val$fraudulent))
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1033  117
##   1  150   25
##                                           
##                Accuracy : 0.7985          
##                  95% CI : (0.7759, 0.8198)
##     No Information Rate : 0.8928          
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.0447          
##  Mcnemar's Test P-Value : 0.05019         
##                                           
##             Sensitivity : 0.8732          
##             Specificity : 0.1761          
##          Pos Pred Value : 0.8983          
##          Neg Pred Value : 0.1429          
##              Prevalence : 0.8928          
##          Detection Rate : 0.7796          
##    Detection Prevalence : 0.8679          
##       Balanced Accuracy : 0.5246          
##                                           
##        'Positive' Class : 0               
## 
# Evaluation Curve
pred=prediction(p,val$fraudulent)
eval= performance(pred,"acc")
plot(eval)

#Roc
roc=performance(pred,"tpr","fpr")
plot(roc,main="ROC curve")
abline(a=0,b=1)