XGBoost that stands for Extreme Gradient Boosting is my second favorite classifier after LightGBM. It has become an extremely popular tool among Kaggle competitors and Data Scientists in industry. It is used for supervised learning problems, where training data is used to predict a target variable.
Basic Settings and Data Import
Let’s begin by loading the required libraries and importing the data set we are going to use for this model.
#set working directory
setwd("C:/Users/awani/Desktop/50daysofAnalytics")
options(scipen = 999)
# load required libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(pscl, ggplot, ROCR,xgboost, magrittr, Matrix, readr, stringr,caret, car)
#read data
claims = read.csv("Insurance.csv", stringsAsFactors = F)
Data Format Correction
The dependent variable needs to be a factor with levels 0 and 1.
str(claims)
## 'data.frame': 4415 obs. of 19 variables:
## $ claimid : int 351069569 806984053 654100160 653220231 226637568 46113373 397237121 917836504 277901331 16290780 ...
## $ claim_type : int 3 3 5 1 5 4 1 2 4 5 ...
## $ uninhabitable : int 0 0 0 0 0 0 0 0 0 0 ...
## $ claim_amount : num 192.29 355.9 3.53 33.45 4.03 ...
## $ fraudulent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ claim_days : int 7662 9197 7351 3171 2487 5028 715 4923 10922 697 ...
## $ coverage : int 436 925 79 607 119 144 636 254 291 114 ...
## $ deductible : int 2000 1000 1000 1000 3000 500 1000 2000 1000 1000 ...
## $ townsize : int 1 1 1 1 3 1 2 1 1 3 ...
## $ gender : int 1 0 1 0 1 0 0 1 0 1 ...
## $ age : int 65 75 69 37 40 78 26 64 60 27 ...
## $ edcat : int 2 2 2 3 1 3 4 1 4 1 ...
## $ work_ex : int 27 26 4 9 4 39 0 20 11 3 ...
## $ retire : int 0 0 0 0 0 1 0 1 0 0 ...
## $ income : int 193 203 49 118 18 25 36 13 91 23 ...
## $ marital : int 0 0 0 1 0 1 0 1 0 1 ...
## $ residents : int 1 1 1 3 1 2 1 2 1 4 ...
## $ primary_residence: int 1 1 1 1 1 1 1 1 1 1 ...
## $ occupancy : int 30 37 31 15 8 15 5 22 40 7 ...
#change categorical or ordinal variables to factor
for ( i in c(2,3,5,9,10,12,14,16,18))
{
claims[,i]= as.factor(claims[,i])
}
#drop claimid
claims$claimid = NULL
Exploratory Data Analysis
Before we jump into any kind of model training, it is essential that we understand data well. A quick eyeballing uni-variate and bi-variate plots can make us familiar with the data set.
#dependent variable
table(claims$fraudulent)
##
## 0 1
## 3952 463
# independent variable
# Fraud by claim amount and claim days
ggplot(claims, aes(x = claim_amount, y = claim_days, shape = fraudulent, color = fraudulent)) +
geom_point() +ggtitle("Frauds by Claim Amount and Claim Days")
ggplot(claims, aes(x = claim_days, y = coverage, shape = fraudulent, color = fraudulent)) +
geom_point() + ggtitle("Frauds by Coverage and Claim Days")
Data preparation
XGBoost demands we prepare our training and validation data set in a special manner unlike logistic or other classifiers. The data type should be XGB Matrix the target variable needs to be predefined. Also, like always we will split the data into train and val sets.
# dependent variable
claims$fraudulent = as.numeric(as.character(claims$fraudulent))
#training and Validation dataset
set.seed(123)
smp_size = floor(0.7 * nrow(claims))
train_ind = sample(seq_len(nrow(claims)), size = smp_size)
train = claims[train_ind, ]
val = claims[-train_ind, ]
#prepare training data
trainm = sparse.model.matrix(fraudulent ~., data = train)
train_label = train[,"fraudulent"]
train_matrix = xgb.DMatrix(data = as.matrix(trainm), label = train_label)
#prepare validation data
valm = sparse.model.matrix(fraudulent ~., data= val)
val_label = val[,"fraudulent"]
val_matrix = xgb.DMatrix(data = as.matrix(valm), label = val_label)
Training XGBoost Model
There are many parameters in XGBoost model that can be used to fine tune the model. Here we are defining objective as binary logistics and max depth of 3 to prevent any type of over-fitting. Finally, we run 1000 iterations to find the best model.
#parameters
xgb_params = list(objective = "binary:logistic",
eval_metric = "error",
max_depth = 3,
eta = 0.01,
gammma = 1,
colsample_bytree = 0.5,
min_child_weight = 1)
#model
bst_model = xgb.train(params = xgb_params, data = train_matrix,
nrounds = 1000)
Prediction and Model Evalaution
XGB importance plot is a quick method to visualize importance of independent variables. It can be then used to limit independent variables of our model. The model performance is definitely better than what we had using logistic regression when we compare accuracy, sensitivity and specificity.
#feature importance
imp = xgb.importance(colnames(train_matrix), model = bst_model)
xgb.plot.importance(imp)
#prediction & confusion matrix
p = predict(bst_model, newdata = val_matrix)
val$predicted = ifelse(p > 0.15,1,0)
confusionMatrix(table(val$predicted, val$fraudulent))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 1033 117
## 1 150 25
##
## Accuracy : 0.7985
## 95% CI : (0.7759, 0.8198)
## No Information Rate : 0.8928
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.0447
## Mcnemar's Test P-Value : 0.05019
##
## Sensitivity : 0.8732
## Specificity : 0.1761
## Pos Pred Value : 0.8983
## Neg Pred Value : 0.1429
## Prevalence : 0.8928
## Detection Rate : 0.7796
## Detection Prevalence : 0.8679
## Balanced Accuracy : 0.5246
##
## 'Positive' Class : 0
##
# Evaluation Curve
pred=prediction(p,val$fraudulent)
eval= performance(pred,"acc")
plot(eval)
#Roc
roc=performance(pred,"tpr","fpr")
plot(roc,main="ROC curve")
abline(a=0,b=1)