Light GBM is a fast, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

 

Basic Settings and Data Import

Let’s begin by loading the required libraries and importing the data set we are going to use for this model.

#set working directory
setwd("C:/Users/awani/Desktop/50daysofAnalytics")
options(scipen = 999)

# load required libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(pscl, ggplot2, ROCR, lightgbm, methods, Matrix, caret)


#read data
claims = read.csv("Insurance.csv", stringsAsFactors = F)

 

Data Format Correction

The dependent variable needs to be a factor with levels 0 and 1.

str(claims)
## 'data.frame':    4415 obs. of  19 variables:
##  $ claimid          : int  351069569 806984053 654100160 653220231 226637568 46113373 397237121 917836504 277901331 16290780 ...
##  $ claim_type       : int  3 3 5 1 5 4 1 2 4 5 ...
##  $ uninhabitable    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ claim_amount     : num  192.29 355.9 3.53 33.45 4.03 ...
##  $ fraudulent       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ claim_days       : int  7662 9197 7351 3171 2487 5028 715 4923 10922 697 ...
##  $ coverage         : int  436 925 79 607 119 144 636 254 291 114 ...
##  $ deductible       : int  2000 1000 1000 1000 3000 500 1000 2000 1000 1000 ...
##  $ townsize         : int  1 1 1 1 3 1 2 1 1 3 ...
##  $ gender           : int  1 0 1 0 1 0 0 1 0 1 ...
##  $ age              : int  65 75 69 37 40 78 26 64 60 27 ...
##  $ edcat            : int  2 2 2 3 1 3 4 1 4 1 ...
##  $ work_ex          : int  27 26 4 9 4 39 0 20 11 3 ...
##  $ retire           : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ income           : int  193 203 49 118 18 25 36 13 91 23 ...
##  $ marital          : int  0 0 0 1 0 1 0 1 0 1 ...
##  $ residents        : int  1 1 1 3 1 2 1 2 1 4 ...
##  $ primary_residence: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ occupancy        : int  30 37 31 15 8 15 5 22 40 7 ...
#change categorical or ordinal variables to factor
for ( i in c(2,3,5,9,10,12,14,16,18))
{
  claims[,i]= as.factor(claims[,i])
}

#drop claimid
claims$claimid = NULL

 

Exploratory Data Analysis

Before we jump into any kind of model training, it is essential that we understand data well. A quick eyeballing uni-variate and bi-variate plots can make us familiar with the data set.

#dependent variable
table(claims$fraudulent)
## 
##    0    1 
## 3952  463
# independent variable

# Fraud by claim amount and claim days
ggplot(claims, aes(x = claim_amount, y = claim_days, shape = fraudulent, color = fraudulent)) +
  geom_point() +ggtitle("Frauds by Claim Amount and Claim Days")

ggplot(claims, aes(x = claim_days, y = coverage, shape = fraudulent, color = fraudulent)) +
  geom_point() + ggtitle("Frauds by Coverage and Claim Days")

 

Data preparation

Like always we will split the data into training and validation sets before moving forward.

# dependent variable
claims$fraudulent = as.numeric(as.character(claims$fraudulent))

#training and Validation dataset
set.seed(123)
smp_size = floor(0.7 * nrow(claims))
train_ind = sample(seq_len(nrow(claims)), size = smp_size)

train = claims[train_ind, ]
val = claims[-train_ind, ]

 

LightGBM demands we prepare our training and validation data set in a special manner unlike logistic or other classifiers. The data type should be lbg dataset and the target variable needs to be predefined.

#prepare training and validation data
trainm = sparse.model.matrix(fraudulent ~., data = train)
train_label = train[,"fraudulent"]

valm = sparse.model.matrix(fraudulent~., data= val)
val_label = val[,"fraudulent"]

train_matrix = lgb.Dataset(data = as.matrix(trainm), label = train_label)
val_matrix = lgb.Dataset(data = as.matrix(valm), label = val_label)

 

Training LightGBM Model

There are many parameters in XGBoost model that can be used to fine tune the model. Here we are defining objective as binary logistics and max bin size of 5 to prevent any type of over-fitting and use binary logloss as fit metric. Finally, we run 1000 rounds to find the best model.

valid = list(test = val_matrix)

# model parameters
params = list(max_bin = 5,
               learning_rate = 0.001,
               objective = "binary",
               metric = 'binary_logloss')

#model training
bst = lightgbm(params = params, train_matrix, valid, nrounds = 1000)

 

Prediction and Model Evalaution

LightGBM is certainly faster than XGBoost and sligthly better than in terms of fit. However, results vary case by case. It is definitely gives better fit when compared to logistic regression when we compare accuracy, sensitivity and specificity.

#prediction & confusion matrix
p = predict(bst, valm)
val$predicted = ifelse(p > 0.3,1,0)
confusionMatrix(factor(val$predicted), factor(val$fraudulent))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1098  125
##          1   85   17
##                                           
##                Accuracy : 0.8415          
##                  95% CI : (0.8207, 0.8608)
##     No Information Rate : 0.8928          
##     P-Value [Acc > NIR] : 1.000000        
##                                           
##                   Kappa : 0.0546          
##  Mcnemar's Test P-Value : 0.007118        
##                                           
##             Sensitivity : 0.9281          
##             Specificity : 0.1197          
##          Pos Pred Value : 0.8978          
##          Neg Pred Value : 0.1667          
##              Prevalence : 0.8928          
##          Detection Rate : 0.8287          
##    Detection Prevalence : 0.9230          
##       Balanced Accuracy : 0.5239          
##                                           
##        'Positive' Class : 0               
## 
# Evaluation Curve
pred=prediction(p,val$fraudulent)
eval= performance(pred,"acc")
plot(eval)

#Roc
roc=performance(pred,"tpr","fpr")
plot(roc,main="ROC curve")
abline(a=0,b=1)