Background

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

Variables

id : Unique ID for the customer
Gender : Gender of the customer
Age : Age of the customer
Driving_License : 0 : Customer does not have DL, 1 : Customer already has DL
Region_Code : Unique code for the region of the customer
Previously_Insured : 1 : Customer already has Vehicle Insurance, 0 : Customer doesn’t have Vehicle Insurance
Vehicle_Age : Age of the Vehicle
Vehicle_Damage : 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn’t get his/her vehicle damaged in the past.
Annual_Premium : The amount customer needs to pay as premium in the year
PolicySalesChannel: Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
Vintage : Number of Days, Customer has been associated with the company
Response : 1 : Customer is interested, 0 : Customer is not interested

Load Library

library(caret)
library(e1071) # Naive Bayes
library(tidyverse)
library(dplyr)
library(ROCR)
library(partykit)

Load Dataset

data<-read.csv("data/train.csv", stringsAsFactors = T)

head(data)
##   id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age
## 1  1   Male  44               1          28                  0   > 2 Years
## 2  2   Male  76               1           3                  0    1-2 Year
## 3  3   Male  47               1          28                  0   > 2 Years
## 4  4   Male  21               1          11                  1    < 1 Year
## 5  5 Female  29               1          41                  1    < 1 Year
## 6  6 Female  24               1          33                  0    < 1 Year
##   Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
## 1            Yes          40454                   26     217        1
## 2             No          33536                   26     183        0
## 3            Yes          38294                   26      27        1
## 4             No          28619                  152     203        0
## 5             No          27496                  152      39        0
## 6            Yes           2630                  160     176        0

Data Wrangling

Get Data

checking the region code

data%>%
  group_by(Region_Code)%>%
  count()%>%
  ungroup()%>%
  arrange(desc(n))
## # A tibble: 53 x 2
##    Region_Code      n
##          <dbl>  <int>
##  1          28 106415
##  2           8  33877
##  3          46  19749
##  4          41  18263
##  5          15  13308
##  6          30  12191
##  7          29  11065
##  8          50  10243
##  9           3   9251
## 10          11   9232
## # ... with 43 more rows
data%>%
  group_by(Policy_Sales_Channel)%>%
  count()%>%
  ungroup()%>%
  arrange(desc(n))
## # A tibble: 155 x 2
##    Policy_Sales_Channel      n
##                   <dbl>  <int>
##  1                  152 134784
##  2                   26  79700
##  3                  124  73995
##  4                  160  21779
##  5                  156  10661
##  6                  122   9930
##  7                  157   6684
##  8                  154   5993
##  9                  151   3885
## 10                  163   2893
## # ... with 145 more rows

I’ll decided to take region 28 and Policy Channel 152 as the sample of our data, as it still covered >10% of total data that we’ve got

data<-data%>%
  filter(Region_Code==28, Policy_Sales_Channel==152)

NA checker

colSums(is.na(data))
##                   id               Gender                  Age 
##                    0                    0                    0 
##      Driving_License          Region_Code   Previously_Insured 
##                    0                    0                    0 
##          Vehicle_Age       Vehicle_Damage       Annual_Premium 
##                    0                    0                    0 
## Policy_Sales_Channel              Vintage             Response 
##                    0                    0                    0

Data Types

data<-data%>%
  mutate_at(vars(Response, Previously_Insured,Driving_License,Region_Code,Policy_Sales_Channel), as.factor)

Drop Unnecessary Variables

insurance<-data%>%
  select(-c(id,Region_Code,Policy_Sales_Channel))

Balance Check

insurance$Response%>%table()%>%prop.table()
## .
##          0          1 
## 0.95800219 0.04199781

Summary

summary(insurance)
##     Gender          Age        Driving_License Previously_Insured
##  Female:7390   Min.   :20.00   0:    1         0: 3499           
##  Male  :6325   1st Qu.:24.00   1:13714         1:10216           
##                Median :26.00                                     
##                Mean   :27.11                                     
##                3rd Qu.:28.00                                     
##                Max.   :80.00                                     
##     Vehicle_Age    Vehicle_Damage Annual_Premium      Vintage      Response 
##  < 1 Year :12691   No :10458      Min.   :  2630   Min.   : 10.0   0:13139  
##  > 2 Years:    2   Yes: 3257      1st Qu.: 31411   1st Qu.: 83.0   1:  576  
##  1-2 Year : 1022                  Median : 39271   Median :156.0            
##                                   Mean   : 41217   Mean   :155.1            
##                                   3rd Qu.: 48528   3rd Qu.:228.0            
##                                   Max.   :495106   Max.   :299.0

Pre-Processing

Split Dataset

set.seed(156)
cut<- sample(seq_len(nrow(insurance)), size = nrow(insurance)*0.75)

train <- insurance[cut, ]
test <- insurance[-cut, ]

Upscale Train

set.seed(156)
train_up <- upSample(x = train %>% select(-Response),
                     y = train$Response,
                     list = F, yname = "Response"
                     )

train_up$Response%>%table()
## .
##    0    1 
## 9860 9860

Model

Naive Bayes

State Model

mod_naive <- naiveBayes(train_up[-9], train_up$Response, laplace = 1)

Predict Model

pred_naive <- predict(mod_naive, test)
pred <- as.integer(pred_naive)
real <- as.integer(test$Response)

ROC

# prediction
roc_b <- prediction(pred, real)

# performance
roc_vec_b <- performance(roc_b, "tpr", "fpr")

# buat plot
plot(roc_vec_b, main = "ROC curve - Naive Bayes",col = "red", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

AUC Rate

auc1 <- performance(roc_b, measure = "auc")
unlist(auc1@y.values)
## [1] 0.8936932

Decision Tree

State Model

ctrl<-ctree_control(mincriterion = 0.99, maxsurrogate = 3)
mod_tree<-ctree(Response~., data=train_up, control = ctrl)

Plot Model

If you really want to ask me, should we make the visualization of decision tree for comparing all of the predictor to target variable, I would really say, not really - unless it’s mandatory -. But why?

Let’s see graph below:

As you can see, this is the graph of “all-variable predicted” decision tree. From here, we could conclude that in region 28 via approaching type ‘152’ likely won’t response the vehicle insurance campaign.

Customer that likely want to get the Vehicle insurance mostly:
1. Has been using the insurance company’s product for more than 20 days
2. Annual premium each customer less than Rs.56,297/Year
3. More than 21 Years old or less than 35 Years old

and customer that doesn’t get the vehicle insurance would have the higher chance for getting the company’s vehicle insurance rather than customer that have purchased this kind-of insurance before, unless they keep in touch to the company for 26-28 days.

Predict

tree <- predict(mod_tree, test, type = "prob")
pred_tree <- tree[ , 2]

ROC

roc_t <- prediction(pred_tree, test$Response)

roc_vec_t <- performance(roc_t,"tpr","fpr")

# buat plot
plot(roc_vec_t, main = "ROC curve - Decision Tree",col = "red", lwd = 3)
abline(a = 0, b = 1, lwd = 1, lty = 1)

AUC Rate

auc2 <- performance(roc_t, measure = "auc")@y.values[[1]]

auc2
## [1] 0.8731778

Random Forest

State Model

Cross Validation K-Means

# set.seed(156)
# 
# ctrl <- trainControl(method="repeatedcv", number = 5,  repeats=3) 
# rfores <- train(Response ~ ., data=train_up, method="rf", trControl = ctrl)
# rfores

#(rfores, file = "rfores.rds")
rfores<-readRDS("rfores.rds")
varImp(rfores)
## rf variable importance
## 
##                       Overall
## Previously_Insured1  100.0000
## Annual_Premium        24.3374
## Vintage               23.5082
## Age                    8.9656
## GenderMale             1.9407
## Vehicle_DamageYes      1.8302
## Vehicle_Age1-2 Year    0.2050
## Driving_License1       0.0142
## Vehicle_Age> 2 Years   0.0000
rfores$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 1.21%
## Confusion matrix:
##      0    1 class.error
## 0 9622  238  0.02413793
## 1    0 9860  0.00000000

Plot

plot(rfores$finalModel)
legend("topleft", colnames(rfores$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

Predict

pred_forest <- predict(rfores, test, type="prob")
pred2 <- pred_forest$`1`

ROC

# prediction
roc_f <- prediction(pred2, real)

# performance
roc_vec_f <- performance(roc_f, "tpr", "fpr")

# buat plot
plot(roc_vec_f, main = "ROC curve - Random Forest",col = "red", lwd = 3)
abline(a = 0, b = 1, lwd = 2, lty = 2)

AUC

auc3 <- performance(roc_f, measure = "auc")@y.values[[1]]

auc3
## [1] 0.8932957

Model Comparisons

Conclusion

Reference