Business Question

Perusahaan asuransi yang telah memberikan Asuransi Kesehatan kepada pelanggannya kini membutuhkan bantuan Anda dalam membangun model untuk memprediksi apakah pemegang polis (pelanggan) tahun lalu juga akan tertarik dengan Asuransi Kendaraan yang disediakan oleh perusahaan. Membangun model untuk memprediksi apakah pelanggan akan tertarik dengan Asuransi Kendaraan sangat membantu perusahaan karena kemudian dapat merencanakan strategi komunikasinya untuk menjangkau pelanggan tersebut dan mengoptimalkan model bisnis dan pendapatannya.

Let’s get started!

Import Library

library(dplyr)
library(splitstackshape)
library(caret)
library(e1071)
library(rpart)
library(rattle)
library(rpart.plot)
library(partykit)
library(randomForest)
library(ROCR)

Read Data

Source: Health Insurance Cross Sell Prediction Dataset - Anmol Kumar

asuransi <- read.csv("train.csv", stringsAsFactors = T)
str(asuransi)

## 'data.frame':    381109 obs. of  12 variables:
##  $ id                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender              : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 2 1 1 1 ...
##  $ Age                 : int  44 76 47 21 29 24 23 56 24 32 ...
##  $ Driving_License     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Region_Code         : num  28 3 28 11 41 33 11 28 3 6 ...
##  $ Previously_Insured  : int  0 0 0 1 1 0 0 0 1 1 ...
##  $ Vehicle_Age         : Factor w/ 3 levels "< 1 Year","> 2 Years",..: 2 3 2 1 1 1 1 3 1 1 ...
##  $ Vehicle_Damage      : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 2 2 1 1 ...
##  $ Annual_Premium      : num  40454 33536 38294 28619 27496 ...
##  $ Policy_Sales_Channel: num  26 26 26 152 152 160 152 26 152 152 ...
##  $ Vintage             : int  217 183 27 203 39 176 249 72 28 80 ...
##  $ Response            : int  1 0 1 0 0 0 0 1 0 0 ...

Deskripsi Data

Variable	Definition
id	Unique ID for the customer
Gender	Gender of the customer
Age	Age of the customer
Driving_License	0 - Customer does not have DL; 1 - Customer already has DL
Region_Code	Unique code for the region of the customer
Previously_Insured	1 - Customer already has Vehicle Insurance, 0 - Customer doesn’t have Vehicle Insurance
Vehicle_Age	Age of the Vehicle
Vehicle_Damage	1 - Customer got his/her vehicle damaged in the past. 0 - Customer didn’t get his/her vehicle damaged in the past.
Annual_Premium	The amount customer needs to pay as premium in the year
PolicySalesChannel	Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
Vintage	Number of Days, Customer has been associated with the company
Response	1 - Customer is interested; 0 - Customer is not interested

Data Wrangling

asuransi <- asuransi %>%
  select(-id) %>% 
    mutate(Driving_License = as.factor (Driving_License),
         Previously_Insured = as.factor(Previously_Insured),
         Vehicle_Age = as.factor(Vehicle_Age),
         Vehicle_Damage = as.factor(Vehicle_Damage),
         Response = as.factor(Response))

glimpse(asuransi)

## Rows: 381,109
## Columns: 11
## $ Gender               <fct> Male, Male, Male, Male, Female, Female, Male, ...
## $ Age                  <int> 44, 76, 47, 21, 29, 24, 23, 56, 24, 32, 47, 24...
## $ Driving_License      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Region_Code          <dbl> 28, 3, 28, 11, 41, 33, 11, 28, 3, 6, 35, 50, 1...
## $ Previously_Insured   <fct> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0...
## $ Vehicle_Age          <fct> > 2 Years, 1-2 Year, > 2 Years, < 1 Year, < 1 ...
## $ Vehicle_Damage       <fct> Yes, No, Yes, No, No, Yes, Yes, Yes, No, No, Y...
## $ Annual_Premium       <dbl> 40454, 33536, 38294, 28619, 27496, 2630, 23367...
## $ Policy_Sales_Channel <dbl> 26, 26, 26, 152, 152, 160, 152, 26, 152, 152, ...
## $ Vintage              <int> 217, 183, 27, 203, 39, 176, 249, 72, 28, 80, 4...
## $ Response             <fct> 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1...

EDA

Pertama mari kita lihat apakah ada missing value.

colSums(is.na(asuransi))

##               Gender                  Age      Driving_License 
##                    0                    0                    0 
##          Region_Code   Previously_Insured          Vehicle_Age 
##                    0                    0                    0 
##       Vehicle_Damage       Annual_Premium Policy_Sales_Channel 
##                    0                    0                    0 
##              Vintage             Response 
##                    0                    0

dari data di atas bisa dilihat bahwa tidak ada missing value, jadi kita tidark perlu melakukan EDA lebih lenjut.

Cross-Validation

Karena data yang tersedia terlalu banyak yaitu 381,109 baris, maka saya melakukan stratified random sampling untuk mengambil 20% data menggunakan fungsi stratified dari library splitstackshape.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
loan_sample <- stratified(indt = asuransi,
                          group = c("Gender", "Driving_License", "Previously_Insured","Vehicle_Age", "Vehicle_Damage",
                                    "Response"),
                          size = 0.2,
                          bothSets = TRUE)

data_asuransi <- loan_sample$SAMP1

str(data_asuransi)

## Classes 'data.table' and 'data.frame':   76220 obs. of  11 variables:
##  $ Gender              : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age                 : int  45 61 45 43 40 67 44 62 58 62 ...
##  $ Driving_License     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Region_Code         : num  28 28 28 28 8 28 28 28 28 4 ...
##  $ Previously_Insured  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Vehicle_Age         : Factor w/ 3 levels "< 1 Year","> 2 Years",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Vehicle_Damage      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Annual_Premium      : num  69687 43637 60600 34998 42523 ...
##  $ Policy_Sales_Channel: num  26 122 26 125 124 25 124 55 124 124 ...
##  $ Vintage             : int  40 228 82 271 21 128 225 229 163 113 ...
##  $ Response            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

lalu dari 20% didapati 76.220 baris observasi, setelah itu saya baru membaginya menjadi 80% data_train dan 20% data_test dengan fungsi sample

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index <- sample(nrow(data_asuransi), nrow(data_asuransi)*0.8)
data_train <- data_asuransi[index, ]
data_test <- data_asuransi[-index, ]

# check class proportion
prop.table(table(data_train$Response))

## 
##         0         1 
## 0.8776896 0.1223104

dari proportion table di atas bisa kita lihat bahwa data cenderung imbalance (87.8% data not interested dan 12.2% data interested), maka kita bisa coba downsample sebagai salah satu bentuk model tuning.

data_train <- downSample(x = data_train %>% select(-Response), # data prediktor
                         y = data_train$Response, # data label
                         yname = "Response") # nama kolom label

setelah proses downsample mari kita lihat proporsi datanya.

prop.table(table(data_train$Response))

## 
##   0   1 
## 0.5 0.5

sekarang kita sudah mendapatkan proporsi data yang seimbang dan dataset siap digunakan.

Pembuatan Model

Langkah selanjutnya adalah merancang model klasifikasi menggunakan algoritma yang berbeda dan membandingkan akurasi model dari seluruh model yang telah dibuat yaitu algoritma Naive Bayes, Decision Tree dan RandomForest.

Naive Bayes

Pertama kita buat terlebih dahulu model Naive Bayes

model_naive <- naiveBayes(x = data_train %>% select(-Response),
                          y = data_train$Response,
                          laplace = 1)

Prediction

Prediksi kelas target ke dataset data_test

naive_class <- predict(model_naive, data_test, type = "class")

Evaluation

Pertama mari kita lihat performa model pada data training

confusionMatrix(naive_class, data_test$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9293  242
##          1 4066 1643
##                                           
##                Accuracy : 0.7174          
##                  95% CI : (0.7102, 0.7245)
##     No Information Rate : 0.8763          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3031          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8716          
##             Specificity : 0.6956          
##          Pos Pred Value : 0.2878          
##          Neg Pred Value : 0.9746          
##              Prevalence : 0.1237          
##          Detection Rate : 0.1078          
##    Detection Prevalence : 0.3745          
##       Balanced Accuracy : 0.7836          
##                                           
##        'Positive' Class : 1               
##

Dari Confusion Matrix model Naive Bayes yang diaplikasikan ke data test di atas bisa dilihat bahwa akurasi model Naive Bayes terhadap data_test adalah 71.74% dan nilai recallnya 87,16%. Meskipun begitu model ini sangat buruk untuk memprediksi positif valuenya.

Decision Tree

Selanjutnya mari kita uji dataset menggunakan Klasifikasi Decision Tree.

Model

# model building
dtree <- rpart(Response ~ ., data = data_train, cp = .01)

fancyRpartPlot(dtree, sub = NULL)

asuransi_tree <- ctree(formula = Response ~ .,
                       data = data_train,
                       control = ctree_control(mincriterion=0.999, minsplit=30, minbucket=2))

# pembuatan plot tipe simple
plot(asuransi_tree, type = "simple")

Prediction

Prediksi kelas ke data_test dan simpan ke objek dtree_class:

dtree_class <- predict(asuransi_tree, data_test, type = "response")

Evaluation

confusion matrix pada data test

confusionMatrix(dtree_class, data_test$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8231   54
##          1 5128 1831
##                                           
##                Accuracy : 0.6601          
##                  95% CI : (0.6525, 0.6676)
##     No Information Rate : 0.8763          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2725          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9714          
##             Specificity : 0.6161          
##          Pos Pred Value : 0.2631          
##          Neg Pred Value : 0.9935          
##              Prevalence : 0.1237          
##          Detection Rate : 0.1201          
##    Detection Prevalence : 0.4565          
##       Balanced Accuracy : 0.7937          
##                                           
##        'Positive' Class : 1               
##

Dari hasil yang didapatkan, performa model tidak cukup baik. Mari kita cek apakah hal tersebut karena model mengalami overfitting:

Prediksi kelas ke data_train dan simpan ke objek dtree_class:

# prediksi `kelas` data train
dtree_train_class <-  predict(object = asuransi_tree, newdata = data_train %>% select(-Response),
                              type = "response")

# confusion matrix data train
confusionMatrix(dtree_train_class, data_train$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4580  246
##          1 2878 7212
##                                           
##                Accuracy : 0.7906          
##                  95% CI : (0.7839, 0.7971)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5811          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9670          
##             Specificity : 0.6141          
##          Pos Pred Value : 0.7148          
##          Neg Pred Value : 0.9490          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4835          
##    Detection Prevalence : 0.6765          
##       Balanced Accuracy : 0.7906          
##                                           
##        'Positive' Class : 1               
##

Dari Confusion Matrix di atas bisa diasumsikan bahwa akurasi model Decision Tree di atas mengalami overfitting, hal ini bisa dilihat dari akurasi

Dari Confusion Matrix di atas bisa dilihat bahwa akurasi model Naive Bayes terhadap data_test adalah 71.74% dan nilai recallnya 87,16%

Random Forest

Model

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

ctrl <- trainControl(method="repeatedcv", number = 2, repeats = 3)

model_forest <- train(Response ~ ., data = data_train, method = "rf", trControl = ctrl)

model_forest

## Random Forest 
## 
## 14916 samples
##    10 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 3 times) 
## Summary of sample sizes: 7458, 7458, 7458, 7458, 7458, 7458, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7921918  0.5843837
##    6    0.7802807  0.5605614
##   11    0.7767051  0.5534102
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Dari hasil RF itu yang paling bagus mtry paling bagus adalah mtry = 2 (2 variabel), dalam crossvalidation ini menggunakan 2 variabel adalah yang terbaik dari keseluruhan mtry yang dicoba. Dari sini kita mengetahui bahwa mtry adalah banyak variabel yang digunakan pada pembentukan model dan secara sistem default kita juga bisa melihat bahwa Random Forest mencoba berbagai macam nilai mtry.

Mari lihat Out of Bag Error yang dihasilkan dari model random forest:

model_forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.86%
## Confusion matrix:
##      0    1 class.error
## 0 4911 2547  0.34151247
## 1  564 6894  0.07562349

OOB yang dihasilkan mengindikasikan bahwa model akan memiliki akurasi sebesar 79.27% di data baru.

Kita juga bisa melihat variable apa saja yang penting dalam pembuatan random forest dengan menggunakan varImp().

varImp(model_forest)

## rf variable importance
## 
##                      Overall
## Vehicle_DamageYes    100.000
## Previously_Insured1   85.804
## Age                   40.234
## Policy_Sales_Channel  23.208
## Vintage               13.685
## Annual_Premium        13.080
## Region_Code           10.876
## Vehicle_Age1-2 Year   10.840
## Vehicle_Age> 2 Years   2.476
## GenderMale             1.604
## Driving_License1       0.000

Dari hasil di atas, dapat terlihat bahwa Vehicle_DamageYes memiliki pengaruh paling tinggi terhadap hasil.

Prediction

Lakukan prediksi kelas ke data_test dan simpan ke objek forest_class:

forest_class <- predict(model_forest, data_test, type = "raw")

confusionMatrix(forest_class, data_test$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8903  128
##          1 4456 1757
##                                           
##                Accuracy : 0.6993          
##                  95% CI : (0.6919, 0.7066)
##     No Information Rate : 0.8763          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3014          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9321          
##             Specificity : 0.6664          
##          Pos Pred Value : 0.2828          
##          Neg Pred Value : 0.9858          
##              Prevalence : 0.1237          
##          Detection Rate : 0.1153          
##    Detection Prevalence : 0.4076          
##       Balanced Accuracy : 0.7993          
##                                           
##        'Positive' Class : 1               
##

Terlihat, nilai akurasi hanya di angka 69.94% dan Recall yang tinggi di angka 93,16%.

Overall Evaluation

Confusion Matrix

Untuk melihat model mana yang memiiliki performa yang lebih baik, pertama mari kita bandingkan Akurasi dari Confusion Matrix tiap Model.

Naive Bayes

confusionMatrix(naive_class, data_test$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9293  242
##          1 4066 1643
##                                           
##                Accuracy : 0.7174          
##                  95% CI : (0.7102, 0.7245)
##     No Information Rate : 0.8763          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3031          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8716          
##             Specificity : 0.6956          
##          Pos Pred Value : 0.2878          
##          Neg Pred Value : 0.9746          
##              Prevalence : 0.1237          
##          Detection Rate : 0.1078          
##    Detection Prevalence : 0.3745          
##       Balanced Accuracy : 0.7836          
##                                           
##        'Positive' Class : 1               
##

Decision Tree

confusionMatrix(dtree_class, data_test$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8231   54
##          1 5128 1831
##                                           
##                Accuracy : 0.6601          
##                  95% CI : (0.6525, 0.6676)
##     No Information Rate : 0.8763          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2725          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9714          
##             Specificity : 0.6161          
##          Pos Pred Value : 0.2631          
##          Neg Pred Value : 0.9935          
##              Prevalence : 0.1237          
##          Detection Rate : 0.1201          
##    Detection Prevalence : 0.4565          
##       Balanced Accuracy : 0.7937          
##                                           
##        'Positive' Class : 1               
##

Random Forest

confusionMatrix(forest_class, data_test$Response, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8903  128
##          1 4456 1757
##                                           
##                Accuracy : 0.6993          
##                  95% CI : (0.6919, 0.7066)
##     No Information Rate : 0.8763          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3014          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9321          
##             Specificity : 0.6664          
##          Pos Pred Value : 0.2828          
##          Neg Pred Value : 0.9858          
##              Prevalence : 0.1237          
##          Detection Rate : 0.1153          
##    Detection Prevalence : 0.4076          
##       Balanced Accuracy : 0.7993          
##                                           
##        'Positive' Class : 1               
##

dari 3 Confusion Matrix di atas bisa dilihat bahwa meskipun hanya berbeda tipis, model Naive Bayes memiliki nilai akurasi yang paling tinggi yaitu di angka 71.74%, sedangkan Random Forest 69.94% dan Decision Tree 66.01%. Melihat nilai akurasi yang belum terlalu tinggi maka mari kita cek kembali evaluasi model ini dengan AUC dan ROC value.

AUC dan ROC value

Naive Bayes

naive_predprob <- predict(model_naive, data_test, type = "raw")

# membuat objek prediction
naive_roc <- prediction(predictions = naive_predprob[,2], # prob kelas positif
                        labels = as.numeric(data_test$Response == "1"))

# buat performance dari objek prediction
perf <- performance(prediction.obj = naive_roc,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
# buat plot
plot(perf)
abline(0,1, lty = 2) # utk buat garis diagonal saja = kurva utk model yang buruk (utk jadi pembanding)

auc <- performance(prediction.obj = naive_roc, 
                   measure = "auc")
auc@y.values

## [[1]]
## [1] 0.8240812

Decision Tree

Dtree_predprob <- predict(asuransi_tree, data_test, type = "prob")

# membuat objek prediction
Dtree_roc <- prediction(predictions = Dtree_predprob[,2], # prob kelas positif
                        labels = as.numeric(data_test$Response == "1"))

# buat performance dari objek prediction
Dtree_perf <- performance(prediction.obj = Dtree_roc,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
# buat plot
plot(Dtree_perf)
abline(0,1, lty = 2) # utk buat garis diagonal saja = kurva utk model yang buruk (utk jadi pembanding)

auc <- performance(prediction.obj = Dtree_roc, 
                   measure = "auc")
auc@y.values

## [[1]]
## [1] 0.8344786

Random Forest

#menyiapkan hasil probability
forestProb <- predict(model_forest, data_test, type = "prob")

#membuat objek prediction
forest_auc <- prediction(predictions = forestProb[, 2], # prob kelas positif
                        labels = as.numeric(data_test$Response == "1")) # label kelas positif

forest_perf <- performance(prediction.obj = forest_auc,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
# buat plot
plot(forest_perf)
abline(0,1, lty = 2)

auc <- performance(prediction.obj = forest_auc, 
                   measure = "auc")
auc@y.values

## [[1]]
## [1] 0.8477199

Dari AUC 3 model di atas bisa dilihat bahwa nilai AUC nya juga berbeda tipis, terlihat juga model Random Forest memiliki nilai AUC yang paling tinggi yaitu di angka 84.77%, setelah itu diikuti oleh Decision Tree 83.44% dan Naive Bayes 82.4%.

Health Insurance Cross Sell Prediction

L. FERUSTA NOVA ADI P.

Business Question

Import Library

Read Data

Data Wrangling

EDA

Cross-Validation

Pembuatan Model

Naive Bayes

Decision Tree

Random Forest

Overall Evaluation

Confusion Matrix

AUC dan ROC value