Travel Insurance Prediction Using Naive Bayes, Decision Tree and Random Forest

Nasya Nurul Arsy

3/3/2022

Introduction

In this case study, I will examine travel insurance prediction data, looking for indicators that can be used to predict customers to be able to participate in travel insurance.

This dataset contains nine features from which we must predict whether or not the customer will purchase travel insurance:
- Age: The Customer’s Age
- Employment Type: The Industry in which the customer works
- Graduate Or Not : Whether the customer is college Graduate Or Not
- Annual Income: The customer’s annual income in Indian rupees [rounded to the nearest 50 thousand rupees]
- Family Members: The number of people in the customer’s family
- Chronic Disease: Whether the customer has any major diseases or conditions such as diabetes, high blood pressure, or asthama, etc.
- Frequent Flyer: Data derived from a customer’s history of booking air tickets at least four times in the last two years [2017-2019]
- Ever Traveled Abroad : Has the customer ever traveled To a foreign Country[Not Using The Company’s Services]
- Travel Insurance : Did the customer buy travel insurance package during introductory offering held in the year 2019 (Whether The Customer Bought The Travel Insurance Or Not; 0: Yes, 1: No)

Data Preparation

Import data:

travel_ins <- read.csv("data_input/TravelInsurancePrediction.csv", stringsAsFactors = T)
travel_ins

Data Cleansing

Check data types :

str(travel_ins)

#> 'data.frame':    1987 obs. of  10 variables:
#>  $ X                  : int  0 1 2 3 4 5 6 7 8 9 ...
#>  $ Age                : int  31 31 34 28 28 25 31 31 28 33 ...
#>  $ Employment.Type    : Factor w/ 2 levels "Government Sector",..: 1 2 2 2 2 2 1 2 2 1 ...
#>  $ GraduateOrNot      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
#>  $ AnnualIncome       : int  400000 1250000 500000 700000 700000 1150000 1300000 1350000 1450000 800000 ...
#>  $ FamilyMembers      : int  6 7 4 3 8 4 4 3 6 3 ...
#>  $ ChronicDiseases    : int  1 0 1 1 1 0 0 0 1 0 ...
#>  $ FrequentFlyer      : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 2 2 ...
#>  $ EverTravelledAbroad: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
#>  $ TravelInsurance    : int  0 0 1 0 0 0 0 1 1 0 ...

I’ll replace name for levels in target variabel:

travel_ins$TravelInsurance <- factor(travel_ins$TravelInsurance,levels = c(0,1) , labels=c("Yes","No"))

Change the data type and remove any unnecessary columns:

library(dplyr)
travel_ins_clear <- travel_ins %>% 
  select(-X) %>% 
  mutate(ChronicDiseases = as.factor(ChronicDiseases))

str(travel_ins_clear)

#> 'data.frame':    1987 obs. of  9 variables:
#>  $ Age                : int  31 31 34 28 28 25 31 31 28 33 ...
#>  $ Employment.Type    : Factor w/ 2 levels "Government Sector",..: 1 2 2 2 2 2 1 2 2 1 ...
#>  $ GraduateOrNot      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
#>  $ AnnualIncome       : int  400000 1250000 500000 700000 700000 1150000 1300000 1350000 1450000 800000 ...
#>  $ FamilyMembers      : int  6 7 4 3 8 4 4 3 6 3 ...
#>  $ ChronicDiseases    : Factor w/ 2 levels "0","1": 2 1 2 2 2 1 1 1 2 1 ...
#>  $ FrequentFlyer      : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 2 2 ...
#>  $ EverTravelledAbroad: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
#>  $ TravelInsurance    : Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 1 2 2 1 ...

Check missing value:

colSums(is.na(travel_ins_clear))

#>                 Age     Employment.Type       GraduateOrNot        AnnualIncome 
#>                   0                   0                   0                   0 
#>       FamilyMembers     ChronicDiseases       FrequentFlyer EverTravelledAbroad 
#>                   0                   0                   0                   0 
#>     TravelInsurance 
#>                   0

Check summary data:

summary(travel_ins_clear)

#>       Age                            Employment.Type GraduateOrNot
#>  Min.   :25.00   Government Sector           : 570   No : 295     
#>  1st Qu.:28.00   Private Sector/Self Employed:1417   Yes:1692     
#>  Median :29.00                                                    
#>  Mean   :29.65                                                    
#>  3rd Qu.:32.00                                                    
#>  Max.   :35.00                                                    
#>   AnnualIncome     FamilyMembers   ChronicDiseases FrequentFlyer
#>  Min.   : 300000   Min.   :2.000   0:1435          No :1570     
#>  1st Qu.: 600000   1st Qu.:4.000   1: 552          Yes: 417     
#>  Median : 900000   Median :5.000                                
#>  Mean   : 932763   Mean   :4.753                                
#>  3rd Qu.:1250000   3rd Qu.:6.000                                
#>  Max.   :1800000   Max.   :9.000                                
#>  EverTravelledAbroad TravelInsurance
#>  No :1607            Yes:1277       
#>  Yes: 380            No : 710       
#>                                     
#>                                     
#>                                     
#>

Checek propotion table:

prop.table(table(travel_ins_clear$TravelInsurance))

#> 
#>       Yes        No 
#> 0.6426774 0.3573226

Naive Bayes

Cross-Validation

Split the data for training and testing data, with 75% of the data used for training:

RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(travel_ins_clear), nrow(travel_ins_clear)*0.75)
travelins_train <- travel_ins_clear[intrain, ]
travelins_test <- travel_ins_clear[-intrain, ]

Checek propotion table:

prop.table(table(travelins_train$TravelInsurance))

#> 
#>       Yes        No 
#> 0.6463087 0.3536913

Build Model

Created a model based on business knowledge:

library(e1071)

# Train Model
model_naive <- naiveBayes(formula = TravelInsurance~., data = travelins_train, laplace=1)
  
# check model
model_naive

#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>       Yes        No 
#> 0.6463087 0.3536913 
#> 
#> Conditional probabilities:
#>      Age
#> Y         [,1]     [,2]
#>   Yes 29.47560 2.633184
#>   No  29.85389 3.337125
#> 
#>      Employment.Type
#> Y     Government Sector Private Sector/Self Employed
#>   Yes         0.3274611                    0.6725389
#>   No          0.1795841                    0.8204159
#> 
#>      GraduateOrNot
#> Y            No       Yes
#>   Yes 0.1585492 0.8414508
#>   No  0.1436673 0.8563327
#> 
#>      AnnualIncome
#> Y          [,1]     [,2]
#>   Yes  817808.9 330118.5
#>   No  1137476.3 371507.4
#> 
#>      FamilyMembers
#> Y         [,1]     [,2]
#>   Yes 4.638629 1.575621
#>   No  4.952562 1.657919
#> 
#>      ChronicDiseases
#> Y             0         1
#>   Yes 0.7181347 0.2818653
#>   No  0.7126654 0.2873346
#> 
#>      FrequentFlyer
#> Y            No       Yes
#>   Yes 0.8632124 0.1367876
#>   No  0.6672968 0.3327032
#> 
#>      EverTravelledAbroad
#> Y             No        Yes
#>   Yes 0.93056995 0.06943005
#>   No  0.56899811 0.43100189

Model Prediction

Predict class from the test data with the predict() function:

# predict
naive_predClass <- predict(object = model_naive, newdata = travelins_test, type = "class")
head(naive_predClass)

#> [1] Yes Yes No  Yes No  No 
#> Levels: Yes No

Model Evaluation

Evaluate the model with the confusion matrix:

library(caret)
confusionMatrix(data = naive_predClass, reference = travelins_test$TravelInsurance, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Yes  No
#>        Yes 283  90
#>        No   31  93
#>                                           
#>                Accuracy : 0.7565          
#>                  95% CI : (0.7163, 0.7937)
#>     No Information Rate : 0.6318          
#>     P-Value [Acc > NIR] : 1.884e-09       
#>                                           
#>                   Kappa : 0.439           
#>                                           
#>  Mcnemar's Test P-Value : 1.344e-07       
#>                                           
#>             Sensitivity : 0.9013          
#>             Specificity : 0.5082          
#>          Pos Pred Value : 0.7587          
#>          Neg Pred Value : 0.7500          
#>              Prevalence : 0.6318          
#>          Detection Rate : 0.5694          
#>    Detection Prevalence : 0.7505          
#>       Balanced Accuracy : 0.7047          
#>                                           
#>        'Positive' Class : Yes             
#>

Based on the Confusion Matrix results, it can be concluded that the model made is fairy well, with an accuracy value of 75.65%. Then because in this case, we offer our customers a travel investment, and we want to predict where the customer will invest. So, we’ll be using the Sensitivity/Recall Metric, which has a 90,13% success rate.

ROC

ROC and AUC are two additional evaluation tools used after the Confusion Matrix to evaluate the model’s classification performance the two classes (positive and negative):

pred_naiveProb <- predict(model_naive, newdata = travelins_test, type = "raw")

prepare data frame for ROC:

df_roc_naive <- data.frame(prob = pred_naiveProb[,1], 
                           label = as.numeric(travelins_test$TravelInsurance == "Yes")) 
df_roc_naive

We can make a ROC by using the ‘prediction()’ object:

library(ROCR)

# create object prediction
prediction_roc_naive <- prediction(predictions = df_roc_naive$prob, 
                                  labels = df_roc_naive$label) 

# make plot of prediction object
plot(performance(prediction.obj = prediction_roc_naive, 
                 measure = "tpr", 
                 x.measure = "fpr"))

KLIK IMAGE TO OPEN

Area Under ROC Curve (AUC)

AUC shows the area under the ROC curve. The closer to 1, the better the model’s performance. To get the AUC value, write auc in the measure parameter of performance() and take the value y.values:

auc_naive <- performance(prediction.obj = prediction_roc_naive, measure = "auc")
auc_naive@y.values

#> [[1]]
#> [1] 0.749913

The Naive Bayes model’s AUC value is nearly equal to 1 that has 0.74, its mean that our model is able to classify positive and negative classes.

Decision Tree

Cross-Validation

Split the data for training and testing data, with 75% of the data used for training:

RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(travel_ins_clear), nrow(travel_ins_clear)*0.75)
travelins_train <- travel_ins_clear[intrain, ]
travelins_test <- travel_ins_clear[-intrain, ]

Use a downsample technique to show the majority observation into balance with the minority.

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

library(caret)
travel_train_down <- downSample(x = travelins_train %>% select(-TravelInsurance), 
                              y = travelins_train$TravelInsurance, 
                              yname = "TravelInsurance") 

dim(travel_train_down)

#> [1] 1054    9

head(travel_train_down)

Model Fitting

library(partykit)
model_traveldt <- ctree(formula = TravelInsurance~., data = travel_train_down)
model_traveldt

#> 
#> Model formula:
#> TravelInsurance ~ Age + Employment.Type + GraduateOrNot + AnnualIncome + 
#>     FamilyMembers + ChronicDiseases + FrequentFlyer + EverTravelledAbroad
#> 
#> Fitted party:
#> [1] root
#> |   [2] AnnualIncome <= 1300000
#> |   |   [3] Age <= 32: Yes (n = 607, err = 29.2%)
#> |   |   [4] Age > 32
#> |   |   |   [5] FamilyMembers <= 5: Yes (n = 120, err = 35.0%)
#> |   |   |   [6] FamilyMembers > 5: No (n = 75, err = 10.7%)
#> |   [7] AnnualIncome > 1300000
#> |   |   [8] EverTravelledAbroad in No
#> |   |   |   [9] Age <= 25: No (n = 28, err = 3.6%)
#> |   |   |   [10] Age > 25: No (n = 9, err = 44.4%)
#> |   |   [11] EverTravelledAbroad in Yes: No (n = 215, err = 2.8%)
#> 
#> Number of inner nodes:    5
#> Number of terminal nodes: 6

Insight:

When a customer has an annual income <= 1300000 and an age <= 32, 607 people with a 29.2% error will choose to purchase travel insurance.
When a person has an annual income of more than > 1300000, has never traveled abroad, and is over the age <= 25, they will choose not to purchase travel insurance, as many as 28 people with a 3.6 percent error rate.

# decision tree visualization
plot(model_traveldt, type = "simple")

KLIK IMAGE TO OPEN

Model Evaluation

Let’s evaluate travelinsturance_tree using a confusion matrix based on the prediction results in the test data:

# prediksi kelas di data test
pred_test_travel <- predict(object = model_traveldt, newdata = travelins_test, type = "response")

# confusion matrix data test
confusionMatrix(pred_test_travel, travelins_test$TravelInsurance, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Yes  No
#>        Yes 305  71
#>        No    9 112
#>                                           
#>                Accuracy : 0.839           
#>                  95% CI : (0.8037, 0.8703)
#>     No Information Rate : 0.6318          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6277          
#>                                           
#>  Mcnemar's Test P-Value : 9.104e-12       
#>                                           
#>             Sensitivity : 0.9713          
#>             Specificity : 0.6120          
#>          Pos Pred Value : 0.8112          
#>          Neg Pred Value : 0.9256          
#>              Prevalence : 0.6318          
#>          Detection Rate : 0.6137          
#>    Detection Prevalence : 0.7565          
#>       Balanced Accuracy : 0.7917          
#>                                           
#>        'Positive' Class : Yes             
#>

Based on the Confusion Matrix results, it can be concluded that the model made is so good, with an accuracy value of 83.9%. Then because in this case, we offer our customers a travel investment, and we want to predict where the customer will invest. So, we’ll be using the Sensitivity/Recall Metric, which has a 97,13% success rate.

Random Forest

Cross Validation

Split the data for training and testing data, with 75% of the data used for training:

RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(travel_ins_clear), nrow(travel_ins_clear)*0.75)
travelins_train <- travel_ins_clear[intrain, ]
travelins_test <- travel_ins_clear[-intrain, ]

Model Fitting

library(randomForest)
model_rf <- randomForest(TravelInsurance ~ ., data = travelins_train, importance=TRUE, 
                         ntree = 100)

model_rf

#> 
#> Call:
#>  randomForest(formula = TravelInsurance ~ ., data = travelins_train,      importance = TRUE, ntree = 100) 
#>                Type of random forest: classification
#>                      Number of trees: 100
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 17.18%
#> Confusion matrix:
#>     Yes  No class.error
#> Yes 933  30  0.03115265
#> No  226 301  0.42884250

The results show that based on OOB (out-of-bag) observations, the misclassification rate is around 17.32 percent, with an accuracy of 82.68 percent.

Prediction dan Model Evaluation

pred_rf_test <- predict(model_rf, newdata = travelins_test)
confusionMatrix(as.factor(pred_rf_test), travelins_test$TravelInsurance, positive="Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Yes  No
#>        Yes 308  76
#>        No    6 107
#>                                           
#>                Accuracy : 0.835           
#>                  95% CI : (0.7994, 0.8666)
#>     No Information Rate : 0.6318          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.6146          
#>                                           
#>  Mcnemar's Test P-Value : 2.541e-14       
#>                                           
#>             Sensitivity : 0.9809          
#>             Specificity : 0.5847          
#>          Pos Pred Value : 0.8021          
#>          Neg Pred Value : 0.9469          
#>              Prevalence : 0.6318          
#>          Detection Rate : 0.6197          
#>    Detection Prevalence : 0.7726          
#>       Balanced Accuracy : 0.7828          
#>                                           
#>        'Positive' Class : Yes             
#>

Based on the Confusion Matrix results, it can be concluded that the model made is so good, with an accuracy value of 82.7%. Then because in this case, we offer our customers a travel investment, and we want to predict where the customer will invest. So, we’ll be using the Sensitivity/Recall Metric, which has a 98,09% success rate.

Conclusion

In this case, we have created an intelligent model that can predict whether a customer will be enticed to purchase a travel insurance based on certain parameters, and we will use precisionor Pos Pred Value metrics because we want to approach / offer the customer as much as possible.

When compared to the other three models, the Naive Bayes, Decision Tree and Random Forest Model. So, the Random Forest model has a precision value of 98.09%, while the Naive Bayes model has a precision value of 90.13% and the Decision Tree model has a precision value of 97.13%. Companies can better predict potential customers and reduce the risk of providing travel insurance by using this model.