Introduction
In this case study, I will examine travel insurance prediction data, looking for indicators that can be used to predict customers to be able to participate in travel insurance.
This dataset contains nine features from which we must predict whether or not the customer will purchase travel insurance:
- Age: The Customer’s Age
- Employment Type: The Industry in which the customer works
- Graduate Or Not : Whether the customer is college Graduate Or Not
- Annual Income: The customer’s annual income in Indian rupees [rounded to the nearest 50 thousand rupees]
- Family Members: The number of people in the customer’s family
- Chronic Disease: Whether the customer has any major diseases or conditions such as diabetes, high blood pressure, or asthama, etc.
- Frequent Flyer: Data derived from a customer’s history of booking air tickets at least four times in the last two years [2017-2019]
- Ever Traveled Abroad : Has the customer ever traveled To a foreign Country[Not Using The Company’s Services]
- Travel Insurance : Did the customer buy travel insurance package during introductory offering held in the year 2019 (Whether The Customer Bought The Travel Insurance Or Not; 0: Yes, 1: No)
Data Preparation
Import data:
travel_ins <- read.csv("data_input/TravelInsurancePrediction.csv", stringsAsFactors = T)
travel_ins
Data Cleansing
Check data types :
str(travel_ins)
#> 'data.frame': 1987 obs. of 10 variables:
#> $ X : int 0 1 2 3 4 5 6 7 8 9 ...
#> $ Age : int 31 31 34 28 28 25 31 31 28 33 ...
#> $ Employment.Type : Factor w/ 2 levels "Government Sector",..: 1 2 2 2 2 2 1 2 2 1 ...
#> $ GraduateOrNot : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
#> $ AnnualIncome : int 400000 1250000 500000 700000 700000 1150000 1300000 1350000 1450000 800000 ...
#> $ FamilyMembers : int 6 7 4 3 8 4 4 3 6 3 ...
#> $ ChronicDiseases : int 1 0 1 1 1 0 0 0 1 0 ...
#> $ FrequentFlyer : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 2 2 ...
#> $ EverTravelledAbroad: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
#> $ TravelInsurance : int 0 0 1 0 0 0 0 1 1 0 ...
I’ll replace name for levels in target variabel:
travel_ins$TravelInsurance <- factor(travel_ins$TravelInsurance,levels = c(0,1) , labels=c("Yes","No"))
Change the data type and remove any unnecessary columns:
library(dplyr)
travel_ins_clear <- travel_ins %>%
select(-X) %>%
mutate(ChronicDiseases = as.factor(ChronicDiseases))
str(travel_ins_clear)
#> 'data.frame': 1987 obs. of 9 variables:
#> $ Age : int 31 31 34 28 28 25 31 31 28 33 ...
#> $ Employment.Type : Factor w/ 2 levels "Government Sector",..: 1 2 2 2 2 2 1 2 2 1 ...
#> $ GraduateOrNot : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
#> $ AnnualIncome : int 400000 1250000 500000 700000 700000 1150000 1300000 1350000 1450000 800000 ...
#> $ FamilyMembers : int 6 7 4 3 8 4 4 3 6 3 ...
#> $ ChronicDiseases : Factor w/ 2 levels "0","1": 2 1 2 2 2 1 1 1 2 1 ...
#> $ FrequentFlyer : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 2 2 ...
#> $ EverTravelledAbroad: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 2 1 ...
#> $ TravelInsurance : Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 1 2 2 1 ...
Check missing value:
colSums(is.na(travel_ins_clear))
#> Age Employment.Type GraduateOrNot AnnualIncome
#> 0 0 0 0
#> FamilyMembers ChronicDiseases FrequentFlyer EverTravelledAbroad
#> 0 0 0 0
#> TravelInsurance
#> 0
Check summary data:
summary(travel_ins_clear)
#> Age Employment.Type GraduateOrNot
#> Min. :25.00 Government Sector : 570 No : 295
#> 1st Qu.:28.00 Private Sector/Self Employed:1417 Yes:1692
#> Median :29.00
#> Mean :29.65
#> 3rd Qu.:32.00
#> Max. :35.00
#> AnnualIncome FamilyMembers ChronicDiseases FrequentFlyer
#> Min. : 300000 Min. :2.000 0:1435 No :1570
#> 1st Qu.: 600000 1st Qu.:4.000 1: 552 Yes: 417
#> Median : 900000 Median :5.000
#> Mean : 932763 Mean :4.753
#> 3rd Qu.:1250000 3rd Qu.:6.000
#> Max. :1800000 Max. :9.000
#> EverTravelledAbroad TravelInsurance
#> No :1607 Yes:1277
#> Yes: 380 No : 710
#>
#>
#>
#>
Checek propotion table:
prop.table(table(travel_ins_clear$TravelInsurance))
#>
#> Yes No
#> 0.6426774 0.3573226
Naive Bayes
Cross-Validation
Split the data for training and testing data, with 75% of the data used for training:
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(travel_ins_clear), nrow(travel_ins_clear)*0.75)
travelins_train <- travel_ins_clear[intrain, ]
travelins_test <- travel_ins_clear[-intrain, ]
Checek propotion table:
prop.table(table(travelins_train$TravelInsurance))
#>
#> Yes No
#> 0.6463087 0.3536913
Build Model
Created a model based on business knowledge:
library(e1071)
# Train Model
model_naive <- naiveBayes(formula = TravelInsurance~., data = travelins_train, laplace=1)
# check model
model_naive
#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#>
#> A-priori probabilities:
#> Y
#> Yes No
#> 0.6463087 0.3536913
#>
#> Conditional probabilities:
#> Age
#> Y [,1] [,2]
#> Yes 29.47560 2.633184
#> No 29.85389 3.337125
#>
#> Employment.Type
#> Y Government Sector Private Sector/Self Employed
#> Yes 0.3274611 0.6725389
#> No 0.1795841 0.8204159
#>
#> GraduateOrNot
#> Y No Yes
#> Yes 0.1585492 0.8414508
#> No 0.1436673 0.8563327
#>
#> AnnualIncome
#> Y [,1] [,2]
#> Yes 817808.9 330118.5
#> No 1137476.3 371507.4
#>
#> FamilyMembers
#> Y [,1] [,2]
#> Yes 4.638629 1.575621
#> No 4.952562 1.657919
#>
#> ChronicDiseases
#> Y 0 1
#> Yes 0.7181347 0.2818653
#> No 0.7126654 0.2873346
#>
#> FrequentFlyer
#> Y No Yes
#> Yes 0.8632124 0.1367876
#> No 0.6672968 0.3327032
#>
#> EverTravelledAbroad
#> Y No Yes
#> Yes 0.93056995 0.06943005
#> No 0.56899811 0.43100189
Model Prediction
Predict class from the test data with the predict() function:
# predict
naive_predClass <- predict(object = model_naive, newdata = travelins_test, type = "class")
head(naive_predClass)
#> [1] Yes Yes No Yes No No
#> Levels: Yes No
Model Evaluation
Evaluate the model with the confusion matrix:
library(caret)
confusionMatrix(data = naive_predClass, reference = travelins_test$TravelInsurance, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Yes No
#> Yes 283 90
#> No 31 93
#>
#> Accuracy : 0.7565
#> 95% CI : (0.7163, 0.7937)
#> No Information Rate : 0.6318
#> P-Value [Acc > NIR] : 1.884e-09
#>
#> Kappa : 0.439
#>
#> Mcnemar's Test P-Value : 1.344e-07
#>
#> Sensitivity : 0.9013
#> Specificity : 0.5082
#> Pos Pred Value : 0.7587
#> Neg Pred Value : 0.7500
#> Prevalence : 0.6318
#> Detection Rate : 0.5694
#> Detection Prevalence : 0.7505
#> Balanced Accuracy : 0.7047
#>
#> 'Positive' Class : Yes
#>
Based on the Confusion Matrix results, it can be concluded that the model made is fairy well, with an accuracy value of 75.65%. Then because in this case, we offer our customers a travel investment, and we want to predict where the customer will invest. So, we’ll be using the Sensitivity/Recall Metric, which has a 90,13% success rate.
ROC
ROC and AUC are two additional evaluation tools used after the Confusion Matrix to evaluate the model’s classification performance the two classes (positive and negative):
pred_naiveProb <- predict(model_naive, newdata = travelins_test, type = "raw")
prepare data frame for ROC:
df_roc_naive <- data.frame(prob = pred_naiveProb[,1],
label = as.numeric(travelins_test$TravelInsurance == "Yes"))
df_roc_naive
We can make a ROC by using the ‘prediction()’ object:
library(ROCR)
# create object prediction
prediction_roc_naive <- prediction(predictions = df_roc_naive$prob,
labels = df_roc_naive$label)
# make plot of prediction object
plot(performance(prediction.obj = prediction_roc_naive,
measure = "tpr",
x.measure = "fpr"))
KLIK IMAGE TO OPEN
Area Under ROC Curve (AUC)
AUC shows the area under the ROC curve. The closer to 1, the better the model’s performance. To get the AUC value, write auc in the measure parameter of performance() and take the value y.values:
auc_naive <- performance(prediction.obj = prediction_roc_naive, measure = "auc")
auc_naive@y.values
#> [[1]]
#> [1] 0.749913
The Naive Bayes model’s AUC value is nearly equal to 1 that has 0.74, its mean that our model is able to classify positive and negative classes.
Decision Tree
Cross-Validation
Split the data for training and testing data, with 75% of the data used for training:
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(travel_ins_clear), nrow(travel_ins_clear)*0.75)
travelins_train <- travel_ins_clear[intrain, ]
travelins_test <- travel_ins_clear[-intrain, ]
Use a downsample technique to show the majority observation into balance with the minority.
# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
travel_train_down <- downSample(x = travelins_train %>% select(-TravelInsurance),
y = travelins_train$TravelInsurance,
yname = "TravelInsurance")
dim(travel_train_down)
#> [1] 1054 9
head(travel_train_down)
Model Fitting
library(partykit)
model_traveldt <- ctree(formula = TravelInsurance~., data = travel_train_down)
model_traveldt
#>
#> Model formula:
#> TravelInsurance ~ Age + Employment.Type + GraduateOrNot + AnnualIncome +
#> FamilyMembers + ChronicDiseases + FrequentFlyer + EverTravelledAbroad
#>
#> Fitted party:
#> [1] root
#> | [2] AnnualIncome <= 1300000
#> | | [3] Age <= 32: Yes (n = 607, err = 29.2%)
#> | | [4] Age > 32
#> | | | [5] FamilyMembers <= 5: Yes (n = 120, err = 35.0%)
#> | | | [6] FamilyMembers > 5: No (n = 75, err = 10.7%)
#> | [7] AnnualIncome > 1300000
#> | | [8] EverTravelledAbroad in No
#> | | | [9] Age <= 25: No (n = 28, err = 3.6%)
#> | | | [10] Age > 25: No (n = 9, err = 44.4%)
#> | | [11] EverTravelledAbroad in Yes: No (n = 215, err = 2.8%)
#>
#> Number of inner nodes: 5
#> Number of terminal nodes: 6
Insight:
- When a customer has an annual income <= 1300000 and an age <= 32, 607 people with a 29.2% error will choose to purchase travel insurance.
- When a person has an annual income of more than > 1300000, has never traveled abroad, and is over the age <= 25, they will choose not to purchase travel insurance, as many as 28 people with a 3.6 percent error rate.
# decision tree visualization
plot(model_traveldt, type = "simple")
KLIK IMAGE TO OPEN
Model Evaluation
Let’s evaluate travelinsturance_tree using a confusion matrix based on the prediction results in the test data:
# prediksi kelas di data test
pred_test_travel <- predict(object = model_traveldt, newdata = travelins_test, type = "response")
# confusion matrix data test
confusionMatrix(pred_test_travel, travelins_test$TravelInsurance, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Yes No
#> Yes 305 71
#> No 9 112
#>
#> Accuracy : 0.839
#> 95% CI : (0.8037, 0.8703)
#> No Information Rate : 0.6318
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6277
#>
#> Mcnemar's Test P-Value : 9.104e-12
#>
#> Sensitivity : 0.9713
#> Specificity : 0.6120
#> Pos Pred Value : 0.8112
#> Neg Pred Value : 0.9256
#> Prevalence : 0.6318
#> Detection Rate : 0.6137
#> Detection Prevalence : 0.7565
#> Balanced Accuracy : 0.7917
#>
#> 'Positive' Class : Yes
#>
Based on the Confusion Matrix results, it can be concluded that the model made is so good, with an accuracy value of 83.9%. Then because in this case, we offer our customers a travel investment, and we want to predict where the customer will invest. So, we’ll be using the Sensitivity/Recall Metric, which has a 97,13% success rate.
Random Forest
Cross Validation
Split the data for training and testing data, with 75% of the data used for training:
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(travel_ins_clear), nrow(travel_ins_clear)*0.75)
travelins_train <- travel_ins_clear[intrain, ]
travelins_test <- travel_ins_clear[-intrain, ]
Model Fitting
library(randomForest)
model_rf <- randomForest(TravelInsurance ~ ., data = travelins_train, importance=TRUE,
ntree = 100)
model_rf
#>
#> Call:
#> randomForest(formula = TravelInsurance ~ ., data = travelins_train, importance = TRUE, ntree = 100)
#> Type of random forest: classification
#> Number of trees: 100
#> No. of variables tried at each split: 2
#>
#> OOB estimate of error rate: 17.18%
#> Confusion matrix:
#> Yes No class.error
#> Yes 933 30 0.03115265
#> No 226 301 0.42884250
The results show that based on OOB (out-of-bag) observations, the misclassification rate is around 17.32 percent, with an accuracy of 82.68 percent.
Prediction dan Model Evaluation
pred_rf_test <- predict(model_rf, newdata = travelins_test)
confusionMatrix(as.factor(pred_rf_test), travelins_test$TravelInsurance, positive="Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Yes No
#> Yes 308 76
#> No 6 107
#>
#> Accuracy : 0.835
#> 95% CI : (0.7994, 0.8666)
#> No Information Rate : 0.6318
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.6146
#>
#> Mcnemar's Test P-Value : 2.541e-14
#>
#> Sensitivity : 0.9809
#> Specificity : 0.5847
#> Pos Pred Value : 0.8021
#> Neg Pred Value : 0.9469
#> Prevalence : 0.6318
#> Detection Rate : 0.6197
#> Detection Prevalence : 0.7726
#> Balanced Accuracy : 0.7828
#>
#> 'Positive' Class : Yes
#>
Based on the Confusion Matrix results, it can be concluded that the model made is so good, with an accuracy value of 82.7%. Then because in this case, we offer our customers a travel investment, and we want to predict where the customer will invest. So, we’ll be using the Sensitivity/Recall Metric, which has a 98,09% success rate.
Conclusion
In this case, we have created an intelligent model that can predict whether a customer will be enticed to purchase a travel insurance based on certain parameters, and we will use precisionor Pos Pred Value metrics because we want to approach / offer the customer as much as possible.
When compared to the other three models, the Naive Bayes, Decision Tree and Random Forest Model. So, the Random Forest model has a precision value of 98.09%, while the Naive Bayes model has a precision value of 90.13% and the Decision Tree model has a precision value of 97.13%. Companies can better predict potential customers and reduce the risk of providing travel insurance by using this model.