1 Greetings

Welcome to my Rmd. The reason why I created this Rmd is to improve my understanding on Clustering Machine Learning using Naive Bayes, Decision Tree and Random Forest.

2 Brief explanation about the data

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

Columns Insight :

1. id: Unique ID for the customer
2. Gender: Gender of the customer
3. Age: Age of the customer
4. Driving_License: 0 -> Customer does not have DL, 1 -> Customer already has DL
5. Region_Code: Unique code for the region of the customer
6. Previously_Insured: 1 -> Customer already has Vehicle Insurance, 0 -> Customer doesn’t have Vehicle Insurance
7. Vehicle_Age: Age of the Vehicle
8. Vehicle_Damage: 1 -> Customer got his/her vehicle damaged in the past, 0 -> Customer didn’t get his/her vehicle damaged in the past.
9. Annual_Premium: The amount customer needs to pay as premium in the year
10. PolicySalesChannel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
11. Vintage : Number of Days, Customer has been associated with the company
12. Response : 1 -> Customer is interested, 0 -> Customer is not interested

You may download the data set from kaggle: https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction

3 Bussines Questions

Predict Health Insurance Owners who will be interested in Vehicle Insurance

4 Data Preparation

4.1 Import necessary library

library(dplyr)
library(UBL)
library(GGally)
library(ggplot2)
library(gridExtra) 
library(inspectdf)
library(tidymodels) 
library(caret)
library(MASS)
library(e1071)
library(ROCR)
library(partykit)
library(rpart)
library(rattle)
library(rpart.plot)
library(randomForest)

4.2 Read the dataset

insur <- read.csv("insur.csv")
glimpse(insur)

## Rows: 10,428
## Columns: 12
## $ id                   <int> 9198, 9199, 9200, 9201, 9202, 9203, 9204, 9205, 9~
## $ Gender               <chr> "Female", "Male", "Female", "Female", "Male", "Ma~
## $ Age                  <int> 21, 64, 44, 23, 50, 42, 60, 38, 69, 21, 42, 29, 2~
## $ Driving_License      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ Region_Code          <int> 40, 33, 29, 6, 28, 28, 28, 18, 15, 30, 36, 15, 36~
## $ Previously_Insured   <int> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1~
## $ Vehicle_Age          <chr> "< 1 Year", "1-2 Year", "1-2 Year", "< 1 Year", "~
## $ Vehicle_Damage       <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No"~
## $ Annual_Premium       <int> 2630, 28046, 28932, 32819, 25200, 2630, 2630, 373~
## $ Policy_Sales_Channel <int> 160, 124, 26, 152, 26, 124, 125, 26, 122, 160, 26~
## $ Vintage              <int> 20, 220, 61, 75, 53, 158, 246, 125, 218, 243, 95,~
## $ Response             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

5 Exploratory Data Analysis

5.1 Check Missing Value

colSums(is.na(insur))

##                   id               Gender                  Age 
##                    0                    0                    0 
##      Driving_License          Region_Code   Previously_Insured 
##                    0                    0                    0 
##          Vehicle_Age       Vehicle_Damage       Annual_Premium 
##                    0                    0                    0 
## Policy_Sales_Channel              Vintage             Response 
##                    0                    0                    0

5.2 Drop Unecessary Column

Column id is a unique identifier for each customer so can be ignore. As well as, columns Region_Code and Policy_Sales_Channel, since unique code for the region of the customer and anonymized Code for the channel of outreaching to the customer are not very useful in this case.

insur <- insur %>% 
  dplyr::select(-id, -Region_Code, -Policy_Sales_Channel)

glimpse(insur)

## Rows: 10,428
## Columns: 9
## $ Gender             <chr> "Female", "Male", "Female", "Female", "Male", "Male~
## $ Age                <int> 21, 64, 44, 23, 50, 42, 60, 38, 69, 21, 42, 29, 26,~
## $ Driving_License    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ Previously_Insured <int> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, ~
## $ Vehicle_Age        <chr> "< 1 Year", "1-2 Year", "1-2 Year", "< 1 Year", "1-~
## $ Vehicle_Damage     <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", ~
## $ Annual_Premium     <int> 2630, 28046, 28932, 32819, 25200, 2630, 2630, 37367~
## $ Vintage            <int> 20, 220, 61, 75, 53, 158, 246, 125, 218, 243, 95, 1~
## $ Response           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

5.3 Change Data Type

insur_clean <- insur %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(Driving_License = as.factor(Driving_License),
         Previously_Insured = as.factor(Previously_Insured),
         Response = as.factor(Response))

glimpse(insur_clean)

## Rows: 10,428
## Columns: 9
## $ Gender             <fct> Female, Male, Female, Female, Male, Male, Male, Mal~
## $ Age                <int> 21, 64, 44, 23, 50, 42, 60, 38, 69, 21, 42, 29, 26,~
## $ Driving_License    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ Previously_Insured <fct> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, ~
## $ Vehicle_Age        <fct> < 1 Year, 1-2 Year, 1-2 Year, < 1 Year, 1-2 Year, 1~
## $ Vehicle_Damage     <fct> No, No, Yes, No, Yes, Yes, No, No, No, Yes, Yes, No~
## $ Annual_Premium     <int> 2630, 28046, 28932, 32819, 25200, 2630, 2630, 37367~
## $ Vintage            <int> 20, 220, 61, 75, 53, 158, 246, 125, 218, 243, 95, 1~
## $ Response           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

5.4 Check the distribution / pattern data

summary(insur_clean)

##     Gender          Age        Driving_License Previously_Insured
##  Female:4511   Min.   :20.00   0:   18         0:7691            
##  Male  :5917   1st Qu.:27.00   1:10410         1:2737            
##                Median :41.00                                     
##                Mean   :40.72                                     
##                3rd Qu.:50.00                                     
##                Max.   :83.00                                     
##     Vehicle_Age   Vehicle_Damage Annual_Premium      Vintage      Response
##  < 1 Year :3275   No :3020       Min.   :  2630   Min.   : 10.0   0:5240  
##  > 2 Years: 698   Yes:7408       1st Qu.: 24609   1st Qu.: 81.0   1:5188  
##  1-2 Year :6455                  Median : 32427   Median :153.0           
##                                  Mean   : 31299   Mean   :153.7           
##                                  3rd Qu.: 40562   3rd Qu.:225.0           
##                                  Max.   :508073   Max.   :299.0

From the result above there are some information which might be used to help the modeling:
- Most of the customer do not have vehicle insurance yet.
- Most of the customer vehicle are quite new because it’s still under 2 years mostly.
- Ideally, target variable can be indicated as a balance when the distribution of variables are equal (50:50) but from inspection above can be indicated proportion data for target variable can be said balance since only slightly different.

6 Cross Validation

Cross validation is to find out how good the model, by splitting the data into data train and data test.

-Data train: will be used for model training.
-Data test: will be used for testing model performance. the model will be tested to predict the test data. The predicted results and actual data from the test data will be compared to validate the model performance.

set.seed(123)

init <- initial_split(data = insur_clean,
                      prop = 0.8,
                      strata = Response)

insur_train <- training(init)
insur_test <- testing(init)

Recheck class imbalance for target variable after splitting the data.

prop.table(table(insur_train$Response))

## 
##         0         1 
## 0.5025168 0.4974832

table(insur_train$Response)

## 
##    0    1 
## 4193 4151

prop.table(table(insur_test$Response))

## 
##         0         1 
## 0.5023992 0.4976008

table(insur_test$Response)

## 
##    0    1 
## 1047 1037

7 Build Model

There are three approach which can be use to predict whether the customer will take the vehicle insurance or no. The first one is used Naive Bayes, the second one is Decision Tree and the other one is Random Forest. Those three model has their own pros and cons, let’s implement all three model and compare to find which one is better.

7.1 Naive Bayes Model

In machine learning, the Naive Bayes is a classification algorithm based on the concept of Bayes Theorem. Bayes theorem is one of the fundamental theorems in probability.

Modeling

At first lets put column Response as target variable and the rest of the columns as predictor into function naiveBayes().

Whenever using Naive Bayes, it is better to apply Laplace Smoothing. Laplace Smoothing is a smoothing technique that handles the problem of zero probability in Naive Bayes, since Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.

#Pembuatan Model NaiveBayes 
model_naive <- naiveBayes(Response ~., insur_train, laplace = 1)

model_naive

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.5025168 0.4974832 
## 
## Conditional probabilities:
##    Gender
## Y      Female      Male
##   0 0.4669845 0.5330155
##   1 0.3934505 0.6065495
## 
##    Age
## Y       [,1]     [,2]
##   0 38.22633 15.89517
##   1 43.30980 12.01581
## 
##    Driving_License
## Y             0           1
##   0 0.002622169 0.997377831
##   1 0.001203949 0.998796051
## 
##    Previously_Insured
## Y             0           1
##   0 0.482240763 0.517759237
##   1 0.994702625 0.005297375
## 
##    Vehicle_Age
## Y     < 1 Year  > 2 Years   1-2 Year
##   0 0.47044805 0.03479504 0.49475691
##   1 0.15214251 0.09942224 0.74843524
## 
##    Vehicle_Damage
## Y           No        Yes
##   0 0.55542312 0.44457688
##   1 0.02143029 0.97856971
## 
##    Annual_Premium
## Y       [,1]     [,2]
##   0 30706.98 18862.11
##   1 32018.05 18847.50
## 
##    Vintage
## Y       [,1]     [,2]
##   0 154.0925 83.25753
##   1 154.0482 83.98995

From the model result above, each predictor value can be interpret: Based on predictor Vehicle_Damage The proportion of health insurance customers who interested to buy car insurance if their car has been hit by an accident is 0.97, while those who have never been in an accident but still buy car insurance have a proportion of 0.03.

The proportion of health insurance customers who not interested to buy car insurance if their car has had an accident is 0.45, while those who have had an accident are 0.55.

Based on Vehicle_Age The proportion of customers who interested to buy car insurance if the age of the vehicle is, - Still under 1 year is 0.15 - Between 1 and 2 years is 0.74 - Above 2 years is 0.1

The proportion of customers who not interested to buy car insurance if the age of the vehicle is, - Still under 1 year is 0.47 - Between 1 and 2 years is 0.49 - Above 2 years is 0.03

That is how to interpret the results of the naiveBayes () model, for other predictor variables, the same thing can be done to get additional insight.

Predict

After creating the naive Bayes model object, the model can be predict using two parameter. Parameter class and raw.

- Parameter Class

Parameter class will produce prediction result label, in this case 1(interested) or 0(not interested)

preds_naive_class <- predict(model_naive,
                             newdata = insur_test,
                             type = "class")

head(preds_naive_class)

## [1] 0 1 0 1 1 1
## Levels: 0 1

- Parameter Raw

Parameter raw will produce the probability for every class.

preds_naive_raw <- predict(model_naive,
                             newdata = insur_test,
                             type = "raw")

head(preds_naive_raw)

##               0            1
## [1,] 0.99174619 0.0082538068
## [2,] 0.09844969 0.9015503070
## [3,] 0.99925646 0.0007435421
## [4,] 0.09006940 0.9099306030
## [5,] 0.10169503 0.8983049662
## [6,] 0.10628814 0.8937118641

Evaluation

Since there are two prediction with two different approached, the evaluation model for each one are also different.

- Confusion Matrix

For prediction using parameter class, will remain using function confusionMatrix() since the prediction result is label.

eval_nb <- confusionMatrix(data = preds_naive_class,
                reference = as.factor(insur_test$Response),
                positive = "1")

eval_nb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 722 120
##          1 325 917
##                                           
##                Accuracy : 0.7865          
##                  95% CI : (0.7682, 0.8039)
##     No Information Rate : 0.5024          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5733          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8843          
##             Specificity : 0.6896          
##          Pos Pred Value : 0.7383          
##          Neg Pred Value : 0.8575          
##              Prevalence : 0.4976          
##          Detection Rate : 0.4400          
##    Detection Prevalence : 0.5960          
##       Balanced Accuracy : 0.7869          
##                                           
##        'Positive' Class : 1               
##

- ROC-AUC Curve

For prediction using parameter raw, the evaluation cannot using function confusionMatrix() since the prediction result is not a label. The evaluation for parameter raw will using ROC-AUC Curve, ROC is Receiver Operating Characteristics and AUC is Area Under Curve. - ROC will plot the proportion calculation of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). - AUC will shows the area under the ROC curve and represents the degree or measure of separability.

#ROC
naive_roc <- prediction(predictions = preds_naive_raw[, 1],
                        labels = as.numeric(insur_test$Response == "1"))

perf <- performance(prediction.obj = naive_roc,
                    measure = "tpr", #tpr = true positive rate
                    x.measure = "fpr") # fpr = false positive rate 

#Plot
plot(perf)
abline(0,1, lty = 2)

#AUC calculation
auc_nb <- performance(prediction.obj = naive_roc, 
                   measure = "auc")
auc_nb@y.values

## [[1]]
## [1] 0.1721602

7.2 Decision Tree Model

The second model that can be used to predict whether the customer is also interested in vehicle insurance or not is Decision Tree Model. Decision Tree is a model that can be used to visually and explicitly represent decisions and decision making. In this case, this model will be used to visualize whether the customer decisions on cross selling vehicle insurance.

Modeling

At first lets put column Response as target variable and the rest of the columns as predictor into function ctree()

dtree_model <- ctree(formula = Response ~., data = insur_train)

dtree_model

## 
## Model formula:
## Response ~ Gender + Age + Driving_License + Previously_Insured + 
##     Vehicle_Age + Vehicle_Damage + Annual_Premium + Vintage
## 
## Fitted party:
## [1] root
## |   [2] Vehicle_Damage in No
## |   |   [3] Previously_Insured in 0: 0 (n = 366, err = 21.6%)
## |   |   [4] Previously_Insured in 1: 0 (n = 2051, err = 0.4%)
## |   [5] Vehicle_Damage in Yes
## |   |   [6] Previously_Insured in 0
## |   |   |   [7] Vehicle_Age < 1 Year
## |   |   |   |   [8] Age <= 27
## |   |   |   |   |   [9] Age <= 22
## |   |   |   |   |   |   [10] Gender in Female: 0 (n = 120, err = 26.7%)
## |   |   |   |   |   |   [11] Gender in Male
## |   |   |   |   |   |   |   [12] Age <= 20: 0 (n = 20, err = 15.0%)
## |   |   |   |   |   |   |   [13] Age > 20: 0 (n = 103, err = 47.6%)
## |   |   |   |   |   [14] Age > 22: 1 (n = 481, err = 47.8%)
## |   |   |   |   [15] Age > 27: 1 (n = 332, err = 23.5%)
## |   |   |   [16] Vehicle_Age > 2 Years, 1-2 Year
## |   |   |   |   [17] Age <= 63
## |   |   |   |   |   [18] Age <= 46: 1 (n = 2513, err = 22.2%)
## |   |   |   |   |   [19] Age > 46
## |   |   |   |   |   |   [20] Vehicle_Age > 2 Years: 1 (n = 287, err = 20.2%)
## |   |   |   |   |   |   [21] Vehicle_Age in 1-2 Year: 1 (n = 1468, err = 29.2%)
## |   |   |   |   [22] Age > 63: 1 (n = 462, err = 48.3%)
## |   |   [23] Previously_Insured in 1: 0 (n = 141, err = 8.5%)
## 
## Number of inner nodes:    11
## Number of terminal nodes: 12

plot(dtree_model, type = "simple")

From both model and plot result above, the Decision Tree produces a very detail model. It can be said as both advantages and disadvantages because Decision Tree is able to do data splitting in great detail, even in conditions where 1 leaf node only has 1 observation. Due to that, the Decision Tree needs to know when to stop branching so that the tree is simpler. This cutting of tree branches is known as Pruning.

Model Tuning

Parameter tuning for Decision Tree Model exist in function ctree(control = ctree_control()). There are three parameters that can be set:
- mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split.
- minsplit: The minimum number of observations that must exist in a node in order for a split to be attempted. (Default to 20)
- minbucket: The minimum number of observations at the terminal node. If not fulfilled, no branching is done. (default: 7)

In this case to reduce the complexity of the model:
x- mincriterion: the value is need to be enlarged into 0.5 since from the model above, the p-value is to small.

- minsplit: the value is enlarged need to be enlarged into 15% of total data train.
- minbucket: the value is enlarged need to be enlarged into 5% of total data train.

dtree_model2 <- ctree(formula = Response ~ ., 
                      data = insur_train,
                      control = ctree_control(mincriterion = 0.5,
                                            minsplit = 1500,
                                            minbucket = 500))

dtree_model2

## 
## Model formula:
## Response ~ Gender + Age + Driving_License + Previously_Insured + 
##     Vehicle_Age + Vehicle_Damage + Annual_Premium + Vintage
## 
## Fitted party:
## [1] root
## |   [2] Vehicle_Damage in No
## |   |   [3] Annual_Premium <= 23033: 0 (n = 500, err = 7.0%)
## |   |   [4] Annual_Premium > 23033
## |   |   |   [5] Vehicle_Age < 1 Year: 0 (n = 1216, err = 1.6%)
## |   |   |   [6] Vehicle_Age in 1-2 Year: 0 (n = 701, err = 4.7%)
## |   [7] Vehicle_Damage in Yes
## |   |   [8] Vehicle_Age < 1 Year: 1 (n = 1109, err = 46.6%)
## |   |   [9] Vehicle_Age > 2 Years, 1-2 Year
## |   |   |   [10] Age <= 62
## |   |   |   |   [11] Age <= 46
## |   |   |   |   |   [12] Annual_Premium <= 18345: 1 (n = 534, err = 30.5%)
## |   |   |   |   |   [13] Annual_Premium > 18345
## |   |   |   |   |   |   [14] Age <= 42: 1 (n = 1312, err = 20.0%)
## |   |   |   |   |   |   [15] Age > 42: 1 (n = 707, err = 23.8%)
## |   |   |   |   [16] Age > 46: 1 (n = 1733, err = 28.6%)
## |   |   |   [17] Age > 62: 1 (n = 532, err = 48.5%)
## 
## Number of inner nodes:    8
## Number of terminal nodes: 9

plot(dtree_model2,type= "simple")

From the result plot dtree_model2 the main filter is to determine the customer will be interested or not is predictor whether the customer vehicle has been damage or not. After knowing that, there are two parameters used the most by the model to make the final decisions, those parameters are vehicle age ownership and customer annual premium. Unlike previous method dtree_model

Prediction

The same as with the Naive Bayes model, model Decision Tree also have two parameter that can be use to predict. Parameter prob and response.

- Parameter Prob

Parameter prob will produce the probability for every class.

pred_dtree_prob <- predict(object = dtree_model2, newdata = insur_test, type = "prob")

head(pred_dtree_prob)

##            0          1
## 1  0.9300000 0.07000000
## 31 0.2862089 0.71379111
## 35 0.9529244 0.04707561
## 41 0.3052434 0.69475655
## 44 0.4849624 0.51503759
## 45 0.2862089 0.71379111

- Parameter Response

Parameter Response will produce prediction result label, in this case 1(interested) or 0(not interested)

pred_dtree_res <- predict(object = dtree_model2, newdata = insur_test, type = "response")

head(pred_dtree_res)

##  1 31 35 41 44 45 
##  0  1  0  1  1  1 
## Levels: 0 1

Evaluation

Since there are two prediction with two different approached, the evaluation model for each one are also different.

- Confusion Matrix

For prediction using response parameter, will remain using function confusionMatrix() since the prediction result is label.

eval_dt <- confusionMatrix(pred_dtree_res,
                reference = as.factor(insur_test$Response),
                positive = "1")

eval_dt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  578   25
##          1  469 1012
##                                           
##                Accuracy : 0.763           
##                  95% CI : (0.7441, 0.7811)
##     No Information Rate : 0.5024          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5269          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9759          
##             Specificity : 0.5521          
##          Pos Pred Value : 0.6833          
##          Neg Pred Value : 0.9585          
##              Prevalence : 0.4976          
##          Detection Rate : 0.4856          
##    Detection Prevalence : 0.7107          
##       Balanced Accuracy : 0.7640          
##                                           
##        'Positive' Class : 1               
##

- ROC-AUC Curve

For prediction using prob parameter, will using function ROC-AUC Curver since the prediction result is probability.

#ROC
dtree_roc <- prediction(predictions = pred_dtree_prob[, 1],
                        labels = as.numeric(insur_test$Response == "1"))

perf <- performance(prediction.obj = dtree_roc,
                    measure = "tpr", #tpr = true positive rate
                    x.measure = "fpr") # fpr = false positive rate 

#Plot
plot(perf)
abline(0,1, lty = 2)

#AUC calculation
auc_dt <- performance(prediction.obj = dtree_roc, 
                   measure = "auc")
auc_dt@y.values

## [[1]]
## [1] 0.1844794

7.3 Random Forest Model

The last but not least, Random Forest Model. Random Forest makes predictions by making many decision trees. Each decision tree has characteristics and is not interrelated, each one of decision tree makes their own predictions, then from the prediction results, majority voting is carried out. The class with the highest number will be the final prediction result.

Modeling

The first step is set the control variable and there are two variables that have to be set. The first one is K-Fold and the second one is how many times do the process want to be repeated.

K-Fold can be said as cross validation, usually cross validation is used to split between data train and data test but in Random Forest is used to divides the data by \(k\) equal parts, where each part is used to test data in turn. As \(k\) gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller. Usually, the choice of k is usually 5 or 10 but there is no formal rule.

# set.seed(100)
# 
# ctrl <- trainControl(method = "repeatedcv",
#                       number = 5, # k-fold
#                       repeats = 3) #repetition
# 
# insur_forest <- train(Response ~ .,
#                     data = insur_train,
#                     method = "rf", # random forest
#                     trControl = ctrl)
# 
# saveRDS(insur_forest, "insur_forest.RDS")

insur_forest <- readRDS("insur_forest.RDS")
insur_forest

## Random Forest 
## 
## 8344 samples
##    8 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 6675, 6676, 6675, 6675, 6675, 6674, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7871534  0.5748894
##   5     0.7762077  0.5528557
##   9     0.7670586  0.5344918
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is at mtry = 2. Since the largest accuracy value was produce at mtry = 2.

In order to find out what is the most importance variable or predictor used in random forest, function varImp() can be used.

varImp(insur_forest)

## rf variable importance
## 
##                      Overall
## Vehicle_DamageYes    100.000
## Previously_Insured1   77.690
## Age                   35.569
## Vehicle_Age1-2 Year   10.670
## Annual_Premium         9.772
## Vintage                9.546
## Vehicle_Age> 2 Years   2.823
## GenderMale             1.271
## Driving_License1       0.000

From the result above, the most importance variable of consideration in Random Forest model is whether the vehicle is damage or not and the less importance variable of consideration is whether the customer is have driving license or not.

Prediction

rm_pred_raw <- predict(insur_forest, insur_test, type = "raw")
head(rm_pred_raw)

## [1] 0 1 0 1 1 1
## Levels: 0 1

rm_pred_prob <- predict(insur_forest, insur_test, type = "prob")
head(rm_pred_prob)

Evaluation

- Out of Bad Error

At the Bootstrap sampling stage, there is data that is not used in making the model, this is known as Out-of-Bag (OOB) data. The Random Forest model will use OOB data as test data to evaluate by calculating errors. This error is known as OOB Error. In the case of classification, the OOB error is the percentage of misclassified OOB data.

insur_forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 21.48%
## Confusion matrix:
##      0    1 class.error
## 0 2707 1486  0.35440019
## 1  306 3845  0.07371718

- Confusion Matrix

eval_rf <- confusionMatrix(data = rm_pred_raw,
                reference = insur_test$Response)

eval_rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 691  86
##          1 356 951
##                                           
##                Accuracy : 0.7879          
##                  95% CI : (0.7697, 0.8053)
##     No Information Rate : 0.5024          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5763          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6600          
##             Specificity : 0.9171          
##          Pos Pred Value : 0.8893          
##          Neg Pred Value : 0.7276          
##              Prevalence : 0.5024          
##          Detection Rate : 0.3316          
##    Detection Prevalence : 0.3728          
##       Balanced Accuracy : 0.7885          
##                                           
##        'Positive' Class : 0               
##

ROC-AUC Curve

#ROC
rm_roc <- prediction(predictions = rm_pred_prob[, 1],
                        labels = as.numeric(insur_test$Response == "1"))

perf <- performance(prediction.obj = rm_roc,
                    measure = "tpr", #tpr = true positive rate
                    x.measure = "fpr") # fpr = false positive rate 

#Plot
plot(perf)
abline(0,1, lty = 2)

#AUC calculation
auc_rf <- performance(prediction.obj = dtree_roc, 
                   measure = "auc")
auc_rf_values <- auc_rf@y.values
auc_rf_values

## [[1]]
## [1] 0.1844794

8 Insight

1. Comparison between model Naive Bayes, Decision Tree and Random Forest based on confusion matrix:

compare_nb <- data_frame(Model = "Naive Bayes",
                         Accuracy = round((eval_nb$overall[1] * 100), 2),
                         Recall = round((eval_nb$byClass[1] * 100), 2),
                         Precision = round((eval_nb$byClass[3] * 100), 2),
                         AUC = 0.17)

compare_dt <- data_frame(Model = "Decision Tree",
                         Accuracy = round((eval_dt$overall[1] * 100), 2),
                         Recall = round((eval_dt$byClass[1] * 100), 2),
                         Precision = round((eval_dt$byClass[3] * 100), 2),
                         AUC = 0.16)

compare_rf <- data_frame(Model = "Random Forest",
                         Accuracy = round((eval_rf$overall[1] * 100), 2),
                         Recall = round((eval_rf$byClass[1] * 100), 2),
                         Precision = round((eval_rf$byClass[3] * 100), 2),
                         AUC = 0.16)

rbind(compare_nb, compare_dt, compare_rf)

From the comparison confusion matrix result between Naive Bayes, Decision Tree and Random Forest, model Random Forest generate better result in Accuracy and Precision score but model Naive Bayes and Decision Tree generate better result in Recall. However, overall result from three model yielded roughly the same result.

However in this case, the main goal is to determine whether the current customers are interested or not in vehicle insurance and as data scientist in insurance company it is better to have low recall value instead of low precision value because with high precision the company will not suffer big lost of losing potential customer who actually interested in vehicle insurance. Even though from sales team perspective they, have to work harder because there are many customers who are not really interested but still have to be offered.

So the final model will be used in this case is Random Forest from mtry = 2.

2. Comparison between model Naive Bayes, Decision Tree and Random Forest based on AUC-ROC Value:

compare_nb <- data_frame(Model = "Naive Bayes",
                         AUC = 0.17)

compare_dt <- data_frame(Model = "Decision Tree",
                         AUC = 0.16)

compare_rf <- data_frame(Model = "Random Forest",
                         AUC = 0.16)

rbind(compare_nb, compare_dt, compare_rf)

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. But from the plot and AUC calculation above AUC is approximately 0, When AUC is approximately 0, the model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class.

From the result above, all tree models have the same results which indicates he model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class and it’s all proven when see the results of the confusion matrix.

In order to provide better AUC value, all tree models can be tuning manually by adjusting the threshold variable higher or lower. If the threshold is being adjusted, it could provide a model that is better or even worse than the initial model.

Disclaimer : The determination of the threshold value by manual tuning has no special rules, therefore it can be adjusted freely. The threshold value can be added or subtracted little by little until the threshold value produces the desired model, if the threshold value are being increase or decrease little by little it will require a lot of time.