Objective

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees, how can you change this perception when using the decision tree you created to solve a real problem? Format: document with screen captures & analysis.

Car Insurance Claim Prediction

As part of this Homework I decided to the use the (Kaggle data) [https://www.kaggle.com/datasets/ifteshanajnin/carinsuranceclaimprediction-classification]

The dataset contains 58592 observations on 44 variables. variables. As part of this exercise , I will analyzing the data and predicting whether there is an insurance claim . I will be solving this classification problem using Decision Tree and Random Forest. Then I will be evaluating the performance of both the models and comparing the results.

Load the required libraries

library(stats)
library(corrplot)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(caret)
library(rpart.plot)
library(DataExplorer)
library(skimr)

library(performanceEstimation)
library(randomForest)

Exploratory Data analysis

trainData <- read_csv("train.csv", col_types = "cnnnfnffffffffffffffffffffffffffffffffffffff")
dim(trainData)
## [1] 58592    44

The dataset has 58592 observations on 44 variables.

Analyse the data

Let’s analyze the distribution of various feature and it’s values.

# Snippet of the data
head(trainData)
## # A tibble: 6 x 44
##   policy_id policy_tenure age_of_car age_of_policyholder area_cluster
##   <chr>             <dbl>      <dbl>               <dbl> <fct>       
## 1 ID00001           0.516       0.05               0.644 C1          
## 2 ID00002           0.673       0.02               0.375 C2          
## 3 ID00003           0.841       0.02               0.385 C3          
## 4 ID00004           0.900       0.11               0.433 C4          
## 5 ID00005           0.596       0.11               0.635 C5          
## 6 ID00006           1.02        0.07               0.519 C6          
## # ... with 39 more variables: population_density <dbl>, make <fct>,
## #   segment <fct>, model <fct>, fuel_type <fct>, max_torque <fct>,
## #   max_power <fct>, engine_type <fct>, airbags <fct>, is_esc <fct>,
## #   is_adjustable_steering <fct>, is_tpms <fct>, is_parking_sensors <fct>,
## #   is_parking_camera <fct>, rear_brakes_type <fct>, displacement <fct>,
## #   cylinder <fct>, transmission_type <fct>, gear_box <fct>,
## #   steering_type <fct>, turning_radius <fct>, length <fct>, width <fct>, ...
# glimpse and summary of the data
glimpse(trainData)
## Rows: 58,592
## Columns: 44
## $ policy_id                        <chr> "ID00001", "ID00002", "ID00003", "ID0~
## $ policy_tenure                    <dbl> 0.51587359, 0.67261851, 0.84111026, 0~
## $ age_of_car                       <dbl> 0.05, 0.02, 0.02, 0.11, 0.11, 0.07, 0~
## $ age_of_policyholder              <dbl> 0.6442308, 0.3750000, 0.3846154, 0.43~
## $ area_cluster                     <fct> C1, C2, C3, C4, C5, C6, C7, C8, C7, C~
## $ population_density               <dbl> 4990, 27003, 4076, 21622, 34738, 1305~
## $ make                             <fct> 1, 1, 1, 1, 2, 3, 4, 1, 3, 1, 1, 1, 1~
## $ segment                          <fct> A, A, A, C1, A, C2, B2, B2, C2, B2, A~
## $ model                            <fct> M1, M1, M1, M2, M3, M4, M5, M6, M4, M~
## $ fuel_type                        <fct> CNG, CNG, CNG, Petrol, Petrol, Diesel~
## $ max_torque                       <fct> 60Nm@3500rpm, 60Nm@3500rpm, 60Nm@3500~
## $ max_power                        <fct> 40.36bhp@6000rpm, 40.36bhp@6000rpm, 4~
## $ engine_type                      <fct> F8D Petrol Engine, F8D Petrol Engine,~
## $ airbags                          <fct> 2, 2, 2, 2, 2, 6, 2, 2, 6, 6, 2, 2, 2~
## $ is_esc                           <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_adjustable_steering           <fct> No, No, No, Yes, No, Yes, Yes, Yes, Y~
## $ is_tpms                          <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_parking_sensors               <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes~
## $ is_parking_camera                <fct> No, No, No, Yes, Yes, Yes, No, No, Ye~
## $ rear_brakes_type                 <fct> Drum, Drum, Drum, Drum, Drum, Disc, D~
## $ displacement                     <fct> 796, 796, 796, 1197, 999, 1493, 1497,~
## $ cylinder                         <fct> 3, 3, 3, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4~
## $ transmission_type                <fct> Manual, Manual, Manual, Automatic, Au~
## $ gear_box                         <fct> 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5~
## $ steering_type                    <fct> Power, Power, Power, Electric, Electr~
## $ turning_radius                   <fct> 4.6, 4.6, 4.6, 4.8, 5, 5.2, 5, 4.8, 5~
## $ length                           <fct> 3445, 3445, 3445, 3995, 3731, 4300, 3~
## $ width                            <fct> 1515, 1515, 1515, 1735, 1579, 1790, 1~
## $ height                           <fct> 1475, 1475, 1475, 1515, 1490, 1635, 1~
## $ gross_weight                     <fct> 1185, 1185, 1185, 1335, 1155, 1720, 1~
## $ is_front_fog_lights              <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_rear_window_wiper             <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_washer            <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_defogger          <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_brake_assist                  <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_power_door_locks              <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_central_locking               <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_power_steering                <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ is_driver_seat_height_adjustable <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_day_night_rear_view_mirror    <fct> No, No, No, Yes, Yes, No, No, Yes, No~
## $ is_ecw                           <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_speed_alert                   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ ncap_rating                      <fct> 0, 0, 0, 2, 2, 3, 5, 2, 3, 0, 0, 2, 2~
## $ is_claim                         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
summary(trainData)
##   policy_id         policy_tenure        age_of_car      age_of_policyholder
##  Length:58592       Min.   :0.002735   Min.   :0.00000   Min.   :0.2885     
##  Class :character   1st Qu.:0.210250   1st Qu.:0.02000   1st Qu.:0.3654     
##  Mode  :character   Median :0.573792   Median :0.06000   Median :0.4519     
##                     Mean   :0.611246   Mean   :0.06942   Mean   :0.4694     
##                     3rd Qu.:1.039104   3rd Qu.:0.11000   3rd Qu.:0.5481     
##                     Max.   :1.396641   Max.   :1.00000   Max.   :1.0000     
##                                                                             
##   area_cluster   population_density make         segment          model      
##  C8     :13654   Min.   :  290      1:38126   A      :17321   M1     :14948  
##  C2     : 7342   1st Qu.: 6112      2: 2373   C1     : 3557   M4     :14018  
##  C5     : 6979   Median : 8794      3:14018   C2     :14018   M6     :13776  
##  C3     : 6101   Mean   :18827      4: 1961   B2     :18314   M8     : 4173  
##  C14    : 3660   3rd Qu.:27003      5: 2114   B1     : 4173   M7     : 2940  
##  C13    : 3423   Max.   :73430                Utility: 1209   M3     : 2373  
##  (Other):17433                                                (Other): 6364  
##   fuel_type              max_torque                max_power    
##  CNG   :20330   113Nm@4400rpm :17796   88.50bhp@6000rpm :17796  
##  Petrol:20532   60Nm@3500rpm  :14948   40.36bhp@6000rpm :14948  
##  Diesel:17730   250Nm@2750rpm :14018   113.45bhp@4000rpm:14018  
##                 82.1Nm@3400rpm: 4173   55.92bhp@5300rpm : 4173  
##                 91Nm@4250rpm  : 2373   67.06bhp@5500rpm : 2373  
##                 200Nm@1750rpm : 2114   97.89bhp@3600rpm : 2114  
##                 (Other)       : 3170   (Other)          : 3170  
##                 engine_type    airbags   is_esc      is_adjustable_steering
##  F8D Petrol Engine    :14948   2:40425   No :40191   No :23066             
##  1.5 L U2 CRDi        :14018   6:16958   Yes:18401   Yes:35526             
##  K Series Dual jet    :13776   1: 1209                                     
##  K10C                 : 4173                                               
##  1.2 L K Series Engine: 2940                                               
##  1.0 SCe              : 2373                                               
##  (Other)              : 6364                                               
##  is_tpms     is_parking_sensors is_parking_camera rear_brakes_type
##  No :44574   Yes:56219          No :35704         Drum:44574      
##  Yes:14018   No : 2373          Yes:22888         Disc:14018      
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##   displacement   cylinder  transmission_type gear_box   steering_type  
##  1197   :17796   3:21857   Manual   :38181   5:44211   Power   :33502  
##  796    :14948   4:36735   Automatic:20411   6:14381   Electric:23881  
##  1493   :14018                                         Manual  : 1209  
##  998    : 4173                                                         
##  999    : 2373                                                         
##  1498   : 2114                                                         
##  (Other): 3170                                                         
##  turning_radius      length          width           height     
##  4.6    :14948   3445   :14948   1515   :14948   1475   :14948  
##  4.8    :14856   4300   :14018   1735   :14856   1635   :14018  
##  5.2    :14018   3845   :13776   1790   :14018   1530   :13776  
##  4.7    : 4173   3990   : 4538   1620   : 4173   1675   : 4173  
##  5      : 3971   3655   : 4173   1745   : 2940   1500   : 2940  
##  4.85   : 2940   3995   : 3194   1579   : 2373   1490   : 2373  
##  (Other): 3686   (Other): 3945   (Other): 5284   (Other): 6364  
##   gross_weight   is_front_fog_lights is_rear_window_wiper is_rear_window_washer
##  1185   :14948   No :24664           No :41634            No :41634            
##  1335   :14856   Yes:33928           Yes:16958            Yes:16958            
##  1720   :14018                                                                 
##  1340   : 4173                                                                 
##  1410   : 2940                                                                 
##  1155   : 2373                                                                 
##  (Other): 5284                                                                 
##  is_rear_window_defogger is_brake_assist is_power_door_locks is_central_locking
##  No :38077               No :26415       No :16157           No :16157         
##  Yes:20515               Yes:32177       Yes:42435           Yes:42435         
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  is_power_steering is_driver_seat_height_adjustable
##  Yes:57383         No :24301                       
##  No : 1209         Yes:34291                       
##                                                    
##                                                    
##                                                    
##                                                    
##                                                    
##  is_day_night_rear_view_mirror is_ecw      is_speed_alert ncap_rating is_claim 
##  No :36309                     No :16157   Yes:58229      0:19097     0:54844  
##  Yes:22283                     Yes:42435   No :  363      2:21402     1: 3748  
##                                                           3:14018              
##                                                           5: 1961              
##                                                           4: 2114              
##                                                                                
## 
plot_missing(trainData)

#hist(trainData$is_claim)

From our analysis it is clear that there are no missing data.

Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:

skimr::skim(trainData)
Data summary
Name trainData
Number of rows 58592
Number of columns 44
_______________________
Column type frequency:
character 1
factor 39
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
policy_id 0 1 7 7 0 58592 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
area_cluster 0 1 FALSE 22 C8: 13654, C2: 7342, C5: 6979, C3: 6101
make 0 1 FALSE 5 1: 38126, 3: 14018, 2: 2373, 5: 2114
segment 0 1 FALSE 6 B2: 18314, A: 17321, C2: 14018, B1: 4173
model 0 1 FALSE 11 M1: 14948, M4: 14018, M6: 13776, M8: 4173
fuel_type 0 1 FALSE 3 Pet: 20532, CNG: 20330, Die: 17730
max_torque 0 1 FALSE 9 113: 17796, 60N: 14948, 250: 14018, 82.: 4173
max_power 0 1 FALSE 9 88.: 17796, 40.: 14948, 113: 14018, 55.: 4173
engine_type 0 1 FALSE 11 F8D: 14948, 1.5: 14018, K S: 13776, K10: 4173
airbags 0 1 FALSE 3 2: 40425, 6: 16958, 1: 1209
is_esc 0 1 FALSE 2 No: 40191, Yes: 18401
is_adjustable_steering 0 1 FALSE 2 Yes: 35526, No: 23066
is_tpms 0 1 FALSE 2 No: 44574, Yes: 14018
is_parking_sensors 0 1 FALSE 2 Yes: 56219, No: 2373
is_parking_camera 0 1 FALSE 2 No: 35704, Yes: 22888
rear_brakes_type 0 1 FALSE 2 Dru: 44574, Dis: 14018
displacement 0 1 FALSE 9 119: 17796, 796: 14948, 149: 14018, 998: 4173
cylinder 0 1 FALSE 2 4: 36735, 3: 21857
transmission_type 0 1 FALSE 2 Man: 38181, Aut: 20411
gear_box 0 1 FALSE 2 5: 44211, 6: 14381
steering_type 0 1 FALSE 3 Pow: 33502, Ele: 23881, Man: 1209
turning_radius 0 1 FALSE 9 4.6: 14948, 4.8: 14856, 5.2: 14018, 4.7: 4173
length 0 1 FALSE 9 344: 14948, 430: 14018, 384: 13776, 399: 4538
width 0 1 FALSE 10 151: 14948, 173: 14856, 179: 14018, 162: 4173
height 0 1 FALSE 11 147: 14948, 163: 14018, 153: 13776, 167: 4173
gross_weight 0 1 FALSE 10 118: 14948, 133: 14856, 172: 14018, 134: 4173
is_front_fog_lights 0 1 FALSE 2 Yes: 33928, No: 24664
is_rear_window_wiper 0 1 FALSE 2 No: 41634, Yes: 16958
is_rear_window_washer 0 1 FALSE 2 No: 41634, Yes: 16958
is_rear_window_defogger 0 1 FALSE 2 No: 38077, Yes: 20515
is_brake_assist 0 1 FALSE 2 Yes: 32177, No: 26415
is_power_door_locks 0 1 FALSE 2 Yes: 42435, No: 16157
is_central_locking 0 1 FALSE 2 Yes: 42435, No: 16157
is_power_steering 0 1 FALSE 2 Yes: 57383, No: 1209
is_driver_seat_height_adjustable 0 1 FALSE 2 Yes: 34291, No: 24301
is_day_night_rear_view_mirror 0 1 FALSE 2 No: 36309, Yes: 22283
is_ecw 0 1 FALSE 2 Yes: 42435, No: 16157
is_speed_alert 0 1 FALSE 2 Yes: 58229, No: 363
ncap_rating 0 1 FALSE 5 2: 21402, 0: 19097, 3: 14018, 4: 2114
is_claim 0 1 FALSE 2 0: 54844, 1: 3748

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
policy_tenure 0 1 0.61 0.41 0.00 0.21 0.57 1.04 1.4 ▇▅▃▆▃
age_of_car 0 1 0.07 0.06 0.00 0.02 0.06 0.11 1.0 ▇▁▁▁▁
age_of_policyholder 0 1 0.47 0.12 0.29 0.37 0.45 0.55 1.0 ▇▇▃▁▁
population_density 0 1 18826.86 17660.17 290.00 6112.00 8794.00 27003.00 73430.0 ▇▃▂▁▁

Data Visualization

trainData %>% 
  ggplot(aes(x = policy_tenure, fill = is_claim)) +
  geom_boxplot()

trainData %>% 
  ggplot(aes(x = age_of_car, fill = is_claim)) +
  geom_boxplot()

trainData %>% 
  ggplot(aes(x = age_of_policyholder, fill = is_claim)) +
  geom_boxplot()

trainData %>% 
  ggplot(aes(x = population_density, fill = is_claim)) +
  geom_boxplot()

Distribution of the data across categorical variable

long_df1 <- trainData %>%  select(c('is_speed_alert','ncap_rating','is_claim')) %>%   pivot_longer(cols=-c(is_claim), names_to='kpi')


ggplot(long_df1, aes(x=value, fill=is_claim)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64")) + 
  ggtitle('Comparing Categorical Features and Target')

long_df2 <- trainData %>%  select(c('segment','area_cluster','is_claim')) %>%   pivot_longer(cols=-c(is_claim), names_to='kpi')

ggplot(long_df2, aes(x=value, fill=is_claim)) + 
  geom_bar() + 
  facet_wrap(~kpi, scales="free_x") + 
  scale_fill_manual(values = c("#2bbac0", "#f06e64"))   + ggtitle('Comparing Categorical Features and Target')

long_df3 <- trainData %>%  select(c('model','is_parking_camera','is_claim')) %>%   pivot_longer(cols=-c(is_claim), names_to='kpi')

We are considering only major categorical variable for the analysis

As per our analysis , we could see only a subset of features are relevant for our analysis

trainData2 <- trainData %>% select (policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,max_torque,
                                    is_parking_sensors,is_parking_camera,is_brake_assist,is_power_door_locks,is_central_locking,is_speed_alert,ncap_rating,is_claim  )

Target Variable Distribution

Let’s visualize our target distribution.

# Bar plot for target (Heart disease) 
ggplot(trainData, aes(x=is_claim, fill=is_claim)) + 
  geom_bar() +
  xlab("Insurance Claim") +
  ylab("Count") +
  ggtitle("Analysis of Insurance Claim") +
  scale_fill_discrete(name = "Insurance Claim", labels = c("Absence", "Presence"))

round(prop.table(table(select(trainData, 'is_claim'))),2)
## 
##    0    1 
## 0.94 0.06

Class imbalance

According to our analysis we could see there is a class imbalance present in this dataset. Imbalanced data can prove to be quite problematic in classifying the minority class. I will conduct oversampling using SMOTE before modeling the data to give the model a better chance at classifying the minority class.

One way to address this imbalance problem is to use Synthetic Minority Oversampling Technique, often abbreviated SMOTE. This technique involves creating a new dataset by oversampling observations from the minority class, which produces a dataset that has more balanced classes.

Splitting the Dataset into Training and Testing

Our first step will be to separate the data into a training and test dataset. This way, we can test the accuracy by using our holdout test dataset later on. We decided to perform a 80/20 training to testing data split on our dataset.

# Splitting the data 80/20
set.seed(1234)

training.samples <- trainData2 $is_claim %>% 
  createDataPartition(p = 0.8, list=FALSE)

train.data <- trainData2  [training.samples,]
test.data <- trainData[-training.samples,]

round(prop.table(table(select(train.data, 'is_claim'))),2)
## 
##    0    1 
## 0.94 0.06
round(prop.table(table(select(test.data, 'is_claim'))),2)
## 
##    0    1 
## 0.94 0.06

We could see that the target variable is uniformly distributed both in test and train data in accordance with original data.

Since there is a class imbalance of the target variable , we will solve this by oversampling the minority class.

set.seed(12345)
train.balanced <- smote(is_claim ~ ., data = train.data, perc.over = 1)

train.balanced %>% ggplot(aes(is_claim)) +
  geom_bar(fill = "#04354F") +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")

We could see the target variable is now uniformly distributed on the test data.

Supervised Machine Learning

Decision tree

Decision Tree is a Supervised Machine Learning Algorithm that uses a set of rules to make decisions, similarly to how humans make decisions. It can be used for both classification and regression problem.

Construct First Decision tree Model

I will now construct a model for classification Tree using the training dataset considering the entire set of feature we have already identified as part of our earlier analysis.

DT_modelAll1 <- rpart(is_claim~., data = train.balanced, method = 'class')
rpart.plot(DT_modelAll1)

We could see that the parameter policy_tenure and age_of_car plays an important role in making the decision.

Measure the performance

predDTAll1 <-predict(DT_modelAll1, test.data, type="class")
dtAllCM <- confusionMatrix(data = predDTAll1, reference = test.data$is_claim, positive = "1")
dtAllCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4238  157
##          1 6730  592
##                                           
##                Accuracy : 0.4122          
##                  95% CI : (0.4033, 0.4212)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0347          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.79039         
##             Specificity : 0.38640         
##          Pos Pred Value : 0.08085         
##          Neg Pred Value : 0.96428         
##              Prevalence : 0.06392         
##          Detection Rate : 0.05052         
##    Detection Prevalence : 0.62490         
##       Balanced Accuracy : 0.58839         
##                                           
##        'Positive' Class : 1               
## 

From the confusion matrix we could see that this model has very low accuracy 41.2% , even though it has good sensitive 79.03% it doesn’t perform well with sensitive 38.6%

Switching the variables

DT_modelSub1 <- rpart(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist, data = train.balanced, method = 'class')
rpart.plot(DT_modelSub1)

predSub1 <-predict(DT_modelSub1, test.data, type="class")
dtSub1CM <- confusionMatrix(data = predSub1, reference = test.data$is_claim, positive = "1")
dtSub1CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5565  226
##          1 5403  523
##                                           
##                Accuracy : 0.5196          
##                  95% CI : (0.5105, 0.5287)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0487          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.69826         
##             Specificity : 0.50739         
##          Pos Pred Value : 0.08826         
##          Neg Pred Value : 0.96097         
##              Prevalence : 0.06392         
##          Detection Rate : 0.04464         
##    Detection Prevalence : 0.50576         
##       Balanced Accuracy : 0.60282         
##                                           
##        'Positive' Class : 1               
## 

Measure the performance

We could see that the performance has improved when we are considering only few relevant features and decision tree model is considering one more additional parameter is_break_assist.

Now the Accuracy is 51.9% and Sensitivity is 69.82% and Specificity is 50.73%

Random Forest

Random forest is a Supervised Learning algorithm which uses an ensemble learning method for classification and regression.

A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.

The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.

Model Creation

set.seed(123)
fit.forest <- randomForest(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
                +ncap_rating+is_brake_assist, data = train.balanced, importance=TRUE, ntree=200)

# display model details
fit.forest
## 
## Call:
##  randomForest(formula = is_claim ~ age_of_car + age_of_policyholder +      population_density + policy_tenure + is_parking_sensors +      is_power_door_locks + ncap_rating + is_brake_assist, data = train.balanced,      importance = TRUE, ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 31.9%
## Confusion matrix:
##      0    1 class.error
## 0 3468 2530   0.4218073
## 1 1297 4701   0.2162387
rf.pred <- predict(fit.forest, newdata=test.data, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, test.data$is_claim))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6258  280
##          1 4710  469
##                                           
##                Accuracy : 0.5741          
##                  95% CI : (0.5651, 0.5831)
##     No Information Rate : 0.9361          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0524          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.57057         
##             Specificity : 0.62617         
##          Pos Pred Value : 0.95717         
##          Neg Pred Value : 0.09056         
##              Prevalence : 0.93608         
##          Detection Rate : 0.53410         
##    Detection Prevalence : 0.55799         
##       Balanced Accuracy : 0.59837         
##                                           
##        'Positive' Class : 0               
## 

Analysis

We could see that Accuracy of random forest model is 57.4% , Sensitivity is 57.05% and Specificity is 62.6% which is better than the Decision Tree

Summary

From our analysis we noticed that Decision Tree are simple and easy to understand and interpret. Trees can be visualized. Also it requires little data preparation. The main drawback it is over fitting. While decision tree performs when against training data set , it may not perform well with unseen data.

Since Random Forest is build based on multiple decision trees ,it gives better performance as we saw from our analysis.