Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees, how can you change this perception when using the decision tree you created to solve a real problem? Format: document with screen captures & analysis.
As part of this Homework I decided to the use the (Kaggle data) [https://www.kaggle.com/datasets/ifteshanajnin/carinsuranceclaimprediction-classification]
The dataset contains 58592 observations on 44 variables. variables. As part of this exercise , I will analyzing the data and predicting whether there is an insurance claim . I will be solving this classification problem using Decision Tree and Random Forest. Then I will be evaluating the performance of both the models and comparing the results.
Load the required libraries
library(stats)
library(corrplot)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(caret)
library(rpart.plot)
library(DataExplorer)
library(skimr)
library(performanceEstimation)
library(randomForest)
trainData <- read_csv("train.csv", col_types = "cnnnfnffffffffffffffffffffffffffffffffffffff")
dim(trainData)
## [1] 58592 44
The dataset has 58592 observations on 44 variables.
Let’s analyze the distribution of various feature and it’s values.
# Snippet of the data
head(trainData)
## # A tibble: 6 x 44
## policy_id policy_tenure age_of_car age_of_policyholder area_cluster
## <chr> <dbl> <dbl> <dbl> <fct>
## 1 ID00001 0.516 0.05 0.644 C1
## 2 ID00002 0.673 0.02 0.375 C2
## 3 ID00003 0.841 0.02 0.385 C3
## 4 ID00004 0.900 0.11 0.433 C4
## 5 ID00005 0.596 0.11 0.635 C5
## 6 ID00006 1.02 0.07 0.519 C6
## # ... with 39 more variables: population_density <dbl>, make <fct>,
## # segment <fct>, model <fct>, fuel_type <fct>, max_torque <fct>,
## # max_power <fct>, engine_type <fct>, airbags <fct>, is_esc <fct>,
## # is_adjustable_steering <fct>, is_tpms <fct>, is_parking_sensors <fct>,
## # is_parking_camera <fct>, rear_brakes_type <fct>, displacement <fct>,
## # cylinder <fct>, transmission_type <fct>, gear_box <fct>,
## # steering_type <fct>, turning_radius <fct>, length <fct>, width <fct>, ...
# glimpse and summary of the data
glimpse(trainData)
## Rows: 58,592
## Columns: 44
## $ policy_id <chr> "ID00001", "ID00002", "ID00003", "ID0~
## $ policy_tenure <dbl> 0.51587359, 0.67261851, 0.84111026, 0~
## $ age_of_car <dbl> 0.05, 0.02, 0.02, 0.11, 0.11, 0.07, 0~
## $ age_of_policyholder <dbl> 0.6442308, 0.3750000, 0.3846154, 0.43~
## $ area_cluster <fct> C1, C2, C3, C4, C5, C6, C7, C8, C7, C~
## $ population_density <dbl> 4990, 27003, 4076, 21622, 34738, 1305~
## $ make <fct> 1, 1, 1, 1, 2, 3, 4, 1, 3, 1, 1, 1, 1~
## $ segment <fct> A, A, A, C1, A, C2, B2, B2, C2, B2, A~
## $ model <fct> M1, M1, M1, M2, M3, M4, M5, M6, M4, M~
## $ fuel_type <fct> CNG, CNG, CNG, Petrol, Petrol, Diesel~
## $ max_torque <fct> 60Nm@3500rpm, 60Nm@3500rpm, 60Nm@3500~
## $ max_power <fct> 40.36bhp@6000rpm, 40.36bhp@6000rpm, 4~
## $ engine_type <fct> F8D Petrol Engine, F8D Petrol Engine,~
## $ airbags <fct> 2, 2, 2, 2, 2, 6, 2, 2, 6, 6, 2, 2, 2~
## $ is_esc <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_adjustable_steering <fct> No, No, No, Yes, No, Yes, Yes, Yes, Y~
## $ is_tpms <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_parking_sensors <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes~
## $ is_parking_camera <fct> No, No, No, Yes, Yes, Yes, No, No, Ye~
## $ rear_brakes_type <fct> Drum, Drum, Drum, Drum, Drum, Disc, D~
## $ displacement <fct> 796, 796, 796, 1197, 999, 1493, 1497,~
## $ cylinder <fct> 3, 3, 3, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4~
## $ transmission_type <fct> Manual, Manual, Manual, Automatic, Au~
## $ gear_box <fct> 5, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 5~
## $ steering_type <fct> Power, Power, Power, Electric, Electr~
## $ turning_radius <fct> 4.6, 4.6, 4.6, 4.8, 5, 5.2, 5, 4.8, 5~
## $ length <fct> 3445, 3445, 3445, 3995, 3731, 4300, 3~
## $ width <fct> 1515, 1515, 1515, 1735, 1579, 1790, 1~
## $ height <fct> 1475, 1475, 1475, 1515, 1490, 1635, 1~
## $ gross_weight <fct> 1185, 1185, 1185, 1335, 1155, 1720, 1~
## $ is_front_fog_lights <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_rear_window_wiper <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_washer <fct> No, No, No, No, No, Yes, No, No, Yes,~
## $ is_rear_window_defogger <fct> No, No, No, Yes, No, Yes, No, No, Yes~
## $ is_brake_assist <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_power_door_locks <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_central_locking <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_power_steering <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ is_driver_seat_height_adjustable <fct> No, No, No, Yes, No, Yes, No, Yes, Ye~
## $ is_day_night_rear_view_mirror <fct> No, No, No, Yes, Yes, No, No, Yes, No~
## $ is_ecw <fct> No, No, No, Yes, Yes, Yes, Yes, Yes, ~
## $ is_speed_alert <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
## $ ncap_rating <fct> 0, 0, 0, 2, 2, 3, 5, 2, 3, 0, 0, 2, 2~
## $ is_claim <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1~
summary(trainData)
## policy_id policy_tenure age_of_car age_of_policyholder
## Length:58592 Min. :0.002735 Min. :0.00000 Min. :0.2885
## Class :character 1st Qu.:0.210250 1st Qu.:0.02000 1st Qu.:0.3654
## Mode :character Median :0.573792 Median :0.06000 Median :0.4519
## Mean :0.611246 Mean :0.06942 Mean :0.4694
## 3rd Qu.:1.039104 3rd Qu.:0.11000 3rd Qu.:0.5481
## Max. :1.396641 Max. :1.00000 Max. :1.0000
##
## area_cluster population_density make segment model
## C8 :13654 Min. : 290 1:38126 A :17321 M1 :14948
## C2 : 7342 1st Qu.: 6112 2: 2373 C1 : 3557 M4 :14018
## C5 : 6979 Median : 8794 3:14018 C2 :14018 M6 :13776
## C3 : 6101 Mean :18827 4: 1961 B2 :18314 M8 : 4173
## C14 : 3660 3rd Qu.:27003 5: 2114 B1 : 4173 M7 : 2940
## C13 : 3423 Max. :73430 Utility: 1209 M3 : 2373
## (Other):17433 (Other): 6364
## fuel_type max_torque max_power
## CNG :20330 113Nm@4400rpm :17796 88.50bhp@6000rpm :17796
## Petrol:20532 60Nm@3500rpm :14948 40.36bhp@6000rpm :14948
## Diesel:17730 250Nm@2750rpm :14018 113.45bhp@4000rpm:14018
## 82.1Nm@3400rpm: 4173 55.92bhp@5300rpm : 4173
## 91Nm@4250rpm : 2373 67.06bhp@5500rpm : 2373
## 200Nm@1750rpm : 2114 97.89bhp@3600rpm : 2114
## (Other) : 3170 (Other) : 3170
## engine_type airbags is_esc is_adjustable_steering
## F8D Petrol Engine :14948 2:40425 No :40191 No :23066
## 1.5 L U2 CRDi :14018 6:16958 Yes:18401 Yes:35526
## K Series Dual jet :13776 1: 1209
## K10C : 4173
## 1.2 L K Series Engine: 2940
## 1.0 SCe : 2373
## (Other) : 6364
## is_tpms is_parking_sensors is_parking_camera rear_brakes_type
## No :44574 Yes:56219 No :35704 Drum:44574
## Yes:14018 No : 2373 Yes:22888 Disc:14018
##
##
##
##
##
## displacement cylinder transmission_type gear_box steering_type
## 1197 :17796 3:21857 Manual :38181 5:44211 Power :33502
## 796 :14948 4:36735 Automatic:20411 6:14381 Electric:23881
## 1493 :14018 Manual : 1209
## 998 : 4173
## 999 : 2373
## 1498 : 2114
## (Other): 3170
## turning_radius length width height
## 4.6 :14948 3445 :14948 1515 :14948 1475 :14948
## 4.8 :14856 4300 :14018 1735 :14856 1635 :14018
## 5.2 :14018 3845 :13776 1790 :14018 1530 :13776
## 4.7 : 4173 3990 : 4538 1620 : 4173 1675 : 4173
## 5 : 3971 3655 : 4173 1745 : 2940 1500 : 2940
## 4.85 : 2940 3995 : 3194 1579 : 2373 1490 : 2373
## (Other): 3686 (Other): 3945 (Other): 5284 (Other): 6364
## gross_weight is_front_fog_lights is_rear_window_wiper is_rear_window_washer
## 1185 :14948 No :24664 No :41634 No :41634
## 1335 :14856 Yes:33928 Yes:16958 Yes:16958
## 1720 :14018
## 1340 : 4173
## 1410 : 2940
## 1155 : 2373
## (Other): 5284
## is_rear_window_defogger is_brake_assist is_power_door_locks is_central_locking
## No :38077 No :26415 No :16157 No :16157
## Yes:20515 Yes:32177 Yes:42435 Yes:42435
##
##
##
##
##
## is_power_steering is_driver_seat_height_adjustable
## Yes:57383 No :24301
## No : 1209 Yes:34291
##
##
##
##
##
## is_day_night_rear_view_mirror is_ecw is_speed_alert ncap_rating is_claim
## No :36309 No :16157 Yes:58229 0:19097 0:54844
## Yes:22283 Yes:42435 No : 363 2:21402 1: 3748
## 3:14018
## 5: 1961
## 4: 2114
##
##
plot_missing(trainData)
#hist(trainData$is_claim)
From our analysis it is clear that there are no missing data.
Next, we’ll look at a full summary of our features, including rudimentary distributions of each of our continuous variables:
skimr::skim(trainData)
| Name | trainData |
| Number of rows | 58592 |
| Number of columns | 44 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 39 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| policy_id | 0 | 1 | 7 | 7 | 0 | 58592 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| area_cluster | 0 | 1 | FALSE | 22 | C8: 13654, C2: 7342, C5: 6979, C3: 6101 |
| make | 0 | 1 | FALSE | 5 | 1: 38126, 3: 14018, 2: 2373, 5: 2114 |
| segment | 0 | 1 | FALSE | 6 | B2: 18314, A: 17321, C2: 14018, B1: 4173 |
| model | 0 | 1 | FALSE | 11 | M1: 14948, M4: 14018, M6: 13776, M8: 4173 |
| fuel_type | 0 | 1 | FALSE | 3 | Pet: 20532, CNG: 20330, Die: 17730 |
| max_torque | 0 | 1 | FALSE | 9 | 113: 17796, 60N: 14948, 250: 14018, 82.: 4173 |
| max_power | 0 | 1 | FALSE | 9 | 88.: 17796, 40.: 14948, 113: 14018, 55.: 4173 |
| engine_type | 0 | 1 | FALSE | 11 | F8D: 14948, 1.5: 14018, K S: 13776, K10: 4173 |
| airbags | 0 | 1 | FALSE | 3 | 2: 40425, 6: 16958, 1: 1209 |
| is_esc | 0 | 1 | FALSE | 2 | No: 40191, Yes: 18401 |
| is_adjustable_steering | 0 | 1 | FALSE | 2 | Yes: 35526, No: 23066 |
| is_tpms | 0 | 1 | FALSE | 2 | No: 44574, Yes: 14018 |
| is_parking_sensors | 0 | 1 | FALSE | 2 | Yes: 56219, No: 2373 |
| is_parking_camera | 0 | 1 | FALSE | 2 | No: 35704, Yes: 22888 |
| rear_brakes_type | 0 | 1 | FALSE | 2 | Dru: 44574, Dis: 14018 |
| displacement | 0 | 1 | FALSE | 9 | 119: 17796, 796: 14948, 149: 14018, 998: 4173 |
| cylinder | 0 | 1 | FALSE | 2 | 4: 36735, 3: 21857 |
| transmission_type | 0 | 1 | FALSE | 2 | Man: 38181, Aut: 20411 |
| gear_box | 0 | 1 | FALSE | 2 | 5: 44211, 6: 14381 |
| steering_type | 0 | 1 | FALSE | 3 | Pow: 33502, Ele: 23881, Man: 1209 |
| turning_radius | 0 | 1 | FALSE | 9 | 4.6: 14948, 4.8: 14856, 5.2: 14018, 4.7: 4173 |
| length | 0 | 1 | FALSE | 9 | 344: 14948, 430: 14018, 384: 13776, 399: 4538 |
| width | 0 | 1 | FALSE | 10 | 151: 14948, 173: 14856, 179: 14018, 162: 4173 |
| height | 0 | 1 | FALSE | 11 | 147: 14948, 163: 14018, 153: 13776, 167: 4173 |
| gross_weight | 0 | 1 | FALSE | 10 | 118: 14948, 133: 14856, 172: 14018, 134: 4173 |
| is_front_fog_lights | 0 | 1 | FALSE | 2 | Yes: 33928, No: 24664 |
| is_rear_window_wiper | 0 | 1 | FALSE | 2 | No: 41634, Yes: 16958 |
| is_rear_window_washer | 0 | 1 | FALSE | 2 | No: 41634, Yes: 16958 |
| is_rear_window_defogger | 0 | 1 | FALSE | 2 | No: 38077, Yes: 20515 |
| is_brake_assist | 0 | 1 | FALSE | 2 | Yes: 32177, No: 26415 |
| is_power_door_locks | 0 | 1 | FALSE | 2 | Yes: 42435, No: 16157 |
| is_central_locking | 0 | 1 | FALSE | 2 | Yes: 42435, No: 16157 |
| is_power_steering | 0 | 1 | FALSE | 2 | Yes: 57383, No: 1209 |
| is_driver_seat_height_adjustable | 0 | 1 | FALSE | 2 | Yes: 34291, No: 24301 |
| is_day_night_rear_view_mirror | 0 | 1 | FALSE | 2 | No: 36309, Yes: 22283 |
| is_ecw | 0 | 1 | FALSE | 2 | Yes: 42435, No: 16157 |
| is_speed_alert | 0 | 1 | FALSE | 2 | Yes: 58229, No: 363 |
| ncap_rating | 0 | 1 | FALSE | 5 | 2: 21402, 0: 19097, 3: 14018, 4: 2114 |
| is_claim | 0 | 1 | FALSE | 2 | 0: 54844, 1: 3748 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| policy_tenure | 0 | 1 | 0.61 | 0.41 | 0.00 | 0.21 | 0.57 | 1.04 | 1.4 | ▇▅▃▆▃ |
| age_of_car | 0 | 1 | 0.07 | 0.06 | 0.00 | 0.02 | 0.06 | 0.11 | 1.0 | ▇▁▁▁▁ |
| age_of_policyholder | 0 | 1 | 0.47 | 0.12 | 0.29 | 0.37 | 0.45 | 0.55 | 1.0 | ▇▇▃▁▁ |
| population_density | 0 | 1 | 18826.86 | 17660.17 | 290.00 | 6112.00 | 8794.00 | 27003.00 | 73430.0 | ▇▃▂▁▁ |
trainData %>%
ggplot(aes(x = policy_tenure, fill = is_claim)) +
geom_boxplot()
trainData %>%
ggplot(aes(x = age_of_car, fill = is_claim)) +
geom_boxplot()
trainData %>%
ggplot(aes(x = age_of_policyholder, fill = is_claim)) +
geom_boxplot()
trainData %>%
ggplot(aes(x = population_density, fill = is_claim)) +
geom_boxplot()
Distribution of the data across categorical variable
long_df1 <- trainData %>% select(c('is_speed_alert','ncap_rating','is_claim')) %>% pivot_longer(cols=-c(is_claim), names_to='kpi')
ggplot(long_df1, aes(x=value, fill=is_claim)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) +
ggtitle('Comparing Categorical Features and Target')
long_df2 <- trainData %>% select(c('segment','area_cluster','is_claim')) %>% pivot_longer(cols=-c(is_claim), names_to='kpi')
ggplot(long_df2, aes(x=value, fill=is_claim)) +
geom_bar() +
facet_wrap(~kpi, scales="free_x") +
scale_fill_manual(values = c("#2bbac0", "#f06e64")) + ggtitle('Comparing Categorical Features and Target')
long_df3 <- trainData %>% select(c('model','is_parking_camera','is_claim')) %>% pivot_longer(cols=-c(is_claim), names_to='kpi')
We are considering only major categorical variable for the analysis
As per our analysis , we could see only a subset of features are relevant for our analysis
trainData2 <- trainData %>% select (policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,max_torque,
is_parking_sensors,is_parking_camera,is_brake_assist,is_power_door_locks,is_central_locking,is_speed_alert,ncap_rating,is_claim )
Target Variable Distribution
Let’s visualize our target distribution.
# Bar plot for target (Heart disease)
ggplot(trainData, aes(x=is_claim, fill=is_claim)) +
geom_bar() +
xlab("Insurance Claim") +
ylab("Count") +
ggtitle("Analysis of Insurance Claim") +
scale_fill_discrete(name = "Insurance Claim", labels = c("Absence", "Presence"))
round(prop.table(table(select(trainData, 'is_claim'))),2)
##
## 0 1
## 0.94 0.06
According to our analysis we could see there is a class imbalance present in this dataset. Imbalanced data can prove to be quite problematic in classifying the minority class. I will conduct oversampling using SMOTE before modeling the data to give the model a better chance at classifying the minority class.
One way to address this imbalance problem is to use Synthetic Minority Oversampling Technique, often abbreviated SMOTE. This technique involves creating a new dataset by oversampling observations from the minority class, which produces a dataset that has more balanced classes.
Our first step will be to separate the data into a training and test dataset. This way, we can test the accuracy by using our holdout test dataset later on. We decided to perform a 80/20 training to testing data split on our dataset.
# Splitting the data 80/20
set.seed(1234)
training.samples <- trainData2 $is_claim %>%
createDataPartition(p = 0.8, list=FALSE)
train.data <- trainData2 [training.samples,]
test.data <- trainData[-training.samples,]
round(prop.table(table(select(train.data, 'is_claim'))),2)
##
## 0 1
## 0.94 0.06
round(prop.table(table(select(test.data, 'is_claim'))),2)
##
## 0 1
## 0.94 0.06
We could see that the target variable is uniformly distributed both in test and train data in accordance with original data.
Since there is a class imbalance of the target variable , we will solve this by oversampling the minority class.
set.seed(12345)
train.balanced <- smote(is_claim ~ ., data = train.data, perc.over = 1)
train.balanced %>% ggplot(aes(is_claim)) +
geom_bar(fill = "#04354F") +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
We could see the target variable is now uniformly distributed on the test data.
Decision Tree is a Supervised Machine Learning Algorithm that uses a set of rules to make decisions, similarly to how humans make decisions. It can be used for both classification and regression problem.
Construct First Decision tree Model
I will now construct a model for classification Tree using the training dataset considering the entire set of feature we have already identified as part of our earlier analysis.
DT_modelAll1 <- rpart(is_claim~., data = train.balanced, method = 'class')
rpart.plot(DT_modelAll1)
We could see that the parameter policy_tenure and age_of_car plays an important role in making the decision.
Measure the performance
predDTAll1 <-predict(DT_modelAll1, test.data, type="class")
dtAllCM <- confusionMatrix(data = predDTAll1, reference = test.data$is_claim, positive = "1")
dtAllCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4238 157
## 1 6730 592
##
## Accuracy : 0.4122
## 95% CI : (0.4033, 0.4212)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0347
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79039
## Specificity : 0.38640
## Pos Pred Value : 0.08085
## Neg Pred Value : 0.96428
## Prevalence : 0.06392
## Detection Rate : 0.05052
## Detection Prevalence : 0.62490
## Balanced Accuracy : 0.58839
##
## 'Positive' Class : 1
##
From the confusion matrix we could see that this model has very low accuracy 41.2% , even though it has good sensitive 79.03% it doesn’t perform well with sensitive 38.6%
Switching the variables
DT_modelSub1 <- rpart(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist, data = train.balanced, method = 'class')
rpart.plot(DT_modelSub1)
predSub1 <-predict(DT_modelSub1, test.data, type="class")
dtSub1CM <- confusionMatrix(data = predSub1, reference = test.data$is_claim, positive = "1")
dtSub1CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5565 226
## 1 5403 523
##
## Accuracy : 0.5196
## 95% CI : (0.5105, 0.5287)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0487
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.69826
## Specificity : 0.50739
## Pos Pred Value : 0.08826
## Neg Pred Value : 0.96097
## Prevalence : 0.06392
## Detection Rate : 0.04464
## Detection Prevalence : 0.50576
## Balanced Accuracy : 0.60282
##
## 'Positive' Class : 1
##
Measure the performance
We could see that the performance has improved when we are considering only few relevant features and decision tree model is considering one more additional parameter is_break_assist.
Now the Accuracy is 51.9% and Sensitivity is 69.82% and Specificity is 50.73%
Random forest is a Supervised Learning algorithm which uses an ensemble learning method for classification and regression.
A random forest algorithm consists of many decision trees. The ‘forest’ generated by the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.
The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.
Model Creation
set.seed(123)
fit.forest <- randomForest(is_claim ~ age_of_car+age_of_policyholder + population_density + policy_tenure +is_parking_sensors + is_power_door_locks
+ncap_rating+is_brake_assist, data = train.balanced, importance=TRUE, ntree=200)
# display model details
fit.forest
##
## Call:
## randomForest(formula = is_claim ~ age_of_car + age_of_policyholder + population_density + policy_tenure + is_parking_sensors + is_power_door_locks + ncap_rating + is_brake_assist, data = train.balanced, importance = TRUE, ntree = 200)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 31.9%
## Confusion matrix:
## 0 1 class.error
## 0 3468 2530 0.4218073
## 1 1297 4701 0.2162387
rf.pred <- predict(fit.forest, newdata=test.data, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, test.data$is_claim))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6258 280
## 1 4710 469
##
## Accuracy : 0.5741
## 95% CI : (0.5651, 0.5831)
## No Information Rate : 0.9361
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0524
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.57057
## Specificity : 0.62617
## Pos Pred Value : 0.95717
## Neg Pred Value : 0.09056
## Prevalence : 0.93608
## Detection Rate : 0.53410
## Detection Prevalence : 0.55799
## Balanced Accuracy : 0.59837
##
## 'Positive' Class : 0
##
We could see that Accuracy of random forest model is 57.4% , Sensitivity is 57.05% and Specificity is 62.6% which is better than the Decision Tree
From our analysis we noticed that Decision Tree are simple and easy to understand and interpret. Trees can be visualized. Also it requires little data preparation. The main drawback it is over fitting. While decision tree performs when against training data set , it may not perform well with unseen data.
Since Random Forest is build based on multiple decision trees ,it gives better performance as we saw from our analysis.